How to Build a Puppeteer Scraper for Authenticated Sites (With Batch CSV and InDesign Output)

A previous client loved the automation I'd built to export all their SVG icons and asked me to help build an image catalog for their sales team. The company had a web portal where all 6,000 images lived. Each image had its own title, category, and description. But there was no search functionality, no category browsing, and no index. Salespeople clicked through pages at random, hoping to stumble across the right image. One sales director told me his team was spending 2–3 hours per pitch deck hunting for images.

They wanted a PDF catalog organized by category—people, restaurants, lifestyle, etc.—with titles, thumbnails, and download links. Something sales could actually search. Why not just fix the website? Getting a giant company to fix its website takes months. I could build the catalog in a weekend.

The Scraping Strategy: Working from Image IDs

Each image had a detail page following the URL pattern: /images/[IMAGE_ID].jpg. The images had auto-generated names like fhdyskein87.jpg. With a list of all image IDs, I could scrape each detail page for its metadata. They sent a hard drive with all the image files (for the PDF thumbnails) and a CSV listing every image ID. The metadata only existed on the website, so I still needed to scrape.

Shoutout to the development team at this monolith of a company for making it easy for me to scrape the data. It took a few attempts to get the scraper working. The site required users to accept a cookie and log in. Puppeteer was perfect for this. Running Puppeteer with headless set to false lets me feel like Hackerman.

Setting Up Puppeteer for an Authenticated Site

Scraping a website that requires login with Puppeteer means you need to automate the full browser session — navigate to the login page, fill in credentials, submit the form, and confirm you're authenticated before touching any content pages. For each image ID, Puppeteer navigates to its detail page, waits for the page to load, then Cheerio parses the HTML to extract the title, category, and description. (Puppeteer handles the browser navigation; Cheerio parses the static HTML snapshot — each tool does what it's best at.) Save to CSV. Wait 2 seconds, then move to the next ID. Repeat that 6,000 times.

The login kept failing. This wasn’t my first go-round with Puppeteer, and entering username and password fields is preeeeeeeeeeetty easy. The cookie banner had to be clicked before anything else would load. When I'd tested manually, I'd clicked it without thinking. The sequence that actually worked:

Navigate to the site homepage.
Wait for the cookie consent banner to appear.
Click the "Accept" button to dismiss the banner.
Navigate to the login page.
Fill in credentials and submit.

  await page.click('#cookie-banner-accept');  // accept cookie banner first
  await page.waitForSelector('#login-form');
  await page.type('#username', process.env.PORTAL_USERNAME);
  await page.type('#password', process.env.PORTAL_PASSWORD);
  await page.click('#login-submit');
  await page.waitForNavigation();

Gotcha #2: Dealing with Inconsistent CSS Selectors

The CSS selectors were a mess. Some pages stored categories in data attributes (data-category="people"), others in nested divs. Not every page had the same fields—some had descriptions, others didn't. The conditional logic looked like: check if data-category exists → use that. Otherwise, look for div.category → extract text. If neither exists, mark as "Uncategorized."

Batch Processing: Saving to CSV in Groups of 100

The 2-second delay kept me under the radar—rate limits would alert the webmaster. We use a 2-second delay between requests to avoid triggering rate limits, which adds up to roughly 3.5 hours for 6,000 pages but keeps the traffic pattern looking like a slow human rather than a bot. Batching in sets of 100 saved memory and meant a crash at image 3,500 wouldn't lose everything. I tested the scraper on 20 images first to verify the selectors were right—easier than debugging after a 6-hour run. Once that worked, I let it run overnight for the full 6,000. The scraper took about 3 hours to write and debug. Running it took 3.5 hours—6,000 pages at 2 seconds each, plus overhead for authentication and error handling.

 for (const imageId of batch) {
    await scrapePage(imageId);
    await page.waitForTimeout(2000); // stay under rate limits
  }

I've used the same CSV-as-output pattern before — for the Mailchimp subscriber reporter, where the CSV becomes a list of new readers.

Sorting the Data for InDesign Data Merge

I combined the batch CSVs in Google Sheets and sorted everything by category. InDesign's data merge populates PDFs in CSV order, so sorting meant the final PDF would have all "People" images grouped together, all "Restaurants" together, etc. Without sorting, it would be 400 pages of random images—a salesperson looking for restaurant photos would have to scan the entire PDF.

Making Images Clickable in InDesign

Sort your CSV by category in Google Sheets
Open InDesign and set up your data merge template
Map the CSV columns to text and image fields
Add an invisible box over each image placeholder, linked to the download URL field
Run the data merge — InDesign generates one page per row

InDesign's data merge is straightforward for text, but adding image thumbnails and clickable links gets messy. I tested layouts to fit as many images as possible on each page. InDesign won't let you make images clickable directly. The workaround: add an invisible box over each image in the template and link it to the download URL field. When InDesign runs the data merge, it automatically populates 6,000 linked boxes.

After generating the catalog came the grunt work. Creating section breaks, adding page navigation, and compressing the PDF. I sent the file off, went for a coffee, and got on with my day.

Results: From 30-Minute Search to 30 Seconds

The sales team got a 400-page PDF they could search with CMD+F. What used to take 30 minutes of random clicking now took 30 seconds. In the first week, I received thank-you emails from five different salespeople. One guy said he'd been searching for a specific restaurant photo for three days and found it in 10 seconds with the PDF.

Want the Scraper Code?

The scraper is configurable—swap in your selectors, adjust the batch size, and you're good to go. The repo includes the Puppeteer scraper, CSV batch processor, and configuration examples.

How to Build a Puppeteer Scraper for Authenticated Sites (With Batch CSV and InDesign Output)

The Scraping Strategy: Working from Image IDs

Setting Up Puppeteer for an Authenticated Site

Gotcha #2: Dealing with Inconsistent CSS Selectors

Batch Processing: Saving to CSV in Groups of 100

Sorting the Data for InDesign Data Merge

Making Images Clickable in InDesign

Results: From 30-Minute Search to 30 Seconds

Want the Scraper Code?

Keep Reading

Automation Almanac

How to Build a Puppeteer Scraper for Authenticated Sites (With Batch CSV and InDesign Output)

The Scraping Strategy: Working from Image IDs

Setting Up Puppeteer for an Authenticated Site

Gotcha #1: Handling Cookie Banners Before Login

Gotcha #2: Dealing with Inconsistent CSS Selectors

Batch Processing: Saving to CSV in Groups of 100

Sorting the Data for InDesign Data Merge

Making Images Clickable in InDesign

Results: From 30-Minute Search to 30 Seconds

Want the Scraper Code?

Subscribe to keep reading

Keep Reading

Automation Almanac