Scraping with Puppeteer
Not going into the details of explaining what Puppeteer is, go read that on their website. We just want to code
What can Puppeteer do
Puppeteer can do lots of things, but the most used ones are:
- automate form filling and submissions
- crawl a website and extract data from it
- take screenshots of a website, page and even sections
What we'll be building
A MVP of a simple scraper that can scrape a website and return the data in a JSON format.
Requirements
- Node.js; You can install this the traditional way by downloading and installing it, or you can use NVM to manage multiple versions of Node.js.
- VSCode or any other text editor
Project setup
Install puppeteer
Open the project with VSCode. If you don't have the code
available on your PATH, please follow this guide
Let's code
Alright, we're done setting up the project. Let's start coding! We'll use a local news website called Starnieuws as our example.
Basics
Basic usage of the puppeteer
package
const puppeteer = require('puppeteer');
const url = 'https://www.starnieuws.com/';
// this is a async function that calls itself, no need to call it manually
(async () => {
// launch a Chromium browser, set the headless property to true if you're deploying to production
browser = await puppeteer.launch({
headless: false,
});
// create new page object
const page = await browser.newPage();
// set viewport width and height
await page.setViewport({
width: 1920,
height: 1080,
});
// nagivate to the url
await page.goto(url);
})();
Run it with node <filename>.js
via your VSCode Terminal. Voila, a browser should open and you should be able to see the website.
Catch errors (try catch)
We'll add a try catch
block to catch any errors that might occur within the async
function
let browser = null;
try {
// launch headless Chromium browser
browser = await puppeteer.launch({
headless: false,
});
// create new page object
const page = await browser.newPage();
// set viewport width and height
await page.setViewport({
width: 1920,
height: 1080,
});
await page.goto(url);
// do something with the page ...
} catch (err) {
console.log(`Error: ${err.message}`);
} finally {
if (browser) {
await browser.close();
}
console.log(`\nScraping ${url} done!`);
}
Scraping
Let's scrape something now. We want the title of the news article and the URL. Looking at the source code, we can see that it's in this order .headlines_content > ul > li
. The list sits under the .headlines_content
class.
Scraping contents looks like this in puppeteer
, read more about evaluate()
here
The element looks like this <a href="https://www.starnieuws.com/index.php/welcome/index/nieuwsitem/71066">LVV: Stijgend waterpeil Nannizwamp wordt gemonitord</a>
so we want the href
attribute, and the text content. Add this function under the await page.goto(url);
code block
let data = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('.headlines_content > ul > li');
// iterate through the items and push the data to the results array
items.forEach((item) => {
results.push({
title: item.querySelector('a').innerText,
url: item.querySelector('a').href,
})
})
return results
})
// do something with the scraped data
console.log(data)
Complete source code
Your complete code should look like this
const puppeteer = require('puppeteer');
const url = 'https://www.starnieuws.com/';
(async () => {
let browser = null;
try {
// launch headless Chromium browser
browser = await puppeteer.launch({
headless: false,
});
// create new page object
const page = await browser.newPage();
// set viewport width and height
await page.setViewport({
width: 1920,
height: 1080,
});
await page.goto(url);
let data = await page.evaluate(() => {
let results = []
let items = document.querySelectorAll('.headlines_content > ul > li')
items.forEach((item) => {
results.push({
title: item.querySelector('a').innerText,
url: item.querySelector('a').href,
})
})
return results
})
// do something with the scraped data
console.log(data)
} catch (err) {
console.log(`Error: ${err.message}`);
} finally {
if (browser) {
await browser.close();
}
console.log(`\nScraping ${url} done!`);
}
})();
Run it! node app.js
and you should see something like this
Storing your results
- As a file. We can achieve this by storing the output to a file with the
fs
module. Read more about it here - In a Database (DB), you can use MongoDB, MySQL, PostgreSQL, etc.
Wrapping up
So, what have we learned?
- Setting up a
git
repo and initializingnpm
- Getting started with Puppeteer
- Catching errors with
try catch
blocks - Scraping data
Thank you so much for reading and following along, see you soon