Web Scraping in Node.Js: Top 7 Best Tools

Web scraping is a technique of using scripts to crawl web pages for you and return invaluable data. It’s a powerful way of obtaining large amounts of information that can be further processed to extract insights.

Before the web scraping era, people had to manually find and go through websites to copy-paste the information. It was hectic, time-consuming, and downright annoying.

Since we’re still in the Javascript world, countless different packages claim to be the ultimate web scraping solution. Because the options on the market are so diverse, I created a list of 7 tools that are worth discovering.

In the following article, you’ll find out which are the steps you have to go through to be able to explore each tool. Then, we’ll go over the 7 essential tools for web scraping in Node.js and their details. Along the way, you’ll get recommendations and suggestions on making the smart choice.

Let’s start!

Essential steps to follow to use one of the 7 web scraping tools
Best tool options for web scraping in Node.Js
Axios
JSDom
Cheerio
Puppeteer
Playwright
Nightmare
X-RAY
Osmosis
Suggestions and recommendations

Essential steps to follow to use one of the 7 web scraping tools

Step 1: Node.js and NPM installation

First, we will need a server-side language since we request and parse HTML programmatically. We will use Node.js for this.

Node.js is an open-source Javascript runtime engine that lets developers write command-line tools in JS. It is also used for server-side scripting, like producing dynamic web page content before the page is sent to the user’s web browser.

To use any of the 7 tools, make sure you have up-to-date versions of Node.js (at least 12.0.0) and npm (Node Package Manager) installed on your machine. If you already have it, you can still check the steps to see if you installed it correctly.

  1. Navigate to the Node.js website and download the latest version (14.15.5 at the moment of writing this article). The installer also includes the npm package manager.

2. Once the download has finished, open your Downloads folder or browse the location where you saved the file and launch the installer.

3. After installing, you can check for the Node.js version by opening a terminal and typing the following command:

node -v

You can do the same for NPM:

npm -v

Once you have those installed, we can move to the next steps.

Step 2: Explore the target

We’ll explore the target website and find selectors that would allow us to extract items and find out how to scrape data.

First of all, let’s look at the example.com website to analyze the target. The website looks like this:

As you can see, this is a straightforward website, and it does, however, contain a title, a description, and a link. We want to extract the items using the tools presented below and scrape the website.

Step 3: Decode the URLs

URL stands for Universal Resource Locator or Uniform Resource Locator and represents the web address that you use to navigate to a particular website on the internet. The URL can be more than just an address. It can also contain some parameters, passed to the database behind the website, that control the results returned.

While you navigate the website, you need to pay close attention to the site’s URLs as it changes. This is how you can learn more about the parameters passed to the database.

There are multiple ways to decode an URL like browser extensions, websites or simply using the JavaScript decodeURI() and encodeURI() methods as shown in the example below:

const uri = 'https://mozilla.org/?x=шеллы';
const encoded = encodeURI(uri);
console.log(encoded);
// expected output: "https://mozilla.org/?x=%D1%88%D0%B5%D0%BB%D0%BB%D1%8B"

try {
console.log(decodeURI(encoded));
// expected output: "https://mozilla.org/?x=шеллы"
} catch (e) { // catches a malformed URI
console.error(e);
}

If you know the parameters that the web servers are looking for, you don’t need to use the web page to submit requests to the web server. You can generate a URL and directly submit it to the server using the address bar of your browser or a script.

Step 4: Inspect using developer tools

Every website is different, and before writing code to parse the content you want, you need to take a look at the HTML of that page rendered by the browser.

Right-clicking the element you are interested in, you can inspect it using Developer Tools.

Once you do that, you can see the HTML and CSS that stands behind that particular element. Hovering your mouse over the HTML of a specific element will highlight it. You can identify the corresponding visual element for that code, as you can see below.

You will often need to filter the content. You can do this by using CSS selectors, or if you want to be more precise, you can write functions that filter through the content of elements. Regular expressions (Regex) are also handy in many situations.

Getting the right data out of the website requires a bit of creativity combined with some pattern recognition.

Best tool options for web scraping in Node.Js

Axios

Axios is a robust promise-based HTTP client for both the browser and NodeJS. It is a well-known package that is used in tons of projects. You can make HTTP requests from Node.js using promises. Moreover, you can download data from the internet fast and easily.

By using Axios we remove the need to pass the results of the HTTP request to the .json() method. Axios already does that for you and simply returns the data object in JSON format. Furthermore, any kind of error with an HTTP request will successfully trigger the .catch() block right out of the box.

We need to request the HTML of the page and Axios will help us out with that. To install it, in the root of your project folder, open the terminal and run the following command and you’re all set!

npm install axios

JSDom

JSDom, used with Node.js, is a pure-JavaScript implementation of several web standards. It’s super handy for web scraping and application testing.

const axios = require('axios');
const jsdom = require('jsdom');
const {JSDOM} = jsdom;

(async () => {
const html = await axios.get('https://www.example.com');
const dom = new JSDOM(html.data);

const title = dom.window.document.querySelector('h1');

if (title) {
console.log(title.textContent)
}
})();

JSDom is the closest thing to a headless browser, meaning that it gives a very accurate representation of what’s actually on the page while remaining lean and quick.

Its most powerful ability is that it can execute scripts inside the JSDom, meaning that these scripts can modify that page’s content.

const dom = new JSDOM(`<body>
<script>document.body.appendChild(document.createElement("hr"));</script>
</body>`);

Other desirable features include setting timers, inject user actions, access logs from the console, and many more. It can perform the same operations as PhantomJS (a headless browser) in half the time.

Unfortunately, the JSDom does not have WebSockets support, so you can’t return content loaded through WebSockets.

Cheerio

Cheerio is similar to JSDom, but it was designed to be more lightweight. It is like a server-side version of jQuery, providing an API that many developers are familiar with.

const axios = require('axios');
const cheerio = require('axios');

(async () => {
const html = await axios.get('https://www.example.com/');
const $ = await cheerio.load(html.data);
let data = []
$("body").each((i,elem) => {
data.push({
title: $(elem).find("h1").text(),
paragraph: $(elem).find("p").text(),
link: $(elem).find("a").attr('href')
})
})
console.log(data)
})();

Parsing, manipulating, and rendering are incredibly efficient with Cheerio because it works with a simple, consistent DOM model. Cheerio can parse nearly any HTML or XML document.

Unlike JSDom, Cheerio doesn’t produce a visual rendering, apply CSS, load external resources, or execute JavaScript.

Puppeteer

Puppeteer provides a high-level API to control Chrome or Chromium. The tool was designed by Google, and it runs headless by default, but it can be configured to run full Chrome or Chromium.

Using Puppeteer, you can do most of the things you can manually with your browser. That includes generating screenshots, PDFs of pages, UI testing, automate form submission, web scraping, etc.

Example — navigating to example.com and saving a screenshot as example.png

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });

await browser.close();
})();

People using other browser testing frameworks will find Puppeteer very familiar. You create an instance of a browser, open pages, and then manipulate them with Puppeteer’s API.

Playwright

Playwright is a Node.js library built to work with Chromium, Firefox, and Safari (WebKit) with a single API. Its design makes sure to enable cross-browser web automation.

While working with Playwright, you need to declare which browser you are using explicitly. The code below navigates to https://example.com in Firefox and executes a script in the page context.

const { firefox } = require('playwright');

(async () => {
const browser = await firefox.launch();
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://www.example.com/');
const dimensions = await page.evaluate(() => {
return {
width: document.documentElement.clientWidth,
height: document.documentElement.clientHeight,
deviceScaleFactor: window.devicePixelRatio
}
});
console.log(dimensions);

await browser.close();
})();

Playwright is very similar to Puppeteer. In fact, it was written by the same team.

Nightmare

Nightmare is a high-level browser automation Node.js library that exposes basic techniques that imitate user behavior, for example — goto, type, click.

const Nightmare = require('nightmare')
const nightmare = Nightmare({ show: true })

nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'github nightmare')
.click('#search_button_homepage')
.wait('#r1-0 a.result__a')
.evaluate(() => document.querySelector('#r1-0 a.result__a').href)
.end()
.then(console.log)
.catch(error => {
console.error('Search failed:', error)
})

It uses Electron, which is a faster and more modern alternative to PhantomJS.

Initially, Nightmare was created to automate tasks across sites that don’t have APIs. However, it is most often used for testing and web scraping.

X-RAY

X-ray provides a composable API, supporting pagination, concurrency, throttles, delays, pluggable drivers (currently supports HTTP and PhantomJS Driver), and many more, giving you great flexibility.

You can extract data in any way you choose since the schema is not tied to the scraped page’s structure.

Here are some examples:

Scrape a single tag

xray('http://google.com', 'title')(function(err, title) {
console.log(title) // Google
})

Scrape an attribute

xray('http://techcrunch.com', 'img.logo@src')(fn)

Scrape innerHTML

xray('http://news.ycombinator.com', 'body@html')(fn)

The flow is predictable, following a breadth-first crawl through each of the pages.

Osmosis

Osmosis is a HTML/XML parser written in Node.js, packed with CSS3/Xpath selector and lightweight HTTP wrapper. If you’re familiar with jQuery, Osmosis will feel somehow similar.

Features like loading and searching AJAX content, embedded and remote scripts, redirects, cookies, form submission, single or multiple proxies, basic authentication, and many more are included in the Osmosis library.

const osmosis = require('osmosis');

osmosis
.get('https://www.example.com/')
.set({
title: 'h1',
description: 'p',
link: 'a@href'
})
.data(item => console.log(item))

Osmosis is lightweight and easy to use. The main advantage is that it has no large dependencies like jQuery, Cheerio, or JSDom.

Suggestions and recommendations

Now that we’ve gone through all seven web scraping tools in Node.Js, let’s get back to the most significant curiosity: what would be the most suitable tool for you?

Let’s be honest. Only you can make that right decision. If you think you’ve got all the information you need so far, let me give you the following advice. If you need something lightweight, easy to use with some basic functionality, you could go for Cheerio, Osmosis, or X-RAY.

If you have something more complex in mind, you could go for JSDom, Puppeteer/Playwright or Nightmare, since they have more features and provide high-level API control. As mentioned above, JSDom is the closest to a headless browser.

It is also essential to know that there are always alternatives for what you are looking for. For example, if you don’t have the time to build a web scraper by yourself, and simply want to make your life easier, a great alternative could be a premade web scraping API. These are tools that carry out the heavy lifting for you and bring you closer to web data.

Let me give you the following example for a better understanding of this alternative — WebScrapingAPI.

WebScrapingAPI is a handy web scraping tool, and as the name suggests, an API that allows you to scrape any online source. You don’t have to download, install, or set it up, and it comes with lots of benefits: it is easy to use, reliable and you can customize it on request.

It uses 100M+ rotating proxies to increase reliability and avoid IP blocks. Thus, you won’t have to deal with CAPTCHAs, proxies, or IP rotations because WebScrapingAPI manages in the backend all the possible blockers. Other desirable features include Java rendering, mass crawling operations, unlimited bandwidth, global geotargeting, customization on request, speed obsessive architecture, and much more.

Moreover, the tool is part of the price category of Freemium products, so you can test it at any time benefiting from 1000 free API calls. If you are curious, you can always find out more here.

webscrapingapi.com
webscrapingapi.com

Final thoughts

I hope the list of tools provided by this article was very helpful. Take advantage of this information for your own projects or business. If you are curious and want to learn more about web scraping, essential tools, and everything in between, here are some articles that I enjoyed reading:

CEO & Co-Founder @Knoxon, Full Stack Developer