The Easiest Way to Build a Web Scraper Using JavaScript and NodeJS

As you probably already know, we are being flooded by our own digital data. Finding something on the Internet, though, is pretty easy: open up a browser window, google what you need to find out, and that’s pretty much it!

But what do you do when you need to collect some data from a website?

Even the most used search engine joins the enemy’s side. Just as you thought that gathering up the data was the easy step, you hit a wall. When search engines find out you’re trying to scrape their website without permission, they restrict your access. Based on your physical location, a website can completely ban your access if requests come from untrustworthy regions.

Thus, in the following article, I will help you build your own web scraper using NodeJS without being blocked. But before we get straight into the subject, let’s find out more about web scraping.

What is web scraping?
Why use JavaScript?
Creating your web scraper
1. Choose the page you want to scrape
2. Inspect the code of the website
3. Write the code
4. Run the code
5. Store your extracted data
Extracting data on your own has never been simpler

What is web scraping?

A web scraper represents the tool that will help us automate the process of gathering a website’s data. In the absence of it, people have to make a request to the website, inspect the HTML page and break it down to get the data they need.

For those of you who don’t already know what a web scraper can be used for, I’m going to mention some of the main use cases below:

  • Price comparison
  • Academic research
  • Market analysis
  • Lead generation
  • Collecting training and testing datasets for Machine Learning

Let me give you a more practical example: Using some web scraping technology, a company called Brisk Voyage helps their users save up to 80% on their last-minute weekend trips.

They manage to do this by constantly checking flight and hotel prices, and right at the moment their tool finds a trip that’s a low-priced outlier, the user gets an email with the booking instructions. Pretty neat, right?

Why use JavaScript?

JavaScript is one of the Internet’s most used and easy-to-learn programming languages. It helps developers add complex features to their website like displaying dynamic content, interactive maps, scrolling video jukeboxes, etc. Every time a website does more than present some static information, it most probably uses JavaScript.

In the following section of this article, I will help you write your own web scraper application using NodeJS. This lean, fast, cross-platform JavaScript runtime environment is useful for both servers and desktop applications. It’s popular because it helps its users build and run network applications in just a couple of minutes.

Creating your web scraper

First of all, please make sure you have all the tools you need for the following process:

  • Chrome (or any other browser, for that matter). You can download it here.
  • VSCode (or some other code editor). You can download it here.
  • NodeJS and npm. The easiest way to install NodeJS and NPM is to get one of the NodeJS official source installers and just run it. After you have successfully installed NodeJS, you can verify if all went well by running node-vand npm-v in a new terminal window. If you’re having issues with this process, you can check out these instructions.

Now create a new folder for this project, open a new terminal window, navigate to the newly created folder and run npm init-y.

  • Axios. Run npm install axios in the newly created folder.
  • Cheerio. Same as before, run npm install cheerio in the project’s folder.

Please keep in mind, if you have chosen a Single Page Application to scrape, things might be a little more complex, and the tools I have selected for this tutorial might not work. Take a look at Puppeteer if this happens to you.

Now, let’s begin building the scraper:

First, you need to access the website you want to scrape using Chrome or any other web browser. To successfully scrape the data, you have to understand the website’s structure. For the following steps, I chose to scrape the information on the /r/movies subreddit.

After you’ve accessed the website, try to imagine what a regular user would do. Check the posts on the main page clicking on them, read the comments, upvote or downvote a post based on your preferences or even sort them by day, week, month or year.

Let’s try to understand how the information is structured on any subreddit. This will also help you get a more clear idea of the data.

Chrome Dev Tools gives you a way of exploring the Document Object Model of the website you are trying to scrape. Just right-click anywhere on the page and select the “Inspect” option.

We’re going to be shown the following window. Select the “Elements” tab to see the interactive HTML structure of the website.

You can interact with the website by editing its structure, expanding and collapsing elements, or even deleting them. Note that these changes will only be visible to you.

We can now highlight different elements of the website by hovering over the presented structure or even deleting them. These changes will only be visible to you.

In order to keep things simple, I’m going to only collect the posts’ titles. Let’s get to work!

Let’s create a new file called index.js and type or just copy the following lines:

const axios = require("axios");
const cheerio = require("cheerio");

const fetchTitles = async () => {
try {
const response = await axios.get('https://old.reddit.com/r/movies/');

const html = response.data;

const $ = cheerio.load(html);

const titles = [];

$('div > p.title > a').each((_idx, el) => {
const title = $(el).text()
titles.push(title)
});

return titles;
} catch (error) {
throw error;
}
};

fetchTitles().then((titles) => console.log(titles));

To better understand the code written above, I’m going to explain what the main asynchronous function does:

First, I make a GET request to the old Reddit website using the previously installed library, Axios. The result of that request is then loaded by Cheerio on line 10. Using the Chrome Dev Tools, I have found out the elements containing the desired information are a couple of anchor tags. To be sure that I only select the anchor tags that contain the post’s title, I am going to also select their parents by using the following selector: $(`div>p/title>a`).

To get each title individually and not just a big chunk of letters that make no sense, I have to loop through each post using the each() function. Finally, calling text() on each item will return me the title of that specific post.

To run it, just type node index.js in the terminal and hit enter. You should see an array containing all the titles of the posts.

Depending on what you are going to use the scraped data for, you need to store it in a csv file, a new database, or just a plain old array. Let me show you how you can store them in a new csv file.

In the same index.js file you wrote before, replace the last line of code with the following:

fetchTitles().then((titles) => {
var csv = titles.join("%0A");
var a = document.createElement('a');
a.href = 'data:attachment/csv,' + csv;
a.target = '_blank';
a.download = titles.csv';

document.body.appendChild(a);
a.click();
});

First, you create a new anchor element. You set the type of the href to attachment and add the previously generated array as its content. The target of the link should be set to _blankas you need to open the newly generated file in a new window, and you also set the name of the file to “titles.csv” just to be specific. After all of the above is completed, you append the anchor element to the DOM’s body, and by calling the click() method, you invoke the download process.

Extracting data on your own has never been simpler

That’s how you scrape all the information you need from a single web page using JavaScript, NodeJS, and Cheerio. I hope this article made the process a little bit more bearable.

As you can see, scraping a single web page is not the most fun you can spend your time on the Internet. Besides the fact that it’s very time-consuming, you must do it repeatedly if you need to scrape more than one page.

Now, if you need to do it en masse, you need to find slightly more advanced software. From my personal experience, using a web scraping API can help you save a lot of time. There are a lot more tools out there, but you can start by checking out WebScrapingAPI to see if it fits your requirements.

Thank you for sticking till the end & have an awesome day!

CEO & Co-Founder @Knoxon, Full Stack Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store