How to Collect Data for Machine Learning in 6 Steps via Web Scraping

Dan Suciu
7 min readApr 5, 2021

Machine learning algorithms are powerful tools for analyzing large amounts of data. You can use them to detect fraud, predict stock prices, and even generate accurate medical diagnoses.

The problem is that these algorithms require vast quantities of training data to function correctly. And how do you get that data when online sources do not exist or are not enough? It could take hours and hours of recording, logging, and transcribing errors. There has to be a better way.

This is where web scraping shines: developers who need more training data than they have access to can use a web scraping tool to extract the right kind of information from publicly available websites.

By the end of this article, you should have a firm idea of how to use web scraping for machine learning training purposes and how web scraping can help you save time (and effort) in training your models.

Why you need training data
How web scraping helps ML developers
Gather training data for your ML model with WebScrapingAPI
Finding good sources of data
Inspecting the source code
Register to WebScrapingAPI
Scrape the HTML
Extract the data
Feeding the data to your machine learning algorithm
The many uses of web scraping

Why you need training data

If you’re familiar with machine learning (ML), you’re also aware of the value of training data in machine learning projects. The learning algorithm receives a collection of training data to learn from and train the ML model.

We pass data to machines, teach them to think logically using algorithms (also referred to as models), and then let them apply what they have learned to new data sets. The more data you pass to your model, the higher its performance and accuracy.

For example, given an ML algorithm that predicts whether a website’s content is suitable for children, it will receive training data that includes both acceptable and unacceptable material (usually in proportions of 20% respectively 80%). This is how the algorithm is trained, giving us a model determining whether the new material is age-appropriate.

How web scraping helps ML developers

In the realm of machine learning, web scraping is an active area of research.

So, what exactly does web scraping mean? Well, to put it simply, web scraping entails programs that can extract data from HTML and present it in such a structured form that statistical and linguistic analysis software can utilize the data without further manipulation.

Usually, you can find datasets of various sizes free to download, special for training purposes. However, these datasets may not apply to every ML model. For example, if you need images, videos, or articles, you have to research and download them manually. Considering that an ML application needs a significant amount of data, you can get yourself into a tedious and time-consuming chore.

Web scraping means extracting large quantities of data in a short amount of time by its definition. It is and will continue to be a crucial skill in the future, as more ML applications require it for training.

Gather training data for your ML model with WebScrapingAPI

Let’s put all of this philosophy into effect now.

For the scraping part, I will use WebScrapingAPI, as it is a simple yet powerful API designed to extract data from any website. Still, you are free to choose whatever other tool you are comfortable with.

I will scrape the comments of this Reddit post about worldwide news and use them as training data for a sentiment analysis algorithm. In the end, you will see a file containing each comment and the emotion it evokes (positive, negative, or neutral).

For this tutorial, you need to install Python3 (because we’re talking about machine learning after all), as well as the following libraries:

  • requests: to send an HTTP request to the API;
  • beautifulsoup: to parse and extract specific sections from the HTML document;
  • nltk: to process the natural language.
>pip install requests beautifulsoup nltk

Finding good sources of data

For a sentiment analysis algorithm, you can easily find online datasets of reviews, tweets, etc., with the purpose of training your model. However, there is never too much training for an ML algorithm, especially when it comes to new data.

Reddit is an excellent place to start gathering this information, as users often share their views about various topics.

Inspecting the source code

Of course, the first thing you need to do is to take a look at the HTML code and see how you can extract the post comments.

You may notice from the start a lot of random classes that have big chances of changing in the future, so they will not help us too much.

Luckily, an id attribute, specifically a data-test-id attribute, caught our attention. That’s for testing purposes, so we can be confident that won’t change in the future.

Besides that, we also observe that a comment is composed of multiple <p> tags, surrounded by a <div> tag.

Register to WebScrapingAPI

Now we need to configure our scraping API. Skip to the next step if you already have an account or another scraping tool on hand.

Otherwise, sign up for a free account so you can receive an API key. Copy its value, and you are ready to start writing the code.

Scrape the HTML

Let’s extract the previous HTML document, the one that you saw in your browser. First, we import the libraries that we need:

import requests
from bs4 import BeautifulSoup

Then, we specify the parameters, which will customize the scraping process according to our needs. One of the benefits of using an API is that you can access its documentation, which describes each parameter and its role.

Remember to replace the value of the API key!

endpoint = "https://api.webscrapingapi.com/v1"

params = {
"api_key":"YOUR_API_KEY",
"url": "https%3A%2F%2Fwww.reddit.com%2Fr%2Feurope%2Fcomments%2Fmewexr%2Fwhat_happened_in_your_country_this_week_20210328%2F",
"device": "desktop",
"proxy_type": "datacenter",
"wait_until": "domcontentloaded",
"render_js": "1"
}

In the end, we make an HTTP request and slightly process its result:

  • extract only the <body> content;
  • remove all the <script> and <style>> tags.

This way, we can focus only on consistent HTML.

page = requests.request("GET", endpoint, params=params)
page_soup = BeautifulSoup(page.content, 'html.parser')
body_soup = page_soup.find('body')

for s in body_soup.select('script'):
s.extract()

for s in body_soup.select('style'):
s.extract()

Extract the data

Now that we have the HTML content that interests us, we will retrieve the post’s comments by searching for the previous elements with a data-test-id attribute.

comments_file = open("comments.txt", "w", encoding="utf-8")

for comment in body_soup.select('[data-test-id="comment"]'):
comment_texts = comment.find_all('p')
comment = ' '.join(map(lambda x: x.text, comment_texts))
clean_comment = ' '.join(comment.split())
comments_file.write(clean_comment + '\n\n')

From these elements’ content, we extract only the paragraphs (<p> tags) and apply some simple preprocessing to remove extra spaces and tabs. The final result is the comment’s body, which we will record in a text file for further analysis.

Feeding the data to your machine learning algorithm

With the generated dataset, we can now train our machine learning algorithm. I used the nltk’s pre-built sentiment analyzer for simplicity, but typically, this is where you construct your own build machine learning algorithm.

We have to import the rest of the libraries:

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import csv

Then we open and read the previous file which contains the post’s comments:

comments_file = open("comments.txt", mode="r", encoding="utf-8")
comments = comments_file.readlines()

And finally, we use the sentiment analyzer to compute the score for each comment. It has four values, but only three are relevant: negative, neutral, and positive.

train_file = open("traindata.csv", mode="a", encoding="utf-8")train_writer = csv.writer(train_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)sia = SentimentIntensityAnalyzer()

for comment in comments:
if comment != '\n':
clean_comment = ' '.join(comment.split())
score = sia.polarity_scores(clean_comment)
train_writer.writerow([clean_comment, score.get('neg'), score.get('neu'), score.get('pos')])

We store each comment and its score values in a CSV file for any possible future processing or review.

We used this script to determine how Redditors feel about their respective country’s situation at a given moment. The majority of the comments came out neutral, while the negative ones outnumbered the positives. A rather depressing result but hey, at least the code works.

The many uses of web scraping

Web scraping is a great way for developers to gather large quantities of training data for their machine learning algorithms. It’s simple to do, quick to finish, and inexpensive.

Do you want to hear another cool thing? Web scraping’s applications are far from limited to machine learning only. Its uses are spreading through different sectors that demand large volumes of data in a limited time as it becomes a more popular approach.

If you want to see more applied examples, here’s an analysis of five different APIs and their ability to scrape Amazon for product data.

--

--