The Complete Manual to Legal & Ethical Web Scraping in 2021

Dan Suciu
7 min readApr 22, 2021

--

The popularity of web scraping is evolving nowadays at such an accelerated rate that it would be almost impossible not to get crossed responses when asking the big question: Is it legal? 👀

If you are exploring the Internet to find a legitimate answer that best fits your needs, you’ve come to the right place. This article aims to outline the legal concerns you must be aware of when scraping and also offer insights on how to minimize risks.

Spoiler alert: The question of whether or not web scraping is legal has no definitive and single response. Such an answer depends on many factors, and some may fluctuate based on the country’s laws and rules.

But first, let’s briefly define what web scraping is for those unfamiliar with the concept before diving into the legalities.

Short saga of web scraping

Web Scraping is the automated art of gathering and organizing public available information on the Internet. The result usually consists of a structured composition stored in a content table such as Excel Spreadsheets, which displays the extracted data in a “readable” format.

This practice requires a soft agent that automatically downloads the desired information by imitating your browser interaction. This “robot” can access multiple pages simultaneously, saving you from the hassle of wasting precious time copy-pasting data.

To accomplish that, the web scraper sends much more requests per second than a fellow human would be able to do. Having said that, your scraping engine needs to remain anonymous in order to avoid being detected and blocked. If you want to read more about how to prevent being left out of the data party, I recommend reading this article before choosing a web scraping provider.

So, now that we have a big picture of what a web scraping tool can do, let’s discover how to use it and still sleep peacefully at night.

Is the process of web scraping illegal?

Using a web scraper to harvest data off the Internet is not a criminal act on its own. Many times, it is absolutely legal to scrape a website, but the way you intend to use that data may be illegal.

The legality of the process is determined by several factors, depending on a particular situation.

  • The kind of data are you scraping
  • What do you want to do with the scraped data
  • How you managed to collect the data from the website

Let’s talk about specific types of data and how to gracefully handle them.

Data such as rainfall or temperature measurements, demographic statistics, prices, and ratings may seem to be perfectly legal to scrape because it is not covered by copyright. And it is not personal data either. But if the source of information is owned by a website whose terms and conditions forbid the scraping, you may find yourself in trouble.

So let’s dive into each of the two types of sensitive data to better understand how to scrape smart:

  1. Personal Data
  2. Copyrighted Data

Personal Data

Any kind of data that might be used to identify a specific individual is considered to be Personal Data (PII in more technical words).

One of the hottest topics of discussion in today’s business world is General Data Protection Regulation. GDPR is the legislative mechanism that establishes rules for the gathering and processing of personal data from European Union citizens (EU).

As a rule of thumb, it is recommended to have a lawful reason to obtain, store and use personal data without the user’s consent.

The vast majority of the time, companies use web scraping techniques to gather data for lead generation, sales intelligence, and similar matters. This purpose is usually not compatible with any of those lawful reasons, such as Official Authority, where you can access personal data without any consent if it’s a matter of public interest.

To keep in mind: you are more likely to scrape safely from a legal point of view if you stay away from extracting personal data (if we’re talking about EU or Californian citizens).

Copyrighted data

Data is king. And every king has guards on duty to protect him. And one of the most unmerciful soldiers in this scenario is Copyright. This one prohibits you from scraping, storing and/or reproducing data without the author’s blessing.

Much like the case of copyrighted photographs and music, just because data is publicly accessible on the Internet, it does not automatically entail that it is legal to scrape it without the owner’s permission. Businesses and people who own copyrighted data have a specific power over its reuse and capture.

Data that is usually strongly protected under Copyright law are:

Here’s an observation that could save the day: It is not illegal to scrape copyrighted data as long as you don’t plan to reuse or publish it.

Oh, before we forget…

Do you remember that box you have to check every time you create an account? Because the box remembers you. And if somehow you manage to scrape a website that clearly forbids using automated engines to access their content, you can get in trouble.

Terms of service translate intro: the legal agreements between a service provider (a website) and the person who uses that service (to access its information). Hence, the user must accept the terms and conditions if he wants to use the website.

Data Scraping is something that has to be done responsibly. So it’s better for you to review the Terms and Conditions before scraping a website.

How to make sure your scraping remains legal and ethical

1. Check the Robots.txt file

Back in the olden days, when the Internet was just learning its first words, developers already discovered a way to scrape, crawl and index newborn pages.

Those little fellas qualified for such operations are nicknamed “robots” or “spiders”, and they would occasionally stray into websites that weren’t meant to be scraped or indexed. Aliweb, the inventor of the world’s first search engine, proposed a solution — a set of rules that each robot should obey.

To help ground the definition, a Robots.txt is a text file in the root directory of a website meant to instruct web robots how to crawl pages.

So, in order to scrape harmoniously, you must carefully follow and check the rules from Robots.txt. There is a little trick that can help take a peek behind a website’s curtain: type robots.txt at the end of any URL (https://www.example.com/robots.txt)

However, if the Terms of Service or the Robots.txt file clearly obstructs content scraping, you should first get written permission from the website owner before start harvesting their data.

2. Defend your web scraping identity

If you’re scraping the web for marketing objectives, anonymization is the first step of protection you can make. A pattern of repeated and consistent requests sending from the same IP address can trigger lots of alarm signals. Websites can distinguish web crawlers from real users by tracking a browsers’ activity, checking the IP address, setting honeypots, attaching CAPTCHAs, or even restricting the request rate.

There are different ways you can protect your identity, to name a few:

  • A strong proxy pool
  • Use rotating proxies
  • Use residential IPs
  • Take Anti-fingerprinting measures

For more detailed information about the subject, I highly recommend reading The Biggest Web Scraping Roadblocks and How to Avoid Them.

3. Don’t get greedy — only collect what you need

Companies often tend to abuse the power of a web scraper by gathering the largest quantity of data possible. That’s because they think it can become handy in the future, but in most cases, data also has an expiry date.

4. Check for copyright violations

As the data on some websites may be protected by copyright, it would be an intelligent move to search for a proprietary warrant before you start scraping.

Make sure you don’t reuse or republish the content of the scraped data without either checking the website’s license or without receiving written permission from the data’s copyright holder.

5. Extract public data only

If you want to sleep tight at night, we suggest going for public data harvesting only. If the desired content is private, you must ensure that you obtain the proper approval from the site source.

Final thoughts

So there you have it: we’ve covered all of the major points that decide whether or not your web scraping is legitimate. What businesses want to scrape in the vast majority of cases is completely honest if the rules and ethics allow it.

Nevertheless, I recommend you always double-check by asking yourself these three questions:

  1. Is the data protected by Copyright?
  2. Am I scraping personal data?
  3. Am I violating the Terms and Conditions?

If you receive NO for all of these questions, then congratulations: you’re free to web scrape from a legal perspective.

Just aim to find the right balance between collecting all the desired data and obeying the website’s rules and regulations.

Also, don’t forget that the primary goal of harvested data is to be analyzed, not republished.

I hope I’ve managed to answer your questions regarding the legal gray area where web scraping makes its appearance. Now you’re better prepared to leverage data extraction and meet your business goals.

Until next time, stay safe, scrape smart, not hard!

--

--