4 Things to Consider Before Starting Your Web Scraping Project
Web scraping and data extraction in general have seen a rise in popularity over the last years. The benefits are clear to see, but despite that, many still shy away from the idea because of a few perceived problems.
While some of these roadblocks are very much real, others are often blown out of proportion. In this article, we will take a look at the possible difficulties of data extraction and how to avoid them.
1. The legality of web scraping
Whether it’s curious developers looking for an interesting project or companies interested in getting an edge over the competition, many pass up web scraping because of the perceived legal risks.
It’s not a completely unfounded concern, especially since there has been a legal battle between Linkedin (owned by Microsoft) and a data science company name hiQ Labs. The latter scraped public profiles from Linkedin to gather data for their clients and Linkedin tried to stop that.
Ultimately, the court ruled in favor of hiQ Labs, deciding that web scraping public sites does not violate the CFAA (Computer Fraud and Abuse Act).
The matter is quite simple, really: the act of web scraping is legal, but extracting sensitive data or violating the terms and conditions of a website isn’t. So, you are free to scrape to your heart’s content as long as you do it with care and in good faith.
Here are a few tips that you should follow while scraping, to be perfectly sure you risk any costly mistakes:
Read the terms of service before scraping
If a website doesn’t want to be scraped, they’ll likely write it down on that page. Remember that logging in on a website implies that you have read these terms and accept them.
Avoid needless and excessive scraping
There are many reasons why you should only extract the data that you actually need: it’s less costly, you’ll have less data to process, it’s less likely that you’ll be blocked or cause problem to the website’s host, and you’re less likely to scrape data unlawfully.
Check the robots.txt file
This file is used by websites to dictate how robots act on their pages and what they can do. The file is meant for the bots themselves, but you can see the file for yourself before using any software. To see the file, add /robots.txt to the end of their homepage’s URL.
Now that we’ve cleared the air on how web scraping can be used without fear of a court battle, let’s move on to the next matter.
2. The difficulty of building a web scraper
This is only partly true. If you want a web oiled scraping machine, it will certainly take time and skill. After all, you’ll need a headless browser to render Javascript, a good proxy pool to not get blocked and you’ll have to make constant changes for every website you want to get data from.
But, and this is a big but, building a basic scraper for small or simple projects is surprisingly easy. There are two reasons for that: tutorials and pre-existing libraries.
Even if the scraper will ultimately be built by you, it doesn’t mean that there aren’t some ready-made components that can be used.
I’ve made two tutorials on building web scrapers. Consider reading them after you finish reading this article:
- The Easiest Way to Build a Web Scraper Using JavaScript and NodeJS
- 7 Easy Steps for Creating Your Own Web Scraper Using Python
After you give them a read, I’d love to hear your thoughts on the matter in the comments. In these tutorials, I also go over the different libraries or frameworks that will make your job a whole lot easier.
Python and Javascript are probably the most popular programming languages for data extraction tools, but you can do it with many others too. I plan on expanding on the tutorial list, but for the time being, rest assured that there are other guides out there.
Of course, there are plenty of tools that you can use straight out of the box. These cost money, almost all having a SaaS model, but it’s a lot more cost-effective than having someone manually gather the data. It’s up to you to weigh your time, money, and expertise to find the best tool for your specific needs.
3. Websites differ greatly and change over time
While the first two points were things many people knew and believed despite not being completely true, this one is the opposite.
More people should know just how big the differences between two pages can be and that they also change from time to time. Maybe even more importantly, it’s essential that people understand what that means for web scraping projects.
Amazon is a prime example (excuse the pun). Besides their many anti-bot functionalities, there is the simple fact that different product pages and product category pages have varying layouts. It’s not that the people at Amazon hate consistency, it’s that they know much bots love it.
A web scraper, be it API, visual software, or browser extension, chooses what to scrape based on the indications you give it.
To know exactly what you want, you’ll have to inspect the source code of a page. Then you instruct the scraper to gather specific data, based on your use case. For example, you might want all the bolded text on the page. Yet, when you go to a new page (on a new website or even the same one) the information you need is stored under different parameters, so the scraper won’t return the targeted data until you change the targeting specifications.
4. What to do with the gathered data
With terms like “big data” floating around, it’s easy to get caught up in all the excitement and start gathering information without a clear goal in mind. I understand the sentiment, but it does more harm than good.
Before even choosing what kind of tool to use, you should ask yourself a few questions:
- Why do I need more data from the web?
- What kind of information do I need?
- How will I use that information?
- How will I store it?
Since web scrapers return HTML, the code will need a bit of tidying up before it’s useful. If you just gather random data, that task becomes exponentially harder, since you’ll have difficulties sorting and storing the information.
Once you know perfectly well what information you need and why, you can start looking for scrapers. At the same time, think about how you’ll process the information. In some cases, saving the info on an Excel file and doing a few calculations or searches might be enough. Other times, you’ll want to feed the data to a different software product, that can parse the data, save it, and send it wherever necessary.
It all depends on your use case, so think ahead. To quote Benjamin Franklin, “By failing to prepare, you are preparing to fail.”
We’ve reached the end of the article and I’d like to do a lightning-fast recap of the points discussed:
- Web scraping is legal. Violating terms of use or copyright laws isn’t.
- Building a basic web scraper isn’t hard if you use freely available assets.
- Data extraction is a methodical process that needs recurring attention.
- Understanding what data is necessary and why is the first step when web scraping.
Now that you understand web scraping better, do you feel prepared to start turning web data into tangible results for your business? Great, here’s an article that will help you find the right software for the job.