The 5 Burning Questions You Need to Answer Before You Start Scraping the Web
I’ve been writing about web scraping for a while now, both on general subjects and niche use cases or projects I found interesting. I like to think that I did a good job, but I’ve realized of late that before you embark on your next web scraping adventure, you need to explore some essential details.
I’ve compiled a list of frequently asked questions and did my best to come up with answers for them. Maybe you already know some of them, answers included, but these are all questions you should think about before you start scraping, so a refresher should still help you.
All set? Great, let’s get this show on the road!
1. What kind of data do you need?
After learning about web scraping, you might be tempted to start gathering all sorts of data, even if it might not actually be useful. I call that the “kid in a candy shop” effect.
I definitely don’t want to take away your enthusiasm, but before you even think about scraping, you should consider what data you need and why.
Let’s say you’re trying to determine the optimal price for a product you’re creating. In that case, you’ll want to look at the prices similar products go for and maybe a list of their features, to determine what each feature is worth. While product reviews might be useful, they have more to do with user experience, so, unless you plan on analyzing those separately, you shouldn’t scrape them at this point.
Remember that more collected data means more details to analyze and consider. While “less is more” isn’t generally associated with research or big data, you don’t want to have useless facts weighing you down.
Another important point is the format of that data. Do you only need text, which can be stored in strings or will you need images? The text can be of different types, for example, numbers (for prices or inventory), dates (for release dates), long strings (for product descriptions), or short strings (for titles or meta descriptions).
Determine exactly what you’ll need because it will be of huge help once you build or purchase a web scraping tool, and once you parse the extracted data.
2. What web pages will you be visiting?
Once you know what to look for, the next step is to know where to search. For web scraping, that means identifying the websites that hold valuable information.
I can’t really offer you much help here, since data sources depend on what info you need and for what reason. The advice I can give to you is this:
- Keep in mind that well-known and advanced websites, like Amazon, will have more scraping countermeasures than others so you’ll have to work harder for the data. Look into possible alternatives if you don’t want the headache.
- Read the Terms of Service and robots.txt files before extracting data. A website’s content is the intellectual property of the site owner and you wouldn’t want to get in trouble. Here’s a guide to keep you on the straight and narrow.
- Some websites have their own API specifically to share data. Those may be locked behind paywalls or offer only partial data, but they do make your job a whole lot easier.
- Remember that heavy bot activity can slow down or even overload the server that hosts the website. Avoid causing problems for the site owner or other visitors.
Some web scraping tools that use special interfaces to select what content to scrape make the process pretty easy. If you’re using something like an API though, you’ll have to insect the page structure so you gather exactly the data you’ll need. It’s easy, just right-click on the information you need and select “inspect element”.
Dynamic websites are great for people since they enrich user experience. For bots though, not so much. Robots don’t surf the Internet like us., They only need the code.
4. What will you do once your web scraper gets blocked?
Make no mistake, even with the most high-tech data extraction tool, you will get noticed sooner or later. That’s why you should avoid extracting data without a proxy.
You’re in luck! Plenty of web scraping products come with incorporated proxy services, so you don’t have to look for different tools, mix and match different plans, and keep track of it all. It’s simple, actually — you choose a plan that lets you make as many requests as you need and you have access to the software’s proxy pool.
Keep an eye out for the following terms, though:
- Datacenter proxies — they won’t work for any website, but they’re cost-effective.
- Residential proxies — nearly indistinguishable from normal users, expensive but they get the job done.
- Rotating proxies — automatic switching IPs so your web scraper is harder to identify and track.
If you’d like, you can still get the scraper and the proxies separately, but I’d say that it’s not worth the bother.
5. Will the scope of the scraping data extraction project change?
Let’s say you did your homework, determined what you need, and found the right app for the job. You start scraping, getting data, and gaining new insights. Then you realize that you’ll need new information, from new sources. That complicates matters.
It’s not always possible to predict how a product will change, so it’s not always easy to prepare for such an event. What you can do, though, is choose a web scraping tool that can help with more than one use case.
Scalability and a wealth of functionalities are the keys to turning a one-time-deal app into an always-present tool in your company’s arsenal. This doesn’t apply only to web scraping software, but to just about any application out there. Plan for success by picking something that you won’t outgrow easily.
The next step
I consider these five questions prerequisites before choosing a tool or putting any money on the table. In certain cases, there will be more questions or maybe different ones, but I’m sure you’ll be clever enough to deal with them; after all, you were clever enough to reach the end of the article.
Jokes aside, the next step in your pre-scraping process should be looking at a few options and getting to know the market. Here’s a good article to start with — a top 10.