Web Crawlers: 5 Tips and Tricks to Improve Your Results

Web scraping is a well-established and mature technology used by thousands of programmers every single day around the world. You get to spend many hours creating your own web crawler and a lot of work optimising the code. But there’s still something missing.

Either you get blocked by the website, the HTML document is not complete, or you wait for decades until you even see some HTML. There is a lot of grunt work involved in making your web crawler genuinely effective.

Web scraping is no longer new, but there are still many tricks to be learned, and things that might seem obvious to experienced users may not have occurred to newcomers yet.

This article will cover five tips that I am confident will improve your results if you’re using a web crawler to search for data across the web.

What are web crawlers?
What defines a web crawler’s efficiency?
Five tips on how to make your web scrapers more efficient
1. API digger
2. Don’t get caught
3. Fewer requests
4. Keep writing
5. Consider the copy
Don’t have the time to make your own web crawler?

What are web crawlers?

But first, let’s talk about web crawlers. You may also know them under the name of spiders, automatic indexers, web robots or ants.

No matter the name, the process and its purpose is the same. A web crawler traverses the web looking for data. It typically starts at one or more seed URLs and follows all hyperlinks from those pages, adding new links to their list of destinations to visit.

When the first crawlers appeared, their sole purpose was to optimise search engines’ indexing strategies. Nowadays, they have a wider variety of use cases: real estate investments, market analysis, price and product intelligence, lead generation, brand monitoring, machine learning, recruitment, etc.

What defines a web crawler’s efficiency?

There are multiple features to consider when talking about a truly effective web crawler, but they all come down to 3 significant characteristics.

Time is money, so a web crawler that takes hours to perform a request is not worth it, no matter how good and complete the data is.

It doesn’t mean that you should completely ignore the data’s consistency. Your crawler must include all the website’s components, especially the ones generated by Javascript. Besides that, the information you scrape may be subjective to different aspects, so accuracy is another essential factor.

What happens when the amount of input significantly increases? It is an inevitable situation that your crawler should be able to handle. Scalability is always important so that you can expand your project with a minimum of technical and human resources.

Five tips on how to make your web scrapers more efficient

Check if the website you want to scrape has a public API. If they do, well, you’re in luck and now have a lot of time on your hands.

It means that a server provides most (maybe even all) of the information you see on the website. Simply accessing the API’s endpoints will give you the data you need in much less time and is already organised in a well-known format (usually JSON or XML).

Websites implement a lot of anti-bot techniques for various reasons. If your crawler falls in these traps, the process will become increasingly difficult.

Luckily, there are as many solutions to overcome them: proxy servers, geotargeting, user agents, etc. You can find them incorporated in most of the already existing web scraping tools.

Try to make as few requests as possible while extracting the data you need. It will make your crawler faster, and you will use the resources you pay for (ex: proxies) wisely.

For example, instead of sending a request to the website for each particular piece of data you need, you can retrieve the whole HTML document, save it externally, and extract the information from it.

Building a web scraper from scratch implies many roadblocks and errors. No matter the amount of data you have to scrape, keep writing it in an external file. Instead of starting all over again after every failure, use your CSV/JSON file as a checkpoint for your web crawler.

Later on, after managing functional errors, you start the scaling process, so the crawler may need to process more websites. There will always be some chances that some of them will fail. The crawler should not stop at that moment but log the failure in an external file and go on with the process.

In some cases, you may want to crawl a website that doesn’t change that often in HTML structure and data terms. For these situations, instead of scraping the original website, you can use the Google cache, a more lightweight version of it.

Don’t have the time to make your own web crawler?

I hope these pieces of advice gave you more insight into what a quality web crawler means. They may not seem easy or quick to implement, but they are necessary for achieving your goals efficiently.

If you got overwhelmed and think that these tips are too time-consuming, the web scraping market provides many pre-built tools to get the job done for you. I recommend that you investigate further using this analysis of multiple web scraping tools in your quest of finding the lion share’s best fit.

CEO & Co-Founder @Knoxon, Full Stack Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store