Creating your web scraper can be quite an easy task if you know what tools to use. Python, for example, is one of the most popular programming languages used to extract data from the web.
In one of my previous articles, I shared with you how to create your own Python web scraper in just a few steps, but you can use other libraries in order to get your job done too.
Now we’ll see which of these packages can help us and what their advantages are in building our web scraper. So fasten your seatbelts, we are going for a ride!
Why build your own web scraper?
I am certain you already know what purpose your web scraper will have, but let’s take a moment and see what other uses it can have in your daily scraping activities. Even if your reasons are personal or business-oriented, knowing a little more about something doesn’t hurt the cause! Now let’s have a look at those use cases:
- Price optimization: Scraping websites to see the market fluctuating can help your business grow by adapting to your competitors. This can help you even when you just want to purchase a product because data extraction and comparison will aid you in finding the best offers.
- Lead generation: Looking through a phone book doesn’t sound so productive, does it? Scraping a directory website and structuring the data in an easy-to-read format sounds more advantageous.
- Research: Gathering data to help your research study can be pretty time-consuming. Using a web scraping tool can help you fasten the process by collecting data reports and statistics.
I hope these use case examples are helpful for you, and if you wish to learn more about what problems a web scraping tool can solve, have a look at these top 7 use cases for data extraction.
The 5 best Python libraries and frameworks for building a web scraper
1. Requests
Requests III is one of the most popular Python packages available, and for good reasons too! With the help of this library, you can send HTTP/1.1 and HTTP/2 requests. It isn’t necessary to add query strings to the URL or to form-encode your POST data either. Request III automatically handles keep-alive and HTTP connection pooling too!
How can we use this library? Well, getting the raw HTML of a web page is an easy task, and then you have to parse it and extract the data you need. Let’s see an example where we scrape a Wikipedia page about kangaroos because who doesn’t like kangaroos?
To install the Request III package, you just need to execute the following command line within your command prompt: pip install requests
.
After we’ve installed the library, we need to import it into our project. Then, we need to make a GET request to the URL, and voila!
import requestsr = requests.get(‘https://en.wikipedia.org/wiki/Kangaroo')print(r.content)
2. BeautifulSoup
This package is called beautiful for a reason, as it helps you parse the extracted data with ease, navigate through it, and select only the data you are interested in.
The parse tree can easily be modified based on your needs, and searching within the parse tree is child play for this package, as it can find specific patterns or CSS styles.
Using the above example of extracting the raw HTML with Requests III, let’s parse all paragraphs and print their content. First, we need to install BeautifulSoup4 using this command line: pip install bs4
.
from bs4 import BeautifulSoupimport requestsr = requests.get(‘https://en.wikipedia.org/wiki/Kangaroo')content = r.contentsoup = BeautifulSoup(content, features=”html.parser”)for element in soup.findAll(‘p’):print(element.text)
Secondly, we told BeautifulSoup to use the HTML parser for the extracted content and select all the <p> tags for us.
And finally, we iterate through the selected tags and print the text contained within them. Pretty easy, right?
3. LXML
As a faster toolkit when it comes to processing XML and HTML in the Python language, LXML can help you during your daily scraping activities with ease. They have a vast documentation with examples to help you better understand its features.
In this example, we’ll try to get all the links appearing on a web page. To install the package, we do the same as on the last steps: pip install lxml
.
We are going to use the requests package again to get the raw HTML code of the webpage and then parse it using LXML.
import lxml.htmlimport requestsr = requests.get(‘https://en.wikipedia.org/wiki/Kangaroo')content = r.contentdoc = lxml.html.fromstring(content)for element in doc.xpath(‘//a/@href’):print(element)
Above, we used an XPath expression to select all the links we could find and print them out.
4. Selenium
Selenium is like an umbrella project with a set of tools and libraries for web browser automation. You can use this package for more than just scraping, like simulating actions made by end-users, such as clicking, entering text into fields, mouse movements, and more.
Although Selenium is mainly used for front-end testing of websites, it can also be used for data extraction, and it is more advantageous if you use it with a headless browser to access content that loads after the initial page load.
To do that, we must download chromedriver and then install selenium using this command line within our command prompt: pip install selenium
.
from selenium import webdriverdriver = webdriver.Chrome(“/your/path/here/chromedriver”)driver.get(‘https://en.wikipedia.org/wiki/Kangaroo')links = driver.find_elements_by_tag_name(‘a’)for element in links:print(element.get_attribute(‘href’))
Notice that we didn’t need help from other libraries to get the contents of the web page, as selenium can do everything single-handed!
After we told the webdriver where the chromedriver is located and which URL to scrape, we must specify what we are searching for within the extracted data. Let’s try to get all the links again.
We tell the driver to find all the <a> tags, iterate through them and then print out the ‘href’ attribute. That’s about it!
5. MechanicalSoup
Built on other big Python libraries such as Requests, to handle HTTP sessions, and BeautifulSoup, to navigate through the document, MechanicalSoup is designed to automate interactions with websites.
You can simulate human behavior by easily navigating a webpage. MechanicalSoup is also used to fill in and submit forms, store and send cookies, follow links found on the webpage. Downloading files is also possible with this package.
The downside is that it doesn’t do well when JavaScript is involved. Since 2017, MechanicalSoup is actively maintained by a small team.
Here is a code example of how to scrape links from a given URL. It is very similar to the ones we did earlier, as it is built on both Requests and BeautifulSoup.
import mechanicalsoupbrowser = mechanicalsoup.StatefulBrowser(user_agent=’MechanicalSoup’)browser.open(‘https://en.wikipedia.org/wiki/Kangaroo')for link in browser.links():print(link.attrs[‘href’])
Is a home-built web scraper the answer?
As you can see, building your web scraping tool isn’t hard at all! But does it suffice your needs? Scraping more complex websites can be difficult for a home-built tool, as roadblocks may appear along the way, such as anti-bot countermeasures. If you want to know more about web scraping roadblocks and how to avoid them, you can always take a look.
Managing a proxy pool is also necessary if you want to scrape without being detected or if you want to scrape en masse. But managing a large number of proxies can be rather tricky and time-consuming. Have a glance at how important are proxies when scraping, and you’ll quickly understand what I am implying.
Have you ever thought about opting for a pre-built web scraping tool? If you are curious enough, I’ve written an article about the difference between a DIY and a pre-made tool.