08 February 2020

How to Build a Basic Web Crawler to Pull Information From a Website


Programs that read information from websites, or web crawlers, have all kinds of useful applications. You can scrape for stock information, sports scores, text from a Twitter account, or pull prices from shopping websites.

Writing these web crawling programs is easier than you might think. Python has a great library for writing scripts that extract information from websites. Let’s look at how to create a web crawler using Scrapy.

Installing Scrapy

Scrapy is a Python library that was created to scrape the web and build web crawlers. It is fast, simple, and can navigate through multiple web pages without much effort.

Scrapy is available through the Pip Installs Python (PIP) library, here’s a refresher on how to install PIP on Windows, Mac, and Linux.

Using a Python Virtual Environment is preferred because it will allow you to install Scrapy in a virtual directory that leaves your system files alone. Scrapy’s documentation recommends doing this to get the best results.

Create a directory and initialize a virtual environment.

mkdir crawler
cd crawler
virtualenv venv
. venv/bin/activate

You can now install Scrapy into that directory using a PIP command.

pip install scrapy

A quick check to make sure Scrapy is installed properly

scrapy
# prints
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
...

How to Build a Web Crawler

Now that the environment is ready you can start building the web crawler. Let’s scrape some information from a Wikipedia page on batteries: https://en.wikipedia.org/wiki/Battery_(electricity).

The first step to write a crawler is defining a Python class that extends from Scrapy.Spider. This gives you access to all the functions and features in Scrapy. Let’s call this class spider1.

A spider class needs a few pieces of information:

  • a name for identifying the spider
  • a start_urls variable containing a list of URLs to crawl from  (the Wikipedia URL will be the example in this tutorial)
  • a parse() method which is used to process the webpage to extract information
import scrapy

class spider1(scrapy.Spider):
    name = 'Wikipedia'
    start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']

    def parse(self, response):
        pass

A quick test to make sure everything is running properly.

scrapy runspider spider1.py
# prints
2017-11-23 09:09:21 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-11-23 09:09:21 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2017-11-23 09:09:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
...

Turning Off Logging

Running Scrapy with this class prints log information that won’t help you right now. Let’s make it simple by removing this excess log information. Use a warning statement by adding code to the beginning of the file.

import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)

Now when you run the script again, the log information will not print.

Using the Chrome Inspector

Everything on a web page is stored in HTML elements. The elements are arranged in the Document Object Model (DOM). Understanding the DOM is critical to getting the most out of your web crawler. A web crawler searches through all of the HTML elements on a page to find information, so knowing how they’re arranged is important.

Google Chrome has tools that help you find HTML elements faster. You can locate the HTML for any element you see on the web page using the inspector.

  • Navigate to a page in Chrome
  • Place the mouse on the element you would like to view
  • Right-click and select Inspect from the menu

These steps will open the developer console with the Elements tab selected. At the bottom of the console, you will see a tree of elements. This tree is how you will get information for your script.

Extracting the Title

Let’s get the script to do some work for us; A simple crawl to get the title text of the web page.

Start the script by adding some code to the parse() method that extracts the title.

...
    def parse(self, response):
        print response.css('h1#firstHeading::text').extract()
...

The response argument supports a method called CSS() that selects elements from the page using the location you provide.

In this example, the element is h1.firstHeading. Adding ::text to the script is what gives you the text content of the element. Finally, the extract() method returns the selected element.

Running this script in Scrapy prints the title in text form.

[u'Battery (electricity)']

Finding the Description

Now that we’ve scraped the title text let’s do more with the script. The crawler is going to find the first paragraph after the title and extract this information.

Here’s the element tree in the Chrome Developer Console:

div#mw-content-text>div>p

The right arrow (>) indicates a parent-child relationship between the elements.

This location will return all of the p elements matched, which includes the entire description. To get the first p element you can write this code:

response.css('div#mw-content-text>div>p')[0]

Just like the title, you add CSS extractor ::text to get the text content of the element.

response.css('div#mw-content-text>div>p')[0].css('::text')

The final expression uses extract() to return the list. You can use the Python join() function to join the list once all the crawling is complete.

    def parse(self, response):
        print ''.join(response.css('div#mw-content-text>div>p')[0].css('::text').extract())

The result is the first paragraph of the text!

An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is
...

Collecting JSON Data

Scrapy can extract information in text form, which is useful. Scrapy also lets you view the data JavaScript Object Notation (JSON). JSON is a neat way to organize information and is widely used in web development. JSON works pretty nicely with Python as well.

When you need to collect data as JSON, you can use the yield statement built into Scrapy.

Here’s a new version of the script using a yield statement. Instead of getting the first p element in text format, this will grab all of the p elements and organize it in JSON format.

...
    def parse(self, response):
        for e in response.css('div#mw-content-text>div>p'):
            yield { 'para' : ''.join(e.css('::text').extract()).strip() }
...

You can now run the spider by specifying an output JSON file:

scrapy runspider spider3.py -o joe.json

The script will now print all of the p elements.

[
{"para": "An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is the cathode and its negative terminal is the anode.[2] The terminal marked negative is the source of electrons that when connected to an external circuit will flow and deliver energy to an external device. When a battery is connected to an external circuit, electrolytes are able to move as ions within, allowing the chemical reactions to be completed at the separate terminals and so deliver energy to the external circuit. It is the movement of those ions within the battery which allows current to flow out of the battery to perform work.[3] Historically the term \"battery\" specifically referred to a device composed of multiple cells, however the usage has evolved additionally to include devices composed of a single cell.[4]"},
{"para": "Primary (single-use or \"disposable\") batteries are used once and discarded; the electrode materials are irreversibly changed during discharge. Common examples are the alkaline battery used for flashlights and a multitude of portable electronic devices. Secondary (rechargeable) batteries can be discharged and recharged multiple
...

Scraping Multiple Elements

So far the web crawler has scraped the title and one kind of an element from the page. Scrapy can also extract information from different types of elements in one script.

Let’s extract top IMDb Box Office hits for a weekend. This information is pulled from http://www.imdb.com/chart/boxoffice, in a table with rows for each metric.

The parse() method can extract more than one field from the row. Using the Chrome Developer Tools you can find the elements nested inside the table.

...
    def parse(self, response):
        for e in response.css('div#boxoffice>table>tbody>tr'):
            yield {
                'title': ''.join(e.css('td.titleColumn>a::text').extract()).strip(),
                'weekend': ''.join(e.css('td.ratingColumn')[0].css('::text').extract()).strip(),
                'gross': ''.join(e.css('td.ratingColumn')[1].css('span.secondaryInfo::text').extract()).strip(),
                'weeks': ''.join(e.css('td.weeksColumn::text').extract()).strip(),
                'image': e.css('td.posterColumn img::attr(src)').extract_first(),
            }
...

The image selector specifies that img is a descendant of td.posterColumn. To extract the right attribute, use the expression ::attr(src).

Running the spider returns JSON:

[
{"gross": "$93.8M", "weeks": "1", "weekend": "$93.8M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYWVhZjZkYTItOGIwYS00NmRkLWJlYjctMWM0ZjFmMDU4ZjEzXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Justice League"},
{"gross": "$27.5M", "weeks": "1", "weekend": "$27.5M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYjFhOWY0OTgtNDkzMC00YWJkLTk1NGEtYWUxNjhmMmQ5ZjYyXkEyXkFqcGdeQXVyMjMxOTE0ODA@._V1_UX45_CR0,0,45,67_AL_.jpg", "title": "Wonder"},
{"gross": "$247.3M", "weeks": "3", "weekend": "$21.7M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BMjMyNDkzMzI1OF5BMl5BanBnXkFtZTgwODcxODg5MjI@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Thor: Ragnarok"},
...
]

More Web Scrapers and Bots

Scrapy is a detailed library that can do just about any kind of web crawling that you ask it to. When it comes to finding information in HTML elements, combined with the support of Python, it’s hard to beat. Whether you’re building a web crawler or learning about the basics of web scraping the only limit is how much you’re willing to learn.

If you’re looking for more ways to build crawlers or bots you can try to build Twitter and Instagram bots using Python. Python can build some amazing things in web development, so it’s worth going beyond web crawlers when exploring this language.

Read the full article: How to Build a Basic Web Crawler to Pull Information From a Website


Read Full Article

No comments:

Post a Comment