Scrapy is an open source Python web crawler that can scrape hundreds of pages in a single run. It is robust and asynchronous so it’s ideal for scraping large websites.
Scrapy uses a combination of XPath and CSS selectors to scrape the contents of a web page. It also supports Regular Expressions and can extract text from files using the re() method.
It has an asynchronous scheduler that allows it to run many requests simultaneously instead of running them one after the other, which improves overall efficiency. This also helps reduce the amount of time that a crawler spends waiting for requests to finish.
The data can be saved to files in various formats such as JSON, XML and CSV. This makes it easier to use the information in a variety of ways.
To save the data, you’ll need to create an “Item” object in Scrapy and give it a file extension that matches the format you want the data to be stored in. There are a few options you can use when creating the Item object:
FEED_FORMAT is an optional value that sets the format in which Scrapy will store the exported data (JSON, JSON lines or XML). It can also be used to specify the location of the file where the extracted data should be stored.
Once the Item has been saved to a file, Scrapy get help will process it through an item pipeline. This is a collection of Python classes that are executed sequentially to handle the items being scraped by your spiders.
You can customize how the items are processed and stored with middlewares. The middlewares are a series of hooks into the Scrapy spider processing mechanism which let you plug in custom functionality to handle your spiders’ responses and items.
This library contains several middlewares which allow you to automatically validate and write unit tests for the results of your spider/s. This is done by setting up custom logic that checks the output of your spider/s against a set of criteria. If it is acceptable, the testmaster parse function is called to write the unit tests. If it is not, an informative error message is written.
It also has a shell mode which is an interactive Python console that is used to check the XPath or CSS extracted from a web page. This is useful for testing a scraper’s extraction logic before using the spider to scrape the rest of the data.
The AutoThrottle extension is a very handy tool for managing how much data you scrape with your spiders. It works by monitoring the download latency of a website and automatically adjusting the number of simultaneous requests to help keep your web server happy while you crawl it.
For example, if you scrape StackOverflow and the site’s rate limiter is in place, the AutoThrottle extension will ensure your spider stays below the rate limit by limiting the number of requests that it can make at any given time.