Spiders

Classes

The main interface of finscraper is the Spider API, which aims to make web scraping as easy as possible. Each class represents specific content that can be scraped with it, and no more.

Note

If you need more flexibility in terms of the content to be scraped, you might want to implement your spiders with the lower level library Scrapy.

Parameters

Some spiders may have their own parameters which typically describe the content to be crawled. All spiders also have common parameters, such as jobdir, log_level and progress_bar. These are mainly for controlling the verbosity and working directories of the spiders.

Scraping

The scrape -method starts making HTTP requests and parsing desired items. In most cases, the spider keeps track of already visited pages to avoid fetching the same content more than once.

The workflow of a typical spider in pseudocode would look something like:

1. Start from a certain URL

2. While there are links to follow
    a) Find new links to follow
    b) Find links with desired content and parse them
    c) Stop, if a condition is met (e.g. number of scraped items)

Warning

Some websites try to prevent bots from crawling them, which could lead to a ban of your IP address. By default, finscraper obeys website-specific crawling rules (robots.txt) to avoid this.

Results

Scraped items and the state of the spider are saved into your hard disk as defined by jobdir. By default, this is your temporary directory.

The items are saved in jsonline -format into a single file at spider.items_save_path, and when calling the get -method, the data is read from the disk and returned as a pandas.DataFrame.

Save & load

You may save and load spiders to continue scraping later on. Moreover, you may use clear -method to reset the state and scraped items of an existing spider.