finscraper package

Subpackages

Submodules

finscraper.extensions module

Module for Scrapy extensions.

class finscraper.extensions.ProgressBar(crawler)

Bases: object

Scrapy extension that displays progress bar.

Enabled via PROGRESS_BAR_ENABLED Scrapy setting.

classmethod from_crawler(crawler)
on_error(failure, response, spider)
on_item_scraped(item, spider)
on_response(response, request, spider)
spider_closed(spider)
spider_opened(spider)

finscraper.middlewares module

Module for Scrapy middlewares.

class finscraper.middlewares.SeleniumCallbackMiddleware(settings)

Bases: object

Middleware that processes request with given callback.

Headless mode can be disabled via DISABLE_HEADLESS Scrapy setting. In non-headless mode, window can be minimized via MINIMIZE_WINDOW Scrapy setting.

classmethod from_crawler(crawler)
process_request(request, spider)
spider_closed(spider)
spider_opened(spider)

finscraper.pipelines module

Module for Scrapy pipelines.

class finscraper.pipelines.DefaultValueNonePipeline

Bases: object

Pipeline that sets default values of all item fields to None.

process_item(item, spider)

finscraper.request module

Module for custom Scrapy request components.

class finscraper.request.SeleniumCallbackRequest(*args, **kwargs)

Bases: Request

Process request with given callback using Selenium.

Parameters

selenium_callback (func or None, optional) – Function that will be called with the chrome webdriver. The function should take in parameters (request, spider, driver) and return request, response or None. If None, driver will be used for fetching the page, and return is response. Defaults to None.

attributes: Tuple[str, ...] = ('url', 'callback', 'method', 'headers', 'body', 'cookies', 'meta', 'encoding', 'priority', 'dont_filter', 'errback', 'flags', 'cb_kwargs')

A tuple of str objects containing the name of all public attributes of the class that are also keyword parameters of the __init__ method.

Currently used by Request.replace(), Request.to_dict() and request_from_dict().

property body: bytes
property cb_kwargs: dict
copy() Request
property encoding: str
classmethod from_curl(curl_command: str, ignore_unknown_options: bool = True, **kwargs) RequestTypeVar

Create a Request object from a string containing a cURL command. It populates the HTTP method, the URL, the headers, the cookies and the body. It accepts the same arguments as the Request class, taking preference and overriding the values of the same arguments contained in the cURL command.

Unrecognized options are ignored by default. To raise an error when finding unknown options call this method by passing ignore_unknown_options=False.

Caution

Using from_curl() from Request subclasses, such as JSONRequest, or XmlRpcRequest, as well as having downloader middlewares and spider middlewares enabled, such as DefaultHeadersMiddleware, UserAgentMiddleware, or HttpCompressionMiddleware, may modify the Request object.

To translate a cURL command into a Scrapy request, you may use curl2scrapy.

property meta: dict
replace(*args, **kwargs) Request

Create a new Request with the same attributes except for those given new values

to_dict(*, spider: Optional[Spider] = None) dict

Return a dictionary containing the Request’s data.

Use request_from_dict() to convert back into a Request object.

If a spider is given, this method will try to find out the name of the spider methods used as callback and errback and include them in the output dict, raising an exception if they cannot be found.

property url: str

finscraper.settings module

Module for finscraper’s default Scrapy settings.

finscraper.spiders module

Module for Spider API - the main interface of finscraper.

class finscraper.spiders.ILArticle(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch Iltalehti news articles.

Parameters
  • jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

  • progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.

  • log_level (str or None, optional) –

    Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

    Note

    This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:
  • url (str): URL of the scraped web page.

  • time (int): UNIX timestamp of the scraping.

  • title (str): Title of the article.

  • ingress (str): Ingress of the article.

  • content (str): Contents of the article.

  • published (str): Publish time of the article.

  • author (str): Author of the article.

  • images (list of dict): Images of the article.

clear()

Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters

fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.

Returns

If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.

Raises

ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters

jobdir (str) – Path to job directory.

Returns

Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns

Path to job directory.

Return type

str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters
  • n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.

  • timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.

  • settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.ISArticle(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch IltaSanomat news articles.

Parameters
  • jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

  • progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.

  • log_level (str or None, optional) –

    Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

    Note

    This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:
  • url (str): URL of the scraped web page.

  • time (int): UNIX timestamp of the scraping.

  • title (str): Title of the article.

  • ingress (str): Ingress of the article.

  • content (str): Contents of the article.

  • published (str): Publish time of the article.

  • author (str): Author of the article.

  • images (list of dict): Images of the article.

clear()

Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters

fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.

Returns

If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.

Raises

ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters

jobdir (str) – Path to job directory.

Returns

Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns

Path to job directory.

Return type

str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters
  • n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.

  • timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.

  • settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.MNetPage(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch threads from muusikoiden.net discussions.

Parameters
  • jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

  • progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.

  • log_level (str or None, optional) –

    Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

    Note

    This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned page fields:
  • url (str): Url of the thread page.

  • time (int): UNIX timestamp of the scraping.

  • title (str): Title of the thread.

  • page_number (int): Number of the page in the thread.

  • messages (list of str): Messages on the thread page.

Returned message fields:
  • author (str): Author of the message.

  • time_posted (str): Publish time of the message.

  • quotes (list of str): List of quotes in the message.

  • content (str): Contents of the message.

clear()

Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters

fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.

Returns

If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.

Raises

ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters

jobdir (str) – Path to job directory.

Returns

Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns

Path to job directory.

Return type

str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters
  • n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.

  • timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.

  • settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.OikotieApartment(area=None, jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch oikotie.fi apartments.

Args:
area (str, optional): Scrape listings based on area, e.g.

“helsinki” or “hausjärvi”. The final URL will be formed as: ‘https://asunnot.oikotie.fi/myytavat-asunnot/{area}’. Defaults to None.

jobdir (str or None, optional): Working directory of the spider.

Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

progress_bar (bool, optional): Whether to enable progress bar or not. This

parameter is ignored if log_level is not None. Defaults to True.

log_level (str or None, optional): Logging level to display. Should be in

[‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:
  • url (str): URL of the scraped web page.

  • time (int): UNIX timestamp of the scraping.

  • title (str): Title of the web browser tab.

  • overview (str): Overview text of the apartment.

  • contact_person_name (str): Name of the contact person.

  • contact_person_job_title (str): Job title of the contact person.

  • contact_person_phone_number (str): Phone number of the contact person.

  • contact_person_company (str): Company of the contact person.

  • location (str): Sijainti

  • city (str): Kaupunki

  • district (str): Kaupunginosa

  • oikotie_id (str): Kohdenumero

  • floor (str): Kerros

  • life_sq (str): Asuinpinta-ala

  • property_sq (str): Tontin pinta-ala

  • total_sq (str): Kokonaispinta-ala

  • room_info (str): Huoneiston kokoonpano

  • number_of_rooms (str): Huoneita

  • condition (str): Kunto

  • condition_details (str): Kunnon lisätiedot

  • availability (str): Lisätietoa vapautumisesta

  • kitchen_appliances (str): Keittiön varusteet

  • bathroom_appliances (str): Kylpyhuoneen varusteet

  • window_direction (str): Ikkunoiden suunta

  • has_balcony (str): Parveke

  • balcony_details (str): Parvekkeen lisätiedot

  • storage_space (str): Säilytystilat

  • view (str): Näkymät

  • future_renovations (str): Tulevat remontit

  • completed_renovations (str): Tehdyt remontit

  • has_sauna (str): Asunnossa sauna

  • sauna_details (str): Saunan lisätiedot

  • housing_type (str): Asumistyyppi

  • services (str): Palvelut

  • additional_info (str): Lisätiedot

  • property_id (str): Kiinteistötunnus

  • apartment_is (str): Kohde on

  • telecommunication_services (str): Tietoliikennepalvelut

  • price_no_tax (str): Velaton hinta

  • sales_price (str): Myyntihinta

  • shared_loan_payment (str): Lainaosuuden maksu

  • price_per_sq (str): Neliöhinta

  • share_of_liabilities (str): Velkaosuus

  • mortgages (str): Kiinnitykset

  • financial_charge (str): Rahoitusvastike

  • condominium_payment (str): Hoitovastike

  • maintenance_charge (str): Yhtiövastike

  • water_charge (str): Vesimaksu

  • water_charge_details (str): Vesimaksun lisätiedot

  • heating_charge (str): Lämmityskustannukset

  • other_costs (str): Muut kustannukset

  • is_brand_new (str): Uudiskohde

  • housing_company_name (str): Taloyhtiön nimi

  • building_type (str): Rakennuksen tyyppi

  • build_year (str): Rakennusvuosi

  • build_year_details (str): Rakennusvuoden lisätiedot

  • number_of_apartments (str): Huoneistojen lukumäärä

  • total_floors (str): Kerroksia

  • building_has_elevator (str): Hissi

  • building_has_sauna (str): Taloyhtiössä on sauna

  • building_material (str): Rakennusmateriaali

  • roof_type (str): Kattotyyppi

  • energy_class (str): Energialuokka

  • has_energy_certificate (str): Energiatoditus

  • antenna_system (str): Kiinteistön antennijärjestelmä

  • property_size (str): Tontin koko

  • maintenance (str): Kiinteistönhoito

  • real_estate_management (str): Isännöinti

  • plan_info (str): Kaavatiedot

  • plan (str): Kaavatilanne

  • traffic_communication (str): Liikenneyhteydet

  • heating (str): Lämmitys

  • parking_space_description (str): Pysäköintitilan kuvaus

  • common_spaces (str): Yhteiset tilat

  • wallcovering (str): Pintamateriaalit

clear()

Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters

fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.

Returns

If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.

Raises

ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters

jobdir (str) – Path to job directory.

Returns

Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns

Path to job directory.

Return type

str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters
  • n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.

  • timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.

  • settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.Suomi24Page(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch comments from suomi24.fi.

Parameters
  • jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

  • progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.

  • log_level (str or None, optional) –

    Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

    Note

    This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned page fields:
  • url (str): URL of the scraped web page.

  • time (int): UNIX timestamp of the scraping.

  • title (str): Title of the thread.

  • content (str): Content of the first message.

  • comments (str): Comments of the thread page.

  • published (str): Publish time of the thread.

  • author (str): Author of the thread.

  • n_comments (int): Number of comments in the thread.

  • views (str): Number of views.

Returned comment fields:
  • author (str): Author of the comment.

  • date (str): Publish time of the comment.

  • quotes (list of str): List of quotes in the comment.

  • responses (list of dict): Response comments to this comment.

  • content (str): Contents of the comment.

Returned comment response fields:
  • author (str): Author of the comment response.

  • date (str): Publish time of the comment response.

  • quotes (list of str): List of quotes in the comment response.

  • content (str): Contents of the comment response.

clear()

Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters

fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.

Returns

If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.

Raises

ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters

jobdir (str) – Path to job directory.

Returns

Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns

Path to job directory.

Return type

str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters
  • n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.

  • timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.

  • settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.ToriDeal(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch deals from tori.fi.

Parameters
  • jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

  • progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.

  • log_level (str or None, optional) –

    Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

    Note

    This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:
  • url (str): URL of the scraped web page.

  • time (int): UNIX timestamp of the scraping.

  • seller (str): Seller of the item.

  • name (str): Name of the item.

  • description (list of str): Description of the item.

  • price (str): Price of the item.

  • type (str): Type of the deal.

  • published (str): Publish time of the deal.

  • images (list of dict): Images of the item.

clear()

Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters

fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.

Returns

If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.

Raises

ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters

jobdir (str) – Path to job directory.

Returns

Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns

Path to job directory.

Return type

str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters
  • n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.

  • timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.

  • settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.VauvaPage(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch comments from vauva.fi.

Parameters
  • jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

  • progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.

  • log_level (str or None, optional) –

    Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

    Note

    This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned page fields:
  • url (str): URL of the scraped web page.

  • time (int): UNIX timestamp of the scraping.

  • title (str): Title of the thread.

  • page (int): Page number of the thread.

  • pages (int): Pages in the thread.

  • comments (str): Comments of the thread page.

  • published (str): Publish time of the article.

  • author (str): Author of the article.

Returned comment fields:
  • author (str): Author of the comment.

  • date (str): Publish time of the comment.

  • quotes (list of str): List of quotes in the comment.

  • content (str): Contents of the comment.

  • upvotes (int): Upvotes of the comment.

  • downvotes (int): Downvotes of the comment.

clear()

Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters

fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.

Returns

If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.

Raises

ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters

jobdir (str) – Path to job directory.

Returns

Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns

Path to job directory.

Return type

str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters
  • n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.

  • timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.

  • settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.YLEArticle(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch YLE news articles.

Parameters
  • jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

  • progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.

  • log_level (str or None, optional) –

    Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

    Note

    This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:
  • url (str): URL of the scraped web page.

  • time (int): UNIX timestamp of the scraping.

  • title (str): Title of the article.

  • ingress (str): Ingress of the article.

  • content (str): Contents of the article.

  • published (str): Publish time of the article.

  • author (str): Author of the article.

  • images (list of dict): Images of the article.

clear()

Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters

fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.

Returns

If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.

Raises

ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters

jobdir (str) – Path to job directory.

Returns

Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns

Path to job directory.

Return type

str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters
  • n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.

  • timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.

  • settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

finscraper.text_utils module

Module for text processing utility functions and classes.

finscraper.text_utils.drop_empty_elements(text_list)
finscraper.text_utils.paragraph_join(text_list)
finscraper.text_utils.replace(text, source, target)
finscraper.text_utils.safe_cast_int(text)
finscraper.text_utils.strip_elements(text_list)
finscraper.text_utils.strip_join(text_list, join_with=' ')

finscraper.utils module

Module for utility functions and classes.

class finscraper.utils.QueueHandler(queue)

Bases: Handler

Sends events to a queue, allowing multiprocessing.

This handler checks for picklability before saving items into queue. Modified from: https://gist.github.com/vsajip/591589

acquire()

Acquire the I/O thread lock.

addFilter(filter)

Add the specified filter to this handler.

close()

Tidy up any resources used by the handler.

This version removes the handler from an internal map of handlers, _handlers, which is used for handler lookup by name. Subclasses should ensure that this gets called from overridden close() methods.

createLock()

Acquire a thread lock for serializing access to the underlying I/O.

emit(record)

Do whatever it takes to actually log the specified logging record.

This version is intended to be implemented by subclasses and so raises a NotImplementedError.

enqueue(record)
filter(record)

Determine if a record is loggable by consulting all the filters.

The default is to allow the record to be logged; any filter can veto this and the record is then dropped. Returns a zero value if a record is to be dropped, else non-zero.

Changed in version 3.2: Allow filters to be just callables.

flush()

Ensure all logging output has been flushed.

This version does nothing and is intended to be implemented by subclasses.

format(record)

Format the specified record.

If a formatter is set, use it. Otherwise, use the default formatter for the module.

get_name()
handle(record)

Conditionally emit the specified logging record.

Emission depends on filters which may have been added to the handler. Wrap the actual emission of the record with acquisition/release of the I/O thread lock. Returns whether the filter passed the record for emission.

handleError(record)

Handle errors which occur during an emit() call.

This method should be called from handlers when an exception is encountered during an emit() call. If raiseExceptions is false, exceptions get silently ignored. This is what is mostly wanted for a logging system - most users will not care about errors in the logging system, they are more interested in application errors. You could, however, replace this with a custom handler if you wish. The record which was being processed is passed in to this method.

property name
prepare(record)
release()

Release the I/O thread lock.

removeFilter(filter)

Remove the specified filter from this handler.

setFormatter(fmt)

Set the formatter for this handler.

setLevel(level)

Set the logging level of this handler. level must be an int or a str.

set_name(name)
class finscraper.utils.TqdmLogger(logger)

Bases: StringIO

File-like object that redirects buffer to stdout.

close()

Close the IO object.

Attempting any further operation after the object is closed will raise a ValueError.

This method has no effect if the file is already closed.

closed
detach()

Separate the underlying buffer from the TextIOBase and return it.

After the underlying buffer has been detached, the TextIO is in an unusable state.

encoding

Encoding of the text stream.

Subclasses should override.

errors

The error setting of the decoder or encoder.

Subclasses should override.

fileno()

Returns underlying file descriptor if one exists.

OSError is raised if the IO object does not use a file descriptor.

flush()

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

getvalue()

Retrieve the entire contents of the object.

isatty()

Return whether this is an ‘interactive’ stream.

Return False if it can’t be determined.

line_buffering
newlines
read(size=- 1, /)

Read at most size characters, returned as a string.

If the argument is negative or omitted, read until EOF is reached. Return an empty string at EOF.

readable()

Returns True if the IO object can be read.

readline(size=- 1, /)

Read until newline or EOF.

Returns an empty string if EOF is hit immediately.

readlines(hint=- 1, /)

Return a list of lines from the stream.

hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

seek(pos, whence=0, /)

Change stream position.

Seek to character offset pos relative to position indicated by whence:

0 Start of stream (the default). pos should be >= 0; 1 Current position - pos must be 0; 2 End of stream - pos must be 0.

Returns the new absolute position.

seekable()

Returns True if the IO object can be seeked.

tell()

Tell the current file position.

truncate(pos=None, /)

Truncate size to pos.

The pos argument defaults to the current file position, as returned by tell(). The current file position is unchanged. Returns the new absolute position.

writable()

Returns True if the IO object can be written.

write(buf)

Write string to file.

Returns the number of characters written, which is always equal to the length of the string.

writelines(lines, /)

Write a list of lines to stream.

Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.

finscraper.utils.get_chromedriver(options=None, settings=None)

Get chromedriver automatically.

Parameters
  • options (selenium.webdriver.chrome.options.Options, optional) – Options to start chromedriver with. If None, will use default settings. Defaults to None.

  • settings (scrapy.settings.Settings, optional) – Scrapy settings to take into consideration when starting chromedriver. If None, will not be taken into consideration. Defaults to None.

Returns

Selenium webdriver for Chrome (selenium.webdriver.Chrome).

finscraper.wrappers module

Module for wrapping Scrapy spiders.

Module contents