finscraper package

Subpackages

finscraper.scrapy_spiders package

Submodules

finscraper.extensions module

Module for Scrapy extensions.

class finscraper.extensions.ProgressBar(crawler)

Bases: object

Scrapy extension that displays progress bar.

Enabled via PROGRESS_BAR_ENABLED Scrapy setting.

classmethod from_crawler(crawler)

on_error(failure, response, spider)

on_item_scraped(item, spider)

on_response(response, request, spider)

spider_closed(spider)

spider_opened(spider)

finscraper.middlewares module

Module for Scrapy middlewares.

class finscraper.middlewares.SeleniumCallbackMiddleware(settings)

Bases: object

Middleware that processes request with given callback.

Headless mode can be disabled via DISABLE_HEADLESS Scrapy setting. In non-headless mode, window can be minimized via MINIMIZE_WINDOW Scrapy setting.

classmethod from_crawler(crawler)

process_request(request, spider)

spider_closed(spider)

spider_opened(spider)

finscraper.pipelines module

Module for Scrapy pipelines.

class finscraper.pipelines.DefaultValueNonePipeline

Bases: object

Pipeline that sets default values of all item fields to None.

process_item(item, spider)

finscraper.request module

Module for custom Scrapy request components.

class finscraper.request.SeleniumCallbackRequest(*args, **kwargs)

Bases: Request

Process request with given callback using Selenium.

Parameters: selenium_callback (func or None, optional) – Function that will be called with the chrome webdriver. The function should take in parameters (request, spider, driver) and return request, response or None. If None, driver will be used for fetching the page, and return is response. Defaults to None.

attributes: Tuple[str, ...] = ('url', 'callback', 'method', 'headers', 'body', 'cookies', 'meta', 'encoding', 'priority', 'dont_filter', 'errback', 'flags', 'cb_kwargs')

A tuple of str objects containing the name of all public attributes of the class that are also keyword parameters of the __init__ method.

Currently used by Request.replace(), Request.to_dict() and request_from_dict().

property body: bytes

property cb_kwargs: dict

copy() → Request

property encoding: str

classmethod from_curl(curl_command: str, ignore_unknown_options: bool = True, **kwargs) → RequestTypeVar

Create a Request object from a string containing a cURL command. It populates the HTTP method, the URL, the headers, the cookies and the body. It accepts the same arguments as the Request class, taking preference and overriding the values of the same arguments contained in the cURL command.

Unrecognized options are ignored by default. To raise an error when finding unknown options call this method by passing ignore_unknown_options=False.

Caution

Using from_curl() from Request subclasses, such as JSONRequest, or XmlRpcRequest, as well as having downloader middlewares and spider middlewares enabled, such as DefaultHeadersMiddleware, UserAgentMiddleware, or HttpCompressionMiddleware, may modify the Request object.

To translate a cURL command into a Scrapy request, you may use curl2scrapy.

property meta: dict

replace(*args, **kwargs) → Request: Create a new Request with the same attributes except for those given new values

to_dict(*, spider: Optional[Spider] = None) → dict

Return a dictionary containing the Request’s data.

Use request_from_dict() to convert back into a Request object.

If a spider is given, this method will try to find out the name of the spider methods used as callback and errback and include them in the output dict, raising an exception if they cannot be found.

property url: str

finscraper.settings module

Module for finscraper’s default Scrapy settings.

finscraper.spiders module

Module for Spider API - the main interface of finscraper.

class finscraper.spiders.ILArticle(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch Iltalehti news articles.

Parameters

jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!
progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.
log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:

url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the article.
ingress (str): Ingress of the article.
content (str): Contents of the article.
published (str): Publish time of the article.
author (str): Author of the article.
images (list of dict): Images of the article.

clear(): Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters: fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
Returns: If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.
Raises: ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters: jobdir (str) – Path to job directory.
Returns: Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns: Path to job directory.
Return type: str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters

n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.ISArticle(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch IltaSanomat news articles.

Parameters

jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!
progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.
log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:

url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the article.
ingress (str): Ingress of the article.
content (str): Contents of the article.
published (str): Publish time of the article.
author (str): Author of the article.
images (list of dict): Images of the article.

clear(): Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters: fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
Returns: If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.
Raises: ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters: jobdir (str) – Path to job directory.
Returns: Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns: Path to job directory.
Return type: str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters

n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.MNetPage(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch threads from muusikoiden.net discussions.

Parameters

jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!
progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.
log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned page fields:

url (str): Url of the thread page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the thread.
page_number (int): Number of the page in the thread.
messages (list of str): Messages on the thread page.

Returned message fields:

author (str): Author of the message.
time_posted (str): Publish time of the message.
quotes (list of str): List of quotes in the message.
content (str): Contents of the message.

clear(): Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters: fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
Returns: If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.
Raises: ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters: jobdir (str) – Path to job directory.
Returns: Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns: Path to job directory.
Return type: str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters

n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.OikotieApartment(area=None, jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch oikotie.fi apartments.

Args:

area (str, optional): Scrape listings based on area, e.g.
“helsinki” or “hausjärvi”. The final URL will be formed as: ‘https://asunnot.oikotie.fi/myytavat-asunnot/{area}’. Defaults to None.

jobdir (str or None, optional): Working directory of the spider.
Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!

progress_bar (bool, optional): Whether to enable progress bar or not. This
parameter is ignored if log_level is not None. Defaults to True.

log_level (str or None, optional): Logging level to display. Should be in
[‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:

url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the web browser tab.
overview (str): Overview text of the apartment.
contact_person_name (str): Name of the contact person.
contact_person_job_title (str): Job title of the contact person.
contact_person_phone_number (str): Phone number of the contact person.
contact_person_company (str): Company of the contact person.
location (str): Sijainti
city (str): Kaupunki
district (str): Kaupunginosa
oikotie_id (str): Kohdenumero
floor (str): Kerros
life_sq (str): Asuinpinta-ala
property_sq (str): Tontin pinta-ala
total_sq (str): Kokonaispinta-ala
room_info (str): Huoneiston kokoonpano
number_of_rooms (str): Huoneita
condition (str): Kunto
condition_details (str): Kunnon lisätiedot
availability (str): Lisätietoa vapautumisesta
kitchen_appliances (str): Keittiön varusteet
bathroom_appliances (str): Kylpyhuoneen varusteet
window_direction (str): Ikkunoiden suunta
has_balcony (str): Parveke
balcony_details (str): Parvekkeen lisätiedot
storage_space (str): Säilytystilat
view (str): Näkymät
future_renovations (str): Tulevat remontit
completed_renovations (str): Tehdyt remontit
has_sauna (str): Asunnossa sauna
sauna_details (str): Saunan lisätiedot
housing_type (str): Asumistyyppi
services (str): Palvelut
additional_info (str): Lisätiedot
property_id (str): Kiinteistötunnus
apartment_is (str): Kohde on
telecommunication_services (str): Tietoliikennepalvelut
price_no_tax (str): Velaton hinta
sales_price (str): Myyntihinta
shared_loan_payment (str): Lainaosuuden maksu
price_per_sq (str): Neliöhinta
share_of_liabilities (str): Velkaosuus
mortgages (str): Kiinnitykset
financial_charge (str): Rahoitusvastike
condominium_payment (str): Hoitovastike
maintenance_charge (str): Yhtiövastike
water_charge (str): Vesimaksu
water_charge_details (str): Vesimaksun lisätiedot
heating_charge (str): Lämmityskustannukset
other_costs (str): Muut kustannukset
is_brand_new (str): Uudiskohde
housing_company_name (str): Taloyhtiön nimi
building_type (str): Rakennuksen tyyppi
build_year (str): Rakennusvuosi
build_year_details (str): Rakennusvuoden lisätiedot
number_of_apartments (str): Huoneistojen lukumäärä
total_floors (str): Kerroksia
building_has_elevator (str): Hissi
building_has_sauna (str): Taloyhtiössä on sauna
building_material (str): Rakennusmateriaali
roof_type (str): Kattotyyppi
energy_class (str): Energialuokka
has_energy_certificate (str): Energiatoditus
antenna_system (str): Kiinteistön antennijärjestelmä
property_size (str): Tontin koko
maintenance (str): Kiinteistönhoito
real_estate_management (str): Isännöinti
plan_info (str): Kaavatiedot
plan (str): Kaavatilanne
traffic_communication (str): Liikenneyhteydet
heating (str): Lämmitys
parking_space_description (str): Pysäköintitilan kuvaus
common_spaces (str): Yhteiset tilat
wallcovering (str): Pintamateriaalit

clear(): Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters: fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
Returns: If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.
Raises: ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters: jobdir (str) – Path to job directory.
Returns: Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns: Path to job directory.
Return type: str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters

n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.Suomi24Page(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch comments from suomi24.fi.

Parameters

jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!
progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.
log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned page fields:

url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the thread.
content (str): Content of the first message.
comments (str): Comments of the thread page.
published (str): Publish time of the thread.
author (str): Author of the thread.
n_comments (int): Number of comments in the thread.
views (str): Number of views.

Returned comment fields:

author (str): Author of the comment.
date (str): Publish time of the comment.
quotes (list of str): List of quotes in the comment.
responses (list of dict): Response comments to this comment.
content (str): Contents of the comment.

Returned comment response fields:

author (str): Author of the comment response.
date (str): Publish time of the comment response.
quotes (list of str): List of quotes in the comment response.
content (str): Contents of the comment response.

clear(): Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters: fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
Returns: If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.
Raises: ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters: jobdir (str) – Path to job directory.
Returns: Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns: Path to job directory.
Return type: str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters

n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.ToriDeal(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch deals from tori.fi.

Parameters

jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!
progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.
log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:

url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
seller (str): Seller of the item.
name (str): Name of the item.
description (list of str): Description of the item.
price (str): Price of the item.
type (str): Type of the deal.
published (str): Publish time of the deal.
images (list of dict): Images of the item.

clear(): Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters: fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
Returns: If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.
Raises: ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters: jobdir (str) – Path to job directory.
Returns: Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns: Path to job directory.
Return type: str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters

n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.VauvaPage(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch comments from vauva.fi.

Parameters

jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!
progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.
log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned page fields:

url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the thread.
page (int): Page number of the thread.
pages (int): Pages in the thread.
comments (str): Comments of the thread page.
published (str): Publish time of the article.
author (str): Author of the article.

Returned comment fields:

author (str): Author of the comment.
date (str): Publish time of the comment.
quotes (list of str): List of quotes in the comment.
content (str): Contents of the comment.
upvotes (int): Upvotes of the comment.
downvotes (int): Downvotes of the comment.

clear(): Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters: fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
Returns: If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.
Raises: ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters: jobdir (str) – Path to job directory.
Returns: Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns: Path to job directory.
Return type: str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters

n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

class finscraper.spiders.YLEArticle(jobdir=None, progress_bar=True, log_level=None)

Bases: _SpiderWrapper

Fetch YLE news articles.

Parameters

jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the clear method!
progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if log_level is not None. Defaults to True.
log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.

Note

This parameter can be overridden through Scrapy settings (LOG_LEVEL, LOG_ENABLED) within the scrape -method.

Returned fields:

url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the article.
ingress (str): Ingress of the article.
content (str): Contents of the article.
published (str): Publish time of the article.
author (str): Author of the article.
images (list of dict): Images of the article.

clear(): Clear contents of jobdir.

get(fmt='df')

Return scraped data as DataFrame or list.

Parameters: fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
Returns: If fmt = 'df', DataFrame of scraped items. If fmt = 'list', list of dict of scraped items.
Raises: ValueError – If fmt not in allowed values.

property items_save_path

Save of path of the scraped items.

Cannot be changed after initialization of a spider.

property jobdir

Working directory of the spider.

Can be changed after initialization of a spider.

classmethod load(jobdir)

Load existing spider from jobdir.

Parameters: jobdir (str) – Path to job directory.
Returns: Spider loaded from job directory.

property log_level

Logging level of the spider.

This attribute can be changed after initialization of a spider.

property progress_bar

Whether progress bar is enabled or not.

Can be changed after initialization of a spider.

save()

Save spider in jobdir for later use.

Returns: Path to job directory.
Return type: str

scrape(n=10, timeout=60, settings=None)

Scrape given number of items.

Parameters

n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.

Returns

self

property spider_save_path

Save path of the spider.

Cannot be changed after initialization of a spider.

finscraper.text_utils module

Module for text processing utility functions and classes.

finscraper.text_utils.drop_empty_elements(text_list)

finscraper.text_utils.paragraph_join(text_list)

finscraper.text_utils.replace(text, source, target)

finscraper.text_utils.safe_cast_int(text)

finscraper.text_utils.strip_elements(text_list)

finscraper.text_utils.strip_join(text_list, join_with=' ')

finscraper.utils module

Module for utility functions and classes.

class finscraper.utils.QueueHandler(queue)

Bases: Handler

Sends events to a queue, allowing multiprocessing.

This handler checks for picklability before saving items into queue. Modified from: https://gist.github.com/vsajip/591589

acquire(): Acquire the I/O thread lock.

addFilter(filter): Add the specified filter to this handler.

close()

Tidy up any resources used by the handler.

This version removes the handler from an internal map of handlers, _handlers, which is used for handler lookup by name. Subclasses should ensure that this gets called from overridden close() methods.

createLock(): Acquire a thread lock for serializing access to the underlying I/O.

emit(record)

Do whatever it takes to actually log the specified logging record.

This version is intended to be implemented by subclasses and so raises a NotImplementedError.

enqueue(record)

filter(record)

Determine if a record is loggable by consulting all the filters.

The default is to allow the record to be logged; any filter can veto this and the record is then dropped. Returns a zero value if a record is to be dropped, else non-zero.

Changed in version 3.2: Allow filters to be just callables.

flush()

Ensure all logging output has been flushed.

This version does nothing and is intended to be implemented by subclasses.

format(record)

Format the specified record.

If a formatter is set, use it. Otherwise, use the default formatter for the module.

get_name()

handle(record)

Conditionally emit the specified logging record.

Emission depends on filters which may have been added to the handler. Wrap the actual emission of the record with acquisition/release of the I/O thread lock. Returns whether the filter passed the record for emission.

handleError(record)

Handle errors which occur during an emit() call.

This method should be called from handlers when an exception is encountered during an emit() call. If raiseExceptions is false, exceptions get silently ignored. This is what is mostly wanted for a logging system - most users will not care about errors in the logging system, they are more interested in application errors. You could, however, replace this with a custom handler if you wish. The record which was being processed is passed in to this method.

property name

prepare(record)

release(): Release the I/O thread lock.

removeFilter(filter): Remove the specified filter from this handler.

setFormatter(fmt): Set the formatter for this handler.

setLevel(level): Set the logging level of this handler. level must be an int or a str.

set_name(name)

class finscraper.utils.TqdmLogger(logger)

Bases: StringIO

File-like object that redirects buffer to stdout.

close()

Close the IO object.

Attempting any further operation after the object is closed will raise a ValueError.

This method has no effect if the file is already closed.

closed

detach()

Separate the underlying buffer from the TextIOBase and return it.

After the underlying buffer has been detached, the TextIO is in an unusable state.

encoding

Encoding of the text stream.

Subclasses should override.

errors

The error setting of the decoder or encoder.

Subclasses should override.

fileno()

Returns underlying file descriptor if one exists.

OSError is raised if the IO object does not use a file descriptor.

flush()

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

getvalue(): Retrieve the entire contents of the object.

isatty()

Return whether this is an ‘interactive’ stream.

Return False if it can’t be determined.

line_buffering

newlines

read(size=- 1, /)

Read at most size characters, returned as a string.

If the argument is negative or omitted, read until EOF is reached. Return an empty string at EOF.

readable(): Returns True if the IO object can be read.

readline(size=- 1, /)

Read until newline or EOF.

Returns an empty string if EOF is hit immediately.

readlines(hint=- 1, /)

Return a list of lines from the stream.

hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.

seek(pos, whence=0, /)

Change stream position.

Seek to character offset pos relative to position indicated by whence:: 0 Start of stream (the default). pos should be >= 0; 1 Current position - pos must be 0; 2 End of stream - pos must be 0.

Returns the new absolute position.

seekable(): Returns True if the IO object can be seeked.

tell(): Tell the current file position.

truncate(pos=None, /)

Truncate size to pos.

The pos argument defaults to the current file position, as returned by tell(). The current file position is unchanged. Returns the new absolute position.

writable(): Returns True if the IO object can be written.

write(buf)

Write string to file.

Returns the number of characters written, which is always equal to the length of the string.

writelines(lines, /)

Write a list of lines to stream.

Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.

finscraper.utils.get_chromedriver(options=None, settings=None)

Get chromedriver automatically.

Parameters

options (selenium.webdriver.chrome.options.Options, optional) – Options to start chromedriver with. If None, will use default settings. Defaults to None.
settings (scrapy.settings.Settings, optional) – Scrapy settings to take into consideration when starting chromedriver. If None, will not be taken into consideration. Defaults to None.

Returns

Selenium webdriver for Chrome (selenium.webdriver.Chrome).

finscraper.wrappers module

Module for wrapping Scrapy spiders.

finscraper package

Subpackages

Submodules

finscraper.extensions module

finscraper.middlewares module

finscraper.pipelines module

finscraper.request module

finscraper.settings module

finscraper.spiders module

finscraper.text_utils module

finscraper.utils module

finscraper.wrappers module

Module contents