finscraper package¶
Subpackages¶
- finscraper.scrapy_spiders package
- Submodules
- finscraper.scrapy_spiders.demipage module
- finscraper.scrapy_spiders.ilarticle module
- finscraper.scrapy_spiders.isarticle module
- finscraper.scrapy_spiders.mixins module
- finscraper.scrapy_spiders.oikotieapartment module
- finscraper.scrapy_spiders.suomi24page module
- finscraper.scrapy_spiders.torideal module
- finscraper.scrapy_spiders.vauvapage module
- finscraper.scrapy_spiders.ylearticle module
- Module contents
Submodules¶
finscraper.extensions module¶
Module for Scrapy extensions.
-
class
finscraper.extensions.
ProgressBar
(crawler)¶ Bases:
object
Scrapy extension thay displays progress bar.
Enabled via
PROGRESS_BAR_ENABLED
Scrapy setting.-
classmethod
from_crawler
(crawler)¶
-
on_error
(failure, response, spider)¶
-
on_item_scraped
(item, spider)¶
-
on_response
(response, request, spider)¶
-
classmethod
finscraper.middlewares module¶
Module for Scrapy middlewares.
-
class
finscraper.middlewares.
SeleniumCallbackMiddleware
(settings)¶ Bases:
object
Middleware that processes request with given callback.
Headless mode can be disabled via
DISABLE_HEADLESS
Scrapy setting.-
classmethod
from_crawler
(crawler)¶
-
process_request
(request, spider)¶
-
spider_closed
(spider)¶
-
spider_opened
(spider)¶
-
classmethod
finscraper.pipelines module¶
Module for Scrapy pipelines.
finscraper.request module¶
Module for custom Scrapy request components.
-
class
finscraper.request.
SeleniumCallbackRequest
(*args, selenium_callback=None, **kwargs)¶ Bases:
scrapy.http.request.Request
Process request with given callback using Selenium.
- Parameters
selenium_callback (func or None, optional) – Function that will be called with the chrome webdriver. The function should take in parameters (request, spider, driver) and return request, response or None. If None, driver will be used for fetching the page, and return is response. Defaults to None.
-
property
body
¶
-
property
cb_kwargs
¶
-
copy
()¶ Return a copy of this Request
-
property
encoding
¶
-
classmethod
from_curl
(curl_command, ignore_unknown_options=True, **kwargs)¶ Create a Request object from a string containing a cURL command. It populates the HTTP method, the URL, the headers, the cookies and the body. It accepts the same arguments as the
Request
class, taking preference and overriding the values of the same arguments contained in the cURL command.Unrecognized options are ignored by default. To raise an error when finding unknown options call this method by passing
ignore_unknown_options=False
.Caution
Using
from_curl()
fromRequest
subclasses, such asJSONRequest
, orXmlRpcRequest
, as well as having downloader middlewares and spider middlewares enabled, such asDefaultHeadersMiddleware
,UserAgentMiddleware
, orHttpCompressionMiddleware
, may modify theRequest
object.To translate a cURL command into a Scrapy request, you may use curl2scrapy.
-
property
meta
¶
-
replace
(*args, **kwargs)¶ Create a new Request with the same attributes except for those given new values.
-
property
url
¶
finscraper.settings module¶
Module for finscraper’s default Scrapy settings.
finscraper.spiders module¶
Module for Spider API - the main interface of finscraper.
-
class
finscraper.spiders.
DemiPage
(jobdir=None, progress_bar=True, log_level=None)¶ Bases:
finscraper.wrappers._SpiderWrapper
Fetch comments from demi.fi.
- Parameters
jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the
clear
method!progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if
log_level
is not None. Defaults to True.log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.
Note
This parameter can be overridden through Scrapy
settings
(LOG_LEVEL, LOG_ENABLED) within thescrape
-method.
- Returned page fields:
url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the thread.
comments (str): Comments of the thread page.
published (str): Publish time of the article.
author (str): Author of the article.
- Returned comment fields:
author (str): Author of the comment.
date (str): Publish time of the comment.
quotes (list of str): List of quotes in the comment.
content (str): Contents of the comment.
numbering (str): Numbering of the comment.
likes (int): Like count of the comment.
-
clear
()¶ Clear contents of
jobdir
.
-
get
(fmt='df')¶ Return scraped data as DataFrame or list.
- Parameters
fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
- Returns
If
fmt = 'df'
, DataFrame of scraped items. Iffmt = 'list'
, list of dict of scraped items.- Raises
ValueError – If
fmt
not in allowed values.
-
property
items_save_path
¶ Save of path of the scraped items.
Cannot be changed after initialization of a spider.
-
property
jobdir
¶ Working directory of the spider.
Can be changed after initialization of a spider.
-
classmethod
load
(jobdir)¶ Load existing spider from
jobdir
.- Parameters
jobdir (str) – Path to job directory.
- Returns
Spider loaded from job directory.
-
property
log_level
¶ Logging level of the spider.
This attribute can be changed after initialization of a spider.
-
property
progress_bar
¶ Whether progress bar is enabled or not.
Can be changed after initialization of a spider.
-
save
()¶ Save spider in
jobdir
for later use.- Returns
Path to job directory.
- Return type
str
-
scrape
(n=10, timeout=60, settings=None)¶ Scrape given number of items.
- Parameters
n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.
- Returns
self
-
property
spider_save_path
¶ Save path of the spider.
Cannot be changed after initialization of a spider.
-
class
finscraper.spiders.
ILArticle
(jobdir=None, progress_bar=True, log_level=None)¶ Bases:
finscraper.wrappers._SpiderWrapper
Fetch Iltalehti news articles.
- Parameters
jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the
clear
method!progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if
log_level
is not None. Defaults to True.log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.
Note
This parameter can be overridden through Scrapy
settings
(LOG_LEVEL, LOG_ENABLED) within thescrape
-method.
- Returned fields:
url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the article.
ingress (str): Ingress of the article.
content (str): Contents of the article.
published (str): Publish time of the article.
author (str): Author of the article.
images (list of dict): Images of the article.
-
clear
()¶ Clear contents of
jobdir
.
-
get
(fmt='df')¶ Return scraped data as DataFrame or list.
- Parameters
fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
- Returns
If
fmt = 'df'
, DataFrame of scraped items. Iffmt = 'list'
, list of dict of scraped items.- Raises
ValueError – If
fmt
not in allowed values.
-
property
items_save_path
¶ Save of path of the scraped items.
Cannot be changed after initialization of a spider.
-
property
jobdir
¶ Working directory of the spider.
Can be changed after initialization of a spider.
-
classmethod
load
(jobdir)¶ Load existing spider from
jobdir
.- Parameters
jobdir (str) – Path to job directory.
- Returns
Spider loaded from job directory.
-
property
log_level
¶ Logging level of the spider.
This attribute can be changed after initialization of a spider.
-
property
progress_bar
¶ Whether progress bar is enabled or not.
Can be changed after initialization of a spider.
-
save
()¶ Save spider in
jobdir
for later use.- Returns
Path to job directory.
- Return type
str
-
scrape
(n=10, timeout=60, settings=None)¶ Scrape given number of items.
- Parameters
n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.
- Returns
self
-
property
spider_save_path
¶ Save path of the spider.
Cannot be changed after initialization of a spider.
-
class
finscraper.spiders.
ISArticle
(jobdir=None, progress_bar=True, log_level=None)¶ Bases:
finscraper.wrappers._SpiderWrapper
Fetch IltaSanomat news articles.
- Parameters
jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the
clear
method!progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if
log_level
is not None. Defaults to True.log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.
Note
This parameter can be overridden through Scrapy
settings
(LOG_LEVEL, LOG_ENABLED) within thescrape
-method.
- Returned fields:
url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the article.
ingress (str): Ingress of the article.
content (str): Contents of the article.
published (str): Publish time of the article.
author (str): Author of the article.
images (list of dict): Images of the article.
-
clear
()¶ Clear contents of
jobdir
.
-
get
(fmt='df')¶ Return scraped data as DataFrame or list.
- Parameters
fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
- Returns
If
fmt = 'df'
, DataFrame of scraped items. Iffmt = 'list'
, list of dict of scraped items.- Raises
ValueError – If
fmt
not in allowed values.
-
property
items_save_path
¶ Save of path of the scraped items.
Cannot be changed after initialization of a spider.
-
property
jobdir
¶ Working directory of the spider.
Can be changed after initialization of a spider.
-
classmethod
load
(jobdir)¶ Load existing spider from
jobdir
.- Parameters
jobdir (str) – Path to job directory.
- Returns
Spider loaded from job directory.
-
property
log_level
¶ Logging level of the spider.
This attribute can be changed after initialization of a spider.
-
property
progress_bar
¶ Whether progress bar is enabled or not.
Can be changed after initialization of a spider.
-
save
()¶ Save spider in
jobdir
for later use.- Returns
Path to job directory.
- Return type
str
-
scrape
(n=10, timeout=60, settings=None)¶ Scrape given number of items.
- Parameters
n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.
- Returns
self
-
property
spider_save_path
¶ Save path of the spider.
Cannot be changed after initialization of a spider.
-
class
finscraper.spiders.
OikotieApartment
(jobdir=None, progress_bar=True, log_level=None)¶ Bases:
finscraper.wrappers._SpiderWrapper
Fetch oikotie.fi apartments.
- Args:
- jobdir (str or None, optional): Working directory of the spider.
Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the
clear
method!- progress_bar (bool, optional): Whether to enable progress bar or not. This
parameter is ignored if
log_level
is not None. Defaults to True.- log_level (str or None, optional): Logging level to display. Should be in
[‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.
Note
This parameter can be overridden through Scrapy
settings
(LOG_LEVEL, LOG_ENABLED) within thescrape
-method.
- Returned fields:
url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the web browser tab.
overview (str): Overview text of the apartment.
contact_person_name (str): Name of the contact person.
contact_person_job_title (str): Job title of the contact person.
contact_person_phone_number (str): Phone number of the contact person.
contact_person_company (str): Company of the contact person.
location (str): Sijainti
city (str): Kaupunki
district (str): Kaupunginosa
oikotie_id (str): Kohdenumero
floor (str): Kerros
life_sq (str): Asuinpinta-ala
property_sq (str): Tontin pinta-ala
total_sq (str): Kokonaispinta-ala
room_info (str): Huoneiston kokoonpano
number_of_rooms (str): Huoneita
condition (str): Kunto
condition_details (str): Kunnon lisätiedot
availability (str): Lisätietoa vapautumisesta
kitchen_appliances (str): Keittiön varusteet
bathroom_appliances (str): Kylpyhuoneen varusteet
window_direction (str): Ikkunoiden suunta
has_balcony (str): Parveke
balcony_details (str): Parvekkeen lisätiedot
storage_space (str): Säilytystilat
view (str): Näkymät
future_renovations (str): Tulevat remontit
completed_renovations (str): Tehdyt remontit
has_sauna (str): Asunnossa sauna
sauna_details (str): Saunan lisätiedot
housing_type (str): Asumistyyppi
services (str): Palvelut
additional_info (str): Lisätiedot
property_id (str): Kiinteistötunnus
apartment_is (str): Kohde on
telecommunication_services (str): Tietoliikennepalvelut
price_no_tax (str): Velaton hinta
sales_price (str): Myyntihinta
shared_loan_payment (str): Lainaosuuden maksu
price_per_sq (str): Neliöhinta
share_of_liabilities (str): Velkaosuus
mortgages (str): Kiinnitykset
financial_charge (str): Rahoitusvastike
condominium_payment (str): Hoitovastike
maintenance_charge (str): Yhtiövastike
water_charge (str): Vesimaksu
water_charge_details (str): Vesimaksun lisätiedot
heating_charge (str): Lämmityskustannukset
other_costs (str): Muut kustannukset
is_brand_new (str): Uudiskohde
housing_company_name (str): Taloyhtiön nimi
building_type (str): Rakennuksen tyyppi
build_year (str): Rakennusvuosi
build_year_details (str): Rakennusvuoden lisätiedot
number_of_apartments (str): Huoneistojen lukumäärä
total_floors (str): Kerroksia
building_has_elevator (str): Hissi
building_has_sauna (str): Taloyhtiössä on sauna
building_material (str): Rakennusmateriaali
roof_type (str): Kattotyyppi
energy_class (str): Energialuokka
has_energy_certificate (str): Energiatoditus
antenna_system (str): Kiinteistön antennijärjestelmä
property_size (str): Tontin koko
maintenance (str): Kiinteistönhoito
real_estate_management (str): Isännöinti
plan_info (str): Kaavatiedot
plan (str): Kaavatilanne
traffic_communication (str): Liikenneyhteydet
heating (str): Lämmitys
parking_space_description (str): Pysäköintitilan kuvaus
common_spaces (str): Yhteiset tilat
wallcovering (str): Pintamateriaalit
-
clear
()¶ Clear contents of
jobdir
.
-
get
(fmt='df')¶ Return scraped data as DataFrame or list.
- Parameters
fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
- Returns
If
fmt = 'df'
, DataFrame of scraped items. Iffmt = 'list'
, list of dict of scraped items.- Raises
ValueError – If
fmt
not in allowed values.
-
property
items_save_path
¶ Save of path of the scraped items.
Cannot be changed after initialization of a spider.
-
property
jobdir
¶ Working directory of the spider.
Can be changed after initialization of a spider.
-
classmethod
load
(jobdir)¶ Load existing spider from
jobdir
.- Parameters
jobdir (str) – Path to job directory.
- Returns
Spider loaded from job directory.
-
property
log_level
¶ Logging level of the spider.
This attribute can be changed after initialization of a spider.
-
property
progress_bar
¶ Whether progress bar is enabled or not.
Can be changed after initialization of a spider.
-
save
()¶ Save spider in
jobdir
for later use.- Returns
Path to job directory.
- Return type
str
-
scrape
(n=10, timeout=60, settings=None)¶ Scrape given number of items.
- Parameters
n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.
- Returns
self
-
property
spider_save_path
¶ Save path of the spider.
Cannot be changed after initialization of a spider.
-
class
finscraper.spiders.
Suomi24Page
(jobdir=None, progress_bar=True, log_level=None)¶ Bases:
finscraper.wrappers._SpiderWrapper
Fetch comments from suomi24.fi.
- Parameters
jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the
clear
method!progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if
log_level
is not None. Defaults to True.log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.
Note
This parameter can be overridden through Scrapy
settings
(LOG_LEVEL, LOG_ENABLED) within thescrape
-method.
- Returned page fields:
url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the thread.
content (str): Content of the first message.
comments (str): Comments of the thread page.
published (str): Publish time of the thread.
author (str): Author of the thread.
n_comments (int): Number of comments in the thread.
views (str): Number of views.
- Returned comment fields:
author (str): Author of the comment.
date (str): Publish time of the comment.
quotes (list of str): List of quotes in the comment.
responses (list of dict): Response comments to this comment.
content (str): Contents of the comment.
- Returned comment response fields:
author (str): Author of the comment response.
date (str): Publish time of the comment response.
quotes (list of str): List of quotes in the comment response.
content (str): Contents of the comment response.
-
clear
()¶ Clear contents of
jobdir
.
-
get
(fmt='df')¶ Return scraped data as DataFrame or list.
- Parameters
fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
- Returns
If
fmt = 'df'
, DataFrame of scraped items. Iffmt = 'list'
, list of dict of scraped items.- Raises
ValueError – If
fmt
not in allowed values.
-
property
items_save_path
¶ Save of path of the scraped items.
Cannot be changed after initialization of a spider.
-
property
jobdir
¶ Working directory of the spider.
Can be changed after initialization of a spider.
-
classmethod
load
(jobdir)¶ Load existing spider from
jobdir
.- Parameters
jobdir (str) – Path to job directory.
- Returns
Spider loaded from job directory.
-
property
log_level
¶ Logging level of the spider.
This attribute can be changed after initialization of a spider.
-
property
progress_bar
¶ Whether progress bar is enabled or not.
Can be changed after initialization of a spider.
-
save
()¶ Save spider in
jobdir
for later use.- Returns
Path to job directory.
- Return type
str
-
scrape
(n=10, timeout=60, settings=None)¶ Scrape given number of items.
- Parameters
n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.
- Returns
self
-
property
spider_save_path
¶ Save path of the spider.
Cannot be changed after initialization of a spider.
-
class
finscraper.spiders.
ToriDeal
(jobdir=None, progress_bar=True, log_level=None)¶ Bases:
finscraper.wrappers._SpiderWrapper
Fetch deals from tori.fi.
- Parameters
jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the
clear
method!progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if
log_level
is not None. Defaults to True.log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.
Note
This parameter can be overridden through Scrapy
settings
(LOG_LEVEL, LOG_ENABLED) within thescrape
-method.
- Returned fields:
url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
seller (str): Seller of the item.
name (str): Name of the item.
description (list of str): Description of the item.
price (str): Price of the item.
type (str): Type of the deal.
published (str): Publish time of the deal.
images (list of dict): Images of the item.
-
clear
()¶ Clear contents of
jobdir
.
-
get
(fmt='df')¶ Return scraped data as DataFrame or list.
- Parameters
fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
- Returns
If
fmt = 'df'
, DataFrame of scraped items. Iffmt = 'list'
, list of dict of scraped items.- Raises
ValueError – If
fmt
not in allowed values.
-
property
items_save_path
¶ Save of path of the scraped items.
Cannot be changed after initialization of a spider.
-
property
jobdir
¶ Working directory of the spider.
Can be changed after initialization of a spider.
-
classmethod
load
(jobdir)¶ Load existing spider from
jobdir
.- Parameters
jobdir (str) – Path to job directory.
- Returns
Spider loaded from job directory.
-
property
log_level
¶ Logging level of the spider.
This attribute can be changed after initialization of a spider.
-
property
progress_bar
¶ Whether progress bar is enabled or not.
Can be changed after initialization of a spider.
-
save
()¶ Save spider in
jobdir
for later use.- Returns
Path to job directory.
- Return type
str
-
scrape
(n=10, timeout=60, settings=None)¶ Scrape given number of items.
- Parameters
n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.
- Returns
self
-
property
spider_save_path
¶ Save path of the spider.
Cannot be changed after initialization of a spider.
-
class
finscraper.spiders.
VauvaPage
(jobdir=None, progress_bar=True, log_level=None)¶ Bases:
finscraper.wrappers._SpiderWrapper
Fetch comments from vauva.fi.
- Parameters
jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the
clear
method!progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if
log_level
is not None. Defaults to True.log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.
Note
This parameter can be overridden through Scrapy
settings
(LOG_LEVEL, LOG_ENABLED) within thescrape
-method.
- Returned page fields:
url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the thread.
page (int): Page number of the thread.
pages (int): Pages in the thread.
comments (str): Comments of the thread page.
published (str): Publish time of the article.
author (str): Author of the article.
- Returned comment fields:
author (str): Author of the comment.
date (str): Publish time of the comment.
quotes (list of str): List of quotes in the comment.
content (str): Contents of the comment.
upvotes (int): Upvotes of the comment.
downvotes (int): Downvotes of the comment.
-
clear
()¶ Clear contents of
jobdir
.
-
get
(fmt='df')¶ Return scraped data as DataFrame or list.
- Parameters
fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
- Returns
If
fmt = 'df'
, DataFrame of scraped items. Iffmt = 'list'
, list of dict of scraped items.- Raises
ValueError – If
fmt
not in allowed values.
-
property
items_save_path
¶ Save of path of the scraped items.
Cannot be changed after initialization of a spider.
-
property
jobdir
¶ Working directory of the spider.
Can be changed after initialization of a spider.
-
classmethod
load
(jobdir)¶ Load existing spider from
jobdir
.- Parameters
jobdir (str) – Path to job directory.
- Returns
Spider loaded from job directory.
-
property
log_level
¶ Logging level of the spider.
This attribute can be changed after initialization of a spider.
-
property
progress_bar
¶ Whether progress bar is enabled or not.
Can be changed after initialization of a spider.
-
save
()¶ Save spider in
jobdir
for later use.- Returns
Path to job directory.
- Return type
str
-
scrape
(n=10, timeout=60, settings=None)¶ Scrape given number of items.
- Parameters
n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.
- Returns
self
-
property
spider_save_path
¶ Save path of the spider.
Cannot be changed after initialization of a spider.
-
class
finscraper.spiders.
YLEArticle
(jobdir=None, progress_bar=True, log_level=None)¶ Bases:
finscraper.wrappers._SpiderWrapper
Fetch YLE news articles.
- Parameters
jobdir (str or None, optional) – Working directory of the spider. Defaults to None, which creates a temp directory to be used. Note that this directory will only be deleted through the
clear
method!progress_bar (bool, optional) – Whether to enable progress bar or not. This parameter is ignored if
log_level
is not None. Defaults to True.log_level (str or None, optional) –
Logging level to display. Should be in [‘debug’, ‘info’, ‘warn’, ‘error’, ‘critical’] or None (disabled). Defaults to None.
Note
This parameter can be overridden through Scrapy
settings
(LOG_LEVEL, LOG_ENABLED) within thescrape
-method.
- Returned fields:
url (str): URL of the scraped web page.
time (int): UNIX timestamp of the scraping.
title (str): Title of the article.
ingress (str): Ingress of the article.
content (str): Contents of the article.
published (str): Publish time of the article.
author (str): Author of the article.
images (list of dict): Images of the article.
-
clear
()¶ Clear contents of
jobdir
.
-
get
(fmt='df')¶ Return scraped data as DataFrame or list.
- Parameters
fmt (str, optional) – Format to return parsed items as. Should be in [‘df’, ‘list’]. Defaults to ‘df’.
- Returns
If
fmt = 'df'
, DataFrame of scraped items. Iffmt = 'list'
, list of dict of scraped items.- Raises
ValueError – If
fmt
not in allowed values.
-
property
items_save_path
¶ Save of path of the scraped items.
Cannot be changed after initialization of a spider.
-
property
jobdir
¶ Working directory of the spider.
Can be changed after initialization of a spider.
-
classmethod
load
(jobdir)¶ Load existing spider from
jobdir
.- Parameters
jobdir (str) – Path to job directory.
- Returns
Spider loaded from job directory.
-
property
log_level
¶ Logging level of the spider.
This attribute can be changed after initialization of a spider.
-
property
progress_bar
¶ Whether progress bar is enabled or not.
Can be changed after initialization of a spider.
-
save
()¶ Save spider in
jobdir
for later use.- Returns
Path to job directory.
- Return type
str
-
scrape
(n=10, timeout=60, settings=None)¶ Scrape given number of items.
- Parameters
n (int, optional) – Number of items to attempt to scrape. Zero corresponds to no limit. Defaults to 10.
timeout (int, optional) – Timeout in seconds to wait before stopping the spider. Zero corresponds to no limit. Defaults to 60.
settings (dict or None, optional) – Scrapy spider settings to use. Defaults to None, which correspond to default settings. See list of available settings at: https://docs.scrapy.org/en/latest/topics/settings.html.
- Returns
self
-
property
spider_save_path
¶ Save path of the spider.
Cannot be changed after initialization of a spider.
finscraper.text_utils module¶
Module for text processing utility functions and classes.
-
finscraper.text_utils.
drop_empty_elements
(text_list)¶
-
finscraper.text_utils.
paragraph_join
(text_list)¶
-
finscraper.text_utils.
replace
(text, source, target)¶
-
finscraper.text_utils.
safe_cast_int
(text)¶
-
finscraper.text_utils.
strip_elements
(text_list)¶
-
finscraper.text_utils.
strip_join
(text_list, join_with=' ')¶
finscraper.utils module¶
Module for utility functions and classes.
-
class
finscraper.utils.
QueueHandler
(queue)¶ Bases:
logging.Handler
Sends events to a queue, allowing multiprocessing.
This handler checks for picklability before saving items into queue. Modified from: https://gist.github.com/vsajip/591589
-
acquire
()¶ Acquire the I/O thread lock.
-
addFilter
(filter)¶ Add the specified filter to this handler.
-
close
()¶ Tidy up any resources used by the handler.
This version removes the handler from an internal map of handlers, _handlers, which is used for handler lookup by name. Subclasses should ensure that this gets called from overridden close() methods.
-
createLock
()¶ Acquire a thread lock for serializing access to the underlying I/O.
-
emit
(record)¶ Do whatever it takes to actually log the specified logging record.
This version is intended to be implemented by subclasses and so raises a NotImplementedError.
-
enqueue
(record)¶
-
filter
(record)¶ Determine if a record is loggable by consulting all the filters.
The default is to allow the record to be logged; any filter can veto this and the record is then dropped. Returns a zero value if a record is to be dropped, else non-zero.
Changed in version 3.2: Allow filters to be just callables.
-
flush
()¶ Ensure all logging output has been flushed.
This version does nothing and is intended to be implemented by subclasses.
-
format
(record)¶ Format the specified record.
If a formatter is set, use it. Otherwise, use the default formatter for the module.
-
get_name
()¶
-
handle
(record)¶ Conditionally emit the specified logging record.
Emission depends on filters which may have been added to the handler. Wrap the actual emission of the record with acquisition/release of the I/O thread lock. Returns whether the filter passed the record for emission.
-
handleError
(record)¶ Handle errors which occur during an emit() call.
This method should be called from handlers when an exception is encountered during an emit() call. If raiseExceptions is false, exceptions get silently ignored. This is what is mostly wanted for a logging system - most users will not care about errors in the logging system, they are more interested in application errors. You could, however, replace this with a custom handler if you wish. The record which was being processed is passed in to this method.
-
property
name
¶
-
prepare
(record)¶
-
release
()¶ Release the I/O thread lock.
-
removeFilter
(filter)¶ Remove the specified filter from this handler.
-
setFormatter
(fmt)¶ Set the formatter for this handler.
-
setLevel
(level)¶ Set the logging level of this handler. level must be an int or a str.
-
set_name
(name)¶
-
-
class
finscraper.utils.
TqdmLogger
(logger)¶ Bases:
_io.StringIO
File-like object that redirects buffer to stdout.
-
close
()¶ Close the IO object.
Attempting any further operation after the object is closed will raise a ValueError.
This method has no effect if the file is already closed.
-
closed
¶
-
detach
()¶ Separate the underlying buffer from the TextIOBase and return it.
After the underlying buffer has been detached, the TextIO is in an unusable state.
-
encoding
¶ Encoding of the text stream.
Subclasses should override.
-
errors
¶ The error setting of the decoder or encoder.
Subclasses should override.
-
fileno
()¶ Returns underlying file descriptor if one exists.
OSError is raised if the IO object does not use a file descriptor.
-
flush
()¶ Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.
-
getvalue
()¶ Retrieve the entire contents of the object.
-
isatty
()¶ Return whether this is an ‘interactive’ stream.
Return False if it can’t be determined.
-
line_buffering
¶
-
newlines
¶ Line endings translated so far.
Only line endings translated during reading are considered.
Subclasses should override.
-
read
()¶ Read at most size characters, returned as a string.
If the argument is negative or omitted, read until EOF is reached. Return an empty string at EOF.
-
readable
()¶ Returns True if the IO object can be read.
-
readline
()¶ Read until newline or EOF.
Returns an empty string if EOF is hit immediately.
-
readlines
()¶ Return a list of lines from the stream.
hint can be specified to control the number of lines read: no more lines will be read if the total size (in bytes/characters) of all lines so far exceeds hint.
-
seek
()¶ Change stream position.
- Seek to character offset pos relative to position indicated by whence:
0 Start of stream (the default). pos should be >= 0; 1 Current position - pos must be 0; 2 End of stream - pos must be 0.
Returns the new absolute position.
-
seekable
()¶ Returns True if the IO object can be seeked.
-
tell
()¶ Tell the current file position.
-
truncate
()¶ Truncate size to pos.
The pos argument defaults to the current file position, as returned by tell(). The current file position is unchanged. Returns the new absolute position.
-
writable
()¶ Returns True if the IO object can be written.
-
write
(buf)¶ Write string to file.
Returns the number of characters written, which is always equal to the length of the string.
-
writelines
()¶ Write a list of lines to stream.
Line separators are not added, so it is usual for each of the lines provided to have a line separator at the end.
-
-
finscraper.utils.
console
(text, bold=False)¶
-
finscraper.utils.
get_chromedriver
(options=None)¶
finscraper.wrappers module¶
Module for wrapping Scrapy spiders.