The crawlster module¶
-
class
crawlster.
Crawlster
(config=None)¶ Base class for web crawlers
Any crawler must subclass this and provide a valid Configuration object as the config class attribute.
-
finalize
()¶ Performs the finalize action on all item handlers and helpers
-
get_pool
()¶ Creates and returns the worker pool
-
init_context
()¶ Initializes the crawler context (the queue and the worker pool)
-
inject_config_and_crawler
(to_be_injected)¶ Injects the config instance and crawler instance into the object
The crawler instance will be accessible through the .crawler attribute, the config instance will be accessible through the .config attribute. After injection, the .initialize() is called to perform the init actions.
Parameters: to_be_injected (object which has initialize()) – An object in which will be injected the config and crawler attributes. Must have an .initialize() method
-
inject_handlers
()¶ Injects and initializes all the known item handlers
-
inject_helpers
()¶ Injects and initializes all the known helpers
-
iter_helpers
()¶ Iterates through all the item handlers
-
iter_item_handlers
()¶ Iterates through all the known item handlers
-
populate_config
()¶ Populates the config with the options from helpers and item handlers
Each helper and item handler defines a list of options that it uses. This method will visit each helper and item handler to populate the config instance with those options.
-
process_job
(job)¶ Processes a single job and enqueues the results
-
report_error
(e, failed_job)¶ Reports a failed job
Parameters: - e (Exception) – The exception instance that was thrown
- failed_job (Job) – The job instance that caused the exception
-
schedule
(func, *args, **kwargs)¶ Schedules the next tep to be executed by workers
-
start
()¶ Starts crawling based on the config
-
submit_item
(item)¶ Submit an item to be handled by the item handlers
Parameters: item (dict) – The item that has to be processed
-
worker
()¶ Worker body that executes the jobs
-
-
crawlster.
start
(method)¶ Decorator for specifying the start step.
Must decorate a single method from the crawler class