The crawlster module

class crawlster.Crawlster(config=None)

Base class for web crawlers

Any crawler must subclass this and provide a valid Configuration object as the config class attribute.

finalize()

Performs the finalize action on all item handlers and helpers

get_pool()

Creates and returns the worker pool

init_context()

Initializes the crawler context (the queue and the worker pool)

inject_config_and_crawler(to_be_injected)

Injects the config instance and crawler instance into the object

The crawler instance will be accessible through the .crawler attribute, the config instance will be accessible through the .config attribute. After injection, the .initialize() is called to perform the init actions.

Parameters:to_be_injected (object which has initialize()) – An object in which will be injected the config and crawler attributes. Must have an .initialize() method
inject_handlers()

Injects and initializes all the known item handlers

inject_helpers()

Injects and initializes all the known helpers

iter_helpers()

Iterates through all the item handlers

iter_item_handlers()

Iterates through all the known item handlers

populate_config()

Populates the config with the options from helpers and item handlers

Each helper and item handler defines a list of options that it uses. This method will visit each helper and item handler to populate the config instance with those options.

process_job(job)

Processes a single job and enqueues the results

report_error(e, failed_job)

Reports a failed job

Parameters:
  • e (Exception) – The exception instance that was thrown
  • failed_job (Job) – The job instance that caused the exception
schedule(func, *args, **kwargs)

Schedules the next tep to be executed by workers

start()

Starts crawling based on the config

submit_item(item)

Submit an item to be handled by the item handlers

Parameters:item (dict) – The item that has to be processed
worker()

Worker body that executes the jobs

crawlster.start(method)

Decorator for specifying the start step.

Must decorate a single method from the crawler class