Extending the crawler with helpers¶
The crawlster
library makes very easy to extend the functionality
of the crawler through helpers. A helper is only a utility class that is
attached to the crawler instance.
Core helpers:
crawlster.helpers.RequestsHelper
available ashttp
.crawlster.helpers.UrlsHelper
available asurls
.crawlster.helpers.ExtractHelper
available asextract
.crawlster.helpers.StatsHelper
available asstats
.crawlster.helpers.LoggingHelper
available aslog
.crawlster.helpers.QueueHelper
available asqueue
.crawlster.helpers.RegexHelper
available asregex
.
Create your own helper¶
In order to create your own helper to enhance your crawler with super powers
you need to subclass the crawlster.helpers.BaseHelper
base class.
Then you can start implementing the functionality you need.
Methods¶
There is no required method that has to be overwritten, but there are some methods that can be overwritten to act as hooks. So far the only two available hooks are
crawlster.helpers.BaseHelper.initialize()
that performs actions on crawler start.crawlster.helpers.BaseHelper.finalize()
that performs actions on crawler stop (when there are no more items to process).
Configuration¶
Helpers can take advantage of the configuration system the library provides by
providing the config_options
attribute, a mapping of option name and
option value.
Attributes¶
The two attributes that are available inside the helper are
config
and crawler
.
The config
attribute will hold the Configuration
instance used to
initialize the crawler. You can get values from the configuration using
the self.config.get(option_name)
method.
The crawler
attribute holds the current crawler instance through which
the helper can access other helpers. Although it is recommended to make
the helper as independent as possible, sometimes you would need to use
the functionality already provided by some already existent helper (stats
aggregation, logging, etc).
Attaching the helper to the crawler¶
In the crawler definition, provide the helper instance as a class attribute
class MyCrawler(Crawlster):
my_helper = MyHelperClass()
# ...
def some_step(self, url):
# ...
self.my_helper.do_amazing_things()
# ...