crawlster - small and light web crawlers¶
A simple, lightweight web crawling framework
Features:
- HTTP crawling
- Various data extraction methods (regex, css selectors, xpath)
- Very configurable and extensible
What is crawlster?¶
Crawlster is a web crawling library designed to build lightweight and reusable web crawlers. It is very extensible and provides many shortcuts for the most common tasks in a web crawler, such as HTTP request sending and parsing and info extraction.
It was created out of need of a lighter framework for web crawling, as an alternative to Scrapy.
Installation¶
From PyPi:
pip install crawlster
From source:
git clone https://github.com/vladcalin/crawlster.git
cd crawlster
python setup.py install
Quick example¶
This is the hello world equivalent for this library:
import crawlster
from crawlster.handlers import JsonLinesHandler
class MyCrawler(crawlster.Crawlster):
# items will be saved to items.jsonl
item_handler = JsonLinesHandler('items.jsonl')
@crawlster.start
def step_start(self, url):
resp = self.http.get(url)
# we select elements with the expression and we are interested
# only in the 'href' attribute. Also, we get only the first result
# for this example
events_uri = self.extract.css(resp.text, '#events > a', attr='href')[0]
# we specify what method should be called next
self.schedule(self.step_events_page, self.urls.join(url, events_uri))
def step_events_page(self, url):
resp = self.http.get(url)
# We extract the content/text of all the selected titles
events = self.extract.css(resp.text, 'h3.event-title a', content=True)
for event_name in events:
# submitting items to be processed by the item handler
self.submit_item({'event': event_name})
if __name__ == '__main__':
# defining the configuration
config = crawlster.Configuration({
# the start pages
'core.start_urls': ['https://www.python.org/'],
# the method that will process the start pages
'core.start_step': 'step_start',
# to see in-depth what happens
'log.level': 'debug'
})
# starting the crawler
crawler = MyCrawler(config)
# this will block until everything finishes
crawler.start()
# printing some run stats, such as the number of requests, how many items
# were submitted, etc.
print(crawler.stats.dump())
Running the above code will fetch the event names from python.org and save them
in a items.jsonl
file in the current directory.
For more advanced usage, consult the documentation.
Helpers¶
A helper is a utility class that provides certain functionality. The Crawlster
class requires the .log
, .stats
, .http
and .queue
helpers
to be provided (and are by default) for internal behaviour. These are called
core helpers
Also, besides the core helpers, the Crawlster
class also provides the .urls
,
.extract
and .regex
helpers for common tasks.
You can also create other helpers and attach them to the crawler to enhance it.