Tutorial¶
We will build a crawler from scratch in which we will use all the major features of this framework. We are going to write a crawler for the python.org documentation which will extract items containing the name of the module and the url for the documentation of that module for all standard libs.
You can find the example in examples/python_org.py
.
Firstly, import the required modules
from crawlster import Crawlster, start, Configuration
from crawlster.handlers.jsonl import JsonLinesHandler
from crawlster.handlers.log_handler import LogItemHandler
crawlster.Crawlster
is the base class which the our crawler will inheritcrawlster.start()
is a decorator for specifying the first step that will be executedcrawlster.handlers.JsonLinesHandler
is the item handler that will write all found items (results) to a json files, one per linecrawlster.handlers.LogItemHandler
is the item handler that will write all found items to the console
Next, we need to start implementing the crawler class and configure it
class PythonOrgCrawler(Crawlster):
"""
This is an example crawler used to crawl info about all the Python modules
"""
item_handler = [LogItemHandler(),
JsonLinesHandler('items.jsonl')]
Here, we subclass the Crawlster
base class and provide the item_handler
attribute as a list of item handlers. When we submit an item, it will pe passed
to each item handler in the order they are defined.
Next, we will start implementing the start step
@start
def step_start(self, url):
data = self.http.get(url)
if not data:
return
Here we decorate the step_start
method to be used as entry point for the
crawling process, then we fetch the start url using the core http
helper
(see crawlster.helpers.RequestsHelper
). If data is None
it means
the request failed, so we return immediately.
Then we parse and extract data from the response
self.urls.mark_seen(url)
hrefs = self.extract.css(data.body, 'a', attr='href')
full_links = self.urls.multi_join(url, hrefs)
Here, with the urls
core helper we mark the url as being seen so that it will
not be processed multiple times (the python docs contain a lot of cross references
between pages). Read more on it on crawlster.helpers.UrlsHelper
.
Then we go on an extract from the body only the content of the href
attributes
of all a
elements. The extract
helper provides the css
method which we
use to select all relevant elements from the content of the page (the anchors) and
then return only the part that we need, the href
attribute.
After that, there is a high chance that all the extracted content are the paths
of the final url (eg. /path/to/some/page.html
) and we need to convert them to
full url (https://python.org/path/to/some/page.html
). The urls
helper
has a method which performs urljoins on elements from a list at once.
Next we need to schedule the next steps in the crawling process
for link in full_links:
if '#' in link:
continue
self.schedule(self.process_page, link)
For each extracted link, we send it to self.process_page
to process it.
We do that using the crawlster.Crawlster.schedule()
method because
all steps are executed in parallel, inside workers that run in separate threads.
Next, we need a way to tell if the current page represents a module reference page.
def looks_like_module_page(self, page_content):
return b'Source code:' in page_content
I know, kind of lame but hey… it does the job!
Next, we do all the request fetching thing again, in the next step’s method
def process_page(self, url):
if not self.urls.can_crawl(url):
return
resp = self.http.get(url)
self.urls.mark_seen(url)
This time we check of the current page can be crawler (in other words, if it has been already crawled). We don’t want to get stuck in an infinite loop because of the numerous cross references between pages.
Then, we check if the page is a module reference page using the method defined earlier
module_name = self.extract.css(resp.body,
'h1 a.reference.internal code span',
content=True)
if not module_name:
return
self.submit_item({'name': module_name[0], 'url': url})
So what happens here is that we extract only the content from the elements. In some cases, that elements does not exist, so we skip the page as it is not a valid module page (eg. https://docs.python.org/3/library/idle.html ).
When finding a module name, we submit it and send it through the item handlers
with the crawlster.Crawlster.submit_item()
method.
The crawler class is done!
All that is left to do is starting it:
if __name__ == '__main__':
crawler = PythonOrgCrawler(Configuration({
"core.start_urls": [
"https://docs.python.org/3/library/index.html"
],
"log.level": "debug",
"pool.workers": 3,
}))
crawler.start()
pprint.pprint(crawler.stats.dump())
There, we initialize the PythonOrgCrawler
with a configuration.
core.start_urls
is a list of starting urls. The start step will be called once for each item in this list.log.level
will set the logging level so that we’ll see more in the console when we run the crawler.pool.workers
will set the worker thread pool’s size. For this example, a concurrency level of 3 is more than enough.
By calling the crawlster.Crawlster.start()
method we start the
crawling process and after that we we’ll print some nice stats about what happened.
Now go to a terminal, and assuming you wrote the crawler in a python_org.py
file,
run:
python python_org.py
Now all should work and after approx. 15 seconds, it should finish.
There should be stats printed in console
{'http.download': 16373171,
'http.requests': 300,
'http.upload': 0,
'items': 185,
'time.duration': 14.879282,
'time.finish': datetime.datetime(2018, 1, 1, 17, 39, 5, 917190),
'time.start': datetime.datetime(2018, 1, 1, 17, 38, 51, 37908)}
and the results should be in the items.jsonl
file in the current directory
(crawlster) vladcalin@mylaptop ~/crawlster $ tail items.jsonl
{"name": "calendar", "url": "https://docs.python.org/3/library/calendar.html"}
{"name": "struct", "url": "https://docs.python.org/3/library/struct.html"}
{"name": "stringprep", "url": "https://docs.python.org/3/library/stringprep.html"}
{"name": "textwrap", "url": "https://docs.python.org/3/library/textwrap.html"}
{"name": "difflib", "url": "https://docs.python.org/3/library/difflib.html"}
{"name": "collections", "url": "https://docs.python.org/3/library/collections.html"}
{"name": "codecs", "url": "https://docs.python.org/3/library/codecs.html"}
{"name": "string", "url": "https://docs.python.org/3/library/string.html"}
{"name": "re", "url": "https://docs.python.org/3/library/re.html"}
{"name": "datetime", "url": "https://docs.python.org/3/library/datetime.html"}
That’s all! Have fun crawling (in a responsible manner)!
Here’s the whole crawler code after putting everything together:
import pprint
from crawlster import Crawlster, start, JsonConfiguration, Configuration
from crawlster.handlers.jsonl import JsonLinesHandler
from crawlster.handlers.log_handler import LogItemHandler
class PythonOrgCrawler(Crawlster):
"""
This is an example crawler used to crawl info about all the Python modules
"""
item_handler = [LogItemHandler(),
JsonLinesHandler('items.jsonl')]
@start
def step_start(self, url):
data = self.http.get(url)
if not data:
return
self.urls.mark_seen(url)
hrefs = self.extract.css(data.body, 'a', attr='href')
full_links = self.urls.multi_join(url, hrefs)
for link in full_links:
if '#' in link:
continue
self.schedule(self.process_page, link)
def process_page(self, url):
if not self.urls.can_crawl(url):
return
resp = self.http.get(url)
self.urls.mark_seen(url)
if not self.looks_like_module_page(resp.body):
return
module_name = self.extract.css(resp.body,
'h1 a.reference.internal code span',
content=True)
if not module_name:
return
self.submit_item({'name': module_name[0], 'url': url})
def looks_like_module_page(self, page_content):
return b'Source code:' in page_content
if __name__ == '__main__':
crawler = PythonOrgCrawler(Configuration({
"core.start_urls": [
"https://docs.python.org/3/library/index.html"
],
"log.level": "debug",
"pool.workers": 3,
}))
crawler.start()
pprint.pprint(crawler.stats.dump())