From the previous post, I wrote things that I learned from Scrapy architecture.
At this one, I try to explain how to run Scrapy in distribution mode. Or even better, try to write your own spiders code to support that.
So I once say that Scrapy has 4 parts to get things to work. But why 4 parts, you may wonder?
Let’s just forget about Scrapy, and use worker + consumer
pattern to get a deeper understanding of how spider works.
Spiders are the worker
.They try to crawl pages, producing more and more links to crawl, which we call tasks, right? All the spiders just crawl the page, get new links and store them in one place, and store the target content in another place. And go on crawling.
Then who is the consumer
, and what does the “places” mean?
Well, to consumer
, you can choose any things that can get content from “places” and do stuff on it, like the Item Pipeline
part of Scrapy. Or write your code to consume.
What’re the “places”? It can be a mysql table, a FIFO queue based on Redis, or other things that have lock support. The idea is to explore all our content to workers and consumers, so they can share the same content. With lock support, we can ensure that no race condition would happen here.
So, are there any libraries can help?
If you are using the Scrapy framework, scrapy-redis would be a nice one to watch. I use rediscrapy in my project.
If not, try to understand what is the pattern of worker + consumer
, and write your own one.
And finally, here are the articles I once read: