From the previous post, I wrote things that I learned from Scrapy architecture.
At this one, I try to explain how to run Scrapy in distribution mode. Or even better, try to write your own spiders code to support that.
So I once say that Scrapy has 4 parts to get things to work. But why 4 parts, you may wonder?
Let’s just forget about Scrapy, and use
worker + consumer pattern to get a deeper understanding of how spider works.
Spiders are the
worker.They try to crawl pages, producing more and more links to crawl, which we call tasks, right? All the spiders just crawl the page, get new links and store them in one place, and store the target content in another place. And go on crawling.
Then who is the
consumer, and what does the “places” mean?
consumer, you can choose any things that can get content from “places” and do stuff on it, like the
Item Pipeline part of Scrapy. Or write your code to consume.
What’re the “places”? It can be a mysql table, a FIFO queue based on Redis, or other things that have lock support. The idea is to explore all our content to workers and consumers, so they can share the same content. With lock support, we can ensure that no race condition would happen here.
So, are there any libraries can help?
If not, try to understand what is the pattern of
worker + consumer, and write your own one.
And finally, here are the articles I once read: