how to run scrapy in distribution mode

From the previous post, I wrote things that I learned from Scrapy architecture.

At this one, I try to explain how to run Scrapy in distribution mode. Or even better, try to write your own spiders code to support that.

So I once say that Scrapy has 4 parts to get things to work. But why 4 parts, you may wonder?

Let’s just forget about Scrapy, and use worker + consumer pattern to get a deeper understanding of how spider works.

Spiders are the worker.They try to crawl pages, producing more and more links to crawl, which we call tasks, right? All the spiders just crawl the page, get new links and store them in one place, and store the target content in another place. And go on crawling.

Then who is the consumer, and what does the “places” mean?

Well, to consumer, you can choose any things that can get content from “places” and do stuff on it, like the Item Pipeline part of Scrapy. Or write your code to consume.

What’re the “places”? It can be a mysql table, a FIFO queue based on Redis, or other things that have lock support. The idea is to explore all our content to workers and consumers, so they can share the same content. With lock support, we can ensure that no race condition would happen here.

So, are there any libraries can help?

If you are using the Scrapy framework, scrapy-redis would be a nice one to watch. I use rediscrapy in my project.

If not, try to understand what is the pattern of worker + consumer, and write your own one.

And finally, here are the articles I once read:

things I learn from scrapy

Early this month, I used scrapy to write spiders. And I won’t say that I use almost 2-3 hours to read the docs again, so I can get familiar with it for how it working.

But i just had written same code one year ago.

It’s just amazing how people often forget what they have done. And I think to write down the architecture of it would be a great thing for me, or for a better me. =)

So, how a spider work?

Let’s take a look at scrapy one.

There are 4 parts for Scrapy and a core engine in it.

In general, when you crawl a website, an index url is needed.

First, start the core Scrapy Engin, which is built on Twisted framework. It connects the 4 components and transfer messages in specific directions.

Second, we send the index url to Downloader, which download the page.

Then, the Spiders come in grape all the things you want, like link, images, content and so on. If the spider got new links, it just sends them to the Scheduler part, preparing for the next new crawling task.

By sending the new links to Scheduler, the Spiders part also send stuff like content to Item Pipeline.In the Item Pipeline part, raw content is being refactored and stored in db like MySQL.

So, what scrapy teaches me here?

One, try to split different parts to do only one thing.

Second, use middleware to extend the flexibility of your program.

Third, an architecture photo means a lot than words.