spider related

怎么部署

  • scrapyd + supervisord + crontab + redis

可以用的一些 lib

分布式

参考的 blog

入门

结合

行业资料

行业需求

  • [lagou]

其他

比如如何防止被 ban 掉

Here are some tips to keep in mind when dealing with these kinds of sites:
- rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
- disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
- use download delays (2 or higher). See DOWNLOAD_DELAY setting.- if possible, use Google cache to fetch pages, instead of hitting the sites directly
- use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
- use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

reading note of the totor library for python

背景

在 ziz 的时候,cto 的技术选型不合理,对使用技术缺乏必要的了解,采用了旧和大而笨重的中间件。因为 tcelery 的核心开发者不再维护代码,所以需要重新选择更加合适的技术方案。

阅读

原版的 tcelery 有几点问题:

- 对于redis的backend,存在race condition的情况,具体要参考下对应的git仓库的issue
- 对于超时的处理,如果worker跑这个任务运行的运行时间超过规定时间,客户端的连接就会卡死
- 代码时间有点旧了,作者mher也没有继续维护和做codereview的工作

后来认识了一个提交 pull request 的童鞋,因为作者没有维护采纳修改的缘故,他做了一版新的工具,totoro。现在的项目就是基于这个来开发的。

ps: segmentfault 有专门一个问题来讨论这个 bug 的,也可以去看看解决方案(其实是我自问自答啦)。

how to run scrapy in distribution mode

From the previous post, I wrote things that I learned from Scrapy architecture.

At this one, I try to explain how to run Scrapy in distribution mode. Or even better, try to write your own spiders code to support that.

So I once say that Scrapy has 4 parts to get things to work. But why 4 parts, you may wonder?

Let’s just forget about Scrapy, and use worker + consumer pattern to get a deeper understanding of how spider works.

Spiders are the worker.They try to crawl pages, producing more and more links to crawl, which we call tasks, right? All the spiders just crawl the page, get new links and store them in one place, and store the target content in another place. And go on crawling.

Then who is the consumer, and what does the “places” mean?

Well, to consumer, you can choose any things that can get content from “places” and do stuff on it, like the Item Pipeline part of Scrapy. Or write your code to consume.

What’re the “places”? It can be a mysql table, a FIFO queue based on Redis, or other things that have lock support. The idea is to explore all our content to workers and consumers, so they can share the same content. With lock support, we can ensure that no race condition would happen here.

So, are there any libraries can help?

If you are using the Scrapy framework, scrapy-redis would be a nice one to watch. I use rediscrapy in my project.

If not, try to understand what is the pattern of worker + consumer, and write your own one.

And finally, here are the articles I once read:

things I learn from scrapy

Early this month, I used scrapy to write spiders. And I won’t say that I use almost 2-3 hours to read the docs again, so I can get familiar with it for how it working.

But i just had written same code one year ago.

It’s just amazing how people often forget what they have done. And I think to write down the architecture of it would be a great thing for me, or for a better me. =)

So, how a spider work?

Let’s take a look at scrapy one.

There are 4 parts for Scrapy and a core engine in it.

In general, when you crawl a website, an index url is needed.

First, start the core Scrapy Engin, which is built on Twisted framework. It connects the 4 components and transfer messages in specific directions.

Second, we send the index url to Downloader, which download the page.

Then, the Spiders come in grape all the things you want, like link, images, content and so on. If the spider got new links, it just sends them to the Scheduler part, preparing for the next new crawling task.

By sending the new links to Scheduler, the Spiders part also send stuff like content to Item Pipeline.In the Item Pipeline part, raw content is being refactored and stored in db like MySQL.

So, what scrapy teaches me here?

One, try to split different parts to do only one thing.

Second, use middleware to extend the flexibility of your program.

Third, an architecture photo means a lot than words.

StackContext in Tornado

背景

最近一直在读 tornado 的代码,大概是看懂了骨架。关于结构的部分,google 一下应该会有比较多的分析,我主要想讲下在 tornado/stack_context.py 文件里面的 StackContext 类的作用。

StackContext 是什么

StackContext 从字面意义来说,翻译过来,就是 “堆栈上下文”。拆开来看,就是用 “堆栈” 这种数据结构,保存 “上下文”。

为什么提出 StackContext 这个概念

在 tornado 的 mailing list 里面,motor 库的作者有说到这么一点。大概的意思是说,在把函数扔到异步调用里面的时候(挖坑准备下一篇讲 Futures),如果这个 callback 本身会抛错的话,抛出的异常本身信息是不正确的。而为了修补这个 “bug”,于是就引入了 StackContext 这个概念。

StackContext 用在什么地方

StackCotnext 的用法,我一开始看了下官网的简介,不是很理解。

后来,在 tornado 的 test 目录下,有个 stack_context_test.py 文件,里面写了一整套的测试用例,可以结合来看。你可以看到作者的代码,能清楚的认识到它的作用。

主要作用就是上文提到的保存函数发生的上下文信息。

更多的参考资料

PS:在阅读代码的过程中,google 了很多其他的资料,这里就不放出来了。大概的意思是,代码还是要自己读,而且主动点去查阅不懂的东西。然后呢,不要怕耗费时间,花点心思是能看懂的。我怎么会告诉你,为了搞懂 StackContext 类的作用,我基本把整个 tornado 撸了一遍,耗时超长哦…XD

How to write a dht crawler in python

I recently write a dht crawler in python.

What a dht? well, you can think of it as a protocol as we treated the bt.
And I am not gonna tell you that I can’t understand it before I read the source code in python that implements it.
LOL…It is the truth, the paper sometimes is just hard to read.

I call it the “bt-share”, meaning that it provides the seeds search service for people.

It composite three parts: the web interface, the dht protocol crawler, and the bt downloader.

The basic structure is shown below:

It is inspired by the dht-crawler written in erlang. I draw a pic of its structure.

The big django

我总是有种冲动要把一个框架给撸一遍才 happy,但我发现这样把自己累得半死,还吸收不良……
特别是当我遇到了神奇的 Django,她的完备,资料齐全,开发方便,社区活跃,让我很 high 得撸了好几个月….. 好吧,其实是项目需要…..

但就像有些人说的一样,当一个框架的文档完备到让人发紫的时候,是不是也意味着他很难改变?
额,有点,Django 给我的感觉是,开发的程序猿是来写配置文件的,已经脱离了传统意义上的开发。那种感觉,
就像你去写 rails 一样,一切都安排好了,按着套路去填就好了。

吐槽完了,其实我是来记笔记的……

继续阅读

webpy-cherryServer-analytics

Introduction

So you want to how web.py works?

In this article, I will walk through the web.py’s simple server, which is adopted from CherryPy.

Things you need to know

what is a web server?

In my words, a web server is simply a request/response manager.
It listens to comming requests from the client, and dispatches requests to the appropriate application based on the rules.
Then it gets responses back from the same application and send back to the client later.

There are a couple of things we need to go deeper.

  • What is a request?

When a user clicks a link or post a form, a request has been created. How?

继续阅读

my first udacity course

so it’s my second article written in english~greate!!!

I recently take an online course on udacity, use python to write a search engine~

not just awesome, but it tells me that how such a monster like google being built. and I love python!

it shows that code course could be tought in this way, and the online edu interaction could play
like this…

so, here is the link and i wonder how many guys would do the final homework.would find a time to place the little search code on GAE,and make a gui for it~

if you come from china, why not go to the v2ex to share your learning experience with geeks….

find the course on udacity, keep learing.