怎么部署
- scrapyd + supervisord + crontab + redis
可以用的一些 lib
- weibo 的登录可以学下
- 这个爬虫的架构可以学下,基于队列的 master 和 jobber 模式来实现分布式
- 怒赞的一个爬虫框架,基于 tornado,妈的,这个得好好学习
- 去重的逻辑用下这个库来搞
- 部署、控制相关的用这个
分布式
参考的 blog
入门
结合
行业资料
行业需求
- [lagou]
其他
比如如何防止被 ban 掉
Here are some tips to keep in mind when dealing with these kinds of sites:
- rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
- disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
- use download delays (2 or higher). See DOWNLOAD_DELAY setting.- if possible, use Google cache to fetch pages, instead of hitting the sites directly
- use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
- use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera