scrapy rabbitmq 分布式爬虫
对于没接触过rabbitmq的同学,可以看这个文章:https://blog.csdn.net/hellozpc/article/details/81436980
rabbitmq是个不错的消息队列服务,可以配合scrapy作为消息队列.
下面是一个简单的demo:
启动spider:
然后往rabbitmq里面推送数据:
推送数据后,scrapy会马上接受到队里里面的数据。
注意不能在start_requests里面写等待队列的命令,因为start_requests函数需要返回一个生成器,否则程序会报错。
待续。。。
###### 2019-08-29 更新 ###################
发现一个坑,就是rabbitMQ在接受到数据后,无法在回调函数里面使用yield生成器。
rabbitmq是个不错的消息队列服务,可以配合scrapy作为消息队列.
下面是一个简单的demo:
import re
import requests
import scrapy
from scrapy import Request
from rabbit_spider import settings
from scrapy.log import logger
import json
from rabbit_spider.items import RabbitSpiderItem
import datetime
from scrapy.selector import Selector
import pika
# from scrapy_rabbitmq.spiders import RabbitMQMixin
# from scrapy.contrib.spiders import CrawlSpider
class Website(scrapy.Spider):
name = "rabbit"
def start_requests(self):
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'Host': '36kr.com',
'Referer': 'https://36kr.com/information/web_news',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://36kr.com/information/web_news'
yield Request(url=url,
headers=headers)
def parse(self, response):
credentials = pika.PlainCredentials('admin', 'admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101', 5672, '/', credentials))
channel = connection.channel()
channel.exchange_declare(exchange='direct_log', exchange_type='direct')
result = channel.queue_declare(exclusive=True, queue='')
queue_name = result.method.queue
# print(queue_name)
# infos = sys.argv[1:] if len(sys.argv)>1 else ['info']
info = 'info'
# 绑定多个值
channel.queue_bind(
exchange='direct_log',
routing_key=info,
queue=queue_name
)
print('start to receive [{}]'.format(info))
channel.basic_consume(
on_message_callback=self.callback_func,
queue=queue_name,
auto_ack=True,
)
channel.start_consuming()
def callback_func(self, ch, method, properties, body):
print(body)
启动spider:
from scrapy import cmdline
cmdline.execute('scrapy crawl rabbit'.split())
然后往rabbitmq里面推送数据:
import pika
import settings
credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))
channel = connection.channel()
channel.exchange_declare(exchange='direct_log',exchange_type='direct') # fanout 就是组播
routing_key = 'info'
message='https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=59499&per_page=30'
channel.basic_publish(
exchange='direct_log',
routing_key=routing_key,
body=message
)
print('sending message {}'.format(message))
connection.close()
推送数据后,scrapy会马上接受到队里里面的数据。
注意不能在start_requests里面写等待队列的命令,因为start_requests函数需要返回一个生成器,否则程序会报错。
待续。。。
###### 2019-08-29 更新 ###################
发现一个坑,就是rabbitMQ在接受到数据后,无法在回调函数里面使用yield生成器。