python并行编程手册 勘误

书籍李魔佛 发表了文章 • 0 个评论 • 132 次浏览 • 2019-07-28 17:06 • 来自相关话题

python并行编程手册中文版
 
65页的进程创建, p.join() 不能写到循环里面,不然的话会阻塞下一次进程的创建,因为下一次进程要卡在join这里。
 
可以改成这样的 p0 = multiprocessing.Process(name=str(0), target=foo, args=(0,))
p0.start()
p1 = multiprocessing.Process(name=str(1), target=foo, args=(1,))
p1.start()
p2 = multiprocessing.Process(name=str(2), target=foo, args=(2,))
p2.start()
p3 = multiprocessing.Process(name=str(3), target=foo, args=(3,))
p3.start()
p4 = multiprocessing.Process(name=str(4), target=foo, args=(4,))
p4.start()

p5 = multiprocessing.Process(name=str(5), target=foo, args=(5,))
p5.start()

p0.join()
p1.join()
p2.join()
p3.join()
p4.join()
p5.join() 
而且后面发现,整本书都是有这个问题的。 查看全部
python并行编程手册中文版
 
65页的进程创建, p.join() 不能写到循环里面,不然的话会阻塞下一次进程的创建,因为下一次进程要卡在join这里。
 
可以改成这样的
 p0 = multiprocessing.Process(name=str(0), target=foo, args=(0,))
p0.start()
p1 = multiprocessing.Process(name=str(1), target=foo, args=(1,))
p1.start()
p2 = multiprocessing.Process(name=str(2), target=foo, args=(2,))
p2.start()
p3 = multiprocessing.Process(name=str(3), target=foo, args=(3,))
p3.start()
p4 = multiprocessing.Process(name=str(4), target=foo, args=(4,))
p4.start()

p5 = multiprocessing.Process(name=str(5), target=foo, args=(5,))
p5.start()

p0.join()
p1.join()
p2.join()
p3.join()
p4.join()
p5.join()
 
而且后面发现,整本书都是有这个问题的。

mongodb find得到的数据顺序每次都是一样的

数据库李魔佛 发表了文章 • 0 个评论 • 137 次浏览 • 2019-07-26 09:00 • 来自相关话题

只要用的find内容不变,那么返回的内容顺序也就都一样的。
只要用的find内容不变,那么返回的内容顺序也就都一样的。

kindle使用率低

书籍liwanqiang 回复了问题 • 4 人关注 • 4 个回复 • 928 次浏览 • 2019-07-25 17:49 • 来自相关话题

[Articles to save]

闲聊李魔佛 发表了文章 • 0 个评论 • 147 次浏览 • 2019-07-21 15:31 • 来自相关话题

Since on Raspberrypi and can't launch note application , using this web page to save articles link to store later.
 
https://www.jisilu.cn/question/321759 -Done
https://www.80shihua.com/archives/1590 -Done
  查看全部
Since on Raspberrypi and can't launch note application , using this web page to save articles link to store later.
 
https://www.jisilu.cn/question/321759 -Done
https://www.80shihua.com/archives/1590 -Done
 

Raspberrypi 2 Install or upgrade Python3.6

树莓派李魔佛 发表了文章 • 0 个评论 • 108 次浏览 • 2019-07-21 14:55 • 来自相关话题

Since no chinese input method in my raspberrypi, i can only write with English.
 
Raspberrypi has python2. 7 and python3.4, but i want to upgrade to python3.6+.
 
Python3.6 support some new feature such as print(f'{name}') and x=1_000_242_200 expression.
 
How to upgrade ?
 

$ wget https://www.python.org/ftp/pyt ... 1.tgz $ tar zxvf Python-3.6.1.tgz $ cd Python-3.6.1

then run command:

$ sudo ./configure && sudo make && sudo make install

wait for about 20mins (low perf of raspberrypi :( )
 
then you run command:
python3
 
it will using the new python3.6 version:
 

Python 3.6.1 (default, Jul 21 2019, 14:26:28) 
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
 

 
Enjoy it ! 查看全部
Since no chinese input method in my raspberrypi, i can only write with English.
 
Raspberrypi has python2. 7 and python3.4, but i want to upgrade to python3.6+.
 
Python3.6 support some new feature such as print(f'{name}') and x=1_000_242_200 expression.
 
How to upgrade ?
 

$ wget https://www.python.org/ftp/pyt ... 1.tgz 
$ tar zxvf Python-3.6.1.tgz $ cd Python-3.6.1

then run command:

$ sudo ./configure && sudo make && sudo make install

wait for about 20mins (low perf of raspberrypi :( )
 
then you run command:
python3
 
it will using the new python3.6 version:
 


Python 3.6.1 (default, Jul 21 2019, 14:26:28) 
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
 


 
Enjoy it !

frontera运行link_follower.py 报错:doesn't define any object named 'FIFO'

python爬虫李魔佛 发表了文章 • 0 个评论 • 139 次浏览 • 2019-07-18 11:29 • 来自相关话题

代码如下:
from __future__ import print_function

import re

import requests

from frontera.contrib.requests.manager import RequestsFrontierManager
# from frontera.contrib.requests.manager import RequestsFrontierManager
from frontera import Settings

from six.moves.urllib.parse import urljoin


SETTINGS = Settings()
SETTINGS.BACKEND = 'frontera.contrib.backends.memory.FIFO'
# SETTINGS.BACKEND = 'frontera.contrib.backends.memory.MemoryDistributedBackend'

SETTINGS.LOGGING_MANAGER_ENABLED = True
SETTINGS.LOGGING_BACKEND_ENABLED = True
SETTINGS.MAX_REQUESTS = 100
SETTINGS.MAX_NEXT_REQUESTS = 10

SEEDS = [
'http://www.imdb.com',
]

LINK_RE = re.compile(r'<a.+?href="(.*?)".?>', re.I)


def extract_page_links(response):
return [urljoin(response.url, link) for link in LINK_RE.findall(response.text)]

if __name__ == '__main__':

frontier = RequestsFrontierManager(SETTINGS)
frontier.add_seeds([requests.Request(url=url) for url in SEEDS])
while True:
next_requests = frontier.get_next_requests()
if not next_requests:
break
for request in next_requests:
try:
response = requests.get(request.url)
links = [
requests.Request(url=url)
for url in extract_page_links(response)
]
frontier.page_crawled(response)
print('Crawled', response.url, '(found', len(links), 'urls)')

if links:
frontier.links_extracted(request, links)
except requests.RequestException as e:
error_code = type(e).__name__
frontier.request_error(request, error_code)
print('Failed to process request', request.url, 'Error:', e)

 无论用的py2或者py3,都会报以下的错误。raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
NameError: Module 'frontera.contrib.backends.memory' doesn't define any object named 'FIFO' 查看全部
代码如下:
from __future__ import print_function

import re

import requests

from frontera.contrib.requests.manager import RequestsFrontierManager
# from frontera.contrib.requests.manager import RequestsFrontierManager
from frontera import Settings

from six.moves.urllib.parse import urljoin


SETTINGS = Settings()
SETTINGS.BACKEND = 'frontera.contrib.backends.memory.FIFO'
# SETTINGS.BACKEND = 'frontera.contrib.backends.memory.MemoryDistributedBackend'

SETTINGS.LOGGING_MANAGER_ENABLED = True
SETTINGS.LOGGING_BACKEND_ENABLED = True
SETTINGS.MAX_REQUESTS = 100
SETTINGS.MAX_NEXT_REQUESTS = 10

SEEDS = [
'http://www.imdb.com',
]

LINK_RE = re.compile(r'<a.+?href="(.*?)".?>', re.I)


def extract_page_links(response):
return [urljoin(response.url, link) for link in LINK_RE.findall(response.text)]

if __name__ == '__main__':

frontier = RequestsFrontierManager(SETTINGS)
frontier.add_seeds([requests.Request(url=url) for url in SEEDS])
while True:
next_requests = frontier.get_next_requests()
if not next_requests:
break
for request in next_requests:
try:
response = requests.get(request.url)
links = [
requests.Request(url=url)
for url in extract_page_links(response)
]
frontier.page_crawled(response)
print('Crawled', response.url, '(found', len(links), 'urls)')

if links:
frontier.links_extracted(request, links)
except requests.RequestException as e:
error_code = type(e).__name__
frontier.request_error(request, error_code)
print('Failed to process request', request.url, 'Error:', e)

 无论用的py2或者py3,都会报以下的错误。
raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
NameError: Module 'frontera.contrib.backends.memory' doesn't define any object named 'FIFO'

scrapy-rabbitmq 不支持python3 [修改源码使它支持]

python爬虫李魔佛 发表了文章 • 0 个评论 • 133 次浏览 • 2019-07-17 17:24 • 来自相关话题

官方版本在2015年就没有更新了。
在python3上运行的收会报错。
 
需要修改以下地方:
 
待续。。
官方版本在2015年就没有更新了。
在python3上运行的收会报错。
 
需要修改以下地方:
 
待续。。

scrapy rabbitmq 分布式爬虫

python爬虫李魔佛 发表了文章 • 0 个评论 • 213 次浏览 • 2019-07-17 16:59 • 来自相关话题

对于没接触过rabbitmq的同学,可以看这个文章:https://blog.csdn.net/hellozpc/article/details/81436980
rabbitmq是个不错的消息队列服务,可以配合scrapy作为消息队列.
 
下面是一个简单的demo:import re
import requests
import scrapy
from scrapy import Request
from rabbit_spider import settings
from scrapy.log import logger
import json
from rabbit_spider.items import RabbitSpiderItem
import datetime
from scrapy.selector import Selector
import pika

# from scrapy_rabbitmq.spiders import RabbitMQMixin
# from scrapy.contrib.spiders import CrawlSpider

class Website(scrapy.Spider):
name = "rabbit"

def start_requests(self):
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'Host': '36kr.com',
'Referer': 'https://36kr.com/information/web_news',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

url = 'https://36kr.com/information/web_news'


yield Request(url=url,
headers=headers)

def parse(self, response):


credentials = pika.PlainCredentials('admin', 'admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101', 5672, '/', credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log', exchange_type='direct')

result = channel.queue_declare(exclusive=True, queue='')

queue_name = result.method.queue

# print(queue_name)
# infos = sys.argv[1:] if len(sys.argv)>1 else ['info']
info = 'info'

# 绑定多个值

channel.queue_bind(
exchange='direct_log',
routing_key=info,
queue=queue_name
)
print('start to receive [{}]'.format(info))

channel.basic_consume(
on_message_callback=self.callback_func,
queue=queue_name,
auto_ack=True,
)

channel.start_consuming()


def callback_func(self, ch, method, properties, body):
print(body)
 启动spider:
from scrapy import cmdline
cmdline.execute('scrapy crawl rabbit'.split())
 然后往rabbitmq里面推送数据:
import pika
import settings

credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log',exchange_type='direct') # fanout 就是组播

routing_key = 'info'
message='https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=59499&per_page=30'
channel.basic_publish(
exchange='direct_log',
routing_key=routing_key,
body=message
)

print('sending message {}'.format(message))
connection.close()
 
推送数据后,scrapy会马上接受到队里里面的数据。
注意不能在start_requests里面写等待队列的命令,因为start_requests函数需要返回一个生成器,否则程序会报错。
 
待续。。。
  查看全部
对于没接触过rabbitmq的同学,可以看这个文章:https://blog.csdn.net/hellozpc/article/details/81436980
rabbitmq是个不错的消息队列服务,可以配合scrapy作为消息队列.
 
下面是一个简单的demo:
import re
import requests
import scrapy
from scrapy import Request
from rabbit_spider import settings
from scrapy.log import logger
import json
from rabbit_spider.items import RabbitSpiderItem
import datetime
from scrapy.selector import Selector
import pika

# from scrapy_rabbitmq.spiders import RabbitMQMixin
# from scrapy.contrib.spiders import CrawlSpider

class Website(scrapy.Spider):
name = "rabbit"

def start_requests(self):
headers = {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7',
'Host': '36kr.com',
'Referer': 'https://36kr.com/information/web_news',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

url = 'https://36kr.com/information/web_news'


yield Request(url=url,
headers=headers)

def parse(self, response):


credentials = pika.PlainCredentials('admin', 'admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101', 5672, '/', credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log', exchange_type='direct')

result = channel.queue_declare(exclusive=True, queue='')

queue_name = result.method.queue

# print(queue_name)
# infos = sys.argv[1:] if len(sys.argv)>1 else ['info']
info = 'info'

# 绑定多个值

channel.queue_bind(
exchange='direct_log',
routing_key=info,
queue=queue_name
)
print('start to receive [{}]'.format(info))

channel.basic_consume(
on_message_callback=self.callback_func,
queue=queue_name,
auto_ack=True,
)

channel.start_consuming()


def callback_func(self, ch, method, properties, body):
print(body)

 启动spider:
from scrapy import cmdline
cmdline.execute('scrapy crawl rabbit'.split())

 然后往rabbitmq里面推送数据:
import pika
import settings

credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()
channel.exchange_declare(exchange='direct_log',exchange_type='direct') # fanout 就是组播

routing_key = 'info'
message='https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=59499&per_page=30'
channel.basic_publish(
exchange='direct_log',
routing_key=routing_key,
body=message
)

print('sending message {}'.format(message))
connection.close()

 
推送数据后,scrapy会马上接受到队里里面的数据。
注意不能在start_requests里面写等待队列的命令,因为start_requests函数需要返回一个生成器,否则程序会报错。
 
待续。。。
 

exchange_declare() got an unexpected keyword argument 'type'

python李魔佛 发表了文章 • 0 个评论 • 125 次浏览 • 2019-07-16 14:40 • 来自相关话题

In new version of pika, now it is using 
exchange_type instead of type
 
credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()

channel.exchange_declare(exchange='logs',exchange_type='fanout') 查看全部
In new version of pika, now it is using 
exchange_type instead of type
 
	credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()

channel.exchange_declare(exchange='logs',exchange_type='fanout')

twisted的getPage已经不建议使用,新接口为twisted.web.client.Agent

python爬虫李魔佛 发表了文章 • 0 个评论 • 159 次浏览 • 2019-07-12 11:31 • 来自相关话题

Twisted-16.7.0 is coming soon, and it deprecates twisted.web.client.getPage (and client.HTTPClientFactory). We use these in some of the unit tests, to fetch one of the HTTP WAPI/WUI pages and make sure the contents look right.

We need to change these tests to use twisted.web.client.Agent instead, or a package named "treq", which is a Twisted flavor of the excellent (but blocking) requests library.

 
  查看全部


Twisted-16.7.0 is coming soon, and it deprecates twisted.web.client.getPage (and client.HTTPClientFactory). We use these in some of the unit tests, to fetch one of the HTTP WAPI/WUI pages and make sure the contents look right.

We need to change these tests to use twisted.web.client.Agent instead, or a package named "treq", which is a Twisted flavor of the excellent (but blocking) requests library.