通知设置 新通知
薅“疫情公益”羊毛,黑产恶意爬取各大出版社电子书上万册
python爬虫 • Magiccc 发表了文章 • 0 个评论 • 2847 次浏览 • 2020-02-26 13:17
个人的知识星球
量化交易-Ptrade-QMT • 李魔佛 发表了文章 • 0 个评论 • 3506 次浏览 • 2020-02-23 11:10
微信扫一扫加入我的知识星球
星球的第一篇文章
python获取全市场LOF基金折溢价数据并进行套利
市场是总共的LOF基金有301只(上图右下角的圈圈是所有基金的条数),而集思录上只有120只左右,所以有些溢价厉害(大于10%)的LOF基金并没有在集思录的网站上显示,这对于专注于套利的投资者来说,会损失很多潜在的套利机会。
点击查看大图
我回复了该贴后,有大量的人私信我,问我能否提供一份这个数据,或者教对方如何获取这些数据。 因为人数众多,也没有那么多精力来一一回答。毕竟不同人的水平背景不一样,逐个回答起来也很累,所以就回答了几个朋友的问题后就一一婉拒了。
然后在几个投资群里,居然也有人提到这个数据,在咨询如何才能获取到这个完整的数据,并且可以实时更新显示。 因为我的微信群昵称和集思录是一样的,所以不少人@我,我也都简单的回复了下,是使用python抓取的数据,数据保存到Mysql和MongoDB。 代码行数不多,100行都不到。
具体实现在星球会有完整代码。 查看全部
微信扫一扫加入我的知识星球
星球的第一篇文章
python获取全市场LOF基金折溢价数据并进行套利
市场是总共的LOF基金有301只(上图右下角的圈圈是所有基金的条数),而集思录上只有120只左右,所以有些溢价厉害(大于10%)的LOF基金并没有在集思录的网站上显示,这对于专注于套利的投资者来说,会损失很多潜在的套利机会。
点击查看大图
我回复了该贴后,有大量的人私信我,问我能否提供一份这个数据,或者教对方如何获取这些数据。 因为人数众多,也没有那么多精力来一一回答。毕竟不同人的水平背景不一样,逐个回答起来也很累,所以就回答了几个朋友的问题后就一一婉拒了。
然后在几个投资群里,居然也有人提到这个数据,在咨询如何才能获取到这个完整的数据,并且可以实时更新显示。 因为我的微信群昵称和集思录是一样的,所以不少人@我,我也都简单的回复了下,是使用python抓取的数据,数据保存到Mysql和MongoDB。 代码行数不多,100行都不到。
具体实现在星球会有完整代码。
证券etf和券商etf的区别
券商万一免五 • 李魔佛 发表了文章 • 0 个评论 • 47765 次浏览 • 2020-02-10 23:48
而券商etf是:
华宝中证全指证券公司ETF (512000)LOF/ETF
二者都是指数/lof基金,而且持仓差不多都一样的。
证券etf的持仓:
券商etf的持仓:
不同的是规模,证券ETF的规模要比券商etf的规模要大得多。目前是2倍左右的差距【2019年的数据】。 所以如果你关心的是流动性,那么可以买证券ETF。
最新的规模【2020-06】其实二者在缩小,具体数据见文末。
证券ETF的规模是190亿
而券商ETF规模为160亿。(最近增长比较多,笔者在1年前记录的数据为30亿,可以到http://fundf10.eastmoney.com/gmbd_512000.html 查看历史规模。现在券商ETF的规模翻了一番。
更多量化分析,关注公众号:可转债量化分析
券商开户股票万一免5,基金申购一折,拖拉机6+1,关注公众号留言。
可转债手续费 百万分之二,一万块手续费才2分钱,没有最低限制(没有最低收1元,1毛这种)
(教你使用python进行量化分析股票,可转债数据)
查看全部
而券商etf是:
华宝中证全指证券公司ETF (512000)LOF/ETF
二者都是指数/lof基金,而且持仓差不多都一样的。
证券etf的持仓:
券商etf的持仓:
不同的是规模,证券ETF的规模要比券商etf的规模要大得多。目前是2倍左右的差距【2019年的数据】。 所以如果你关心的是流动性,那么可以买证券ETF。
最新的规模【2020-06】其实二者在缩小,具体数据见文末。
证券ETF的规模是190亿
而券商ETF规模为160亿。(最近增长比较多,笔者在1年前记录的数据为30亿,可以到http://fundf10.eastmoney.com/gmbd_512000.html 查看历史规模。现在券商ETF的规模翻了一番。
更多量化分析,关注公众号:可转债量化分析
券商开户股票万一免5,基金申购一折,拖拉机6+1,关注公众号留言。
可转债手续费 百万分之二,一万块手续费才2分钱,没有最低限制(没有最低收1元,1毛这种)
(教你使用python进行量化分析股票,可转债数据)
dataframe 根据日期重采样 计算个数
量化交易-Ptrade-QMT • 李魔佛 发表了文章 • 0 个评论 • 3004 次浏览 • 2019-12-19 09:07
bandcamp移动开发更简单
数据库 • linxiaojue 发表了文章 • 0 个评论 • 3042 次浏览 • 2019-12-14 05:12
http://TalkingData.bandcamp.com/
http://Bugly.bandcamp.com/
http://Box2D.bandcamp.com/
http://aineice.bandcamp.com/
http://wyyp.bandcamp.com/
http://Prepo.bandcamp.com/
http://Chipmunk.bandcamp.com/
http://openinstall.bandcamp.com/
http://MobileInsight.bandcamp.com/
http://zhugelo.bandcamp.com/
http://CobubRazor.bandcamp.com/
http://Testin.bandcamp.com/
http://crashlytics.bandcamp.com/
http://APKProtect.bandcamp.com/
http://Ucloud.bandcamp.com/
http://ydkfpgj.bandcamp.com/releases
http://TalkingData.bandcamp.com/releases
http://Bugly.bandcamp.com/releases
http://Box2D.bandcamp.com/releases
http://aineice.bandcamp.com/releases
http://wyyp.bandcamp.com/releases
http://Prepo.bandcamp.com/releases
http://Chipmunk.bandcamp.com/releases
http://openinstall.bandcamp.com/releases
http://MobileInsight.bandcamp.com/releases
http://zhugelo.bandcamp.com/releases
http://CobubRazor.bandcamp.com/releases
http://Testin.bandcamp.com/releases
http://crashlytics.bandcamp.com/releases
http://APKProtect.bandcamp.com/releases
http://Ucloud.bandcamp.com/releases 查看全部
http://TalkingData.bandcamp.com/
http://Bugly.bandcamp.com/
http://Box2D.bandcamp.com/
http://aineice.bandcamp.com/
http://wyyp.bandcamp.com/
http://Prepo.bandcamp.com/
http://Chipmunk.bandcamp.com/
http://openinstall.bandcamp.com/
http://MobileInsight.bandcamp.com/
http://zhugelo.bandcamp.com/
http://CobubRazor.bandcamp.com/
http://Testin.bandcamp.com/
http://crashlytics.bandcamp.com/
http://APKProtect.bandcamp.com/
http://Ucloud.bandcamp.com/
http://ydkfpgj.bandcamp.com/releases
http://TalkingData.bandcamp.com/releases
http://Bugly.bandcamp.com/releases
http://Box2D.bandcamp.com/releases
http://aineice.bandcamp.com/releases
http://wyyp.bandcamp.com/releases
http://Prepo.bandcamp.com/releases
http://Chipmunk.bandcamp.com/releases
http://openinstall.bandcamp.com/releases
http://MobileInsight.bandcamp.com/releases
http://zhugelo.bandcamp.com/releases
http://CobubRazor.bandcamp.com/releases
http://Testin.bandcamp.com/releases
http://crashlytics.bandcamp.com/releases
http://APKProtect.bandcamp.com/releases
http://Ucloud.bandcamp.com/releases
mongodb日期条件查找
数据库 • 李魔佛 发表了文章 • 0 个评论 • 2430 次浏览 • 2019-12-10 10:41
RuntimeError: `get_session` is not available when using TensorFlow 2.0.
深度学习 • 李魔佛 发表了文章 • 0 个评论 • 13053 次浏览 • 2019-11-28 15:10
pip install tensorflow==1.15 --upgradeHere, we will see how we can upgrade our code to work with tensorflow 2.0.
This error is usually faced when we are loading pre-trained model with tensorflow session/graph or we are building flask api over a pre-trained model and loading model in tensorflow graph to avoid collision of sessions while application is getting multiple requests at once or say in case of multi-threadinng
After tensorflow 2.0 upgrade, i also started facing above error in one of my project when i had built api of pre-trained model with flask. So i looked around in tensorflow 2.0 documents to find a workaround, to avoid this runtime error and upgrade my code to work with tensorflow 2.0 as well rather than downgrading it to tensorflow 1.x .
I had a project on which i had written tutorial as well on how to build Flask api on trained keras model of text classification and then use it in production
But this project was not working after tensorflow upgrade and was facing runtime error.
Stacktrace of error was something like below:
File "/Users/Upasana/Documents/playground/deploy-keras-model-in-production/src/main.py", line 37, in model_predict
with backend.get_session().graph.as_default() as g:
File "/Users/Upasana/Documents/playground/deploy-keras-model-in-production/venv-tf2/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 379, in get_session
'`get_session` is not available '
RuntimeError: `get_session` is not available when using TensorFlow 2.0.
Related code to get model
with backend.get_session().graph.as_default() as g:
model = SentimentService.get_model1()
Related code to load model
def load_deep_model(self, model):
json_file = open('./src/mood-saved-models/' + model + '.json', 'r')
loaded_model_json = json_file.read()
loaded_model = model_from_json(loaded_model_json)
loaded_model.load_weights("./src/mood-saved-models/" + model + ".h5")
loaded_model._make_predict_function()
return loaded_model
get_session is removed in tensorflow 2.0 and hence not available.
so, in order to load saved model we switched methods. Rather than using keras’s load_model, we used tensorflow to load model so that we can load model using distribution strategy.
Note
The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units.
New code to get model
another_strategy = tf.distribute.MirroredStrategy()
with another_strategy.scope():
model = SentimentService.get_model1()
New code to load model
def load_deep_model(self, model):
loaded_model = tf.keras.models.load_model("./src/mood-saved-models/"model + ".h5")
return loaded_model
This worked and solved the problem with runtime error of get_session not available in tensorflow 2.0 . You can refer to Tensorflow 2.0 upgraded article too
Hope, this will solve your problem too. Thanks for following this article. 查看全部
pip install tensorflow==1.15 --upgrade
Here, we will see how we can upgrade our code to work with tensorflow 2.0.
This error is usually faced when we are loading pre-trained model with tensorflow session/graph or we are building flask api over a pre-trained model and loading model in tensorflow graph to avoid collision of sessions while application is getting multiple requests at once or say in case of multi-threadinng
After tensorflow 2.0 upgrade, i also started facing above error in one of my project when i had built api of pre-trained model with flask. So i looked around in tensorflow 2.0 documents to find a workaround, to avoid this runtime error and upgrade my code to work with tensorflow 2.0 as well rather than downgrading it to tensorflow 1.x .
I had a project on which i had written tutorial as well on how to build Flask api on trained keras model of text classification and then use it in production
But this project was not working after tensorflow upgrade and was facing runtime error.
Stacktrace of error was something like below:
File "/Users/Upasana/Documents/playground/deploy-keras-model-in-production/src/main.py", line 37, in model_predict
with backend.get_session().graph.as_default() as g:
File "/Users/Upasana/Documents/playground/deploy-keras-model-in-production/venv-tf2/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 379, in get_session
'`get_session` is not available '
RuntimeError: `get_session` is not available when using TensorFlow 2.0.
Related code to get model
with backend.get_session().graph.as_default() as g:
model = SentimentService.get_model1()
Related code to load model
def load_deep_model(self, model):
json_file = open('./src/mood-saved-models/' + model + '.json', 'r')
loaded_model_json = json_file.read()
loaded_model = model_from_json(loaded_model_json)
loaded_model.load_weights("./src/mood-saved-models/" + model + ".h5")
loaded_model._make_predict_function()
return loaded_model
get_session is removed in tensorflow 2.0 and hence not available.
so, in order to load saved model we switched methods. Rather than using keras’s load_model, we used tensorflow to load model so that we can load model using distribution strategy.
Note
The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units.
New code to get model
another_strategy = tf.distribute.MirroredStrategy()
with another_strategy.scope():
model = SentimentService.get_model1()
New code to load model
def load_deep_model(self, model):
loaded_model = tf.keras.models.load_model("./src/mood-saved-models/"model + ".h5")
return loaded_model
This worked and solved the problem with runtime error of get_session not available in tensorflow 2.0 . You can refer to Tensorflow 2.0 upgraded article too
Hope, this will solve your problem too. Thanks for following this article.
yolo voc_label 源码分析
深度学习 • 李魔佛 发表了文章 • 0 个评论 • 2909 次浏览 • 2019-11-27 15:19
import xml.etree.ElementTree as ET
import pickle
import os
from os import listdir, getcwd
from os.path import join
sets = [('2012', 'train'), ('2012', 'val'), ('2007', 'train'), ('2007', 'val'), ('2007', 'test')]
# 20类
classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog",
"horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
# size w,h
# box x-min,x-max,y-min,y-max
def convert(size, box):
dw = 1. / size[0]
dh = 1. / size[1]
x = (box[0] + box[1]) / 2.0 # 中心点位置
y = (box[2] + box[3]) / 2.0
w = box[1] - box[0]
h = box[3] - box[2]
x = x * dw
w = w * dw
y = y * dh
h = h * dh # 全部转化为相对坐标
return (x, y, w, h)
def convert_annotation(year, image_id):
# 找到2个同样的文件
in_file = open('VOCdevkit/VOC%s/Annotations/%s.xml' % (year, image_id))
out_file = open('VOCdevkit/VOC%s/labels/%s.txt' % (year, image_id), 'w')
tree = ET.parse(in_file)
root = tree.getroot()
size = root.find('size')
w = int(size.find('width').text)
h = int(size.find('height').text)
for obj in root.iter('object'):
difficult = obj.find('difficult').text
cls = obj.find('name').text
if cls not in classes or int(difficult) == 1: # difficult ==1 的不要了
continue
cls_id = classes.index(cls) # 排在第几位
xmlbox = obj.find('bndbox')
b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text),
float(xmlbox.find('ymax').text))
# 传入的是w,h 与框框的周边
bb = convert((w, h), b)
out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')
wd = getcwd()
for year, image_set in sets:
# ('2012', 'train') 循环5次
# 创建目录 一次性
if not os.path.exists('VOCdevkit/VOC%s/labels/' % (year)):
os.makedirs('VOCdevkit/VOC%s/labels/' % (year))
# 图片的id数据
image_ids = open('VOCdevkit/VOC%s/ImageSets/Main/%s.txt' % (year, image_set)).read().strip().split()
# 结果写入这个文件
list_file = open('%s_%s.txt' % (year, image_set), 'w')
for image_id in image_ids:
list_file.write('%s/VOCdevkit/VOC%s/JPEGImages/%s.jpg\n' % (wd, year, image_id)) # 补全路径
convert_annotation(year, image_id)
list_file.close()
查看全部
import xml.etree.ElementTree as ET
import pickle
import os
from os import listdir, getcwd
from os.path import join
sets = [('2012', 'train'), ('2012', 'val'), ('2007', 'train'), ('2007', 'val'), ('2007', 'test')]
# 20类
classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog",
"horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]
# size w,h
# box x-min,x-max,y-min,y-max
def convert(size, box):
dw = 1. / size[0]
dh = 1. / size[1]
x = (box[0] + box[1]) / 2.0 # 中心点位置
y = (box[2] + box[3]) / 2.0
w = box[1] - box[0]
h = box[3] - box[2]
x = x * dw
w = w * dw
y = y * dh
h = h * dh # 全部转化为相对坐标
return (x, y, w, h)
def convert_annotation(year, image_id):
# 找到2个同样的文件
in_file = open('VOCdevkit/VOC%s/Annotations/%s.xml' % (year, image_id))
out_file = open('VOCdevkit/VOC%s/labels/%s.txt' % (year, image_id), 'w')
tree = ET.parse(in_file)
root = tree.getroot()
size = root.find('size')
w = int(size.find('width').text)
h = int(size.find('height').text)
for obj in root.iter('object'):
difficult = obj.find('difficult').text
cls = obj.find('name').text
if cls not in classes or int(difficult) == 1: # difficult ==1 的不要了
continue
cls_id = classes.index(cls) # 排在第几位
xmlbox = obj.find('bndbox')
b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text),
float(xmlbox.find('ymax').text))
# 传入的是w,h 与框框的周边
bb = convert((w, h), b)
out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')
wd = getcwd()
for year, image_set in sets:
# ('2012', 'train') 循环5次
# 创建目录 一次性
if not os.path.exists('VOCdevkit/VOC%s/labels/' % (year)):
os.makedirs('VOCdevkit/VOC%s/labels/' % (year))
# 图片的id数据
image_ids = open('VOCdevkit/VOC%s/ImageSets/Main/%s.txt' % (year, image_set)).read().strip().split()
# 结果写入这个文件
list_file = open('%s_%s.txt' % (year, image_set), 'w')
for image_id in image_ids:
list_file.write('%s/VOCdevkit/VOC%s/JPEGImages/%s.jpg\n' % (wd, year, image_id)) # 补全路径
convert_annotation(year, image_id)
list_file.close()
docker run 和 create 区别
Linux • 李魔佛 发表了文章 • 0 个评论 • 7344 次浏览 • 2019-11-25 13:49
docker create command creates a writeable container from the image and prepares it for running.
docker run command creates the container (same as docker create ) and starts it.
查看全部
docker create command creates a writeable container from the image and prepares it for running.
docker run command creates the container (same as docker create ) and starts it.
用docker编译go代码
Linux • 李魔佛 发表了文章 • 0 个评论 • 3701 次浏览 • 2019-11-25 13:45
1. 用文本编辑你的go代码,现在以hello world为例:
package main import "fmt" func main() {
/* 这是我的第一个简单的程序 */
fmt.Println("Hello, World!")
}
2. 然后直接使用docker执行编译。docker首先会自动去下载go的编译器,顺便把所有的依赖给解决掉
docker run --rm -v "$(pwd)":/usr/src/hello -w /usr/src/hello golang:1.3 go build -v
最后会在本地生成一个编译好的hello静态文件。
上述docker命令的具体含义就是把当前路径挂在到docker容器里头,然后切换到改到改路径下,然后进行编译。 查看全部
如果偶尔需要编译go代码,但是又不想要安装一堆乱七八糟的依赖和Go编译器,可以利用docker来实现。 应该是解决起来话费时间最小的。
1. 用文本编辑你的go代码,现在以hello world为例:
package main import "fmt" func main() {
/* 这是我的第一个简单的程序 */
fmt.Println("Hello, World!")
}
2. 然后直接使用docker执行编译。docker首先会自动去下载go的编译器,顺便把所有的依赖给解决掉
docker run --rm -v "$(pwd)":/usr/src/hello -w /usr/src/hello golang:1.3 go build -v
最后会在本地生成一个编译好的hello静态文件。
上述docker命令的具体含义就是把当前路径挂在到docker容器里头,然后切换到改到改路径下,然后进行编译。
docker实战 代码勘误
Linux • 李魔佛 发表了文章 • 0 个评论 • 2403 次浏览 • 2019-11-25 11:24
看这本书的作者一定要看,不然坑挺多的。
一路采坑过来的哭着说。Last updated August 21, 2016
In an effort to offer continued support beyond publication, we have listed many updates to code due to version updates.
[code - omission] Page 18
The command to start the "mailer" is missing a line. Where the book reads:
docker run -d \
--name mailer \
the proper command should read:
docker run -d \
--name mailer \
dockerinaction/ch2_mailer
[code - regression] Page 68
Newer versions of Docker have changed the structure of the JSON returned by the docker inspect subcommand. If the following command does not work then use the replacement. Original:
docker inspect --format "{{json .Volumes}}" reader
Replacement:
docker inspect --format "{{json .Mounts}}" reader
[code - regression] Page 69
Newer versions of Docker have changed the structure of the JSON returned by the docker inspect subcommand. If the following command does not work then use the replacement. Original:
docker inspect --format "{{json .Volumes}}" student
Replacement:
docker inspect --format "{{json .Mounts}}" student
[code - regression] Page 74
The alpine image entrypoint has changed since original publication and has been unset. The last command on the page should now read:
docker run --rm \
--volumes-from tools \
--entrypoint /bin/sh \
alpine:latest \
-c 'ls /operations/*'
[code - regression] Page 75
The docker exec example on the top of page 75 was printed with the wrong tool name. The correct command is:
docker exec important_application /operations/tools/diagnostics
[code - regression] Page 86
It appears that nslookup behavior in the alpine image has changed. To run the example use the busybox:1 image.
docker run --rm \
--hostname barker \
busybox:1 \
nslookup barker
[code - regression] Page 87 (top)
It appears that nslookup behavior in the alpine image has changed. To run the example use the busybox:1 image.
docker run --rm \
--dns 8.8.8.8 \
busybox:1 \
nslookup docker.com
[code - regression] Page 87 (bottom)
It appears that nslookup behavior in the alpine image has changed. To run the example use the busybox:1 image.
docker run --rm \
--dns-search docker.com \
busybox:1 \
nslookup registry.hub
[code - regression] Page 88 (bottom)
It appears that nslookup behavior in the alpine image has changed. To run the example use the busybox:1 image.
docker run --rm \
--add-host test:10.10.10.255 \
busybox:1 \
nslookup test
[code - regression] Page 106
There are a few new problems with this example. First, the named repository (dockerfile/mariadb) no longer exists. You can use mariadb:5.5 as replacement. However, the second problem is that containers created from the mariadb image perform certain initialization at startup. That initialization work requires certain capabilities and to be started with the default user. The system should instead drop permissions after the initialization work is complete. Note that the real value of this example is in demonstrating different resource isolation mechanisms. It is not so important that you get it working. You can start the database with the following command:
docker run -d --name ch6_mariadb \
--memory 256m \
--cpu-shares 1024 \
--cap-drop net_raw \
-e MYSQL_ROOT_PASSWORD=test \
mariadb:5.5
[code - regression] Page 107
Containers created from the wordpress:4.1 image perform certain initialization at startup and expect certain environment variables. That initialization work requires certain capabilities and to be started with the default user. The system should instead drop permissions after the initialization work is complete. Note that the real value of this example is in demonstrating different resource isolation mechanisms. It is not so important that you get it working. You can start wordpress with the following command:
docker run -d -P --name ch6_wordpress \
--memory 512m \
--cpu-shares 512 \
--cap-drop net_raw \
-e WORDPRESS_DB_PASSWORD=test \
mariadb:5.5
[code - typo] Page 109
The device access example is missing the "run" subcommand. The command listed as:
docker -it --rm \
--device /dev/video0:/dev/video0 \
ubuntu:latest ls -al /dev
should have been written as:
docker run -it --rm \
--device /dev/video0:/dev/video0 \
ubuntu:latest ls -al /dev
[code - typo] Page 110 - 111
Several commands are missing the "run" subcommand. In each case the command begins with
docker -d ...
and should have been written as:
docker run -d ...
[code - regression] Page 115 (bottom)
The busybox and alpine images have been updated to fix the problem described in the paragraph below. The 'su' command does not have the SUID bit set and will not provide any automatic privilege escalation.
[command correction] Page 116
Boot2Docker has been discontinued and rolled into a newer project called Docker Machine. Because a reader is unlikely to have the boot2docker command installed, the command at the top of this page should be changed from:
boot2docker ssh
to the Docker Machine equivalent:
docker-machine ssh default
where default is the name of the machine you created.
[code - regression] Page 119
The ifconfig command has since been removed from ubuntu:latest. Instead of using the ubuntu:latest image for these examples use ubuntu:trusty. The example using ifconfig should look like:
docker run --rm \
--privileged \
ubuntu:trusty ifconfig
[Illustration Mistake] Page 136
Image layer ced2 on the left side of the illustration is listed at c3d2 on the right side. These two layers should represent the same item.
[code - typo] Page 140
Containers need not be in a running state in order to export their file system. The first command on page 140 uses the "run" subcommand but the command listed will never be able to start. Replace "run" with "create." The command should appear as follows:
docker create --name export-test \
dockerinaction/ch7_packed:latest ./echo For Export
[code - missing line] Page 146
In the example Dockerfile near the top of the page the line with the RUN directive is missing part of the instruction. That line should read:
RUN apt-get update && apt-get install -y git
[code - evolution] Page 215
The registry:2 configuration file now requires the population of additional fields under "maintenance > uploadpurging." The example should currently look like:
# Filename: s3-config.yml
version: 0.1
log:
level: debug
fields:
service: registry
environment: development
storage:
cache:
layerinfo: inmemory
s3:
accesskey:
secretkey:
region:
bucket:
encrypt: true
secure: true
v4auth: true
chunksize: 5242880
rootdirectory: /s3/object/name/prefix
maintenance:
uploadpurging:
enabled: true
age: 168h
interval: 24h
dryrun: false
readonly:
enabled: false
http:
addr: :5000
secret: asecretforlocaldevelopment
debug:
addr: localhost:5001
[code - evolution] Page 216
The registry:2 configuration file now requires the population of additional fields under "maintenance > uploadpurging." The example should currently look like:
# Filename: rados-config.yml
version: 0.1
log:
level: debug
fields:
service: registry
environment: development
storage:
cache:
layerinfo: inmemory
rados:
poolname: radospool
username: radosuser
chunksize: 4194304
maintenance:
uploadpurging:
enabled: false
age: 168h
interval: 24h
dryrun: false
readonly:
enabled: false
http:
addr: :5000
secret: asecretforlocaldevelopment
debug:
addr: localhost:5001
[code - evolution] Page 218
The registry:2 configuration file now requires the population of additional fields under "maintenance > uploadpurging." The example should currently look like:
# Filename: redis-config.yml
version: 0.1
log:
level: debug
fields:
service: registry
environment: development
http:
addr: :5000
secret: asecretforlocaldevelopment
debug:
addr: localhost:5001
storage:
cache:
blobdescriptor: redis
s3:
accesskey:
secretkey:
region:
bucket:
encrypt: true
secure: true
v4auth: true
chunksize: 5242880
rootdirectory: /s3/object/name/prefix
maintenance:
uploadpurging:
enabled: true
age: 168h
interval: 24h
dryrun: false
readonly:
enabled: false
redis:
addr: redis-host:6379
password: asecret
dialtimeout: 10ms
readtimeout: 10ms
writetimeout: 10ms
pool:
maxidle: 16
maxactive: 64
idletimeout: 300s
[code - typo] Page 220
The name of the file shown should be scalable-config.yml as in previous examples. This example also requires the addition of the newer uploadpurging attributes. The mainenance section of the file should be as follows:
maintenance:
uploadpurging:
enabled: true
age: 168h
interval: 24h
dryrun: false
readonly:
enabled: false
[text - typo] Page 240
In the second paragraph the reader is instructed to, "Open ./coffee/api/api.py" this is not the correct location of the file. The correct file location is at, "./coffee/app/api.py."
[text - typo] Page 262
The refere nce to "flock.json" in the first sentence of the third paragraph should be "flock.yml."
[code - typo] Page 270
The git clone command uses the ssh protocol instead of https. The command should read as follows:
git clone https://github.com/dockerinact ... i.git 查看全部
看这本书的作者一定要看,不然坑挺多的。
一路采坑过来的哭着说。
Last updated August 21, 2016
In an effort to offer continued support beyond publication, we have listed many updates to code due to version updates.
[code - omission] Page 18
The command to start the "mailer" is missing a line. Where the book reads:
docker run -d \
--name mailer \
the proper command should read:
docker run -d \
--name mailer \
dockerinaction/ch2_mailer
[code - regression] Page 68
Newer versions of Docker have changed the structure of the JSON returned by the docker inspect subcommand. If the following command does not work then use the replacement. Original:
docker inspect --format "{{json .Volumes}}" reader
Replacement:
docker inspect --format "{{json .Mounts}}" reader
[code - regression] Page 69
Newer versions of Docker have changed the structure of the JSON returned by the docker inspect subcommand. If the following command does not work then use the replacement. Original:
docker inspect --format "{{json .Volumes}}" student
Replacement:
docker inspect --format "{{json .Mounts}}" student
[code - regression] Page 74
The alpine image entrypoint has changed since original publication and has been unset. The last command on the page should now read:
docker run --rm \
--volumes-from tools \
--entrypoint /bin/sh \
alpine:latest \
-c 'ls /operations/*'
[code - regression] Page 75
The docker exec example on the top of page 75 was printed with the wrong tool name. The correct command is:
docker exec important_application /operations/tools/diagnostics
[code - regression] Page 86
It appears that nslookup behavior in the alpine image has changed. To run the example use the busybox:1 image.
docker run --rm \
--hostname barker \
busybox:1 \
nslookup barker
[code - regression] Page 87 (top)
It appears that nslookup behavior in the alpine image has changed. To run the example use the busybox:1 image.
docker run --rm \
--dns 8.8.8.8 \
busybox:1 \
nslookup docker.com
[code - regression] Page 87 (bottom)
It appears that nslookup behavior in the alpine image has changed. To run the example use the busybox:1 image.
docker run --rm \
--dns-search docker.com \
busybox:1 \
nslookup registry.hub
[code - regression] Page 88 (bottom)
It appears that nslookup behavior in the alpine image has changed. To run the example use the busybox:1 image.
docker run --rm \
--add-host test:10.10.10.255 \
busybox:1 \
nslookup test
[code - regression] Page 106
There are a few new problems with this example. First, the named repository (dockerfile/mariadb) no longer exists. You can use mariadb:5.5 as replacement. However, the second problem is that containers created from the mariadb image perform certain initialization at startup. That initialization work requires certain capabilities and to be started with the default user. The system should instead drop permissions after the initialization work is complete. Note that the real value of this example is in demonstrating different resource isolation mechanisms. It is not so important that you get it working. You can start the database with the following command:
docker run -d --name ch6_mariadb \
--memory 256m \
--cpu-shares 1024 \
--cap-drop net_raw \
-e MYSQL_ROOT_PASSWORD=test \
mariadb:5.5
[code - regression] Page 107
Containers created from the wordpress:4.1 image perform certain initialization at startup and expect certain environment variables. That initialization work requires certain capabilities and to be started with the default user. The system should instead drop permissions after the initialization work is complete. Note that the real value of this example is in demonstrating different resource isolation mechanisms. It is not so important that you get it working. You can start wordpress with the following command:
docker run -d -P --name ch6_wordpress \
--memory 512m \
--cpu-shares 512 \
--cap-drop net_raw \
-e WORDPRESS_DB_PASSWORD=test \
mariadb:5.5
[code - typo] Page 109
The device access example is missing the "run" subcommand. The command listed as:
docker -it --rm \
--device /dev/video0:/dev/video0 \
ubuntu:latest ls -al /dev
should have been written as:
docker run -it --rm \
--device /dev/video0:/dev/video0 \
ubuntu:latest ls -al /dev
[code - typo] Page 110 - 111
Several commands are missing the "run" subcommand. In each case the command begins with
docker -d ...
and should have been written as:
docker run -d ...
[code - regression] Page 115 (bottom)
The busybox and alpine images have been updated to fix the problem described in the paragraph below. The 'su' command does not have the SUID bit set and will not provide any automatic privilege escalation.
[command correction] Page 116
Boot2Docker has been discontinued and rolled into a newer project called Docker Machine. Because a reader is unlikely to have the boot2docker command installed, the command at the top of this page should be changed from:
boot2docker ssh
to the Docker Machine equivalent:
docker-machine ssh default
where default is the name of the machine you created.
[code - regression] Page 119
The ifconfig command has since been removed from ubuntu:latest. Instead of using the ubuntu:latest image for these examples use ubuntu:trusty. The example using ifconfig should look like:
docker run --rm \
--privileged \
ubuntu:trusty ifconfig
[Illustration Mistake] Page 136
Image layer ced2 on the left side of the illustration is listed at c3d2 on the right side. These two layers should represent the same item.
[code - typo] Page 140
Containers need not be in a running state in order to export their file system. The first command on page 140 uses the "run" subcommand but the command listed will never be able to start. Replace "run" with "create." The command should appear as follows:
docker create --name export-test \
dockerinaction/ch7_packed:latest ./echo For Export
[code - missing line] Page 146
In the example Dockerfile near the top of the page the line with the RUN directive is missing part of the instruction. That line should read:
RUN apt-get update && apt-get install -y git
[code - evolution] Page 215
The registry:2 configuration file now requires the population of additional fields under "maintenance > uploadpurging." The example should currently look like:
# Filename: s3-config.yml
version: 0.1
log:
level: debug
fields:
service: registry
environment: development
storage:
cache:
layerinfo: inmemory
s3:
accesskey:
secretkey:
region:
bucket:
encrypt: true
secure: true
v4auth: true
chunksize: 5242880
rootdirectory: /s3/object/name/prefix
maintenance:
uploadpurging:
enabled: true
age: 168h
interval: 24h
dryrun: false
readonly:
enabled: false
http:
addr: :5000
secret: asecretforlocaldevelopment
debug:
addr: localhost:5001
[code - evolution] Page 216
The registry:2 configuration file now requires the population of additional fields under "maintenance > uploadpurging." The example should currently look like:
# Filename: rados-config.yml
version: 0.1
log:
level: debug
fields:
service: registry
environment: development
storage:
cache:
layerinfo: inmemory
rados:
poolname: radospool
username: radosuser
chunksize: 4194304
maintenance:
uploadpurging:
enabled: false
age: 168h
interval: 24h
dryrun: false
readonly:
enabled: false
http:
addr: :5000
secret: asecretforlocaldevelopment
debug:
addr: localhost:5001
[code - evolution] Page 218
The registry:2 configuration file now requires the population of additional fields under "maintenance > uploadpurging." The example should currently look like:
# Filename: redis-config.yml
version: 0.1
log:
level: debug
fields:
service: registry
environment: development
http:
addr: :5000
secret: asecretforlocaldevelopment
debug:
addr: localhost:5001
storage:
cache:
blobdescriptor: redis
s3:
accesskey:
secretkey:
region:
bucket:
encrypt: true
secure: true
v4auth: true
chunksize: 5242880
rootdirectory: /s3/object/name/prefix
maintenance:
uploadpurging:
enabled: true
age: 168h
interval: 24h
dryrun: false
readonly:
enabled: false
redis:
addr: redis-host:6379
password: asecret
dialtimeout: 10ms
readtimeout: 10ms
writetimeout: 10ms
pool:
maxidle: 16
maxactive: 64
idletimeout: 300s
[code - typo] Page 220
The name of the file shown should be scalable-config.yml as in previous examples. This example also requires the addition of the newer uploadpurging attributes. The mainenance section of the file should be as follows:
maintenance:
uploadpurging:
enabled: true
age: 168h
interval: 24h
dryrun: false
readonly:
enabled: false
[text - typo] Page 240
In the second paragraph the reader is instructed to, "Open ./coffee/api/api.py" this is not the correct location of the file. The correct file location is at, "./coffee/app/api.py."
[text - typo] Page 262
The refere nce to "flock.json" in the first sentence of the third paragraph should be "flock.yml."
[code - typo] Page 270
The git clone command uses the ssh protocol instead of https. The command should read as follows:
git clone https://github.com/dockerinact ... i.git
jieba.posseg TypeError: cannot unpack non-iterable pair object 词性分析报错
python • 李魔佛 发表了文章 • 0 个评论 • 4119 次浏览 • 2019-11-23 10:12
例子:import jieba.posseg as pseg
seg_list = pseg.cut("我爱北京天安门")
for word,flag in seg_list:
print(word)
print(flag)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-f105f6980f88> in <module>()
1 import jieba.posseg as pseg
2 seg_list = pseg.cut("我爱北京天安门")
----> 3 for word,flag in seg_list:
4 print(word)
5 print(flag)
TypeError: cannot unpack non-iterable pair object原因是新版本中seg_list是一个生成器,所以只能 for win seg_list然后从word中解包出来
print(w.word)
print(w.flag)
这样问题就解决了。 查看全部
例子:
import jieba.posseg as pseg
seg_list = pseg.cut("我爱北京天安门")
for word,flag in seg_list:
print(word)
print(flag)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-f105f6980f88> in <module>()
1 import jieba.posseg as pseg
2 seg_list = pseg.cut("我爱北京天安门")
----> 3 for word,flag in seg_list:
4 print(word)
5 print(flag)
TypeError: cannot unpack non-iterable pair object
原因是新版本中seg_list是一个生成器,所以只能 for win seg_list
然后从word中解包出来
print(w.word)
print(w.flag)
这样问题就解决了。
华为AI云市场的modelarts是要收费的<坑>
闲聊 • 李魔佛 发表了文章 • 0 个评论 • 5015 次浏览 • 2019-11-21 10:57
看了介绍说觉得不错,就注册了,然后还要求实名才能使用。想着华为大公司应该不会拿个人数据到处倒卖吧,就按教程下一步下一步注册了。
拿了几张样本上传到服务器,自己打标注,生成模型。然后测试一下效果,感觉产品做得不错。但整个过程没提示说要收费。
然后晚上就收到几条短信说华为的AI云处于欠费状态,晕,原来是秋后算账呀。
点击查看大图
还以为不部署到服务器的话就不需要缴费呢,怎么随便测试了下就扣费了呢。
查看全部
scrapy在settings中定义变量不能包含小写!
python爬虫 • 李魔佛 发表了文章 • 0 个评论 • 2706 次浏览 • 2019-11-16 16:39
比如定义了一个叫 Redis_host = '192.168.1.1',的值
然后在spider中,如果你调用self.settings.get('Redis_host')
那么返回值是 None。
如果用REDIS_HOST定义,那么就可以正确返回它的值。
如果你一定要用小写,也有其他方法可正常调用。
先导入settings文件
fromt xxxx import setttings # xxx为项目名
host = settings.Redis_host # 直接导入一个文件的形式来调用是可以的 查看全部
比如定义了一个叫 Redis_host = '192.168.1.1',的值
然后在spider中,如果你调用self.settings.get('Redis_host')
那么返回值是 None。
如果用REDIS_HOST定义,那么就可以正确返回它的值。
如果你一定要用小写,也有其他方法可正常调用。
先导入settings文件
fromt xxxx import setttings # xxx为项目名
host = settings.Redis_host # 直接导入一个文件的形式来调用是可以的
etree.strip_tags的用法
python爬虫 • 李魔佛 发表了文章 • 0 个评论 • 3928 次浏览 • 2019-10-24 11:24
它把参数中的标签从源htmlelement中删除,并且把里面的标签文本给合并进来。
举个例子:from lxml.html import etree
from lxml.html import fromstring, HtmlElement
test_html = '''<p><span>hello</span><span>world</span></p>'''
test_element = fromstring(test_html)
etree.strip_tags(test_element,'span') # 清除span标签
etree.tostring(test_element)
因为上述操作直接应用于test_element上的,所以test_element的值已经被修改了。
所以现在test_element 的值是
b'<p>helloworld</p>'
原创文章,转载请注明出处
http://30daydo.com/article/553
查看全部
它把参数中的标签从源htmlelement中删除,并且把里面的标签文本给合并进来。
举个例子:
from lxml.html import etree
from lxml.html import fromstring, HtmlElement
test_html = '''<p><span>hello</span><span>world</span></p>'''
test_element = fromstring(test_html)
etree.strip_tags(test_element,'span') # 清除span标签
etree.tostring(test_element)
因为上述操作直接应用于test_element上的,所以test_element的值已经被修改了。
所以现在test_element 的值是
b'<p>helloworld</p>'
原创文章,转载请注明出处
http://30daydo.com/article/553
pycharm自带的版本控制软件挺好用的
闲聊 • 李魔佛 发表了文章 • 0 个评论 • 2382 次浏览 • 2019-10-24 09:02
pycharm自带的git,svn版本控制工具已经很好用的了,所以以后可以直接不用sourcetree这种专业的GUI管理软件了
android monitor 系统找不到文件 lib\monitor-Could。
Android • 李魔佛 发表了文章 • 0 个评论 • 3320 次浏览 • 2019-10-17 09:53
但我觉得这是一个很好用的监控日志工具。
在sdk的tool目录底下启动
monitor.bat
然后就可以看到报错:
系统找不到文件 lib\monitor-Could
解决办法:
使用android-studio的sdk管理工具下载一个android-19的API,最新的api少了部分jar文件。
查看全部
但我觉得这是一个很好用的监控日志工具。
在sdk的tool目录底下启动
monitor.bat
然后就可以看到报错:
系统找不到文件 lib\monitor-Could
解决办法:
使用android-studio的sdk管理工具下载一个android-19的API,最新的api少了部分jar文件。
mumu模拟器adb无法识别
python爬虫 • 李魔佛 发表了文章 • 0 个评论 • 4970 次浏览 • 2019-10-17 08:41
<Forwarding name="ADB_PORT" proto="1" hostip="127.0.0.1" hostport="7555" guestport="5555"/>
在mumu浏览器里面可以看到这个配置信息。
adb connect 127.0.0.1:7555
然后adb shell 就可以了。
配置文件名是:myandrovm_vbox86.nemu 查看全部
<Forwarding name="ADB_PORT" proto="1" hostip="127.0.0.1" hostport="7555" guestport="5555"/>
在mumu浏览器里面可以看到这个配置信息。
adb connect 127.0.0.1:7555
然后adb shell 就可以了。
配置文件名是:myandrovm_vbox86.nemu
最近是忙的飞起,没有更新文章
闲聊 • 李魔佛 发表了文章 • 0 个评论 • 2152 次浏览 • 2019-09-24 11:12
真是希望时间能够停顿下来让我歇歇。
真是希望时间能够停顿下来让我歇歇。
aiohttp异步下载图片
python爬虫 • 李魔佛 发表了文章 • 0 个评论 • 4498 次浏览 • 2019-09-16 17:14
headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
async def getPage(num):
async with aiohttp.ClientSession() as session:
async with session.get(url.format(num),headers=headers) as resp:
if resp.status==200:
f= await aiofiles.open('{}.jpg'.format(num),mode='wb')
await f.write(await resp.read())
await f.close()
loop = asyncio.get_event_loop()
tasks = [getPage(i) for i in range(5)]
loop.run_until_complete(asyncio.wait(tasks))
原创文章,
转载请注明出处:
http://30daydo.com/article/537
查看全部
url = 'http://xyhz.huizhou.gov.cn/static/js/common/jigsaw/images/{}.jpg'
headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
async def getPage(num):
async with aiohttp.ClientSession() as session:
async with session.get(url.format(num),headers=headers) as resp:
if resp.status==200:
f= await aiofiles.open('{}.jpg'.format(num),mode='wb')
await f.write(await resp.read())
await f.close()
loop = asyncio.get_event_loop()
tasks = [getPage(i) for i in range(5)]
loop.run_until_complete(asyncio.wait(tasks))
原创文章,
转载请注明出处:
http://30daydo.com/article/537
基于文本及符号密度的网页正文提取方法 python实现
python • 李魔佛 发表了文章 • 0 个评论 • 4697 次浏览 • 2019-09-10 15:19
项目路径https://github.com/Rockyzsu/CodePool/tree/master/GeneralNewsExtractor
完成后在本文详细介绍,
请密切关注。 查看全部
项目路径https://github.com/Rockyzsu/CodePool/tree/master/GeneralNewsExtractor
完成后在本文详细介绍,
请密切关注。
根据东财股吧爬虫数据进行自然语言分析,展示股市热度
股票 • 李魔佛 发表了文章 • 0 个评论 • 5339 次浏览 • 2019-09-10 09:27
项目开展中.....
https://github.com/Rockyzsu/StockPredict
完工后会把代码搬上来并加注释。
### 2019-11-17 更新 ######
股市舆情情感分类可视化系统
此Web基于Django+Bootstrap+Echarts等框架,个股交易行情数据调用了Tushare接口。对于舆情文本数据采取先爬取东方财富网股吧论坛标题词语设置机器学习训练集,在此基础上运用scikit-learn机器学习朴素贝叶斯方法构建文本分类器。通过Django Web框架,将所得数据传递到前端经过Bootstrap渲染过的html,对数据使用Echarts进行图表可视化处理
不足之处或交流学习欢迎通过邮箱联系我
目前的功能:
个股历史交易行情
个股相关词云展示
情感字典舆情预测
朴素贝叶斯舆情预测
Quick Start
在项目当前目录下: $ python manage.py runserver
浏览器打开127.0.0.1:8000
查看全部
项目开展中.....
https://github.com/Rockyzsu/StockPredict
完工后会把代码搬上来并加注释。
### 2019-11-17 更新 ######
股市舆情情感分类可视化系统
此Web基于Django+Bootstrap+Echarts等框架,个股交易行情数据调用了Tushare接口。对于舆情文本数据采取先爬取东方财富网股吧论坛标题词语设置机器学习训练集,在此基础上运用scikit-learn机器学习朴素贝叶斯方法构建文本分类器。通过Django Web框架,将所得数据传递到前端经过Bootstrap渲染过的html,对数据使用Echarts进行图表可视化处理
不足之处或交流学习欢迎通过邮箱联系我
目前的功能:
个股历史交易行情
个股相关词云展示
情感字典舆情预测
朴素贝叶斯舆情预测
Quick Start
在项目当前目录下: $ python manage.py runserver
浏览器打开127.0.0.1:8000
子弹短信 --已经下架了
闲聊 • 李魔佛 发表了文章 • 0 个评论 • 1899 次浏览 • 2019-09-08 09:25
轻报APP --骗子请注意点
闲聊 • 李魔佛 发表了文章 • 0 个评论 • 2916 次浏览 • 2019-09-08 08:52
第一次见这种公司。
事情缘由:
该公司在拉勾上以招聘兼职为由,加你微信,然后借口说已测试一下应聘者的的水平,要求对方写一个爬取一个他们想要爬的网站,而且是用一个第三方的网站 神箭手 的平台代码来写的。 意味着,他可以拿着你的代码直接在上面运行,爬取他们想要的数据。
因为他们要的网站我曾经爬过,我直接把数据接了图给他们。 他们就急着要我用神箭手重新写一次。这时我就妥妥地确定他们就是想要空手白狼的人。 然后就拉黑了哦。
注意,那个负责人叫王锦锋。 查看全部
第一次见这种公司。
事情缘由:
该公司在拉勾上以招聘兼职为由,加你微信,然后借口说已测试一下应聘者的的水平,要求对方写一个爬取一个他们想要爬的网站,而且是用一个第三方的网站 神箭手 的平台代码来写的。 意味着,他可以拿着你的代码直接在上面运行,爬取他们想要的数据。
因为他们要的网站我曾经爬过,我直接把数据接了图给他们。 他们就急着要我用神箭手重新写一次。这时我就妥妥地确定他们就是想要空手白狼的人。 然后就拉黑了哦。
注意,那个负责人叫王锦锋。
性能对比 pypy vs python
python • 李魔佛 发表了文章 • 0 个评论 • 4753 次浏览 • 2019-09-06 17:04
不试不知道,一试吓一跳。
如果是CPU密集型的程序,pypy3的执行速度比python要快上一百倍。
talk is cheap, show me the code!
代码很简单,运行加法运算:
执行2千万次
import time
LOOP = 2*10**8
def add(x,y):
return x+y
def cpu_pressure(loop):
for i in range(loop):
result = add(i,i+1)
if __name__ == '__main__':
start = time.time()
cpu_pressure(LOOP)
print(f'time used {time.time()-start}s')
python执行:
python main.py
返回用时:time used 21.422261476516724s
pypy执行:
pypy main.py
返回用时:time used 0.1925642490386963s
差距真的很大。 查看全部
不试不知道,一试吓一跳。
如果是CPU密集型的程序,pypy3的执行速度比python要快上一百倍。
talk is cheap, show me the code!
代码很简单,运行加法运算:
执行2千万次
import time
LOOP = 2*10**8
def add(x,y):
return x+y
def cpu_pressure(loop):
for i in range(loop):
result = add(i,i+1)
if __name__ == '__main__':
start = time.time()
cpu_pressure(LOOP)
print(f'time used {time.time()-start}s')
python执行:
python main.py
返回用时:time used 21.422261476516724s
pypy执行:
pypy main.py
返回用时:time used 0.1925642490386963s
差距真的很大。
scrapy源码分析<一>:入口函数以及是如何运行
python爬虫 • 李魔佛 发表了文章 • 0 个评论 • 5566 次浏览 • 2019-08-31 10:47
下面我们从源码分析一下scrapy执行的流程:
执行scrapy crawl 命令时,调用的是Command类class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders - My Defined'
def run(self,args,opts):
print('==================')
print(type(self.crawler_process))
spider_list = self.crawler_process.spiders.list() # 找到爬虫类
for name in spider_list:
print('=================')
print(name)
self.crawler_process.crawl(name,**opts.__dict__)
self.crawler_process.start()
然后我们去看看crawler_process,这个是来自ScrapyCommand,而ScrapyCommand又是CrawlerProcess的子类,而CrawlerProcess又是CrawlerRunner的子类
在CrawlerRunner构造函数里面主要作用就是这个 def __init__(self, settings=None):
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
self.settings = settings
self.spider_loader = _get_spider_loader(settings) # 构造爬虫
self._crawlers = set()
self._active = set()
self.bootstrap_failed = False
1. 加载配置文件def _get_spider_loader(settings):
cls_path = settings.get('SPIDER_LOADER_CLASS')
# settings文件没有定义SPIDER_LOADER_CLASS,所以这里获取到的是系统的默认配置文件,
# 默认配置文件在接下来的代码块A
# SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'
loader_cls = load_object(cls_path)
# 这个函数就是根据路径转为类对象,也就是上面crapy.spiderloader.SpiderLoader 这个
# 字符串变成一个类对象
# 具体的load_object 对象代码见下面代码块B
return loader_cls.from_settings(settings.frozencopy())
默认配置文件defautl_settting.py# 代码块A
#......省略若干
SCHEDULER = 'scrapy.core.scheduler.Scheduler'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'
SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader' 就是这个值
SPIDER_LOADER_WARN_ONLY = False
SPIDER_MIDDLEWARES = {}
load_object的实现# 代码块B 为了方便,我把异常处理的去除
from importlib import import_module #导入第三方库
def load_object(path):
dot = path.rindex('.')
module, name = path[:dot], path[dot+1:]
# 上面把路径分为基本路径+模块名
mod = import_module(module)
obj = getattr(mod, name)
# 获取模块里面那个值
return obj
测试代码:In [33]: mod = import_module(module)
In [34]: mod
Out[34]: <module 'scrapy.spiderloader' from '/home/xda/anaconda3/lib/python3.7/site-packages/scrapy/spiderloader.py'>
In [35]: getattr(mod,name)
Out[35]: scrapy.spiderloader.SpiderLoader
In [36]: obj = getattr(mod,name)
In [37]: obj
Out[37]: scrapy.spiderloader.SpiderLoader
In [38]: type(obj)
Out[38]: type
在代码块A中,loader_cls是SpiderLoader,最后返回的的是SpiderLoader.from_settings(settings.frozencopy())
接下来看看SpiderLoader.from_settings, def from_settings(cls, settings):
return cls(settings)
返回类对象自己,所以直接看__init__函数即可class SpiderLoader(object):
"""
SpiderLoader is a class which locates and loads spiders
in a Scrapy project.
"""
def __init__(self, settings):
self.spider_modules = settings.getlist('SPIDER_MODULES')
# 获得settting中的模块名字,创建scrapy的时候就默认帮你生成了
# 你可以看看你的settings文件里面的内容就可以找到这个值,是一个list
self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')
self._spiders = {}
self._found = defaultdict(list)
self._load_all_spiders() # 加载所有爬虫
核心就是这个_load_all_spiders:
走起:def _load_all_spiders(self):
for name in self.spider_modules:
for module in walk_modules(name): # 这个遍历文件夹里面的文件,然后再转化为类对象,
# 保存到字典:self._spiders = {}
self._load_spiders(module) # 模块变成spider
self._check_name_duplicates() # 去重,如果名字一样就异常
接下来看看_load_spiders
核心就是下面的。def iter_spider_classes(module):
from scrapy.spiders import Spider
for obj in six.itervalues(vars(module)): # 找到模块里面的变量,然后迭代出来
if inspect.isclass(obj) and \
issubclass(obj, Spider) and \
obj.__module__ == module.__name__ and \
getattr(obj, 'name', None): # 有name属性,继承于Spider
yield obj
这个obj就是我们平时写的spider类了。
原来分析了这么多,才找到了我们平时写的爬虫类
待续。。。。
原创文章
转载请注明出处
http://30daydo.com/article/530
查看全部
下面我们从源码分析一下scrapy执行的流程:
执行scrapy crawl 命令时,调用的是Command类
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders - My Defined'
def run(self,args,opts):
print('==================')
print(type(self.crawler_process))
spider_list = self.crawler_process.spiders.list() # 找到爬虫类
for name in spider_list:
print('=================')
print(name)
self.crawler_process.crawl(name,**opts.__dict__)
self.crawler_process.start()
然后我们去看看crawler_process,这个是来自ScrapyCommand,而ScrapyCommand又是CrawlerProcess的子类,而CrawlerProcess又是CrawlerRunner的子类
在CrawlerRunner构造函数里面主要作用就是这个
def __init__(self, settings=None):
if isinstance(settings, dict) or settings is None:
settings = Settings(settings)
self.settings = settings
self.spider_loader = _get_spider_loader(settings) # 构造爬虫
self._crawlers = set()
self._active = set()
self.bootstrap_failed = False
1. 加载配置文件
def _get_spider_loader(settings):
cls_path = settings.get('SPIDER_LOADER_CLASS')
# settings文件没有定义SPIDER_LOADER_CLASS,所以这里获取到的是系统的默认配置文件,
# 默认配置文件在接下来的代码块A
# SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'
loader_cls = load_object(cls_path)
# 这个函数就是根据路径转为类对象,也就是上面crapy.spiderloader.SpiderLoader 这个
# 字符串变成一个类对象
# 具体的load_object 对象代码见下面代码块B
return loader_cls.from_settings(settings.frozencopy())
默认配置文件defautl_settting.py
# 代码块A
#......省略若干
SCHEDULER = 'scrapy.core.scheduler.Scheduler'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.ScrapyPriorityQueue'
SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader' 就是这个值
SPIDER_LOADER_WARN_ONLY = False
SPIDER_MIDDLEWARES = {}
load_object的实现
# 代码块B 为了方便,我把异常处理的去除
from importlib import import_module #导入第三方库
def load_object(path):
dot = path.rindex('.')
module, name = path[:dot], path[dot+1:]
# 上面把路径分为基本路径+模块名
mod = import_module(module)
obj = getattr(mod, name)
# 获取模块里面那个值
return obj
测试代码:
In [33]: mod = import_module(module)
In [34]: mod
Out[34]: <module 'scrapy.spiderloader' from '/home/xda/anaconda3/lib/python3.7/site-packages/scrapy/spiderloader.py'>
In [35]: getattr(mod,name)
Out[35]: scrapy.spiderloader.SpiderLoader
In [36]: obj = getattr(mod,name)
In [37]: obj
Out[37]: scrapy.spiderloader.SpiderLoader
In [38]: type(obj)
Out[38]: type
在代码块A中,loader_cls是SpiderLoader,最后返回的的是SpiderLoader.from_settings(settings.frozencopy())
接下来看看SpiderLoader.from_settings,
def from_settings(cls, settings):
return cls(settings)
返回类对象自己,所以直接看__init__函数即可
class SpiderLoader(object):
"""
SpiderLoader is a class which locates and loads spiders
in a Scrapy project.
"""
def __init__(self, settings):
self.spider_modules = settings.getlist('SPIDER_MODULES')
# 获得settting中的模块名字,创建scrapy的时候就默认帮你生成了
# 你可以看看你的settings文件里面的内容就可以找到这个值,是一个list
self.warn_only = settings.getbool('SPIDER_LOADER_WARN_ONLY')
self._spiders = {}
self._found = defaultdict(list)
self._load_all_spiders() # 加载所有爬虫
核心就是这个_load_all_spiders:
走起:
def _load_all_spiders(self):
for name in self.spider_modules:
for module in walk_modules(name): # 这个遍历文件夹里面的文件,然后再转化为类对象,
# 保存到字典:self._spiders = {}
self._load_spiders(module) # 模块变成spider
self._check_name_duplicates() # 去重,如果名字一样就异常
接下来看看_load_spiders
核心就是下面的。
def iter_spider_classes(module):
from scrapy.spiders import Spider
for obj in six.itervalues(vars(module)): # 找到模块里面的变量,然后迭代出来
if inspect.isclass(obj) and \
issubclass(obj, Spider) and \
obj.__module__ == module.__name__ and \
getattr(obj, 'name', None): # 有name属性,继承于Spider
yield obj
这个obj就是我们平时写的spider类了。
原来分析了这么多,才找到了我们平时写的爬虫类
待续。。。。
原创文章
转载请注明出处
http://30daydo.com/article/530