nunpy中的std标准差是样本差吗
# 测试一下那个方差X.mean()
x=[1,2,3,4,5,6,7,8,9,10]
X = np.array(x)
5.5
X.std() # 标准差
2.8722813232690143
手工计算一下:
def my_fangca(X):
l=len(X)
mean=X.mean()
sum_ = 0
sum_std=0
for i in X:
sum_+=(i-mean)**2
var_=sum_/l
std_=(sum_/(l))**0.5
return var_,std_
result = my_fangca(X)
得到的result
(8.25, 2.8722813232690143)
说明numpy的std是标准差,不是样本差 收起阅读 »
喜马拉雅app 爬取音频文件
因为喜马拉雅的源码格式改了,所以爬虫代码也更新了一波
# -*- coding: utf-8 -*-
# website: http://30daydo.com
# @Time : 2019/6/30 12:03
# @File : main.py
import requests
import re
import os
url = 'http://180.153.255.6/mobile/v1/album/track/ts-1571294887744?albumId=23057324&device=android&isAsc=true&isQueryInvitationBrand=true&pageId={}&pageSize=20&pre_page=0'
headers = {'User-Agent': 'Xiaomi'}
def download():
for i in range(1, 3):
r = requests.get(url=url.format(i), headers=headers)
js_data = r.json()
data_list = js_data.get('data', {}).get('list', [])
for item in data_list:
trackName = item.get('title')
trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)
# trackName=re.sub(':','',trackName)
src_url = item.get('playUrl64')
filename = '{}.mp3'.format(trackName)
if not os.path.exists(filename):
try:
r0 = requests.get(src_url, headers=headers)
except Exception as e:
print(e)
print(trackName)
r0 = requests.get(src_url, headers=headers)
else:
with open(filename, 'wb') as f:
f.write(r0.content)
print('{} downloaded'.format(trackName))
else:
print(f'{filename}已经下载过了')
import shutil
def rename_():
for i in range(1, 3):
r = requests.get(url=url.format(i), headers=headers)
js_data = r.json()
data_list = js_data.get('data', {}).get('list', [])
for item in data_list:
trackName = item.get('title')
trackName = re.sub('[\/\\\:\*\?\"\<\>\|]', '_', trackName)
src_url = item.get('playUrl64')
orderNo=item.get('orderNo')
filename = '{}.mp3'.format(trackName)
try:
if os.path.exists(filename):
new_file='{}_{}.mp3'.format(orderNo,trackName)
shutil.move(filename,new_file)
except Exception as e:
print(e)
if __name__=='__main__':
rename_()
音频文件也更新了,详情见百度网盘。
======== 2018-10=============
爬取喜马拉雅app上 杨继东的投资之道 的音频文件
运行环境:python3
# -*- coding: utf-8 -*-
# website: http://30daydo.com
# @Time : 2019/6/30 12:03
# @File : main.py
import requests
import re
url = 'https://www.ximalaya.com/revision/play/album?albumId=23057324&pageNum=1&sort=1&pageSize=60'
headers={'User-Agent':'Xiaomi'}
r = requests.get(url=url,headers=headers)
js_data = r.json()
data_list = js_data.get('data',{}).get('tracksAudioPlay',)
for item in data_list:
trackName=item.get('trackName')
trackName=re.sub(':','',trackName)
src_url = item.get('src')
try:
r0=requests.get(src_url,headers=headers)
except Exception as e:
print(e)
print(trackName)
else:
with open('{}.m4a'.format(trackName),'wb') as f:
f.write(r0.content)
print('{} downloaded'.format(trackName))
保存为main.py
然后运行 python main.py
稍微等几分钟就自动下载好了。
附下载好的音频文件:
链接:https://pan.baidu.com/s/1t_vJhTvSJSeFdI1IaDS6fA
提取码:e3zb
原创文章
转载请注明出处
http://30daydo.com/article/503 收起阅读 »
python3与python2迭代器的写法的区别
附一个例子:
def iter_demo():收起阅读 »
class DefineIter(object):
def __init__(self,length):
self.length = length
self.data = range(self.length)
self.index=0
def __iter__(self):
return self
def __next__(self):
if self.index >=self.length:
# return None
raise StopIteration
d = self.data[self.index]*50
self.index =self.index + 1
return d
a = DefineIter(10)
print(type(a))
for i in a:
print(i)
PyCharm 快捷键快速插入当前时间
方式
通过 Live Template 快速添加时间
步骤
1、添加一个 Template Group 命名为 Common
2、添加一个 Live Template 设置如下
Abbreviation: time
Description : current time
Template Text: $time$
Edit Variables -> Expresssion : date("yyyy-MM-dd HH:mm:ss")
3、让设置生效
Define->Everywhere
4、使用
输入 time 后 按下tab键 就能转换为当前时间了
收起阅读 »
深圳转债转股后不可以撤单
说明转股后也是不能撤单的。
修改easytrader国金证券的默认启动路径
pywinauto.application.AppStartError: Could not create the process "C:\全能行证券交易终端\xiadan.exe"
Error returned by CreateProcess: (2, 'CreateProcess', '系统找不到指定的文件。')
看了配置文件,也是没有具体的参数可以修改,只好修改源代码。
别听到改源代码就害怕,只是需要改一行就可以了。
找到文件:
site-package\easytrader\config\client.py
找过这一行:
class GJ(CommonConfig):只要修改上面的路径就可以了。注意用双斜杠。
DEFAULT_EXE_PATH = "C:\\Tool\\xiadan.exe"
收起阅读 »
使用pymongo中的find_one_and_update出错:需要分片键
File "C:\ProgramData\Anaconda3\lib\site-packages\pymongo\helpers.py", line 155, in _check_command_response
raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: Query for sharded findAndModify must contain the shard key
2019-06-10 16:14:32 [scrapy.core.engine] INFO: Closing spider (finished)
2019-06-10 16:14:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
需要在查询语句中把分片键也添加进去。
因为findOneModify只会找一个记录,但是到底在哪个分片的记录呢? 因为不确定,所以才需要把shard加上去。
参考官方:
Targeted Operations vs. Broadcast Operations收起阅读 »
Generally, the fastest queries in a sharded environment are those that mongos route to a single shard, using the shard key and the cluster meta data from the config server. These targeted operations use the shard key value to locate the shard or subset of shards that satisfy the query document.
For queries that don’t include the shard key, mongos must query all shards, wait for their responses and then return the result to the application. These “scatter/gather” queries can be long running operations.
Broadcast Operations
mongos instances broadcast queries to all shards for the collection unless the mongos can determine which shard or subset of shards stores this data.
After the mongos receives responses from all shards, it merges the data and returns the result document. The performance of a broadcast operation depends on the overall load of the cluster, as well as variables like network latency, individual shard load, and number of documents returned per shard. Whenever possible, favor operations that result in targeted operation over those that result in a broadcast operation.
Multi-update operations are always broadcast operations.
The updateMany() and deleteMany() methods are broadcast operations, unless the query document specifies the shard key in full.
Targeted Operations
mongos can route queries that include the shard key or the prefix of a compound shard key a specific shard or set of shards. mongos uses the shard key value to locate the chunk whose range includes the shard key value and directs the query at the shard containing that chunk.
For example, if the shard key is:
copy
{ a: 1, b: 1, c: 1 }
The mongos program can route queries that include the full shard key or either of the following shard key prefixes at a specific shard or set of shards:
copy
{ a: 1 }
{ a: 1, b: 1 }
All insertOne() operations target to one shard. Each document in the insertMany() array targets to a single shard, but there is no guarantee all documents in the array insert into a single shard.
All updateOne(), replaceOne() and deleteOne() operations must include the shard key or _id in the query document. MongoDB returns an error if these methods are used without the shard key or _id.
Depending on the distribution of data in the cluster and the selectivity of the query, mongos may still perform a broadcast operation to fulfill these queries.
Index Use
If the query does not include the shard key, the mongos must send the query to all shards as a “scatter/gather” operation. Each shard will, in turn, use either the shard key index or another more efficient index to fulfill the query.
If the query includes multiple sub-expressions that reference the fields indexed by the shard key and the secondary index, the mongos can route the queries to a specific shard and the shard will use the index that will allow it to fulfill most efficiently.
Sharded Cluster Security
Use Internal Authentication to enforce intra-cluster security and prevent unauthorized cluster components from accessing the cluster. You must start each mongod or mongos in the cluster with the appropriate security settings in order to enforce internal authentication.
See Deploy Sharded Cluster with Keyfile Access Control for a tutorial on deploying a secured shardedcluster.
Cluster Users
Sharded clusters support Role-Based Access Control (RBAC) for restricting unauthorized access to cluster data and operations. You must start each mongod in the cluster, including the config servers, with the --auth option in order to enforce RBAC. Alternatively, enforcing Internal Authentication for inter-cluster security also enables user access controls via RBAC.
With RBAC enforced, clients must specify a --username, --password, and --authenticationDatabase when connecting to the mongos in order to access cluster resources.
Each cluster has its own cluster users. These users cannot be used to access individual shards.
See Enable Access Control for a tutorial on enabling adding users to an RBAC-enabled MongoDB deployment.
jupyter notebook格式的文件损坏如何修复
使用下面的代码:
# 拯救损坏的jupyter 文件只要把你要修复的文件替换一下就可以了。 收起阅读 »
import re
import codecs
pattern = re.compile('"source": \[(.*?)\]\s+\},',re.S)
filename = 'tushare_usage.ipynb'
with codecs.open(filename,encoding='utf8') as f:
content = f.read()
source = pattern.findall(content)
for s in source:
t=s.replace('\\n','')
t=re.sub('"','',t)
t=re.sub('(,$)','',t)
print(t)
python连接mongodb集群 cluster
连接方法如下:
import pymongo上面默认的端口do都是27017,如果是其他端口,需要这样修改:
db = pymongo.MongoClient('mongodb://10.18.6.46,10.18.6.26,10.18.6.102')
db = pymongo.MongoClient('mongodb://10.18.6.46:8888,10.18.6.26:9999,10.18.6.102:7777')
然后就可以正常读写数据库:
读:
coll=db['testdb']['testcollection'].find()输出内容:
for i in coll:
print(i)
{'_id': ObjectId('5cf4c7981ee9edff72e5c503'), 'username': 'hello'}
{'_id': ObjectId('5cf4c7991ee9edff72e5c504'), 'username': 'hello'}
{'_id': ObjectId('5cf4c7991ee9edff72e5c505'), 'username': 'hello'}
{'_id': ObjectId('5cf4c79a1ee9edff72e5c506'), 'username': 'hello'}
{'_id': ObjectId('5cf4c7b21ee9edff72e5c507'), 'username': 'hello world'}
写:
collection = db['testdb']['testcollection']
for i in range(10):
collection.insert({'username':'huston{}'.format(i)})
原创文章,转载请注明出处:
http://30daydo.com/article/494
收起阅读 »
ElasticSearch查看已经存在的文档保存在哪个分片
比如我有以下的文档:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 1.0,
"hits" : [
{
"_index" : "test",
"_type" : "mydoc",
"_id" : "XxyrM2kBVzdNcvl_GHv2",
"_score" : 1.0,
"_source" : {
"name" : "Shiled",
"twitter" : "Sonny sql is awesome",
"date" : "2018-12-27",
"id" : 1240,
"tags" : [
"red",
"shit"
]
}
},
{
"_index" : "test",
"_type" : "mydoc",
"_id" : "YByrM2kBVzdNcvl_tnvm",
"_score" : 1.0,
"_source" : {
"name" : "YYerk",
"twitter" : "sql is awesome",
"date" : "2008-12-27",
"id" : 12357,
"tags" : [
"red"
]
}
},
{
"_index" : "test",
"_type" : "mydoc",
"_id" : "7777",
"_score" : 1.0,
"_source" : {
"name" : "Rocky Chen",
"twitter" : "sql is awesome",
"date" : "2008-12-27",
"id" : 9999
}
},
{
"_index" : "test",
"_type" : "mydoc",
"_id" : "YhzDN2kBVzdNcvl_enuT",
"_score" : 1.0,
"_source" : {
"name" : "YYerk",
"twitter" : "sql is awesome",
"date" : "2008-12-27",
"id" : 888888,
"tags" : [
"red",
"green"
]
}
},
{
"_index" : "test",
"_type" : "mydoc",
"_id" : "YxzDN2kBVzdNcvl_u3th",
"_score" : 1.0,
"_source" : {
"name" : "YYerk",
"twitter" : "sql is awesome",
"date" : "2008-12-27",
"id" : 888888,
"tags" : [
"red",
"green"
]
}
}
]
}
}
如果我想看看id是 "_id" : "YxzDN2kBVzdNcvl_u3th",
这个文档是保存在哪个分片,如何查看?
引用:
路由一个文档到一个分片中编辑
当索引一个文档的时候,文档会被存储到一个主分片中。 Elasticsearch 如何知道一个文档应该存放到哪个分片中呢?当我们创建文档时,它如何决定这个文档应当被存储在分片 1 还是分片 2 中呢?
首先这肯定不会是随机的,否则将来要获取文档的时候我们就不知道从何处寻找了。实际上,这个过程是根据下面这个公式决定的:
shard = hash(routing) % number_of_primary_shards
routing 是一个可变值,默认是文档的 _id ,也可以设置成一个自定义的值。 routing 通过 hash 函数生成一个数字,然后这个数字再除以 number_of_primary_shards (主分片的数量)后得到 余数 。这个分布在 0 到 number_of_primary_shards-1 之间的余数,就是我们所寻求的文档所在分片的位置。
这就解释了为什么我们要在创建索引的时候就确定好主分片的数量 并且永远不会改变这个数量:因为如果数量变化了,那么所有之前路由的值都会无效,文档也再也找不到了。
那么可以使用
GET test/_search_shards?routing=ID号 来查看你要查询的id所在的分片
得到的结果:
{
"nodes" : {
"yl-qYmh1SXqzJsfI4d1ddw" : {
"name" : "node-3",
"ephemeral_id" : "UsJ9rFELTiCW07oHE9YMdg",
"transport_address" : "10.18.6.26:9300",
"attributes" : {
"ml.machine_memory" : "6088101888",
"rack" : "r1",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"ml.enabled" : "true"
}
},
"wT7wUd3iTkujYUsbVNVv-w" : {
"name" : "node-1",
"ephemeral_id" : "fP-vgSb0SdemnHDyaJUsWw",
"transport_address" : "10.18.6.102:9300",
"attributes" : {
"ml.machine_memory" : "8256720896",
"rack" : "r1",
"xpack.installed" : "true",
"ml.max_open_jobs" : "20",
"ml.enabled" : "true"
}
}
},
"indices" : {
"test" : { }
},
"shards" : [
[
{
"state" : "STARTED",
"primary" : true,
"node" : "wT7wUd3iTkujYUsbVNVv-w",
"relocating_node" : null,
"shard" : 1,
"index" : "test",
"allocation_id" : {
"id" : "k-8E4dL7QmGgwcsNsUCP6Q"
}
},
{
"state" : "STARTED",
"primary" : false,
"node" : "yl-qYmh1SXqzJsfI4d1ddw",
"relocating_node" : null,
"shard" : 1,
"index" : "test",
"allocation_id" : {
"id" : "lvOpQIKgRUibkulr3nRfEw"
}
}
]
]
}
只需要关注shards字段就可以,从上面可以看到,该文档存在shard 1 分片上。 分别在node1和node3节点,一个是主分片,一个是副本分片 收起阅读 »
elasticsearch在match查询里面使用了type字段 报错
POST get-together/_search
{
"query":
{
"match": {
"name": {
"type":"phrase",
"query":"enterprise london",
"slop":1
}}
},
"_source": "name"
}
报错:
{
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "[match] query does not support [type]",
"line": 6,
"col": 13
}
],
"type": "parsing_exception",
"reason": "[match] query does not support [type]",
"line": 6,
"col": 13
},
"status": 400
}
在6.x已经不支持在math里面使用type,
可以修改为以下语法:
POST get-together/_search
{
"query":
{
"match_phrase": {
"name": {
"query":"enterprise london",
"slop":1
}}
},
"_source": "name"
}
得到的效果是一致的:
{收起阅读 »
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 2,
"successful" : 2,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.3243701,
"hits" : [
{
"_index" : "get-together",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.3243701,
"_source" : {
"name" : "Enterprise search London get-together"
}
}
]
}
}
elasticsearch 更新文档的坑
POST cnbeta/doc/cUxO42oB9O-zF2ru-rs-/_update
{
"doc":{
"title":"中国操作系统"
}
}
那个body里面的”doc" 不能少
不然会报错:
{
"error": {
"root_cause": [
{
"type": "action_request_validation_exception",
"reason": "Validation Failed: 1: script or doc is missing;"
}
],
"type": "action_request_validation_exception",
"reason": "Validation Failed: 1: script or doc is missing;"
},
"status": 400
} 收起阅读 »
python的mixin类
maxin类似多重继承的一种限制形式:
关于Python的Mixin模式
像C或C++这类语言都支持多重继承,一个子类可以有多个父类,这样的设计常被人诟病。因为继承应该是个”is-a”关系。比如轿车类继承交通工具类,因为轿车是一个(“is-a”)交通工具。一个物品不可能是多种不同的东西,因此就不应该存在多重继承。不过有没有这种情况,一个类的确是需要继承多个类呢?
答案是有,我们还是拿交通工具来举例子,民航飞机是一种交通工具,对于土豪们来说直升机也是一种交通工具。对于这两种交通工具,它们都有一个功能是飞行,但是轿车没有。所以,我们不可能将飞行功能写在交通工具这个父类中。但是如果民航飞机和直升机都各自写自己的飞行方法,又违背了代码尽可能重用的原则(如果以后飞行工具越来越多,那会出现许多重复代码)。怎么办,那就只好让这两种飞机同时继承交通工具以及飞行器两个父类,这样就出现了多重继承。这时又违背了继承必须是”is-a”关系。这个难题该怎么破?
不同的语言给出了不同的方法,让我们先来看下Java。Java提供了接口interface功能,来实现多重继承:
public abstract class Vehicle {
}
public interface Flyable {
public void fly();
}
public class FlyableImpl implements Flyable {
public void fly() {
System.out.println("I am flying");
}
}
public class Airplane extends Vehicle implements Flyable {
private flyable;
public Airplane() {
flyable = new FlyableImpl();
}
public void fly() {
flyable.fly();
}
}
现在我们的飞机同时具有了交通工具及飞行器两种属性,而且我们不需要重写飞行器中的飞行方法,同时我们没有破坏单一继承的原则。飞机就是一种交通工具,可飞行的能力是是飞机的属性,通过继承接口来获取。
回到主题,Python语言可没有接口功能,但是它可以多重继承。那Python是不是就该用多重继承来实现呢?是,也不是。说是,因为从语法上看,的确是通过多重继承实现的。说不是,因为它的继承依然遵守”is-a”关系,从含义上看依然遵循单继承的原则。这个怎么理解呢?我们还是看例子吧。
class Vehicle(object):
pass
class PlaneMixin(object):
def fly(self):
print 'I am flying'
class Airplane(Vehicle, PlaneMixin):
pass
可以看到,上面的Airplane类实现了多继承,不过它继承的第二个类我们起名为PlaneMixin,而不是Plane,这个并不影响功能,但是会告诉后来读代码的人,这个类是一个Mixin类。所以从含义上理解,Airplane只是一个Vehicle,不是一个Plane。这个Mixin,表示混入(mix-in),它告诉别人,这个类是作为功能添加到子类中,而不是作为父类,它的作用同Java中的接口。
使用Mixin类实现多重继承要非常小心
- 首先它必须表示某一种功能,而不是某个物品,如同Java中的Runnable,Callable等
- 其次它必须责任单一,如果有多个功能,那就写多个Mixin类
- 然后,它不依赖于子类的实现
- 最后,子类即便没有继承这个Mixin类,也照样可以工作,就是缺少了某个功能。(比如飞机照样可以载客,就是不能飞了^_^)
原创文章,转载请注明出处
http://30daydo.com/article/480
收起阅读 »
截止今天(2019-05-14)银行股今年的涨幅排名
ticker secShortName secFullName y_chgPct收起阅读 »
31 601288 农业银行 中国农业银行股份有限公司 -0.341178
11 600015 华夏银行 华夏银行股份有限公司 1.856174
45 601988 中国银行 中国银行股份有限公司 2.248533
32 601328 交通银行 交通银行股份有限公司 3.532657
30 601229 上海银行 上海银行股份有限公司 3.725781
35 601398 工商银行 中国工商银行股份有限公司 4.771403
40 601818 光大银行 中国光大银行股份有限公司 5.643119
27 601169 北京银行 北京银行股份有限公司 6.205580
14 600016 民生银行 中国民生银行股份有限公司 7.092815
5 002936 郑州银行 郑州银行股份有限公司 7.551112
49 601998 中信银行 中信银行股份有限公司 8.181956
43 601939 建设银行 中国建设银行股份有限公司 9.402651
41 601838 成都银行 成都银行股份有限公司 9.424554
52 603323 苏农银行 江苏苏州农村商业银行股份有限公司 12.375732
20 600926 杭州银行 杭州银行股份有限公司 12.933645
8 600000 浦发银行 上海浦东发展银行股份有限公司 14.752244
39 601577 长沙银行 长沙银行股份有限公司 14.792683
18 600908 无锡银行 无锡农村商业银行股份有限公司 16.181704
3 002807 江阴银行 江苏江阴农村商业银行股份有限公司 19.274586
48 601997 贵阳银行 贵阳银行股份有限公司 20.489563
4 002839 张家港行 江苏张家港农村商业银行股份有限公司 20.599511
25 601166 兴业银行 兴业银行股份有限公司 21.206503
24 601128 常熟银行 江苏常熟农村商业银行股份有限公司 21.571187
19 600919 江苏银行 江苏银行股份有限公司 23.218299
22 601009 南京银行 南京银行股份有限公司 26.297500
16 600036 招商银行 招商银行股份有限公司 27.518708
0 000001 平安银行 平安银行股份有限公司 31.624747
2 002142 宁波银行 宁波银行股份有限公司 31.729062
6 002948 青岛银行 青岛银行股份有限公司 48.602573
7 002958 青农商行 青岛农村商业银行股份有限公司 108.983776
42 601860 紫金银行 江苏紫金农村商业银行股份有限公司 115.147347
21 600928 西安银行 西安银行股份有限公司 128.496683
正则表达式替换中文换行符【python】
使用正则表达式替换换行符。(也可以替换为任意字符)
js=re.sub('\r\n','',js)
完毕。
request header显示Provisional headers are shown
把插件卸载了问题就解决了。
异步爬虫aiohttp post提交数据
async def fetch(session,url, data):
async with session.post(url=url, data=data, headers=headers) as response:
return await response.json()
完整的例子:
import aiohttp收起阅读 »
import asyncio
page = 30
post_data = {
'page': 1,
'pageSize': 10,
'keyWord': '',
'dpIds': '',
}
headers = {
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
}
result=
async def fetch(session,url, data):
async with session.post(url=url, data=data, headers=headers) as response:
return await response.json()
async def parse(html):
xzcf_list = html.get('newtxzcfList')
if xzcf_list is None:
return
for i in xzcf_list:
result.append(i)
async def downlod(page):
data=post_data.copy()
data['page']=page
url = 'http://credit.chaozhou.gov.cn/tfieldTypeActionJson!initXzcfListnew.do'
async with aiohttp.ClientSession() as session:
html=await fetch(session,url,data)
await parse(html)
loop = asyncio.get_event_loop()
tasks=[asyncio.ensure_future(downlod(i)) for i in range(1,page)]
tasks=asyncio.gather(*tasks)
# print(tasks)
loop.run_until_complete(tasks)
# loop.close()
# print(result)
count=0
for i in result:
print(i.get('cfXdrMc'))
count+=1
print(f'total {count}')
python异步aiohttp爬虫 - 异步爬取链家数据
import requests收起阅读 »
from lxml import etree
import asyncio
import aiohttp
import pandas
import re
import math
import time
loction_info = ''' 1→杭州
2→武汉
3→北京
按ENTER确认:'''
loction_select = input(loction_info)
loction_dic = {'1': 'hz',
'2': 'wh',
'3': 'bj'}
city_url = 'https://{}.lianjia.com/ershoufang/'.format(loction_dic[loction_select])
down = input('请输入价格下限(万):')
up = input('请输入价格上限(万):')
inter_list = [(int(down), int(up))]
def half_inter(inter):
lower = inter[0]
upper = inter[1]
delta = int((upper - lower) / 2)
inter_list.remove(inter)
print('已经缩小价格区间', inter)
inter_list.append((lower, lower + delta))
inter_list.append((lower + delta, upper))
pagenum = {}
def get_num(inter):
url = city_url + 'bp{}ep{}/'.format(inter[0], inter[1])
r = requests.get(url).text
print(r)
num = int(etree.HTML(r).xpath("//h2[@class='total fl']/span/text()")[0].strip())
pagenum[(inter[0], inter[1])] = num
return num
totalnum = get_num(inter_list[0])
judge = True
while judge:
a = [get_num(x) > 3000 for x in inter_list]
if True in a:
judge = True
else:
judge = False
for i in inter_list:
if get_num(i) > 3000:
half_inter(i)
print('价格区间缩小完毕!')
url_lst = []
url_lst_failed = []
url_lst_successed = []
url_lst_duplicated = []
for i in inter_list:
totalpage = math.ceil(pagenum[i] / 30)
for j in range(1, totalpage + 1):
url = city_url + 'pg{}bp{}ep{}/'.format(j, i[0], i[1])
url_lst.append(url)
print('url列表获取完毕!')
info_lst = []
async def get_info(url):
async with aiohttp.ClientSession() as session:
async with session.get(url, timeout=5) as resp:
if resp.status != 200:
url_lst_failed.append(url)
else:
url_lst_successed.append(url)
r = await resp.text()
nodelist = etree.HTML(r).xpath("//ul[@class='sellListContent']/li")
# print('-------------------------------------------------------------')
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url),len(url_lst)))
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))
# print('开始抓取第{}个页面的数据,共计{}个页面'.format(url_lst.index(url), len(url_lst)))
# print('-------------------------------------------------------------')
info_dic = {}
index = 1
print('开始抓取{}'.format(resp.url))
print('开始抓取{}'.format(resp.url))
print('开始抓取{}'.format(resp.url))
for node in nodelist:
try:
info_dic['title'] = node.xpath(".//div[@class='title']/a/text()")[0]
except:
info_dic['title'] = '/'
try:
info_dic['href'] = node.xpath(".//div[@class='title']/a/@href")[0]
except:
info_dic['href'] = '/'
try:
info_dic['xiaoqu'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[0]
except:
info_dic['xiaoqu'] = '/'
try:
info_dic['huxing'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[1]
except:
info_dic['huxing'] = '/'
try:
info_dic['area'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[2]
except:
info_dic['area'] = '/'
try:
info_dic['chaoxiang'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[3]
except:
info_dic['chaoxiang'] = '/'
try:
info_dic['zhuangxiu'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[4]
except:
info_dic['zhuangxiu'] = '/'
try:
info_dic['dianti'] = \
node.xpath(".//div[@class='houseInfo']")[0].xpath('string(.)').replace(' ', '').split('|')[5]
except:
info_dic['dianti'] = '/'
try:
info_dic['louceng'] = re.findall('\((.*)\)', node.xpath(".//div[@class='positionInfo']/text()")[0])
except:
info_dic['louceng'] = '/'
try:
info_dic['nianxian'] = re.findall('\)(.*?)年', node.xpath(".//div[@class='positionInfo']/text()")[0])
except:
info_dic['nianxian'] = '/'
try:
info_dic['guanzhu'] = ''.join(re.findall('[0-9]', node.xpath(".//div[@class='followInfo']/text()")[
0].replace(' ', '').split('/')[0]))
except:
info_dic['guanzhu'] = '/'
try:
info_dic['daikan'] = ''.join(re.findall('[0-9]',
node.xpath(".//div[@class='followInfo']/text()")[0].replace(
' ', '').split('/')[1]))
except:
info_dic['daikan'] = '/'
try:
info_dic['fabu'] = node.xpath(".//div[@class='followInfo']/text()")[0].replace(' ', '').split('/')[
2]
except:
info_dic['fabu'] = '/'
try:
info_dic['totalprice'] = node.xpath(".//div[@class='totalPrice']/span/text()")[0]
except:
info_dic['totalprice'] = '/'
try:
info_dic['unitprice'] = node.xpath(".//div[@class='unitPrice']/span/text()")[0].replace('单价', '')
except:
info_dic['unitprice'] = '/'
if True in [info_dic['href'] in dic.values() for dic in info_lst]:
url_lst_duplicated.append(info_dic)
else:
info_lst.append(info_dic)
print('第{}条: {}→房屋信息抓取完毕!'.format(index, info_dic['title']))
index += 1
info_dic = {}
start = time.time()
# 首次抓取url_lst中的信息,部分url没有对其发起请求,不知道为什么
tasks = [asyncio.ensure_future(get_info(url)) for url in url_lst]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
# 将没有发起请求的url放入一个列表,对其进行循环抓取,直到所有url都被发起请求
url_lst_unrequested = []
for url in url_lst:
if url not in url_lst_successed or url_lst_failed:
url_lst_unrequested.append(url)
while len(url_lst_unrequested) > 0:
tasks_unrequested = [asyncio.ensure_future(get_info(url)) for url in url_lst_unrequested]
loop.run_until_complete(asyncio.wait(tasks_unrequested))
url_lst_unrequested = []
for url in url_lst:
if url not in url_lst_successed:
url_lst_unrequested.append(url)
end = time.time()
print('当前价格区间段内共有{}套二手房源\(包含{}条重复房源\),实际获得{}条房源信息。'.format(totalnum, len(url_lst_duplicated), len(info_lst)))
print('总共耗时{}秒'.format(end - start))
df = pandas.DataFrame(info_lst)
df.to_csv("ljwh.csv", encoding='gbk')
神经网络中数值梯度的计算 python代码
import matplotlib.pyplot as plt收起阅读 »
import numpy as np
import time
from collections import OrderedDict
def softmax(a):
a = a - np.max(a)
exp_a = np.exp(a)
exp_a_sum = np.sum(exp_a)
return exp_a / exp_a_sum
def cross_entropy_error(t, y):
delta = 1e-7
s = -1 * np.sum(t * np.log(y + delta))
# print('cross entropy ',s)
return s
class simpleNet:
def __init__(self):
self.W = np.random.randn(2, 3)
def predict(self, x):
print('current w',self.W)
return np.dot(x, self.W)
def loss(self, x, t):
z = self.predict(x)
# print(z)
# print(z.ndim)
y = softmax(z)
# print('y',y)
loss = cross_entropy_error(y, t) # y为预测的值
return loss
def numerical_gradient_(f, x): # 针对2维的情况 甚至是多维
h = 1e-4 # 0.0001
grad = np.zeros_like(x)
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
print('idx', idx)
tmp_val = x[idx]
x[idx] = float(tmp_val) + h
fxh1 = f(x) # f(x+h)
print('fxh1 ', fxh1)
# print('current W', net.W)
x[idx] = tmp_val - h
fxh2 = f(x) # f(x-h)
print('fxh2 ', fxh2)
# print('next currnet W ', net.W)
grad[idx] = (fxh1 - fxh2) / (2 * h)
x[idx] = tmp_val # 还原值
it.iternext()
return grad
net = simpleNet()
x=np.array([0.6,0.9])
t = np.array([0.0,0.0,1.0])
def f(W):
return net.loss(x,t)
grads =numerical_gradient_(f,net.W)
print(grads)
【可转债剩余转股比例数据排序】【2019-05-06】
剩余的比例越少,上市公司下调转股价的欲望就越少。 也就是会任由可转债在那里晾着,不会积极拉正股。
数据定期更新。
原创文章,
转载请注明出处:
http://30daydo.com/article/472
收起阅读 »
pycharm激活码 有效期到2019年11月
MTW881U3Z5-eyJsaWNlbnNlSWQiOiJNVFc4ODFVM1o1IiwibGljZW5zZWVOYW1lIjoiTnNzIEltIiwiYXNzaWduZWVOYW1lIjoiIiwiYXNzaWduZWVFbWFpbCI6IiIsImxpY2Vuc2VSZXN0cmljdGlvbiI6IkZvciBlZHVjYXRpb25hbCB1c2Ugb25seSIsImNoZWNrQ29uY3VycmVudFVzZSI6ZmFsc2UsInByb2R1Y3RzIjpbeyJjb2RlIjoiSUkiLCJwYWlkVXBUbyI6IjIwMTktMTEtMDYifSx7ImNvZGUiOiJBQyIsInBhaWRVcFRvIjoiMjAxOS0xMS0wNiJ9LHsiY29kZSI6IkRQTiIsInBhaWRVcFRvIjoiMjAxOS0xMS0wNiJ9LHsiY29kZSI6IlBTIiwicGFpZFVwVG8iOiIyMDE5LTExLTA2In0seyJjb2RlIjoiR08iLCJwYWlkVXBUbyI6IjIwMTktMTEtMDYifSx7ImNvZGUiOiJETSIsInBhaWRVcFRvIjoiMjAxOS0xMS0wNiJ9LHsiY29kZSI6IkNMIiwicGFpZFVwVG8iOiIyMDE5LTExLTA2In0seyJjb2RlIjoiUlMwIiwicGFpZFVwVG8iOiIyMDE5LTExLTA2In0seyJjb2RlIjoiUkMiLCJwYWlkVXBUbyI6IjIwMTktMTEtMDYifSx7ImNvZGUiOiJSRCIsInBhaWRVcFRvIjoiMjAxOS0xMS0wNiJ9LHsiY29kZSI6IlBDIiwicGFpZFVwVG8iOiIyMDE5LTExLTA2In0seyJjb2RlIjoiUk0iLCJwYWlkVXBUbyI6IjIwMTktMTEtMDYifSx7ImNvZGUiOiJXUyIsInBhaWRVcFRvIjoiMjAxOS0xMS0wNiJ9LHsiY29kZSI6IkRCIiwicGFpZFVwVG8iOiIyMDE5LTExLTA2In0seyJjb2RlIjoiREMiLCJwYWlkVXBUbyI6IjIwMTktMTEtMDYifSx7ImNvZGUiOiJSU1UiLCJwYWlkVXBUbyI6IjIwMTktMTEtMDYifV0sImhhc2giOiIxMDgyODE0Ni8wIiwiZ3JhY2VQZXJpb2REYXlzIjowLCJhdXRvUHJvbG9uZ2F0ZWQiOmZhbHNlLCJpc0F1dG9Qcm9sb25nYXRlZCI6ZmFsc2V9-aKyalfjUfiV5UXfhaMGgOqrMzTYy2rnsmobL47k8tTpR/jvG6HeL3FxxleetI+W+Anw3ZSe8QAMsSxqVS4podwlQgIe7f+3w7zyAT1j8HMVlfl2h96KzygdGpDSbwTbwOkJ6/5TQOPgAP86mkaSiM97KgvkZV/2nXQHRz1yhm+MT+OsioTwxDhd/22sSGq6KuIztZ03UvSciEmyrPdl2ueJw1WuT9YmFjdtTm9G7LuXvCM6eav+BgCRm+wwtUeDfoQqigbp0t6FQgkdQrcjoWvLSB0IUgp/f4qGf254fA7lXskT2VCFdDvi0jgxLyMVct1cKnPdM6fkHnbdSXKYDWw==-MIIElTCCAn2gAwIBAgIBCTANBgkqhkiG9w0BAQsFADAYMRYwFAYDVQQDDA1KZXRQcm9maWxlIENBMB4XDTE4MTEwMTEyMjk0NloXDTIwMTEwMjEyMjk0NlowaDELMAkGA1UEBhMCQ1oxDjAMBgNVBAgMBU51c2xlMQ8wDQYDVQQHDAZQcmFndWUxGTAXBgNVBAoMEEpldEJyYWlucyBzLnIuby4xHTAbBgNVBAMMFHByb2QzeS1mcm9tLTIwMTgxMTAxMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxcQkq+zdxlR2mmRYBPzGbUNdMN6OaXiXzxIWtMEkrJMO/5oUfQJbLLuMSMK0QHFmaI37WShyxZcfRCidwXjot4zmNBKnlyHodDij/78TmVqFl8nOeD5+07B8VEaIu7c3E1N+e1doC6wht4I4+IEmtsPAdoaj5WCQVQbrI8KeT8M9VcBIWX7fD0fhexfg3ZRt0xqwMcXGNp3DdJHiO0rCdU+Itv7EmtnSVq9jBG1usMSFvMowR25mju2JcPFp1+I4ZI+FqgR8gyG8oiNDyNEoAbsR3lOpI7grUYSvkB/xVy/VoklPCK2h0f0GJxFjnye8NT1PAywoyl7RmiAVRE/EKwIDAQABo4GZMIGWMAkGA1UdEwQCMAAwHQYDVR0OBBYEFGEpG9oZGcfLMGNBkY7SgHiMGgTcMEgGA1UdIwRBMD+AFKOetkhnQhI2Qb1t4Lm0oFKLl/GzoRykGjAYMRYwFAYDVQQDDA1KZXRQcm9maWxlIENBggkA0myxg7KDeeEwEwYDVR0lBAwwCgYIKwYBBQUHAwEwCwYDVR0PBAQDAgWgMA0GCSqGSIb3DQEBCwUAA4ICAQAF8uc+YJOHHwOFcPzmbjcxNDuGoOUIP+2h1R75Lecswb7ru2LWWSUMtXVKQzChLNPn/72W0k+oI056tgiwuG7M49LXp4zQVlQnFmWU1wwGvVhq5R63Rpjx1zjGUhcXgayu7+9zMUW596Lbomsg8qVve6euqsrFicYkIIuUu4zYPndJwfe0YkS5nY72SHnNdbPhEnN8wcB2Kz+OIG0lih3yz5EqFhld03bGp222ZQCIghCTVL6QBNadGsiN/lWLl4JdR3lJkZzlpFdiHijoVRdWeSWqM4y0t23c92HXKrgppoSV18XMxrWVdoSM3nuMHwxGhFyde05OdDtLpCv+jlWf5REAHHA201pAU6bJSZINyHDUTB+Beo28rRXSwSh3OUIvYwKNVeoBY+KwOJ7WnuTCUq1meE6GkKc4D/cXmgpOyW/1SmBz3XjVIi/zprZ0zf3qH5mkphtg6ksjKgKjmx1cXfZAAX6wcDBNaCL+Ortep1Dh8xDUbqbBVNBL4jbiL3i3xsfNiyJgaZ5sX7i8tmStEpLbPwvHcByuf59qJhV/bZOl8KqJBETCDJcY6O2aqhTUy+9x93ThKs1GKrRPePrWPluud7ttlgtRveit/pcBrnQcXOl1rHq7ByB8CFAxNotRUYL9IF5n3wJOgkPojMy6jetQA5Ogc8Sm7RG6vg1yow==原创文章
转载请注明出处:
http://30daydo.com/article/470
收起阅读 »
发现numpy一个很坑的问题,要一定级别的高手才能发现问题
y=X0**2+X1**2 # **2 是平方
def function_2(x):
return x[0]**2+x[1]**2
下面是计算y的偏导数,分布计算X0和X1的偏导
def numerical_gradient(f,x):
grad = np.zeros_like(x)
h=1e-4
for idx in range(x.size):
temp_v = x[idx]
x[idx]=temp_v+h
f1=f(x)
print(x,f1)
x[idx]=temp_v-h
f2=f(x)
print(x,f2)
ret = (f1-f2)/(2*h)
print(ret)
x[idx]=temp_v
grad[idx]=ret
return grad
然后调用
numerical_gradient(function_2,np.array([3,4]))
计算的是二元一次方程 y=X0**2+X1**2 在点(3,4)的偏导的值
得到的是什么结果?
为什么会得到这样的结果?
小白一般要花点时间才能找到原因。
收起阅读 »
numpy和dataframe轴的含义,axis为负数的含义
a=np.array([[[1,2],[3,4]],[[11,12],[13,14]]])
a
array([[[ 1, 2],
[ 3, 4]],
[[11, 12],
[13, 14]]])
a有3个中括号,那么就有3条轴,从0开始到2,分别是axis=0,1,2
那么我要对a进行求和,分别用axis=0,1,2进行运行。
a.sum(axis=0)得到:
array([[12, 14],意思是去掉一个中括号,然后运行。
[16, 18]])
同理:
a.sum(axis=1)对a去掉2个中括号,然后运行。
得到:
array([[ 4, 6],那么对a.sum(axis=2)的结果呢?读者可以自己上机去尝试吧。
[24, 26]])
而轴的负数,axis=-3和axis=0的意思是一样的,对于有3层轴的数组来说的话。
a.sum(axis=-3)
array([[12, 14],
[16, 18]])
收起阅读 »
np.nonzero()的用法【numpy小白】
返回值为元组, 两个值分别为两个维度, 包含了相应维度上非零元素的目录值。
比如:
n1=np.array([0,1,0,0,0,0,1,0,0,0,0,0,0,1])
n1.nonzero()
返回的是:
(array([ 1, 6, 13], dtype=int64),)
注意上面是一个yu元组要获取里面的值,需要用 n1.nonzero()[0] 来获取。
原创文章
转载请注明出处:
http://30daydo.com/article/466
收起阅读 »
ndarray和array的区别【numpy小白】
np.array([1,2,3,4])
上面代码创建了一个对象,这个对象就是ndarray。 所以ndarray是一个类对象象,而array是一个方法。原创文章
转载请注明出处:
http://30daydo.com/article/465
收起阅读 »