基于文本及符号密度的网页正文提取方法 python实现

李魔佛 发表了文章 • 0 个评论 • 328 次浏览 • 2019-09-10 15:19 • 来自相关话题

基于文本及符号密度的网页正文提取方法 python实现
 项目路径https://github.com/Rockyzsu/CodePool/tree/master/GeneralNewsExtractor
完成后在本文详细介绍,
请密切关注。 查看全部
基于文本及符号密度的网页正文提取方法 python实现
 项目路径https://github.com/Rockyzsu/CodePool/tree/master/GeneralNewsExtractor
完成后在本文详细介绍,
请密切关注。

python exchange保存备份邮件

李魔佛 发表了文章 • 0 个评论 • 242 次浏览 • 2019-09-09 10:50 • 来自相关话题

python exchange保存备份邮件
 方便自己平时备份邮件。# -*-coding=utf-8-*-

# @Time : 2019/9/9 9:25
# @File : mail_backup.py
# @Author :
import codecs
import re
import config
import os
from exchangelib import DELEGATE, Account, Credentials, Configuration, NTLM, Message, Mailbox, HTMLBody,FileAttachment,ItemAttachment
from exchangelib.protocol import BaseProtocol, NoVerifyHTTPAdapter


#此句用来消除ssl证书错误,exchange使用自签证书需加上
BaseProtocol.HTTP_ADAPTER_CLS = NoVerifyHTTPAdapter


# 输入你的域账号如example\xxx
cred = Credentials(r'example\xxx', 你的邮箱密码)

configx = Configuration(server='mail.credlink.com', credentials=cred, auth_type=NTLM)
a = Account(
primary_smtp_address='你的邮箱地址', config=configx, autodiscover=False, access_type=DELEGATE
)


for item in a.inbox.all().order_by('-datetime_received')[:100]:
print(item.subject, item.sender, item.unique_body,item.datetime_received)

name = item.subject
name = re.sub('[\/:*?"<>|]', '-', name)
local_path = os.path.join('inbox', name+'.html')
with codecs.open(local_path, 'w','utf-8') as f:
f.write(item.unique_body)

for attachment in item.attachments:
if isinstance(attachment, FileAttachment):
name = attachment.name
name = re.sub('[\/:*?"<>|]','-',name)
local_path = os.path.join('inbox', attachment.name)
with codecs.open(local_path, 'wb') as f:
f.write(attachment.content)
print('Saved attachment to', local_path)

elif isinstance(attachment, ItemAttachment):
if isinstance(attachment.item, Message):
name=attachment.item.subject
name = re.sub('[\/:*?"<>|]', '-', name)
local_path = os.path.join('inbox', 'attachment')
with codecs.open(local_path, 'w') as f:
f.write(attachment.item.body)
原创文章,
转载请注明出处
http://30daydo.com/article/534
  查看全部
python exchange保存备份邮件
 方便自己平时备份邮件。
# -*-coding=utf-8-*-

# @Time : 2019/9/9 9:25
# @File : mail_backup.py
# @Author :
import codecs
import re
import config
import os
from exchangelib import DELEGATE, Account, Credentials, Configuration, NTLM, Message, Mailbox, HTMLBody,FileAttachment,ItemAttachment
from exchangelib.protocol import BaseProtocol, NoVerifyHTTPAdapter


#此句用来消除ssl证书错误,exchange使用自签证书需加上
BaseProtocol.HTTP_ADAPTER_CLS = NoVerifyHTTPAdapter


# 输入你的域账号如example\xxx
cred = Credentials(r'example\xxx', 你的邮箱密码)

configx = Configuration(server='mail.credlink.com', credentials=cred, auth_type=NTLM)
a = Account(
primary_smtp_address='你的邮箱地址', config=configx, autodiscover=False, access_type=DELEGATE
)


for item in a.inbox.all().order_by('-datetime_received')[:100]:
print(item.subject, item.sender, item.unique_body,item.datetime_received)

name = item.subject
name = re.sub('[\/:*?"<>|]', '-', name)
local_path = os.path.join('inbox', name+'.html')
with codecs.open(local_path, 'w','utf-8') as f:
f.write(item.unique_body)

for attachment in item.attachments:
if isinstance(attachment, FileAttachment):
name = attachment.name
name = re.sub('[\/:*?"<>|]','-',name)
local_path = os.path.join('inbox', attachment.name)
with codecs.open(local_path, 'wb') as f:
f.write(attachment.content)
print('Saved attachment to', local_path)

elif isinstance(attachment, ItemAttachment):
if isinstance(attachment.item, Message):
name=attachment.item.subject
name = re.sub('[\/:*?"<>|]', '-', name)
local_path = os.path.join('inbox', 'attachment')
with codecs.open(local_path, 'w') as f:
f.write(attachment.item.body)

原创文章,
转载请注明出处
http://30daydo.com/article/534
 

性能对比 pypy vs python

李魔佛 发表了文章 • 0 个评论 • 180 次浏览 • 2019-09-06 17:04 • 来自相关话题

性能对比 pypy vs python
 不试不知道,一试吓一跳。
如果是CPU密集型的程序,pypy3的执行速度比python要快上一百倍。
talk is cheap, show me the code!
 
代码很简单,运行加法运算:
执行2千万次
 import time

LOOP = 2*10**8

def add(x,y):
return x+y

def cpu_pressure(loop):

for i in range(loop):
result = add(i,i+1)


if __name__ == '__main__':
start = time.time()
cpu_pressure(LOOP)
print(f'time used {time.time()-start}s')
python执行:
python main.py
返回用时:time used 21.422261476516724s
 
pypy执行:
pypy main.py
返回用时:time used 0.1925642490386963s
 
差距真的很大。 查看全部
性能对比 pypy vs python
 不试不知道,一试吓一跳。
如果是CPU密集型的程序,pypy3的执行速度比python要快上一百倍。
talk is cheap, show me the code!
 
代码很简单,运行加法运算:
执行2千万次
 
import time

LOOP = 2*10**8

def add(x,y):
return x+y

def cpu_pressure(loop):

for i in range(loop):
result = add(i,i+1)


if __name__ == '__main__':
start = time.time()
cpu_pressure(LOOP)
print(f'time used {time.time()-start}s')

python执行:
python main.py
返回用时:time used 21.422261476516724s
 
pypy执行:
pypy main.py
返回用时:time used 0.1925642490386963s
 
差距真的很大。

anaconda环境下无法启动jupyter notebook

李魔佛 发表了文章 • 0 个评论 • 505 次浏览 • 2019-08-19 17:16 • 来自相关话题

运行 jupyter notebook
报错: from . import (constants, error, message, context,
ImportError: DLL load failed: 找不到指定的模块。

但是可以直接在Anaconda navigator中直接启动,所以判断是环境问题。
切换到anaconda的虚拟环境,(在菜单中进入anaconda prompt command),在当前命令行下执行 jupyter notebook就能够正常运行。
 
  查看全部
运行 jupyter notebook
报错:
    from . import (constants, error, message, context,
ImportError: DLL load failed: 找不到指定的模块。

但是可以直接在Anaconda navigator中直接启动,所以判断是环境问题。
切换到anaconda的虚拟环境,(在菜单中进入anaconda prompt command),在当前命令行下执行 jupyter notebook就能够正常运行。
 
 

random.randint的用法

李魔佛 发表了文章 • 0 个评论 • 260 次浏览 • 2019-08-01 16:31 • 来自相关话题

random.randint的用法:
from random import randint

randint(0,1)
Out[25]: 1

randint(0,1)
Out[26]: 1

randint(0,1)
Out[27]: 1

randint(0,1)
Out[28]: 1

randint(0,1)
Out[29]: 0

randint(0,1)
Out[30]: 1
random.randint(a,b)
 
输出的整数范围包含a和b,和之间的整数
  查看全部
random.randint的用法:
from random import randint

randint(0,1)
Out[25]: 1

randint(0,1)
Out[26]: 1

randint(0,1)
Out[27]: 1

randint(0,1)
Out[28]: 1

randint(0,1)
Out[29]: 0

randint(0,1)
Out[30]: 1

random.randint(a,b)
 
输出的整数范围包含a和b,和之间的整数
 

exchange_declare() got an unexpected keyword argument 'type'

李魔佛 发表了文章 • 0 个评论 • 221 次浏览 • 2019-07-16 14:40 • 来自相关话题

In new version of pika, now it is using 
exchange_type instead of type
 
credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()

channel.exchange_declare(exchange='logs',exchange_type='fanout') 查看全部
In new version of pika, now it is using 
exchange_type instead of type
 
	credentials = pika.PlainCredentials('admin','admin')
connection = pika.BlockingConnection(pika.ConnectionParameters('192.168.1.101',5672,'/',credentials))

channel = connection.channel()

channel.exchange_declare(exchange='logs',exchange_type='fanout')

twisted reactor运行后,添加了addBoth函数,但是还是无法停止

李魔佛 发表了文章 • 0 个评论 • 318 次浏览 • 2019-07-11 09:43 • 来自相关话题

代码如下:
  from scrapy.selector import Selector

def get_response_callback(content):
txt = str(content,encoding='utf-8')
resp = Selector(text=txt)
title = resp.xpath('//title/text()').extract_first()
print(title)

@defer.inlineCallbacks
def task():
url = 'http://www.baidu.com'
d=getPage(url.encode('utf-8'))
d.addCallback(get_response_callback)
yield d

def done():
reactor.stop()

def done1(*args,**kwargs):
reactor.stop()

task_list =
for i in range(4):
d=task()
task_list.append(d)

dd = defer.DeferredList(task_list)

dd.addBoth(done)

reactor.run()
上面的代码是无法停止的,如果使用的是 
dd.addBoth(done)
 
done函数的定义是没有参数的。 
 
而使用另一个done函数带参数的done(*args,**kwargs)
是可以正常退出的,done里面写了reactor.stop() 函数
 
原创文章
转载请注明出处:
http://30daydo.com/article/509
  查看全部
代码如下:
 
	from scrapy.selector import Selector

def get_response_callback(content):
txt = str(content,encoding='utf-8')
resp = Selector(text=txt)
title = resp.xpath('//title/text()').extract_first()
print(title)

@defer.inlineCallbacks
def task():
url = 'http://www.baidu.com'
d=getPage(url.encode('utf-8'))
d.addCallback(get_response_callback)
yield d

def done():
reactor.stop()

def done1(*args,**kwargs):
reactor.stop()

task_list =
for i in range(4):
d=task()
task_list.append(d)

dd = defer.DeferredList(task_list)

dd.addBoth(done)

reactor.run()

上面的代码是无法停止的,如果使用的是 
dd.addBoth(done)
 
done函数的定义是没有参数的。 
 
而使用另一个done函数带参数的done(*args,**kwargs)
是可以正常退出的,done里面写了reactor.stop() 函数
 
原创文章
转载请注明出处:
http://30daydo.com/article/509
 

cv2 distanceTransform函数的用法 python

李魔佛 发表了文章 • 0 个评论 • 984 次浏览 • 2019-07-08 15:35 • 来自相关话题

distanceTransform
Calculates the distance to the closest zero pixel for each pixel of the source image.


Python: cv2.distanceTransform(src, distanceType, maskSize[, dst]) → dst

Python: cv.DistTransform(src, dst, distance_type=CV_DIST_L2, mask_size=3, mask=None, labels=None) → None
Parameters:
src – 8-bit, single-channel (binary) source image.
dst – Output image with calculated distances. It is a 32-bit floating-point, single-channel image of the same size as src .
distanceType – Type of distance. It can be CV_DIST_L1, CV_DIST_L2 , or CV_DIST_C .
maskSize – Size of the distance transform mask. It can be 3, 5, or CV_DIST_MASK_PRECISE (the latter option is only supported by the first function). In case of the CV_DIST_L1 or CV_DIST_C distance type, the parameter is forced to 3 because a 3\times 3 mask gives the same result as 5\times 5 or any larger aperture.
labels – Optional output 2D array of labels (the discrete Voronoi diagram). It has the type CV_32SC1 and the same size as src . See the details below.
labelType – Type of the label array to build. If labelType==DIST_LABEL_CCOMP then each connected component of zeros in src (as well as all the non-zero pixels closest to the connected component) will be assigned the same label. If labelType==DIST_LABEL_PIXEL then each zero pixel (and all the non-zero pixels closest to it) gets its own label.
The functions distanceTransform calculate the approximate or precise distance from every binary image pixel to the nearest zero pixel. For zero image pixels, the distance will obviously be zero.


When maskSize == CV_DIST_MASK_PRECISE and distanceType == CV_DIST_L2 , the function runs the algorithm described in [Felzenszwalb04]. This algorithm is parallelized with the TBB library.

In other cases, the algorithm [Borgefors86] is used. This means that for a pixel the function finds the shortest path to the nearest zero pixel consisting of basic shifts: horizontal, vertical, diagonal, or knight’s move (the latest is available for a 5\times 5 mask). The overall distance is calculated as a sum of these basic distances. Since the distance function should be symmetric, all of the horizontal and vertical shifts must have the same cost (denoted as a ), all the diagonal shifts must have the same cost (denoted as b ), and all knight’s moves must have the same cost (denoted as c ). For the CV_DIST_C and CV_DIST_L1 types, the distance is calculated precisely, whereas for CV_DIST_L2 (Euclidean distance) the distance can be calculated only with a relative error (a 5\times 5 mask gives more accurate results). For a,``b`` , and c , OpenCV uses the values suggested in the original paper:

CV_DIST_C (3\times 3) a = 1, b = 1
CV_DIST_L1 (3\times 3) a = 1, b = 2
CV_DIST_L2 (3\times 3) a=0.955, b=1.3693
CV_DIST_L2 (5\times 5) a=1, b=1.4, c=2.1969
Typically, for a fast, coarse distance estimation CV_DIST_L2, a 3\times 3 mask is used. For a more accurate distance estimation CV_DIST_L2 , a 5\times 5 mask or the precise algorithm is used. Note that both the precise and the approximate algorithms are linear on the number of pixels.

The second variant of the function does not only compute the minimum distance for each pixel (x, y) but also identifies the nearest connected component consisting of zero pixels (labelType==DIST_LABEL_CCOMP) or the nearest zero pixel (labelType==DIST_LABEL_PIXEL). Index of the component/pixel is stored in \texttt{labels}(x, y) . When labelType==DIST_LABEL_CCOMP, the function automatically finds connected components of zero pixels in the input image and marks them with distinct labels. When labelType==DIST_LABEL_CCOMP, the function scans through the input image and marks all the zero pixels with distinct labels.

In this mode, the complexity is still linear. That is, the function provides a very fast way to compute the Voronoi diagram for a binary image. Currently, the second variant can use only the approximate distance transform algorithm, i.e. maskSize=CV_DIST_MASK_PRECISE is not supported yet.

Note
An example on using the distance transform can be found at opencv_source_code/samples/cpp/distrans.cpp
(Python) An example on using the distance transform can be found at opencv_source/samples/python2/distrans.py 

  查看全部
distanceTransform
Calculates the distance to the closest zero pixel for each pixel of the source image.


Python: cv2.distanceTransform(src, distanceType, maskSize[, dst]) → dst

Python: cv.DistTransform(src, dst, distance_type=CV_DIST_L2, mask_size=3, mask=None, labels=None) → None

Parameters:
src – 8-bit, single-channel (binary) source image.
dst – Output image with calculated distances. It is a 32-bit floating-point, single-channel image of the same size as src .

distanceType – Type of distance. It can be CV_DIST_L1, CV_DIST_L2 , or CV_DIST_C .
maskSize – Size of the distance transform mask. It can be 3, 5, or CV_DIST_MASK_PRECISE (the latter option is only supported by the first function). In case of the CV_DIST_L1 or CV_DIST_C distance type, the parameter is forced to 3 because a 3\times 3 mask gives the same result as 5\times 5 or any larger aperture.

labels – Optional output 2D array of labels (the discrete Voronoi diagram). It has the type CV_32SC1 and the same size as src . See the details below.

labelType – Type of the label array to build. If labelType==DIST_LABEL_CCOMP then each connected component of zeros in src (as well as all the non-zero pixels closest to the connected component) will be assigned the same label. If labelType==DIST_LABEL_PIXEL then each zero pixel (and all the non-zero pixels closest to it) gets its own label.
The functions distanceTransform calculate the approximate or precise distance from every binary image pixel to the nearest zero pixel. For zero image pixels, the distance will obviously be zero.


When maskSize == CV_DIST_MASK_PRECISE and distanceType == CV_DIST_L2 , the function runs the algorithm described in [Felzenszwalb04]. This algorithm is parallelized with the TBB library.

In other cases, the algorithm [Borgefors86] is used. This means that for a pixel the function finds the shortest path to the nearest zero pixel consisting of basic shifts: horizontal, vertical, diagonal, or knight’s move (the latest is available for a 5\times 5 mask). The overall distance is calculated as a sum of these basic distances. Since the distance function should be symmetric, all of the horizontal and vertical shifts must have the same cost (denoted as a ), all the diagonal shifts must have the same cost (denoted as b ), and all knight’s moves must have the same cost (denoted as c ). For the CV_DIST_C and CV_DIST_L1 types, the distance is calculated precisely, whereas for CV_DIST_L2 (Euclidean distance) the distance can be calculated only with a relative error (a 5\times 5 mask gives more accurate results). For a,``b`` , and c , OpenCV uses the values suggested in the original paper:

CV_DIST_C (3\times 3) a = 1, b = 1
CV_DIST_L1 (3\times 3) a = 1, b = 2
CV_DIST_L2 (3\times 3) a=0.955, b=1.3693
CV_DIST_L2 (5\times 5) a=1, b=1.4, c=2.1969
Typically, for a fast, coarse distance estimation CV_DIST_L2, a 3\times 3 mask is used. For a more accurate distance estimation CV_DIST_L2 , a 5\times 5 mask or the precise algorithm is used. Note that both the precise and the approximate algorithms are linear on the number of pixels.

The second variant of the function does not only compute the minimum distance for each pixel (x, y) but also identifies the nearest connected component consisting of zero pixels (labelType==DIST_LABEL_CCOMP) or the nearest zero pixel (labelType==DIST_LABEL_PIXEL). Index of the component/pixel is stored in \texttt{labels}(x, y) . When labelType==DIST_LABEL_CCOMP, the function automatically finds connected components of zero pixels in the input image and marks them with distinct labels. When labelType==DIST_LABEL_CCOMP, the function scans through the input image and marks all the zero pixels with distinct labels.

In this mode, the complexity is still linear. That is, the function provides a very fast way to compute the Voronoi diagram for a binary image. Currently, the second variant can use only the approximate distance transform algorithm, i.e. maskSize=CV_DIST_MASK_PRECISE is not supported yet.

Note
An example on using the distance transform can be found at opencv_source_code/samples/cpp/distrans.cpp
(Python) An example on using the distance transform can be found at opencv_source/samples/python2/distrans.py
 

 

Win10下PhantomJS无法运行 【版本兼容问题】

李魔佛 发表了文章 • 0 个评论 • 442 次浏览 • 2019-07-04 09:07 • 来自相关话题

以前在win7上运行的好好的。
在win10下就报错:
selenium.common.exceptions.WebDriverException: Message: Service C:\Tool\phantomjs-2.5.0-beta2-windows\phantomjs-2.5.0-beta2-windows\bin\phantomjs.exe unexpectedly exited. Status code was: 4294967295
 
后来替换了一个旧的版本,发现问题就这么解决了。
旧版本:phantomjs-2.1.1-windows
 
原创文章
转载请注明出处 
http://30daydo.com/article/505
  查看全部
以前在win7上运行的好好的。
在win10下就报错:
selenium.common.exceptions.WebDriverException: Message: Service C:\Tool\phantomjs-2.5.0-beta2-windows\phantomjs-2.5.0-beta2-windows\bin\phantomjs.exe unexpectedly exited. Status code was: 4294967295
 
后来替换了一个旧的版本,发现问题就这么解决了。
旧版本:phantomjs-2.1.1-windows
 
原创文章
转载请注明出处 
http://30daydo.com/article/505
 

python3与python2迭代器的写法的区别

李魔佛 发表了文章 • 0 个评论 • 248 次浏览 • 2019-06-26 11:22 • 来自相关话题

大部分相同,只是python2里面需要实现在类中实现next()方法,而python3里面需要实现__next__()方法。
 
附一个例子:
def iter_demo():

class DefineIter(object):

def __init__(self,length):
self.length = length
self.data = range(self.length)
self.index=0

def __iter__(self):
return self


def __next__(self):

if self.index >=self.length:
# return None
raise StopIteration

d = self.data[self.index]*50
self.index =self.index + 1

return d

a = DefineIter(10)
print(type(a))
for i in a:
print(i) 查看全部
大部分相同,只是python2里面需要实现在类中实现next()方法,而python3里面需要实现__next__()方法。
 
附一个例子:
def iter_demo():

class DefineIter(object):

def __init__(self,length):
self.length = length
self.data = range(self.length)
self.index=0

def __iter__(self):
return self


def __next__(self):

if self.index >=self.length:
# return None
raise StopIteration

d = self.data[self.index]*50
self.index =self.index + 1

return d

a = DefineIter(10)
print(type(a))
for i in a:
print(i)