python教程

您现在的位置：首页 > 网站教程 > python教程

Python爬取blog文章内容代码

python教程 51源码 2024-01-24 共人阅读

临时写的一个，应用场景佷有限，大家凭自己再扩展吧，我是因为要把一个文章迁移，强制复制也不行，就写了个这玩意。

import re
import requests
from lxml import etree
post_url = input('请输入文章地址: ')
#根提文章地址get数据
res = requests. get(post_url)
xx= res. content. decode('utf-8')
x = etree. HTML(xx)
#需要获取父级xpath
#xpath示例: //*[@id="article-container"]
#不会的百度吧
xpath = input('请输入xpath路径, 可打开控制台查看:')
content = x. xpath(xpath + '//*')
ree = re. compile(r'class=".*"|id=".*"')
url l = re. compile(r'(?<=(src="))(/).*?(?=("))')
with open('resualt. txt', 'w', encoding='utf-8') as file:
tep1 = ''
for i in content:
tep = etree. tostring(i, encoding='utf-8'). decode('utf-8'). strip()
tep = re. sub(ree, ", tep)
strr = re. search(urll, tep)
#如果图片是想对路径，就自动背换成绝对路径，《需要自己寻找修改路径地址》
#后面不用筒,只需要找到煎面的路径就行。就像&#160;https://dreamtea.top
#需要自己实测
if strr is not None:
strr r = strr. group()
tep = re.sub(urll, '&#160;https://cdn.con'+'/'+strr,tep)
# print(tep)
strr = None
if tep != tep1 and tep in tep1:
#print(tep)
continue
file. write(tep)
tep1 = tep
print('导出完成!')

这个可以再扩展成更自动的，可是我懒，希望有闲的没事的大佬扩展一下，我要借鉴（抄）~~

文章来源：

标签 Python爬取

上一篇：分享三种Python批量图片添加随机水印方法

下一篇：返回列表

栏目分类

Zblog教程

python教程

python教程

Python爬取blog文章内容代码

Zblog教程

python教程

pbootcms教程

网站前端教程

迅睿CMS教程

帝国cms教程

wordpress教程

织梦CMS教程

ecshop教程

phpcms教程

极致CMS

php教程

discuz教程

视频教程

网站安装