异步秒爬某小说网

作者：jaaks 发布时间：2023-9-24 14:07:14

[Python] 纯文本查看复制代码from bs4 import BeautifulSoup
import os,re,time,json,aiohttp,asyncio
url_list = []
headers = {
         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36"
      }
directory = "txt"  # 相对路径，将在当前工作目录下创建txt目录
if not os.path.exists(directory):
os.makedirs(directory)
async def fetch_post(url, headers, data):
async with aiohttp.ClientSession() as session:
      async with session.post(url, headers=headers, data=data) as response:
         return await response.text()
async def fetch_get(url, headers):
async with aiohttp.ClientSession() as session:
      async with session.get(url, headers=headers) as response:
         return await response.text()
async def get_list(bookid):#获取章节列表
data = {"bookId": bookid}
r = await fetch_post("https://bookapi.zongheng.com/api/chapter/getChapterList", data=data, headers=headers)
response_data = json.loads(r)
chapter_list = response_data["result"]["chapterList"]
for chapter in chapter_list:
      for chapte in chapter["chapterViewList"]:
         chapterId = chapte["chapterId"]
         url_list.append(f"https://read.zongheng.com/chapter/{bookid}/{chapterId}.html")
return True
async def get_text(url):#访问正文
      p_text = ""
      r = await fetch_get(url,headers=headers)
      soup = BeautifulSoup(r, 'html.parser')
      name = soup.find(class_="title_txtbox").text #标题
      contents = soup.find('div', class_="content") #正文
      content = contents.find_all("p")
      for conten in content:
         p_text += conten.text+"\n\n"
      name = re.sub('[?|&]',"",name.strip()) #正则过滤内容
      #将标题和内容写进去
      file_name = os.path.join("txt",name+".txt")
      await sava_file(file_name,p_text)
      await asyncio.sleep(2)
      print(name)
async def sava_file(name,text):
with open(name,"w",encoding="utf8") as f:
      f.write(text)
async def main():
loop = asyncio.get_running_loop()
task = [asyncio.ensure_future(get_text(url)) for url in url_list]
await asyncio.gather(*task)
Chapter =  asyncio.run(get_list("1249806"))#访问章节
print("长度："+str(len(url_list)))
print(url_list)
if Chapter:
asyncio.run(main())
多线程爬某小说网：https://www.52pojie.cn/thread-1834722-1-1.html
基于同一个源码只不过改成异步实现秒爬，没找到网络请求阻塞的好处理方法，所以我学了异步

小说网, 标题

相关帖子

joy95611 2023-9-24 14:08:04

好好学习. 现在遇到问题
module 'asyncio' has no attribute 'run'
原来我的python 是3.6的, 我参考了
python 中 AttributeError: module 'async io' has no attribute 'run' 解决 - wzqwer - 博客园
https://www.cnblogs.com/wzbk/p/14119401.html
问题的解法.
改动代码如下
.....前面一样...
async def main():
#loop = asyncio.get_running_loop()
loop = asyncio.get_event_loop()
task = [asyncio.ensure_future(get_text(url)) for url in url_list]
await asyncio.gather(*task)
#Chapter = asyncio.run(get_list("1249806"))#访问章节
loop = asyncio.get_event_loop()
Chapter = loop.run_until_complete(get_list("1249806"))
print("长度："+str(len(url_list)))
print(url_list)
print(Chapter)
#loop = asyncio.get_event_loop()
if Chapter:
result = loop.run_until_complete(main())
#asyncio.run(main())
顺利爬取数据了 !

sssguo 2023-9-24 14:08:35

感谢分享！！

daraxi 2023-9-24 14:09:30

码住学习

吖力锅 2023-9-24 14:10:23

异步我还没学会，向你学习

lookfeiji 2023-9-24 14:11:16

异步确实好用，奈何我还不会

fengxiaoxiao7 2023-9-24 14:12:01

异步确实厉害

黑金刚 2023-9-24 14:12:43

[远程计算机拒绝网络连接。]是不是爬太快了。

sinyzh 2023-9-24 14:13:37

好用的好用的，谢谢楼主

qinren051 2023-9-24 14:14:27

想请教下，要换成别的小说在替换哪个地方了

异步秒爬某小说网

相关帖子

浏览过的版块

热门主题

干啥都没流量咋办？

有没有发布外链的工具

官方明确！居家期间因工作原因受伤可认定工

SEMrush 反链批量导出器手动搞太浪费时间

你们网站日流量多少了？看看我的

4414首页的棉花云怎么样

如何让域名既在阿里发布又在海外发布

修罗轻论坛程序Xiuno BBS 4.0带积分支付插

现在不BA百度做站可以起来吗

慈云数据的新平台上线了，提前注册体验了一

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿投放广告

Yoo趣儿网站用户应遵守规则

异步秒爬某小说网

相关帖子

浏览过的版块

热门主题

干啥都没流量咋办？

有没有发布外链的工具

官方明确！居家期间因工作原因受伤可认定工

SEMrush 反链批量导出器 手动搞太浪费时间

你们网站日流量多少了？看看我的

4414首页的棉花云怎么样

如何让域名既在阿里发布又在海外发布

修罗轻论坛程序Xiuno BBS 4.0带积分支付插

现在不BA百度做站可以起来吗

慈云数据的新平台上线了，提前注册体验了一

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿 投放广告

Yoo趣儿网站用户应遵守规则

SEMrush 反链批量导出器手动搞太浪费时间

在 Yoo趣儿投放广告