【爬虫】小说爬取实例

查看 262|回复 10

作者：icer233 发布时间：2024-11-23 01:05:00

最近看了太一生水的《万古至尊》，觉得挺好看的，推荐一下。
随便找了个网站，下载一下。
以下是爬虫源码。
如果由于网络因素等问题部分章节没有下载下来，可以再次运行，程序只会下载那些没有的，不用担心重复下载浪费时间。
[Python] 纯文本查看复制代码# -*- coding:utf-8 -*-
import requests
from lxml import etree
import os
from multiprocessing.dummy import Pool
def create_path(file_path):
if not os.path.exists(file_path):
      os.makedirs(file_path)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
book_url = 'https://www.bqvvxg8.cc/wenzhang/1/1424/' + 'index.html'
book_detail_content = requests.get(url=book_url, headers=headers)
book_detail_content.encoding = 'gbk'
book_detail_content = book_detail_content.text
book_detail_tree = etree.HTML(book_detail_content)
book_name = book_detail_tree.xpath('//div[@class="book"]/div[@class="info"]/h2/text()')[0]
create_path('./' + book_name)
chapter_dd_list = book_detail_tree.xpath('//div[@class="listmain"]/dl/dd')
def down_chapter(dd):
chapter_url = 'https://www.bqvvxg8.cc/' + dd.xpath('./a/@href')[0]
chapter_title = dd.xpath('./a/text()')[0].replace('?', '？')
chapter_txt_path = './' + book_name +'/' + chapter_title + '.txt'
if not os.path.exists(chapter_txt_path):
      chapter_content = requests.get(url=chapter_url, headers=headers).text
      chapter_tree = etree.HTML(chapter_content)
      chapter_text = chapter_tree.xpath('//*[@id="content"]/text()')
      # 保存章节
      with open(chapter_txt_path, 'a', encoding='UTF-8') as file:
         file.write(chapter_title)
         for i in range(1, chapter_text.__len__() - 3):
            file.write(chapter_text)
      print(chapter_title, " 下载成功")
pool = Pool(20)
pool.map(down_chapter, chapter_dd_list)
pool.close()
pool.join()
爬虫, 实例

相关帖子

• 70R出几个98堂精英邀请码（色花堂）送全套新手入门教程

• 初学者值得学习的歌曲宝小爬虫

• 谷歌爬虫采集；其中10个是图片。为啥呢？

• 多个XML.要不要写上首页让爬虫爬?>??

• 真恶心，垃圾蜘蛛爬虫，把服务器都给整死机了！

• 求路费爬虫2025年的网课，带APP逆向的

• 想问下大家知道豆包或deepseek有爬虫ip段吗？

• 抓取货拉拉预估费用

• 猿来Python爬虫实战案例全景分析（第13期）的课件资料

• 65出98堂邀请码

 sunyue2719   2024-11-23 01:06:00

icer233 发表于 2024-11-20 20:15
可以用网上的免费代{过}{滤}理
[Python] 纯文本查看复制代码# -*- coding:utf-8 -*-
import requests
from lxml import etree
import os
from multiprocessing.dummy import Pool
import random
def create_path(file_path):
if not os.path.exists(file_path):
      os.makedirs(file_path)
fetch or update this list as needed)
proxy_list = [
{"http": "http://127.0.0.1:7890"},
{"http": "http://127.0.0.1:7891"},
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
book_url = 'https://www.bqvvxg8.cc/wenzhang/1/1424/' + 'index.html'
# Fetch the book's main page using a random proxy
proxy = random.choice(proxy_list)
book_detail_content = requests.get(url=book_url, headers=headers, proxies=proxy)
book_detail_content.encoding = 'gbk'
book_detail_content = book_detail_content.text
book_detail_tree = etree.HTML(book_detail_content)
book_name = book_detail_tree.xpath('//div[@class="book"]/div[@class="info"]/h2/text()')[0]
create_path('./' + book_name)
chapter_dd_list = book_detail_tree.xpath('//div[@class="listmain"]/dl/dd')
def down_chapter(dd):
chapter_url = 'https://www.bqvvxg8.cc/' + dd.xpath('./a/@href')[0]
chapter_title = dd.xpath('./a/text()')[0].replace('?', '？')
chapter_txt_path = './' + book_name + '/' + chapter_title + '.txt'
if not os.path.exists(chapter_txt_path):
      # Fetch the chapter content using a random proxy
      proxy = random.choice(proxy_list)
      chapter_content = requests.get(url=chapter_url, headers=headers, proxies=proxy).text
      chapter_tree = etree.HTML(chapter_content)
      chapter_text = chapter_tree.xpath('//*[@id="content"]/text()')
      # Save the chapter content
      with open(chapter_txt_path, 'a', encoding='UTF-8') as file:
         file.write(chapter_title + "\n")
         for i in range(1, len(chapter_text) - 3):
            file.write(chapter_text + "\n")
      print(chapter_title, "下载成功")
# Use multithreading to download chapters
pool = Pool(20)
pool.map(down_chapter, chapter_dd_list)
pool.close()
pool.join()
好的，感谢

Xianhuagan   2024-11-23 01:06:41

这个好，谢谢大神。

三生沧海踏歌   2024-11-23 01:07:33

学习一下，感谢教程

 dankai18   2024-11-23 01:08:23

感谢分享

 Yhuo   2024-11-23 01:08:59

感谢分享，学习学习

 11Zero   2024-11-23 01:09:57

感谢分享

 Tick12333   2024-11-23 01:10:40

感谢分享

 A11111111   2024-11-23 01:11:37

I感谢楼主分享技术

 s15s   2024-11-23 01:12:23

感谢大神分享

【爬虫】小说爬取实例

相关帖子

浏览过的版块

热门主题

啊这emmmmm

求win11可用的“化学金排”软件，用来在wor

这是上科技了吧

求分享通达信最新缠论公式指标

最近看了很多站，还是无从下手，大部分站都

求电子书刑法学教义分论

从百度网盘转到夸克网盘870m

豆包携手七家国家一级博物馆，打造数字化看

如果社保可一次性补缴17万元，退休月可领14

这是走不出舒适圈了吗

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿投放广告

Yoo趣儿网站用户应遵守规则

【爬虫】小说爬取实例

相关帖子

浏览过的版块

热门主题

啊这emmmmm

求win11可用的“化学金排”软件，用来在wor

这是上科技了吧

求分享通达信最新缠论公式指标

最近看了很多站，还是无从下手，大部分站都

求 电子书 刑法学教义 分论

从百度网盘转到夸克网盘870m

豆包携手七家国家一级博物馆，打造数字化看

如果社保可一次性补缴17万元，退休月可领14

这是走不出舒适圈了吗

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿 投放广告

Yoo趣儿网站用户应遵守规则

求电子书刑法学教义分论

在 Yoo趣儿投放广告