万年潜水党，发个爬小姐姐的源码

作者：xinxiu 发布时间：2023-4-10 03:00:27

最近也要学爬虫，发个爬小姐姐的源码，只用了多线程，没有作查重处理。
图片保存在J:\xiezhen\文件夹下，可自行修改。
第一次发帖，如果违规，请版主删除。谢谢
[Python] 纯文本查看复制代码import time
import requests
from lxml import etree
import os
import concurrent.futures
def download_image(url, img_path):
headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
img_name = url.split('/')[-1]
with open(os.path.join(img_path, img_name), 'wb') as f:
      f.write(response.content)
      print(f'图片：{img_path}' + '/' + f'{img_name}下载完成！')
def process_page(page):
url = f'https://www.xiezhen.xyz/page/{page}'
headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
html = etree.HTML(response.content)
mail_url = html.xpath('//div[@class="excerpts"]/article/a/@href')
for url in mail_url:
      response = requests.get(url, headers=headers)
      html = etree.HTML(response.content)
      sub_url = html.xpath('//article/p/img')
      img_title = html.xpath('//title/text()')[0].split('-')[0]
      img_path = f'J:/xiezhen/{img_title}'
      if not os.path.exists(img_path):
         os.makedirs(img_path)
      with concurrent.futures.ThreadPoolExecutor() as executor:
         futures = []
         for s_url in sub_url:
            img_url = s_url.attrib['src']
            futures.append(executor.submit(download_image, img_url, img_path))
         for future in concurrent.futures.as_completed(futures):
            pass
      time.sleep(0.5)
if __name__ == '__main__':
with concurrent.futures.ThreadPoolExecutor() as executor:
      futures = []
      for page in range(1, 573):
         futures.append(executor.submit(process_page, page))
      for future in concurrent.futures.as_completed(futures):
         pass

发个, 源码

相关帖子

xinxiu

OP

2023-4-10 03:01:02

oxding 发表于 2023-4-9 23:38
img_path=f'{os.getcwd()}\\aaa'
with open(os.mkdir(os.path.join(img_path, f'{b}.png'),"wb")) as f: ...
我没明白，你是在我这段代码的基础上改的还是说单纯的想写一个自动创建文件夹的代码？
[Python] 纯文本查看复制代码img_path = f'J:/xiezhen/{img_title}'
if not os.path.exists(img_path):
os.makedirs(img_path)
这几句代码是如果img_path定义的文件夹不存在则建立

oxding 2023-4-10 03:01:41

xinxiu 发表于 2023-4-9 23:31
把这个改为
[mw_shl_c ...
[color=]img_path
=
[color=]f
[color=]'
[color=]{
[color=]os
.
[color=]getcwd
()
[color=]}
[color=]\\
[color=]aaa'
[color=]with

[color=]open
(
[color=]os
.
[color=]mkdir
(
[color=]os
.
[color=]path
.
[color=]join
(
[color=]img_path
,
[color=]f
[color=]'
[color=]{
[color=]b
[color=]}
[color=].png'
),
[color=]"wb"
))
[color=]as

[color=]f
:
发生异常: TypeError

an integer is required (got type str)
会报错咋办

oxding 2023-4-10 03:02:24

img_name = url.split('/')[-1]
with open(os.path.join(img_path, img_name), 'wb') as f:
问一下如何在程序根目录下自动建立一个xxx文件夹然后把图片网里面写？

antness 2023-4-10 03:03:24

网址不错哈哈哈

xinxiu

OP

2023-4-10 03:04:14

oxding 发表于 2023-4-9 23:09
img_name = url.split('/')[-1]
with open(os.path.join(img_path, img_name), 'wb') as f:
修改img_path

oxding 2023-4-10 03:05:09

xinxiu 发表于 2023-4-9 23:12
修改img_path
怎么改比如我要自动建立aaa文件夹怎么写

dujiu3611 2023-4-10 03:06:07

我貌似看到一个还不错的网址哦

Andrea 2023-4-10 03:07:07

这个网站比较费精力啊~

chinap 2023-4-10 03:07:45

学习了，感谢分享知识

万年潜水党，发个爬小姐姐的源码

相关帖子

热门主题

最近收BA的人很多交易了要立刻取消BA 教训

刚看了一个视频，让我又清醒了一下

小小农民新开中转站，欢迎来踩

港版安卓机是满血的国际版安卓机吗？

我 ThreeJSON 又回来了： V 友们批评得对！

继之前 5.4 的 “收口”之后， 5.6 Sol 好

折腾 homelab 挺长时间了建了一个群想不

codex 打开风扇狂转怎么办

Vibe 的一个中文起名小工具

你们明天要去看周星驰的电影么？

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿投放广告

Yoo趣儿网站用户应遵守规则

万年潜水党，发个爬小姐姐的源码

相关帖子

热门主题

最近收BA的人很多 交易了要立刻取消BA 教训

刚看了一个视频，让我又清醒了一下

小小农民新开中转站，欢迎来踩

港版安卓机是满血的国际版安卓机吗？

我 ThreeJSON 又回来了： V 友们批评得对！

继之前 5.4 的 “收口”之后， 5.6 Sol 好

折腾 homelab 挺长时间了 建了一个群 想不

codex 打开风扇狂转怎么办

Vibe 的一个中文起名小工具

你们明天要去看周星驰的电影么？

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿 投放广告

Yoo趣儿网站用户应遵守规则

最近收BA的人很多交易了要立刻取消BA 教训

折腾 homelab 挺长时间了建了一个群想不

在 Yoo趣儿投放广告