万年潜水党,发个爬小姐姐的源码

查看 214|回复 11
作者:xinxiu   
最近也要学爬虫,发个爬小姐姐的源码,只用了多线程,没有作查重处理。
图片保存在J:\xiezhen\文件夹下,可自行修改。
第一次发帖,如果违规,请版主删除。谢谢
[Python] 纯文本查看 复制代码import time
import requests
from lxml import etree
import os
import concurrent.futures
def download_image(url, img_path):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    img_name = url.split('/')[-1]
    with open(os.path.join(img_path, img_name), 'wb') as f:
        f.write(response.content)
        print(f'图片:{img_path}' + '/' + f'{img_name}下载完成!')
def process_page(page):
    url = f'https://www.xiezhen.xyz/page/{page}'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    html = etree.HTML(response.content)
    mail_url = html.xpath('//div[@class="excerpts"]/article/a/@href')
    for url in mail_url:
        response = requests.get(url, headers=headers)
        html = etree.HTML(response.content)
        sub_url = html.xpath('//article/p/img')
        img_title = html.xpath('//title/text()')[0].split('-')[0]
        img_path = f'J:/xiezhen/{img_title}'
        if not os.path.exists(img_path):
            os.makedirs(img_path)
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = []
            for s_url in sub_url:
                img_url = s_url.attrib['src']
                futures.append(executor.submit(download_image, img_url, img_path))
            for future in concurrent.futures.as_completed(futures):
                pass
        time.sleep(0.5)
if __name__ == '__main__':
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = []
        for page in range(1, 573):
            futures.append(executor.submit(process_page, page))
        for future in concurrent.futures.as_completed(futures):
            pass

发个, 源码

xinxiu
OP
  


oxding 发表于 2023-4-9 23:38
img_path=f'{os.getcwd()}\\aaa'
with open(os.mkdir(os.path.join(img_path, f'{b}.png'),"wb")) as f: ...

我没明白,你是在我这段代码的基础上改的还是说单纯的想写一个自动创建文件夹的代码?
[Python] 纯文本查看 复制代码img_path = f'J:/xiezhen/{img_title}'
        if not os.path.exists(img_path):
            os.makedirs(img_path)
这几句代码是如果img_path定义的文件夹不存在则建立
oxding   


xinxiu 发表于 2023-4-9 23:31
把这个改为
[mw_shl_c ...

[color=]img_path
=
[color=]f
[color=]'
[color=]{
[color=]os
.
[color=]getcwd
()
[color=]}
[color=]\\
[color=]aaa'
[color=]with

[color=]open
(
[color=]os
.
[color=]mkdir
(
[color=]os
.
[color=]path
.
[color=]join
(
[color=]img_path
,
[color=]f
[color=]'
[color=]{
[color=]b
[color=]}
[color=].png'
),
[color=]"wb"
))
[color=]as

[color=]f
:
发生异常: TypeError


  • an integer is required (got type str)
    会报错咋办
  • oxding   

    img_name = url.split('/')[-1]
        with open(os.path.join(img_path, img_name), 'wb') as f:
    问一下如何在程序根目录下自动建立一个xxx文件夹 然后把图片网里面写?
    antness   

    网址不错哈哈哈
    xinxiu
    OP
      


    oxding 发表于 2023-4-9 23:09
    img_name = url.split('/')[-1]
        with open(os.path.join(img_path, img_name), 'wb') as f:

    修改img_path
    oxding   


    xinxiu 发表于 2023-4-9 23:12
    修改img_path

    怎么改 比如我要自动建立aaa文件夹 怎么写
    dujiu3611   

    我貌似看到一个还不错的网址哦
    Andrea   

    这个网站比较费精力啊~
    chinap   

    学习了,感谢分享知识
    您需要登录后才可以回帖 登录 | 立即注册

    返回顶部