python实现抓取某站的sitemap.xml，获取url、title和描 ...

作者：oneai 发布时间：2023-5-29 21:00:28

python抓取某站的sitemap.xml, 再根据url得到最新的title、desc、keyword、url
步骤：使用py的requests和beautifulsoup4库来抓取数据和解析XML文件，再用正则表达式匹配URL的后缀名和页面内容。保存到一个txt文本文本中。
[Python] 纯文本查看复制代码import requests
from bs4 import BeautifulSoup
import re
# 请求XML文件
url = 'https://www.xxxxxxxxxxxxxxxx/sitemap.xml'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
# 解析XML文件
soup = BeautifulSoup(response.text, 'xml')
urls = [loc.text for loc in soup.find_all('loc')]
# 匹配URL的后缀名为.html的网页并获取title、description和keywords
pattern = re.compile(r'.*\.html$')
for url in urls:
if re.match(pattern, url):
      response = requests.get(url, headers=headers)
      soup = BeautifulSoup(response.text, 'html.parser')
      title = soup.title.string
      keywords = soup.find('meta', attrs={'name': 'keywords'})['content']
      description = soup.find('meta', attrs={'name': 'description'})['content']
      # 保存到txt文件
      with open('output.txt', 'a', encoding='utf-8') as f:
         f.write('Title: {}\n'.format(title))
         f.write('Keywords: {}\n'.format(keywords))
         f.write('Description: {}\n'.format(description))
         f.write('URL: {}\n\n'.format(url))

文件, 文本

相关帖子

Klock0828 2023-5-29 21:01:25

感谢分享，论坛因你而精彩！

806785900 2023-5-29 21:01:55

感谢楼主分享，用xpath解析重写一下！
[Python] 纯文本查看复制代码import requests
import re
import parsel
# 请求XML文件
url = 'https://www.xxxxxxxxxxxxxxxx/sitemap.xml'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)
# 解析XML文件
selector = parsel.Selector(response.text)
urls = selector.xpath('//loc/text()').getall()
# 匹配URL的后缀名为.html的网页并获取title、description和keywords
pattern = re.compile(r'.*\.html$')
for url in urls:
if re.match(pattern, url):
      response = requests.get(url, headers=headers)
      selector = parsel.Selector(response.text)
      title = selector.css('title::text').get()
      keywords = selector.xpath('//meta[@name="keywords"]/@content')
      description = selector.xpath('//meta[@name="description"]/@content')
      # 保存到txt文件
      with open('output.txt', 'a', encoding='utf-8') as f:
         f.write('Title: {}\n'.format(title))
         f.write('Keywords: {}\n'.format(keywords.get()))
         f.write('Description: {}\n'.format(description.get()))
         f.write('URL: {}\n\n'.format(url))

python实现抓取某站的sitemap.xml，获取url、title和描述

相关帖子

浏览过的版块

热门主题

ioio事件是什么鬼？

养老贷又来了，贷贷相传啊

今天要撸2次

好评有礼给的是红包还是优惠卷

现在干啥都太难了，珍惜吧

淘宝现在也好难搞啊

现在的ai能生产图文结合的内容吗

周固固突然发飙了，谁惹他了呢？吃光群众等

怎么吵架了啊

老坛们看过来，周固固同志狂撒金币。折射一

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿投放广告

Yoo趣儿网站用户应遵守规则

python实现抓取某站的sitemap.xml，获取url、title和描述

相关帖子

浏览过的版块

热门主题

ioio事件是什么鬼？

养老贷又来了，贷贷相传啊

今天要撸2次

好评有礼给的是红包还是优惠卷

现在干啥都太难了，珍惜吧

淘宝现在也好难搞啊

现在的ai能生产图文结合的内容吗

周固固突然发飙了，谁惹他了呢？吃光群众等

怎么吵架了啊

老坛们看过来，周固固同志狂撒金币。折射一

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿 投放广告

Yoo趣儿网站用户应遵守规则

在 Yoo趣儿投放广告