求问如何爬取百度学术的搜索页面

查看 104|回复 9
作者:yurika34   
使用 requests库时被认出为爬虫,系统直接返回空内容这是我的代码:
[color=]import

[color=]requests
[color=]import

[color=]time
[color=]search_url
=
[color=]'https://xueshu.baidu.com/s?wd=
[color=]%E
[color=]8%A7
[color=]%86%
[color=]E9%A2
[color=]%91%
[color=]E6
[color=]%8F%8F%E
[color=]8%BF%B0&rsv_bp=0&tn=SE_baiduxueshu_c1gjeupa&rsv_spt=3&ie=utf-8&f=8&rsv_sug2=0&sc_f_para=sc_tasktype%3D%7BfirstSimpleSearch%7D'
[color=]headers
= {
        
[color=]'user-agent'
:
[color=]'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46'
,
        
[color=]'cookie'
:
[color=]'BDUSS=np0bUdsQWtBWXdmN2NYWldLTkNBU3c5bkRjN35GY2JrbkhocUFzenMzc2ZHaTFrRVFBQUFBJCQAAAAAAAAAAAEAAAAFM3r2cXdmZ8qxtPoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB-NBWQfjQVkY; BDUSS_BFESS=np0bUdsQWtBWXdmN2NYWldLTkNBU3c5bkRjN35GY2JrbkhocUFzenMzc2ZHaTFrRVFBQUFBJCQAAAAAAAAAAAEAAAAFM3r2cXdmZ8qxtPoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB-NBWQfjQVkY; BAIDU_WISE_UID=wapp_1678505642304_564; BAIDUID=65F895C23BF832A9420E4AEFC67B4BAB:FG=1; BAIDUID_BFESS=65F895C23BF832A9420E4AEFC67B4BAB:FG=1; BIDUPSID=65F895C23BF832A9420E4AEFC67B4BAB; PSTM=1683082612; ZFY=enLlhhs:BDuzXxvTN8DSru2QkC3DZ:AfVKtu4:BqvNQdS8:C; RT="z=1&dm=baidu.com&si=2a8178ae-479b-44e4-8225-02ee4de854a1&ss=lhh6onkb&sl=4&tt=6ix&bcn=https%3A
[color=]%2F%2F
[color=]fclog.baidu.com
[color=]%2F
[color=]log
[color=]%2F
[color=]weirwood
[color=]%3F
[color=]type%3Dperf&ld=7jk&nu=47cvtjw2&cl=295i&ul=2o27&hd=2o3e"; ariaDefaultTheme=undefined; delPer=0; BD_HOME=0; H_PS_PSSID=; Hm_lvt_f28578486a5410f35e6fbd0da5361e5f=1683769241,1683862430; Hm_lpvt_f28578486a5410f35e6fbd0da5361e5f=1683862430; BDRCVFR[w2jhEs_Zudc]=mbxnW11j9Dfmh7GuZR8mvqV; BD_CK_SAM=1; PSINO=1; BDSVRTM=372'
,
        
[color=]'referer'
:
[color=]'https://xueshu.baidu.com/'
,
        
[color=]'Host'
:
[color=]'xueshu.baidu.com'
,
    }
[color=]res
=
[color=]requests
.
[color=]get
(
[color=]url
=
[color=]search_url
,
[color=]headers
=
[color=]headers
)
[color=]print
[color=](
[color=]res
[color=].
[color=]text
[color=])
Referer,Cookie 都加上了,系统是如何识别出来并返回给我空内容的呢?
如有大神,不胜感激!

这是, 学术

T4DNA   


yurika34 发表于 2023-5-13 09:02
好的,这是我的代码和结果: https://www.aliyundrive.com/s/3p8NG2P3zHf

确实是ip限制了,带cookie能多坚持几次,sleep长一点。要么买爬虫代》《理,要么解决那个旋转验证码
yurika34
OP
  

不好意思,上面的代码写乱了,这是代码:[Python] 纯文本查看 复制代码import requests
import time
search_url = 'https://xueshu.baidu.com/s?wd=%E8%A7%86%E9%A2%91%E6%8F%8F%E8%BF%B0&rsv_bp=0&tn=SE_baiduxueshu_c1gjeupa&rsv_spt=3&ie=utf-8&f=8&rsv_sug2=0&sc_f_para=sc_tasktype%3D%7BfirstSimpleSearch%7D'
headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46',
        'cookie':'BDUSS=np0bUdsQWtBWXdmN2NYWldLTkNBU3c5bkRjN35GY2JrbkhocUFzenMzc2ZHaTFrRVFBQUFBJCQAAAAAAAAAAAEAAAAFM3r2cXdmZ8qxtPoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB-NBWQfjQVkY; BDUSS_BFESS=np0bUdsQWtBWXdmN2NYWldLTkNBU3c5bkRjN35GY2JrbkhocUFzenMzc2ZHaTFrRVFBQUFBJCQAAAAAAAAAAAEAAAAFM3r2cXdmZ8qxtPoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAB-NBWQfjQVkY; BAIDU_WISE_UID=wapp_1678505642304_564; BAIDUID=65F895C23BF832A9420E4AEFC67B4BAB:FG=1; BAIDUID_BFESS=65F895C23BF832A9420E4AEFC67B4BAB:FG=1; BIDUPSID=65F895C23BF832A9420E4AEFC67B4BAB; PSTM=1683082612; ZFY=enLlhhs:BDuzXxvTN8DSru2QkC3DZ:AfVKtu4:BqvNQdS8:C; RT="z=1&dm=baidu.com&si=2a8178ae-479b-44e4-8225-02ee4de854a1&ss=lhh6onkb&sl=4&tt=6ix&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=7jk&nu=47cvtjw2&cl=295i&ul=2o27&hd=2o3e"; ariaDefaultTheme=undefined; delPer=0; BD_HOME=0; H_PS_PSSID=; Hm_lvt_f28578486a5410f35e6fbd0da5361e5f=1683769241,1683862430; Hm_lpvt_f28578486a5410f35e6fbd0da5361e5f=1683862430; BDRCVFR[w2jhEs_Zudc]=mbxnW11j9Dfmh7GuZR8mvqV; BD_CK_SAM=1; PSINO=1; BDSVRTM=372',
        'referer': 'https://xueshu.baidu.com/',
        'Host': 'xueshu.baidu.com',
    }
res = requests.get(url=search_url,headers=headers)
print(res.url)
Laodao2333   

[Python] 纯文本查看 复制代码import time
from selenium import webdriver
from lxml import etree
browser = webdriver.Edge()
# 打开网页
browser.get('https://xueshu.baidu.com')
# 获取页面源码
html = browser.page_source
# 扫码
time.sleep(20)
for page in range(8):
    browser.get('https://xueshu.baidu.com/s?wd=%E8%A7%86%E9%A2%91%E6%8F%8F%E8%BF%B0&pn='+str(10*page)+'&tn=SE_baiduxueshu_c1gjeupa&ie=utf-8&sc_f_para=sc_tasktype%3D%7BfirstSimpleSearch%7D&sc_hit=1')
    # 获取页面源码
    html = browser.page_source
    tree = etree.HTML(html)
    links = tree.xpath('//*[@id]/div/h3/a/@href')
    for link in links:
        browser.get(link)
        print(link)
        print(etree.HTML(browser.page_source).xpath('//*[@id="dtl_l"]/div[1]/h3/a/text()'))
browser.quit()
弹出浏览器后,点击上方的登录先扫码登录
T4DNA   

可能对requests的相关特征进行了检测,不是改几个参数就可以避开的,如果坚持使用requests需要对库进行修改,尝试使用urllib库或者httpx库进行请求只需要useragent就可以获得完整内容
[Python] 纯文本查看 复制代码import urllib.request
search_url = 'https://xueshu.baidu.com/s?wd=%E8%A7%86%E9%A2%91%E6%8F%8F%E8%BF%B0&rsv_bp=0&tn=SE_baiduxueshu_c1gjeupa&rsv_spt=3&ie=utf-8&f=8&rsv_sug2=0&sc_f_para=sc_tasktype%3D%7BfirstSimpleSearch%7D'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.35',
}
request = urllib.request.Request('https://xueshu.baidu.com/s', headers=headers)
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
[Python] 纯文本查看 复制代码import httpx
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Ch' \
    'rome/58.0.3029.81 Safari/537.36'
headers = {
    'User-Agent': user_agent,
}
url = 'https://xueshu.baidu.com/s?wd=%E8%A7%86%E9%A2%91%E6%8F%8F%E8%BF%B0&rsv_bp=0&tn=SE_baiduxueshu_c1gjeupa&rsv_spt=3&ie=utf-8&f=8&rsv_sug2=0&sc_f_para=sc_tasktype%3D%7BfirstSimpleSearch%7D'
r = httpx.get(url, headers=headers)
print(r.text)
yurika34
OP
  


T4DNA 发表于 2023-5-12 16:52
可能对requests的相关特征进行了检测,不是改几个参数就可以避开的,如果坚持使用requests需要对库进行修改 ...

我试了,前面urllib库 没能得到完整内容,httpx库可以,但是爬了二十几个页面后百度也识别了出来,返回302重定向,有解决的方法吗同学[HTML] 纯文本查看 复制代码302Found.
T4DNA   


yurika34 发表于 2023-5-12 21:02
我试了,前面urllib库 没能得到完整内容,httpx库可以,但是爬了二十几个页面后百度也识别了出来,返回302重 ...

可以把你的爬取列表传一份到百度云盘给我看看
T4DNA   


yurika34 发表于 2023-5-12 21:02
我试了,前面urllib库 没能得到完整内容,httpx库可以,但是爬了二十几个页面后百度也识别了出来,返回302重 ...

按理说应该不是ip次数的检测限制
yurika34
OP
  


T4DNA 发表于 2023-5-12 21:42
按理说应该不是ip次数的检测限制

好的,这是我的代码和结果: https://www.aliyundrive.com/s/3p8NG2P3zHf
Laodao2333   

试试用Selenium吧
您需要登录后才可以回帖 登录 | 立即注册

返回顶部