麻烦帮忙修改一下python代码

查看 128|回复 9
作者:liangyiyi   
麻烦帮忙修改一下python代码
limit超过4500就没有办法爬取      网址https://www.sceea.cn/Information/Sunshine?id=210&nf=2023#  
[Python] 纯文本查看 复制代码import requests
import json
web_url = "https://www.sceea.cn/Information/GetSunshineList"
payload = {
    "Limit": "200000",
    "Page": "1",
    "gslbid": "210",
    "domParam": "",
    "Key": ""
}
headers = {
    "Origin": "https://www.sceea.cn",
    "Referer": "https://wwwsceeacn/Information/Sunshine?id=210&nf=2023",
    "Content-Type": "application/x-www-form-urlencoded; charset=utf-8"
}
response = requests.post(web_url, data=payload, headers=headers)
json_data = response.json()["gsb"]
print(response.text)
ar_data = []
for item in json_data:
    ar_data.append([
        item["GSXM1"],
        item["GSXM2"],
        item["GSXM3"],
        item["GSXM4"],
        item["GSXM5"],
        item["GSXM6"],
        item["GSXM7"],
        item["GSLBID"],
        item["ID"]
    ])
# 将数据写入Excel文件中的Sheet2
import openpyxl
workbook = openpyxl.Workbook()
sheet = workbook.active
for row in range(len(ar_data)):
    for col in range(len(ar_data[row])):
        sheet.cell(row=row+2, column=col+1).value = ar_data[row][col]
workbook.save("output.xlsx")

代码, 没有办法

CANTON   

[Python] 纯文本查看 复制代码import requests
import json
import openpyxl
web_url = "https://www.sceea.cn/Information/Sunshine?id=210&nf=2023#"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(web_url, headers=headers)
content = response.content.decode("utf-8")
# 解析数据
start_index = content.find("gsb: ") + len("gsb: ")
end_index = content.find(",\n};", start_index)
json_data = json.loads(content[start_index:end_index])
ar_data = []
for item in json_data:
    ar_data.append([
        item["GSXM1"],
        item["GSXM2"],
        item["GSXM3"],
        item["GSXM4"],
        item["GSXM5"],
        item["GSXM6"],
        item["GSXM7"],
        item["GSLBID"],
        item["ID"]
    ])
# 将数据写入Excel文件中的Sheet2
workbook = openpyxl.Workbook()
sheet = workbook.active
for row in range(len(ar_data)):
    for col in range(len(ar_data[row])):
        sheet.cell(row=row+2, column=col+1).value = ar_data[row][col]
workbook.save("output.xlsx")
youth96   

Limit小一些,加上Page呢,Page不要一直是1
liangyiyi
OP
  


CANTON 发表于 2023-6-29 10:08
[mw_shl_code=python,true]import requests
import json
import openpyxl

Traceback (most recent call last):
  File "C:/Users/amzing/Desktop/add score.py", line 17, in
    json_data = json.loads(content[start_index:end_index])
  File "C:\Users\amzing\AppData\Local\Programs\Python\Python38\lib\json\__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "C:\Users\amzing\AppData\Local\Programs\Python\Python38\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\amzing\AppData\Local\Programs\Python\Python38\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)   还是报错呀
grekevin   

建议方案:
1. 4500是上限的话 ,就用4000,翻页查询
2. 每次请求之间加上延迟,爬太快就会出错
95827   

GPT4.0 :从您的描述中,我了解到您遇到的问题是,当请求的数据量超过4500条时,您无法正常获取到数据。这可能是由于服务器对单次请求返回的数据量有限制。为解决这个问题,我们可以分批次请求数据,每次请求一个较小的数据量,每次请求1000条,,直到达到200000条数据或者没有更多数据为止。然后将所有获取到的数据保存到一个Excel文件中。以下是修改后的代码:
[Python] 纯文本查看 复制代码import requests
import json
import openpyxl
web_url = "https://www.sceea.cn/Information/GetSunshineList"
headers = {
    "Origin": "https://www.sceea.cn",
    "Referer": "https://wwwsceeacn/Information/Sunshine?id=210&nf=2023",
    "Content-Type": "application/x-www-form-urlencoded; charset=utf-8"
}
def get_data(page, limit):
    payload = {
        "Limit": limit,
        "Page": page,
        "gslbid": "210",
        "domParam": "",
        "Key": ""
    }
    response = requests.post(web_url, data=payload, headers=headers)
    json_data = response.json()["gsb"]
    return json_data
def save_data_to_excel(ar_data, file_name):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    for row in range(len(ar_data)):
        for col in range(len(ar_data[row])):
            sheet.cell(row=row+2, column=col+1).value = ar_data[row][col]
    workbook.save(file_name)
limit_per_request = 1000
total_data_limit = 200000
ar_data = []
page = 1
while len(ar_data)
liangyiyi
OP
  


95827 发表于 2023-6-29 12:02
GPT4.0 :从您的描述中,我了解到您遇到的问题是,当请求的数据量超过4500条时,您无法正常获取到数据。这 ...

Traceback (most recent call last):
  File "C:\Users\amzing\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\models.py", line 971, in json
    return complexjson.loads(self.text, **kwargs)
  File "C:\Users\amzing\AppData\Local\Programs\Python\Python38\lib\json\__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "C:\Users\amzing\AppData\Local\Programs\Python\Python38\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\amzing\AppData\Local\Programs\Python\Python38\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:/Users/amzing/Desktop/ad.py", line 40, in
    json_data = get_data(page, limit_per_request)
  File "C:/Users/amzing/Desktop/ad.py", line 21, in get_data
    json_data = response.json()["gsb"]
  File "C:\Users\amzing\AppData\Local\Programs\Python\Python38\lib\site-packages\requests\models.py", line 975, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
fei5788   

[Python] 纯文本查看 复制代码import requests
import json
limit = 4500  # 设置limit的值,最大为4500,超过4500则取4500
web_url = "https://www.sceea.cn/Information/GetSunshineList"
payload = {
    "Limit": str(min(limit, 4500)),  # 取limit和4500中的较小值
    "Page": "1",
    "gslbid": "210",
    "domParam": "",
    "Key": ""
}
headers = {
    "Origin": "https://www.sceea.cn",
    "Referer": "https://wwwsceeacn/Information/Sunshine?id=210&nf=2023",
    "Content-Type": "application/x-www-form-urlencoded; charset=utf-8"
}
response = requests.post(web_url, data=payload, headers=headers)
json_data = response.json()["gsb"]
print(response.text)
ar_data = []
for item in json_data:
    ar_data.append([
        item["GSXM1"],
        item["GSXM2"],
        item["GSXM3"],
        item["GSXM4"],
        item["GSXM5"],
        item["GSXM6"],
        item["GSXM7"],
        item["GSLBID"],
        item["ID"]
    ])
# 将数据写入Excel文件中的Sheet2
import openpyxl
workbook = openpyxl.Workbook()
sheet = workbook.active
for row in range(len(ar_data)):
    for col in range(len(ar_data[row])):
        sheet.cell(row=row+2, column=col+1).value = ar_data[row][col]
workbook.save("output.xlsx")
liangyiyi
OP
  


fei5788 发表于 2023-6-29 16:32
[mw_shl_code=python,true]import requests
import json

大佬 这个没有弄完呀   完整数据有10w  这个咋个搞啊
freelive   


liangyiyi 发表于 2023-6-29 18:13
大佬 这个没有弄完呀   完整数据有10w  这个咋个搞啊

不要一次性获取大量数据,可以按照限制值小一点设置,再增加延迟时间,根据总数量分词读取保存即可。
您需要登录后才可以回帖 登录 | 立即注册

返回顶部