[color=]这是我目前写的代码
[Python] 纯文本查看 复制代码import fitz
import re
import os
import pandas as pd
from tqdm import tqdm
def extract_text_from_pdf(pdf_file):
try:
doc = fitz.open(pdf_file)
except fitz.EmptyFileError:
print(f"跳过空文件: {pdf_file}")
return ""
text = ''
for page_num in range(doc.page_count):
page = doc.load_page(page_num)
text += page.get_text()
return text
def extract_data_from_text(text, pattern, table_pattern):
matches = list(re.finditer(pattern, text))
table_matches = list(re.finditer(table_pattern, text))
if not matches:
return 0
for match in matches:
for table_match in table_matches:
if table_match.start()
[color=]我想要pdf里面的本期增加添加到excel表格中,但是它可能识别到关键词在文中出现了两次就填了2
[color=]这个住房公积金同样没有读取到
[color=]这个更是明明pdf里面有的 都没有搜到
求求大佬帮我实现帮我抓取pdf中表格栏里面,本期增加对应的值