利用腾讯云实现PDF转“word”

作者：lianxiang1122 发布时间：2024-6-19 07:00:31

PDF转word很是让人头疼，特别是扫描版的又有图片又有表格的，基本很难搞定。
最近，刚刚发现腾讯云刚刚发布了一个文字识别新品—————智能结构化！
https://cloud.tencent.com/product/smart-ocr
进入网址后，用微信登录并开通服务就可免费使用，腾讯云一如既往的优秀，可以每月白嫖1千次！

image.png (109.58 KB, 下载次数: 0)
下载附件
2024-6-17 16:28 上传

点击demo就可以先体验一下了，只能一张一张PDF或图片识别，识别后会生成一个markdown文件，用相关的软件打开这个markdown文件就可以复制粘贴到word文档中了，从而间接实现PDF转word。
能力有限，不能一次到位，只能这样了。。。。。等高手大佬们解决一次到位转成word。。。

image.png (206.75 KB, 下载次数: 0)
下载附件
2024-6-17 16:32 上传

例如，用我们的pycharm软件打开.md文件后，会显示内容，我们全选复制，再粘贴到word中，就可以了。

image.png (170.65 KB, 下载次数: 0)
下载附件
2024-6-17 16:35 上传

当然了，一张一张识别肯定会很麻烦了。腾讯云提供了API调用，找到控制台的接入指引，第二步的AIP3.0 Explorer，点击进入

image.png (215.44 KB, 下载次数: 0)
下载附件
2024-6-17 16:38 上传

进入AIP3.0 Explorer，之前分享过，不一一解释了。

image.png (259.55 KB, 下载次数: 0)
下载附件
2024-6-17 16:42 上传

将代码复制到IDE中，就可以测试了。
提示：一次只能识别10页PDF，如果多于10页，就得多次调用了。
具体操作，就不一一详解了。利用chatgpt，做了一个带界面的，供大家参考。
简单解释一下：先读取PDF文件，再选择生成markdown文件保存位置，输入你的ID和KEY。程序会先读取D盘是否有一个idkey.txt文件，如果有自动读取到相关的ID和KEY，如果没有，需要你输入一下，当点击转换按钮后，会自动将ID 和key保存到D盘，当下一次使用时读取，就不用再输入了。当pdf文件大于10页后，拆分后拼接，多次请求，最后生成一个.md文件。
[Python] 纯文本查看复制代码import os
import tkinter as tk
from tkinter import filedialog, messagebox
import base64
import fitz  # PyMuPDF
import json
from tencentcloud.common.common_client import CommonClient
from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
def read_pdf_to_base64(paf_file):
with open(paf_file, 'rb') as pdf_file:
      binary_data = pdf_file.read()
base64_encoded_data = base64.b64encode(binary_data)
return base64_encoded_data.decode('utf-8')
def decode_base64_to_markdown(base64_str):
decoded_bytes = base64.b64decode(base64_str)
return decoded_bytes.decode('utf-8')
def save_as_md_file(content, filename):
with open(filename, 'w', encoding='utf-8') as file:
      file.write(content)
def ocr_markdown(base64_string, FileStartPageNumber, FileEndPageNumber, secret_id, secret_key):
cred = credential.Credential(secret_id, secret_key)
httpProfile = HttpProfile()
httpProfile.endpoint = "ocr.tencentcloudapi.com"
clientProfile = ClientProfile()
clientProfile.httpProfile = httpProfile
params_set = {
      "FileType": "PDF",
      "FileBase64": base64_string,
      "FileStartPageNumber": FileStartPageNumber,
      "FileEndPageNumber": FileEndPageNumber
}
params = json.dumps(params_set)
common_client = CommonClient("ocr", "2018-11-19", cred, "ap-guangzhou", profile=clientProfile)
try:
      response = common_client.call_json("ReconstructDocument", json.loads(params))
      return response['Response']['MarkdownBase64']
except TencentCloudSDKException as err:
      error_message = f"An error occurred: {err}\nPlease enter the correct API Secret ID and Secret Key."
      messagebox.showerror("Error", error_message)

      idkey_file = 'D:/idkey.txt'
      if os.path.exists(idkey_file):
         if messagebox.askyesno("Delete idkey.txt", "The idkey.txt file exists. Do you want to delete it?"):
            os.remove(idkey_file)
            messagebox.showinfo("Deleted", "idkey.txt file has been deleted. Please enter the correct API Secret ID and Secret Key.")
      return None
def process_pdf(paf_file, secret_id, secret_key, output_dir):
base64_string = read_pdf_to_base64(paf_file)
doc = fitz.open(paf_file)
page_count = doc.page_count
markdown_output = ''
pdf_filename = os.path.splitext(os.path.basename(paf_file))[0]
output_filepath = os.path.join(output_dir, f"{pdf_filename}.md")
if page_count

下载次数, 腾讯

利用腾讯云实现PDF转“word”

相关帖子

热门主题

最近收BA的人很多交易了要立刻取消BA 教训

刚看了一个视频，让我又清醒了一下

小小农民新开中转站，欢迎来踩

港版安卓机是满血的国际版安卓机吗？

我 ThreeJSON 又回来了： V 友们批评得对！

继之前 5.4 的 “收口”之后， 5.6 Sol 好

折腾 homelab 挺长时间了建了一个群想不

codex 打开风扇狂转怎么办

Vibe 的一个中文起名小工具

你们明天要去看周星驰的电影么？

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿投放广告

Yoo趣儿网站用户应遵守规则

利用腾讯云实现PDF转“word”

相关帖子

热门主题

最近收BA的人很多 交易了要立刻取消BA 教训

刚看了一个视频，让我又清醒了一下

小小农民新开中转站，欢迎来踩

港版安卓机是满血的国际版安卓机吗？

我 ThreeJSON 又回来了： V 友们批评得对！

继之前 5.4 的 “收口”之后， 5.6 Sol 好

折腾 homelab 挺长时间了 建了一个群 想不

codex 打开风扇狂转怎么办

Vibe 的一个中文起名小工具

你们明天要去看周星驰的电影么？

热门板块

公告

网站帮助 - Yoo趣儿

我们的愿景

在 Yoo趣儿 投放广告

Yoo趣儿网站用户应遵守规则

最近收BA的人很多交易了要立刻取消BA 教训

折腾 homelab 挺长时间了建了一个群想不

在 Yoo趣儿投放广告