利用pytesseract库在Python中提取图片中的中文文字的函数

发布时间：2023-12-26 08:32:45

要使用pytesseract库在Python中提取图片中的中文文字，首先需要安装pytesseract库和Tesseract OCR引擎，并将Tesseract可执行文件所在的路径添加到环境变量中。然后，可以使用以下代码编写一个函数来提取中文文字：

import pytesseract
from PIL import Image

def extract_chinese_from_image(image_path):
    # 打开图片
    image = Image.open(image_path)

    # 转换为灰度图像
    image = image.convert('L')

    # 使用pytesseract提取文字（默认语言为英文）
    text_eng = pytesseract.image_to_string(image)

    # 设置Tesseract的语言为中文
    pytesseract.pytesseract.tesseract_cmd = 'tesseract'  # Tesseract可执行文件的路径
    tesseract_config = '--psm 6 -l chi_sim'  # PSM参数设置为6（保持块级分析），语言设置为简体中文

    # 使用中文语言提取文字
    text_chinese = pytesseract.image_to_string(image, config=tesseract_config)

    # 返回提取的中文文字
    return text_chinese

使用例子：

image_path = 'example.jpg'  # 图片路径
chinese_text = extract_chinese_from_image(image_path)  # 提取中文文字
print(chinese_text)

注意：由于pytesseract库对中文文字提取的准确性可能不够高，可能需要对图像进行预处理（如调整亮度、对比度、锐化等）才能提高提取的准确性。