Explore amazon-textract-textractor

What is amazon-textract-textractor?

Textract 的原生 API 返回的结果是一个巨大的 JSON 对象. 你需要阅读 Text Detection and Document Analysis Response Objects 才能理解如何解读这个 JSON 对象. 然后你还要自己写程序 Parse 这个 JSON 对象, 对数据做进一步的处理.

amazon-textract-textractor 是 AWS 实验室里的一个开源 Python 项目. 致力于让 Textract 更好用. 简单来说就是对这个 JSON 对象的进一步封装.

amazon-textract-textractor 是一个顶层项目, 内部有这么几个模块:

  • amazon-textract-caller: 对 boto3 的封装, 毕竟 boto3 的 API 函数根本没有 signature 也没有 type hint.

  • amazon-textract-response-parser: 对 JSON 对象的面向对象封装.

以上两个是 amazon-textract-textractor 的核心, 安装的时候会自动安装这两个.

  • amazon-textract-overlayer: 用来在 PDF 或图片上画方框的.

  • amazon-textract-prettyprinter: 把 Textract 的结果转化成其他 CSV, markdown 等格式.

  • amazon-textract-geofinder: 实现了对 Textract 的 entity 用坐标来搜索. 底层是用 sqlite 数据库实现.

[1]:
# 这里我们把他们全部装好得了
%pip install amazon-textract-textractor
%pip install amazon-textract-caller
%pip install amazon-textract-response-parser
%pip install amazon-textract-geofinder
%pip install amazon-textract-prettyprinter
%pip install amazon-textract-overlayer
Requirement already satisfied: amazon-textract-textractor in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (1.0.18)
Requirement already satisfied: tabulate==0.8.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (0.8.10)
Requirement already satisfied: amazon-textract-response-parser==0.1.33 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (0.1.33)
Requirement already satisfied: XlsxWriter==3.0.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (3.0.3)
Requirement already satisfied: jsonschema in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (4.4.0)
Requirement already satisfied: amazon-textract-caller==0.0.24 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (0.0.24)
Requirement already satisfied: pyxDamerauLevenshtein==1.7.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (1.7.1)
Requirement already satisfied: boto3==1.24.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (1.24.96)
Requirement already satisfied: Pillow in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (9.2.0)
Requirement already satisfied: botocore in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-caller==0.0.24->amazon-textract-textractor) (1.27.96)
Requirement already satisfied: marshmallow==3.14.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser==0.1.33->amazon-textract-textractor) (3.14.1)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3==1.24.*->amazon-textract-textractor) (0.10.0)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3==1.24.*->amazon-textract-textractor) (0.6.0)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-caller==0.0.24->amazon-textract-textractor) (1.26.7)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-caller==0.0.24->amazon-textract-textractor) (2.8.2)
Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-caller==0.0.24->amazon-textract-textractor) (1.16.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from jsonschema->amazon-textract-textractor) (0.18.1)
Requirement already satisfied: attrs>=17.4.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from jsonschema->amazon-textract-textractor) (21.4.0)
Requirement already satisfied: importlib-resources>=1.4.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from jsonschema->amazon-textract-textractor) (5.7.1)
Requirement already satisfied: zipp>=3.1.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from importlib-resources>=1.4.0->jsonschema->amazon-textract-textractor) (3.8.0)
WARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: amazon-textract-caller in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.0.24)
Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-caller) (1.24.96)
Requirement already satisfied: botocore in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-caller) (1.27.96)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-caller) (0.10.0)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-caller) (0.6.0)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-caller) (1.26.7)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-caller) (2.8.2)
Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-caller) (1.16.0)
WARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: amazon-textract-response-parser in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.1.33)
Requirement already satisfied: marshmallow==3.14.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser) (3.14.1)
Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser) (1.24.96)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser) (0.10.0)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser) (0.6.0)
Requirement already satisfied: botocore<1.28.0,>=1.27.96 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser) (1.27.96)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser) (1.26.7)
Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser) (1.16.0)
WARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: amazon-textract-geofinder in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.0.6)
Requirement already satisfied: amazon-textract-response-parser>=0.1.17 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-geofinder) (0.1.33)
Requirement already satisfied: marshmallow==3.14.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (3.14.1)
Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (1.24.96)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (0.6.0)
Requirement already satisfied: botocore<1.28.0,>=1.27.96 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (1.27.96)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (0.10.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (1.26.7)
Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (1.16.0)
WARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: amazon-textract-prettyprinter in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.0.16)
Requirement already satisfied: tabulate==0.8.10 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-prettyprinter) (0.8.10)
Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-prettyprinter) (1.24.96)
Requirement already satisfied: botocore in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-prettyprinter) (1.27.96)
Requirement already satisfied: amazon-textract-response-parser>=0.1.27 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-prettyprinter) (0.1.33)
Requirement already satisfied: marshmallow==3.14.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser>=0.1.27->amazon-textract-prettyprinter) (3.14.1)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-prettyprinter) (0.6.0)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-prettyprinter) (0.10.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-prettyprinter) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-prettyprinter) (1.26.7)
Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-prettyprinter) (1.16.0)
WARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: amazon-textract-overlayer in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.0.10)
Requirement already satisfied: Pillow>=9.2.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (9.2.0)
Requirement already satisfied: amazon-textract-caller>=0.0.11 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (0.0.24)
Requirement already satisfied: botocore in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (1.27.96)
Requirement already satisfied: PyPDF2>=2.5.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (2.11.2)
Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (1.24.96)
Requirement already satisfied: typing_extensions>=3.10.0.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from PyPDF2>=2.5.*->amazon-textract-overlayer) (4.2.0)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-overlayer) (0.6.0)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-overlayer) (0.10.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-overlayer) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-overlayer) (1.26.7)
Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-overlayer) (1.16.0)
WARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.
You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.

Set AWS Credential

[1]:
from boto_session_manager import BotoSesManager
from textractor import Textractor
from s3pathlib import context

aws_profile = "aws_data_lab_sanhe_us_east_1"

bsm = BotoSesManager(profile_name=aws_profile)
context.attach_boto_session(bsm.boto_ses)

# Textractor 的顶层 API
extractor = Textractor(profile_name=aws_profile)

Enumerate Important Local Path and S3 Path

这里我们先做一些准备工作, 将 PDF 转化为图片, 上传至 S3 等工作.

[24]:
import os
from pathlib_mate import Path
from s3pathlib import S3Path

#--- Local
dir_here = Path(os.getcwd()).absolute()

path_cms1500_pdf = dir_here / "cms1500-carrie-rodgers.pdf"
path_cms1500_png = dir_here / "page-1.png"

#--- S3
s3dir_root = S3Path("aws-data-lab-sanhe-for-everything", "poc", "2022-12-04-textractor").to_dir()
s3dir_input = s3dir_root.joinpath("input").to_dir()
s3dir_output = s3dir_root.joinpath("output").to_dir()
s3path_cms1500_pdf = s3dir_input / path_cms1500_pdf.basename

#--- Upload
print(f"preview: {s3dir_root.console_url}")

s3path_cms1500_pdf.upload_file(path_cms1500_pdf.abspath, overwrite=True)
preview: https://console.aws.amazon.com/s3/buckets/aws-data-lab-sanhe-for-everything?prefix=poc/2022-12-04-textractor/
[27]:
# 用 PyMuPDF 将 PDF 切割并转化为 图片.
import fitz

# bytes protocol
doc = fitz.open(stream=path_cms1500_pdf.read_bytes())

for page_num, page in enumerate(doc, start=1):
    print(page_num)
    # split page
    one_page_doc = fitz.open()  # new empty PDF
    one_page_doc.insert_pdf(doc, from_page=page_num-1, to_page=page_num-1)
    p = dir_here / f"page-{page_num}.pdf"
    #
    # # you cannot write document to io.BytesIO
    # one_page_doc.save(f"{p}")

    # convert page to image
    pix = page.get_pixmap(dpi=200)

    p = dir_here / f"page-{page_num}.ppm"
    # you cannot write pix map to io.BytesIO
    p.write_bytes(pix.tobytes("ppm"))
    # pix.save(f"{p}")

1
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [27], in <cell line: 7>()
      9 # split page
     10 one_page_doc = fitz.open()  # new empty PDF
---> 11 one_page_doc.insert_pdf(doc, from_page=page_num-1, to_page=page_num-1)
     12 p = dir_here / f"page-{page_num}.pdf"
     13 #
     14 # # you cannot write document to io.BytesIO
     15 # one_page_doc.save(f"{p}")
     16
     17 # convert page to image

File ~/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages/fitz/fitz.py:4608, in Document.insert_pdf(self, docsrc, from_page, to_page, start_at, rotate, links, annots, show_progress, final, _gmap)
   4604     _gmap = Graftmap(self)
   4605     self.Graftmaps[isrt] = _gmap
-> 4608 val = _fitz.Document_insert_pdf(self, docsrc, from_page, to_page, start_at, rotate, links, annots, show_progress, final, _gmap)
   4610 self._reset_page_refs()
   4611 if links:

RuntimeError: source object number out of range

Detect Document Text

[89]:
document = extractor.detect_document_text(file_source=s3path_cms1500_pdf.uri)
type(document)
[89]:
textractor.entities.document.Document
[90]:
document.lines
[90]:
[Mail completed forms to:,
 Department of Labor and Industries,
 PO Box 44269,
 Olympia WA 98504-4269,
 HEALTH INSURANCE CLAIM FORM,
 CARRIER,
 APPROVED BY NATIONAL UNIF ORM CLAIM COMMITTEE (NUCC) 02/12,
 PICA,
 PICA,
 OTHER,
 1a INSURED'S ID NUMBER,
 FECA,
 GROUP,
 CHAMPVA,
 (For Program in Item 1),
 TRICARE,
 MEDICAID,
 1. MEDICARE,
 IKLUNG,
 HEALTH PLAN,
 (ID#),
 (ID#),
 (ID#),
 (Member ID#),
 (ID#/DoD#),
 (Medicaid#),
 (Medicare#),
 SEX,
 3. PATIENT'S BIRTH DATE,
 4. INSURED'S NAME (Last Name, First Name, Middle Initial),
 2 PATIENT'S NAME (Last Name, First Name, Middle Initial),
 YY,
 DD,
 MM,
 F,
 18,
 ALCON LABORATORIES,
 1974,
 9,
 7 INSURED'S ADDRESS (No., Street),
 Carrie Rodgers,
 6. PATIENT RELATIONSHIP TOINSURED,
 5 PATIENT'S ADDRESS (No., Street),
 Other,
 Child,
 Self,
 Spouse,
 6201 S freeway,
 2805 28th StNw,
 STATE,
 CITY,
 8. RESERVED FOR NUCC USE,
 STATE,
 CITY,
 DC,
 Tx,
 fort worth,
 Washington,
 ZIP CODE,
 TELEPHONE (include Area Code),
 TELEPHONE (Include Area Code),
 ZIP CODE,
 (815)571-3008,
 76134,
 (202)614-5824,
 20008,
 11. INSURED'S POLICY GROUP OR FECA NUMBER,
 10. IS PATIENT'S CONDITION RELATED TO,
 9. OTHER INSURED'S NAME (Last Name, First Name, Middle Initial),
 INFORMATION,
 FUR4398,
 SEX,
 a. INSURED'S DATE OF BIRTH,
 a OTHER INSURED'S POLICY OR GROUP NUMBER,
 a. EMPLOYMENT? (Current or Previous),
 YY,
 DD,
 MM,
 F,
 M,
 NO,
 YES,
 X1573,
 INSURED,
 4,
 11 1978,
 b RESERVED FOR NUCC USE,
 b. AUTO ACCIDENT?,
 b. OTHER CLAIM ID (Designated by NUCC),
 PLACE (State),
 NO,
 AND,
 YES,
 Y41 FUR4398,
 C. OTHER ACCIDENT?,
 C. RESERVED FOR NUCC USE,
 C. INSURANCE PLAN NAME OR PROGR AM NAME,
 YES,
 NO,
 Travelers,
 d. INSURANCE PLAN NAME OR PROGRAM NAME,
 d. IS THERE ANOTHER HEALTH BENEFIT PLAN?,
 10d. CLAIM CODES (Designated by NUCC),
 PATIENT,
 NO,
 YES,
 If yes, complete items 9, 9a and 9d,
 READ BACK OF FORM BEFORE COMPLETING & SIGNING THIS FORM.,
 13. INSURED'S OR AUTHORIZED PERSON'S SIGNATURE I authorize,
 12. PATIENT'S OR AUTHORIZED PERSON'S SIGNATURE I authorize the release of any medical or other information necessary,
 payment of medical benefits to the undersigned chysid an or supplier for,
 services described below.,
 to process this claim. I also request payment of government benefits ither to myself or to the party who accepts assignment,
 below,
 Curion,
 Cummium,
 SIGNED,
 SIGNED,
 DATE 03/08/22,
 15. OTHER DATE,
 16. DATES PATIENT UNABLE TO WORK IN CURRENT OCCUPATION,
 14 DATE OF CURRENT ILLNESS INJURY, or PREGNANCY (LMP),
 YY,
 MM,
 DD.,
 MM,
 YY,
 YY,
 DD,
 DD,
 YY,
 MM,
 DD,
 MM,
 QUAL,
 TO,
 FROM,
 QUAL,
 12,
 439,
 06,
 01,
 2013,
 431,
 07,
 06,
 12,
 21,
 21,
 18. HOSPITALIZATION DATES RELATED TO CURRENT SERVICES,
 17. NAME OF REFERRING PROVIDER OR OTHER SOURCE,
 17a.,
 MM,
 YY,
 MM,
 DD,
 DD,
 YY,
 MDF15577,
 FROM,
 TO,
 NPI,
 17b,
 DN JOSE FUENTES MD,
 1235184821,
 20. OUTSIDE LAB?,
 $CHARGES,
 19. ADDITIONAL CLAIM INFORMATION (Designated by NUCC),
 YES,
 NO,
 21. DIAGNOSIS OR NATURE OF ILLNESS OR INJURY Relate AL to service line below (24E),
 22 RESUBMISSION,
 ICD Ind.,
 CODE,
 ORIGINAL REF. NO,
 This is David,
 M25512,
 S16 1XXA,
 C.,
 A,
 M75 42,
 B,
 D.,
 23 PRIOR AUTHORIZATION NUMBER,
 This is H,
 This is F,
 This is,
 This is E,
 E,
 H.,
 G,
 F.,
 This is LETTER,
 L.,
 This is Jack,
 This is King,
 I.,
 J.,
 This is iphone,
 K,
 02FUR4398,
 B,
 E,
 C.,
 F.,
 24. A,
 G.,
 H,
 D. PROCEDURES, SERVICES, OR SUPPLIES,
 DATE(S) OF SERVICE,
 I.,
 J.,
 DAYS,
 EPSD1,
 PLACE OF,
 To,
 From,
 DIAGNOSIS,
 (Explain Unusual Circumstances),
 RENDERING,
 ID.,
 OR,
 Family,
 OPT/HCPCS,
 YY,
 MM,
 EMG,
 YY,
 DD,
 SERVICE,
 DD,
 MM,
 POINTER,
 MODIFIER,
 CHARGES,
 Plan,
 UNITS,
 QUAL,
 PROVIDER ID,
 OB,
 OT105516TX,
 1,
 3,
 25,
 97110 Go,
 NPI,
 25,
 22103,
 22111,
 ABC 189.48,
 03,
 1023439049,
 OB,
 5516TX,
 2,
 22,
 25,
 25,
 03,
 NPI,
 82.18,
 ABC,
 03,
 95730 Go,
 1023439049,
 3,
 NPI,
 SUPPLIER,
 4,
 NPI,
 5,
 NPI,
 6,
 PHYSICIAN,
 NPI,
 SSN BIN,
 25. FEDERAL TAX I.D. NUMBER,
 26 PATIENT'S ACCOUNT NO,
 27 ACCEPT ASSIGNMENT?,
 28 TOTAL CHARGE,
 29. AMOUNT PAID,
 30. Rsvd. for NUCC Use,
 For govt claims see tack),
 YES,
 NO,
 $,
 MONROOOD,
 X,
 x,
 203721804,
 $ 271.66,
 31. SIGNATURE OF PHYSICIAN OR SUPPLIER,
 32. SERVICE FACILITY LOCATION INFORMATION,
 33 BILLING PROVIDER,
 INCLUDING DEGREES OR CREDENTIALS,
 INFO & PH # (214) 953-9431,
 NORTH TEAXS EHABILITATION,
 (I certify that the statements on the reverse,
 NORTH TEXAS REHABILITATION,
 apply to this bill and are made a part thereof),
 PO BOX 226656,
 2601 SCOTT AVE 102, 76103,
 CHRISTY L. HOBBY,
 & DALLAS TX 75222-6656,
 4/4/2022,
 1508095761 b,
 SIGNED,
 150809576,
 b. OTTX,
 PLEASE PRINT OR TYPE,
 NUCC Instruction Manual available at: www.nucc.org,
 APPROVED OMB-0938 1197 FORM 1500 (02-12),
 F245-127-000 CMS 1500 02-2012,
 RESET,
 Scanned with CamScanner]
[91]:
results = document.search_lines("patient name, (Last Name, First Name, Middle Initial)", 3)
results
[91]:
[2 PATIENT'S NAME (Last Name, First Name, Middle Initial),
 4. INSURED'S NAME (Last Name, First Name, Middle Initial),
 9. OTHER INSURED'S NAME (Last Name, First Name, Middle Initial)]
[92]:
results[0].bbox
[92]:
x: 0.020324068143963814, y: 0.1542317271232605, width: 0.2638625502586365, height: 0.008735546842217445

Form and Table

[19]:
from textractor.data.constants import TextractFeatures

analyzed_document = extractor.analyze_document(
    file_source=path_cms1500_pdf.abspath,
    features=[
        TextractFeatures.FORMS,
        TextractFeatures.TABLES,
    ]
)
print("done")
done
[59]:
Path(dir_here, "test_1.json").write_text(json.dumps(analyzed_document.response, indent=4))
[59]:
2006605

Key Value

[57]:
analyzed_document.key_values
[57]:
[1a INSURED'S ID NUMBER : (For Program in Item 1),
 4. INSURED'S NAME (Last Name, First Name, Middle Initial) : ALCON LABORATORIES,
 2 PATIENT'S NAME (Last Name, First Name, Middle Initial) : Carrie Rodgers,
 YY : 1974,
 MM : 9,
 DD : 18,
 7 INSURED'S ADDRESS (No., Street) : 6201 S freeway,
 5 PATIENT'S ADDRESS (No., Street) : 2805 28th StNw,
 CITY : fort worth,
 8. RESERVED FOR NUCC USE : ,
 STATE : DC,
 CITY : Washington,
 ZIP CODE : 76134,
 TELEPHONE (include Area Code) : (815)571-3008,
 TELEPHONE (Include Area Code) : (202)614-5824,
 ZIP CODE : 20008,
 11. INSURED'S POLICY GROUP OR FECA NUMBER : FUR4398,
 9. OTHER INSURED'S NAME (Last Name, First Name, Middle Initial) : ,
 a OTHER INSURED'S POLICY OR GROUP NUMBER : X1573,
 DD : 11,
 MM : 4,
 1978 YY : ,
 b RESERVED FOR NUCC USE : ,
 b. OTHER CLAIM ID (Designated by NUCC) : Y41 FUR4398,
 C. INSURANCE PLAN NAME OR PROGR AM NAME : Travelers,
 C. RESERVED FOR NUCC USE : ,
 d. INSURANCE PLAN NAME OR PROGRAM NAME : ,
 10d. CLAIM CODES (Designated by NUCC) : ,
 SIGNED : Cummium,
 DATE : 03/08/22,
 SIGNED : Curion,
 YY : 2013,
 DD. : 06,
 MM : ,
 YY : 21,
 MM : 07,
 YY : 21,
 DD : ,
 YY : ,
 MM : 12,
 DD : 06,
 MM : 12,
 QUAL : 439,
 QUAL : 431,
 17. NAME OF REFERRING PROVIDER OR OTHER SOURCE : DN JOSE FUENTES MD,
 MM : ,
 DD : ,
 YY : ,
 MM : ,
 DD : ,
 YY : ,
 $CHARGES : ,
 19. ADDITIONAL CLAIM INFORMATION (Designated by NUCC) : ,
 22 RESUBMISSION CODE : ,
 ORIGINAL REF. NO : ,
 C. : M25512,
 A : S16 1XXA,
 B : M75 42,
 D. : This is David,
 23 PRIOR AUTHORIZATION NUMBER : 02FUR4398,
 G : This is,
 E : This is E,
 F. : This is F,
 J. : This is Jack,
 I. : This is iphone,
 K : This is King,
 SSN : ,
 25. FEDERAL TAX I.D. NUMBER : 203721804,
 26 PATIENT'S ACCOUNT NO : MONROOOD,
 29. AMOUNT PAID : $,
 30. Rsvd. for NUCC Use : ,
 32. SERVICE FACILITY LOCATION INFORMATION : NORTH 2601 TEAXS SCOTT AVE EHABILITATION 102, 76103,
 33 BILLING PROVIDER INFO & PH # : NORTH PO BOX & TEXAS DALLAS 226656 REHABILITATION (214) 953-9431 TX 75222-6656,
 b : ,
  : 1508095761,
  : 150809576,
 b. : OTTX,
  : L. 4/4/2022 HOBBY,
 SIGNED : CHRISTY,
 APPROVED OMB-0938 : 1197,
 FORM : 1500 (02-12)]
[38]:
key_value_I = analyzed_document.get(key="I")[0]
print(key_value_I)
doc_width = 2480
doc_height = 3509
x = key_value_I.bbox.x * doc_width
y = key_value_I.bbox.y * doc_height
width = key_value_I.bbox.width * doc_width
height = key_value_I.bbox.height * doc_height

x_min = x
y_min = y
x_max = x + width
y_max = y + width
(x_min, x_max, y_min, y_max)
WARNING:root:Key contains no words objects.
WARNING:root:Key contains no words objects.
WARNING:root:Key contains no words objects.
WARNING:root:Key contains no words objects.
I. : This is iphone
[38]:
(61.32811039686203, 74.83414195477962, 2069.866504251957, 2083.3725358098745)
[10]:
key_value = analyzed_document.key_values[2]
print(key_value.key)
print(key_value.value)
2 PATIENT'S NAME (Last Name, First Name, Middle Initial)
Carrie Rodgers
[11]:
key_value = analyzed_document.get("INSURED POLICY GROUP".lower(), 3)[0]
print(f"{key_value.key} = {key_value.value}")
WARNING:root:Key contains no words objects.
WARNING:root:Key contains no words objects.
WARNING:root:Key contains no words objects.
WARNING:root:Key contains no words objects.
11. INSURED'S POLICY GROUP OR FECA NUMBER = FUR4398

Checkbox

[12]:
analyzed_document.checkboxes
[12]:
[[ ] PICA,
 [X] PICA,
 [ ] OTHER (ID#),
 [ ] FECA (ID#) IKLUNG,
 [ ] GROUP HEALTH (ID#) PLAN,
 [ ] CHAMPVA (Member ID#),
 [ ] MEDICAID (Medicaid#),
 [ ] TRICARE (ID#/DoD#),
 [X] MEDICARE (Medicare#),
 [ ] F,
 [X] ,
 [ ] Other,
 [ ] Child,
 [ ] Self,
 [ ] Spouse,
 [X] STATE,
 [X] M,
 [ ] F,
 [X] YES,
 [ ] NO,
 [ ] PLACE (State),
 [ ] YES,
 [X] NO,
 [ ] YES,
 [X] NO,
 [X] NO,
 [ ] YES,
 [ ] DD,
 [ ] YES,
 [X] NO,
 [ ] ICD Ind.,
 [ ] H.,
 [ ] L.,
 [X] BIN,
 [ ] 28 TOTAL CHARGE,
 [X] YES,
 [ ] NO]
[13]:
checkbox = analyzed_document.checkboxes[0]
checkbox.bbox
[13]:
x: 0.8941406011581421, y: 0.1098247841000557, width: 0.02421991527080536, height: 0.006424476392567158

Overlay

Geo Finder

[3]:
from textractgeofinder.ocrdb import AreaSelection
from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement
from textractprettyprinter.t_pretty_print import get_forms_string
from textractcaller import call_textract
from textractcaller.t_call import Textract_Features
import trp.trp2 as t2
[4]:
js: dict = call_textract(
    input_document=s3path_cms1500_pdf.uri,
    features=[
        Textract_Features.FORMS,
        Textract_Features.TABLES,
    ]
)
[53]:
import json

Path(dir_here, "test.json").write_text(json.dumps(js, indent=4))
[53]:
2131574
[5]:
document: t2.TDocument = t2.TDocumentSchema().load(js)
type(document)
[5]:
trp.trp2.TDocument
[6]:
doc_width = 2480
doc_height = 3509
geofinder_doc = TGeoFinder(js, doc_height=doc_height, doc_width=doc_width)
geofinder_doc
[6]:
<textractgeofinder.tgeofinder.TGeoFinder at 0x1338ccd60>
[10]:
geofinder_doc.__del__()
print("done")
done
[7]:
key_21_phrase = geofinder_doc.find_phrase_on_page("DIAGNOSIS OR NATURE OF ILLNESS OR INJURY")[0]
key_21_phrase
[7]:
TWord(text='diagnosis or nature of illness or injury', original_text='DIAGNOSIS OR NATURE OF ILLNESS OR INJURY', text_type='phrase', confidence=99.78182002476284, id='e10c607d-31a4-4f69-acdc-dcc99cbe224e', xmin=94, ymin=1904, xmax=671, ymax=1927, page_number=1, doc_width=2480, doc_height=3509, child_relationships='', reference=None, resolver=None)
[8]:
from PIL import Image, ImageDraw

def show_bounding_box(path, phrase, fill=None):
    with Image.open(path) as img:
        x, y = img.size
        print(x, y)
        doc_width = key_21_phrase.doc_width
        doc_height = key_21_phrase.doc_height
        draw = ImageDraw.Draw(img)
        xy = [
            phrase.xmin / doc_width * x,
            phrase.ymin / doc_height * y,
            phrase.xmax / doc_width * x,
            phrase.ymax / doc_height * y,
        ]
        draw.rectangle(
            xy=xy,
            outline=128,
            fill=fill,
            width=2,
        )
        img.show()
[15]:
show_bounding_box(path_cms1500_png.abspath, key_21_phrase)
2480 3509
[9]:
key_diagnosis_pointer = geofinder_doc.find_phrase_on_page("DIAGNOSIS POINTER")[0]
key_diagnosis_pointer
[9]:
TWord(text='diagnosis pointer', original_text='DIAGNOSIS POINTER', text_type='phrase', confidence=99.80844497680664, id='35d36ccd-c909-4294-8c25-4ae1f4764aa6', xmin=1326, ymin=2129, xmax=1466, ymax=2181, page_number=1, doc_width=2480, doc_height=3509, child_relationships='', reference=None, resolver=None)
[18]:
show_bounding_box(path_cms1500_png.abspath, key_diagnosis_pointer)
2480 3509
[10]:
top_left = t2.TPoint(x=50, y=key_21_phrase.ymin-50)
lower_right = t2.TPoint(x=key_diagnosis_pointer.xmax+50, y=key_diagnosis_pointer.ymin+100)
[11]:
# a_to_l_fields = geofinder_doc.get_form_fields_in_area(
#     area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number=1)
# )
a_to_l_fields = geofinder_doc.get_form_fields_in_area(
    area_selection=AreaSelection(
        top_left=t2.TPoint(x=0, y=0),
        lower_right=t2.TPoint(x=doc_width, y=doc_height),
        page_number=1,
    )
)
print(len(a_to_l_fields))
for field in sorted(
    a_to_l_fields,
    key=lambda x: x.key.text,
):
    # print(field.key.text, field.value.text)
    # print(field.key.text, field.value)
    print(field.key.text)

115
$ charges
10d. claim codes (designated by nucc)
11. insured's policy group or feca number
17. name of referring provider or other source
19. additional claim information (designated by nucc)
1a. insured's i.d. number
2 patient's name (last name, first name, middle initial)
22. resubmission code
23. prior authorization number
25. federal tax i.d. number
26. patient's account no
28. total charge
29. amount paid
30. rsvd. for nucc use
32. service facility location information
33 billing provider info & ph #
4. insured's name (last name, first name, middle initial)
4/4/2022 date
5 patient's address (no. street)
7. insured's address (no., street)
8. reserved for nucc use
9 other insured's name (last name, first name, middle initial)
a
a.
a. other insured's policy or group number
approved
b
b
b.
b. other claim id (designated by nucc)
b. reserved for nucc use
c.
c. insurance plan name or program name
c. reserved for nucc use
champva (member id#)
child
city
city
d insurance plan name or program name
d.
date
dd
dd
dd
dd
dd
dd
dd
dd.
e
ein
f
f
f.
feca blklung (id#)
form
g
group health plan (id#)
h.
icd ind.
j.
k
l.
m
medicaid (medicaid#)
medicare (medicare#)
mm
mm
mm
mm
mm
mm
mm
mm
no
no
no
no
no
no
original ref. no
other
other (id#)
pica
pica
place (state)
qual
qual
self
signed
signed
signed
spouse
ssn
state
state
telephone (include area code)
telephone (include area code)
tricare (id#/dod#)
yes
yes
yes
yes
yes
yes
yy
yy
yy
yy
yy
yy
yy 1974
yy 1978
zip code
zip code
[24]:
print(key_21_phrase.xmin, key_diagnosis_pointer.xmin, key_diagnosis_pointer.xmax)
94 1326 1466
[36]:
# (61.32811039686203, 74.83414195477962, 2069.866504251957, 2083.3725358098745)
print(top_left.x, lower_right.x, top_left.y, lower_right.y)
50 1516 1854 2129
[29]:
for field in sorted(
    a_to_l_fields,
    key=lambda x: x.key.text,
):
    print(field.key.text, field.value.text)
a s16 1xxa
b m75 42
c. m25512
d. this is david
e this is e
f. this is f
g this is g
h. this is h
icd ind. not_selected
j. this is jack
k this is king
l. this is letter
[ ]: