{ "cells": [ { "cell_type": "markdown", "source": [ "# Explore amazon-textract-textractor\n", "\n", "## What is amazon-textract-textractor?\n", "\n", "Textract 的原生 API 返回的结果是一个巨大的 JSON 对象. 你需要阅读 [Text Detection and Document Analysis Response Objects](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html) 才能理解如何解读这个 JSON 对象. 然后你还要自己写程序 Parse 这个 JSON 对象, 对数据做进一步的处理.\n", "\n", "[amazon-textract-textractor](https://github.com/aws-samples/amazon-textract-textractor) 是 AWS 实验室里的一个开源 Python 项目. 致力于让 Textract 更好用. 简单来说就是对这个 JSON 对象的进一步封装.\n", "\n", "``amazon-textract-textractor`` 是一个顶层项目, 内部有这么几个模块:\n", "\n", "- amazon-textract-caller: 对 boto3 的封装, 毕竟 boto3 的 API 函数根本没有 signature 也没有 type hint.\n", "- amazon-textract-response-parser: 对 JSON 对象的面向对象封装.\n", "\n", "以上两个是 ``amazon-textract-textractor`` 的核心, 安装的时候会自动安装这两个.\n", "\n", "- amazon-textract-overlayer: 用来在 PDF 或图片上画方框的.\n", "- amazon-textract-prettyprinter: 把 Textract 的结果转化成其他 CSV, markdown 等格式.\n", "- amazon-textract-geofinder: 实现了对 Textract 的 entity 用坐标来搜索. 底层是用 sqlite 数据库实现." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 1, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: amazon-textract-textractor in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (1.0.18)\r\n", "Requirement already satisfied: tabulate==0.8.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (0.8.10)\r\n", "Requirement already satisfied: amazon-textract-response-parser==0.1.33 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (0.1.33)\r\n", "Requirement already satisfied: XlsxWriter==3.0.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (3.0.3)\r\n", "Requirement already satisfied: jsonschema in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (4.4.0)\r\n", "Requirement already satisfied: amazon-textract-caller==0.0.24 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (0.0.24)\r\n", "Requirement already satisfied: pyxDamerauLevenshtein==1.7.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (1.7.1)\r\n", "Requirement already satisfied: boto3==1.24.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (1.24.96)\r\n", "Requirement already satisfied: Pillow in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-textractor) (9.2.0)\r\n", "Requirement already satisfied: botocore in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-caller==0.0.24->amazon-textract-textractor) (1.27.96)\r\n", "Requirement already satisfied: marshmallow==3.14.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser==0.1.33->amazon-textract-textractor) (3.14.1)\r\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3==1.24.*->amazon-textract-textractor) (0.10.0)\r\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3==1.24.*->amazon-textract-textractor) (0.6.0)\r\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-caller==0.0.24->amazon-textract-textractor) (1.26.7)\r\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-caller==0.0.24->amazon-textract-textractor) (2.8.2)\r\n", "Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-caller==0.0.24->amazon-textract-textractor) (1.16.0)\r\n", "Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from jsonschema->amazon-textract-textractor) (0.18.1)\r\n", "Requirement already satisfied: attrs>=17.4.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from jsonschema->amazon-textract-textractor) (21.4.0)\r\n", "Requirement already satisfied: importlib-resources>=1.4.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from jsonschema->amazon-textract-textractor) (5.7.1)\r\n", "Requirement already satisfied: zipp>=3.1.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from importlib-resources>=1.4.0->jsonschema->amazon-textract-textractor) (3.8.0)\r\n", "\u001B[33mWARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.\r\n", "You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.\u001B[0m\r\n", "Note: you may need to restart the kernel to use updated packages.\n", "Requirement already satisfied: amazon-textract-caller in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.0.24)\r\n", "Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-caller) (1.24.96)\r\n", "Requirement already satisfied: botocore in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-caller) (1.27.96)\r\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-caller) (0.10.0)\r\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-caller) (0.6.0)\r\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-caller) (1.26.7)\r\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-caller) (2.8.2)\r\n", "Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-caller) (1.16.0)\r\n", "\u001B[33mWARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.\r\n", "You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.\u001B[0m\r\n", "Note: you may need to restart the kernel to use updated packages.\n", "Requirement already satisfied: amazon-textract-response-parser in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.1.33)\r\n", "Requirement already satisfied: marshmallow==3.14.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser) (3.14.1)\r\n", "Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser) (1.24.96)\r\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser) (0.10.0)\r\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser) (0.6.0)\r\n", "Requirement already satisfied: botocore<1.28.0,>=1.27.96 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser) (1.27.96)\r\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser) (2.8.2)\r\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser) (1.26.7)\r\n", "Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser) (1.16.0)\r\n", "\u001B[33mWARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.\r\n", "You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.\u001B[0m\r\n", "Note: you may need to restart the kernel to use updated packages.\n", "Requirement already satisfied: amazon-textract-geofinder in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.0.6)\r\n", "Requirement already satisfied: amazon-textract-response-parser>=0.1.17 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-geofinder) (0.1.33)\r\n", "Requirement already satisfied: marshmallow==3.14.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (3.14.1)\r\n", "Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (1.24.96)\r\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (0.6.0)\r\n", "Requirement already satisfied: botocore<1.28.0,>=1.27.96 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (1.27.96)\r\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (0.10.0)\r\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (2.8.2)\r\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (1.26.7)\r\n", "Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.28.0,>=1.27.96->boto3->amazon-textract-response-parser>=0.1.17->amazon-textract-geofinder) (1.16.0)\r\n", "\u001B[33mWARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.\r\n", "You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.\u001B[0m\r\n", "Note: you may need to restart the kernel to use updated packages.\n", "Requirement already satisfied: amazon-textract-prettyprinter in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.0.16)\r\n", "Requirement already satisfied: tabulate==0.8.10 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-prettyprinter) (0.8.10)\r\n", "Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-prettyprinter) (1.24.96)\r\n", "Requirement already satisfied: botocore in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-prettyprinter) (1.27.96)\r\n", "Requirement already satisfied: amazon-textract-response-parser>=0.1.27 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-prettyprinter) (0.1.33)\r\n", "Requirement already satisfied: marshmallow==3.14.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-response-parser>=0.1.27->amazon-textract-prettyprinter) (3.14.1)\r\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-prettyprinter) (0.6.0)\r\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-prettyprinter) (0.10.0)\r\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-prettyprinter) (2.8.2)\r\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-prettyprinter) (1.26.7)\r\n", "Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-prettyprinter) (1.16.0)\r\n", "\u001B[33mWARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.\r\n", "You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.\u001B[0m\r\n", "Note: you may need to restart the kernel to use updated packages.\n", "Requirement already satisfied: amazon-textract-overlayer in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (0.0.10)\r\n", "Requirement already satisfied: Pillow>=9.2.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (9.2.0)\r\n", "Requirement already satisfied: amazon-textract-caller>=0.0.11 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (0.0.24)\r\n", "Requirement already satisfied: botocore in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (1.27.96)\r\n", "Requirement already satisfied: PyPDF2>=2.5.* in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (2.11.2)\r\n", "Requirement already satisfied: boto3 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from amazon-textract-overlayer) (1.24.96)\r\n", "Requirement already satisfied: typing_extensions>=3.10.0.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from PyPDF2>=2.5.*->amazon-textract-overlayer) (4.2.0)\r\n", "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-overlayer) (0.6.0)\r\n", "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from boto3->amazon-textract-overlayer) (0.10.0)\r\n", "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-overlayer) (2.8.2)\r\n", "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from botocore->amazon-textract-overlayer) (1.26.7)\r\n", "Requirement already satisfied: six>=1.5 in /Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages (from python-dateutil<3.0.0,>=2.1->botocore->amazon-textract-overlayer) (1.16.0)\r\n", "\u001B[33mWARNING: You are using pip version 21.2.4; however, version 22.3.1 is available.\r\n", "You should consider upgrading via the '/Users/sanhehu/venvs/python/3.8.11/dev_exp_share_venv/bin/python -m pip install --upgrade pip' command.\u001B[0m\r\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "# 这里我们把他们全部装好得了\n", "%pip install amazon-textract-textractor\n", "%pip install amazon-textract-caller\n", "%pip install amazon-textract-response-parser\n", "%pip install amazon-textract-geofinder\n", "%pip install amazon-textract-prettyprinter\n", "%pip install amazon-textract-overlayer" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "# Set AWS Credential" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 1, "outputs": [], "source": [ "from boto_session_manager import BotoSesManager\n", "from textractor import Textractor\n", "from s3pathlib import context\n", "\n", "aws_profile = \"aws_data_lab_sanhe_us_east_1\"\n", "\n", "bsm = BotoSesManager(profile_name=aws_profile)\n", "context.attach_boto_session(bsm.boto_ses)\n", "\n", "# Textractor 的顶层 API\n", "extractor = Textractor(profile_name=aws_profile)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Enumerate Important Local Path and S3 Path\n", "\n", "这里我们先做一些准备工作, 将 PDF 转化为图片, 上传至 S3 等工作." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 24, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "preview: https://console.aws.amazon.com/s3/buckets/aws-data-lab-sanhe-for-everything?prefix=poc/2022-12-04-textractor/\n" ] } ], "source": [ "import os\n", "from pathlib_mate import Path\n", "from s3pathlib import S3Path\n", "\n", "#--- Local\n", "dir_here = Path(os.getcwd()).absolute()\n", "\n", "path_cms1500_pdf = dir_here / \"cms1500-carrie-rodgers.pdf\"\n", "path_cms1500_png = dir_here / \"page-1.png\"\n", "\n", "#--- S3\n", "s3dir_root = S3Path(\"aws-data-lab-sanhe-for-everything\", \"poc\", \"2022-12-04-textractor\").to_dir()\n", "s3dir_input = s3dir_root.joinpath(\"input\").to_dir()\n", "s3dir_output = s3dir_root.joinpath(\"output\").to_dir()\n", "s3path_cms1500_pdf = s3dir_input / path_cms1500_pdf.basename\n", "\n", "#--- Upload\n", "print(f\"preview: {s3dir_root.console_url}\")\n", "\n", "s3path_cms1500_pdf.upload_file(path_cms1500_pdf.abspath, overwrite=True)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 27, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n" ] }, { "ename": "RuntimeError", "evalue": "source object number out of range", "output_type": "error", "traceback": [ "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m", "\u001B[0;31mRuntimeError\u001B[0m Traceback (most recent call last)", "Input \u001B[0;32mIn [27]\u001B[0m, in \u001B[0;36m\u001B[0;34m()\u001B[0m\n\u001B[1;32m 9\u001B[0m \u001B[38;5;66;03m# split page\u001B[39;00m\n\u001B[1;32m 10\u001B[0m one_page_doc \u001B[38;5;241m=\u001B[39m fitz\u001B[38;5;241m.\u001B[39mopen() \u001B[38;5;66;03m# new empty PDF\u001B[39;00m\n\u001B[0;32m---> 11\u001B[0m \u001B[43mone_page_doc\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43minsert_pdf\u001B[49m\u001B[43m(\u001B[49m\u001B[43mdoc\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mfrom_page\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mpage_num\u001B[49m\u001B[38;5;241;43m-\u001B[39;49m\u001B[38;5;241;43m1\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mto_page\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mpage_num\u001B[49m\u001B[38;5;241;43m-\u001B[39;49m\u001B[38;5;241;43m1\u001B[39;49m\u001B[43m)\u001B[49m\n\u001B[1;32m 12\u001B[0m p \u001B[38;5;241m=\u001B[39m dir_here \u001B[38;5;241m/\u001B[39m \u001B[38;5;124mf\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mpage-\u001B[39m\u001B[38;5;132;01m{\u001B[39;00mpage_num\u001B[38;5;132;01m}\u001B[39;00m\u001B[38;5;124m.pdf\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m 13\u001B[0m \u001B[38;5;66;03m#\u001B[39;00m\n\u001B[1;32m 14\u001B[0m \u001B[38;5;66;03m# # you cannot write document to io.BytesIO\u001B[39;00m\n\u001B[1;32m 15\u001B[0m \u001B[38;5;66;03m# one_page_doc.save(f\"{p}\")\u001B[39;00m\n\u001B[1;32m 16\u001B[0m \n\u001B[1;32m 17\u001B[0m \u001B[38;5;66;03m# convert page to image\u001B[39;00m\n", "File \u001B[0;32m~/venvs/python/3.8.11/dev_exp_share_venv/lib/python3.8/site-packages/fitz/fitz.py:4608\u001B[0m, in \u001B[0;36mDocument.insert_pdf\u001B[0;34m(self, docsrc, from_page, to_page, start_at, rotate, links, annots, show_progress, final, _gmap)\u001B[0m\n\u001B[1;32m 4604\u001B[0m _gmap \u001B[38;5;241m=\u001B[39m Graftmap(\u001B[38;5;28mself\u001B[39m)\n\u001B[1;32m 4605\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mGraftmaps[isrt] \u001B[38;5;241m=\u001B[39m _gmap\n\u001B[0;32m-> 4608\u001B[0m val \u001B[38;5;241m=\u001B[39m \u001B[43m_fitz\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mDocument_insert_pdf\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mdocsrc\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mfrom_page\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mto_page\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mstart_at\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mrotate\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mlinks\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mannots\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mshow_progress\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mfinal\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m_gmap\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m 4610\u001B[0m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_reset_page_refs()\n\u001B[1;32m 4611\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m links:\n", "\u001B[0;31mRuntimeError\u001B[0m: source object number out of range" ] } ], "source": [ "# 用 PyMuPDF 将 PDF 切割并转化为 图片.\n", "import fitz\n", "\n", "# bytes protocol\n", "doc = fitz.open(stream=path_cms1500_pdf.read_bytes())\n", "\n", "for page_num, page in enumerate(doc, start=1):\n", " print(page_num)\n", " # split page\n", " one_page_doc = fitz.open() # new empty PDF\n", " one_page_doc.insert_pdf(doc, from_page=page_num-1, to_page=page_num-1)\n", " p = dir_here / f\"page-{page_num}.pdf\"\n", " #\n", " # # you cannot write document to io.BytesIO\n", " # one_page_doc.save(f\"{p}\")\n", "\n", " # convert page to image\n", " pix = page.get_pixmap(dpi=200)\n", "\n", " p = dir_here / f\"page-{page_num}.ppm\"\n", " # you cannot write pix map to io.BytesIO\n", " p.write_bytes(pix.tobytes(\"ppm\"))\n", " # pix.save(f\"{p}\")\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Detect Document Text" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 89, "outputs": [ { "data": { "text/plain": "textractor.entities.document.Document" }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "document = extractor.detect_document_text(file_source=s3path_cms1500_pdf.uri)\n", "type(document)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 90, "outputs": [ { "data": { "text/plain": "[Mail completed forms to:,\n Department of Labor and Industries,\n PO Box 44269,\n Olympia WA 98504-4269,\n HEALTH INSURANCE CLAIM FORM,\n CARRIER,\n APPROVED BY NATIONAL UNIF ORM CLAIM COMMITTEE (NUCC) 02/12,\n PICA,\n PICA,\n OTHER,\n 1a INSURED'S ID NUMBER,\n FECA,\n GROUP,\n CHAMPVA,\n (For Program in Item 1),\n TRICARE,\n MEDICAID,\n 1. MEDICARE,\n IKLUNG,\n HEALTH PLAN,\n (ID#),\n (ID#),\n (ID#),\n (Member ID#),\n (ID#/DoD#),\n (Medicaid#),\n (Medicare#),\n SEX,\n 3. PATIENT'S BIRTH DATE,\n 4. INSURED'S NAME (Last Name, First Name, Middle Initial),\n 2 PATIENT'S NAME (Last Name, First Name, Middle Initial),\n YY,\n DD,\n MM,\n F,\n 18,\n ALCON LABORATORIES,\n 1974,\n 9,\n 7 INSURED'S ADDRESS (No., Street),\n Carrie Rodgers,\n 6. PATIENT RELATIONSHIP TOINSURED,\n 5 PATIENT'S ADDRESS (No., Street),\n Other,\n Child,\n Self,\n Spouse,\n 6201 S freeway,\n 2805 28th StNw,\n STATE,\n CITY,\n 8. RESERVED FOR NUCC USE,\n STATE,\n CITY,\n DC,\n Tx,\n fort worth,\n Washington,\n ZIP CODE,\n TELEPHONE (include Area Code),\n TELEPHONE (Include Area Code),\n ZIP CODE,\n (815)571-3008,\n 76134,\n (202)614-5824,\n 20008,\n 11. INSURED'S POLICY GROUP OR FECA NUMBER,\n 10. IS PATIENT'S CONDITION RELATED TO,\n 9. OTHER INSURED'S NAME (Last Name, First Name, Middle Initial),\n INFORMATION,\n FUR4398,\n SEX,\n a. INSURED'S DATE OF BIRTH,\n a OTHER INSURED'S POLICY OR GROUP NUMBER,\n a. EMPLOYMENT? (Current or Previous),\n YY,\n DD,\n MM,\n F,\n M,\n NO,\n YES,\n X1573,\n INSURED,\n 4,\n 11 1978,\n b RESERVED FOR NUCC USE,\n b. AUTO ACCIDENT?,\n b. OTHER CLAIM ID (Designated by NUCC),\n PLACE (State),\n NO,\n AND,\n YES,\n Y41 FUR4398,\n C. OTHER ACCIDENT?,\n C. RESERVED FOR NUCC USE,\n C. INSURANCE PLAN NAME OR PROGR AM NAME,\n YES,\n NO,\n Travelers,\n d. INSURANCE PLAN NAME OR PROGRAM NAME,\n d. IS THERE ANOTHER HEALTH BENEFIT PLAN?,\n 10d. CLAIM CODES (Designated by NUCC),\n PATIENT,\n NO,\n YES,\n If yes, complete items 9, 9a and 9d,\n READ BACK OF FORM BEFORE COMPLETING & SIGNING THIS FORM.,\n 13. INSURED'S OR AUTHORIZED PERSON'S SIGNATURE I authorize,\n 12. PATIENT'S OR AUTHORIZED PERSON'S SIGNATURE I authorize the release of any medical or other information necessary,\n payment of medical benefits to the undersigned chysid an or supplier for,\n services described below.,\n to process this claim. I also request payment of government benefits ither to myself or to the party who accepts assignment,\n below,\n Curion,\n Cummium,\n SIGNED,\n SIGNED,\n DATE 03/08/22,\n 15. OTHER DATE,\n 16. DATES PATIENT UNABLE TO WORK IN CURRENT OCCUPATION,\n 14 DATE OF CURRENT ILLNESS INJURY, or PREGNANCY (LMP),\n YY,\n MM,\n DD.,\n MM,\n YY,\n YY,\n DD,\n DD,\n YY,\n MM,\n DD,\n MM,\n QUAL,\n TO,\n FROM,\n QUAL,\n 12,\n 439,\n 06,\n 01,\n 2013,\n 431,\n 07,\n 06,\n 12,\n 21,\n 21,\n 18. HOSPITALIZATION DATES RELATED TO CURRENT SERVICES,\n 17. NAME OF REFERRING PROVIDER OR OTHER SOURCE,\n 17a.,\n MM,\n YY,\n MM,\n DD,\n DD,\n YY,\n MDF15577,\n FROM,\n TO,\n NPI,\n 17b,\n DN JOSE FUENTES MD,\n 1235184821,\n 20. OUTSIDE LAB?,\n $CHARGES,\n 19. ADDITIONAL CLAIM INFORMATION (Designated by NUCC),\n YES,\n NO,\n 21. DIAGNOSIS OR NATURE OF ILLNESS OR INJURY Relate AL to service line below (24E),\n 22 RESUBMISSION,\n ICD Ind.,\n CODE,\n ORIGINAL REF. NO,\n This is David,\n M25512,\n S16 1XXA,\n C.,\n A,\n M75 42,\n B,\n D.,\n 23 PRIOR AUTHORIZATION NUMBER,\n This is H,\n This is F,\n This is,\n This is E,\n E,\n H.,\n G,\n F.,\n This is LETTER,\n L.,\n This is Jack,\n This is King,\n I.,\n J.,\n This is iphone,\n K,\n 02FUR4398,\n B,\n E,\n C.,\n F.,\n 24. A,\n G.,\n H,\n D. PROCEDURES, SERVICES, OR SUPPLIES,\n DATE(S) OF SERVICE,\n I.,\n J.,\n DAYS,\n EPSD1,\n PLACE OF,\n To,\n From,\n DIAGNOSIS,\n (Explain Unusual Circumstances),\n RENDERING,\n ID.,\n OR,\n Family,\n OPT/HCPCS,\n YY,\n MM,\n EMG,\n YY,\n DD,\n SERVICE,\n DD,\n MM,\n POINTER,\n MODIFIER,\n CHARGES,\n Plan,\n UNITS,\n QUAL,\n PROVIDER ID,\n OB,\n OT105516TX,\n 1,\n 3,\n 25,\n 97110 Go,\n NPI,\n 25,\n 22103,\n 22111,\n ABC 189.48,\n 03,\n 1023439049,\n OB,\n 5516TX,\n 2,\n 22,\n 25,\n 25,\n 03,\n NPI,\n 82.18,\n ABC,\n 03,\n 95730 Go,\n 1023439049,\n 3,\n NPI,\n SUPPLIER,\n 4,\n NPI,\n 5,\n NPI,\n 6,\n PHYSICIAN,\n NPI,\n SSN BIN,\n 25. FEDERAL TAX I.D. NUMBER,\n 26 PATIENT'S ACCOUNT NO,\n 27 ACCEPT ASSIGNMENT?,\n 28 TOTAL CHARGE,\n 29. AMOUNT PAID,\n 30. Rsvd. for NUCC Use,\n For govt claims see tack),\n YES,\n NO,\n $,\n MONROOOD,\n X,\n x,\n 203721804,\n $ 271.66,\n 31. SIGNATURE OF PHYSICIAN OR SUPPLIER,\n 32. SERVICE FACILITY LOCATION INFORMATION,\n 33 BILLING PROVIDER,\n INCLUDING DEGREES OR CREDENTIALS,\n INFO & PH # (214) 953-9431,\n NORTH TEAXS EHABILITATION,\n (I certify that the statements on the reverse,\n NORTH TEXAS REHABILITATION,\n apply to this bill and are made a part thereof),\n PO BOX 226656,\n 2601 SCOTT AVE 102, 76103,\n CHRISTY L. HOBBY,\n & DALLAS TX 75222-6656,\n 4/4/2022,\n 1508095761 b,\n SIGNED,\n 150809576,\n b. OTTX,\n PLEASE PRINT OR TYPE,\n NUCC Instruction Manual available at: www.nucc.org,\n APPROVED OMB-0938 1197 FORM 1500 (02-12),\n F245-127-000 CMS 1500 02-2012,\n RESET,\n Scanned with CamScanner]" }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "document.lines" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 91, "outputs": [ { "data": { "text/plain": "[2 PATIENT'S NAME (Last Name, First Name, Middle Initial),\n 4. INSURED'S NAME (Last Name, First Name, Middle Initial),\n 9. OTHER INSURED'S NAME (Last Name, First Name, Middle Initial)]" }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = document.search_lines(\"patient name, (Last Name, First Name, Middle Initial)\", 3)\n", "results" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 92, "outputs": [ { "data": { "text/plain": "x: 0.020324068143963814, y: 0.1542317271232605, width: 0.2638625502586365, height: 0.008735546842217445" }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results[0].bbox" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Form and Table" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 19, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done\n" ] } ], "source": [ "from textractor.data.constants import TextractFeatures\n", "\n", "analyzed_document = extractor.analyze_document(\n", "\tfile_source=path_cms1500_pdf.abspath,\n", "\tfeatures=[\n", " TextractFeatures.FORMS,\n", " TextractFeatures.TABLES,\n", " ]\n", ")\n", "print(\"done\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 59, "outputs": [ { "data": { "text/plain": "2006605" }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Path(dir_here, \"test_1.json\").write_text(json.dumps(analyzed_document.response, indent=4))" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### Key Value" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 57, "outputs": [ { "data": { "text/plain": "[1a INSURED'S ID NUMBER : (For Program in Item 1),\n 4. INSURED'S NAME (Last Name, First Name, Middle Initial) : ALCON LABORATORIES,\n 2 PATIENT'S NAME (Last Name, First Name, Middle Initial) : Carrie Rodgers,\n YY : 1974,\n MM : 9,\n DD : 18,\n 7 INSURED'S ADDRESS (No., Street) : 6201 S freeway,\n 5 PATIENT'S ADDRESS (No., Street) : 2805 28th StNw,\n CITY : fort worth,\n 8. RESERVED FOR NUCC USE : ,\n STATE : DC,\n CITY : Washington,\n ZIP CODE : 76134,\n TELEPHONE (include Area Code) : (815)571-3008,\n TELEPHONE (Include Area Code) : (202)614-5824,\n ZIP CODE : 20008,\n 11. INSURED'S POLICY GROUP OR FECA NUMBER : FUR4398,\n 9. OTHER INSURED'S NAME (Last Name, First Name, Middle Initial) : ,\n a OTHER INSURED'S POLICY OR GROUP NUMBER : X1573,\n DD : 11,\n MM : 4,\n 1978 YY : ,\n b RESERVED FOR NUCC USE : ,\n b. OTHER CLAIM ID (Designated by NUCC) : Y41 FUR4398,\n C. INSURANCE PLAN NAME OR PROGR AM NAME : Travelers,\n C. RESERVED FOR NUCC USE : ,\n d. INSURANCE PLAN NAME OR PROGRAM NAME : ,\n 10d. CLAIM CODES (Designated by NUCC) : ,\n SIGNED : Cummium,\n DATE : 03/08/22,\n SIGNED : Curion,\n YY : 2013,\n DD. : 06,\n MM : ,\n YY : 21,\n MM : 07,\n YY : 21,\n DD : ,\n YY : ,\n MM : 12,\n DD : 06,\n MM : 12,\n QUAL : 439,\n QUAL : 431,\n 17. NAME OF REFERRING PROVIDER OR OTHER SOURCE : DN JOSE FUENTES MD,\n MM : ,\n DD : ,\n YY : ,\n MM : ,\n DD : ,\n YY : ,\n $CHARGES : ,\n 19. ADDITIONAL CLAIM INFORMATION (Designated by NUCC) : ,\n 22 RESUBMISSION CODE : ,\n ORIGINAL REF. NO : ,\n C. : M25512,\n A : S16 1XXA,\n B : M75 42,\n D. : This is David,\n 23 PRIOR AUTHORIZATION NUMBER : 02FUR4398,\n G : This is,\n E : This is E,\n F. : This is F,\n J. : This is Jack,\n I. : This is iphone,\n K : This is King,\n SSN : ,\n 25. FEDERAL TAX I.D. NUMBER : 203721804,\n 26 PATIENT'S ACCOUNT NO : MONROOOD,\n 29. AMOUNT PAID : $,\n 30. Rsvd. for NUCC Use : ,\n 32. SERVICE FACILITY LOCATION INFORMATION : NORTH 2601 TEAXS SCOTT AVE EHABILITATION 102, 76103,\n 33 BILLING PROVIDER INFO & PH # : NORTH PO BOX & TEXAS DALLAS 226656 REHABILITATION (214) 953-9431 TX 75222-6656,\n b : ,\n : 1508095761,\n : 150809576,\n b. : OTTX,\n : L. 4/4/2022 HOBBY,\n SIGNED : CHRISTY,\n APPROVED OMB-0938 : 1197,\n FORM : 1500 (02-12)]" }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "analyzed_document.key_values" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 38, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Key contains no words objects.\n", "WARNING:root:Key contains no words objects.\n", "WARNING:root:Key contains no words objects.\n", "WARNING:root:Key contains no words objects.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "I. : This is iphone\n" ] }, { "data": { "text/plain": "(61.32811039686203, 74.83414195477962, 2069.866504251957, 2083.3725358098745)" }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "key_value_I = analyzed_document.get(key=\"I\")[0]\n", "print(key_value_I)\n", "doc_width = 2480\n", "doc_height = 3509\n", "x = key_value_I.bbox.x * doc_width\n", "y = key_value_I.bbox.y * doc_height\n", "width = key_value_I.bbox.width * doc_width\n", "height = key_value_I.bbox.height * doc_height\n", "\n", "x_min = x\n", "y_min = y\n", "x_max = x + width\n", "y_max = y + width\n", "(x_min, x_max, y_min, y_max)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 10, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 PATIENT'S NAME (Last Name, First Name, Middle Initial)\n", "Carrie Rodgers\n" ] } ], "source": [ "key_value = analyzed_document.key_values[2]\n", "print(key_value.key)\n", "print(key_value.value)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:Key contains no words objects.\n", "WARNING:root:Key contains no words objects.\n", "WARNING:root:Key contains no words objects.\n", "WARNING:root:Key contains no words objects.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "11. INSURED'S POLICY GROUP OR FECA NUMBER = FUR4398\n" ] } ], "source": [ "key_value = analyzed_document.get(\"INSURED POLICY GROUP\".lower(), 3)[0]\n", "print(f\"{key_value.key} = {key_value.value}\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### Checkbox" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 12, "outputs": [ { "data": { "text/plain": "[[ ] PICA,\n [X] PICA,\n [ ] OTHER (ID#),\n [ ] FECA (ID#) IKLUNG,\n [ ] GROUP HEALTH (ID#) PLAN,\n [ ] CHAMPVA (Member ID#),\n [ ] MEDICAID (Medicaid#),\n [ ] TRICARE (ID#/DoD#),\n [X] MEDICARE (Medicare#),\n [ ] F,\n [X] ,\n [ ] Other,\n [ ] Child,\n [ ] Self,\n [ ] Spouse,\n [X] STATE,\n [X] M,\n [ ] F,\n [X] YES,\n [ ] NO,\n [ ] PLACE (State),\n [ ] YES,\n [X] NO,\n [ ] YES,\n [X] NO,\n [X] NO,\n [ ] YES,\n [ ] DD,\n [ ] YES,\n [X] NO,\n [ ] ICD Ind.,\n [ ] H.,\n [ ] L.,\n [X] BIN,\n [ ] 28 TOTAL CHARGE,\n [X] YES,\n [ ] NO]" }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "analyzed_document.checkboxes" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 13, "outputs": [ { "data": { "text/plain": "x: 0.8941406011581421, y: 0.1098247841000557, width: 0.02421991527080536, height: 0.006424476392567158" }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "checkbox = analyzed_document.checkboxes[0]\n", "checkbox.bbox" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Overlay" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "source": [ "## Geo Finder" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 3, "outputs": [], "source": [ "from textractgeofinder.ocrdb import AreaSelection\n", "from textractgeofinder.tgeofinder import KeyValue, TGeoFinder, AreaSelection, SelectionElement\n", "from textractprettyprinter.t_pretty_print import get_forms_string\n", "from textractcaller import call_textract\n", "from textractcaller.t_call import Textract_Features\n", "import trp.trp2 as t2" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 4, "outputs": [], "source": [ "js: dict = call_textract(\n", " input_document=s3path_cms1500_pdf.uri,\n", " features=[\n", " Textract_Features.FORMS,\n", " Textract_Features.TABLES,\n", " ]\n", ")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 53, "outputs": [ { "data": { "text/plain": "2131574" }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "\n", "Path(dir_here, \"test.json\").write_text(json.dumps(js, indent=4))" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 5, "outputs": [ { "data": { "text/plain": "trp.trp2.TDocument" }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "document: t2.TDocument = t2.TDocumentSchema().load(js)\n", "type(document)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 6, "outputs": [ { "data": { "text/plain": "" }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "doc_width = 2480\n", "doc_height = 3509\n", "geofinder_doc = TGeoFinder(js, doc_height=doc_height, doc_width=doc_width)\n", "geofinder_doc" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 10, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "done\n" ] } ], "source": [ "geofinder_doc.__del__()\n", "print(\"done\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 7, "outputs": [ { "data": { "text/plain": "TWord(text='diagnosis or nature of illness or injury', original_text='DIAGNOSIS OR NATURE OF ILLNESS OR INJURY', text_type='phrase', confidence=99.78182002476284, id='e10c607d-31a4-4f69-acdc-dcc99cbe224e', xmin=94, ymin=1904, xmax=671, ymax=1927, page_number=1, doc_width=2480, doc_height=3509, child_relationships='', reference=None, resolver=None)" }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "key_21_phrase = geofinder_doc.find_phrase_on_page(\"DIAGNOSIS OR NATURE OF ILLNESS OR INJURY\")[0]\n", "key_21_phrase" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 8, "outputs": [], "source": [ "from PIL import Image, ImageDraw\n", "\n", "def show_bounding_box(path, phrase, fill=None):\n", " with Image.open(path) as img:\n", " x, y = img.size\n", " print(x, y)\n", " doc_width = key_21_phrase.doc_width\n", " doc_height = key_21_phrase.doc_height\n", " draw = ImageDraw.Draw(img)\n", " xy = [\n", " phrase.xmin / doc_width * x,\n", " phrase.ymin / doc_height * y,\n", " phrase.xmax / doc_width * x,\n", " phrase.ymax / doc_height * y,\n", " ]\n", " draw.rectangle(\n", " xy=xy,\n", " outline=128,\n", " fill=fill,\n", " width=2,\n", " )\n", " img.show()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 15, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2480 3509\n" ] } ], "source": [ "show_bounding_box(path_cms1500_png.abspath, key_21_phrase)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 9, "outputs": [ { "data": { "text/plain": "TWord(text='diagnosis pointer', original_text='DIAGNOSIS POINTER', text_type='phrase', confidence=99.80844497680664, id='35d36ccd-c909-4294-8c25-4ae1f4764aa6', xmin=1326, ymin=2129, xmax=1466, ymax=2181, page_number=1, doc_width=2480, doc_height=3509, child_relationships='', reference=None, resolver=None)" }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "key_diagnosis_pointer = geofinder_doc.find_phrase_on_page(\"DIAGNOSIS POINTER\")[0]\n", "key_diagnosis_pointer" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 18, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2480 3509\n" ] } ], "source": [ "show_bounding_box(path_cms1500_png.abspath, key_diagnosis_pointer)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 10, "outputs": [], "source": [ "top_left = t2.TPoint(x=50, y=key_21_phrase.ymin-50)\n", "lower_right = t2.TPoint(x=key_diagnosis_pointer.xmax+50, y=key_diagnosis_pointer.ymin+100)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 11, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "115\n", "$ charges\n", "10d. claim codes (designated by nucc)\n", "11. insured's policy group or feca number\n", "17. name of referring provider or other source\n", "19. additional claim information (designated by nucc)\n", "1a. insured's i.d. number\n", "2 patient's name (last name, first name, middle initial)\n", "22. resubmission code\n", "23. prior authorization number\n", "25. federal tax i.d. number\n", "26. patient's account no\n", "28. total charge\n", "29. amount paid\n", "30. rsvd. for nucc use\n", "32. service facility location information\n", "33 billing provider info & ph #\n", "4. insured's name (last name, first name, middle initial)\n", "4/4/2022 date\n", "5 patient's address (no. street)\n", "7. insured's address (no., street)\n", "8. reserved for nucc use\n", "9 other insured's name (last name, first name, middle initial)\n", "a\n", "a.\n", "a. other insured's policy or group number\n", "approved\n", "b\n", "b\n", "b.\n", "b. other claim id (designated by nucc)\n", "b. reserved for nucc use\n", "c.\n", "c. insurance plan name or program name\n", "c. reserved for nucc use\n", "champva (member id#)\n", "child\n", "city\n", "city\n", "d insurance plan name or program name\n", "d.\n", "date\n", "dd\n", "dd\n", "dd\n", "dd\n", "dd\n", "dd\n", "dd\n", "dd.\n", "e\n", "ein\n", "f\n", "f\n", "f.\n", "feca blklung (id#)\n", "form\n", "g\n", "group health plan (id#)\n", "h.\n", "icd ind.\n", "j.\n", "k\n", "l.\n", "m\n", "medicaid (medicaid#)\n", "medicare (medicare#)\n", "mm\n", "mm\n", "mm\n", "mm\n", "mm\n", "mm\n", "mm\n", "mm\n", "no\n", "no\n", "no\n", "no\n", "no\n", "no\n", "original ref. no\n", "other\n", "other (id#)\n", "pica\n", "pica\n", "place (state)\n", "qual\n", "qual\n", "self\n", "signed\n", "signed\n", "signed\n", "spouse\n", "ssn\n", "state\n", "state\n", "telephone (include area code)\n", "telephone (include area code)\n", "tricare (id#/dod#)\n", "yes\n", "yes\n", "yes\n", "yes\n", "yes\n", "yes\n", "yy\n", "yy\n", "yy\n", "yy\n", "yy\n", "yy\n", "yy 1974\n", "yy 1978\n", "zip code\n", "zip code\n" ] } ], "source": [ "# a_to_l_fields = geofinder_doc.get_form_fields_in_area(\n", "# area_selection=AreaSelection(top_left=top_left, lower_right=lower_right, page_number=1)\n", "# )\n", "a_to_l_fields = geofinder_doc.get_form_fields_in_area(\n", " area_selection=AreaSelection(\n", " top_left=t2.TPoint(x=0, y=0),\n", " lower_right=t2.TPoint(x=doc_width, y=doc_height),\n", " page_number=1,\n", " )\n", ")\n", "print(len(a_to_l_fields))\n", "for field in sorted(\n", " a_to_l_fields,\n", " key=lambda x: x.key.text,\n", "):\n", " # print(field.key.text, field.value.text)\n", " # print(field.key.text, field.value)\n", " print(field.key.text)\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 24, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "94 1326 1466\n" ] } ], "source": [ "print(key_21_phrase.xmin, key_diagnosis_pointer.xmin, key_diagnosis_pointer.xmax)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 36, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "50 1516 1854 2129\n" ] } ], "source": [ "# (61.32811039686203, 74.83414195477962, 2069.866504251957, 2083.3725358098745)\n", "print(top_left.x, lower_right.x, top_left.y, lower_right.y)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 29, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a s16 1xxa\n", "b m75 42\n", "c. m25512\n", "d. this is david\n", "e this is e\n", "f. this is f\n", "g this is g\n", "h. this is h\n", "icd ind. not_selected\n", "j. this is jack\n", "k this is king\n", "l. this is letter\n" ] } ], "source": [ "for field in sorted(\n", " a_to_l_fields,\n", " key=lambda x: x.key.text,\n", "):\n", " print(field.key.text, field.value.text)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }