Text Insight Solution ============================================================================== Keywords: text insight solution, unstructured document to structured data, pdf to data Summary ------------------------------------------------------------------------------ "Text Insight" is an "Unstructured Document to Structured Data" solution that can extract high quality, machine readable, structured data from PDF, Image, or any None Text data format (has potential to extend to process Audio / Video). Target vertical industry include: - Law Service - Financial Document - Insurance Document - Health Care Document Architect ------------------------------------------------------------------------------ .. raw:: html :file: ./pdf-to-data-solution.drawio.html .. tab:: 1. Raw Data User or Machine uploads raw document to S3 bucket. S3 Folder Structure: - Raw File: ``2022-01-01-financial-report.pdf`` - S3 Object: ``s3://my-bucket/my-folder/01-raw/${45781d436b1c285fbf11eb90e60a2a93_MD5}.dat``, the ``4578...`` is the MD5 of ``2022-01-01-financial-report.pdf`` for deduplication. The original file name can be stored as a S3 Object Tag .. tab:: 2. Trigger Text Tract Once the raw file been uploaded to S3 bucket, the S3 put object event will trigger a Lambda Function, and the Lambda Function calls the Textract **async** API. .. tab:: 3. Textract Once the text-extract operation is done, Textract will store the "machine readable extracted data" in JSON format in S3 bucket. Since this process may takes long (if it is 100+ pages PDF), you can configure to send an notification to the SNS topic when it is done. .. tab:: 4. Extracted Text The extracted text data is stored in S3 bucket The Machine readable extracted data: .. code-block:: python # Sample textract output JSON { "Blocks": [ { "Id": "c6dac97a-ec9d-4b74-b9f4-554853bd88a4", "BlockType": "PAGE | LINE | WORD", "Text": "your text here", "Geometry": { "BoundingBox": {...}, "Polygon": [...] }, "Relationships": [...], ... }, ... ] } Convert to Human readable extracted text: .. code-block:: python # Create a pure-text merged view of the extracted text data data = json.loads(s3path.read_text()) lines = list() for block in data["Blocks"]: s.add(block["BlockType"]) if block["BlockType"] == "LINE": lines.append(block["Text"]) content = "\n".join(lines) .. tab:: 5. SNS Topic Textract will send a message to SNS topic when the async operation is done. It can trigger subsequence job as required. .. tab:: 6. Trigger Comprehend The SNS message triggers a Lambda Function that invoke the Comprehend API, try to detect entities from extracted text. The input of the comprehend is the "Human readable extracted text" data. .. tab:: 7. Comprehend Once the detect-entity operation is done, it will store the machine readable detected entities data in JSON in S3 Bucket. .. tab:: 8. Detected Entities Sample comprehend output data: .. code-block:: python # Machine readable extracted text { "Entities": [ { "Score": 0.851378858089447, "Type": "ORGANIZATION", "Text": "CENTER FOR MEDICARE", "BeginOffset": 0, "EndOffset": 86 }, ... ] } .. tab:: 9. Trigger HIL The Comprehend output JSON file creation event will trigger a Lambda Function, and the Lambda Function can do necessary post process on Textract and Comprehend output, and it will trigger the Human in Loop to verify the quality of extracted data. .. tab:: 10. Human In Loop A HIL task is created by the Lambda Function. .. tab:: 11. Human Review The Human workers receive the assign HIL, and be able to provide feed back in HIL GUI. Sample GUI: .. image:: ./hil-ui.png .. tab:: 12. HIL Output The HIL output data will be saved to S3 bucket. Sample HIL Output: .. code-block:: python [ { "Change Reason1": "looks weird", "True Prediction1": "sanhe prediction", "predicted1": "0.1544346809387207", "predicted2": "0.4938497543334961", "predicted3": "0.23486430943012238", "rating1": { "agree": true, "disagree": false }, "rating2": { "agree": false, "disagree": true }, "rating3": { "agree": true, "disagree": false } } ] .. tab:: 13. Save to Data Store The creation HIL Output event will trigger a Lambda Function that merges HIL output with the Textract / Comprehend output, and store validated data to final Data Store. .. tab:: 14. Data Store The required structured data of the original document will be stored in proper data store backend for future use. .. tab:: 15. Status Tracker Dynamodb The entire workflow has multiple steps, we could store the status information for each step in Dynamodb and be able to use a simple query to continues the workflow from any step.