Understanding Document Types

Hyperscience extracts data from documents and converts it to a machine-readable format. We support Structured, Semi-structured, and Additional documents. 

In this article, you'll learn how to differentiate between these three document types and understand how Hyperscience processes them differently.

Structured

Structured documents are those where the layout of information on the page is consistent (e.g., the information is in the same location every time you receive the same form). These documents are often printed out and filled out by hand, or they are PDFs with fields that can be completed electronically.

For example, consider a tax form issued by your local government. Each person who fills out the form will be using the same form, and that person will fill out each field with their own relevant information, but the location of each field will be consistent overall.

Essentially, documents with consistent location of data types and data fields are Structured.

Semi-structured

Semi-structured documents do not necessarily have a consistent location for information across all documents.

For example, consider a business that receives hundreds of thousands of invoices from hundreds of different vendors. Each vendor will likely place common pieces of information (e.g. the invoice identification number, the payer's address, etc.) in different places on their respective invoices. Even among documents received from a single vendor, the exact location of some pieces of information may vary due to the number of line items (e.g. "Total" may be on page 1 in some documents and page 4 in others).

Essentially, documents with consistent data types but inconsistent location of data fields are Semi-structured.

Additional

Additional documents do not have any fields for extraction. Typically, these are contextual components that belong in a submission but do not have any valuable information for downstream processes. 

For example, consider an insurance claim that is composed of multiple pages meant for extraction. This claim will almost certainly be introduced with a cover page, however, this page will not likely provide any additional information than what is already contained in the document.

Essentially, these documents do not have data for extraction – just categorization for context.

Workflow differences based on document type

Layout setup

Regardless of the document type, you need to create a layout before processing a document through Hyperscience. However, the layout setup process differs among Structured, Semi-structured, and Additional documents. 

  • For Structured layouts, the original unfilled form is uploaded to create the layout. Layout setup is complete once a user has drawn the bounding boxes for all desired fields and completed required metadata (e.g., Field Name, Data Type, etc.). To learn more, see Creating Structured Layouts.
    • You can also create layout variations for forms that vary only slightly and share all or most of the same fields. When you create multiple variations of a layout, those variations collectively make up the layout, along with any field customizations you’ve created for releases. See Adding a Variation to a Layout for more information. 
  • For Semi-structured layouts, no document needs to be uploaded to create the layout. Layout setup consists only of defining the fields desired for extraction along with their metadata. More details can be found in Creating Semi-structured Layouts.
  • For Additional layouts, no document needs to be uploaded to create the layout, nor does any metadata need to be defined. To learn more, see Creating Additional Layouts.

Supervision tasks

For more information on Supervision tasks, see What is Extraction Supervision? Due to the different nature of these documents, some Supervision tasks are only applicable to certain document types: 

Document Type Document Classification Identification Transcription
Structured X   X
Semi-structured X X X
Additional X   *

*Additional documents can go through Transcription if the Manual Extract workflow is enabled during layout creation.

Document Classification

Document Classification will be generated based on the system settings for Document Classification tasks. For more information about these settings, see Document Classification Settings

For more information about this kind of Supervision task, see Document Classification.

Identification

There are two types of Identification tasks that apply exclusively to Semi-structured documents: Field Identification and Table Identification. For more information about these settings, see Identification Settings

  1. Field Identification tasks can be created in two cases:
    • Before a model has been trained and deployed, all fields will require manual Field Identification.
    • After a trained model has been deployed, and when the machine is uncertain about the location of a given field, a manual Field Identification task will be created.
  2. Table Identification tasks will always be created for documents associated with a layout that contains tables.

For more information about these kinds of Supervision tasks, see Table Identification and Field Identification.

Transcription

Transcription tasks are created whenever the machine is uncertain about the transcription of a specific field. For more information about these settings, see Transcription Settings.

For more information about this kind of Supervision task, see Transcription.

Quality Assurance

Quality Assurance (QA) serves two purposes within Hyperscience:

  1. To provide accuracy estimates for Field Output, Transcription, and Field Identification
  2. To provide training data for the system to improve itself

QA for Structured Documents

For Structured documents, individual fields are sampled according to a configurable sample rate and put through a consensus process to determine the "correct" answer. For more information on the consensus process, see Scoring Transcription Accuracy and Scoring Field Identification Accuracy.

This QA data is used to improve transcription accuracy and reduce manual intervention by identifying patterns in the data. For more information on this, see Transcription Settings.

QA for Semi-structured Documents

For Semi-structured documents, individual documents are sampled according to a configurable sample rate and put through a QA process to measure Field Identification accuracy. For more information on this process, see the article Field Identification Quality Assurance.

This document-level QA data is also used for training models to automate Field Identification of Semi-structured documents. For more information on training models, see Training a New Field Identification Model.

 

Was this article helpful?
0 out of 0 found this helpful