AI-Augmented OCR with Amazon Textract

Artificial Intelligence & MLOps

Learn how organizations can eliminate manual data extraction with Amazon Textract, a cutting-edge tool that uses machine learning to extract and organize text and data from scanned documents.

If you’ve experienced applications that extract text from a scanned image or document, then you’re likely familiar with optical character recognition (OCR). This functionality has been around for many years, and the use cases are clear – online data capture, automated text validation, and more efficient workflows, to name a few. Amazon Textract has taken this idea and turned it into a cloud-native, data-driven, Machine Learning tool with all the modernized benefits of being in the AWS ecosystem.

What is Amazon Textract?

Textract offers a sense of near-immediate satisfaction: it quickly extracts text and structured data – typed or handwritten – from scanned documents, then it organizes this data and outputs it back to you as you see fit.

However, it’s not your typical OCR. Amazon Textract is a Machine Learning tool; it identifies the format of the data, understands the structure, and organizes it in a way that is easy to consume. The more it does this, the smarter it gets.

If you’re dealing with particularly sensitive information, you can choose to use additional features – such as Amazon Augmented AI – to allow an extra layer of human review of your data, as well. We’ll touch on this later in the article.

Why Use Textract?

Simple: Textract quickly extracts written or typed words from documents and transforms them into usable data formats. It does this efficiently and accurately, which saves time, energy, and money.

Imagine being presented with a stack of invoices and having to manually enter that data into a spreadsheet to keep track of business expenses. Or maybe you’re a healthcare worker managing files for new patients – all written by hand with differing levels of legibility – and you have to manually input that form data into the hospital database.

These are just a few scenarios for which Textract can be used to automate data extraction and organization. A key takeaway here is the speed and high-accuracy that Textract offers - for both printed data and hand-written data. The risk of human error alone in both of the real-life examples above could be costly. What would take the business or healthcare worker hours and hours every day to complete, Textract can do in a matter of minutes or less, freeing up workers to focus on other tasks.

This same smart technology can be utilized to optimize your organization’s data management. As long as the data you’re working with is in the form of text or images with text (letters, numbers, and symbols included within Textract’s current capabilities), Textract can see it!

How Does It Really Work?

Alright, here's the deep-dive. Currently, Textract detects text within documents that fits the Standard English alphabet and ASCII symbols, as well as signatures, forms, and tables written in English, German, French, Spanish, Italian, and Portuguese. Files can be sent to Textract for processing as PNGs, JPGs, TIFFs, or PDFs.

While there’s a lot of “vision” talk going on here, when it comes to software, there’s always more going on than meets the eye. Textract shines brightest when partnered with additional cloud-native services, such as Amazon S3Amazon Simple Notification Service (SNS), and AWS Lambda, to create fully functional solutions. These services can work together to automatically batch process large numbers of documents for optimal efficiency. A simple pipeline could look something like this:

This solution demonstrates a case where a text file is uploaded to an S3 bucket, which uses Amazon S3 Event Notifications to trigger a Lambda function to call Textract asynchronously to process and extract text from the document, and then – once complete – the results are returned to the same S3.

The solution can handle more complexity by integrating other native AWS services, such as Amazon Comprehend, which works with Textract to utilize Machine Learning to find insightful relationships amongst extracted text. A more involved workflow might look something like this:

Here, this design is utilizing SNS, which triggers after Textract completes its work, which then triggers a second Lambda function that processes Textracts results and runs them through Comprehend to find relationships within the extracted data.

Textract APIs

So, how does Textract do this? Textract currently exposes several APIs that enable the extraction and organization of data in a variety of ways. One of the most commonly used is AnalyzeDocument, which looks for key relationships between data within a document. Another popular one is AnalyzeExpensewhich specifically looks for financial relationships amongst the data. Textract’s APIs function both synchronously and asynchronously, depending on the number of documents to analyze and the use case.

For the developers out there, you might use something like the following Python code via AWS SDK to make a call to AnalyzeDocument to detect text, forms, or tables in an image (in this case, either PNG or JPG format):

Textract’s response will come back as a list of blocks that describes the text elements detected in the image.

For a Lambda function, you might make a call that looks something like this, building in error statuses so we can troubleshoot when Textract isn’t able to read text from a provided file:

Textract Feature Types

Beyond the traditional OCR mechanism, Textract contains some features that make it easier to get the desired data from the document. Most of them are focused on the question, "What is the value of a given table, form, or query?" rather than dealing with other cumbersome factors, such as page layout and position of the data in the document.

To explore these feature types we’ll use the PDF for IRS Form W-9. Using this well-known, complex document as an example is a great way to see Textract feature types in action.

Textract Feature Types – Generic Analysis Method

Let's see how a complete analysis can be done on a document that contains multiple pages. The trick is moving through the NextToken property returned from the "get_document_analysis" to perform the analysis with pagination. This method returns the processing status based on the JobId returned from the "start_document_analysis" method.

The example below shows how the pagination can be implemented and will be used in the following examples:

Textract Feature Type - Tables

The "TABLES" feature type extracts data from tables within the PDF file. For example, consider the following table in the document:

After running the analysis using the Feature Type "TABLES," the following data is extracted from the file:

We can see that the first line of the tables was successfully extracted. We could further extract the rest of the data, but only the first line is presented here for brevity. However, if you want to extract all data from a table, here is the code that would help you to achieve that:

Textract Feature Type – Forms

The "FORMS" Feature Type is used when we want to extract values from a form within the PDF File. The values are returned as "KEY_VALUE_SET" where the key is the Title of the field form, and the value is the corresponding data for the given key.

Consider the following form in the PDF file:

All the fields are empty since this is an example, but it is interesting to observe that this feature type can identify fields and checkboxes.

After running the analysis using the feature type "FORMS," we can get the following results:

The code block below shows how to perform this analysis and see Textract collect the values of the forms:

Textract Feature Type – Query

This feature type is one of the most interesting; it allows the user to create "Questions,” which will be interpreted by Textract and, depending on what was asked and regardless of the place where the response is, Textract will return the value.

Given the following header of the sample PDF, there is some interesting information that can be queried: for example, the revision date and form id.

Based on this header, we can ask questions such as "What is the Rev.?" and "What is this Form?" Impressively, Textract will understand these questions and return their values. Check out the execution output:

If you want to try this execution on your side, here is an example to achieve the "QUERIES" analysis. Feel free to play around with different queries:

Amazon Augmented AI (A2I)

When it comes to sensitive information, or even documents that may have not scanned clearly and thus have text that could be difficult to read, it can be helpful to have a second glance for extra reassurance that we’re getting the highest accuracy possible. This is where Amazon Augmented AI (A2I) can come in handy.

For example, if we have data that Textract grades a low confidence score, this can trigger a call for human reviewers to double-check the associated text. Using A2I, we can add activation conditions to initialize these triggers based on a preset confidence threshold of important form keys within the data.

Seamlessly Integrated

On the other hand, if we know in advance that our data is sensitive, triggers can be preset to call for manual intervention, and A2I isn’t your only option.

If you’re already using AWS to meet some of your business needs, it’s important to recognize that since Textract is cloud-native, it can be seamlessly integrated into your current business AWS workflows. (No need to start from scratch!) For example, high confidence scores could be written to your database, while low confidence scores could enter an SNS queue for your internal team to double-check.

Let’s Use Textract!

When you’re ready to build a custom Textract solution to reduce manual data entry, improve data quality, and speed up critical workflows, Caylent can help! Our knowledgeable and experienced development teams can optimize custom builds that utilize Textract alongside other cloud-native services in ways that bring the greatest benefit and value to meet your company’s specific needs. We’re excited to see eye-to-eye on the many ways Textract can work for you.

Artificial Intelligence & MLOps
Dianna Cameron

Dianna Cameron

Dianna is a Cloud Software Engineer whose professional journey weaves together a uniquely creative background with an itch for learning and exploring ways that technology can benefit broader communities. She supports Caylent’s initiatives on the Cloud Native Applications team for a wide range of client projects as both a frontend and backend developer, and at the root of it all, Dianna’s passion for tech lies in fostering human connections. She thrives in collaborative environments and is grateful each day to be crafting innovative solutions to complex problems alongside an amazingly talented team.

View Dianna's articles
Vinícius Defeo

Vinícius Defeo

Vinícius Defeo is a Software Architect with almost 20 years of experience in diverse technologies, including desktop development, mobile development, back-end development, and software architecture. Vinicius is passionate about technology and does not limit himself to one of them, comfortably navigating through different stacks and paradigms such as C#, Python, JS, event-driven, scalable, and fault-tolerant environments. He worked for many companies and big players in different industries, helping them successfully achieve their business goals, always keeping security, performance, and cost optimization in mind. He enjoys learning and helping his teammates to succeed as well.

View Vinícius's articles

Learn more about the services mentioned

Caylent Services

Artificial Intelligence & MLOps

Apply artificial intelligence (AI) to your data to automate business processes and predict outcomes. Gain a competitive edge in your industry and make more informed decisions.

Caylent Catalysts™

MLOps Strategy

Plan and implement an MLOps strategy unique to your team's needs, capabilities, and current state, unlocking the next steps in tactical execution by offloading the infrastructure, data, operations, and automation work from data scientists.​

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Experiences as a Tech Intern at Caylent

Read about the experiences our summer technology fellow had at Caylent, where she explored cloud computing, generative AI, web development, and more.

Culture
Artificial Intelligence & MLOps

OpenAI vs Bedrock: Optimizing Generative AI on AWS

The AI industry is growing rapidly and a variety of models now exist to tackle different use cases. Amazon Bedrock provides access to diverse AI models, seamless AWS integration, and robust security, making it a top choice for businesses who want to pursue innovation without vendor lock-in.

Artificial Intelligence & MLOps

Building Recommendation Systems Using Generative AI and Amazon Personalize

In this blog, learn how Generative AI augmented recommendation systems can improve the quality of customer interactions and produce higher quality data to train analytical ML models, taking personalized customer experiences to the next level.

Artificial Intelligence & MLOps