If you’ve experienced applications that extract text from a scanned image or document, then you’re likely familiar with optical character recognition (OCR). This functionality has been around for many years, and the use cases are clear – online data capture, automated text validation, and more efficient workflows, to name a few. Amazon Textract has taken this idea and turned it into a cloud-native, data-driven, Machine Learning tool with all the modernized benefits of being in the AWS ecosystem.
What is Amazon Textract?
Textract offers a sense of near-immediate satisfaction: it quickly extracts text and structured data – typed or handwritten – from scanned documents, then it organizes this data and outputs it back to you as you see fit.
However, it’s not your typical OCR. Amazon Textract is a Machine Learning tool; it identifies the format of the data, understands the structure, and organizes it in a way that is easy to consume. The more it does this, the smarter it gets.
If you’re dealing with particularly sensitive information, you can choose to use additional features – such as Amazon Augmented AI – to allow an extra layer of human review of your data, as well. We’ll touch on this later in the article.
Why Use Textract?
Simple: Textract quickly extracts written or typed words from documents and transforms them into usable data formats. It does this efficiently and accurately, which saves time, energy, and money.
Imagine being presented with a stack of invoices and having to manually enter that data into a spreadsheet to keep track of business expenses. Or maybe you’re a healthcare worker managing files for new patients – all written by hand with differing levels of legibility – and you have to manually input that form data into the hospital database.
These are just a few scenarios for which Textract can be used to automate data extraction and organization. A key takeaway here is the speed and high-accuracy that Textract offers - for both printed data and hand-written data. The risk of human error alone in both of the real-life examples above could be costly. What would take the business or healthcare worker hours and hours every day to complete, Textract can do in a matter of minutes or less, freeing up workers to focus on other tasks.
This same smart technology can be utilized to optimize your organization’s data management. As long as the data you’re working with is in the form of text or images with text (letters, numbers, and symbols included within Textract’s current capabilities), Textract can see it!
How Does It Really Work?
Alright, here's the deep-dive. Currently, Textract detects text within documents that fits the Standard English alphabet and ASCII symbols, as well as signatures, forms, and tables written in English, German, French, Spanish, Italian, and Portuguese. Files can be sent to Textract for processing as PNGs, JPGs, TIFFs, or PDFs.
While there’s a lot of “vision” talk going on here, when it comes to software, there’s always more going on than meets the eye. Textract shines brightest when partnered with additional cloud-native services, such as Amazon S3, Amazon Simple Notification Service (SNS), and AWS Lambda, to create fully functional solutions. These services can work together to automatically batch process large numbers of documents for optimal efficiency. A simple pipeline could look something like this: