AWS launches Textract, machine learning for text and data extraction

Need to extract content from a document quickly and automatically? You're in luck if you're an Amazon Web Services (AWS) customer. Amazon today announced the general availability of Textract, a cloud-hosted and fully managed service that uses machine learning to parse data tables, forms, and whole pages for text and data.

It's available today in AWS' US East (Ohio), US East (N. Virginia), US West (Oregon), and EU (Ireland) regions and will expand to additional regions in the coming year.

Textract is more capable than your average optical character recognition system. From files stored in an Amazon S3 bucket, it's able to suss out the contents of fields and tables and the context in which this information is presented, like names and social security numbers in tax forms or totals from photographed receipts. As Amazon notes in a press release, Textract supports such image formats as scans, PDFs, and photos, and it ingests a range of document formats, including those specific to financial services, insurance, and health care.

Textract spits out results in the form of JSON text annotated with the page number, section, form labels, and data types via an API, and it optionally integrates with database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, Amazon Athena, and machine learning products like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker for post-processing. Alternatively, extracted data can be fed directly into third-party cloud environments for compliance purposes in accounting, auditing, and compliance software or to build smart searches on document archives.

Textract can "accurately" process millions of document pages in "just a few hours," Amazon says.

A slew of AWS customers are already using Textract, including the Globe and Mail, the U.K.'s national weather service, PricewaterhouseCoopers, nonprofit managed care organization Healthfirst, and robotic process automation companies UiPath, Ripcord, and Blue Prism. Candor, a startup that aims to bring transparency to the mortgage industry, taps Textract to read documents such as bank statements, pay stubs, and tax documents to expedite underwriting, while financial tech firm Informed uses it to extract text from pay stubs, bank statements, tax returns, and tens of thousands of other documents on behalf of financial institutions.

"The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document with no machine learning experience required," said Amazon Machine Learning VP Swami Sivasubramanian. "In addition to the integration with other AWS services, the rich partner community developing around Amazon Textract makes it possible for customers to gain real meaning from their file collections, operate more efficiently, improve security compliance, automate data entry, and facilitate faster business decisions."

More