How to Extract Structured Data from Documents - Complete Tutorial
In this project we learn how to build AI methods for information extraction from documents to comprehensively beat traditional OCR
optical-character-recognition information-extraction computer-vision natural-language-processing article code dataset tutorial

Around 18 billion invoices are issued each year in USA and Europe alone. Form-like documents such as invoices, purchase orders, tax forms and insurance quotes are common in everyday business, but current techniques for processing these still employ a large amount of manual effort/time or use OCR based heuristics for extraction. Although OCR has been fairly successful in helping digitization of machine-printed text there are quite a lot of limitations in dealing with form-like data available.

Using AI to process form-like data is a challenging task since it involves the usage of both Computer Vision and NLP. In addition, the data input in forms need not be natural language and hence the NLP algorithms have to be trained to deal with unknown words. In this article we shall look at the various challenges involved in dealing with dynamic data, how various AI techniques can be used in attacking the problem along with corresponding code references.

Don't forget to tag @Anil-matcha in your comment, otherwise they may not be notified.

Authors original post
Co-Founder @ Vadoo backed by JioGenNext, Ef | Ex-Samsung | IIT Delhi | Author at Paperspace
Share this project
Similar projects
Graph Convolution on Structured Documents
Convert structured documents to graphs for document entity classification.
A 2020 review of Handwritten Character Recognition
Concept of handwritten text recognition, relevant use-cases, different neural network architectures involved in achieving the results, training your own ...
Transformer OCR
Rectification-free OCR using spatial attention from Transformers.
PicTranslate: Seamless live Image Text translator
Given an image with text on it, this app can give you a new image with text modified into a different language.
Top collections