PDF to Excel using advance python NLP and Computer Vision AKA Document AI

The aim is to convert all kinds of PDF data to a spreadsheet where the PDF data is structured aka Information extraction. This is a general overview of the tools and methodology used in Document AI

14 min readJul 16, 2020

PDF Types & Structures:

Today most of the document are in the form of a PDF either it is a scanned PDF or a text based extracted PDF. These PDF are can be divided into structured / semi-structured / unstructured data. To freed the PDF data into a database in some structured format these documents(PDF) needs to be first extracted there are several ways to extract the data depending on the type of PDF

Scanned PDF — (structured / semi-structured / unstructured data / text in the wild)
Readable PDF — (structured / semi-structured / unstructured data)

The advantage of readable PDF over scanned documents is that the text are to be extracted are already in digital format hence there is no task what so ever to find the text in the image and recognize the text
While the disadvantage of Readable PDF is that some times it so happens that the PDF are of different file formats or encoding where the best practice work around is convert PDF to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data.

Scanned PDF : These documents are challenging as there are a number of hidden task attached to it

Finding the text also get it‘s relative coordinates with respect to the scanned page
recognizing the text (printed / handwritten)

To make things easy there are tools that do it end to end although none have achieved 100% accuracy in all the task but have divided the documents into various categories such as (Invoice ,ID Card, Purchase Orders , Income Proof, Tax Form , Mortgage forms)

1.Structured PDF— If it is a tabular data we can use camelot , tabula or pdftotext library to directly convert the data into a dataframe. To extract tabular content either the use of vertical and horizontal lines we can identify a table through this else the libraries mentioned help us in identifying the table . Other convenient hack for readable PDF is to get the table columns , header and footer and deciphering it on the basis of it coordinate with respect to the column area

import camelot
import pyPdf2 as pyPdf
from tabula import read_pdf
from matplotlib.pyplot import plttables = camelot.read_pdf('invoice.pdf' ,flavor='stream')
camelot.plot(tables[0], kind='text')
plt.show()
camelot.plot(tables[0], kind='grid')
plt.show()

reader = pyPdf.PdfFileReader(open("C:\Users\riley\Desktop\Bank Statements\50340.pdf", mode='rb' ))
n = reader.getNumPages() 

df = []
for page in [str(i+1) for i in range(n)]:
    if page == "1":
            df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", area=(530,12.75,790.5,561), pages=page))
    else:
            df.append(read_pdf(r"C:\Users\riley\Desktop\Bank Statements\50340.pdf", pages=page))

2.Structured PDF & semi-structured PDF — If it is a text PDF then PDFMiner , PyPDF2, PDFQuery , xpdf-python converts the data into textual format as the text are structured then we can use search based information extraction i.e. [keywords surrounding the entity(search for the keyword left to it and a keyword right to it ) or (diagonal to it ) or (above and below to the text needed to be extracted)].While this is good for structured data but in semi-structured data the randomization can be removed on the bases of increasing complexity Using Regex(check image hacks to get the code)-Simple data such as phone number , pincode , email ID , company initials etc.

Using deep learning algorithms-most popularly used are libraries of spacy and transformer .Pattern recognition either by algorithm or with a trained CNN1D model can be used. Relation Extraction — Extracting the relationship between the entity and and nodes within it. Named Entity Recognition (NER)— The task of tagging entities in text with their corresponding type. Typically use BIO notation, (B) the beginning of the entity (I)inside continuation of the token and (O) outside of the entity.

import tensorflow as tf
import tensorflow_datasets
from transformers import *

# Load dataset, tokenizer, model from pretrained model/vocabulary
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')model = BertForTokenClassification.from_pretrained('bert-base-uncased')

# If you used to have this line in pytorch-pretrained-bert:
loss = model(input_ids, labels=labels)

# Now just use this line in transformers to extract the loss from the output tuple:
outputs = model(input_ids, labels=labels)
loss = outputs[0]

# In transformers you can also have access to the logits:
loss, logits = outputs[:2]

# And even the attention weights if you configure the model to output them (and other outputs too, see the docstrings and documentation)
model = BertForTokenClassification.from_pretrained('bert-base-uncased', output_attentions=True)
outputs = model(input_ids, labels=labels)
loss, logits, attentions = outputs

3.Unstructured PDF(Thesis in the best of my knowledge) -Can only be solved by a machine learning engine with the gained knowledge on the specific area of interest(i.e. the document domain invoice ,research paper financial papers etc) with the combination off all the models strategies listed below
• Reinforcement learning where the environment for the model to learn could be set as an online test environment with video of the test would help in engaging on task such as masking , question answering , match the following
• Relation Extraction along with NER could help in predicting Entity Linking (Entity-aware Attention for Single Shot Visual Text Extraction)
• Information Retrieval — The task of ranking a list of entity or search results of entity in response to a query or attention focused upon
• Language Modelling(Dependency Parsing , Sentence Pair Modeling , Natural Language Inference) — Dependency parsing is a grammatical representation of a sentence to extract relation between “head” word and associate words, which modify those heads. Sentence Pair Modeling is to pair the sentences through its embedding which roots to its similarity. Natural language inference is the task where for a given “premise” the model predicts whether a “hypothesis” is entailment contradiction, or undetermined(i.e. true, false or neutral).

The General Pipeline(Theory):

1. Preprocessing: Most scanned Documents are noisy have artefacts and thus for the OCR and information extraction pipeline to work well, it is necessary to pre-process the Documents images. Common pre-processing methods include — Greyscaling, Noise removal and Thresholding (Binarization):
• Grayscaling: converting a 3 chanel image into a single channel in simple terms black and white pixles image.
• Noise removal: typically involves removing Poisson Noise,Speckle Noise,Salt and Pepper noise or Gaussian noise.

• Thresholding: Most OCR engines work well on grayscale images. This can be achieved by thresholding, which is the assignment of pixel values in relation to the threshold value provided. Each pixel value is compared with the threshold value. If the pixel value is smaller than the threshold, it is set to 0 a black pixel, otherwise, it is set to a maximum value (generally 255 a white pixel).The most commonly used thresholding technique is Adaptive Thresholding and Simple Thresholding.
2. Text Detection & recognition (OCR): The next step in the pipeline is OCR. It is used to read text from images such as a scanned document or a picture. This technology is used to convert, virtually any kind of images containing written text (typed, handwritten or printed) into machine-readable text data. OCR involves 2 steps — text detection and text recognition. Text detector — the basic openCV countor detection to deeplearing models(craft , East) , Text recognizing — after the text is given out clean cropped image just containing the text to be extracted then we can use all the ocr tools or custom deeplearning models for recognition. There are a number of approaches to OCR.
The conventional computer Vision approach is to:
• Using filters to separate the characters from the background — Applying filters ,masking and blurring images help in classification of characters better for text recognition.

Apply contour detection to recognize the filtered characters —contour matching is mostly used as a patch code in a semi-structure document while finding countor we have to specify the masking ratio and mostly the text in the document are of varing fonts hence we have to write down all the possible masking ratio, it requires a lot of efforts to generalize for all the document types. They are usually used in template specific problem where the document is static and there is not a lot of movement then ROI is taken into consideration to extract the text.
Use image classification to identify the characters — Deep Learning approaches generalizes very well. The most popular approaches for text detection is EAST and CRAFT .
EAST ( Efficient accurate scene text detector) is a scene text detection which is based on u-net the model can predict arbitrary orientations words or text lines in an image.

CRAFT (Character Region Awareness for Text Detection) —scene text detection method effectively detect text area via affinity between characters and by focusing on each character,exploiting both the given character level annotations from synthetic images and the estimated character-level (ground-truths) for real images acquired by the learned interim model.

3.Information Extraction: The traditional approach to extract information was to divide the document into specific template and then write down rules for each particular template type. This used to be exhaustive but were very accurate. While the new techniques does not demolish the older approach but adds on a new pipeline making the documents more generic and not template oriented.Methods like object detection is being used to detect different types of object within the document for example research paper has [formulas, tables, paragraphs, list and figures]. Before performing the ocr the document image is cropped into different labeled images making it easier to extract the relevant information.Commonly used object detection algorithms are YOLO and Faster R-CNN.

4.Data Dump: Once the information is extracted the data can be stored in any format required usually it is eithere feeded to a database or a spreedsheet but converting the data into a json is quite handy at times.

Image Hacks:

Pre_processing : The general thumb rules of pre_processing image are deskew and cropping the images , increasing sharpness and contrast , adaptive thresholding , affine perspective , gaussian noise removal and resizing of the image use of template matching comes handy when identifying the logo or watermarks.

import cv2
import numpy as npimg = cv2.imread('image.jpg')# get grayscale image
def get_grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)# noise removal
def remove_noise(image):
    return cv2.medianBlur(image,5)
 
#thresholding
def thresholding(image):
    return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]#dilation
def dilate(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.dilate(image, kernel, iterations = 1)
    
#erosion
def erode(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.erode(image, kernel, iterations = 1)#opening - erosion followed by dilation
def opening(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)#canny edge detection
def canny(image):
    return cv2.Canny(image, 100, 200)def denoising(image):
  # image should have3 channel   
  b,g,r = cv2.split(img)
  # switch it to rgb         
  rgb_img = cv2.merge([r,g,b])
  # Denoising
  dst = cv2.fastNlMeansDenoisingColored(img,None,10,10,7,21)
  # get b,g,r
  b,g,r = cv2.split(dst)           
  return cv2.merge([r,g,b])#skew correction
def deskew(image):
    coords = np.column_stack(np.where(image > 0))
    angle = cv2.minAreaRect(coords)[-1]
     if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    (h, w) = image.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
    return rotated#template matching
def match_template(image, template):
    return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)

when the text is too noisy and general pre_processing does not work then cropping the desired area works and sending it to a model works

import cv2 
import pytesseractimg = cv2.imread('image.jpg')# Adding custom options
custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(img, config=custom_config)

Some more analysis with tesseract with openCV

import cv2
import pytesseract
import numpy as npimage = cv2.imread('Document_image.jpg')
img = image.copy()
mask = np.zeros(image.shape, dtype=np.uint8)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]# Filter for ROI using contour area and aspect ratio
countour = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
countour = countour[0] if len(countour) == 2 else countour[1]
for c in countour:
    area = cv2.contourArea(c)
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.05 * peri, True)
    x,y,w,h = cv2.boundingRect(approx)
    aspect_ratio = w / float(h)
    if area > 2000 and aspect_ratio > .5:
        mask[y:y+h, x:x+w] = image[y:y+h, x:x+w]h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)d = pytesseract.image_to_data(img, output_type=Output.DICT)
n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
        img2 = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)cv2.imshow('img', img)
cv2.imshow('img2', img2)
# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(mask, lang='eng', config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('mask', mask)

2. Post_processing: OCR tends to produce erroneous output when the quality of the input image is poor or when the input image has not been pre processed correctly.Extracting the list of OCR errors and their corresponding correctly spelt words is done by making use of 2 similarity measures namely similarity in meaning and similarity on a character level. Following this, a dictionary is used to distinguish the correctly spelt word from the erroneous word.

#values in dict
df1 = pd.read_excel(r"C:\Users\Testing Custom Dictionaries_v2.xlsx")["Diagnosis Codes"].tolist()def listfromdict(ocr_value):
    
    #second letter to digit
    if not ocr_value[1].isdigit():
        ocr_value_list = list(ocr_value)
        ocr_value_list[1] = str(letter2number.get(ocr_value[1]))
        ocr_value = ''.join(ocr_value_list)    #first letter to alpha
    valalphalist = []
    if not ocr_value[0].isalpha():
        ocr_value_list = list(ocr_value)
        alphalist = number2letter.get(ocr_value[0]) #check for value
        for i,let in enumerate(alphalist):
            ocr_value_list[0] = let
            valalphalist.append(''.join(ocr_value_list))
    else:
        ocr_value_list = list(ocr_value)
        ocr_value_list[0] = ocr_value[0].upper()
        ocr_value = ''.join(ocr_value_list)
        valalphalist.append(ocr_value)return ocr_value

Regex — we extract all information related to the number pattern

import re
# regex patter for date asuming format as DD:MM:YYYYdatePattern = r'(0[1-9]|[12][0-9]|3[01])[:](0[1-9]|1[012])[:](19|20)\d\d'
date = re.search(datePattern, raw_text).group()

print(date)

Useful tools/library:

Note: Here is a general overview of all the advancement and methodology

List of ocr tools/libraries(Note: there are a lot of ocr solution but the best in the market):

Tesseract-ocr
Ocr engine

List of document ai tools that do help in the end to end extraction:

Deep learning document AI models trying to achieve the end to end modelling or plays a part in the pipeline of document AI:

1.EATEN( Entity-aware Attention for Single Shot Visual Text Extraction ) -The model proposed in this paper has a series of Entity-aware decoders. After feature extraction from the image it passes through an Entity-aware attention network where each of the Entity-aware decoders is accountable for predicting a certain set of predefined entities/entity.

The model accuracy in evaluated on the basis of mean entity accuracy which gives an improvement over the existing methods.The accuracy increases if the concept of pattern blurring is used to increase attention which plays huge advantage over traditional methods.

2. Object detection and Image segmentation — Documents can be considered as a live templates consisting of sparse sections i.e. the sections in the live templates mostly similar but they are organized in different formats. For example in case of an invoice document the various components are [Company logo, From address, To address, Tables , Disclaimer , GST , Invoice details, etc.]. If we treat them as different object we can use object detection algorithm to extract these components. Using object detection can come with its own challenges as in the case of table extraction (sometimes graphs , formula and images are a part of table), different font-styles, different languages, template variation, etc.

For Document parsing using Object detection, firstly segmentation of the page based of non-text and text area .Later, these are segregated and looked individually while extract different component from the ROI, such as tables, sections from the non-text parts basically object detection and then classification. The use of document images as a regular image opens concepts of object detection , segmentation and classification new ways to tackle parsing over traditional rule-based methods where generalization is difficult and a lot of dependent variables are involved.

At ICDAR (International Conference on Document Analysis and Recognition) a paper proposed by Xiaohan Yi and his team. The abstract of the paper was to detect region of interest from document images using CNNs. To detect objects in the image, they used rough proposal and pruning strategy. At first Rough Proposal a kind of Breadth-First Search (BFS) is used to find all the eight connected component areas (after filtering the binary image). All these component are replaced by bounding rectangle once generated which apparently removes the irrelevant information from the images. In the second part columns in pages are detected also filter the regions existing in multi-columns by using pruning strategy . This paper was specifically designed to extract information from research papers, they used an architecture of a Spatial Pyramid Pooling (SPP) on top of the VGG-16 Network creates fixed size feature maps with the help of fixed scale down-sampling.The optimizer used was SGD(Stochastic Gradient Descent) to train the network.

Object Detection in Document Images(source)

• Segmentation: “Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks” a collaboration between Adobe and The Pennsylvania State University. They proposed a multi-modal CNN network for extracting semantic structures from PDF. This model takes a PDF image as input and splits it into different segment(regions of interest) then recognize the label of each region and results out the segment labels.

The images are not annotated instead there is a text embedding (a map of text similarity). Text embedding can differentiate between caption , paragraphs and list. This kind of text separation creates a semantics structure analysis which mirrors an image semantics logic where categorizes of each region is semantically-relevant classes like tables, list and Section headings. Below is an image representing the same an image input and a segmented labeled output.

Conclusion:

All in all, Document AI model is a very broad and challenging topic was quite a difficult topic to tackle. I have tried my best towards generalizing and providing insights over this area of research which might certainly help you go further into more depth of Document AI. Hope that you found this article helpful.

Reference:

(NOTE: All the papers including its image description are linked below)[1]Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang, EAST: An Efficient and Accurate Scene Text Detector
(2017), Cornell University arxiv.org
[2]Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee,Character Region Awareness for Text Detection(2019), Clova AI Research NAVER Corp Cornell University arxiv.org
[3]He guo, Xiameng Qin, Jiaming Liu, Junyu Han, Jingtuo Liu and Errui Ding,EATEN: Entity-aware Attention for Single Shot Visual Text Extraction(2019),Computer Science Computer Vision and Pattern Recognition Cornell University arxiv.org
[4]Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer and C. Lee Giles,Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks,The Pennsylvania State University Adobe Research
[5]Xiaohan Yi,Liangcai Gao, Yuan Liao, Xiaode Zhang , Runtao Liu and Zhuoren Jiang,CNN Based Page Object Detection in Document Images(2017),14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto