PDF Extraction

Written 2023-12-02

A full pipeline that includes the below techniques and more: VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy
tesseract is one of the leading OCR applications/libraries
- Usually want PSM mode 4 (single column of text) or 6 (single uniform block of text)
- Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy - PyImageSearch

OCR often works best with some preprocessing

# get grayscale image
def get_grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

def thresholding(image):
    return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

Deskew

Table extraction
- Multi-Column Table OCR - PyImageSearch
- Microsoft Table Transformer (TATR)