PDF Extraction
Written
- A full pipeline that includes the below techniques and more: VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy
- tesseract is one of the leading OCR applications/libraries
- Usually want PSM mode 4 (single column of text) or 6 (single uniform block of text)
- Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy - PyImageSearch
- OCR often works best with some preprocessing
# get grayscale image return return
- Deskew
- Table extraction