Tesseract is an open-source OCR (Optical Character Recognition) engine that can extract text from images. It supports over 100 languages, multiple image formats (PNG, JPEG, TIFF), and various output formats including plain text, PDF, and HTML. The current version 5 includes both a neural network-based LSTM engine for line recognition and a legacy character pattern recognition engine. Originally developed by HP and later maintained by Google, it's now community-maintained and provides both command-line tools and C/C++ APIs for developers.

Sort: