textract
« Back to VersTracker
Description:
Extract text from various different types of files
Type: Formula  |  Tracked Since: Dec 28, 2025
Links: Homepage  |  formulae.brew.sh
Category: Developer tools
Tags: text-extraction document-processing python pdf ocr
Install: brew install textract
About:
Textract is a Python library that simplifies extracting text from a wide variety of file formats, including PDFs, Word documents, images, and spreadsheets. It provides a unified, simple interface to access content without needing to learn the specific APIs for each file type. The tool automatically selects the appropriate backend parser based on the file's MIME type, making document processing workflows highly efficient.
Key Features:
  • Unified API for multiple file formats
  • Automatic MIME type detection and parsing
  • Support for images using OCR (Tesseract)
  • No external dependencies required for most formats
Use Cases:
  • Building document indexing and search systems
  • Automating data extraction from reports and forms
  • Content migration and digitization workflows
Alternatives:
  • Apache Tika – Java-based server application; more powerful but heavier and requires a running server process.
  • pypdf2 – Python library focused solely on PDF manipulation; requires separate libraries for other formats.
Version History
Detected Version Rev Change Commit
Sep 14, 2024 10:44pm 0 VERSION_BUMP be1c001d