textract ☆

« Back to VersTracker

Description:
Extract text from various different types of files

Type: Formula | Tracked Since: Dec 28, 2025

Links: Homepage | formulae.brew.sh

Category: Developer tools

Tags: text-extraction document-processing python pdf ocr

Install: brew install textract

About:
Textract is a Python library that simplifies extracting text from a wide variety of file formats, including PDFs, Word documents, images, and spreadsheets. It provides a unified, simple interface to access content without needing to learn the specific APIs for each file type. The tool automatically selects the appropriate backend parser based on the file's MIME type, making document processing workflows highly efficient.

Key Features:

Unified API for multiple file formats
Automatic MIME type detection and parsing
Support for images using OCR (Tesseract)
No external dependencies required for most formats

Use Cases:

Building document indexing and search systems
Automating data extraction from reports and forms
Content migration and digitization workflows

Alternatives:

Apache Tika – Java-based server application; more powerful but heavier and requires a running server process.
pypdf2 – Python library focused solely on PDF manipulation; requires separate libraries for other formats.

Version History

Detected	Version	Rev	Change	Commit
Sep 14, 2024 10:44pm		0	VERSION_BUMP	be1c001d
Jan 18, 2024 7:07pm		0	VERSION_BUMP	cda0d0a9