Description:
Discovery, extraction and processing for Web text
|
|
Type: Formula
|
Latest Version: 2.0.0@4
|
Tracked Since: Oct 11, 2025
|
|
Links:
Homepage |
formulae.brew.sh
|
|
Category: Developer tools
|
|
Tags:
web-scraping
data-extraction
nlp
python
text-processing
|
|
Install:
brew install trafilatura
|
About:
Trafilatura is a Python package and command-line tool for extracting main content and metadata from web pages. It focuses on reliability and efficiency, using heuristics and structural analysis to find text while filtering out boilerplate, ads, and navigation elements. It provides structured output formats like JSON, XML, and CSV, making it ideal for large-scale web data collection and NLP pipelines.
|
Key Features:
- High-precision content extraction using heuristics
- Metadata discovery (title, author, date, tags)
- Multiple output formats (JSON, XML, CSV, Markdown)
- Fast and lightweight with minimal dependencies
- Built-in URL fetching and encoding detection
|
Use Cases:
- Building datasets for Large Language Model (LLM) training
- Web scraping and data mining for research or business intelligence
- Automating content aggregation and news monitoring
- Preprocessing web data for Natural Language Processing (NLP) tasks
|
Alternatives:
-
newspaper3k
– More focused on newspaper-style articles; trafilatura is generally faster and handles a wider variety of page structures better.
-
readability-lxml
– A classic library for content extraction; trafilatura offers more features like metadata extraction and better boilerplate removal out-of-the-box.
|
| Detected |
Version |
Rev |
Change |
Commit |
| Jan 11, 2026 8:23am |
|
4 |
REVISION_ONLY |
b2475967 |
| Oct 11, 2025 9:25pm |
|
1 |
VERSION_BUMP |
c901ff4a |
| Nov 20, 2024 8:14am |
|
1 |
VERSION_BUMP |
1344a737 |
| Sep 12, 2024 4:09am |
|
0 |
VERSION_BUMP |
ccfabbb9 |
| Sep 10, 2024 3:10pm |
|
0 |
VERSION_BUMP |
ed211b14 |
| Aug 20, 2024 12:51pm |
|
0 |
VERSION_BUMP |
8cf1b9ae |
| Jul 30, 2024 3:53pm |
|
0 |
VERSION_BUMP |
17d65db4 |
| Nov 29, 2023 11:14am |
|
0 |
VERSION_BUMP |
f8982028 |
| Nov 26, 2023 9:30am |
|
2 |
VERSION_BUMP |
947ea301 |
| Oct 19, 2023 2:36am |
|
2 |
VERSION_BUMP |
9e801bde |
| Oct 6, 2023 7:54pm |
|
1 |
VERSION_BUMP |
197a5ca1 |
|