trafilatura ☆

« Back to VersTracker

Description:
Discovery, extraction and processing for Web text

Type: Formula | Latest Version: 2.0.0@4 | Tracked Since: Oct 11, 2025

Links: Homepage | formulae.brew.sh

Category: Developer tools

Tags: web-scraping data-extraction nlp python text-processing

Install: brew install trafilatura

About:
Trafilatura is a Python package and command-line tool for extracting main content and metadata from web pages. It focuses on reliability and efficiency, using heuristics and structural analysis to find text while filtering out boilerplate, ads, and navigation elements. It provides structured output formats like JSON, XML, and CSV, making it ideal for large-scale web data collection and NLP pipelines.

Key Features:

High-precision content extraction using heuristics
Metadata discovery (title, author, date, tags)
Multiple output formats (JSON, XML, CSV, Markdown)
Fast and lightweight with minimal dependencies
Built-in URL fetching and encoding detection

Use Cases:

Building datasets for Large Language Model (LLM) training
Web scraping and data mining for research or business intelligence
Automating content aggregation and news monitoring
Preprocessing web data for Natural Language Processing (NLP) tasks

Alternatives:

newspaper3k – More focused on newspaper-style articles; trafilatura is generally faster and handles a wider variety of page structures better.
readability-lxml – A classic library for content extraction; trafilatura offers more features like metadata extraction and better boilerplate removal out-of-the-box.

Version History

Detected	Rev	Change	Commit
Jan 11, 2026 8:23am	4	REVISION_ONLY	b2475967
Oct 11, 2025 9:25pm	1	VERSION_BUMP	c901ff4a
Nov 20, 2024 8:14am	1	VERSION_BUMP	1344a737
Sep 12, 2024 4:09am	0	VERSION_BUMP	ccfabbb9
Sep 10, 2024 3:10pm	0	VERSION_BUMP	ed211b14
Aug 20, 2024 12:51pm	0	VERSION_BUMP	8cf1b9ae
Jul 30, 2024 3:53pm	0	VERSION_BUMP	17d65db4
Nov 29, 2023 11:14am	0	VERSION_BUMP	f8982028
Nov 26, 2023 9:30am	2	VERSION_BUMP	947ea301
Oct 19, 2023 2:36am	2	VERSION_BUMP	9e801bde
Oct 6, 2023 7:54pm	1	VERSION_BUMP	197a5ca1