trafilatura
« Back to VersTracker
Description:
Discovery, extraction and processing for Web text
Type: Formula  |  Latest Version: 2.0.0@4  |  Tracked Since: Oct 11, 2025
Links: Homepage  |  formulae.brew.sh
Category: Developer tools
Tags: web-scraping data-extraction nlp python text-processing
Install: brew install trafilatura
About:
Trafilatura is a Python package and command-line tool for extracting main content and metadata from web pages. It focuses on reliability and efficiency, using heuristics and structural analysis to find text while filtering out boilerplate, ads, and navigation elements. It provides structured output formats like JSON, XML, and CSV, making it ideal for large-scale web data collection and NLP pipelines.
Key Features:
  • High-precision content extraction using heuristics
  • Metadata discovery (title, author, date, tags)
  • Multiple output formats (JSON, XML, CSV, Markdown)
  • Fast and lightweight with minimal dependencies
  • Built-in URL fetching and encoding detection
Use Cases:
  • Building datasets for Large Language Model (LLM) training
  • Web scraping and data mining for research or business intelligence
  • Automating content aggregation and news monitoring
  • Preprocessing web data for Natural Language Processing (NLP) tasks
Alternatives:
  • newspaper3k – More focused on newspaper-style articles; trafilatura is generally faster and handles a wider variety of page structures better.
  • readability-lxml – A classic library for content extraction; trafilatura offers more features like metadata extraction and better boilerplate removal out-of-the-box.
Version History
Detected Version Rev Change Commit
Jan 11, 2026 8:23am 4 REVISION_ONLY b2475967
Oct 11, 2025 9:25pm 1 VERSION_BUMP c901ff4a
Nov 20, 2024 8:14am 1 VERSION_BUMP 1344a737