Tools
html2md - HTML to Markdown Converter
Convert HTML content to Markdown with intelligent content extraction and custom processing
A modern HTML to Markdown converter with HTML structure analysis, custom extractors, async I/O, and parallel processing capabilities.
Overview
The html2md tool (v3.4.0) provides a robust solution for converting HTML content to Markdown format, with fine-grained control over the conversion process. Built with Python 3.10+ and modern async architecture, it focuses on intelligent content extraction and conversion.
New in v3.4.0: Custom extractor plugin system for site-specific content extraction.
Note: Web scraping functionality has been moved to the separate scrape tool for better modularity. Use scrape
to download websites, then html2md
to convert the downloaded HTML files.
Key Features
- Custom Extractor System: Create site-specific extractors for optimal content extraction
- HTML Structure Analysis: Analyze HTML files to find optimal content selectors
- Intelligent Content Extraction: Use CSS selectors to extract specific content
- Async I/O: High-performance concurrent file processing
- API Mode: Programmatic access for integration with other tools
- Type Safety: Full type annotations throughout the codebase
- Modern Architecture: Clean modular design
- Workflow Integration: .scrapes directory structure for organized processing
- Claude AI Integration: AI-powered HTML analysis and conversion
- Recursive directory scanning for batch conversion
- Smart internal link handling (HTML → Markdown)
- Customizable element filtering and removal
- YAML frontmatter generation
- Heading level adjustment
- Code block language detection
- Character encoding detection and conversion
- Parallel processing for faster conversion
Quick Start
# Basic conversion of all HTML files in a directory
m1f-html2md convert ./website -o ./docs
# Use a custom extractor for site-specific conversion
m1f-html2md convert ./website -o ./docs \
--extractor ./extractors/custom_extractor.py
# Extract only main content from HTML files
m1f-html2md convert ./website -o ./docs \
--content-selector "main.content" --ignore-selectors nav .sidebar footer
# Skip YAML frontmatter and adjust heading levels
m1f-html2md convert ./website -o ./docs \
--no-frontmatter --heading-offset 1
# Analyze HTML structure to find best selectors
m1f-html2md analyze ./html/*.html --suggest-selectors
# Use Claude AI to intelligently analyze HTML structure
m1f-html2md analyze ./html/ --claude
# Convert HTML to Markdown using Claude AI (clean content extraction)
m1f-html2md convert ./html/ -o ./markdown/ --claude --model opus --sleep 2
Command Line Interface
The html2md tool uses subcommands for different operations:
Convert Command
Convert local HTML files to Markdown:
m1f-html2md convert <source> -o <output> [options]
Option | Description |
---|---|
source | Source file or directory |
-o, --output | Output file or directory (required) |
-c, --config | Configuration file path (YAML format) |
--format | Output format: markdown, m1f_bundle, json (default: markdown) |
--extractor | Path to custom extractor Python file |
--content-selector | CSS selector for main content |
--ignore-selectors | CSS selectors to ignore (space-separated) |
--heading-offset | Offset heading levels (default: 0) |
--no-frontmatter | Don’t add YAML frontmatter |
--parallel | Enable parallel processing |
--claude | Use Claude AI to convert HTML to Markdown |
--model | Claude model to use: opus, sonnet (default: sonnet) |
--sleep | Sleep time in seconds between Claude API calls (default: 1.0) |
-v, --verbose | Enable verbose output |
-q, --quiet | Suppress all output except errors |
Analyze Command
Analyze HTML structure for optimal content extraction:
m1f-html2md analyze <paths> [options]
Option | Description |
---|---|
paths | HTML files or directories to analyze |
--show-structure | Show detailed HTML structure |
--common-patterns | Find common patterns across files |
--suggest-selectors | Suggest CSS selectors for content extraction |
--claude | Use Claude AI to intelligently select files and suggest selectors |
--analyze-files | Number of files to analyze with Claude (1-20, default: 5) |
-v, --verbose | Enable verbose output |
-q, --quiet | Suppress all output except errors |
Config Command
Generate a configuration file template:
m1f-html2md config [options]
Option | Description |
---|---|
-o, --output | Output configuration file (default: config.yaml) |
--format | Configuration format: yaml, toml, json (default: yaml) |
Complete Workflow Example
# Step 1: Create project structure
mkdir -p .scrapes/my-project/{html,md,extractors}
# Step 2: Download website using scrape tool
m1f-scrape https://example.com -o .scrapes/my-project/html
# Step 3: Analyze HTML structure (optional)
m1f-html2md analyze .scrapes/my-project/html/ --suggest-selectors
# Step 4: Create custom extractor (optional)
# Use Claude to analyze and create site-specific extractor
# Step 5: Convert with custom extractor
m1f-html2md convert .scrapes/my-project/html -o .scrapes/my-project/md \
--extractor .scrapes/my-project/extractors/custom_extractor.py
Claude AI Integration
html2md offers optional Claude AI integration for intelligent HTML analysis and conversion:
Claude Command Detection
The tool automatically detects Claude Code installations in various locations:
- Standard PATH locations
~/.claude/local/claude
(common for local installations)/usr/local/bin/claude
and/usr/bin/claude
AI-Powered Analysis
Use Claude to automatically select representative HTML files and suggest optimal CSS selectors:
# Analyze a directory of HTML files with Claude
m1f-html2md analyze ./scraped-site/ --claude
# Analyze more files for better coverage (up to 20)
m1f-html2md analyze ./scraped-site/ --claude --analyze-files 10
Features of Claude Analysis:
- Project Context: Provides project description to help Claude understand the content
- Important File Priority: Can specify important files for Claude to prioritize
- Multi-phase Analysis: Individual file analysis followed by synthesis
- Transparent Process: Creates temporary analysis files in m1f/ directory
- Smart Subprocess Handling: Uses subprocess.run() for reliable Claude CLI integration
- Streaming Output: Real-time progress display during Claude analysis
- Robust Config Loading: Handles Claude-generated configs with unknown fields gracefully
AI-Powered Conversion
Use Claude to convert HTML to clean Markdown, extracting only the main content:
# Convert all HTML files using Claude AI
m1f-html2md convert ./html/ -o ./markdown/ --claude
# Use Opus model for higher quality (default is Sonnet)
m1f-html2md convert ./html/ -o ./markdown/ --claude --model opus
# Add delay between API calls to avoid rate limits
m1f-html2md convert ./html/ -o ./markdown/ --claude --sleep 3
The Claude conversion:
- Extracts only the main content (no navigation, ads, etc.)
- Preserves document structure and formatting
- Handles complex HTML layouts intelligently
- Generates clean, readable Markdown
Usage Examples
Basic Conversion
# Simple conversion of all HTML files in a directory
m1f-html2md convert ./website -o ./docs
# Convert files with verbose logging
m1f-html2md convert ./website -o ./docs --verbose
# Convert to m1f bundle format
m1f-html2md convert ./website -o ./docs.m1f --format m1f_bundle
# Convert to JSON format for processing
m1f-html2md convert ./website -o ./data.json --format json
Content Selection
# Extract only the main content and ignore navigation elements
m1f-html2md convert ./website -o ./docs \
--content-selector "main" --ignore-selectors nav .sidebar footer
# Extract article content from specific selectors
m1f-html2md convert ./website -o ./docs \
--content-selector "article.content" \
--ignore-selectors .author-bio .share-buttons .related-articles
HTML Analysis
# Analyze HTML files to find optimal selectors
m1f-html2md analyze ./html/ --suggest-selectors
# Show detailed structure of HTML files
m1f-html2md analyze ./html/ --show-structure
# Find common patterns across multiple files
m1f-html2md analyze ./html/ --common-patterns
# Get all analysis options
m1f-html2md analyze ./html/ \
--show-structure --common-patterns --suggest-selectors
Formatting Options
# Adjust heading levels (e.g., h1 → h2, h2 → h3)
m1f-html2md convert ./website -o ./docs \
--heading-offset 1
# Skip frontmatter generation
m1f-html2md convert ./website -o ./docs \
--no-frontmatter
# Use configuration file for advanced options
m1f-html2md convert ./website -o ./docs -c config.yaml
Performance Optimization
# Use parallel processing for faster conversion
m1f-html2md convert ./website -o ./docs --parallel
Custom Extractors
The custom extractor system allows you to create site-specific content extraction logic for optimal results.
Creating a Custom Extractor
Function-based Extractor
# extractors/simple_extractor.py
from bs4 import BeautifulSoup
from typing import Optional, Dict, Any
def extract(soup: BeautifulSoup, config: Optional[Dict[str, Any]] = None) -> BeautifulSoup:
"""Extract main content from HTML."""
# Remove navigation elements
for nav in soup.find_all(['nav', 'header', 'footer']):
nav.decompose()
# Find main content
main = soup.find('main') or soup.find('article')
if main:
new_soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
new_soup.body.append(main)
return new_soup
return soup
def postprocess(markdown: str, config: Optional[Dict[str, Any]] = None) -> str:
"""Clean up the converted markdown."""
# Remove duplicate newlines
import re
return re.sub(r'\n{3,}', '\n\n', markdown)
Class-based Extractor
# extractors/advanced_extractor.py
from tools.html2md.extractors import BaseExtractor
from bs4 import BeautifulSoup
from typing import Optional, Dict, Any
class Extractor(BaseExtractor):
"""Custom extractor for specific website."""
def extract(self, soup: BeautifulSoup, config: Optional[Dict[str, Any]] = None) -> BeautifulSoup:
"""Extract content with site-specific logic."""
# Custom extraction logic
return soup
def preprocess(self, html: str, config: Optional[Dict[str, Any]] = None) -> str:
"""Preprocess raw HTML before parsing."""
# Fix common HTML issues
return html.replace(' ', ' ')
def postprocess(self, markdown: str, config: Optional[Dict[str, Any]] = None) -> str:
"""Post-process converted markdown."""
# Clean up site-specific artifacts
return markdown
Using Custom Extractors
# Use with CLI
m1f-html2md convert ./html -o ./markdown \
--extractor ./extractors/my_extractor.py
.scrapes Directory Structure
The recommended workflow uses a .scrapes
directory (gitignored) for organizing scraping projects:
.scrapes/
└── project-name/
├── html/ # Raw HTML files from scraping
├── md/ # Converted Markdown files
└── extractors/ # Custom extraction scripts
└── custom_extractor.py
Advanced Features
YAML Frontmatter
By default, the converter adds YAML frontmatter to each Markdown file:
---
title: Extracted from HTML
source_file: original.html
date_converted: 2023-06-15T14:30:21
date_modified: 2023-06-12T10:15:33
---
To disable frontmatter generation:
m1f-html2md convert ./website -o ./docs --no-frontmatter
Heading Level Adjustment
The --heading-offset
option adjusts the hierarchical structure:
# Make h1 headings become h2 headings
m1f-html2md convert ./website -o ./docs --heading-offset 1
Code Block Language Detection
The converter automatically detects language hints from HTML code blocks:
<pre><code class="language-python">def example():
return "Hello, world!"
</code></pre>
Converts to:
```python
def example():
return "Hello, world!"
```
Character Encoding Handling
The converter provides robust character encoding detection:
- Automatically detects the encoding of source HTML files
- Properly handles UTF-8, UTF-16, and other encodings
- All output files are written in UTF-8 encoding
- Handles BOM (Byte Order Mark) detection for Unicode files
Architecture
HTML2MD v3.4.0 features a modern, modular architecture:
tools/html2md/
├── __init__.py # Package initialization
├── __main__.py # Entry point for module execution
├── api.py # Programmatic API for other tools
├── cli.py # Command-line interface
├── config/ # Configuration management
│ ├── __init__.py
│ ├── loader.py # Config file loader
│ └── models.py # Config data models
├── core.py # Core conversion logic
├── extractors.py # Custom extractor system
├── preprocessors.py # HTML preprocessing
├── analyze_html.py # HTML structure analysis
└── utils.py # Utility functions
Integration with m1f
The html2md tool works well with the m1f tool for comprehensive documentation handling:
# 1. First convert HTML files to Markdown
m1f-html2md convert ./html-docs -o ./markdown-docs
# 2. Then use m1f to combine the Markdown files
m1f -s ./markdown-docs -o ./combined-docs.m1f.txt \
--separator-style Markdown
This workflow is ideal for:
- Converting documentation from HTML to Markdown format
- Consolidating documentation from multiple sources
- Preparing content for LLM context windows
- Creating searchable knowledge bases
Performance Considerations
- For large websites with many HTML files, use the
--parallel
option - Conversion speed depends on file size, complexity, and number of files
- Memory usage scales with file sizes when parallel processing is enabled
- The tool uses async I/O for efficient file operations
Requirements and Dependencies
- Python 3.10 or newer
- Required packages:
- beautifulsoup4: For HTML parsing
- markdownify: For HTML to Markdown conversion
- aiofiles: For async file operations
- rich: For console output
- pydantic: For configuration models
- Optional packages:
- chardet: For encoding detection
- pyyaml: For YAML configuration files
- toml: For TOML configuration files
Related Tools
- scrape - Download websites before conversion
- m1f - Bundle converted Markdown files
- s1f - Extract files from bundles
- token-counter - Estimate token usage
Next Steps
- Learn about web scraping to download HTML content
- Explore m1f bundling to combine converted files
- Check out Claude integration for AI workflows
- Review auto-bundling for automated processes
- Previous
- Complete API Reference
- Next
- Security