html2md - HTML to Markdown Converter

A modern HTML to Markdown converter with HTML structure analysis, custom extractors, async I/O, and parallel processing capabilities.

Overview

The html2md tool (v3.4.0) provides a robust solution for converting HTML content to Markdown format, with fine-grained control over the conversion process. Built with Python 3.10+ and modern async architecture, it focuses on intelligent content extraction and conversion.

New in v3.4.0: Custom extractor plugin system for site-specific content extraction.

Note: Web scraping functionality has been moved to the separate scrape tool for better modularity. Use scrape to download websites, then html2md to convert the downloaded HTML files.

Key Features

Custom Extractor System: Create site-specific extractors for optimal content extraction
HTML Structure Analysis: Analyze HTML files to find optimal content selectors
Intelligent Content Extraction: Use CSS selectors to extract specific content
Async I/O: High-performance concurrent file processing
API Mode: Programmatic access for integration with other tools
Type Safety: Full type annotations throughout the codebase
Modern Architecture: Clean modular design
Workflow Integration: .scrapes directory structure for organized processing
Claude AI Integration: AI-powered HTML analysis and conversion
Recursive directory scanning for batch conversion
Smart internal link handling (HTML → Markdown)
Customizable element filtering and removal
YAML frontmatter generation
Heading level adjustment
Code block language detection
Character encoding detection and conversion
Parallel processing for faster conversion

Quick Start

# Basic conversion of all HTML files in a directory
m1f-html2md convert ./website -o ./docs

# Use a custom extractor for site-specific conversion
m1f-html2md convert ./website -o ./docs \
  --extractor ./extractors/custom_extractor.py

# Extract only main content from HTML files
m1f-html2md convert ./website -o ./docs \
  --content-selector "main.content" --ignore-selectors nav .sidebar footer

# Skip YAML frontmatter and adjust heading levels
m1f-html2md convert ./website -o ./docs \
  --no-frontmatter --heading-offset 1

# Analyze HTML structure to find best selectors
m1f-html2md analyze ./html/*.html --suggest-selectors

# Use Claude AI to intelligently analyze HTML structure
m1f-html2md analyze ./html/ --claude

# Convert HTML to Markdown using Claude AI (clean content extraction)
m1f-html2md convert ./html/ -o ./markdown/ --claude --model opus --sleep 2

Command Line Interface

The html2md tool uses subcommands for different operations:

Convert Command

Convert local HTML files to Markdown:

m1f-html2md convert <source> -o <output> [options]

Option	Description
`source`	Source file or directory
`-o, --output`	Output file or directory (required)
`-c, --config`	Configuration file path (YAML format)
`--format`	Output format: markdown, m1f_bundle, json (default: markdown)
`--extractor`	Path to custom extractor Python file
`--content-selector`	CSS selector for main content
`--ignore-selectors`	CSS selectors to ignore (space-separated)
`--heading-offset`	Offset heading levels (default: 0)
`--no-frontmatter`	Don’t add YAML frontmatter
`--parallel`	Enable parallel processing
`--claude`	Use Claude AI to convert HTML to Markdown
`--model`	Claude model to use: opus, sonnet (default: sonnet)
`--sleep`	Sleep time in seconds between Claude API calls (default: 1.0)
`-v, --verbose`	Enable verbose output
`-q, --quiet`	Suppress all output except errors

Analyze Command

Analyze HTML structure for optimal content extraction:

m1f-html2md analyze <paths> [options]

Option	Description
`paths`	HTML files or directories to analyze
`--show-structure`	Show detailed HTML structure
`--common-patterns`	Find common patterns across files
`--suggest-selectors`	Suggest CSS selectors for content extraction
`--claude`	Use Claude AI to intelligently select files and suggest selectors
`--analyze-files`	Number of files to analyze with Claude (1-20, default: 5)
`-v, --verbose`	Enable verbose output
`-q, --quiet`	Suppress all output except errors

Config Command

Generate a configuration file template:

m1f-html2md config [options]

Option	Description
`-o, --output`	Output configuration file (default: config.yaml)
`--format`	Configuration format: yaml, toml, json (default: yaml)

Complete Workflow Example

# Step 1: Create project structure
mkdir -p .scrapes/my-project/{html,md,extractors}

# Step 2: Download website using scrape tool
m1f-scrape https://example.com -o .scrapes/my-project/html

# Step 3: Analyze HTML structure (optional)
m1f-html2md analyze .scrapes/my-project/html/ --suggest-selectors

# Step 4: Create custom extractor (optional)
# Use Claude to analyze and create site-specific extractor

# Step 5: Convert with custom extractor
m1f-html2md convert .scrapes/my-project/html -o .scrapes/my-project/md \
  --extractor .scrapes/my-project/extractors/custom_extractor.py

Claude AI Integration

html2md offers optional Claude AI integration for intelligent HTML analysis and conversion:

Claude Command Detection

The tool automatically detects Claude Code installations in various locations:

Standard PATH locations
~/.claude/local/claude (common for local installations)
/usr/local/bin/claude and /usr/bin/claude

AI-Powered Analysis

Use Claude to automatically select representative HTML files and suggest optimal CSS selectors:

# Analyze a directory of HTML files with Claude
m1f-html2md analyze ./scraped-site/ --claude

# Analyze more files for better coverage (up to 20)
m1f-html2md analyze ./scraped-site/ --claude --analyze-files 10

Features of Claude Analysis:

Project Context: Provides project description to help Claude understand the content
Important File Priority: Can specify important files for Claude to prioritize
Multi-phase Analysis: Individual file analysis followed by synthesis
Transparent Process: Creates temporary analysis files in m1f/ directory
Smart Subprocess Handling: Uses subprocess.run() for reliable Claude CLI integration
Streaming Output: Real-time progress display during Claude analysis
Robust Config Loading: Handles Claude-generated configs with unknown fields gracefully

AI-Powered Conversion

Use Claude to convert HTML to clean Markdown, extracting only the main content:

# Convert all HTML files using Claude AI
m1f-html2md convert ./html/ -o ./markdown/ --claude

# Use Opus model for higher quality (default is Sonnet)
m1f-html2md convert ./html/ -o ./markdown/ --claude --model opus

# Add delay between API calls to avoid rate limits
m1f-html2md convert ./html/ -o ./markdown/ --claude --sleep 3

The Claude conversion:

Extracts only the main content (no navigation, ads, etc.)
Preserves document structure and formatting
Handles complex HTML layouts intelligently
Generates clean, readable Markdown

Usage Examples

Basic Conversion

# Simple conversion of all HTML files in a directory
m1f-html2md convert ./website -o ./docs

# Convert files with verbose logging
m1f-html2md convert ./website -o ./docs --verbose

# Convert to m1f bundle format
m1f-html2md convert ./website -o ./docs.m1f --format m1f_bundle

# Convert to JSON format for processing
m1f-html2md convert ./website -o ./data.json --format json

Content Selection

# Extract only the main content and ignore navigation elements
m1f-html2md convert ./website -o ./docs \
  --content-selector "main" --ignore-selectors nav .sidebar footer

# Extract article content from specific selectors
m1f-html2md convert ./website -o ./docs \
  --content-selector "article.content" \
  --ignore-selectors .author-bio .share-buttons .related-articles

HTML Analysis

# Analyze HTML files to find optimal selectors
m1f-html2md analyze ./html/ --suggest-selectors

# Show detailed structure of HTML files
m1f-html2md analyze ./html/ --show-structure

# Find common patterns across multiple files
m1f-html2md analyze ./html/ --common-patterns

# Get all analysis options
m1f-html2md analyze ./html/ \
  --show-structure --common-patterns --suggest-selectors

Formatting Options

# Adjust heading levels (e.g., h1 → h2, h2 → h3)
m1f-html2md convert ./website -o ./docs \
  --heading-offset 1

# Skip frontmatter generation
m1f-html2md convert ./website -o ./docs \
  --no-frontmatter

# Use configuration file for advanced options
m1f-html2md convert ./website -o ./docs -c config.yaml

Performance Optimization

# Use parallel processing for faster conversion
m1f-html2md convert ./website -o ./docs --parallel

Custom Extractors

The custom extractor system allows you to create site-specific content extraction logic for optimal results.

Creating a Custom Extractor

Function-based Extractor

# extractors/simple_extractor.py
from bs4 import BeautifulSoup
from typing import Optional, Dict, Any

def extract(soup: BeautifulSoup, config: Optional[Dict[str, Any]] = None) -> BeautifulSoup:
    """Extract main content from HTML."""
    # Remove navigation elements
    for nav in soup.find_all(['nav', 'header', 'footer']):
        nav.decompose()

    # Find main content
    main = soup.find('main') or soup.find('article')
    if main:
        new_soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
        new_soup.body.append(main)
        return new_soup

    return soup

def postprocess(markdown: str, config: Optional[Dict[str, Any]] = None) -> str:
    """Clean up the converted markdown."""
    # Remove duplicate newlines
    import re
    return re.sub(r'\n{3,}', '\n\n', markdown)

Class-based Extractor

# extractors/advanced_extractor.py
from tools.html2md.extractors import BaseExtractor
from bs4 import BeautifulSoup
from typing import Optional, Dict, Any

class Extractor(BaseExtractor):
    """Custom extractor for specific website."""

    def extract(self, soup: BeautifulSoup, config: Optional[Dict[str, Any]] = None) -> BeautifulSoup:
        """Extract content with site-specific logic."""
        # Custom extraction logic
        return soup

    def preprocess(self, html: str, config: Optional[Dict[str, Any]] = None) -> str:
        """Preprocess raw HTML before parsing."""
        # Fix common HTML issues
        return html.replace('&nbsp;', ' ')

    def postprocess(self, markdown: str, config: Optional[Dict[str, Any]] = None) -> str:
        """Post-process converted markdown."""
        # Clean up site-specific artifacts
        return markdown

Using Custom Extractors

# Use with CLI
m1f-html2md convert ./html -o ./markdown \
  --extractor ./extractors/my_extractor.py

.scrapes Directory Structure

The recommended workflow uses a .scrapes directory (gitignored) for organizing scraping projects:

.scrapes/
└── project-name/
    ├── html/         # Raw HTML files from scraping
    ├── md/           # Converted Markdown files
    └── extractors/   # Custom extraction scripts
        └── custom_extractor.py

Advanced Features

YAML Frontmatter

By default, the converter adds YAML frontmatter to each Markdown file:

---
title: Extracted from HTML
source_file: original.html
date_converted: 2023-06-15T14:30:21
date_modified: 2023-06-12T10:15:33
---

To disable frontmatter generation:

m1f-html2md convert ./website -o ./docs --no-frontmatter

Heading Level Adjustment

The --heading-offset option adjusts the hierarchical structure:

# Make h1 headings become h2 headings
m1f-html2md convert ./website -o ./docs --heading-offset 1

Code Block Language Detection

The converter automatically detects language hints from HTML code blocks:

<pre><code class="language-python">def example():
    return "Hello, world!"
</code></pre>

Converts to:

```python
def example():
    return "Hello, world!"
```

Character Encoding Handling

The converter provides robust character encoding detection:

Automatically detects the encoding of source HTML files
Properly handles UTF-8, UTF-16, and other encodings
All output files are written in UTF-8 encoding
Handles BOM (Byte Order Mark) detection for Unicode files

Architecture

HTML2MD v3.4.0 features a modern, modular architecture:

tools/html2md/
├── __init__.py       # Package initialization
├── __main__.py       # Entry point for module execution
├── api.py            # Programmatic API for other tools
├── cli.py            # Command-line interface
├── config/           # Configuration management
│   ├── __init__.py
│   ├── loader.py     # Config file loader
│   └── models.py     # Config data models
├── core.py           # Core conversion logic
├── extractors.py     # Custom extractor system
├── preprocessors.py  # HTML preprocessing
├── analyze_html.py   # HTML structure analysis
└── utils.py          # Utility functions

Integration with m1f

The html2md tool works well with the m1f tool for comprehensive documentation handling:

# 1. First convert HTML files to Markdown
m1f-html2md convert ./html-docs -o ./markdown-docs

# 2. Then use m1f to combine the Markdown files
m1f -s ./markdown-docs -o ./combined-docs.m1f.txt \
  --separator-style Markdown

This workflow is ideal for:

Converting documentation from HTML to Markdown format
Consolidating documentation from multiple sources
Preparing content for LLM context windows
Creating searchable knowledge bases

Performance Considerations

For large websites with many HTML files, use the --parallel option
Conversion speed depends on file size, complexity, and number of files
Memory usage scales with file sizes when parallel processing is enabled
The tool uses async I/O for efficient file operations

Requirements and Dependencies

Python 3.10 or newer
Required packages:
- beautifulsoup4: For HTML parsing
- markdownify: For HTML to Markdown conversion
- aiofiles: For async file operations
- rich: For console output
- pydantic: For configuration models
Optional packages:
- chardet: For encoding detection
- pyyaml: For YAML configuration files
- toml: For TOML configuration files

scrape - Download websites before conversion
m1f - Bundle converted Markdown files
s1f - Extract files from bundles
token-counter - Estimate token usage

Next Steps

Learn about web scraping to download HTML content
Explore m1f bundling to combine converted files
Check out Claude integration for AI workflows
Review auto-bundling for automated processes