Tools

html2md - HTML to Markdown Converter

Convert HTML content to Markdown with intelligent content extraction and custom processing

A modern HTML to Markdown converter with HTML structure analysis, custom extractors, async I/O, and parallel processing capabilities.

Overview

The html2md tool (v3.4.0) provides a robust solution for converting HTML content to Markdown format, with fine-grained control over the conversion process. Built with Python 3.10+ and modern async architecture, it focuses on intelligent content extraction and conversion.

New in v3.4.0: Custom extractor plugin system for site-specific content extraction.

Note: Web scraping functionality has been moved to the separate scrape tool for better modularity. Use scrape to download websites, then html2md to convert the downloaded HTML files.

Key Features

  • Custom Extractor System: Create site-specific extractors for optimal content extraction
  • HTML Structure Analysis: Analyze HTML files to find optimal content selectors
  • Intelligent Content Extraction: Use CSS selectors to extract specific content
  • Async I/O: High-performance concurrent file processing
  • API Mode: Programmatic access for integration with other tools
  • Type Safety: Full type annotations throughout the codebase
  • Modern Architecture: Clean modular design
  • Workflow Integration: .scrapes directory structure for organized processing
  • Claude AI Integration: AI-powered HTML analysis and conversion
  • Recursive directory scanning for batch conversion
  • Smart internal link handling (HTML → Markdown)
  • Customizable element filtering and removal
  • YAML frontmatter generation
  • Heading level adjustment
  • Code block language detection
  • Character encoding detection and conversion
  • Parallel processing for faster conversion

Quick Start

# Basic conversion of all HTML files in a directory
m1f-html2md convert ./website -o ./docs

# Use a custom extractor for site-specific conversion
m1f-html2md convert ./website -o ./docs \
  --extractor ./extractors/custom_extractor.py

# Extract only main content from HTML files
m1f-html2md convert ./website -o ./docs \
  --content-selector "main.content" --ignore-selectors nav .sidebar footer

# Skip YAML frontmatter and adjust heading levels
m1f-html2md convert ./website -o ./docs \
  --no-frontmatter --heading-offset 1

# Analyze HTML structure to find best selectors
m1f-html2md analyze ./html/*.html --suggest-selectors

# Use Claude AI to intelligently analyze HTML structure
m1f-html2md analyze ./html/ --claude

# Convert HTML to Markdown using Claude AI (clean content extraction)
m1f-html2md convert ./html/ -o ./markdown/ --claude --model opus --sleep 2

Command Line Interface

The html2md tool uses subcommands for different operations:

Convert Command

Convert local HTML files to Markdown:

m1f-html2md convert <source> -o <output> [options]
OptionDescription
sourceSource file or directory
-o, --outputOutput file or directory (required)
-c, --configConfiguration file path (YAML format)
--formatOutput format: markdown, m1f_bundle, json (default: markdown)
--extractorPath to custom extractor Python file
--content-selectorCSS selector for main content
--ignore-selectorsCSS selectors to ignore (space-separated)
--heading-offsetOffset heading levels (default: 0)
--no-frontmatterDon’t add YAML frontmatter
--parallelEnable parallel processing
--claudeUse Claude AI to convert HTML to Markdown
--modelClaude model to use: opus, sonnet (default: sonnet)
--sleepSleep time in seconds between Claude API calls (default: 1.0)
-v, --verboseEnable verbose output
-q, --quietSuppress all output except errors

Analyze Command

Analyze HTML structure for optimal content extraction:

m1f-html2md analyze <paths> [options]
OptionDescription
pathsHTML files or directories to analyze
--show-structureShow detailed HTML structure
--common-patternsFind common patterns across files
--suggest-selectorsSuggest CSS selectors for content extraction
--claudeUse Claude AI to intelligently select files and suggest selectors
--analyze-filesNumber of files to analyze with Claude (1-20, default: 5)
-v, --verboseEnable verbose output
-q, --quietSuppress all output except errors

Config Command

Generate a configuration file template:

m1f-html2md config [options]
OptionDescription
-o, --outputOutput configuration file (default: config.yaml)
--formatConfiguration format: yaml, toml, json (default: yaml)

Complete Workflow Example

# Step 1: Create project structure
mkdir -p .scrapes/my-project/{html,md,extractors}

# Step 2: Download website using scrape tool
m1f-scrape https://example.com -o .scrapes/my-project/html

# Step 3: Analyze HTML structure (optional)
m1f-html2md analyze .scrapes/my-project/html/ --suggest-selectors

# Step 4: Create custom extractor (optional)
# Use Claude to analyze and create site-specific extractor

# Step 5: Convert with custom extractor
m1f-html2md convert .scrapes/my-project/html -o .scrapes/my-project/md \
  --extractor .scrapes/my-project/extractors/custom_extractor.py

Claude AI Integration

html2md offers optional Claude AI integration for intelligent HTML analysis and conversion:

Claude Command Detection

The tool automatically detects Claude Code installations in various locations:

  • Standard PATH locations
  • ~/.claude/local/claude (common for local installations)
  • /usr/local/bin/claude and /usr/bin/claude

AI-Powered Analysis

Use Claude to automatically select representative HTML files and suggest optimal CSS selectors:

# Analyze a directory of HTML files with Claude
m1f-html2md analyze ./scraped-site/ --claude

# Analyze more files for better coverage (up to 20)
m1f-html2md analyze ./scraped-site/ --claude --analyze-files 10

Features of Claude Analysis:

  • Project Context: Provides project description to help Claude understand the content
  • Important File Priority: Can specify important files for Claude to prioritize
  • Multi-phase Analysis: Individual file analysis followed by synthesis
  • Transparent Process: Creates temporary analysis files in m1f/ directory
  • Smart Subprocess Handling: Uses subprocess.run() for reliable Claude CLI integration
  • Streaming Output: Real-time progress display during Claude analysis
  • Robust Config Loading: Handles Claude-generated configs with unknown fields gracefully

AI-Powered Conversion

Use Claude to convert HTML to clean Markdown, extracting only the main content:

# Convert all HTML files using Claude AI
m1f-html2md convert ./html/ -o ./markdown/ --claude

# Use Opus model for higher quality (default is Sonnet)
m1f-html2md convert ./html/ -o ./markdown/ --claude --model opus

# Add delay between API calls to avoid rate limits
m1f-html2md convert ./html/ -o ./markdown/ --claude --sleep 3

The Claude conversion:

  • Extracts only the main content (no navigation, ads, etc.)
  • Preserves document structure and formatting
  • Handles complex HTML layouts intelligently
  • Generates clean, readable Markdown

Usage Examples

Basic Conversion

# Simple conversion of all HTML files in a directory
m1f-html2md convert ./website -o ./docs

# Convert files with verbose logging
m1f-html2md convert ./website -o ./docs --verbose

# Convert to m1f bundle format
m1f-html2md convert ./website -o ./docs.m1f --format m1f_bundle

# Convert to JSON format for processing
m1f-html2md convert ./website -o ./data.json --format json

Content Selection

# Extract only the main content and ignore navigation elements
m1f-html2md convert ./website -o ./docs \
  --content-selector "main" --ignore-selectors nav .sidebar footer

# Extract article content from specific selectors
m1f-html2md convert ./website -o ./docs \
  --content-selector "article.content" \
  --ignore-selectors .author-bio .share-buttons .related-articles

HTML Analysis

# Analyze HTML files to find optimal selectors
m1f-html2md analyze ./html/ --suggest-selectors

# Show detailed structure of HTML files
m1f-html2md analyze ./html/ --show-structure

# Find common patterns across multiple files
m1f-html2md analyze ./html/ --common-patterns

# Get all analysis options
m1f-html2md analyze ./html/ \
  --show-structure --common-patterns --suggest-selectors

Formatting Options

# Adjust heading levels (e.g., h1 → h2, h2 → h3)
m1f-html2md convert ./website -o ./docs \
  --heading-offset 1

# Skip frontmatter generation
m1f-html2md convert ./website -o ./docs \
  --no-frontmatter

# Use configuration file for advanced options
m1f-html2md convert ./website -o ./docs -c config.yaml

Performance Optimization

# Use parallel processing for faster conversion
m1f-html2md convert ./website -o ./docs --parallel

Custom Extractors

The custom extractor system allows you to create site-specific content extraction logic for optimal results.

Creating a Custom Extractor

Function-based Extractor

# extractors/simple_extractor.py
from bs4 import BeautifulSoup
from typing import Optional, Dict, Any

def extract(soup: BeautifulSoup, config: Optional[Dict[str, Any]] = None) -> BeautifulSoup:
    """Extract main content from HTML."""
    # Remove navigation elements
    for nav in soup.find_all(['nav', 'header', 'footer']):
        nav.decompose()

    # Find main content
    main = soup.find('main') or soup.find('article')
    if main:
        new_soup = BeautifulSoup('<html><body></body></html>', 'html.parser')
        new_soup.body.append(main)
        return new_soup

    return soup

def postprocess(markdown: str, config: Optional[Dict[str, Any]] = None) -> str:
    """Clean up the converted markdown."""
    # Remove duplicate newlines
    import re
    return re.sub(r'\n{3,}', '\n\n', markdown)

Class-based Extractor

# extractors/advanced_extractor.py
from tools.html2md.extractors import BaseExtractor
from bs4 import BeautifulSoup
from typing import Optional, Dict, Any

class Extractor(BaseExtractor):
    """Custom extractor for specific website."""

    def extract(self, soup: BeautifulSoup, config: Optional[Dict[str, Any]] = None) -> BeautifulSoup:
        """Extract content with site-specific logic."""
        # Custom extraction logic
        return soup

    def preprocess(self, html: str, config: Optional[Dict[str, Any]] = None) -> str:
        """Preprocess raw HTML before parsing."""
        # Fix common HTML issues
        return html.replace('&nbsp;', ' ')

    def postprocess(self, markdown: str, config: Optional[Dict[str, Any]] = None) -> str:
        """Post-process converted markdown."""
        # Clean up site-specific artifacts
        return markdown

Using Custom Extractors

# Use with CLI
m1f-html2md convert ./html -o ./markdown \
  --extractor ./extractors/my_extractor.py

.scrapes Directory Structure

The recommended workflow uses a .scrapes directory (gitignored) for organizing scraping projects:

.scrapes/
└── project-name/
    ├── html/         # Raw HTML files from scraping
    ├── md/           # Converted Markdown files
    └── extractors/   # Custom extraction scripts
        └── custom_extractor.py

Advanced Features

YAML Frontmatter

By default, the converter adds YAML frontmatter to each Markdown file:

---
title: Extracted from HTML
source_file: original.html
date_converted: 2023-06-15T14:30:21
date_modified: 2023-06-12T10:15:33
---

To disable frontmatter generation:

m1f-html2md convert ./website -o ./docs --no-frontmatter

Heading Level Adjustment

The --heading-offset option adjusts the hierarchical structure:

# Make h1 headings become h2 headings
m1f-html2md convert ./website -o ./docs --heading-offset 1

Code Block Language Detection

The converter automatically detects language hints from HTML code blocks:

<pre><code class="language-python">def example():
    return "Hello, world!"
</code></pre>

Converts to:

```python
def example():
    return "Hello, world!"
```

Character Encoding Handling

The converter provides robust character encoding detection:

  1. Automatically detects the encoding of source HTML files
  2. Properly handles UTF-8, UTF-16, and other encodings
  3. All output files are written in UTF-8 encoding
  4. Handles BOM (Byte Order Mark) detection for Unicode files

Architecture

HTML2MD v3.4.0 features a modern, modular architecture:

tools/html2md/
├── __init__.py       # Package initialization
├── __main__.py       # Entry point for module execution
├── api.py            # Programmatic API for other tools
├── cli.py            # Command-line interface
├── config/           # Configuration management
│   ├── __init__.py
│   ├── loader.py     # Config file loader
│   └── models.py     # Config data models
├── core.py           # Core conversion logic
├── extractors.py     # Custom extractor system
├── preprocessors.py  # HTML preprocessing
├── analyze_html.py   # HTML structure analysis
└── utils.py          # Utility functions

Integration with m1f

The html2md tool works well with the m1f tool for comprehensive documentation handling:

# 1. First convert HTML files to Markdown
m1f-html2md convert ./html-docs -o ./markdown-docs

# 2. Then use m1f to combine the Markdown files
m1f -s ./markdown-docs -o ./combined-docs.m1f.txt \
  --separator-style Markdown

This workflow is ideal for:

  • Converting documentation from HTML to Markdown format
  • Consolidating documentation from multiple sources
  • Preparing content for LLM context windows
  • Creating searchable knowledge bases

Performance Considerations

  • For large websites with many HTML files, use the --parallel option
  • Conversion speed depends on file size, complexity, and number of files
  • Memory usage scales with file sizes when parallel processing is enabled
  • The tool uses async I/O for efficient file operations

Requirements and Dependencies

  • Python 3.10 or newer
  • Required packages:
    • beautifulsoup4: For HTML parsing
    • markdownify: For HTML to Markdown conversion
    • aiofiles: For async file operations
    • rich: For console output
    • pydantic: For configuration models
  • Optional packages:
    • chardet: For encoding detection
    • pyyaml: For YAML configuration files
    • toml: For TOML configuration files
  • scrape - Download websites before conversion
  • m1f - Bundle converted Markdown files
  • s1f - Extract files from bundles
  • token-counter - Estimate token usage

Next Steps