Web Scraper Backends Guide

The m1f suite includes powerful web scraping capabilities through the html2md and scrape tools. This guide helps you choose and configure the right scraping backend for your specific needs.

Overview

The m1f scraper backend system provides flexibility to choose the most appropriate tool for your web scraping needs:

Static websites: BeautifulSoup4 (default) - Fast and lightweight
Complete mirroring: HTTrack - Professional website copying
Performance-critical: httpx + selectolax - Blazing fast parsing
Enterprise scraping: Scrapy - Industrial-strength framework
JavaScript-heavy sites: Playwright - Full browser automation

Backend Comparison

Backend	Speed	Memory	JavaScript	Best For
BeautifulSoup	Fast	Low	❌	Static HTML, documentation sites
HTTrack	Medium	Medium	❌	Complete website mirroring
Selectolax	Fastest	Lowest	❌	High-performance scraping
Scrapy	Very Fast	Low	❌	Large-scale, production scraping
Playwright	Slow	High	✅	SPAs, dynamic content

Available Backends

BeautifulSoup4 (Default)

BeautifulSoup4 is the default backend, ideal for scraping static HTML websites. It provides an excellent balance of ease of use and functionality.

Pros:

Easy to use and lightweight
Fast for simple websites
Excellent HTML parsing capabilities
Good encoding detection
Mature and well-documented

Cons:

No JavaScript support
Basic crawling capabilities
Single-threaded by default

Installation:

# Included with m1f by default
pip install beautifulsoup4 lxml

Usage Examples:

# Default backend (no need to specify)
m1f-scrape https://example.com -o output/

# Explicitly specify BeautifulSoup
m1f-scrape https://example.com -o output/ --scraper beautifulsoup

# With custom options for documentation sites
m1f-scrape https://docs.example.com -o docs/ \
  --scraper beautifulsoup \
  --max-depth 10 \
  --max-pages 500 \
  --request-delay 0.5

Configuration:

# beautifulsoup-config.yaml
parser: "lxml"          # Options: "html.parser", "lxml", "html5lib"
features: "lxml"        # Parser features
encoding: "auto"        # Or specific encoding like "utf-8"

HTTrack

HTTrack is a professional website copier that creates complete offline mirrors, preserving the original site structure.

Pros:

Complete website mirroring
Preserves directory structure
Resume interrupted downloads
Automatic robots.txt compliance
Handles complex websites well

Cons:

Requires system installation
Less flexible for custom parsing
Larger resource footprint

Installation:

# Ubuntu/Debian
sudo apt-get install httrack

# macOS
brew install httrack

# Windows
# Download installer from https://www.httrack.com/

Usage Examples:

# Basic mirroring
m1f-scrape https://example.com -o mirror/ --scraper httrack

# Complete backup with custom settings
m1f-scrape https://example.com -o backup/ \
  --scraper httrack \
  --max-depth 10 \
  --concurrent-requests 8 \
  --scraper-config httrack.yaml

Configuration:

# httrack-config.yaml
mirror_options:
  - "--assume-insecure"  # For HTTPS issues
  - "--robots=3"         # Strict robots.txt compliance
  - "--continue"         # Resume interrupted downloads
extra_filters:
  - "+*.css"
  - "+*.js"
  - "-*.zip"
  - "-*.exe"

Selectolax (httpx + selectolax)

The fastest HTML parsing solution using httpx for modern HTTP/2 networking and selectolax for C-based parsing.

Pros:

Blazing fast performance
Minimal memory footprint
Modern async HTTP/2 support
Excellent for large-scale simple scraping

Cons:

No JavaScript support
Limited parsing features vs BeautifulSoup
Less mature ecosystem

Installation:

pip install httpx selectolax

Usage Examples:

# High-performance scraping
m1f-scrape https://example.com -o output/ --scraper selectolax

# Large-scale scraping with high concurrency
m1f-scrape https://example.com -o output/ \
  --scraper selectolax \
  --concurrent-requests 50 \
  --request-delay 0.1 \
  --max-pages 10000

# Using httpx alias
m1f-scrape https://example.com -o output/ --scraper httpx

Configuration:

# selectolax-config.yaml
http_config:
  timeout: 30
  http2: true
  max_redirects: 10
parser_config:
  strict: false

Scrapy

Industrial-strength web scraping framework with advanced features for production use.

Pros:

Battle-tested in production
Built-in retry logic and error handling
Auto-throttle based on server response
Distributed crawling support
Advanced caching and queuing
Extensive middleware system

Cons:

Steeper learning curve
Heavier than simple scrapers
Twisted-based async model

Installation:

pip install scrapy

Usage Examples:

# Enterprise scraping
m1f-scrape https://example.com -o output/ --scraper scrapy

# With auto-throttle and custom settings
m1f-scrape https://example.com -o output/ \
  --scraper scrapy \
  --scraper-config scrapy.yaml \
  --max-pages 10000

# Large-scale crawling
m1f-scrape https://example.com -o output/ \
  --scraper scrapy \
  --concurrent-requests 16 \
  --max-depth 20

Configuration:

# scrapy-config.yaml
settings:
  AUTOTHROTTLE_ENABLED: true
  AUTOTHROTTLE_START_DELAY: 0.5
  AUTOTHROTTLE_MAX_DELAY: 10
  AUTOTHROTTLE_TARGET_CONCURRENCY: 8.0
  ROBOTSTXT_OBEY: true
  HTTPCACHE_ENABLED: true
  RETRY_TIMES: 3
  DOWNLOAD_TIMEOUT: 30
middlewares:
  - "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware"
  - "scrapy.downloadermiddlewares.retry.RetryMiddleware"

Playwright

Browser automation for JavaScript-heavy websites and single-page applications (SPAs).

Pros:

Full JavaScript execution
Handles SPAs and dynamic content
Multiple browser engines (Chromium, Firefox, WebKit)
Screenshot and PDF generation
Mobile device emulation
Network interception capabilities

Cons:

High resource usage
Significantly slower than HTML-only scrapers
Requires browser installation

Installation:

pip install playwright
playwright install  # Install browser binaries
playwright install-deps  # Install system dependencies (Linux)

Usage Examples:

# Basic JavaScript site scraping
m1f-scrape https://spa.example.com -o output/ --scraper playwright

# With custom browser settings
m1f-scrape https://example.com -o output/ \
  --scraper playwright \
  --scraper-config playwright.yaml \
  --request-delay 2.0

# Mobile emulation
m1f-scrape https://m.example.com -o mobile/ \
  --scraper playwright \
  --concurrent-requests 1

Configuration:

# playwright-config.yaml
browser: "chromium"  # Options: chromium, firefox, webkit
launch_options:
  headless: true
  slow_mo: 0
context_options:
  viewport:
    width: 1920
    height: 1080
  user_agent: "Mozilla/5.0..."
  locale: "en-US"
page_options:
  wait_until: "networkidle"  # or "load", "domcontentloaded"
  timeout: 30000

Command Line Options

Common options available for all scrapers:

--scraper BACKEND           # Choose backend (beautifulsoup, httrack, selectolax, scrapy, playwright)
--max-depth N               # Maximum crawl depth (default: 5)
--max-pages N               # Maximum pages to crawl (default: 1000)
--request-delay SECONDS     # Delay between requests (default: 15.0)
--concurrent-requests N     # Number of concurrent requests (default: 2)
--user-agent STRING         # Custom user agent
--scraper-config PATH       # Path to backend-specific config file
--list-files                # List all downloaded files after completion
-v, --verbose               # Enable verbose output
-q, --quiet                 # Suppress output except errors

Note: robots.txt is always respected and cannot be disabled.

Use Case Recommendations

Static Documentation Sites

For technical documentation, API references, and wikis:

m1f-scrape https://docs.example.com -o docs/ \
  --scraper beautifulsoup \
  --max-depth 10 \
  --request-delay 0.2

Complete Website Backup

For creating offline mirrors and backups:

m1f-scrape https://example.com -o backup/ \
  --scraper httrack \
  --max-pages 10000

High-Performance Scraping

For large-scale data extraction from simple HTML:

m1f-scrape https://catalog.example.com -o data/ \
  --scraper selectolax \
  --concurrent-requests 50 \
  --max-pages 100000

Enterprise Data Collection

For production scraping with reliability requirements:

m1f-scrape https://data.example.com -o output/ \
  --scraper scrapy \
  --scraper-config production-scrapy.yaml

JavaScript-Heavy Applications

For SPAs, React/Vue/Angular sites, and dynamic content:

m1f-scrape https://app.example.com -o app-docs/ \
  --scraper playwright \
  --request-delay 3.0 \
  --concurrent-requests 1

Performance Tuning

Optimizing BeautifulSoup

# Use lxml parser for better performance
parser: "lxml"
# Disable features you don't need
features: ["html", "fast"]

Optimizing HTTrack

# Increase connections for faster mirroring
--concurrent-requests 8
# Exclude unnecessary files
--scraper-config httrack-optimized.yaml

Optimizing Selectolax

# Maximum concurrency for static sites
--concurrent-requests 100
--request-delay 0.05

Optimizing Scrapy

# Enable concurrent requests
CONCURRENT_REQUESTS: 32
CONCURRENT_REQUESTS_PER_DOMAIN: 16
# Use HTTP caching
HTTPCACHE_ENABLED: true

Optimizing Playwright

# Disable images and CSS for faster loading
context_options:
  extra_http_headers:
    "Accept": "text/html"
launch_options:
  args: ["--disable-images", "--disable-css"]

Troubleshooting

Encoding Issues

# Force UTF-8 encoding for BeautifulSoup
encoding: "utf-8"

SSL Certificate Problems

# HTTrack: ignore SSL errors (use with caution)
mirror_options: ["--assume-insecure"]

JavaScript Not Loading

# Playwright: increase wait time
page_options:
  wait_until: "networkidle"
  timeout: 60000

Rate Limiting

# Increase delay between requests
--request-delay 2.0
--concurrent-requests 1

API Usage

For programmatic access:

import asyncio
from tools.html2md.scrapers import create_scraper, ScraperConfig

async def scrape_with_backend():
    config = ScraperConfig(
        max_depth=5,
        max_pages=100,
        request_delay=0.5
    )
    
    # Create scraper instance
    scraper = create_scraper('playwright', config)
    
    # Scrape entire site
    async with scraper:
        async for page in scraper.scrape_site('https://example.com'):
            print(f"Scraped: {page.url}")
            # Process page content
            
asyncio.run(scrape_with_backend())

html2md Tool Reference - Convert HTML to Markdown
scrape Tool Reference - Web scraping command
Web Scraping Guide - General scraping guide
Markdown Conversion Guide - Working with converted content

Web Scraper Backends Guide

Overview

Backend Comparison

Available Backends

BeautifulSoup4 (Default)

HTTrack

Selectolax (httpx + selectolax)

Scrapy

Playwright

Command Line Options

Use Case Recommendations

Static Documentation Sites

Complete Website Backup

High-Performance Scraping

Enterprise Data Collection

JavaScript-Heavy Applications

Performance Tuning

Optimizing BeautifulSoup

Optimizing HTTrack

Optimizing Selectolax

Optimizing Scrapy

Optimizing Playwright

Troubleshooting

Encoding Issues

SSL Certificate Problems

JavaScript Not Loading

Rate Limiting

API Usage

Related Documentation