Guides

Web Scraper Backends Guide

Comparison and configuration of different web scraping backends

Web Scraper Backends Guide

The m1f suite includes powerful web scraping capabilities through the html2md and scrape tools. This guide helps you choose and configure the right scraping backend for your specific needs.

Overview

The m1f scraper backend system provides flexibility to choose the most appropriate tool for your web scraping needs:

  • Static websites: BeautifulSoup4 (default) - Fast and lightweight
  • Complete mirroring: HTTrack - Professional website copying
  • Performance-critical: httpx + selectolax - Blazing fast parsing
  • Enterprise scraping: Scrapy - Industrial-strength framework
  • JavaScript-heavy sites: Playwright - Full browser automation

Backend Comparison

BackendSpeedMemoryJavaScriptBest For
BeautifulSoupFastLowStatic HTML, documentation sites
HTTrackMediumMediumComplete website mirroring
SelectolaxFastestLowestHigh-performance scraping
ScrapyVery FastLowLarge-scale, production scraping
PlaywrightSlowHighSPAs, dynamic content

Available Backends

BeautifulSoup4 (Default)

BeautifulSoup4 is the default backend, ideal for scraping static HTML websites. It provides an excellent balance of ease of use and functionality.

Pros:

  • Easy to use and lightweight
  • Fast for simple websites
  • Excellent HTML parsing capabilities
  • Good encoding detection
  • Mature and well-documented

Cons:

  • No JavaScript support
  • Basic crawling capabilities
  • Single-threaded by default

Installation:

# Included with m1f by default
pip install beautifulsoup4 lxml

Usage Examples:

# Default backend (no need to specify)
m1f-scrape https://example.com -o output/

# Explicitly specify BeautifulSoup
m1f-scrape https://example.com -o output/ --scraper beautifulsoup

# With custom options for documentation sites
m1f-scrape https://docs.example.com -o docs/ \
  --scraper beautifulsoup \
  --max-depth 10 \
  --max-pages 500 \
  --request-delay 0.5

Configuration:

# beautifulsoup-config.yaml
parser: "lxml"          # Options: "html.parser", "lxml", "html5lib"
features: "lxml"        # Parser features
encoding: "auto"        # Or specific encoding like "utf-8"

HTTrack

HTTrack is a professional website copier that creates complete offline mirrors, preserving the original site structure.

Pros:

  • Complete website mirroring
  • Preserves directory structure
  • Resume interrupted downloads
  • Automatic robots.txt compliance
  • Handles complex websites well

Cons:

  • Requires system installation
  • Less flexible for custom parsing
  • Larger resource footprint

Installation:

# Ubuntu/Debian
sudo apt-get install httrack

# macOS
brew install httrack

# Windows
# Download installer from https://www.httrack.com/

Usage Examples:

# Basic mirroring
m1f-scrape https://example.com -o mirror/ --scraper httrack

# Complete backup with custom settings
m1f-scrape https://example.com -o backup/ \
  --scraper httrack \
  --max-depth 10 \
  --concurrent-requests 8 \
  --scraper-config httrack.yaml

Configuration:

# httrack-config.yaml
mirror_options:
  - "--assume-insecure"  # For HTTPS issues
  - "--robots=3"         # Strict robots.txt compliance
  - "--continue"         # Resume interrupted downloads
extra_filters:
  - "+*.css"
  - "+*.js"
  - "-*.zip"
  - "-*.exe"

Selectolax (httpx + selectolax)

The fastest HTML parsing solution using httpx for modern HTTP/2 networking and selectolax for C-based parsing.

Pros:

  • Blazing fast performance
  • Minimal memory footprint
  • Modern async HTTP/2 support
  • Excellent for large-scale simple scraping

Cons:

  • No JavaScript support
  • Limited parsing features vs BeautifulSoup
  • Less mature ecosystem

Installation:

pip install httpx selectolax

Usage Examples:

# High-performance scraping
m1f-scrape https://example.com -o output/ --scraper selectolax

# Large-scale scraping with high concurrency
m1f-scrape https://example.com -o output/ \
  --scraper selectolax \
  --concurrent-requests 50 \
  --request-delay 0.1 \
  --max-pages 10000

# Using httpx alias
m1f-scrape https://example.com -o output/ --scraper httpx

Configuration:

# selectolax-config.yaml
http_config:
  timeout: 30
  http2: true
  max_redirects: 10
parser_config:
  strict: false

Scrapy

Industrial-strength web scraping framework with advanced features for production use.

Pros:

  • Battle-tested in production
  • Built-in retry logic and error handling
  • Auto-throttle based on server response
  • Distributed crawling support
  • Advanced caching and queuing
  • Extensive middleware system

Cons:

  • Steeper learning curve
  • Heavier than simple scrapers
  • Twisted-based async model

Installation:

pip install scrapy

Usage Examples:

# Enterprise scraping
m1f-scrape https://example.com -o output/ --scraper scrapy

# With auto-throttle and custom settings
m1f-scrape https://example.com -o output/ \
  --scraper scrapy \
  --scraper-config scrapy.yaml \
  --max-pages 10000

# Large-scale crawling
m1f-scrape https://example.com -o output/ \
  --scraper scrapy \
  --concurrent-requests 16 \
  --max-depth 20

Configuration:

# scrapy-config.yaml
settings:
  AUTOTHROTTLE_ENABLED: true
  AUTOTHROTTLE_START_DELAY: 0.5
  AUTOTHROTTLE_MAX_DELAY: 10
  AUTOTHROTTLE_TARGET_CONCURRENCY: 8.0
  ROBOTSTXT_OBEY: true
  HTTPCACHE_ENABLED: true
  RETRY_TIMES: 3
  DOWNLOAD_TIMEOUT: 30
middlewares:
  - "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware"
  - "scrapy.downloadermiddlewares.retry.RetryMiddleware"

Playwright

Browser automation for JavaScript-heavy websites and single-page applications (SPAs).

Pros:

  • Full JavaScript execution
  • Handles SPAs and dynamic content
  • Multiple browser engines (Chromium, Firefox, WebKit)
  • Screenshot and PDF generation
  • Mobile device emulation
  • Network interception capabilities

Cons:

  • High resource usage
  • Significantly slower than HTML-only scrapers
  • Requires browser installation

Installation:

pip install playwright
playwright install  # Install browser binaries
playwright install-deps  # Install system dependencies (Linux)

Usage Examples:

# Basic JavaScript site scraping
m1f-scrape https://spa.example.com -o output/ --scraper playwright

# With custom browser settings
m1f-scrape https://example.com -o output/ \
  --scraper playwright \
  --scraper-config playwright.yaml \
  --request-delay 2.0

# Mobile emulation
m1f-scrape https://m.example.com -o mobile/ \
  --scraper playwright \
  --concurrent-requests 1

Configuration:

# playwright-config.yaml
browser: "chromium"  # Options: chromium, firefox, webkit
launch_options:
  headless: true
  slow_mo: 0
context_options:
  viewport:
    width: 1920
    height: 1080
  user_agent: "Mozilla/5.0..."
  locale: "en-US"
page_options:
  wait_until: "networkidle"  # or "load", "domcontentloaded"
  timeout: 30000

Command Line Options

Common options available for all scrapers:

--scraper BACKEND           # Choose backend (beautifulsoup, httrack, selectolax, scrapy, playwright)
--max-depth N               # Maximum crawl depth (default: 5)
--max-pages N               # Maximum pages to crawl (default: 1000)
--request-delay SECONDS     # Delay between requests (default: 15.0)
--concurrent-requests N     # Number of concurrent requests (default: 2)
--user-agent STRING         # Custom user agent
--scraper-config PATH       # Path to backend-specific config file
--list-files                # List all downloaded files after completion
-v, --verbose               # Enable verbose output
-q, --quiet                 # Suppress output except errors

Note: robots.txt is always respected and cannot be disabled.

Use Case Recommendations

Static Documentation Sites

For technical documentation, API references, and wikis:

m1f-scrape https://docs.example.com -o docs/ \
  --scraper beautifulsoup \
  --max-depth 10 \
  --request-delay 0.2

Complete Website Backup

For creating offline mirrors and backups:

m1f-scrape https://example.com -o backup/ \
  --scraper httrack \
  --max-pages 10000

High-Performance Scraping

For large-scale data extraction from simple HTML:

m1f-scrape https://catalog.example.com -o data/ \
  --scraper selectolax \
  --concurrent-requests 50 \
  --max-pages 100000

Enterprise Data Collection

For production scraping with reliability requirements:

m1f-scrape https://data.example.com -o output/ \
  --scraper scrapy \
  --scraper-config production-scrapy.yaml

JavaScript-Heavy Applications

For SPAs, React/Vue/Angular sites, and dynamic content:

m1f-scrape https://app.example.com -o app-docs/ \
  --scraper playwright \
  --request-delay 3.0 \
  --concurrent-requests 1

Performance Tuning

Optimizing BeautifulSoup

# Use lxml parser for better performance
parser: "lxml"
# Disable features you don't need
features: ["html", "fast"]

Optimizing HTTrack

# Increase connections for faster mirroring
--concurrent-requests 8
# Exclude unnecessary files
--scraper-config httrack-optimized.yaml

Optimizing Selectolax

# Maximum concurrency for static sites
--concurrent-requests 100
--request-delay 0.05

Optimizing Scrapy

# Enable concurrent requests
CONCURRENT_REQUESTS: 32
CONCURRENT_REQUESTS_PER_DOMAIN: 16
# Use HTTP caching
HTTPCACHE_ENABLED: true

Optimizing Playwright

# Disable images and CSS for faster loading
context_options:
  extra_http_headers:
    "Accept": "text/html"
launch_options:
  args: ["--disable-images", "--disable-css"]

Troubleshooting

Encoding Issues

# Force UTF-8 encoding for BeautifulSoup
encoding: "utf-8"

SSL Certificate Problems

# HTTrack: ignore SSL errors (use with caution)
mirror_options: ["--assume-insecure"]

JavaScript Not Loading

# Playwright: increase wait time
page_options:
  wait_until: "networkidle"
  timeout: 60000

Rate Limiting

# Increase delay between requests
--request-delay 2.0
--concurrent-requests 1

API Usage

For programmatic access:

import asyncio
from tools.html2md.scrapers import create_scraper, ScraperConfig

async def scrape_with_backend():
    config = ScraperConfig(
        max_depth=5,
        max_pages=100,
        request_delay=0.5
    )
    
    # Create scraper instance
    scraper = create_scraper('playwright', config)
    
    # Scrape entire site
    async with scraper:
        async for page in scraper.scrape_site('https://example.com'):
            print(f"Scraped: {page.url}")
            # Process page content
            
asyncio.run(scrape_with_backend())