Guides
Web Scraper Backends Guide
Comparison and configuration of different web scraping backends
Web Scraper Backends Guide
The m1f suite includes powerful web scraping capabilities through the html2md and scrape tools. This guide helps you choose and configure the right scraping backend for your specific needs.
Overview
The m1f scraper backend system provides flexibility to choose the most appropriate tool for your web scraping needs:
- Static websites: BeautifulSoup4 (default) - Fast and lightweight
- Complete mirroring: HTTrack - Professional website copying
- Performance-critical: httpx + selectolax - Blazing fast parsing
- Enterprise scraping: Scrapy - Industrial-strength framework
- JavaScript-heavy sites: Playwright - Full browser automation
Backend Comparison
Backend | Speed | Memory | JavaScript | Best For |
---|---|---|---|---|
BeautifulSoup | Fast | Low | ❌ | Static HTML, documentation sites |
HTTrack | Medium | Medium | ❌ | Complete website mirroring |
Selectolax | Fastest | Lowest | ❌ | High-performance scraping |
Scrapy | Very Fast | Low | ❌ | Large-scale, production scraping |
Playwright | Slow | High | ✅ | SPAs, dynamic content |
Available Backends
BeautifulSoup4 (Default)
BeautifulSoup4 is the default backend, ideal for scraping static HTML websites. It provides an excellent balance of ease of use and functionality.
Pros:
- Easy to use and lightweight
- Fast for simple websites
- Excellent HTML parsing capabilities
- Good encoding detection
- Mature and well-documented
Cons:
- No JavaScript support
- Basic crawling capabilities
- Single-threaded by default
Installation:
# Included with m1f by default
pip install beautifulsoup4 lxml
Usage Examples:
# Default backend (no need to specify)
m1f-scrape https://example.com -o output/
# Explicitly specify BeautifulSoup
m1f-scrape https://example.com -o output/ --scraper beautifulsoup
# With custom options for documentation sites
m1f-scrape https://docs.example.com -o docs/ \
--scraper beautifulsoup \
--max-depth 10 \
--max-pages 500 \
--request-delay 0.5
Configuration:
# beautifulsoup-config.yaml
parser: "lxml" # Options: "html.parser", "lxml", "html5lib"
features: "lxml" # Parser features
encoding: "auto" # Or specific encoding like "utf-8"
HTTrack
HTTrack is a professional website copier that creates complete offline mirrors, preserving the original site structure.
Pros:
- Complete website mirroring
- Preserves directory structure
- Resume interrupted downloads
- Automatic robots.txt compliance
- Handles complex websites well
Cons:
- Requires system installation
- Less flexible for custom parsing
- Larger resource footprint
Installation:
# Ubuntu/Debian
sudo apt-get install httrack
# macOS
brew install httrack
# Windows
# Download installer from https://www.httrack.com/
Usage Examples:
# Basic mirroring
m1f-scrape https://example.com -o mirror/ --scraper httrack
# Complete backup with custom settings
m1f-scrape https://example.com -o backup/ \
--scraper httrack \
--max-depth 10 \
--concurrent-requests 8 \
--scraper-config httrack.yaml
Configuration:
# httrack-config.yaml
mirror_options:
- "--assume-insecure" # For HTTPS issues
- "--robots=3" # Strict robots.txt compliance
- "--continue" # Resume interrupted downloads
extra_filters:
- "+*.css"
- "+*.js"
- "-*.zip"
- "-*.exe"
Selectolax (httpx + selectolax)
The fastest HTML parsing solution using httpx for modern HTTP/2 networking and selectolax for C-based parsing.
Pros:
- Blazing fast performance
- Minimal memory footprint
- Modern async HTTP/2 support
- Excellent for large-scale simple scraping
Cons:
- No JavaScript support
- Limited parsing features vs BeautifulSoup
- Less mature ecosystem
Installation:
pip install httpx selectolax
Usage Examples:
# High-performance scraping
m1f-scrape https://example.com -o output/ --scraper selectolax
# Large-scale scraping with high concurrency
m1f-scrape https://example.com -o output/ \
--scraper selectolax \
--concurrent-requests 50 \
--request-delay 0.1 \
--max-pages 10000
# Using httpx alias
m1f-scrape https://example.com -o output/ --scraper httpx
Configuration:
# selectolax-config.yaml
http_config:
timeout: 30
http2: true
max_redirects: 10
parser_config:
strict: false
Scrapy
Industrial-strength web scraping framework with advanced features for production use.
Pros:
- Battle-tested in production
- Built-in retry logic and error handling
- Auto-throttle based on server response
- Distributed crawling support
- Advanced caching and queuing
- Extensive middleware system
Cons:
- Steeper learning curve
- Heavier than simple scrapers
- Twisted-based async model
Installation:
pip install scrapy
Usage Examples:
# Enterprise scraping
m1f-scrape https://example.com -o output/ --scraper scrapy
# With auto-throttle and custom settings
m1f-scrape https://example.com -o output/ \
--scraper scrapy \
--scraper-config scrapy.yaml \
--max-pages 10000
# Large-scale crawling
m1f-scrape https://example.com -o output/ \
--scraper scrapy \
--concurrent-requests 16 \
--max-depth 20
Configuration:
# scrapy-config.yaml
settings:
AUTOTHROTTLE_ENABLED: true
AUTOTHROTTLE_START_DELAY: 0.5
AUTOTHROTTLE_MAX_DELAY: 10
AUTOTHROTTLE_TARGET_CONCURRENCY: 8.0
ROBOTSTXT_OBEY: true
HTTPCACHE_ENABLED: true
RETRY_TIMES: 3
DOWNLOAD_TIMEOUT: 30
middlewares:
- "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware"
- "scrapy.downloadermiddlewares.retry.RetryMiddleware"
Playwright
Browser automation for JavaScript-heavy websites and single-page applications (SPAs).
Pros:
- Full JavaScript execution
- Handles SPAs and dynamic content
- Multiple browser engines (Chromium, Firefox, WebKit)
- Screenshot and PDF generation
- Mobile device emulation
- Network interception capabilities
Cons:
- High resource usage
- Significantly slower than HTML-only scrapers
- Requires browser installation
Installation:
pip install playwright
playwright install # Install browser binaries
playwright install-deps # Install system dependencies (Linux)
Usage Examples:
# Basic JavaScript site scraping
m1f-scrape https://spa.example.com -o output/ --scraper playwright
# With custom browser settings
m1f-scrape https://example.com -o output/ \
--scraper playwright \
--scraper-config playwright.yaml \
--request-delay 2.0
# Mobile emulation
m1f-scrape https://m.example.com -o mobile/ \
--scraper playwright \
--concurrent-requests 1
Configuration:
# playwright-config.yaml
browser: "chromium" # Options: chromium, firefox, webkit
launch_options:
headless: true
slow_mo: 0
context_options:
viewport:
width: 1920
height: 1080
user_agent: "Mozilla/5.0..."
locale: "en-US"
page_options:
wait_until: "networkidle" # or "load", "domcontentloaded"
timeout: 30000
Command Line Options
Common options available for all scrapers:
--scraper BACKEND # Choose backend (beautifulsoup, httrack, selectolax, scrapy, playwright)
--max-depth N # Maximum crawl depth (default: 5)
--max-pages N # Maximum pages to crawl (default: 1000)
--request-delay SECONDS # Delay between requests (default: 15.0)
--concurrent-requests N # Number of concurrent requests (default: 2)
--user-agent STRING # Custom user agent
--scraper-config PATH # Path to backend-specific config file
--list-files # List all downloaded files after completion
-v, --verbose # Enable verbose output
-q, --quiet # Suppress output except errors
Note: robots.txt is always respected and cannot be disabled.
Use Case Recommendations
Static Documentation Sites
For technical documentation, API references, and wikis:
m1f-scrape https://docs.example.com -o docs/ \
--scraper beautifulsoup \
--max-depth 10 \
--request-delay 0.2
Complete Website Backup
For creating offline mirrors and backups:
m1f-scrape https://example.com -o backup/ \
--scraper httrack \
--max-pages 10000
High-Performance Scraping
For large-scale data extraction from simple HTML:
m1f-scrape https://catalog.example.com -o data/ \
--scraper selectolax \
--concurrent-requests 50 \
--max-pages 100000
Enterprise Data Collection
For production scraping with reliability requirements:
m1f-scrape https://data.example.com -o output/ \
--scraper scrapy \
--scraper-config production-scrapy.yaml
JavaScript-Heavy Applications
For SPAs, React/Vue/Angular sites, and dynamic content:
m1f-scrape https://app.example.com -o app-docs/ \
--scraper playwright \
--request-delay 3.0 \
--concurrent-requests 1
Performance Tuning
Optimizing BeautifulSoup
# Use lxml parser for better performance
parser: "lxml"
# Disable features you don't need
features: ["html", "fast"]
Optimizing HTTrack
# Increase connections for faster mirroring
--concurrent-requests 8
# Exclude unnecessary files
--scraper-config httrack-optimized.yaml
Optimizing Selectolax
# Maximum concurrency for static sites
--concurrent-requests 100
--request-delay 0.05
Optimizing Scrapy
# Enable concurrent requests
CONCURRENT_REQUESTS: 32
CONCURRENT_REQUESTS_PER_DOMAIN: 16
# Use HTTP caching
HTTPCACHE_ENABLED: true
Optimizing Playwright
# Disable images and CSS for faster loading
context_options:
extra_http_headers:
"Accept": "text/html"
launch_options:
args: ["--disable-images", "--disable-css"]
Troubleshooting
Encoding Issues
# Force UTF-8 encoding for BeautifulSoup
encoding: "utf-8"
SSL Certificate Problems
# HTTrack: ignore SSL errors (use with caution)
mirror_options: ["--assume-insecure"]
JavaScript Not Loading
# Playwright: increase wait time
page_options:
wait_until: "networkidle"
timeout: 60000
Rate Limiting
# Increase delay between requests
--request-delay 2.0
--concurrent-requests 1
API Usage
For programmatic access:
import asyncio
from tools.html2md.scrapers import create_scraper, ScraperConfig
async def scrape_with_backend():
config = ScraperConfig(
max_depth=5,
max_pages=100,
request_delay=0.5
)
# Create scraper instance
scraper = create_scraper('playwright', config)
# Scrape entire site
async with scraper:
async for page in scraper.scrape_site('https://example.com'):
print(f"Scraped: {page.url}")
# Process page content
asyncio.run(scrape_with_backend())
Related Documentation
- html2md Tool Reference - Convert HTML to Markdown
- scrape Tool Reference - Web scraping command
- Web Scraping Guide - General scraping guide
- Markdown Conversion Guide - Working with converted content
- Previous
- What's New in m1f v3.2
- Next
- Python API Reference