scrape - Web Scraper

A modern web scraping tool for downloading websites with multiple backend options, async I/O, and intelligent crawling capabilities.

Overview

The scrape tool provides a robust solution for downloading websites for offline viewing and analysis. Built with Python 3.10+ and modern async architecture, it features pluggable scraper backends for different use cases.

Primary Use Case: Download online documentation to make it available to LLMs (like Claude) for analysis and reference. The downloaded HTML files can be converted to Markdown with html2md, then bundled into a single file with m1f for optimal LLM context usage.

Key Features

Multiple Scraper Backends: Choose from BeautifulSoup (default), HTTrack, Scrapy, Playwright, or Selectolax
Async I/O: High-performance concurrent downloading
Intelligent Crawling: Automatically respects robots.txt, follows redirects, handles encoding
Duplicate Prevention: Three-layer deduplication system:
- Canonical URL checking (enabled by default)
- Content-based deduplication (enabled by default)
- GET parameter normalization (optional with --ignore-get-params)
Metadata Preservation: Saves HTTP headers and metadata alongside HTML files
Domain Restriction: Automatically restricts crawling to the starting domain
Subdirectory Restriction: When URL contains a path, only scrapes within that subdirectory
Rate Limiting: Configurable delays between requests
Progress Tracking: Real-time download progress with file listing
Resume Support: Interrupt and resume scraping sessions with SQLite tracking

Quick Start

# Basic website download
m1f-scrape https://example.com -o ./downloaded_html

# Download with specific depth and page limits
m1f-scrape https://example.com -o ./html \
  --max-pages 50 \
  --max-depth 3

# Use different scraper backend
m1f-scrape https://example.com -o ./html --scraper httrack

# List downloaded files after completion
m1f-scrape https://example.com -o ./html --list-files

# Resume interrupted scraping (with verbose mode to see progress)
m1f-scrape https://example.com -o ./html -v

Command Line Interface

m1f-scrape <url> -o <output> [options]

Required Arguments

Option	Description
`url`	URL to start scraping from
`-o, --output`	Output directory

Optional Arguments

Option	Description	Default
`--scraper`	Scraper backend (httrack, beautifulsoup, bs4, selectolax, httpx, scrapy, playwright)	beautifulsoup
`--scraper-config`	Path to scraper-specific config file (YAML/JSON)	None
`--max-depth`	Maximum crawl depth	5
`--max-pages`	Maximum pages to crawl	1000
`--request-delay`	Delay between requests in seconds	15.0
`--concurrent-requests`	Number of concurrent requests	2
`--user-agent`	Custom user agent string	Mozilla/5.0
`--ignore-get-params`	Ignore GET parameters in URLs	False
`--ignore-canonical`	Ignore canonical URL tags	False
`--ignore-duplicates`	Ignore duplicate content detection	False
`--list-files`	List all downloaded files after completion	False
`-v, --verbose`	Enable verbose output	False
`-q, --quiet`	Suppress all output except errors	False
`--show-db-stats`	Show scraping statistics from the database	False
`--show-errors`	Show URLs that had errors during scraping	False
`--show-scraped-urls`	List all scraped URLs from the database	False
`--version`	Show version information and exit	-

Scraper Backends

BeautifulSoup (default)

Best for: General purpose scraping, simple websites
Features: Fast HTML parsing, good encoding detection
Limitations: No JavaScript support

m1f-scrape https://example.com -o ./html --scraper beautifulsoup

HTTrack

Best for: Complete website mirroring, preserving structure
Features: External links handling, advanced mirroring options
Limitations: Requires HTTrack to be installed separately

m1f-scrape https://example.com -o ./html --scraper httrack

Scrapy

Best for: Large-scale crawling, complex scraping rules
Features: Advanced crawling settings, middleware support
Limitations: More complex configuration

m1f-scrape https://example.com -o ./html --scraper scrapy

Playwright

Best for: JavaScript-heavy sites, SPAs
Features: Full browser automation, JavaScript execution
Limitations: Slower, requires more resources

m1f-scrape https://example.com -o ./html --scraper playwright

Selectolax

Best for: Speed-critical applications
Features: Fastest HTML parsing, minimal overhead
Limitations: Basic feature set

m1f-scrape https://example.com -o ./html --scraper selectolax

Usage Examples

Basic Website Download

# Download a simple website
m1f-scrape https://docs.example.com -o ./docs_html

# Download with verbose output
m1f-scrape https://docs.example.com -o ./docs_html -v

Canonical URL Checking

By default, the scraper checks for canonical URLs to avoid downloading duplicate content:

# Pages with different canonical URLs are automatically skipped
m1f-scrape https://example.com -o ./html

# Ignore canonical tags if you want all page versions
m1f-scrape https://example.com -o ./html --ignore-canonical

When enabled (default), the scraper:

Checks the <link rel="canonical"> tag on each page
Skips pages where the canonical URL differs from the current URL
Prevents downloading duplicate content (e.g., print versions, mobile versions)
Logs skipped pages with their canonical URLs for transparency

Content Deduplication

By default, the scraper detects and skips pages with duplicate content:

# Content deduplication is enabled by default
m1f-scrape https://example.com -o ./html

# Disable content deduplication if needed
m1f-scrape https://example.com -o ./html --ignore-duplicates

This feature:

Extracts plain text from HTML (removes all tags, scripts, styles)
Calculates SHA-256 checksum of the normalized text
Skips pages with identical text content
Works together with canonical URL checking for thorough deduplication

Subdirectory Restriction

When you specify a URL with a path, the scraper automatically restricts crawling to that subdirectory:

# Only scrape pages under /docs subdirectory
m1f-scrape https://example.com/docs -o ./docs_only

# Only scrape API documentation pages
m1f-scrape https://api.example.com/v2/reference -o ./api_docs

# This will NOT scrape /products, /blog, etc. - only /tutorials/*
m1f-scrape https://learn.example.com/tutorials -o ./tutorials_only

Controlled Crawling

# Limit crawl depth for shallow scraping
m1f-scrape https://blog.example.com -o ./blog \
  --max-depth 2 \
  --max-pages 20

# Slow crawling to be respectful
m1f-scrape https://example.com -o ./html \
  --request-delay 2.0 \
  --concurrent-requests 2

Custom Configuration

# Use custom user agent
m1f-scrape https://example.com -o ./html \
  --user-agent "MyBot/1.0 (Compatible)"

# Use scraper-specific configuration
m1f-scrape https://example.com -o ./html \
  --scraper scrapy \
  --scraper-config ./scrapy-settings.yaml

Output Structure

Downloaded files are organized to mirror the website structure:

output_directory/
├── scrape_tracker.db         # SQLite database for resume functionality
├── example.com/
│   ├── index.html
│   ├── index.meta.json
│   ├── about/
│   │   ├── index.html
│   │   └── index.meta.json
│   ├── blog/
│   │   ├── post1/
│   │   │   ├── index.html
│   │   │   └── index.meta.json
│   │   └── post2/
│   │       ├── index.html
│   │       └── index.meta.json
│   └── contact/
│       ├── index.html
│       └── index.meta.json

Metadata Files

Each HTML file has an accompanying .meta.json file containing:

{
  "url": "https://example.com/about/",
  "title": "About Us - Example",
  "encoding": "utf-8",
  "status_code": 200,
  "headers": {
    "Content-Type": "text/html; charset=utf-8",
    "Last-Modified": "2024-01-15T10:30:00Z"
  },
  "metadata": {
    "description": "Learn more about Example company",
    "og:title": "About Us",
    "canonical": "https://example.com/about/"
  }
}

Integration with m1f Workflow

scrape is designed as the first step in a workflow to provide documentation to LLMs:

# Step 1: Download documentation website
m1f-scrape https://docs.example.com -o ./html_files

# Step 2: Analyze HTML structure
m1f-html2md analyze ./html_files/*.html --suggest-selectors

# Step 3: Convert to Markdown
m1f-html2md convert ./html_files -o ./markdown \
  --content-selector "main.content" \
  --ignore-selectors "nav" ".sidebar"

# Step 4: Bundle for LLM consumption
m1f -s ./markdown -o ./docs_bundle.txt \
  --remove-scraped-metadata

# Now docs_bundle.txt contains all documentation in a single file
# that can be provided to Claude or other LLMs for analysis

Complete Documentation Download Example

# Download React documentation for LLM analysis
m1f-scrape https://react.dev/learn -o ./react_docs \
  --max-pages 100 \
  --max-depth 3

# Convert to clean Markdown
m1f-html2md convert ./react_docs -o ./react_md \
  --content-selector "article" \
  --ignore-selectors "nav" "footer" ".sidebar"

# Create single file for LLM
m1f -s ./react_md -o ./react_documentation.txt

# Now you can provide react_documentation.txt to Claude

Resume Functionality

The scraper supports interrupting and resuming downloads, making it ideal for large websites or unreliable connections.

How It Works

SQLite Database: Creates scrape_tracker.db in the output directory to track:
- URL of each scraped page
- HTTP status code and target filename
- Timestamp and error messages (if any)
Progress Display: Shows real-time progress in verbose mode
Graceful Interruption: Press Ctrl+C to interrupt cleanly

Resume Example

# Start scraping with verbose mode
m1f-scrape https://docs.example.com -o ./docs --max-pages 100 -v

# Interrupt with Ctrl+C when needed
# Resume by running the exact same command:
m1f-scrape https://docs.example.com -o ./docs --max-pages 100 -v

# You'll see:
# Resuming crawl - found 25 previously scraped URLs
# Populating queue from previously scraped pages...
# Found 187 URLs to visit after analyzing scraped pages
# Processing: https://docs.example.com/new-page (page 26)

Database Inspection

# Show scraping statistics
m1f-scrape -o docs/ --show-db-stats

# View all scraped URLs with status codes
m1f-scrape -o docs/ --show-scraped-urls

# Check for errors
m1f-scrape -o docs/ --show-errors

# Combine multiple queries
m1f-scrape -o docs/ --show-db-stats --show-errors

Best Practices

Respect robots.txt: The tool automatically respects robots.txt files
Use appropriate delays: Set --request-delay to avoid overwhelming servers (default: 15 seconds)
Limit concurrent requests: Use --concurrent-requests responsibly (default: 2 connections)
Test with small crawls: Start with --max-pages 10 to test your settings
Check output: Use --list-files to verify what was downloaded
Use verbose mode: Add -v flag to see progress and resume information
Keep commands consistent: Use the exact same command to resume a session

Dealing with Cloudflare Protection

Many websites use Cloudflare or similar services to protect against bots. The scraper includes conservative defaults to help avoid detection:

Default Conservative Settings

Request delay: 15 seconds between requests
Concurrent requests: 2 simultaneous connections
HTTrack backend: Limited to 0.5 connections/second max
Bandwidth limiting: 100KB/s for HTTrack backend
Robots.txt: Always respected (cannot be disabled)

For Heavy Cloudflare Protection

For heavily protected sites, manually set very conservative values:

m1f-scrape https://protected-site.com -o ./output \
  --request-delay 30 \
  --concurrent-requests 1 \
  --max-pages 50 \
  --scraper httrack

When Cloudflare Still Blocks

If conservative settings don’t work:

Try Playwright backend: Uses real browser automation

m1f-scrape https://site.com -o ./output --scraper playwright

Manual download: Some sites require manual browsing
API access: Check if the site offers an API
Contact site owner: Request permission or access

Troubleshooting

No files downloaded

Check if the website blocks automated access
Try a different scraper backend
Verify the URL is accessible

Incomplete downloads

Increase --max-depth if pages are deeply nested
Increase --max-pages if hitting the limit
Check for JavaScript-rendered content (use Playwright)

Encoding issues

The tool automatically detects encoding
Check .meta.json files for encoding information
Use html2md with proper encoding settings for conversion

html2md - Convert downloaded HTML to Markdown
m1f - Bundle converted content for LLMs
s1f - Extract files from bundles
token-counter - Estimate token usage

Next Steps

Learn about HTML to Markdown conversion for processing scraped content
Explore m1f bundling to combine processed files
Check out Claude integration for AI workflows
Review auto-bundling for automated processes