Tools
scrape - Web Scraper
Download websites with multiple backend options, intelligent crawling, and duplicate prevention
A modern web scraping tool for downloading websites with multiple backend options, async I/O, and intelligent crawling capabilities.
Overview
The scrape tool provides a robust solution for downloading websites for offline viewing and analysis. Built with Python 3.10+ and modern async architecture, it features pluggable scraper backends for different use cases.
Primary Use Case: Download online documentation to make it available to LLMs (like Claude) for analysis and reference. The downloaded HTML files can be converted to Markdown with html2md, then bundled into a single file with m1f for optimal LLM context usage.
Key Features
- Multiple Scraper Backends: Choose from BeautifulSoup (default), HTTrack, Scrapy, Playwright, or Selectolax
- Async I/O: High-performance concurrent downloading
- Intelligent Crawling: Automatically respects robots.txt, follows redirects, handles encoding
- Duplicate Prevention: Three-layer deduplication system:
- Canonical URL checking (enabled by default)
- Content-based deduplication (enabled by default)
- GET parameter normalization (optional with
--ignore-get-params
)
- Metadata Preservation: Saves HTTP headers and metadata alongside HTML files
- Domain Restriction: Automatically restricts crawling to the starting domain
- Subdirectory Restriction: When URL contains a path, only scrapes within that subdirectory
- Rate Limiting: Configurable delays between requests
- Progress Tracking: Real-time download progress with file listing
- Resume Support: Interrupt and resume scraping sessions with SQLite tracking
Quick Start
# Basic website download
m1f-scrape https://example.com -o ./downloaded_html
# Download with specific depth and page limits
m1f-scrape https://example.com -o ./html \
--max-pages 50 \
--max-depth 3
# Use different scraper backend
m1f-scrape https://example.com -o ./html --scraper httrack
# List downloaded files after completion
m1f-scrape https://example.com -o ./html --list-files
# Resume interrupted scraping (with verbose mode to see progress)
m1f-scrape https://example.com -o ./html -v
Command Line Interface
m1f-scrape <url> -o <output> [options]
Required Arguments
Option | Description |
---|---|
url | URL to start scraping from |
-o, --output | Output directory |
Optional Arguments
Option | Description | Default |
---|---|---|
--scraper | Scraper backend (httrack, beautifulsoup, bs4, selectolax, httpx, scrapy, playwright) | beautifulsoup |
--scraper-config | Path to scraper-specific config file (YAML/JSON) | None |
--max-depth | Maximum crawl depth | 5 |
--max-pages | Maximum pages to crawl | 1000 |
--request-delay | Delay between requests in seconds | 15.0 |
--concurrent-requests | Number of concurrent requests | 2 |
--user-agent | Custom user agent string | Mozilla/5.0 |
--ignore-get-params | Ignore GET parameters in URLs | False |
--ignore-canonical | Ignore canonical URL tags | False |
--ignore-duplicates | Ignore duplicate content detection | False |
--list-files | List all downloaded files after completion | False |
-v, --verbose | Enable verbose output | False |
-q, --quiet | Suppress all output except errors | False |
--show-db-stats | Show scraping statistics from the database | False |
--show-errors | Show URLs that had errors during scraping | False |
--show-scraped-urls | List all scraped URLs from the database | False |
--version | Show version information and exit | - |
Scraper Backends
BeautifulSoup (default)
- Best for: General purpose scraping, simple websites
- Features: Fast HTML parsing, good encoding detection
- Limitations: No JavaScript support
m1f-scrape https://example.com -o ./html --scraper beautifulsoup
HTTrack
- Best for: Complete website mirroring, preserving structure
- Features: External links handling, advanced mirroring options
- Limitations: Requires HTTrack to be installed separately
m1f-scrape https://example.com -o ./html --scraper httrack
Scrapy
- Best for: Large-scale crawling, complex scraping rules
- Features: Advanced crawling settings, middleware support
- Limitations: More complex configuration
m1f-scrape https://example.com -o ./html --scraper scrapy
Playwright
- Best for: JavaScript-heavy sites, SPAs
- Features: Full browser automation, JavaScript execution
- Limitations: Slower, requires more resources
m1f-scrape https://example.com -o ./html --scraper playwright
Selectolax
- Best for: Speed-critical applications
- Features: Fastest HTML parsing, minimal overhead
- Limitations: Basic feature set
m1f-scrape https://example.com -o ./html --scraper selectolax
Usage Examples
Basic Website Download
# Download a simple website
m1f-scrape https://docs.example.com -o ./docs_html
# Download with verbose output
m1f-scrape https://docs.example.com -o ./docs_html -v
Canonical URL Checking
By default, the scraper checks for canonical URLs to avoid downloading duplicate content:
# Pages with different canonical URLs are automatically skipped
m1f-scrape https://example.com -o ./html
# Ignore canonical tags if you want all page versions
m1f-scrape https://example.com -o ./html --ignore-canonical
When enabled (default), the scraper:
- Checks the
<link rel="canonical">
tag on each page - Skips pages where the canonical URL differs from the current URL
- Prevents downloading duplicate content (e.g., print versions, mobile versions)
- Logs skipped pages with their canonical URLs for transparency
Content Deduplication
By default, the scraper detects and skips pages with duplicate content:
# Content deduplication is enabled by default
m1f-scrape https://example.com -o ./html
# Disable content deduplication if needed
m1f-scrape https://example.com -o ./html --ignore-duplicates
This feature:
- Extracts plain text from HTML (removes all tags, scripts, styles)
- Calculates SHA-256 checksum of the normalized text
- Skips pages with identical text content
- Works together with canonical URL checking for thorough deduplication
Subdirectory Restriction
When you specify a URL with a path, the scraper automatically restricts crawling to that subdirectory:
# Only scrape pages under /docs subdirectory
m1f-scrape https://example.com/docs -o ./docs_only
# Only scrape API documentation pages
m1f-scrape https://api.example.com/v2/reference -o ./api_docs
# This will NOT scrape /products, /blog, etc. - only /tutorials/*
m1f-scrape https://learn.example.com/tutorials -o ./tutorials_only
Controlled Crawling
# Limit crawl depth for shallow scraping
m1f-scrape https://blog.example.com -o ./blog \
--max-depth 2 \
--max-pages 20
# Slow crawling to be respectful
m1f-scrape https://example.com -o ./html \
--request-delay 2.0 \
--concurrent-requests 2
Custom Configuration
# Use custom user agent
m1f-scrape https://example.com -o ./html \
--user-agent "MyBot/1.0 (Compatible)"
# Use scraper-specific configuration
m1f-scrape https://example.com -o ./html \
--scraper scrapy \
--scraper-config ./scrapy-settings.yaml
Output Structure
Downloaded files are organized to mirror the website structure:
output_directory/
├── scrape_tracker.db # SQLite database for resume functionality
├── example.com/
│ ├── index.html
│ ├── index.meta.json
│ ├── about/
│ │ ├── index.html
│ │ └── index.meta.json
│ ├── blog/
│ │ ├── post1/
│ │ │ ├── index.html
│ │ │ └── index.meta.json
│ │ └── post2/
│ │ ├── index.html
│ │ └── index.meta.json
│ └── contact/
│ ├── index.html
│ └── index.meta.json
Metadata Files
Each HTML file has an accompanying .meta.json
file containing:
{
"url": "https://example.com/about/",
"title": "About Us - Example",
"encoding": "utf-8",
"status_code": 200,
"headers": {
"Content-Type": "text/html; charset=utf-8",
"Last-Modified": "2024-01-15T10:30:00Z"
},
"metadata": {
"description": "Learn more about Example company",
"og:title": "About Us",
"canonical": "https://example.com/about/"
}
}
Integration with m1f Workflow
scrape is designed as the first step in a workflow to provide documentation to LLMs:
# Step 1: Download documentation website
m1f-scrape https://docs.example.com -o ./html_files
# Step 2: Analyze HTML structure
m1f-html2md analyze ./html_files/*.html --suggest-selectors
# Step 3: Convert to Markdown
m1f-html2md convert ./html_files -o ./markdown \
--content-selector "main.content" \
--ignore-selectors "nav" ".sidebar"
# Step 4: Bundle for LLM consumption
m1f -s ./markdown -o ./docs_bundle.txt \
--remove-scraped-metadata
# Now docs_bundle.txt contains all documentation in a single file
# that can be provided to Claude or other LLMs for analysis
Complete Documentation Download Example
# Download React documentation for LLM analysis
m1f-scrape https://react.dev/learn -o ./react_docs \
--max-pages 100 \
--max-depth 3
# Convert to clean Markdown
m1f-html2md convert ./react_docs -o ./react_md \
--content-selector "article" \
--ignore-selectors "nav" "footer" ".sidebar"
# Create single file for LLM
m1f -s ./react_md -o ./react_documentation.txt
# Now you can provide react_documentation.txt to Claude
Resume Functionality
The scraper supports interrupting and resuming downloads, making it ideal for large websites or unreliable connections.
How It Works
- SQLite Database: Creates
scrape_tracker.db
in the output directory to track:- URL of each scraped page
- HTTP status code and target filename
- Timestamp and error messages (if any)
- Progress Display: Shows real-time progress in verbose mode
- Graceful Interruption: Press Ctrl+C to interrupt cleanly
Resume Example
# Start scraping with verbose mode
m1f-scrape https://docs.example.com -o ./docs --max-pages 100 -v
# Interrupt with Ctrl+C when needed
# Resume by running the exact same command:
m1f-scrape https://docs.example.com -o ./docs --max-pages 100 -v
# You'll see:
# Resuming crawl - found 25 previously scraped URLs
# Populating queue from previously scraped pages...
# Found 187 URLs to visit after analyzing scraped pages
# Processing: https://docs.example.com/new-page (page 26)
Database Inspection
# Show scraping statistics
m1f-scrape -o docs/ --show-db-stats
# View all scraped URLs with status codes
m1f-scrape -o docs/ --show-scraped-urls
# Check for errors
m1f-scrape -o docs/ --show-errors
# Combine multiple queries
m1f-scrape -o docs/ --show-db-stats --show-errors
Best Practices
- Respect robots.txt: The tool automatically respects robots.txt files
- Use appropriate delays: Set
--request-delay
to avoid overwhelming servers (default: 15 seconds) - Limit concurrent requests: Use
--concurrent-requests
responsibly (default: 2 connections) - Test with small crawls: Start with
--max-pages 10
to test your settings - Check output: Use
--list-files
to verify what was downloaded - Use verbose mode: Add
-v
flag to see progress and resume information - Keep commands consistent: Use the exact same command to resume a session
Dealing with Cloudflare Protection
Many websites use Cloudflare or similar services to protect against bots. The scraper includes conservative defaults to help avoid detection:
Default Conservative Settings
- Request delay: 15 seconds between requests
- Concurrent requests: 2 simultaneous connections
- HTTrack backend: Limited to 0.5 connections/second max
- Bandwidth limiting: 100KB/s for HTTrack backend
- Robots.txt: Always respected (cannot be disabled)
For Heavy Cloudflare Protection
For heavily protected sites, manually set very conservative values:
m1f-scrape https://protected-site.com -o ./output \
--request-delay 30 \
--concurrent-requests 1 \
--max-pages 50 \
--scraper httrack
When Cloudflare Still Blocks
If conservative settings don’t work:
-
Try Playwright backend: Uses real browser automation
m1f-scrape https://site.com -o ./output --scraper playwright
-
Manual download: Some sites require manual browsing
-
API access: Check if the site offers an API
-
Contact site owner: Request permission or access
Troubleshooting
No files downloaded
- Check if the website blocks automated access
- Try a different scraper backend
- Verify the URL is accessible
Incomplete downloads
- Increase
--max-depth
if pages are deeply nested - Increase
--max-pages
if hitting the limit - Check for JavaScript-rendered content (use Playwright)
Encoding issues
- The tool automatically detects encoding
- Check
.meta.json
files for encoding information - Use html2md with proper encoding settings for conversion
Related Tools
- html2md - Convert downloaded HTML to Markdown
- m1f - Bundle converted content for LLMs
- s1f - Extract files from bundles
- token-counter - Estimate token usage
Next Steps
- Learn about HTML to Markdown conversion for processing scraped content
- Explore m1f bundling to combine processed files
- Check out Claude integration for AI workflows
- Review auto-bundling for automated processes
- Previous
- Bundle Format Specification