Tools
token-counter - Token Estimation Tool
Estimate token usage for LLM context planning and optimization
The token-counter tool (v2.0.0) estimates token usage for LLM context planning, helping you optimize your use of large language models by managing context window limits.
Overview
When working with LLMs like ChatGPT, Claude, or GPT-4, understanding token consumption is essential for effective prompt engineering and context management. Built with Python 3.10+, the token-counter tool allows you to precisely measure how many tokens your combined files will use, helping you stay within the context window limits of your chosen LLM.
Key Features
- Uses OpenAI’s tiktoken library for accurate estimates
- Supports different encoding schemes for various LLMs
- Helps optimize context usage for LLMs
- Simple command-line interface
- Fast token counting for large files
- Error handling for encoding issues
Quick Start
# Check token count of a file
m1f-token-counter ./combined.txt
# Use a specific encoding model
m1f-token-counter ./combined.txt -e p50k_base
# Check multiple files
m1f-token-counter ./file1.txt ./file2.txt ./file3.txt
Command Line Options
Option | Description |
---|---|
file_path | Path to the text file(s) to analyze |
-e, --encoding | The tiktoken encoding to use (default: cl100k_base) |
-v, --verbose | Show detailed information about token counting |
-s, --summary | Show summary statistics for multiple files |
Usage Examples
Basic Token Counting
# Basic usage with default encoding (cl100k_base, used by GPT-4 and ChatGPT)
m1f-token-counter combined_output.txt
# Output:
# File: combined_output.txt
# Tokens: 15,432
# Characters: 87,234
# Token-to-character ratio: 0.177
Using Specific Encodings
# Use p50k_base encoding for GPT-3.5-Turbo
m1f-token-counter myfile.txt -e p50k_base
# Use r50k_base encoding for older GPT-3 models
m1f-token-counter myfile.txt -e r50k_base
Multiple Files
# Count tokens in multiple files
m1f-token-counter file1.txt file2.txt file3.txt
# Show summary statistics
m1f-token-counter *.txt -s
Verbose Output
# Show detailed information
m1f-token-counter combined.txt -v
# Output:
# File: combined.txt
# Encoding: cl100k_base
# File size: 256.3 KB
# Character count: 262,144
# Token count: 45,678
# Token-to-character ratio: 0.174
# Estimated cost (GPT-4): $0.46
Encoding Models
The tool supports different encoding models depending on which LLM you’re using:
Available Encodings
cl100k_base
- Default, used by GPT-4, ChatGPT, GPT-3.5-Turbop50k_base
- Used by text-davinci-003, text-davinci-002r50k_base
- Used by older GPT-3 models (davinci, curie, babbage, ada)
Choosing the Right Encoding
# For GPT-4 and ChatGPT (default)
m1f-token-counter file.txt -e cl100k_base
# For GPT-3.5-Turbo (same as cl100k_base)
m1f-token-counter file.txt -e cl100k_base
# For older GPT-3 models
m1f-token-counter file.txt -e r50k_base
Token Limits by Model
Understanding token limits is crucial for effective usage:
Model | Token Limit | Recommended Encoding | Context Window |
---|---|---|---|
GPT-4 Turbo | 128,000 | cl100k_base | 128K tokens |
GPT-4 | 8,192 | cl100k_base | 8K tokens |
GPT-3.5-Turbo | 16,385 | cl100k_base | 16K tokens |
Claude 3.5 Opus | 200,000 | - | 200K tokens |
Claude 3 Opus | 200,000 | - | 200K tokens |
Claude 3 Sonnet | 200,000 | - | 200K tokens |
Claude 3 Haiku | 200,000 | - | 200K tokens |
Note: Claude models use a different tokenization scheme than OpenAI models, so token counts may not be directly comparable.
Integration with m1f
The token-counter tool is particularly useful when used with m1f to check if your combined files will fit within the token limit of your chosen LLM:
Basic Workflow
# 1. First, combine files with m1f
m1f -s ./project -o ./combined.txt --include-extensions .py .js
# 2. Then, check the token count
m1f-token-counter ./combined.txt
# 3. If too many tokens, adjust your selection
m1f -s ./project -o ./combined.txt --include-extensions .py --max-file-size 50KB
m1f-token-counter ./combined.txt
Advanced Optimization
# Create different bundles for different purposes
m1f -s ./src -o ./core.txt --includes "src/core/**"
m1f -s ./src -o ./utils.txt --includes "src/utils/**"
m1f -s ./docs -o ./docs.txt --docs-only
# Check token counts for each
m1f-token-counter ./core.txt ./utils.txt ./docs.txt -s
# Combine the most important ones
cat core.txt utils.txt > priority.txt
m1f-token-counter ./priority.txt
Optimizing Token Usage
To reduce token consumption while maintaining context quality:
1. Be Selective with Files
# Include only the most relevant files
m1f -s ./project -o ./focused.txt --includes "src/auth/**" "src/api/**"
m1f-token-counter ./focused.txt
2. Use Minimal Separator Style
# The None separator style uses fewer tokens
m1f -s ./project -o ./minimal.txt --separator-style None
m1f-token-counter ./minimal.txt
3. Filter by File Size
# Skip large files that might not be essential
m1f -s ./project -o ./small.txt --max-file-size 100KB
m1f-token-counter ./small.txt
4. Focus on Key Files
# Create a curated list of important files
echo "src/main.py" > important.txt
echo "src/config.py" >> important.txt
echo "README.md" >> important.txt
m1f -i important.txt -o ./curated.txt
m1f-token-counter ./curated.txt
5. Use Documentation Only
# Sometimes documentation is all you need
m1f -s ./project -o ./docs.txt --docs-only
m1f-token-counter ./docs.txt
Cost Estimation
The tool can help estimate costs for paid LLM APIs:
GPT-4 Pricing (example)
# Assume GPT-4 costs $0.03 per 1K tokens for input
m1f-token-counter ./combined.txt -v
# If file has 50,000 tokens:
# Cost = (50,000 / 1,000) * $0.03 = $1.50
Token Budget Planning
# Plan your token budget
TARGET_TOKENS=100000 # 100K token limit
CURRENT_TOKENS=$(m1f-token-counter ./combined.txt | grep -o '[0-9,]*' | tr -d ',')
if [ $CURRENT_TOKENS -gt $TARGET_TOKENS ]; then
echo "File too large: $CURRENT_TOKENS tokens (limit: $TARGET_TOKENS)"
else
echo "File fits: $CURRENT_TOKENS tokens (limit: $TARGET_TOKENS)"
fi
Architecture
Token counter v2.0.0 features a simple but effective design:
tools/token_counter/
├── __init__.py # Package initialization
├── __main__.py # Entry point for module execution
├── cli.py # Command-line interface
├── counter.py # Core token counting logic
├── models.py # Data models for results
└── utils.py # Utility functions
Key Components
- Module Structure: Can be run as a module (
m1f-token-counter
) - Type Safety: Full type hints for better IDE support
- Error Handling: Graceful handling of encoding errors and file issues
- Performance: Efficient token counting for large files
Requirements
- Python 3.10 or newer
- The
tiktoken
Python package:
pip install tiktoken
This dependency is included in the project’s requirements.txt file.
Tips for Accurate Token Counting
1. Model-Specific Encoding
Always use the encoding that matches your target LLM:
# For GPT-4
m1f-token-counter file.txt -e cl100k_base
# For older GPT-3 models
m1f-token-counter file.txt -e r50k_base
2. Include Prompts in Your Calculation
Remember to count tokens in your prompts as well as the context:
# Count your prompt tokens too
echo "Please analyze this code and suggest improvements:" > prompt.txt
m1f-token-counter prompt.txt combined.txt -s
3. Leave Buffer Space
Leave 10-20% buffer for model responses:
# If model limit is 100K tokens, aim for 80K tokens in context
TARGET_TOKENS=80000
4. Regular Checks
Re-check token counts after file modifications:
# Create a script to check token count after changes
#!/bin/bash
m1f -s ./project -o ./temp.txt --exclude-paths-file .gitignore
TOKENS=$(m1f-token-counter ./temp.txt | grep -o '[0-9,]*' | tr -d ',')
echo "Current token count: $TOKENS"
rm ./temp.txt
Common Patterns
Token-Aware Bundling
# Function to create optimal bundles
create_optimal_bundle() {
local max_tokens=$1
local output_file=$2
# Start with core files
m1f -s ./src -o ./temp.txt --includes "src/main.py" "src/config.py"
# Add files until we approach the limit
while [ $(m1f-token-counter ./temp.txt | grep -o '[0-9,]*' | tr -d ',') -lt $max_tokens ]; do
# Add more files...
break
done
mv ./temp.txt $output_file
}
create_optimal_bundle 100000 ./optimal.txt
Multi-File Analysis
# Analyze multiple bundles
for file in *.txt; do
echo "=== $file ==="
m1f-token-counter "$file"
echo
done
Related Tools
- m1f - Create bundles to analyze
- s1f - Extract files from bundles
- html2md - Convert HTML before bundling
- scrape - Download web content for analysis
Next Steps
- Learn about m1f bundling to create optimized file combinations
- Explore presets for consistent token management
- Check out Claude integration for AI workflows
- Review auto-bundling for automated token-aware processes
- Previous
- Preset Reference