token-counter - Token Estimation Tool

The token-counter tool (v2.0.0) estimates token usage for LLM context planning, helping you optimize your use of large language models by managing context window limits.

Overview

When working with LLMs like ChatGPT, Claude, or GPT-4, understanding token consumption is essential for effective prompt engineering and context management. Built with Python 3.10+, the token-counter tool allows you to precisely measure how many tokens your combined files will use, helping you stay within the context window limits of your chosen LLM.

Key Features

Uses OpenAI’s tiktoken library for accurate estimates
Supports different encoding schemes for various LLMs
Helps optimize context usage for LLMs
Simple command-line interface
Fast token counting for large files
Error handling for encoding issues

Quick Start

# Check token count of a file
m1f-token-counter ./combined.txt

# Use a specific encoding model
m1f-token-counter ./combined.txt -e p50k_base

# Check multiple files
m1f-token-counter ./file1.txt ./file2.txt ./file3.txt

Command Line Options

Option	Description
`file_path`	Path to the text file(s) to analyze
`-e, --encoding`	The tiktoken encoding to use (default: cl100k_base)
`-v, --verbose`	Show detailed information about token counting
`-s, --summary`	Show summary statistics for multiple files

Usage Examples

Basic Token Counting

# Basic usage with default encoding (cl100k_base, used by GPT-4 and ChatGPT)
m1f-token-counter combined_output.txt

# Output:
# File: combined_output.txt
# Tokens: 15,432
# Characters: 87,234
# Token-to-character ratio: 0.177

Using Specific Encodings

# Use p50k_base encoding for GPT-3.5-Turbo
m1f-token-counter myfile.txt -e p50k_base

# Use r50k_base encoding for older GPT-3 models
m1f-token-counter myfile.txt -e r50k_base

Multiple Files

# Count tokens in multiple files
m1f-token-counter file1.txt file2.txt file3.txt

# Show summary statistics
m1f-token-counter *.txt -s

Verbose Output

# Show detailed information
m1f-token-counter combined.txt -v

# Output:
# File: combined.txt
# Encoding: cl100k_base
# File size: 256.3 KB
# Character count: 262,144
# Token count: 45,678
# Token-to-character ratio: 0.174
# Estimated cost (GPT-4): $0.46

Encoding Models

The tool supports different encoding models depending on which LLM you’re using:

Available Encodings

cl100k_base - Default, used by GPT-4, ChatGPT, GPT-3.5-Turbo
p50k_base - Used by text-davinci-003, text-davinci-002
r50k_base - Used by older GPT-3 models (davinci, curie, babbage, ada)

Choosing the Right Encoding

# For GPT-4 and ChatGPT (default)
m1f-token-counter file.txt -e cl100k_base

# For GPT-3.5-Turbo (same as cl100k_base)
m1f-token-counter file.txt -e cl100k_base

# For older GPT-3 models
m1f-token-counter file.txt -e r50k_base

Token Limits by Model

Understanding token limits is crucial for effective usage:

Model	Token Limit	Recommended Encoding	Context Window
GPT-4 Turbo	128,000	cl100k_base	128K tokens
GPT-4	8,192	cl100k_base	8K tokens
GPT-3.5-Turbo	16,385	cl100k_base	16K tokens
Claude 3.5 Opus	200,000	-	200K tokens
Claude 3 Opus	200,000	-	200K tokens
Claude 3 Sonnet	200,000	-	200K tokens
Claude 3 Haiku	200,000	-	200K tokens

Note: Claude models use a different tokenization scheme than OpenAI models, so token counts may not be directly comparable.

Integration with m1f

The token-counter tool is particularly useful when used with m1f to check if your combined files will fit within the token limit of your chosen LLM:

Basic Workflow

# 1. First, combine files with m1f
m1f -s ./project -o ./combined.txt --include-extensions .py .js

# 2. Then, check the token count
m1f-token-counter ./combined.txt

# 3. If too many tokens, adjust your selection
m1f -s ./project -o ./combined.txt --include-extensions .py --max-file-size 50KB
m1f-token-counter ./combined.txt

Advanced Optimization

# Create different bundles for different purposes
m1f -s ./src -o ./core.txt --includes "src/core/**"
m1f -s ./src -o ./utils.txt --includes "src/utils/**"
m1f -s ./docs -o ./docs.txt --docs-only

# Check token counts for each
m1f-token-counter ./core.txt ./utils.txt ./docs.txt -s

# Combine the most important ones
cat core.txt utils.txt > priority.txt
m1f-token-counter ./priority.txt

Optimizing Token Usage

To reduce token consumption while maintaining context quality:

1. Be Selective with Files

# Include only the most relevant files
m1f -s ./project -o ./focused.txt --includes "src/auth/**" "src/api/**"
m1f-token-counter ./focused.txt

2. Use Minimal Separator Style

# The None separator style uses fewer tokens
m1f -s ./project -o ./minimal.txt --separator-style None
m1f-token-counter ./minimal.txt

3. Filter by File Size

# Skip large files that might not be essential
m1f -s ./project -o ./small.txt --max-file-size 100KB
m1f-token-counter ./small.txt

4. Focus on Key Files

# Create a curated list of important files
echo "src/main.py" > important.txt
echo "src/config.py" >> important.txt
echo "README.md" >> important.txt

m1f -i important.txt -o ./curated.txt
m1f-token-counter ./curated.txt

5. Use Documentation Only

# Sometimes documentation is all you need
m1f -s ./project -o ./docs.txt --docs-only
m1f-token-counter ./docs.txt

Cost Estimation

The tool can help estimate costs for paid LLM APIs:

GPT-4 Pricing (example)

# Assume GPT-4 costs $0.03 per 1K tokens for input
m1f-token-counter ./combined.txt -v

# If file has 50,000 tokens:
# Cost = (50,000 / 1,000) * $0.03 = $1.50

Token Budget Planning

# Plan your token budget
TARGET_TOKENS=100000  # 100K token limit
CURRENT_TOKENS=$(m1f-token-counter ./combined.txt | grep -o '[0-9,]*' | tr -d ',')

if [ $CURRENT_TOKENS -gt $TARGET_TOKENS ]; then
    echo "File too large: $CURRENT_TOKENS tokens (limit: $TARGET_TOKENS)"
else
    echo "File fits: $CURRENT_TOKENS tokens (limit: $TARGET_TOKENS)"
fi

Architecture

Token counter v2.0.0 features a simple but effective design:

tools/token_counter/
├── __init__.py       # Package initialization
├── __main__.py       # Entry point for module execution
├── cli.py            # Command-line interface
├── counter.py        # Core token counting logic
├── models.py         # Data models for results
└── utils.py          # Utility functions

Key Components

Module Structure: Can be run as a module (m1f-token-counter)
Type Safety: Full type hints for better IDE support
Error Handling: Graceful handling of encoding errors and file issues
Performance: Efficient token counting for large files

Requirements

Python 3.10 or newer
The tiktoken Python package:

pip install tiktoken

This dependency is included in the project’s requirements.txt file.

Tips for Accurate Token Counting

1. Model-Specific Encoding

Always use the encoding that matches your target LLM:

# For GPT-4
m1f-token-counter file.txt -e cl100k_base

# For older GPT-3 models
m1f-token-counter file.txt -e r50k_base

2. Include Prompts in Your Calculation

Remember to count tokens in your prompts as well as the context:

# Count your prompt tokens too
echo "Please analyze this code and suggest improvements:" > prompt.txt
m1f-token-counter prompt.txt combined.txt -s

3. Leave Buffer Space

Leave 10-20% buffer for model responses:

# If model limit is 100K tokens, aim for 80K tokens in context
TARGET_TOKENS=80000

4. Regular Checks

Re-check token counts after file modifications:

# Create a script to check token count after changes
#!/bin/bash
m1f -s ./project -o ./temp.txt --exclude-paths-file .gitignore
TOKENS=$(m1f-token-counter ./temp.txt | grep -o '[0-9,]*' | tr -d ',')
echo "Current token count: $TOKENS"
rm ./temp.txt

Common Patterns

Token-Aware Bundling

# Function to create optimal bundles
create_optimal_bundle() {
    local max_tokens=$1
    local output_file=$2
    
    # Start with core files
    m1f -s ./src -o ./temp.txt --includes "src/main.py" "src/config.py"
    
    # Add files until we approach the limit
    while [ $(m1f-token-counter ./temp.txt | grep -o '[0-9,]*' | tr -d ',') -lt $max_tokens ]; do
        # Add more files...
        break
    done
    
    mv ./temp.txt $output_file
}

create_optimal_bundle 100000 ./optimal.txt

Multi-File Analysis

# Analyze multiple bundles
for file in *.txt; do
    echo "=== $file ==="
    m1f-token-counter "$file"
    echo
done

m1f - Create bundles to analyze
s1f - Extract files from bundles
html2md - Convert HTML before bundling
scrape - Download web content for analysis

Next Steps

Learn about m1f bundling to create optimized file combinations
Explore presets for consistent token management
Check out Claude integration for AI workflows
Review auto-bundling for automated token-aware processes