Bundle Format Specification

The m1f tool generates bundle files that combine multiple source files into a single output file. This document provides a complete technical specification of the bundle format, including separator styles, metadata structure, and deduplication mechanisms.

Overview

An m1f bundle consists of:

File separators that mark boundaries between files
File metadata containing information about each file
File content with optional processing applied
Deduplication references for files with identical content

Separator Styles

m1f supports five different separator styles, each designed for specific use cases:

Standard Style

A simple, concise separator showing only the file path:

======= path/to/file.py ======

Use cases:

Quick file navigation
Minimal overhead for AI tools
Human-readable bundles

Detailed Style (Default)

Comprehensive separator including full metadata:

========================================================================================
== FILE: path/to/file.py
== DATE: 2025-05-15 14:30:21 | SIZE: 2.50 KB | TYPE: .py
== CHECKSUM_SHA256: abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890
========================================================================================

Metadata fields:

FILE: Relative path from the source directory
DATE: Last modification timestamp (local timezone)
SIZE: Human-readable file size
TYPE: File extension
CHECKSUM_SHA256: SHA-256 hash of the file content

Markdown Style

Formats metadata as Markdown with proper code blocks:

## path/to/file.py

**Date Modified:** 2025-05-15 14:30:21 | **Size:** 2.50 KB | **Type:** .py | **Checksum (SHA256):** abcdef1234567890...

```python
# File content starts here
def example():
    return "Hello, world!"
```

Features:

Uses file extension for syntax highlighting
Renders nicely in Markdown viewers
Ideal for documentation bundles

MachineReadable Style

A robust format for automated parsing and processing:

--- PYMK1F_BEGIN_FILE_METADATA_BLOCK_12345678-1234-1234-1234-123456789abc ---
METADATA_JSON:
{
    "original_filepath": "path/to/file.py",
    "original_filename": "file.py",
    "timestamp_utc_iso": "2025-05-15T14:30:21Z",
    "type": ".py",
    "size_bytes": 2560,
    "checksum_sha256": "abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890",
    "encoding": "utf-8",
    "original_encoding": "utf-8"
}
--- PYMK1F_END_FILE_METADATA_BLOCK_12345678-1234-1234-1234-123456789abc ---
--- PYMK1F_BEGIN_FILE_CONTENT_BLOCK_12345678-1234-1234-1234-123456789abc ---

# File content here

--- PYMK1F_END_FILE_CONTENT_BLOCK_12345678-1234-1234-1234-123456789abc ---

Features:

UUID-based block markers for reliable parsing
JSON metadata for easy extraction
Timestamps in UTC ISO format
Encoding information included

None Style

Files are concatenated directly without any separators:

# First file content
def function1():
    pass
# Second file content
def function2():
    pass

Use cases:

Creating single concatenated files
Minimal processing needed
When file boundaries don’t matter

Content Deduplication

By default, m1f deduplicates files with identical content to save space and reduce token usage.

Deduplication Format

When a duplicate file is encountered:

========================================================================================
== FILE: src/components/Header.js
== DUPE OF: src/widgets/Header.js
========================================================================================

Behavior:

First occurrence includes full content
Subsequent duplicates show reference only
Based on SHA-256 checksum matching
Can be disabled with --allow-duplicate-files

Deduplication Example

========================================================================================
== FILE: src/utils/constants.js
== DATE: 2025-05-15 10:00:00 | SIZE: 150 B | TYPE: .js
== CHECKSUM_SHA256: 123abc...
========================================================================================

export const API_KEY = "example";
export const API_URL = "https://api.example.com";

========================================================================================
== FILE: lib/shared/constants.js
== DUPE OF: src/utils/constants.js
========================================================================================

========================================================================================
== FILE: backup/constants.js
== DUPE OF: src/utils/constants.js
========================================================================================

Metadata Structure

Core Metadata Fields

All separator styles (except None) include these fields:

Field	Description	Format
`filepath`	Relative path from source directory	String
`filename`	Base filename	String
`timestamp`	File modification time	ISO 8601 or Unix timestamp
`size`	File size	Bytes (number) or human-readable
`type`	File extension	String (including dot)
`checksum_sha256`	Content hash	64-character hex string

Extended Metadata (MachineReadable)

Additional fields in JSON format:

Field	Description	Example
`encoding`	Output encoding	`"utf-8"`
`original_encoding`	Detected source encoding	`"windows-1252"`
`line_count`	Number of lines	`150`
`binary`	Is binary file	`false`
`mime_type`	MIME type (if detected)	`"text/javascript"`

Encoding Handling

Character Encoding

Default output encoding: UTF-8
Original encoding detected and preserved in metadata
Configurable with --convert-to-charset
Supported encodings: utf-8, utf-16, ascii, latin-1, cp1252

Line Endings

Preserved from source files by default
Separator line endings configurable with --line-ending
Options: lf (Unix) or crlf (Windows)

Binary File Handling

When --include-binary-files is enabled:

========================================================================================
== FILE: images/logo.png
== DATE: 2025-05-15 12:00:00 | SIZE: 45.2 KB | TYPE: .png
== BINARY FILE: Content read as UTF-8 with errors ignored
========================================================================================

[Binary content appears as garbled text]

Note: Binary content is not suitable for text processing and significantly increases bundle size.

Special Processing

Scraped Content Metadata Removal

When --remove-scraped-metadata is enabled, m1f removes metadata blocks from HTML2MD files:

# Original file with metadata
Article content here...

---
**Source:** https://example.com/article
**Scraped:** 2025-05-15 10:00:00
---

# After processing
Article content here...

Content Processors

When using presets, files can have processors applied:

per_file_settings:
  "*.min.js":
    processors:
      - minify_content

This affects the content but not the metadata structure.

Bundle File Structure Example

A complete bundle showing various features:

========================================================================================
== FILE: README.md
== DATE: 2025-05-15 09:00:00 | SIZE: 1.2 KB | TYPE: .md
== CHECKSUM_SHA256: abc123...
========================================================================================

# My Project

This is the main documentation...

========================================================================================
== FILE: src/index.js
== DATE: 2025-05-15 10:30:00 | SIZE: 3.5 KB | TYPE: .js
== CHECKSUM_SHA256: def456...
========================================================================================

import { utils } from './utils';

export function main() {
    // Main application code
}

========================================================================================
== FILE: src/utils.js
== DATE: 2025-05-15 10:00:00 | SIZE: 2.1 KB | TYPE: .js
== CHECKSUM_SHA256: ghi789...
========================================================================================

export const utils = {
    // Utility functions
};

========================================================================================
== FILE: tests/index.test.js
== DUPE OF: src/index.js
========================================================================================

Parsing Bundle Files

Regular Expression Patterns

For parsing Standard style:

^=======\s+(.+?)\s+======$

For parsing Detailed style:

^==\s+FILE:\s+(.+)$
^==\s+DATE:\s+(.+?)\s+\|\s+SIZE:\s+(.+?)\s+\|\s+TYPE:\s+(.+)$
^==\s+CHECKSUM_SHA256:\s+([a-fA-F0-9]{64})$

For parsing MachineReadable blocks:

^---\s+PYMK1F_BEGIN_FILE_METADATA_BLOCK_([a-f0-9-]+)\s+---$
^---\s+PYMK1F_END_FILE_METADATA_BLOCK_([a-f0-9-]+)\s+---$

Example Parser (Python)

import json
import re

def parse_machine_readable_bundle(content):
    """Parse a MachineReadable format bundle."""
    files = []
    
    # Pattern for metadata blocks
    metadata_pattern = re.compile(
        r'--- PYMK1F_BEGIN_FILE_METADATA_BLOCK_(.+?) ---\n'
        r'METADATA_JSON:\n(.+?)\n'
        r'--- PYMK1F_END_FILE_METADATA_BLOCK_\1 ---',
        re.DOTALL
    )
    
    # Pattern for content blocks
    content_pattern = re.compile(
        r'--- PYMK1F_BEGIN_FILE_CONTENT_BLOCK_(.+?) ---\n'
        r'(.*?)\n'
        r'--- PYMK1F_END_FILE_CONTENT_BLOCK_\1 ---',
        re.DOTALL
    )
    
    # Extract all metadata
    for match in metadata_pattern.finditer(content):
        uuid = match.group(1)
        metadata = json.loads(match.group(2))
        
        # Find corresponding content
        content_match = content_pattern.search(content)
        if content_match and content_match.group(1) == uuid:
            metadata['content'] = content_match.group(2)
        
        files.append(metadata)
    
    return files

Version Compatibility

v3.2+: Current format as documented
v3.1: Standard separator included SHA256 in the separator line
v2.x: Different metadata format, fewer separator styles
s1f tool: Can parse all m1f bundle formats

Best Practices

Choose appropriate separator style:
- Standard for AI tools (minimal tokens)
- Detailed for debugging and verification
- Markdown for documentation
- MachineReadable for automation
- None for simple concatenation
Deduplication considerations:
- Enable for large codebases with repeated files
- Disable when preserving file instances matters
- Check DUPE references when debugging
Encoding handling:
- Use UTF-8 output for maximum compatibility
- Preserve original encoding when needed
- Include encoding metadata for processing tools
Bundle organization:
- Group related files together
- Use consistent source paths
- Include metadata for traceability