Reference

Bundle Format Specification

Technical specification of the m1f bundle file format

Bundle Format Specification

The m1f tool generates bundle files that combine multiple source files into a single output file. This document provides a complete technical specification of the bundle format, including separator styles, metadata structure, and deduplication mechanisms.

Overview

An m1f bundle consists of:

  • File separators that mark boundaries between files
  • File metadata containing information about each file
  • File content with optional processing applied
  • Deduplication references for files with identical content

Separator Styles

m1f supports five different separator styles, each designed for specific use cases:

Standard Style

A simple, concise separator showing only the file path:

======= path/to/file.py ======

Use cases:

  • Quick file navigation
  • Minimal overhead for AI tools
  • Human-readable bundles

Detailed Style (Default)

Comprehensive separator including full metadata:

========================================================================================
== FILE: path/to/file.py
== DATE: 2025-05-15 14:30:21 | SIZE: 2.50 KB | TYPE: .py
== CHECKSUM_SHA256: abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890
========================================================================================

Metadata fields:

  • FILE: Relative path from the source directory
  • DATE: Last modification timestamp (local timezone)
  • SIZE: Human-readable file size
  • TYPE: File extension
  • CHECKSUM_SHA256: SHA-256 hash of the file content

Markdown Style

Formats metadata as Markdown with proper code blocks:

## path/to/file.py

**Date Modified:** 2025-05-15 14:30:21 | **Size:** 2.50 KB | **Type:** .py | **Checksum (SHA256):** abcdef1234567890...

```python
# File content starts here
def example():
    return "Hello, world!"
```

Features:

  • Uses file extension for syntax highlighting
  • Renders nicely in Markdown viewers
  • Ideal for documentation bundles

MachineReadable Style

A robust format for automated parsing and processing:

--- PYMK1F_BEGIN_FILE_METADATA_BLOCK_12345678-1234-1234-1234-123456789abc ---
METADATA_JSON:
{
    "original_filepath": "path/to/file.py",
    "original_filename": "file.py",
    "timestamp_utc_iso": "2025-05-15T14:30:21Z",
    "type": ".py",
    "size_bytes": 2560,
    "checksum_sha256": "abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890",
    "encoding": "utf-8",
    "original_encoding": "utf-8"
}
--- PYMK1F_END_FILE_METADATA_BLOCK_12345678-1234-1234-1234-123456789abc ---
--- PYMK1F_BEGIN_FILE_CONTENT_BLOCK_12345678-1234-1234-1234-123456789abc ---

# File content here

--- PYMK1F_END_FILE_CONTENT_BLOCK_12345678-1234-1234-1234-123456789abc ---

Features:

  • UUID-based block markers for reliable parsing
  • JSON metadata for easy extraction
  • Timestamps in UTC ISO format
  • Encoding information included

None Style

Files are concatenated directly without any separators:

# First file content
def function1():
    pass
# Second file content
def function2():
    pass

Use cases:

  • Creating single concatenated files
  • Minimal processing needed
  • When file boundaries don’t matter

Content Deduplication

By default, m1f deduplicates files with identical content to save space and reduce token usage.

Deduplication Format

When a duplicate file is encountered:

========================================================================================
== FILE: src/components/Header.js
== DUPE OF: src/widgets/Header.js
========================================================================================

Behavior:

  • First occurrence includes full content
  • Subsequent duplicates show reference only
  • Based on SHA-256 checksum matching
  • Can be disabled with --allow-duplicate-files

Deduplication Example

========================================================================================
== FILE: src/utils/constants.js
== DATE: 2025-05-15 10:00:00 | SIZE: 150 B | TYPE: .js
== CHECKSUM_SHA256: 123abc...
========================================================================================

export const API_KEY = "example";
export const API_URL = "https://api.example.com";

========================================================================================
== FILE: lib/shared/constants.js
== DUPE OF: src/utils/constants.js
========================================================================================

========================================================================================
== FILE: backup/constants.js
== DUPE OF: src/utils/constants.js
========================================================================================

Metadata Structure

Core Metadata Fields

All separator styles (except None) include these fields:

FieldDescriptionFormat
filepathRelative path from source directoryString
filenameBase filenameString
timestampFile modification timeISO 8601 or Unix timestamp
sizeFile sizeBytes (number) or human-readable
typeFile extensionString (including dot)
checksum_sha256Content hash64-character hex string

Extended Metadata (MachineReadable)

Additional fields in JSON format:

FieldDescriptionExample
encodingOutput encoding"utf-8"
original_encodingDetected source encoding"windows-1252"
line_countNumber of lines150
binaryIs binary filefalse
mime_typeMIME type (if detected)"text/javascript"

Encoding Handling

Character Encoding

  • Default output encoding: UTF-8
  • Original encoding detected and preserved in metadata
  • Configurable with --convert-to-charset
  • Supported encodings: utf-8, utf-16, ascii, latin-1, cp1252

Line Endings

  • Preserved from source files by default
  • Separator line endings configurable with --line-ending
  • Options: lf (Unix) or crlf (Windows)

Binary File Handling

When --include-binary-files is enabled:

========================================================================================
== FILE: images/logo.png
== DATE: 2025-05-15 12:00:00 | SIZE: 45.2 KB | TYPE: .png
== BINARY FILE: Content read as UTF-8 with errors ignored
========================================================================================

[Binary content appears as garbled text]

Note: Binary content is not suitable for text processing and significantly increases bundle size.

Special Processing

Scraped Content Metadata Removal

When --remove-scraped-metadata is enabled, m1f removes metadata blocks from HTML2MD files:

# Original file with metadata
Article content here...

---
**Source:** https://example.com/article
**Scraped:** 2025-05-15 10:00:00
---

# After processing
Article content here...

Content Processors

When using presets, files can have processors applied:

per_file_settings:
  "*.min.js":
    processors:
      - minify_content

This affects the content but not the metadata structure.

Bundle File Structure Example

A complete bundle showing various features:

========================================================================================
== FILE: README.md
== DATE: 2025-05-15 09:00:00 | SIZE: 1.2 KB | TYPE: .md
== CHECKSUM_SHA256: abc123...
========================================================================================

# My Project

This is the main documentation...

========================================================================================
== FILE: src/index.js
== DATE: 2025-05-15 10:30:00 | SIZE: 3.5 KB | TYPE: .js
== CHECKSUM_SHA256: def456...
========================================================================================

import { utils } from './utils';

export function main() {
    // Main application code
}

========================================================================================
== FILE: src/utils.js
== DATE: 2025-05-15 10:00:00 | SIZE: 2.1 KB | TYPE: .js
== CHECKSUM_SHA256: ghi789...
========================================================================================

export const utils = {
    // Utility functions
};

========================================================================================
== FILE: tests/index.test.js
== DUPE OF: src/index.js
========================================================================================

Parsing Bundle Files

Regular Expression Patterns

For parsing Standard style:

^=======\s+(.+?)\s+======$

For parsing Detailed style:

^==\s+FILE:\s+(.+)$
^==\s+DATE:\s+(.+?)\s+\|\s+SIZE:\s+(.+?)\s+\|\s+TYPE:\s+(.+)$
^==\s+CHECKSUM_SHA256:\s+([a-fA-F0-9]{64})$

For parsing MachineReadable blocks:

^---\s+PYMK1F_BEGIN_FILE_METADATA_BLOCK_([a-f0-9-]+)\s+---$
^---\s+PYMK1F_END_FILE_METADATA_BLOCK_([a-f0-9-]+)\s+---$

Example Parser (Python)

import json
import re

def parse_machine_readable_bundle(content):
    """Parse a MachineReadable format bundle."""
    files = []
    
    # Pattern for metadata blocks
    metadata_pattern = re.compile(
        r'--- PYMK1F_BEGIN_FILE_METADATA_BLOCK_(.+?) ---\n'
        r'METADATA_JSON:\n(.+?)\n'
        r'--- PYMK1F_END_FILE_METADATA_BLOCK_\1 ---',
        re.DOTALL
    )
    
    # Pattern for content blocks
    content_pattern = re.compile(
        r'--- PYMK1F_BEGIN_FILE_CONTENT_BLOCK_(.+?) ---\n'
        r'(.*?)\n'
        r'--- PYMK1F_END_FILE_CONTENT_BLOCK_\1 ---',
        re.DOTALL
    )
    
    # Extract all metadata
    for match in metadata_pattern.finditer(content):
        uuid = match.group(1)
        metadata = json.loads(match.group(2))
        
        # Find corresponding content
        content_match = content_pattern.search(content)
        if content_match and content_match.group(1) == uuid:
            metadata['content'] = content_match.group(2)
        
        files.append(metadata)
    
    return files

Version Compatibility

  • v3.2+: Current format as documented
  • v3.1: Standard separator included SHA256 in the separator line
  • v2.x: Different metadata format, fewer separator styles
  • s1f tool: Can parse all m1f bundle formats

Best Practices

  1. Choose appropriate separator style:

    • Standard for AI tools (minimal tokens)
    • Detailed for debugging and verification
    • Markdown for documentation
    • MachineReadable for automation
    • None for simple concatenation
  2. Deduplication considerations:

    • Enable for large codebases with repeated files
    • Disable when preserving file instances matters
    • Check DUPE references when debugging
  3. Encoding handling:

    • Use UTF-8 output for maximum compatibility
    • Preserve original encoding when needed
    • Include encoding metadata for processing tools
  4. Bundle organization:

    • Group related files together
    • Use consistent source paths
    • Include metadata for traceability

See Also

Previous
Security