git-repo-parser

Overview

git-repo-parser is a powerful tool to scrape all files from a GitHub repository and convert them into multiple formats: JSON, Token-Oriented Object Notation (TOON), or RepoScript (an LLM-first transcript format). Perfect for feeding codebases to AI models or creating structured repository snapshots.

Installation

Install globally via npm:

npm install -g git-repo-parser

Or add to your project:

npm install git-repo-parser

Usage

The package provides three CLI commands:

1. JSON Export

git-repo-to-json https://github.com/username/repo-name.git

Scrapes repository and saves as files.json with structured file data.

2. TOON Export (Token-Oriented Object Notation)

git-repo-to-toon https://github.com/username/repo-name.git

Optimized format for token-efficient LLM consumption.

3. RepoScript Transcript (LLM-First Format)

# Basic transcript without metadata
git-repo-to-text https://github.com/username/repo-name.git --format=transcript

# Transcript with metadata lines and token count
git-repo-to-text https://github.com/username/repo-name.git --format=transcript --meta --tokens

# Alternate syntaxes
git-repo-to-text https://github.com/username/repo-name.git --format=json
git-repo-to-text https://github.com/username/repo-name.git --format=toon

Command Options

  • --format=<format>: Output format (json, toon, transcript)
  • --meta / --no-meta: Toggle RepoScript metadata lines (default: no metadata)
  • --tokens / -t: Print token count using CL100K vocabulary
  • --token-count: Alias for --tokens

Output Formats

JSON Format

Structured JSON with file paths, contents, and metadata:

{
  "files": [
    {
      "path": "src/index.ts",
      "content": "...",
      "size": 1234,
      "language": "TypeScript"
    }
  ]
}

TOON Format (Token-Oriented Object Notation)

Compact, token-efficient format optimized for LLM context windows.

RepoScript Format

LLM-first transcript format with optional metadata:

=== File: src/index.ts ===
[TypeScript, 1234 bytes, 45 lines]

<file content>

=== File: README.md ===
...

Benchmark Suite

Run the bundled benchmark to evaluate scrape runtime and token usage:

npm run build
npm run benchmark

Benchmark Outputs

Results are saved under benchmark/:

  • results.json – Machine-readable summary (durations, token counts, output sizes)
  • results.md – Markdown report per repository/format
  • *.preview.txt – First 100 lines of each export for spot-checking

Programmatic Usage

import { scrapeRepo } from 'git-repo-parser';

const result = await scrapeRepo('https://github.com/username/repo.git', {
  format: 'json',
  includeMeta: true,
  calculateTokens: true
});

console.log(`Files: ${result.files.length}`);
console.log(`Tokens: ${result.tokenCount}`);

Use Cases

  • πŸ€– LLM Context Preparation: Feed entire codebases to AI models
  • πŸ“Š Repository Analysis: Analyze code patterns and structure
  • πŸ“ Documentation Generation: Extract and process repository contents
  • πŸ” Code Search: Create searchable repository snapshots
  • πŸ“¦ Backup & Archiving: Create structured backups of Git repositories

Technical Highlights

  • Token Counting: Uses CL100K vocabulary for accurate token estimation
  • Format Flexibility: Multiple export formats for different use cases
  • Performance Optimized: Efficient file processing and memory management
  • CLI & Programmatic: Works as both CLI tool and Node.js library
  • TypeScript Support: Full TypeScript definitions included

Features

βœ… Clone and parse any public GitHub repository βœ… Multiple output formats (JSON, TOON, RepoScript) βœ… Token counting with CL100K vocabulary βœ… Metadata inclusion (file size, language, line count) βœ… Benchmark suite for performance testing βœ… Programmatic API for Node.js integration βœ… CLI commands for quick repository processing

Performance

Efficiently processes repositories of all sizes:

  • Small repos (<100 files): ~1-2 seconds
  • Medium repos (100-1000 files): ~3-10 seconds
  • Large repos (1000+ files): ~15-60 seconds

Token counts provided for all formats to help estimate LLM context usage.


License: MIT License - see LICENSE for details

Β© 2026 Made withΒ Β by Arnab Chatterjee