Overview
git-repo-parser is a powerful tool to scrape all files from a GitHub repository and convert them into multiple formats: JSON, Token-Oriented Object Notation (TOON), or RepoScript (an LLM-first transcript format). Perfect for feeding codebases to AI models or creating structured repository snapshots.
Installation
Install globally via npm:
npm install -g git-repo-parser
Or add to your project:
npm install git-repo-parser
Usage
The package provides three CLI commands:
1. JSON Export
git-repo-to-json https://github.com/username/repo-name.git
Scrapes repository and saves as files.json with structured file data.
2. TOON Export (Token-Oriented Object Notation)
git-repo-to-toon https://github.com/username/repo-name.git
Optimized format for token-efficient LLM consumption.
3. RepoScript Transcript (LLM-First Format)
# Basic transcript without metadata
git-repo-to-text https://github.com/username/repo-name.git --format=transcript
# Transcript with metadata lines and token count
git-repo-to-text https://github.com/username/repo-name.git --format=transcript --meta --tokens
# Alternate syntaxes
git-repo-to-text https://github.com/username/repo-name.git --format=json
git-repo-to-text https://github.com/username/repo-name.git --format=toon
Command Options
--format=<format>: Output format (json, toon, transcript)--meta/--no-meta: Toggle RepoScript metadata lines (default: no metadata)--tokens/-t: Print token count using CL100K vocabulary--token-count: Alias for--tokens
Output Formats
JSON Format
Structured JSON with file paths, contents, and metadata:
{
"files": [
{
"path": "src/index.ts",
"content": "...",
"size": 1234,
"language": "TypeScript"
}
]
}
TOON Format (Token-Oriented Object Notation)
Compact, token-efficient format optimized for LLM context windows.
RepoScript Format
LLM-first transcript format with optional metadata:
=== File: src/index.ts ===
[TypeScript, 1234 bytes, 45 lines]
<file content>
=== File: README.md ===
...
Benchmark Suite
Run the bundled benchmark to evaluate scrape runtime and token usage:
npm run build
npm run benchmark
Benchmark Outputs
Results are saved under benchmark/:
results.jsonβ Machine-readable summary (durations, token counts, output sizes)results.mdβ Markdown report per repository/format*.preview.txtβ First 100 lines of each export for spot-checking
Programmatic Usage
import { scrapeRepo } from 'git-repo-parser';
const result = await scrapeRepo('https://github.com/username/repo.git', {
format: 'json',
includeMeta: true,
calculateTokens: true
});
console.log(`Files: ${result.files.length}`);
console.log(`Tokens: ${result.tokenCount}`);
Use Cases
- π€ LLM Context Preparation: Feed entire codebases to AI models
- π Repository Analysis: Analyze code patterns and structure
- π Documentation Generation: Extract and process repository contents
- π Code Search: Create searchable repository snapshots
- π¦ Backup & Archiving: Create structured backups of Git repositories
Technical Highlights
- Token Counting: Uses CL100K vocabulary for accurate token estimation
- Format Flexibility: Multiple export formats for different use cases
- Performance Optimized: Efficient file processing and memory management
- CLI & Programmatic: Works as both CLI tool and Node.js library
- TypeScript Support: Full TypeScript definitions included
Features
β Clone and parse any public GitHub repository β Multiple output formats (JSON, TOON, RepoScript) β Token counting with CL100K vocabulary β Metadata inclusion (file size, language, line count) β Benchmark suite for performance testing β Programmatic API for Node.js integration β CLI commands for quick repository processing
Performance
Efficiently processes repositories of all sizes:
- Small repos (<100 files): ~1-2 seconds
- Medium repos (100-1000 files): ~3-10 seconds
- Large repos (1000+ files): ~15-60 seconds
Token counts provided for all formats to help estimate LLM context usage.
License: MIT License - see LICENSE for details