code_skim

December 16, 2025 · View on GitHub

Transform source code by removing implementation details whilst preserving structure. Achieves 60-80% character reduction for optimising AI context windows.

Status

🔒 Disabled by default - Enable with ENABLE_ADDITIONAL_TOOLS=code_skim

⚠️ Platform Availability: Due to tree-sitter's CGO dependency, code_skim is only available on:

macOS (darwin) - included in GitHub release binaries
Linux AMD64 with CGO enabled - included in GitHub release binaries
Docker images exclude this tool (built with CGO_ENABLED=0 for minimal size)

Linux ARM64 and Windows builds exclude this tool. If you need code_skim on those platforms, you'll need to build from source with CGO enabled.

The code_skim tool uses tree-sitter to parse source code and strip function/method bodies whilst preserving signatures, types, and overall structure. Language is automatically detected from file extensions. Results are paginated to prevent overwhelming context windows.

Supported languages:

Python (.py)
Go (.go)
JavaScript (.js, .jsx)
TypeScript (.ts, .tsx)
Rust (.rs)
C (.c, .h)
C++ (.cpp, .cc, .cxx, .hpp, .hxx, .hh)
Bash (.sh, .bash)
HTML (.html, .htm)
CSS (.css)
Swift (.swift)
Java (.java)
YAML (.yml, .yaml)
HCL/Terraform (.hcl, .tf)

Why Use code_skim?

When working with large codebases, you often don't need implementation details to understand architecture, APIs, or structure. The code_skim tool addresses the context attention problem:

Large contexts degrade model performance (attention dilution)
80% of the time, you don't need implementation details
Focus on what code does, not how it does it

Character reduction example:

Original: ~200,000 characters
Structure mode: ~60,000 characters (70% reduction)
Fits more code in limited context windows

Parameters

Required

source (array): Array of file paths, directory paths, or glob patterns
- Single file: ["/path/to/file.py"]
- Directory: ["/path/to/directory"] (recursively finds supported files)
- Glob pattern: ["/path/to/**/*.py"] (matches using glob syntax)
- Multiple: ["/path/to/file1.py", "/path/to/file2.go", "/path/**/*.ts"]
- Multiple sources are automatically deduplicated

Optional

clear_cache (boolean): Clear cache entry before processing
- Default: false
starting_line (number): Line number to start from (1-based) for pagination
- Use when previous response was truncated
- Specified in next_starting_line field of truncated responses
filter (array): Array of glob patterns to filter function/method/class names
- Single pattern: ["handle_*"], ["test_*"], ["*Controller"]
- Multiple patterns: ["handle_*", "process_*", "get*"]
- Inverse filter (exclusion): Prefix with ! (e.g., ["!temp_*"], ["!test_*"])
- Combined: ["handle_*", "!handle_temp*"] (include handle_* but exclude handle_temp*)
- Exclusions take priority over inclusions
- Returns matched_items, total_items, filtered_items counts in response
extract_graph (boolean): Extract relationship graph including imports, calls, and inheritance
- Default: false
- Adds graph field to file results with structured relationship data
output_format (string): Output format for the transformed code
- "json" (default): Standard JSON response
- "sigil": Compressed notation optimised for LLM context (see Sigil Format below)

How It Works

The tool removes function/method bodies whilst preserving:

Function and method signatures
Class declarations
Type definitions
Overall code structure

Character reduction: 60-80%

Example:

# Before
def process_user(user):
    validated = validate_user(user)
    if not validated:
        raise ValueError("Invalid user")
    normalised = normalise_data(user)
    return save_to_database(normalised)

# After transformation
def process_user(user): { /* ... */ }

Line Limiting

By default, results are limited to 10,000 lines per file to prevent overwhelming context windows. When results exceed this limit:

Response includes truncated: true
total_lines shows the full file line count
returned_lines shows how many lines were returned
next_starting_line specifies where to continue from

Configure the limit with the CODE_SKIM_MAX_LINES environment variable.

Examples

Transform a single file

{
  "source": ["/path/to/src/api.py"]
}

Transform all Python files in a directory

{
  "source": ["/path/to/src"]
}

Transform files matching a glob pattern

{
  "source": ["/path/to/src/**/*.ts"]
}

Clear cache and re-process

{
  "source": ["/path/to/app.js"],
  "clear_cache": true
}

Paginate through a large file

{
  "source": ["/path/to/large_file.py"],
  "starting_line": 10001
}

Filter by function name pattern

{
  "source": ["/path/to/api.py"],
  "filter": ["handle_*"]
}

Show only test functions

{
  "source": ["/path/to/tests.py"],
  "filter": ["test_*"]
}

Multiple source files

{
  "source": [
    "/path/to/api.py",
    "/path/to/handlers.py",
    "/path/to/models.py"
  ]
}

Multiple filter patterns

{
  "source": ["/path/to/api.py"],
  "filter": ["handle_*", "process_*", "validate_*"]
}

Exclude specific patterns (inverse filter)

{
  "source": ["/path/to/api.py"],
  "filter": ["handle_*", "!handle_temp*"]
}

Show everything except test functions

{
  "source": ["/path/to/src"],
  "filter": ["!test_*"]
}

Response Format

Single File

{
  "files": [
    {
      "path": "/path/to/api.py",
      "transformed": "def hello(name): { /* ... */ }",
      "language": "python",
      "from_cache": false,
      "truncated": false,
      "total_lines": 8,
      "returned_lines": 8,
      "reduction_percentage": 65
    }
  ],
  "total_files": 1,
  "processed_files": 1,
  "failed_files": 0,
  "processing_time_ms": 15
}

With Filtering

{
  "files": [
    {
      "path": "/path/to/api.py",
      "transformed": "def handle_request(): { /* ... */ }\ndef handle_response(): { /* ... */ }",
      "language": "python",
      "from_cache": false,
      "truncated": false,
      "total_lines": 4,
      "returned_lines": 4,
      "reduction_percentage": 75,
      "matched_items": 2,
      "total_items": 10,
      "filtered_items": 8
    }
  ],
  "total_files": 1,
  "processed_files": 1,
  "failed_files": 0,
  "processing_time_ms": 18
}

Truncated Response (Pagination)

{
  "files": [
    {
      "path": "/path/to/large_file.py",
      "transformed": "...first 10,000 lines...",
      "language": "python",
      "from_cache": false,
      "truncated": true,
      "total_lines": 25000,
      "returned_lines": 10000,
      "next_starting_line": 10001
    }
  ],
  "total_files": 1,
  "processed_files": 1,
  "failed_files": 0
}

Response Fields:

files: Array of file results
- path: Absolute file path
- transformed: Transformed source code
- language: Detected language
- from_cache: Whether result came from cache
- truncated: Whether output was truncated due to line limit
- total_lines: Total line count of transformed output
- returned_lines: Number of lines returned in this response
- next_starting_line: Line number to use for next request (if truncated)
- reduction_percentage: Percentage of token/character reduction from original (0-100)
- matched_items: Number of functions/methods/classes that matched filter (only when filtering)
- total_items: Total number of functions/methods/classes found (only when filtering)
- filtered_items: Number of functions/methods/classes excluded by filter (only when filtering)
- error: Error message (if file processing failed)
total_files: Total number of files found
processed_files: Number of successfully processed files
failed_files: Number of files that failed processing
processing_time_ms: Total processing time in milliseconds

Graph Extraction

When extract_graph: true, the response includes relationship data:

{
  "files": [
    {
      "path": "/path/to/handler.py",
      "graph": {
        "imports": ["os", "json", "typing.Optional"],
        "functions": [
          {
            "name": "handle_request",
            "calls": ["validate", "process", "respond"],
            "connectivity": 3
          }
        ],
        "classes": [
          {
            "name": "RequestHandler",
            "extends": "BaseHandler",
            "implements": ["Loggable"],
            "methods": ["__init__", "handle"]
          }
        ]
      }
    }
  ]
}

Graph Fields:

imports: Module/package imports
functions: Function details with call relationships
- calls: Functions called by this function
- connectivity: Total number of relationships (★ rating)
classes: Class details with inheritance
- extends: Parent class
- implements: Implemented interfaces
- methods: Method names

Sigil Format

The output_format: "sigil" option provides compressed notation optimised for LLM consumption:

# /path/to/handler.py [python]
!os !json !typing.Optional
$RequestHandler < BaseHandler & Loggable
  #__init__() -> #_setup_logging
  #handle() -> #validate #process ★3
#main() -> $RequestHandler.#handle ★1

Sigil Meanings:

! - import/module
$ - class/type
# - function/method
< - extends
& - implements
-> - calls (outgoing)
★n - connectivity rating (n relationships)

Example with Sigil Format:

{
  "source": ["/path/to/api.py"],
  "extract_graph": true,
  "output_format": "sigil"
}

Caching

Results are cached using a key based on:

File path
Language
Filter patterns (if applied)
Source code hash (SHA256)

Cache behaviour:

First call: Processes and caches result (from_cache: false)
Subsequent calls: Returns cached result if file content unchanged (from_cache: true)
Clear cache: Set clear_cache: true to force re-processing
Each file in batch operations is cached independently
Pagination: Cached transformed output is reused for different line ranges
Different filter patterns create separate cache entries

Use Cases

1. Codebase Overview

Quickly understand code structure without implementation noise:

{
  "source": "/path/to/src"
}

2. API Documentation

Extract function signatures for documentation:

{
  "source": "/path/to/api.py"
}

3. Architecture Analysis

Analyse entire packages or modules:

{
  "source": "/path/to/project/**/*.go"
}

4. Context Window Optimisation

Fit more code into limited AI context windows by removing implementation noise.

When to Use

✅ Use when:

Analysing code structure without implementation details
Fitting large codebases into limited AI context windows
Providing architectural overviews
Examining API surfaces and function signatures
Understanding "what" code does without the "how" details

❌ Don't use when:

Debugging implementation logic
Examining algorithm details
Reviewing line-by-line code quality
Actual implementation is required for the task
Working with unsupported languages

Maximum file size: 500KB per individual file
Maximum total memory: 4GB across all files being processed
Maximum AST depth: 500 levels (prevents stack overflow)
Maximum AST nodes: 100,000 per file (prevents memory exhaustion)
Parallel workers: Up to 10 concurrent file processors

Files exceeding these limits are skipped with detailed error messages in the response.

Implementation Details

Built on go-tree-sitter
Uses tree-sitter parsers for accurate AST analysis
Parallel processing with worker pool (up to 10 workers)
In-memory caching with SHA256 hashing for performance
File access controlled by security integration
Batch processing for directories and glob patterns using doublestar
Memory-safe with configurable limits

code_search: Semantic search over indexed code using natural language
find_long_files: Identify large files that may benefit from skimming
get_library_documentation: Get focused library documentation
fetch_url: Fetch web content (can be combined with skimming)

Extended Help

Use the get_tool_help tool to access detailed usage information:

{
  "tool_name": "code_skim"
}

This provides:

Detailed examples for all languages
Common usage patterns
Troubleshooting tips
Parameter explanations
When to use / when not to use guidance