Calculate Total Size of Git LFS Files

December 25, 2024 ยท View on GitHub

human (Dec 25, 2024, 10:32 AM)

I have a git repo with a LOT of large LFS files

How can I calculate the total size without downloading them?

assistant (Dec 25, 2024, 10:33 AM)

I'll help you create a script to calculate the total size of Git LFS files without downloading them. We can use the git lfs ls-files command which shows file sizes.

create Git LFS Repository Size Calculator


#!/usr/bin/env python3
import subprocess
import re
from pathlib import Path
import argparse
from typing import Tuple, List

def parse_size(size_str: str) -> int:
    """Convert Git LFS size string to bytes."""
    units = {'B': 1, 'KB': 1024, 'MB': 1024**2, 'GB': 1024**3, 'TB': 1024**4}
    
    # Extract number and unit from string (e.g., "123.45 MB")
    match = re.match(r'^([\d.]+)\s*([A-Z]+)$', size_str.strip())
    if not match:
        raise ValueError(f"Invalid size format: {size_str}")
    
    num, unit = match.groups()
    if unit not in units:
        raise ValueError(f"Unknown unit: {unit}")
    
    return int(float(num) * units[unit])

def get_repo_lfs_size(repo_path: Path = None) -> Tuple[int, List[Tuple[str, int]]]:
    """
    Calculate total size of LFS objects without downloading them.
    Returns tuple of (total_size, list of (filename, size) tuples)
    """
    if repo_path:
        repo_path = Path(repo_path)
        if not (repo_path / '.git').exists():
            raise ValueError(f"Not a git repository: {repo_path}")
        original_path = Path.cwd()
        repo_path.resolve().chdir()

    try:
        # Run git lfs ls-files command
        result = subprocess.run(
            ['git', 'lfs', 'ls-files', '-l'],
            capture_output=True,
            text=True,
            check=True
        )
    except subprocess.CalledProcessError as e:
        raise RuntimeError(f"Error running git lfs command: {e.stderr}")
    finally:
        if repo_path:
            original_path.chdir()

    total_size = 0
    file_sizes = []

    # Parse output lines
    for line in result.stdout.splitlines():
        parts = line.split()
        if len(parts) < 3:  # Skip malformed lines
            continue
            
        try:
            size = parse_size(parts[2])
            filename = parts[-1]
            total_size += size
            file_sizes.append((filename, size))
        except (ValueError, IndexError) as e:
            print(f"Warning: Couldn't parse line: {line} ({e})")
            continue

    return total_size, file_sizes

def format_size(size_bytes: int) -> str:
    """Format size in bytes to human readable string."""
    for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
        if size_bytes < 1024:
            return f"{size_bytes:.2f} {unit}"
        size_bytes /= 1024
    return f"{size_bytes:.2f} PB"

def main():
    parser = argparse.ArgumentParser(description='Calculate total size of Git LFS objects')
    parser.add_argument('path', nargs='?', type=Path, default=None,
                       help='Path to git repository (optional)')
    parser.add_argument('--files', action='store_true',
                       help='List individual file sizes')
    args = parser.parse_args()

    try:
        total_size, file_sizes = get_repo_lfs_size(args.path)
        
        print(f"\nTotal LFS objects size: {format_size(total_size)}")
        
        if args.files:
            print("\nIndividual file sizes:")
            # Sort by size, largest first
            for filename, size in sorted(file_sizes, key=lambda x: x[1], reverse=True):
                print(f"{filename}: {format_size(size)}")
            
    except Exception as e:
        print(f"Error: {e}")
        return 1

    return 0

if __name__ == '__main__':
    exit(main())

I've created a Python script that calculates the total size of Git LFS objects in your repository without downloading them. The script:

  1. Uses git lfs ls-files -l to list all LFS files with their sizes
  2. Parses the output and converts size strings (e.g., "123.45 MB") to bytes
  3. Provides both total size and individual file sizes if requested

You can use it in two ways:

  1. Basic usage (just total size):
python git_lfs_size.py
  1. With individual file sizes:
python git_lfs_size.py --files

You can also specify a different repository path:

python git_lfs_size.py /path/to/repo --files

The script handles different size units (B, KB, MB, GB, TB) and provides human-readable output. Let me know if you need any modifications or have questions about how it works!

human (Dec 25, 2024, 10:34 AM)

How do I git clone the repo without it downloading those files?

assistant (Dec 25, 2024, 10:34 AM)

You can use the GIT_LFS_SKIP_SMUDGE environment variable when cloning to prevent Git LFS from downloading the large files. Here are two ways to do it:

  1. Set it just for the clone command:
GIT_LFS_SKIP_SMUDGE=1 git clone <repository-url>
  1. Or set it globally first:
git config --global filter.lfs.smudge "git-lfs smudge --skip"
git config --global filter.lfs.process "git-lfs filter-process --skip"
git clone <repository-url>

After cloning, your repository will have the LFS files' pointers instead of the actual files. If you later want to download specific LFS files, you can:

git lfs pull --include="path/to/file"

Or to reset back to normal LFS behavior:

git config --global filter.lfs.smudge "git-lfs smudge -- %f"
git config --global filter.lfs.process "git-lfs filter-process"

human (Dec 25, 2024, 10:35 AM)

give me a simple bash one-liner I can run in the repo to calculate the total size

assistant (Dec 25, 2024, 10:35 AM)

Here's a one-liner that calculates the total size of LFS files:

git lfs ls-files -l | awk '{total += \$3} END {print total/1024/1024 " MB"}'

This will show the total in MB. If you prefer GB, you can divide by 1024 one more time:

git lfs ls-files -l | awk '{total += \$3} END {print total/1024/1024/1024 " GB"}'

human (Dec 25, 2024, 10:35 AM)

output of git lfs ls-files -l is this no size:

5fc36dd41088bf6ad2e362c00f5901b12d12fc91b2da2b5a10f1c951f7fba445 - model-00148-of-000163.safetensors

assistant (Dec 25, 2024, 10:35 AM)

Ah, you need the -s flag to show the size. Here's the corrected one-liner:

git lfs ls-files -s | awk '{total += \$1} END {print total/1024/1024 " MB"}'

This will work with output that shows the size first, like:

2998450961 model-00148-of-000163.safetensors

human (Dec 25, 2024, 10:36 AM)

git lfs ls-files -s
3f4e5fcec2 - model-00001-of-000163.safetensors (5.2 GB)
4fb0c2abdd - model-00002-of-000163.safetensors (4.3 GB)

assistant (Dec 25, 2024, 10:36 AM)

Ah, now I see the format. Here's the correct one-liner that handles the size with units in parentheses:

git lfs ls-files -s | grep -o '[0-9.]\+ GB' | awk '{sum += \$1} END {print sum " GB"}'

This will:

  1. Get the LFS files list with sizes
  2. Extract just the numbers followed by GB (like "5.2 GB")
  3. Sum up the numbers and print the total in GB

If you also have files with MB or other units, let me know and I'll modify it to handle multiple units.

human (Dec 25, 2024, 10:36 AM)

handle all units

assistant (Dec 25, 2024, 10:36 AM)

Here's a one-liner that handles different size units (MB, GB, TB, etc):

git lfs ls-files -s | grep -o '[0-9.]\+ [KMGT]B' | awk '{ 
    split(\$0, a, " "); 
    size=a[1]; 
    unit=a[2]; 
    if(unit=="KB") size*=1024; 
    else if(unit=="MB") size*=$1024^{2}$; 
    else if(unit=="GB") size*=$1024^{3}$; 
    else if(unit=="TB") size*=$1024^{4}$; 
    total+=size
} END { 
    if(total<1024) print total " B";
    else if(total<$1024^{2}$) print total/1024 " KB";
    else if(total<$1024^{3}$) print total/$1024^{2}$ " MB";
    else if(total<$1024^{4}$) print total/$1024^{3}$ " GB";
    else print total/$1024^{4}$ " TB"
}'

This will:

  1. Extract sizes and units (like "5.2 GB")
  2. Convert everything to bytes during the sum
  3. Convert the final total to the most appropriate unit

So if you have a mix like:

file1 (5.2 GB)
file2 (800 MB)
file3 (1.5 TB)

It will give you the correct total in the most appropriate unit.