Calculate Total Size of Git LFS Files
December 25, 2024 ยท View on GitHub
human (Dec 25, 2024, 10:32 AM)
I have a git repo with a LOT of large LFS files
How can I calculate the total size without downloading them?
assistant (Dec 25, 2024, 10:33 AM)
I'll help you create a script to calculate the total size of Git LFS files without downloading them. We can use the git lfs ls-files command which shows file sizes.
create Git LFS Repository Size Calculator
#!/usr/bin/env python3
import subprocess
import re
from pathlib import Path
import argparse
from typing import Tuple, List
def parse_size(size_str: str) -> int:
"""Convert Git LFS size string to bytes."""
units = {'B': 1, 'KB': 1024, 'MB': 1024**2, 'GB': 1024**3, 'TB': 1024**4}
# Extract number and unit from string (e.g., "123.45 MB")
match = re.match(r'^([\d.]+)\s*([A-Z]+)$', size_str.strip())
if not match:
raise ValueError(f"Invalid size format: {size_str}")
num, unit = match.groups()
if unit not in units:
raise ValueError(f"Unknown unit: {unit}")
return int(float(num) * units[unit])
def get_repo_lfs_size(repo_path: Path = None) -> Tuple[int, List[Tuple[str, int]]]:
"""
Calculate total size of LFS objects without downloading them.
Returns tuple of (total_size, list of (filename, size) tuples)
"""
if repo_path:
repo_path = Path(repo_path)
if not (repo_path / '.git').exists():
raise ValueError(f"Not a git repository: {repo_path}")
original_path = Path.cwd()
repo_path.resolve().chdir()
try:
# Run git lfs ls-files command
result = subprocess.run(
['git', 'lfs', 'ls-files', '-l'],
capture_output=True,
text=True,
check=True
)
except subprocess.CalledProcessError as e:
raise RuntimeError(f"Error running git lfs command: {e.stderr}")
finally:
if repo_path:
original_path.chdir()
total_size = 0
file_sizes = []
# Parse output lines
for line in result.stdout.splitlines():
parts = line.split()
if len(parts) < 3: # Skip malformed lines
continue
try:
size = parse_size(parts[2])
filename = parts[-1]
total_size += size
file_sizes.append((filename, size))
except (ValueError, IndexError) as e:
print(f"Warning: Couldn't parse line: {line} ({e})")
continue
return total_size, file_sizes
def format_size(size_bytes: int) -> str:
"""Format size in bytes to human readable string."""
for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
if size_bytes < 1024:
return f"{size_bytes:.2f} {unit}"
size_bytes /= 1024
return f"{size_bytes:.2f} PB"
def main():
parser = argparse.ArgumentParser(description='Calculate total size of Git LFS objects')
parser.add_argument('path', nargs='?', type=Path, default=None,
help='Path to git repository (optional)')
parser.add_argument('--files', action='store_true',
help='List individual file sizes')
args = parser.parse_args()
try:
total_size, file_sizes = get_repo_lfs_size(args.path)
print(f"\nTotal LFS objects size: {format_size(total_size)}")
if args.files:
print("\nIndividual file sizes:")
# Sort by size, largest first
for filename, size in sorted(file_sizes, key=lambda x: x[1], reverse=True):
print(f"{filename}: {format_size(size)}")
except Exception as e:
print(f"Error: {e}")
return 1
return 0
if __name__ == '__main__':
exit(main())
I've created a Python script that calculates the total size of Git LFS objects in your repository without downloading them. The script:
- Uses
git lfs ls-files -lto list all LFS files with their sizes - Parses the output and converts size strings (e.g., "123.45 MB") to bytes
- Provides both total size and individual file sizes if requested
You can use it in two ways:
- Basic usage (just total size):
python git_lfs_size.py
- With individual file sizes:
python git_lfs_size.py --files
You can also specify a different repository path:
python git_lfs_size.py /path/to/repo --files
The script handles different size units (B, KB, MB, GB, TB) and provides human-readable output. Let me know if you need any modifications or have questions about how it works!
human (Dec 25, 2024, 10:34 AM)
How do I git clone the repo without it downloading those files?
assistant (Dec 25, 2024, 10:34 AM)
You can use the GIT_LFS_SKIP_SMUDGE environment variable when cloning to prevent Git LFS from downloading the large files. Here are two ways to do it:
- Set it just for the clone command:
GIT_LFS_SKIP_SMUDGE=1 git clone <repository-url>
- Or set it globally first:
git config --global filter.lfs.smudge "git-lfs smudge --skip"
git config --global filter.lfs.process "git-lfs filter-process --skip"
git clone <repository-url>
After cloning, your repository will have the LFS files' pointers instead of the actual files. If you later want to download specific LFS files, you can:
git lfs pull --include="path/to/file"
Or to reset back to normal LFS behavior:
git config --global filter.lfs.smudge "git-lfs smudge -- %f"
git config --global filter.lfs.process "git-lfs filter-process"
human (Dec 25, 2024, 10:35 AM)
give me a simple bash one-liner I can run in the repo to calculate the total size
assistant (Dec 25, 2024, 10:35 AM)
Here's a one-liner that calculates the total size of LFS files:
git lfs ls-files -l | awk '{total += \$3} END {print total/1024/1024 " MB"}'
This will show the total in MB. If you prefer GB, you can divide by 1024 one more time:
git lfs ls-files -l | awk '{total += \$3} END {print total/1024/1024/1024 " GB"}'
human (Dec 25, 2024, 10:35 AM)
output of git lfs ls-files -l is this no size:
5fc36dd41088bf6ad2e362c00f5901b12d12fc91b2da2b5a10f1c951f7fba445 - model-00148-of-000163.safetensors
assistant (Dec 25, 2024, 10:35 AM)
Ah, you need the -s flag to show the size. Here's the corrected one-liner:
git lfs ls-files -s | awk '{total += \$1} END {print total/1024/1024 " MB"}'
This will work with output that shows the size first, like:
2998450961 model-00148-of-000163.safetensors
human (Dec 25, 2024, 10:36 AM)
git lfs ls-files -s
3f4e5fcec2 - model-00001-of-000163.safetensors (5.2 GB)
4fb0c2abdd - model-00002-of-000163.safetensors (4.3 GB)
assistant (Dec 25, 2024, 10:36 AM)
Ah, now I see the format. Here's the correct one-liner that handles the size with units in parentheses:
git lfs ls-files -s | grep -o '[0-9.]\+ GB' | awk '{sum += \$1} END {print sum " GB"}'
This will:
- Get the LFS files list with sizes
- Extract just the numbers followed by GB (like "5.2 GB")
- Sum up the numbers and print the total in GB
If you also have files with MB or other units, let me know and I'll modify it to handle multiple units.
human (Dec 25, 2024, 10:36 AM)
handle all units
assistant (Dec 25, 2024, 10:36 AM)
Here's a one-liner that handles different size units (MB, GB, TB, etc):
git lfs ls-files -s | grep -o '[0-9.]\+ [KMGT]B' | awk '{
split(\$0, a, " ");
size=a[1];
unit=a[2];
if(unit=="KB") size*=1024;
else if(unit=="MB") size*=$1024^{2}$;
else if(unit=="GB") size*=$1024^{3}$;
else if(unit=="TB") size*=$1024^{4}$;
total+=size
} END {
if(total<1024) print total " B";
else if(total<$1024^{2}$) print total/1024 " KB";
else if(total<$1024^{3}$) print total/$1024^{2}$ " MB";
else if(total<$1024^{4}$) print total/$1024^{3}$ " GB";
else print total/$1024^{4}$ " TB"
}'
This will:
- Extract sizes and units (like "5.2 GB")
- Convert everything to bytes during the sum
- Convert the final total to the most appropriate unit
So if you have a mix like:
file1 (5.2 GB)
file2 (800 MB)
file3 (1.5 TB)
It will give you the correct total in the most appropriate unit.