LLAMA.CL
November 7, 2025 · View on GitHub
A Common Lisp implementation for Llama inference operations
Table of Contents
- About the Project
- Objectives
- Built With
- Getting Started
- Prerequisites
- Installation
- Usage
- Performance
- Roadmap
- Contributing
- License
- Contact
About the Project
LLAMA.CL is a Common Lisp implementation of Llama inference operations, designed for rapid experimentation, research, and as a reference implementation for the Common Lisp community. This project enables researchers and developers to explore LLM techniques within the Common Lisp ecosystem, leveraging the language's capabilities for interactive development and integration with symbolic AI systems.
Objectives
-
Research-oriented interface: Provide a platform for experimenting with LLM inference techniques in an interactive development environment.
-
Reference implementation: Serve as a canonical example of implementing modern neural network inference in Common Lisp.
-
Integration capabilities: Enable seamless combination with other AI paradigms available in Common Lisp, including expert systems, graph algorithms, and constraint-based programming.
-
Simplicity and clarity: Maintain readable, idiomatic Common Lisp code that prioritizes understanding over premature optimization.
Built With
Getting Started
Prerequisites
LLAMA.CL requires:
- A Common Lisp implementation (currently SBCL-only as of version 0.0.5; pull requests for other implementations are welcome)
- Quicklisp or another ASDF-compatible system loader
- Pre-trained model weights in binary format
All dependencies are available through Quicklisp.
Installation
Getting the source
-
Clone the repository to a location accessible to ASDF:
cd ~/common-lisp git clone https://github.com/snunez1/llama.cl.git -
Clear the ASDF source registry to recognize the new system:
(asdf:clear-source-registry)
Obtaining model weights
Download pre-trained models from Karpathy's llama2.c repository. For initial experimentation, the TinyStories models are recommended:
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin
Loading dependencies
Use Quicklisp to obtain required dependencies:
(ql:quickload :llama)
Usage
Initialize and generate text using the following workflow:
;; Load the system
(ql:quickload :llama)
;; Switch to the LLAMA package
(in-package :llama)
;; Initialize with model and tokenizer
(init #P"stories15M.bin" #P"tokenizer.bin" 32000)
;; Generate text
(generate *model* *tokenizer*)
The system supports various generation parameters including temperature control, custom prompts, and different sampling strategies. Consult the source code for detailed parameter specifications.
The implementation has been validated with models up to llama-2-7B. Larger models may require additional optimization or hardware acceleration.
Performance
Lisp
On a reference system Intel(R) Core(TM) Ultra 7 155H 16/22 cores, 32GB DDR4 RAM), the stories110M model achieves approximately 3 tokens/second using SBCL and common lisp along and 22 tokens/sec with SBCL+LLA with 9 threads for lparallel and 3 for MKL BLAS.
Performance characteristics vary based on model size and hardware configuration. For the stories15M model, parallelization overhead may exceed benefits on some systems. See the file benchmarks.md for benchmarking instructions. You'll want to tune the lparallel and BLAS number of threads to find the sweet spot for you machine and model.
Roadmap
- Extend compatibility to additional Common Lisp implementations
- Add support for quantized models
Contributing
Contributions are welcome. Please submit pull requests for bug fixes, performance improvements, or additional Common Lisp implementation support. See the project's issue tracker for current priorities.
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Project Link: https://github.com/snunez1/llama.cl