Rllama
April 3, 2026 ยท View on GitHub
Rllama
Ruby bindings for llama.cpp to run open-source language models locally. Run models like Gemma 4, Qwen 3.5, GLM 4.7, Nemotron, LFM2, Llama 3, and many others directly in your Ruby application code.
Installation
Add this line to your application's Gemfile:
gem 'rllama'
And then execute:
bundle install
Or install it yourself as:
gem install rllama
CLI Chat
The rllama command-line utility provides an interactive chat interface for conversing with language models. After installing the gem, you can start chatting immediately:
rllama
When you run rllama without arguments, it will display:
- Downloaded models: Any models you've already downloaded to
~/.rllama/models/ - Popular models: A curated list of popular models available for download, including:
- Gemma 4 E4B / Gemma 4 26B-A4B
- Nemotron 3 Nano 4B
- Qwen 3.5 35B-A3B
- LFM2 24B-A2B
- GLM 4.7 Flash
- GPT-OSS 20B
- Llama 3.2 3B
- Phi-4
Simply enter the number of the model you want to use. If you select a model that hasn't been downloaded yet, it will be automatically downloaded from Hugging Face.
You can also specify a model path or URL directly:
rllama path/to/your/model.gguf
rllama https://huggingface.co/microsoft/phi-4-gguf/resolve/main/phi-4-Q3_K_S.gguf
Once the model has loaded, you can start chatting.
Usage
Text Generation
Generate text completions using local language models:
require 'rllama'
# Load a model
model = Rllama.load_model('lmstudio-community/gemma-3-1B-it-QAT-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf')
# Generate text
result = model.generate('What is the capital of France?')
puts result.text
# => "The capital of France is Paris."
# Access generation statistics
puts "Tokens generated: #{result.stats[:tokens_generated]}"
puts "Tokens per second: #{result.stats[:tps]}"
puts "Duration: #{result.stats[:duration]} seconds"
# Don't forget to close the model when done
model.close
Generation parameters
Adjust the generation with parameters:
result = model.generate(
'Write a short poem about Ruby programming',
max_tokens: 2024,
temperature: 0.8,
top_k: 40,
top_p: 0.95,
min_p: 0.05
)
Streaming generation
Stream generated text token-by-token:
model.generate('Explain quantum computing') do |token|
print token
end
System prompt
Include system promt to guide model behavior:
result = model.generate(
'What are best practices for Ruby development?',
system: 'You are an expert Ruby developer with 10 years of experience.'
)
Messages list
Pass multiple messages with roles for more complex interactions:
result = model.generate([
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' },
{ role: 'assistant', content: 'The capital of France is Paris.' },
{ role: 'user', content: 'What is its population?' }
])
puts result.text
Chat
For ongoing conversations, use a context object that maintains the conversation history:
# Initialize a chat context
context = model.init_context
# Send messages and maintain conversation history
response1 = context.message('What is the capital of France?')
puts response1.text
# => "The capital of France is Paris."
response2 = context.message('What is the population of that city?')
puts response2.text
# => "Paris has a population of approximately 2.1 million people..."
response3 = context.message('What was my first message?')
puts response3.text
# => "Your first message was asking about the capital of France."
# The context remembers all previous messages in the conversation
# Close context when done
context.close
Embeddings
Generate vector embeddings for text using embedding models:
require 'rllama'
# Load an embedding model
model = Rllama.load_model('lmstudio-community/embeddinggemma-300m-qat-GGUF/embeddinggemma-300m-qat-Q4_0.gguf')
# Generate embedding for a single text
embedding = model.embed('Hello, world!')
puts embedding.length
# => 724 (depending on your model)
# Generate embeddings for multiple sentences
embeddings = model.embed([
'roses are red',
'violets are blue',
'sugar is sweet'
])
puts embeddings.length
# => 3
puts embeddings[0].length
# => 768
model.close
Vector parameters
By default, embedding vectors are normalized. You can disable normalization with normalize: false:
# Generate unnormalized embeddings
embedding = model.embed('Sample text', normalize: false)
Finding Models
You can download GGUF format models from various sources:
- Hugging Face - Search for models with "GGUF" format
License
MIT
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/docusealco/rllama.