Session 3: Open-Source Models in Foundry Local
November 11, 2025 · View on GitHub
Abstract
Discover how to bring Hugging Face and other open-source models into Foundry Local. Learn selection strategies, community contribution workflows, performance comparison methodology, and how to extend Foundry with custom model registrations. This session maps to the weekly "Model Mondays" exploration themes and equips you to evaluate and operationalize open-source models locally before scaling to Azure.
Learning Objectives
By the end you will be able to:
- Discover & Evaluate: Identify candidate models (mistral, gemma, qwen, deepseek) using quality vs resource trade-offs.
- Load & Run: Use Foundry Local CLI to download, cache, and launch community models.
- Benchmark: Apply consistent latency + token throughput + quality heuristics.
- Extend: Register or adapt a custom model wrapper following SDK-compatible patterns.
- Compare: Produce structured comparisons for SLM vs mid-size LLM selection decisions.
Prerequisites
- Sessions 1 & 2 completed
- Python environment with
foundry-local-sdkinstalled - At least 15GB free disk for multiple model caches
Cross-Platform Environment Quick Start
Windows PowerShell:
py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install foundry-local-sdk openai numpy
macOS / Linux:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install foundry-local-sdk openai numpy
When benchmarking from macOS against a Windows host service, set:
export FOUNDRY_LOCAL_ENDPOINT=http://<windows-host>:5273/v1
Demo Flow (30 min)
1. Load Hugging Face Models via CLI (8 min)
# List catalog entries (filter manually if needed)
foundry model list
# Download a set of comparison targets
foundry model download mistral-7b
foundry model download gemma-2-2b
foundry model download qwen2.5-0.5b
# Verify cache
foundry cache list
2. Run & Quick Probe (5 min)
foundry model run qwen2.5-0.5b
foundry model run qwen2.5-0.5b --prompt "List three benefits of local inference."
foundry model run mistral-7b
foundry model run mistral-7b --prompt "Explain retrieval augmented generation in one paragraph."
3. Benchmark Script (8 min)
Create samples/03-oss-models/benchmark_models.py:
#!/usr/bin/env python3
"""Lightweight benchmarking for open-source models in Foundry Local.
Metrics: first token latency, total latency, tokens/sec (approx via usage), model size indicator.
Reference SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python
"""
import time, statistics, json
from openai import OpenAI
MODELS = [
"qwen2.5-0.5b",
"gemma-2-2b",
"mistral-7b",
]
PROMPT = "Explain the principle of retrieval augmented generation in 2 sentences."
ROUNDS = 3
client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")
def run_round(model: str):
start = time.time()
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": PROMPT}],
max_tokens=150,
temperature=0.2,
stream=False
)
end = time.time()
usage = getattr(resp, "usage", None)
total_tokens = usage.total_tokens if usage else None
return {
"latency_sec": end - start,
"total_tokens": total_tokens,
"tokens_per_sec": (total_tokens / (end - start)) if (total_tokens and (end-start) > 0) else None,
"sample_response": resp.choices[0].message.content[:160].replace("\n", " ")
}
def benchmark_model(model: str):
results = [run_round(model) for _ in range(ROUNDS)]
latencies = [r["latency_sec"] for r in results]
tps = [r["tokens_per_sec"] for r in results if r["tokens_per_sec"]]
return {
"model": model,
"rounds": ROUNDS,
"latency_avg": statistics.mean(latencies),
"latency_p95": statistics.quantiles(latencies, n=20)[-1] if len(latencies) > 1 else latencies[0],
"tokens_per_sec_avg": statistics.mean(tps) if tps else None,
"sample": results[-1]["sample_response"]
}
def main():
summary = [benchmark_model(m) for m in MODELS]
print(json.dumps(summary, indent=2))
print("\nInterpretation Tips:\n- latency_avg: lower better\n- tokens_per_sec_avg: higher better\n- Compare quality manually on sample outputs.")
if __name__ == "__main__":
main()
Run:
python samples/03-oss-models/benchmark_models.py
4. Compare Performance (5 min)
Discuss trade-offs: load time, memory footprint (observe Task Manager / nvidia-smi / OS resource monitor), output quality vs speed. Use the Python benchmark script (Session 3) for latency & throughput; repeat after enabling GPU acceleration.
5. Starter Project (4 min)
Create a model comparison report generator (extend benchmarking script with markdown export).
Starter Project: Extend 03-huggingface-models
Enhance the existing sample by:
- Adding benchmark aggregation + CSV/Markdown output.
- Implementing simple qualitative scoring (prompt pair set + manual annotation stub file).
- Introducing a JSON config (
models.json) for pluggable model list & prompt set.
Validation Checklist
foundry cache list
foundry model run qwen2.5-0.5b
curl http://localhost:5273/v1/models
All target models should appear and respond to a probe chat request.
Sample Scenario & Workshop Mapping
| Workshop Script | Scenario | Goal | Prompt / Dataset Source |
|---|---|---|---|
samples/session03/benchmark_oss_models.py / notebooks/session03_benchmark_oss_models.ipynb | Edge platform team selecting default SLM for embedded summarizer | Produce latency + p95 + tokens/sec comparison across candidate models | Inline PROMPT var + environment BENCH_MODELS list |
Scenario Narrative
Product engineering must choose a default lightweight summarization model for an offline meeting-notes feature. They run controlled deterministic benchmarks (temperature=0) across a fixed prompt set (see example below) and collect latency + throughput metrics with and without GPU acceleration.
Example Prompt Set JSON (expandable)
[
"Explain the principle of retrieval augmented generation in 2 sentences.",
"List 3 privacy benefits of local inference.",
"Summarize why model size impacts latency on consumer hardware.",
"Provide two scenarios where an SLM is preferable to an LLM."
]
Loop each prompt per model, capture per‑prompt latency to derive distribution metrics and detect outliers.
Model Selection Framework
| Dimension | Metric | Why It Matters |
|---|---|---|
| Latency | avg / p95 | User experience consistency |
| Throughput | tokens/sec | Batch & streaming scalability |
| Memory | resident size | Device fit & concurrency |
| Quality | rubric prompts | Task suitability |
| Footprint | disk cache | Distribution & updates |
| License | use allowance | Commercial compliance |
Extending With Custom Model
High-level pattern (pseudo):
# pseudo_adapter.py (conceptual)
class CustomModelAdapter:
def load(self, weights_path: str): ...
def generate(self, prompt: str, **params) -> str: ...
# Register with local routing (future extensibility point)
Consult the official repo for evolving adapter interfaces: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| OOM on mistral-7b | Insufficient RAM/GPU | Stop other models; try smaller variant |
| Slow first response | Cold load | Keep warm with a periodic lightweight prompt |
| Download stalls | Network instability | Retry; prefetch during off-peak |
References
- Foundry Local SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python
- Model Mondays: https://aka.ms/model-mondays
- Hugging Face Model Discovery: https://huggingface.co/models
Session Duration: 30 min (+ optional deep dive)
Difficulty: Intermediate
Optional Enhancements
| Enhancement | Benefit | How |
|---|---|---|
| Streaming First-Token Latency | Measures perceived responsiveness | Run benchmark with BENCH_STREAM=1 (enhanced script in Workshop/samples/session03) |
| Deterministic Mode | Stable regression comparisons | temperature=0, fixed prompt set, capture JSON outputs under version control |
| Quality Rubric Scoring | Adds qualitative dimension | Maintain prompts.json with expected facets; annotate scores (1–5) manually or via secondary model |
| CSV / Markdown Export | Shareable report | Extend script to write benchmark_report.md with table & highlights |
| Model Capability Tags | Helps automated routing later | Maintain models.json with {alias: {capabilities:[], size_mb:..}} |
| Cache Warmup Phase | Reduce cold-start bias | Execute one warm round before timing loop (already implemented) |
| Percentile Accuracy | Robust tail latency | Use numpy percentile (already in refactored script) |
| Token Cost Approximation | Economic comparison | Approx cost = (tokens/sec * avg tokens per request) * energy heuristic |
| Auto-Skipping Failed Models | Resilience in batch runs | Wrap each benchmark in try/except and mark status field |
Minimal Markdown Export Snippet
with open("benchmark_report.md", "w") as f:
f.write("|Model|Avg Latency|p95|TPS|\n|---|---|---|---|\n")
for row in summary:
f.write(f"|{row['alias']}|{row['latency_avg']:.2f}|{row['latency_p95']:.2f}|{(row.get('tokens_per_sec_avg') or 0):.1f}|\n")
Deterministic Prompt Set Example
[
"Summarize retrieval augmented generation.",
"List 3 privacy benefits of local inference.",
"Explain when to choose an SLM over an LLM."
]
Loop the static list instead of random prompts for comparable metrics across commits.