Session 3: Open-Source Model Discovery and Management
September 30, 2025 · View on GitHub
Overview
This session focuses on practical model discovery and management with Foundry Local. You'll learn how to list available models, test different options, and understand basic performance characteristics. The approach emphasizes hands-on exploration with the foundry CLI to help you select the right models for your use cases.
Learning Objectives
- Master foundry CLI commands for model discovery and management
- Understand model cache and local storage patterns
- Learn to quickly test and compare different models
- Establish practical workflows for model selection and benchmarking
- Explore the growing ecosystem of models available through Foundry Local
Prerequisites
- Completed Session 1: Getting Started with Foundry Local
- Foundry Local CLI installed and accessible
- Sufficient storage space for model downloads (models can range from 1GB to 20GB+)
- Basic understanding of model types and use cases-Source Models with Foundry Local
Overview
This session explores how to bring open-source models to Foundry L## Part 6: Hands-On Exercise
Exercise: Model Discovery and Comparison
Create your own model evaluation script based on Sample 03:
REM create_model_test.cmd
@echo off
echo Model Discovery and Testing Script
echo =====================================
echo.
echo Step 1: List available models
foundry model list
echo.
echo Step 2: Check what's cached
foundry cache list
echo.
echo Step 3: Start phi-4-mini for testing
foundry model run phi-4-mini --verbose
echo.
echo Step 4: Test with a simple prompt
curl -X POST http://localhost:8000/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d "{\"model\":\"phi-4-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello, please introduce yourself.\"}],\"max_tokens\":100}"
echo.
echo Model test complete!
Your Task
- Run the Sample 03 script:
samples\03\list_and_bench.cmd - Try different models: Test at least 3 different models
- Compare performance: Note differences in speed and response quality
- Document findings: Create a simple comparison chart
Example Comparison Format
Model Comparison Results:
========================
phi-4-mini: Fast (~2s), good for general chat
qwen2.5-7b: Slower (~5s), better reasoning
deepseek-r1: Medium (~3s), excellent for code
Recommendation: Start with phi-4-mini for development,
switch to qwen2.5-7b for production reasoning tasks.
Part 7: Troubleshooting and Best Practices
Common Issues and Solutions
Model Won't Start:
REM Check service status
foundry service status
REM Restart service if needed
foundry service stop
foundry service start
REM Try with verbose output
foundry model run phi-4-mini --verbose
Insufficient Memory:
- Start with smaller models (
phi-4-mini) - Close other applications
- Upgrade RAM if frequently hitting limits
Slow Performance:
- Ensure model is fully loaded (check verbose output)
- Close unnecessary background applications
- Consider faster storage (SSD)
Best Practices
- Start Small: Begin with
phi-4-minito validate setup - One Model at a Time: Stop previous models before starting new ones
- Monitor Resources: Keep an eye on memory usage
- Test Consistently: Use the same prompts for fair comparisons
- Document Results: Keep notes on model performance for your use cases
Part 8: Next Steps and References
Preparing for Session 4
- Session 4 Focus: Optimization tools and techniques
- Prerequisites: Comfortable with model switching and basic performance testing
- Recommended: Have 2-3 favorite models identified from this session
Additional Resources
- Foundry Local Documentation: Official documentation
- CLI Reference: Complete command reference
- Model Mondays: Weekly model spotlights
- Foundry Local GitHub: Community and issues
- Sample 03: Model Discovery: Hands-on example script
Key Takeaways
✅ Model Discovery: Use foundry model list to explore available models
✅ Quick Testing: The list_and_bench.cmd pattern for rapid evaluation
✅ Performance Monitoring: Basic resource usage and response time measurement
✅ Model Selection: Practical guidelines for choosing models by use case
✅ Cache Management: Understanding storage and cleanup procedures
You now have the practical skills to discover, test, and select appropriate models for your AI applications using Foundry Local's straightforward CLI approach.: selecting community models, integrating Hugging Face content, and adopting “bring your own model” (BYOM) strategies. You’ll also discover the Model Mondays series for continuous learning and model discovery.
References:
- Foundry Local docs: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/
- Compile Hugging Face models: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models
- Model Mondays: https://aka.ms/model-mondays
- Foundry Local GitHub: https://github.com/microsoft/Foundry-Local
Learning Objectives
- Discover and evaluate open-source models for local inference
- Compile and run select Hugging Face models within Foundry Local
- Apply model selection strategies for accuracy, latency, and resource needs
- Manage models locally with cache and versioning
Part 1: Model Discovery with Foundry CLI
Basic Model Management Commands
The foundry CLI provides straightforward commands for model discovery and management:
REM List all available models in the catalog
foundry model list
REM List cached (downloaded) models
foundry cache list
REM Check cache directory location
foundry cache ls
Running Your First Models
Start with popular, well-tested models to understand performance characteristics:
REM Run Phi-4-Mini (lightweight, fast)
foundry model run phi-4-mini --verbose
REM Run Qwen 2.5 7B (larger, more capable)
foundry model run qwen2.5-7b --verbose
REM Run DeepSeek (specialized for coding)
foundry model run deepseek-r1-7b --verbose
Note: The --verbose flag provides detailed startup information, including:
- Model download progress (on first run)
- Memory allocation details
- Service binding information
- Performance initialization metrics
Understanding Model Categories
Small Language Models (SLMs):
phi-4-mini: Fast, efficient, great for general chatphi-4: More capable version with better reasoning
Medium Models:
qwen2.5-7b: Excellent reasoning and longer contextdeepseek-r1-7b: Optimized for code generation
Larger Models:
llama-3.2: Meta's latest open-source modelqwen2.5-14b: Enterprise-grade reasoning
Part 2: Quick Model Testing and Comparison
Sample 03 Approach: Simple List and Bench
Based on our Sample 03 pattern, here's the minimal workflow:
@echo off
REM Sample 03 - List and bench pattern
echo Listing available models...
foundry model list
echo.
echo Checking cached models...
foundry cache list
echo.
echo Starting phi-4-mini with verbose output...
foundry model run phi-4-mini --verbose
Testing Model Performance
Once a model is running, test it with consistent prompts:
REM Test via curl (Windows Command Prompt)
curl -X POST http://localhost:8000/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d "{\"model\":\"phi-4-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Explain edge AI in one sentence.\"}],\"max_tokens\":50}"
PowerShell Testing Alternative
# PowerShell approach for testing
$body = @{
model = "phi-4-mini"
messages = @(
@{
role = "user"
content = "Explain edge AI in one sentence."
}
)
max_tokens = 50
} | ConvertTo-Json -Depth 3
Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body $body -ContentType "application/json"
Part 3: Model Cache and Storage Management
Understanding the Model Cache
Foundry Local automatically manages model downloads and caching:
REM Check cache directory and contents
foundry cache ls
REM View cache location
foundry cache cd
REM Clean up unused models (if needed)
foundry cache clean
Model Storage Considerations
Typical Model Sizes:
phi-4-mini: ~2.5 GBqwen2.5-7b: ~4.1 GBdeepseek-r1-7b: ~4.3 GBllama-3.2: ~4.9 GBqwen2.5-14b: ~8.2 GB
Storage Best Practices:
- Keep 2-3 models cached for quick switching
- Remove unused models to free space:
foundry cache clean - Monitor disk usage, especially on smaller SSDs
- Consider model size vs. capability trade-offs
Model Performance Monitoring
While models are running, monitor system resources:
Windows Task Manager:
- Watch memory usage (models stay loaded in RAM)
- Monitor CPU utilization during inference
- Check disk I/O during initial model loading
Command Line Monitoring:
REM Check memory usage (PowerShell)
Get-Process | Where-Object {$_.ProcessName -like "*foundry*"} | Select-Object ProcessName, WorkingSet64
REM Monitor running models
foundry service ps
Part 4: Practical Model Selection Guidelines
Choosing Models by Use Case
For General Chat and Q&A:
- Start with:
phi-4-mini(fast, efficient) - Upgrade to:
phi-4(better reasoning) - Advanced:
qwen2.5-7b(longer context)
For Code Generation:
- Recommended:
deepseek-r1-7b - Alternative:
qwen2.5-7b(also good for code)
For Complex Reasoning:
- Best:
qwen2.5-7borqwen2.5-14b - Budget option:
phi-4
Hardware Requirements Guide
Minimum System Requirements:
phi-4-mini: 8GB RAM, entry-level CPU
phi-4: 12GB RAM, mid-range CPU
qwen2.5-7b: 16GB RAM, mid-range CPU
deepseek-r1: 16GB RAM, mid-range CPU
qwen2.5-14b: 24GB RAM, high-end CPU
Recommended for Best Performance:
- 32GB+ RAM for comfortable multi-model switching
- SSD storage for faster model loading
- Modern CPU with good single-thread performance
- NPU support (Windows 11 Copilot+ PCs) for acceleration
Model Switching Workflow
REM Stop current model (if needed)
foundry service stop
REM Start different model
foundry model run qwen2.5-7b
REM Verify model is running
foundry service status
Part 5: Simple Model Benchmarking
Basic Performance Testing
Here's a straightforward approach to compare model performance:
# simple_bench.py - Based on Sample 03 patterns
import time
import requests
import json
def test_model_response(model_name, prompt="Explain edge AI in one sentence."):
"""Test a single model with a prompt and measure response time."""
start_time = time.time()
try:
response = requests.post(
"http://localhost:8000/v1/chat/completions",
headers={"Content-Type": "application/json"},
json={
"model": model_name,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 64
},
timeout=30
)
elapsed = time.time() - start_time
if response.status_code == 200:
result = response.json()
return {
"model": model_name,
"latency_sec": round(elapsed, 3),
"response": result["choices"][0]["message"]["content"],
"status": "success"
}
else:
return {
"model": model_name,
"status": "error",
"error": f"HTTP {response.status_code}"
}
except Exception as e:
return {
"model": model_name,
"status": "error",
"error": str(e)
}
# Test the currently running model
if __name__ == "__main__":
# Test with different models (start each model first)
test_models = ["phi-4-mini", "qwen2.5-7b", "deepseek-r1-7b"]
print("Model Performance Test")
print("=" * 50)
for model in test_models:
print(f"\nTesting {model}...")
print("Note: Make sure this model is running first with 'foundry model run {model}'")
result = test_model_response(model)
if result["status"] == "success":
print(f"✅ {model}: {result['latency_sec']}s")
print(f" Response: {result['response'][:100]}...")
else:
print(f"❌ {model}: {result['error']}")
Manual Quality Assessment
For each model, test with consistent prompts and manually evaluate:
Test Prompts:
- "Explain quantum computing in simple terms."
- "Write a Python function to sort a list."
- "What are the pros and cons of remote work?"
- "Summarize the benefits of edge AI."
Evaluation Criteria:
- Accuracy: Is the information correct?
- Clarity: Is the explanation easy to understand?
- Completeness: Does it address the full question?
- Speed: How quickly does it respond?
Resource Usage Monitoring
REM Monitor while testing different models
REM Start model
foundry model run phi-4-mini
REM In another terminal, monitor resources
foundry service status
foundry service ps
REM Check system resources (PowerShell)
Get-Process | Where-Object ProcessName -Like "*foundry*" | Format-Table ProcessName, WorkingSet64, CPU
Part 6: Next Steps
- Subscribe to Model Mondays for new models and tips: https://aka.ms/model-mondays
- Contribute findings to your team’s
models.json - Prepare for Session 4: comparing LLMs vs SLMs, local vs cloud inference, and hands-on demos