Session 3: Open-Source Model Discovery and Management

September 30, 2025 · View on GitHub

Overview

This session focuses on practical model discovery and management with Foundry Local. You'll learn how to list available models, test different options, and understand basic performance characteristics. The approach emphasizes hands-on exploration with the foundry CLI to help you select the right models for your use cases.

Learning Objectives

Master foundry CLI commands for model discovery and management
Understand model cache and local storage patterns
Learn to quickly test and compare different models
Establish practical workflows for model selection and benchmarking
Explore the growing ecosystem of models available through Foundry Local

Prerequisites

Completed Session 1: Getting Started with Foundry Local
Foundry Local CLI installed and accessible
Sufficient storage space for model downloads (models can range from 1GB to 20GB+)
Basic understanding of model types and use cases-Source Models with Foundry Local

Overview

This session explores how to bring open-source models to Foundry L## Part 6: Hands-On Exercise

Exercise: Model Discovery and Comparison

Create your own model evaluation script based on Sample 03:

REM create_model_test.cmd
@echo off
echo Model Discovery and Testing Script
echo =====================================

echo.
echo Step 1: List available models
foundry model list

echo.
echo Step 2: Check what's cached
foundry cache list

echo.
echo Step 3: Start phi-4-mini for testing
foundry model run phi-4-mini --verbose

echo.
echo Step 4: Test with a simple prompt
curl -X POST http://localhost:8000/v1/chat/completions ^
  -H "Content-Type: application/json" ^
  -d "{\"model\":\"phi-4-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello, please introduce yourself.\"}],\"max_tokens\":100}"

echo.
echo Model test complete!

Your Task

Run the Sample 03 script: samples\03\list_and_bench.cmd
Try different models: Test at least 3 different models
Compare performance: Note differences in speed and response quality
Document findings: Create a simple comparison chart

Example Comparison Format

Model Comparison Results:
========================
phi-4-mini:        Fast (~2s), good for general chat
qwen2.5-7b:       Slower (~5s), better reasoning  
deepseek-r1:      Medium (~3s), excellent for code

Recommendation: Start with phi-4-mini for development, 
switch to qwen2.5-7b for production reasoning tasks.

Part 7: Troubleshooting and Best Practices

Common Issues and Solutions

Model Won't Start:

REM Check service status
foundry service status

REM Restart service if needed
foundry service stop
foundry service start

REM Try with verbose output
foundry model run phi-4-mini --verbose

Insufficient Memory:

Start with smaller models (phi-4-mini)
Close other applications
Upgrade RAM if frequently hitting limits

Slow Performance:

Ensure model is fully loaded (check verbose output)
Close unnecessary background applications
Consider faster storage (SSD)

Best Practices

Start Small: Begin with phi-4-mini to validate setup
One Model at a Time: Stop previous models before starting new ones
Monitor Resources: Keep an eye on memory usage
Test Consistently: Use the same prompts for fair comparisons
Document Results: Keep notes on model performance for your use cases

Part 8: Next Steps and References

Preparing for Session 4

Session 4 Focus: Optimization tools and techniques
Prerequisites: Comfortable with model switching and basic performance testing
Recommended: Have 2-3 favorite models identified from this session

Additional Resources

Foundry Local Documentation: Official documentation
CLI Reference: Complete command reference
Model Mondays: Weekly model spotlights
Foundry Local GitHub: Community and issues
Sample 03: Model Discovery: Hands-on example script

Key Takeaways

✅ Model Discovery: Use foundry model list to explore available models
✅ Quick Testing: The list_and_bench.cmd pattern for rapid evaluation
✅ Performance Monitoring: Basic resource usage and response time measurement
✅ Model Selection: Practical guidelines for choosing models by use case
✅ Cache Management: Understanding storage and cleanup procedures

You now have the practical skills to discover, test, and select appropriate models for your AI applications using Foundry Local's straightforward CLI approach.: selecting community models, integrating Hugging Face content, and adopting “bring your own model” (BYOM) strategies. You’ll also discover the Model Mondays series for continuous learning and model discovery.

References:

Foundry Local docs: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/
Compile Hugging Face models: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models
Model Mondays: https://aka.ms/model-mondays
Foundry Local GitHub: https://github.com/microsoft/Foundry-Local

Learning Objectives

Discover and evaluate open-source models for local inference
Compile and run select Hugging Face models within Foundry Local
Apply model selection strategies for accuracy, latency, and resource needs
Manage models locally with cache and versioning

Part 1: Model Discovery with Foundry CLI

Basic Model Management Commands

The foundry CLI provides straightforward commands for model discovery and management:

REM List all available models in the catalog
foundry model list

REM List cached (downloaded) models
foundry cache list

REM Check cache directory location
foundry cache ls

Running Your First Models

Start with popular, well-tested models to understand performance characteristics:

REM Run Phi-4-Mini (lightweight, fast)
foundry model run phi-4-mini --verbose

REM Run Qwen 2.5 7B (larger, more capable)
foundry model run qwen2.5-7b --verbose

REM Run DeepSeek (specialized for coding)
foundry model run deepseek-r1-7b --verbose

Note: The --verbose flag provides detailed startup information, including:

Model download progress (on first run)
Memory allocation details
Service binding information
Performance initialization metrics

Understanding Model Categories

Small Language Models (SLMs):

phi-4-mini: Fast, efficient, great for general chat
phi-4: More capable version with better reasoning

Medium Models:

qwen2.5-7b: Excellent reasoning and longer context
deepseek-r1-7b: Optimized for code generation

Larger Models:

llama-3.2: Meta's latest open-source model
qwen2.5-14b: Enterprise-grade reasoning

Part 2: Quick Model Testing and Comparison

Sample 03 Approach: Simple List and Bench

Based on our Sample 03 pattern, here's the minimal workflow:

@echo off
REM Sample 03 - List and bench pattern
echo Listing available models...
foundry model list

echo.
echo Checking cached models...
foundry cache list

echo.
echo Starting phi-4-mini with verbose output...
foundry model run phi-4-mini --verbose

Testing Model Performance

Once a model is running, test it with consistent prompts:

REM Test via curl (Windows Command Prompt)
curl -X POST http://localhost:8000/v1/chat/completions ^
  -H "Content-Type: application/json" ^
  -d "{\"model\":\"phi-4-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Explain edge AI in one sentence.\"}],\"max_tokens\":50}"

PowerShell Testing Alternative

# PowerShell approach for testing
$body = @{
    model = "phi-4-mini"
    messages = @(
        @{
            role = "user"
            content = "Explain edge AI in one sentence."
        }
    )
    max_tokens = 50
} | ConvertTo-Json -Depth 3

Invoke-RestMethod -Uri "http://localhost:8000/v1/chat/completions" -Method Post -Body $body -ContentType "application/json"

Part 3: Model Cache and Storage Management

Understanding the Model Cache

Foundry Local automatically manages model downloads and caching:

REM Check cache directory and contents
foundry cache ls

REM View cache location
foundry cache cd

REM Clean up unused models (if needed)
foundry cache clean

Model Storage Considerations

Typical Model Sizes:

phi-4-mini: ~2.5 GB
qwen2.5-7b: ~4.1 GB
deepseek-r1-7b: ~4.3 GB
llama-3.2: ~4.9 GB
qwen2.5-14b: ~8.2 GB

Storage Best Practices:

Keep 2-3 models cached for quick switching
Remove unused models to free space: foundry cache clean
Monitor disk usage, especially on smaller SSDs
Consider model size vs. capability trade-offs

Model Performance Monitoring

While models are running, monitor system resources:

Windows Task Manager:

Watch memory usage (models stay loaded in RAM)
Monitor CPU utilization during inference
Check disk I/O during initial model loading

Command Line Monitoring:

REM Check memory usage (PowerShell)
Get-Process | Where-Object {$_.ProcessName -like "*foundry*"} | Select-Object ProcessName, WorkingSet64

REM Monitor running models
foundry service ps

Part 4: Practical Model Selection Guidelines

Choosing Models by Use Case

For General Chat and Q&A:

Start with: phi-4-mini (fast, efficient)
Upgrade to: phi-4 (better reasoning)
Advanced: qwen2.5-7b (longer context)

For Code Generation:

Recommended: deepseek-r1-7b
Alternative: qwen2.5-7b (also good for code)

For Complex Reasoning:

Best: qwen2.5-7b or qwen2.5-14b
Budget option: phi-4

Hardware Requirements Guide

Minimum System Requirements:

phi-4-mini:     8GB RAM,  entry-level CPU
phi-4:         12GB RAM,  mid-range CPU
qwen2.5-7b:    16GB RAM,  mid-range CPU
deepseek-r1:   16GB RAM,  mid-range CPU
qwen2.5-14b:   24GB RAM,  high-end CPU

Recommended for Best Performance:

32GB+ RAM for comfortable multi-model switching
SSD storage for faster model loading
Modern CPU with good single-thread performance
NPU support (Windows 11 Copilot+ PCs) for acceleration

Model Switching Workflow

REM Stop current model (if needed)
foundry service stop

REM Start different model
foundry model run qwen2.5-7b

REM Verify model is running
foundry service status

Part 5: Simple Model Benchmarking

Basic Performance Testing

Here's a straightforward approach to compare model performance:

# simple_bench.py - Based on Sample 03 patterns
import time
import requests
import json

def test_model_response(model_name, prompt="Explain edge AI in one sentence."):
    """Test a single model with a prompt and measure response time."""
    start_time = time.time()
    
    try:
        response = requests.post(
            "http://localhost:8000/v1/chat/completions",
            headers={"Content-Type": "application/json"},
            json={
                "model": model_name,
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 64
            },
            timeout=30
        )
        
        elapsed = time.time() - start_time
        
        if response.status_code == 200:
            result = response.json()
            return {
                "model": model_name,
                "latency_sec": round(elapsed, 3),
                "response": result["choices"][0]["message"]["content"],
                "status": "success"
            }
        else:
            return {
                "model": model_name,
                "status": "error",
                "error": f"HTTP {response.status_code}"
            }
            
    except Exception as e:
        return {
            "model": model_name,
            "status": "error", 
            "error": str(e)
        }

# Test the currently running model
if __name__ == "__main__":
    # Test with different models (start each model first)
    test_models = ["phi-4-mini", "qwen2.5-7b", "deepseek-r1-7b"]
    
    print("Model Performance Test")
    print("=" * 50)
    
    for model in test_models:
        print(f"\nTesting {model}...")
        print("Note: Make sure this model is running first with 'foundry model run {model}'")
        
        result = test_model_response(model)
        
        if result["status"] == "success":
            print(f"✅ {model}: {result['latency_sec']}s")
            print(f"   Response: {result['response'][:100]}...")
        else:
            print(f"❌ {model}: {result['error']}")

Manual Quality Assessment

For each model, test with consistent prompts and manually evaluate:

Test Prompts:

"Explain quantum computing in simple terms."
"Write a Python function to sort a list."
"What are the pros and cons of remote work?"
"Summarize the benefits of edge AI."

Evaluation Criteria:

Accuracy: Is the information correct?
Clarity: Is the explanation easy to understand?
Completeness: Does it address the full question?
Speed: How quickly does it respond?

Resource Usage Monitoring

REM Monitor while testing different models
REM Start model
foundry model run phi-4-mini

REM In another terminal, monitor resources
foundry service status
foundry service ps

REM Check system resources (PowerShell)
Get-Process | Where-Object ProcessName -Like "*foundry*" | Format-Table ProcessName, WorkingSet64, CPU

Part 6: Next Steps

Subscribe to Model Mondays for new models and tips: https://aka.ms/model-mondays
Contribute findings to your team’s models.json
Prepare for Session 4: comparing LLMs vs SLMs, local vs cloud inference, and hands-on demos