Session 6: Foundry Local

October 28, 2025 · View on GitHub

Abstract

Treat models as composable tools inside a local AI operating layer. This session shows how to chain multiple specialized SLM/LLM calls, selectively route tasks, and expose a unified SDK surface to applications. You will build a lightweight model router + task planner, integrate it into an app script, and outline the scaling path to Azure AI Foundry for production workloads.

Learning Objectives

Conceptualize models as atomic tools with declared capabilities
Route requests based on intent / heuristic scoring
Chain outputs across multi-step tasks (decompose → solve → refine)
Integrate a unified client API for downstream applications
Scale design to cloud (same OpenAI-compatible contract)

Prerequisites

Sessions 1–5 completed
Multiple local models cached (e.g., phi-4-mini, deepseek-coder-1.3b, qwen2.5-0.5b)

Cross-Platform Environment Snippet

Windows PowerShell:

py -m venv .venv
 .\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install foundry-local-sdk openai

macOS / Linux:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install foundry-local-sdk openai

Remote/VM service access from macOS:

export FOUNDRY_LOCAL_ENDPOINT=http://<windows-host>:5273/v1

Demo Flow (30 min)

1. Tool Capability Declaration (5 min)

Create samples/06-tools/models_catalog.py:

CATALOG = {
  "phi-4-mini": {
    "capabilities": ["general", "reasoning", "summarize"],
    "priority": 2
  },
  "deepseek-coder-1.3b": {
    "capabilities": ["code", "refactor", "explain_code"],
    "priority": 1
  },
  "qwen2.5-0.5b": {
    "capabilities": ["fast", "classification", "lightweight"],
    "priority": 3
  }
}

2. Intent Detection & Routing (8 min)

Create samples/06-tools/router.py:

#!/usr/bin/env python3
"""Model-as-tool router using Foundry Local OpenAI-compatible endpoint."""
from openai import OpenAI
from models_catalog import CATALOG
import re

client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")

INTENT_RULES = [
  (re.compile(r"code|function|refactor|bug|optimi", re.I), "code"),
  (re.compile(r"summari|abstract|tl;dr", re.I), "summarize"),
  (re.compile(r"classif|label|category", re.I), "classification"),
]

def detect_intent(prompt: str) -> str:
    for pat, intent in INTENT_RULES:
        if pat.search(prompt):
            return intent
    return "general"

def select_model(intent: str) -> str:
    # Score catalog: capability match first, then priority
    scored = []
    for name, meta in CATALOG.items():
        caps = meta["capabilities"]
        match = intent in caps
        scored.append((name, match, meta["priority"]))
    # Sort: match True first, then lowest priority value
    scored.sort(key=lambda t: (not t[1], t[2]))
    return scored[0][0]

def run(prompt: str):
    intent = detect_intent(prompt)
    model = select_model(intent)
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=400,
        temperature=0.5
    )
    return {"intent": intent, "model": model, "output": resp.choices[0].message.content}

if __name__ == "__main__":
    tests = [
        "Refactor this Python function for readability",
        "Summarize the importance of local AI governance",
        "Classify this feedback: 'The UI is slow and confusing'"
    ]
    for t in tests:
        r = run(t)
        print(f"Prompt: {t}\nModel: {r['model']} (intent={r['intent']})\nOutput: {r['output'][:160]}...\n")

3. Multi-Step Task Chaining (7 min)

Create samples/06-tools/pipeline.py:

#!/usr/bin/env python3
"""Multi-step pipeline: plan -> solve -> refine using specialized models."""
from openai import OpenAI
from router import detect_intent, select_model

client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")

def chat(model, content, temp=0.4):
    r = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": content}],
        max_tokens=350,
        temperature=temp
    )
    return r.choices[0].message.content

def pipeline(task: str):
    plan_model = select_model("general")
    plan = chat(plan_model, f"Break the task into 3 ordered steps. Task: {task}")
    steps = [s for s in plan.split('\n') if s.strip()][:3]
    outputs = []
    for step in steps:
        intent = detect_intent(step)
        model = select_model(intent)
        out = chat(model, step)
        outputs.append((step, model, out))
    refine_model = select_model("summarize")
    combined = '\n'.join(o[2] for o in outputs)
    refined = chat(refine_model, f"Condense results into a cohesive answer:\n{combined}")
    return {"plan": plan, "steps": outputs, "final": refined}

if __name__ == '__main__':
    result = pipeline("Generate a refactored version of a slow Python loop and summarize performance gains.")
    print("PLAN:\n", result['plan'])
    print("FINAL:\n", result['final'][:400])

4. Starter Project: Adapt `06-models-as-tools` (5 min)

Enhancements:

Add streaming token support (progressive UI update)
Add confidence scoring: lexical overlap or prompt rubric
Export trace JSON (intent → model → latency → token usage)
Implement cache reuse for repeated substeps

5. Scaling Path to Azure (5 min)

Layer	Local (Foundry)	Cloud (Azure AI Foundry)	Transition Strategy
Routing	Heuristic Python	Durable microservice	Containerize & deploy API
Models	SLMs cached	Managed deployments	Map local names to deployment IDs
Observability	CLI stats/manual	Central logging & metrics	Add structured trace events
Security	Local host only	Azure auth / networking	Introduce key vault for secrets
Cost	Device resource	Consumption billing	Add budget guardrails

Validation Checklist

foundry model run phi-4-mini
foundry model run deepseek-coder-1.3b
python samples/06-tools/router.py
python samples/06-tools/pipeline.py

Expect intent-based model selection and final refined output.

Troubleshooting

Problem	Cause	Fix
All tasks routed to same model	Weak rules	Enrich INTENT_RULES regex set
Pipeline fails mid-step	Missing model loaded	Run `foundry model run <model>`
Low output cohesion	No refine phase	Add summarization/validation pass

References

Foundry Local SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python
Azure AI Foundry Docs: https://learn.microsoft.com/azure/ai-foundry
Prompt Quality Patterns: See Session 2

Session Duration: 30 min
Difficulty: Expert

Sample Scenario & Workshop Mapping

Workshop Scripts / Notebooks	Scenario	Objective	Dataset / Catalog Source
`samples/session06/models_router.py` / `notebooks/session06_models_router.ipynb`	Developer assistant handling mixed intent prompts (refactor, summarize, classify)	Heuristic intent → model alias routing with token usage	Inline `CATALOG` + regex `RULES`
`samples/session06/models_pipeline.py` / `notebooks/session06_models_pipeline.ipynb`	Multi-step planning & refinement for complex coding assistance task	Decompose → specialized execution → summarization refine step	Same `CATALOG`; steps derived from plan output

Scenario Narrative

An engineering productivity tool receives heterogeneous tasks: refactor code, summarize architectural notes, classify feedback. To minimize latency & resource usage, a small general model plans and summarizes, a code‑specialized model handles refactoring, and a lightweight classification-capable model labels feedback. The pipeline script demonstrates chaining + refinement; the router script isolates adaptive single‑prompt routing.

Catalog Snapshot

CATALOG = {
    "phi-4-mini": {"capabilities": ["general", "summarize"], "priority": 2},
    "deepseek-coder-1.3b": {"capabilities": ["code", "refactor"], "priority": 1},
    "qwen2.5-0.5b": {"capabilities": ["classification", "fast"], "priority": 3}
}

Example Test Prompts

[
    "Refactor this Python function for readability",
    "Summarize the importance of small language models",
    "Classify this feedback: The UI is slow but pretty",
    "Generate a refactored version of a slow Python loop and summarize performance gains."
]

Trace Extension (Optional)

Add per-step trace JSON lines for models_pipeline.py:

trace.append({
    "step": step_idx,
    "intent": intent,
    "alias": alias,
    "latency_ms": round((end-start)*1000,2),
    "tokens": getattr(usage,'total_tokens',None)
})

Escalation Heuristic (Idea)

If plan contains keywords like "optimize", "security", or step length > 280 chars → escalate to bigger model (e.g., gpt-oss-20b) for that step only.

Optional Enhancements

Area	Enhancement	Value	Hint
Caching	Reuse manager + client objects	Lower latency, less overhead	Use `workshop_utils.get_client`
Usage Metrics	Capture tokens & per-step latency	Profiling & optimization	Time each routed call; store in trace list
Adaptive Routing	Confidence / cost aware	Better quality-cost trade-off	Add scoring: if prompt > N chars or regex matches domain → escalate to larger model
Dynamic Capability Registry	Hot reload catalog	No restart redeploy	Load `catalog.json` at runtime; watch file timestamp
Fallback Strategy	Robustness under failures	Higher availability	Try primary → on exception fallback alias
Streaming Pipeline	Early feedback	UX improvement	Stream each step and buffer final refine input
Vector Intent Embeddings	More nuanced routing	Higher intent accuracy	Embed prompt, cluster & map centroid → capability
Trace Export	Auditable chain	Compliance/reporting	Emit JSON lines: step, intent, model, latency_ms, tokens
Cost Simulation	Pre-cloud estimation	Budget planning	Assign notional cost/token per model & aggregate per task
Deterministic Mode	Repro reproducibility	Stable benchmarking	Env: `temperature=0`, fixed steps count

Trace Structure Example

trace.append({
  "step": idx,
  "intent": intent,
  "alias": alias,
  "latency_ms": round((end-start)*1000,2),
  "tokens": getattr(usage,'total_tokens',None)
})

Adaptive Escalation Sketch

if len(prompt) > 280 or 'compliance' in prompt.lower():
    # escalate to larger reasoning model if available
    alias = 'gpt-oss-20b'

Model Catalog Hot Reload

import json, time, os
CATALOG_PATH = 'catalog.json'
last_mtime = 0
def get_catalog():
    global last_mtime, CATALOG
    m = os.path.getmtime(CATALOG_PATH)
    if m != last_mtime:
        CATALOG = json.load(open(CATALOG_PATH))
        last_mtime = m
    return CATALOG