Session 6: Foundry Local

October 28, 2025 · View on GitHub

Abstract

Treat models as composable tools inside a local AI operating layer. This session shows how to chain multiple specialized SLM/LLM calls, selectively route tasks, and expose a unified SDK surface to applications. You will build a lightweight model router + task planner, integrate it into an app script, and outline the scaling path to Azure AI Foundry for production workloads.

Learning Objectives

  • Conceptualize models as atomic tools with declared capabilities
  • Route requests based on intent / heuristic scoring
  • Chain outputs across multi-step tasks (decompose → solve → refine)
  • Integrate a unified client API for downstream applications
  • Scale design to cloud (same OpenAI-compatible contract)

Prerequisites

  • Sessions 1–5 completed
  • Multiple local models cached (e.g., phi-4-mini, deepseek-coder-1.3b, qwen2.5-0.5b)

Cross-Platform Environment Snippet

Windows PowerShell:

py -m venv .venv
 .\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install foundry-local-sdk openai

macOS / Linux:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install foundry-local-sdk openai

Remote/VM service access from macOS:

export FOUNDRY_LOCAL_ENDPOINT=http://<windows-host>:5273/v1

Demo Flow (30 min)

1. Tool Capability Declaration (5 min)

Create samples/06-tools/models_catalog.py:

CATALOG = {
  "phi-4-mini": {
    "capabilities": ["general", "reasoning", "summarize"],
    "priority": 2
  },
  "deepseek-coder-1.3b": {
    "capabilities": ["code", "refactor", "explain_code"],
    "priority": 1
  },
  "qwen2.5-0.5b": {
    "capabilities": ["fast", "classification", "lightweight"],
    "priority": 3
  }
}

2. Intent Detection & Routing (8 min)

Create samples/06-tools/router.py:

#!/usr/bin/env python3
"""Model-as-tool router using Foundry Local OpenAI-compatible endpoint."""
from openai import OpenAI
from models_catalog import CATALOG
import re

client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")

INTENT_RULES = [
  (re.compile(r"code|function|refactor|bug|optimi", re.I), "code"),
  (re.compile(r"summari|abstract|tl;dr", re.I), "summarize"),
  (re.compile(r"classif|label|category", re.I), "classification"),
]

def detect_intent(prompt: str) -> str:
    for pat, intent in INTENT_RULES:
        if pat.search(prompt):
            return intent
    return "general"

def select_model(intent: str) -> str:
    # Score catalog: capability match first, then priority
    scored = []
    for name, meta in CATALOG.items():
        caps = meta["capabilities"]
        match = intent in caps
        scored.append((name, match, meta["priority"]))
    # Sort: match True first, then lowest priority value
    scored.sort(key=lambda t: (not t[1], t[2]))
    return scored[0][0]

def run(prompt: str):
    intent = detect_intent(prompt)
    model = select_model(intent)
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=400,
        temperature=0.5
    )
    return {"intent": intent, "model": model, "output": resp.choices[0].message.content}

if __name__ == "__main__":
    tests = [
        "Refactor this Python function for readability",
        "Summarize the importance of local AI governance",
        "Classify this feedback: 'The UI is slow and confusing'"
    ]
    for t in tests:
        r = run(t)
        print(f"Prompt: {t}\nModel: {r['model']} (intent={r['intent']})\nOutput: {r['output'][:160]}...\n")

3. Multi-Step Task Chaining (7 min)

Create samples/06-tools/pipeline.py:

#!/usr/bin/env python3
"""Multi-step pipeline: plan -> solve -> refine using specialized models."""
from openai import OpenAI
from router import detect_intent, select_model

client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")

def chat(model, content, temp=0.4):
    r = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": content}],
        max_tokens=350,
        temperature=temp
    )
    return r.choices[0].message.content

def pipeline(task: str):
    plan_model = select_model("general")
    plan = chat(plan_model, f"Break the task into 3 ordered steps. Task: {task}")
    steps = [s for s in plan.split('\n') if s.strip()][:3]
    outputs = []
    for step in steps:
        intent = detect_intent(step)
        model = select_model(intent)
        out = chat(model, step)
        outputs.append((step, model, out))
    refine_model = select_model("summarize")
    combined = '\n'.join(o[2] for o in outputs)
    refined = chat(refine_model, f"Condense results into a cohesive answer:\n{combined}")
    return {"plan": plan, "steps": outputs, "final": refined}

if __name__ == '__main__':
    result = pipeline("Generate a refactored version of a slow Python loop and summarize performance gains.")
    print("PLAN:\n", result['plan'])
    print("FINAL:\n", result['final'][:400])

4. Starter Project: Adapt 06-models-as-tools (5 min)

Enhancements:

  • Add streaming token support (progressive UI update)
  • Add confidence scoring: lexical overlap or prompt rubric
  • Export trace JSON (intent → model → latency → token usage)
  • Implement cache reuse for repeated substeps

5. Scaling Path to Azure (5 min)

LayerLocal (Foundry)Cloud (Azure AI Foundry)Transition Strategy
RoutingHeuristic PythonDurable microserviceContainerize & deploy API
ModelsSLMs cachedManaged deploymentsMap local names to deployment IDs
ObservabilityCLI stats/manualCentral logging & metricsAdd structured trace events
SecurityLocal host onlyAzure auth / networkingIntroduce key vault for secrets
CostDevice resourceConsumption billingAdd budget guardrails

Validation Checklist

foundry model run phi-4-mini
foundry model run deepseek-coder-1.3b
python samples/06-tools/router.py
python samples/06-tools/pipeline.py

Expect intent-based model selection and final refined output.

Troubleshooting

ProblemCauseFix
All tasks routed to same modelWeak rulesEnrich INTENT_RULES regex set
Pipeline fails mid-stepMissing model loadedRun foundry model run <model>
Low output cohesionNo refine phaseAdd summarization/validation pass

References


Session Duration: 30 min
Difficulty: Expert

Sample Scenario & Workshop Mapping

Workshop Scripts / NotebooksScenarioObjectiveDataset / Catalog Source
samples/session06/models_router.py / notebooks/session06_models_router.ipynbDeveloper assistant handling mixed intent prompts (refactor, summarize, classify)Heuristic intent → model alias routing with token usageInline CATALOG + regex RULES
samples/session06/models_pipeline.py / notebooks/session06_models_pipeline.ipynbMulti-step planning & refinement for complex coding assistance taskDecompose → specialized execution → summarization refine stepSame CATALOG; steps derived from plan output

Scenario Narrative

An engineering productivity tool receives heterogeneous tasks: refactor code, summarize architectural notes, classify feedback. To minimize latency & resource usage, a small general model plans and summarizes, a code‑specialized model handles refactoring, and a lightweight classification-capable model labels feedback. The pipeline script demonstrates chaining + refinement; the router script isolates adaptive single‑prompt routing.

Catalog Snapshot

CATALOG = {
    "phi-4-mini": {"capabilities": ["general", "summarize"], "priority": 2},
    "deepseek-coder-1.3b": {"capabilities": ["code", "refactor"], "priority": 1},
    "qwen2.5-0.5b": {"capabilities": ["classification", "fast"], "priority": 3}
}

Example Test Prompts

[
    "Refactor this Python function for readability",
    "Summarize the importance of small language models",
    "Classify this feedback: The UI is slow but pretty",
    "Generate a refactored version of a slow Python loop and summarize performance gains."
]

Trace Extension (Optional)

Add per-step trace JSON lines for models_pipeline.py:

trace.append({
    "step": step_idx,
    "intent": intent,
    "alias": alias,
    "latency_ms": round((end-start)*1000,2),
    "tokens": getattr(usage,'total_tokens',None)
})

Escalation Heuristic (Idea)

If plan contains keywords like "optimize", "security", or step length > 280 chars → escalate to bigger model (e.g., gpt-oss-20b) for that step only.

Optional Enhancements

AreaEnhancementValueHint
CachingReuse manager + client objectsLower latency, less overheadUse workshop_utils.get_client
Usage MetricsCapture tokens & per-step latencyProfiling & optimizationTime each routed call; store in trace list
Adaptive RoutingConfidence / cost awareBetter quality-cost trade-offAdd scoring: if prompt > N chars or regex matches domain → escalate to larger model
Dynamic Capability RegistryHot reload catalogNo restart redeployLoad catalog.json at runtime; watch file timestamp
Fallback StrategyRobustness under failuresHigher availabilityTry primary → on exception fallback alias
Streaming PipelineEarly feedbackUX improvementStream each step and buffer final refine input
Vector Intent EmbeddingsMore nuanced routingHigher intent accuracyEmbed prompt, cluster & map centroid → capability
Trace ExportAuditable chainCompliance/reportingEmit JSON lines: step, intent, model, latency_ms, tokens
Cost SimulationPre-cloud estimationBudget planningAssign notional cost/token per model & aggregate per task
Deterministic ModeRepro reproducibilityStable benchmarkingEnv: temperature=0, fixed steps count

Trace Structure Example

trace.append({
  "step": idx,
  "intent": intent,
  "alias": alias,
  "latency_ms": round((end-start)*1000,2),
  "tokens": getattr(usage,'total_tokens',None)
})

Adaptive Escalation Sketch

if len(prompt) > 280 or 'compliance' in prompt.lower():
    # escalate to larger reasoning model if available
    alias = 'gpt-oss-20b'

Model Catalog Hot Reload

import json, time, os
CATALOG_PATH = 'catalog.json'
last_mtime = 0
def get_catalog():
    global last_mtime, CATALOG
    m = os.path.getmtime(CATALOG_PATH)
    if m != last_mtime:
        CATALOG = json.load(open(CATALOG_PATH))
        last_mtime = m
    return CATALOG