Session 4: Explore Cutting-Edge Models

November 11, 2025 · View on GitHub

Abstract

Compare Large Language Models (LLMs) and Small Language Models (SLMs) for local vs cloud inference scenarios. Learn deployment patterns leveraging ONNX Runtime acceleration, WebGPU execution, and hybrid RAG experiences. Includes a Chainlit RAG demo with a local model plus an optional OpenWebUI exploration. You will adapt a WebGPU inference starter and evaluate Phi vs GPT-OSS-20B capability & cost/perf trade-offs.

Learning Objectives

Contrast SLM vs LLM on latency, memory, quality axes
Deploy models with ONNXRuntime and (where supported) WebGPU
Run browser-based inference (privacy-preserving interactive demo)
Integrate a Chainlit RAG pipeline with a local SLM backend
Evaluate using lightweight quality + cost heuristics

Prerequisites

Sessions 1–3 completed
chainlit installed (already in requirements.txt for Module08)
WebGPU-capable browser (Edge / Chrome latest on Windows 11)
Foundry Local running (foundry service status)

Cross-Platform Notes

Windows remains the primary target environment. For macOS developers awaiting native binaries:

Run Foundry Local in a Windows 11 VM (Parallels / UTM) OR a remote Windows workstation.
Expose the service (default port 5273) and set on macOS:

export FOUNDRY_LOCAL_ENDPOINT=http://<windows-host>:5273/v1

Use the same Python virtual environment steps as prior sessions.

Chainlit install (both platforms):

pip install chainlit

Demo Flow (30 min)

1. Compare Phi (SLM) vs GPT-OSS-20B (LLM) (6 min)

foundry model run phi-4-mini
foundry model run gpt-oss-20b

# Quick capability probes (one-shot non-interactive)
foundry model run phi-4-mini   --prompt "Summarize retrieval augmented generation in 2 sentences."
foundry model run gpt-oss-20b --prompt "Summarize retrieval augmented generation in 2 sentences."

# Basic token / latency test (repeat a few times for intuition)
foundry model run phi-4-mini   --prompt "List 5 creative IoT edge AI ideas."
foundry model run gpt-oss-20b --prompt "List 5 creative IoT edge AI ideas."

Track: response depth, factual accuracy, stylistic richness, latency.

Observe throughput changes after enabling GPU vs CPU-only.

3. WebGPU Inference in Browser (6 min)

Adapt starter 04-webgpu-inference (create samples/04-cutting-edge/webgpu_demo.html):

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Foundry Local WebGPU Demo</title>
  <style>body{font-family:system-ui;margin:2rem;max-width:820px;} textarea{width:100%;height:120px;} pre{background:#111;color:#eee;padding:1rem;} .resp{white-space:pre-wrap;margin-top:1rem;border:1px solid #444;padding:1rem;border-radius:6px;}</style>
</head>
<body>
  <h1>WebGPU Inference (Experimental)</h1>
  <p>Demonstration placeholder for a WebGPU-backed transformer (concept). Replace with actual JS runtime once exposed by Foundry Local or associated runtime libs.</p>
  <textarea id="prompt" placeholder="Enter your prompt..."></textarea>
  <button id="run">Generate</button>
  <div id="out" class="resp"></div>
  <script>
    document.getElementById('run').onclick = async () => {
      const p = document.getElementById('prompt').value.trim();
      if(!p) return;
      document.getElementById('out').textContent = 'Running (simulated)...';
      // Placeholder: in a real implementation you'd call into a WASM/WebGPU pipeline or local gateway endpoint.
      const resp = await fetch('http://localhost:5273/v1/chat/completions', {
        method: 'POST', headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          model: 'phi-4-mini',
          messages: [ { role: 'user', content: p } ],
          max_tokens: 200, temperature: 0.5
        })
      }).then(r=>r.json()).catch(e=>({error:e.toString()}));
      if(resp.error){
        document.getElementById('out').textContent = 'Error: '+resp.error;
      } else {
        document.getElementById('out').textContent = resp.choices?.[0]?.message?.content || JSON.stringify(resp,null,2);
      }
    };
  </script>
</body>
</html>

Open the file in a browser; observe low-latency local roundtrip.

4. Chainlit RAG Chat App (7 min)

Minimal samples/04-cutting-edge/chainlit_app.py:

#!/usr/bin/env python3
"""Chainlit RAG demo using Foundry Local SLM as backend."""
import chainlit as cl
from openai import OpenAI

DOCS = [
  "Foundry Local enables local model execution with OpenAI-compatible APIs.",
  "RAG combines retrieval and generation for grounded answers.",
  "SLMs provide efficiency advantages on constrained hardware."  
]

client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")

def build_context(query: str):
    # Naive lexical scoring
    scored = sorted(DOCS, key=lambda d: sum(w in d.lower() for w in query.lower().split()), reverse=True)
    return "\n".join(scored[:2])

@cl.on_message
async def main(message: cl.Message):
    ctx = build_context(message.content)
    resp = client.chat.completions.create(
        model="phi-4-mini",
        messages=[
            {"role": "system", "content": "Answer using ONLY the provided context. If insufficient, say so."},
            {"role": "user", "content": f"Context:\n{ctx}\n\nQuestion: {message.content}"}
        ],
        max_tokens=300,
        temperature=0.3
    )
    await cl.Message(content=resp.choices[0].message.content).send()

Run:

chainlit run samples/04-cutting-edge/chainlit_app.py -w

5. Starter Project: Adapt `04-webgpu-inference` (6 min)

Deliverables:

Replace placeholder fetch logic with streaming tokens (use stream=True endpoint variant once enabled)
Add latency chart (client-side) for phi vs gpt-oss-20b toggles
Embed RAG context inline (textarea for reference docs)

Evaluation Heuristics

Category	Phi-4-mini	GPT-OSS-20B	Observation
Latency (cold)	Fast	Slower	SLM warms quickly
Memory	Low	High	Device feasibility
Context adherence	Good	Strong	Larger model may be more verbose
Cost (local)	Minimal	Higher (resource)	Energy/time trade-off
Best use case	Edge apps	Deep reasoning	Hybrid pipeline possible

Validating Environment

# List catalog (no --running flag; loaded models are those you have previously run)
foundry model list

# For runtime metrics use the Python benchmark script (Session 3) and OS tools (Task Manager / nvidia-smi) instead of 'model stats'
# Example:
#   cd Workshop/samples
#   set BENCH_MODELS=phi-4-mini,gpt-oss-20b
#   python -m session03.benchmark_oss_models

Troubleshooting

Symptom	Cause	Action
Web page fetch fails	CORS or service down	Use `curl` to verify endpoint; enable CORS proxy if needed
Chainlit blank	Env not active	Activate venv & reinstall deps
High latency	Model just loaded	Warm with small prompt sequence

References

Foundry Local SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python
Chainlit Docs: https://docs.chainlit.io
RAG Evaluation (Ragas): https://docs.ragas.io

Session Duration: 30 min
Difficulty: Advanced

Sample Scenario & Workshop Mapping

Workshop Artifacts	Scenario	Objective	Data / Prompt Source
`samples/session04/model_compare.py` / `notebooks/session04_model_compare.ipynb`	Architecture team evaluating SLM vs LLM for executive summary generator	Quantify latency + token usage delta	Single `COMPARE_PROMPT` env var
`chainlit_app.py` (RAG demo)	Internal knowledge assistant prototype	Ground short answers with minimal lexical retrieval	Inline `DOCS` list in file
`webgpu_demo.html`	Futuristic on‑device browser inference preview	Show low‑latency local roundtrip + UX narrative	Live user prompt only

Scenario Narrative

The product org wants an executive briefing generator. A lightweight SLM (phi‑4‑mini) drafts summaries; a larger LLM (gpt‑oss‑20b) may refine only high‑priority reports. Session scripts capture empirical latency & token metrics to justify a hybrid design, while the Chainlit demo illustrates how grounded retrieval keeps small model answers factual. The WebGPU concept page provides a vision path for fully client‑side processing when browser acceleration matures.

Minimal RAG Context (Chainlit)

DOCS = [
  "Foundry Local enables local model execution with OpenAI-compatible APIs.",
  "RAG combines retrieval and generation for grounded answers.",
  "SLMs provide efficiency advantages on constrained hardware."
]

Hybrid Draft→Refine Flow (Pseudo)

draft, _ = chat_once('phi-4-mini', messages=[{"role":"user","content":prompt}], max_tokens=280)
if len(draft) < 600:  # heuristic: escalate only for longer briefs or flagged topics
    final = draft
else:
    final, _ = chat_once('gpt-oss-20b', messages=[{"role":"user","content":f"Refine and polish:\n{draft}"}], max_tokens=220)

Track both latency components to report blended average cost.

Optional Enhancements

Focus	Enhancement	Why	Implementation Hint
Comparative Metrics	Track token usage + first-token latency	Holistic perf view	Use enhanced benchmark script (Session 3) with `BENCH_STREAM=1`
Hybrid Pipeline	SLM draft → LLM refine	Reduce latency & cost	Generate with phi-4-mini, refine summary w/ gpt-oss-20b
Streaming UI	Better UX in Chainlit	Incremental feedback	Use `stream=True` once local streaming is exposed; accumulate chunks
WebGPU Caching	Faster JS init	Reduce recompile overhead	Cache compiled shader artifacts (future runtime capability)
Deterministic QA Set	Fair model comparison	Remove variance	Fixed prompt list + `temperature=0` for evaluation runs
Output Scoring	Structured quality lens	Move beyond anecdotes	Simple rubric: coherence / factuality / brevity (1–5)
Energy / Resource Notes	Classroom discussion	Show trade-offs	Use OS monitors (Task Manager, `nvidia-smi`) + benchmark script outputs
Cost Emulation	Pre-cloud justification	Plan scaling	Map tokens to hypothetical cloud pricing for TCO narrative
Latency Decomposition	Identify bottlenecks	Target optimizations	Measure prompt prep, request send, first token, full completion
RAG + LLM Fallback	Quality safety net	Improve difficult queries	If SLM answer length < threshold or low confidence → escalate

Example Hybrid Draft/Refine Pattern

draft, _ = chat_once('phi-4-mini', messages=[{"role":"user","content":task}], max_tokens=300, temperature=0.4)
refine, _ = chat_once('gpt-oss-20b', messages=[{"role":"user","content":f"Improve clarity but keep facts:\n{draft}"}], max_tokens=220, temperature=0.3)

Latency Breakdown Sketch

import time
t0 = time.time(); # build messages
prep_ms = (time.time()-t0)*1000
t1 = time.time(); text,_ = chat_once(alias, messages=msgs, max_tokens=180)
full_ms = (time.time()-t1)*1000
print({"prep_ms": prep_ms, "full_gen_ms": full_ms})

Use consistent measurement scaffolding across models for fair comparisons.