EdgeAI for Beginners - Workshop

November 11, 2025 · View on GitHub

Hands-On Learning Path for Building Production-Ready Edge AI Applications

Master local AI deployment with Microsoft Foundry Local, from first chat completion to multi-agent orchestration in 6 progressive sessions.


🎯 Introduction

Welcome to the EdgeAI for Beginners Workshop - your practical, hands-on guide to building intelligent applications that run entirely on local hardware. This workshop transforms theoretical Edge AI concepts into real-world skills through progressively challenging exercises using Microsoft Foundry Local and Small Language Models (SLMs).

Why This Workshop?

The Edge AI Revolution is Here

Organizations worldwide are shifting from cloud-dependent AI to edge computing for three critical reasons:

  1. Privacy & Compliance - Process sensitive data locally without cloud transmission (HIPAA, GDPR, financial regulations)
  2. Performance - Eliminate network latency (50-500ms local vs 500-2000ms cloud round-trip)
  3. Cost Control - Remove per-token API costs and scale without cloud expenses

But Edge AI is Different

Running AI on-premises requires new skills:

  • Model selection and optimization for resource constraints
  • Local service management and hardware acceleration
  • Prompt engineering for smaller models
  • Production deployment patterns for edge devices

This Workshop Delivers Those Skills

In 6 focused sessions (~3 hours total), you'll progress from "Hello World" to deploying production-ready multi-agent systems - all running locally on your machine.


📚 Learning Objectives

By completing this workshop, you will be able to:

Core Competencies

  1. Deploy and Manage Local AI Services

    • Install and configure Microsoft Foundry Local
    • Select appropriate models for edge deployment
    • Manage model lifecycle (download, load, cache)
    • Monitor resource usage and optimize performance
  2. Build AI-Powered Applications

    • Implement OpenAI-compatible chat completions locally
    • Design effective prompts for Small Language Models
    • Handle streaming responses for better UX
    • Integrate local models into existing applications
  3. Create RAG (Retrieval Augmented Generation) Systems

    • Build semantic search with embeddings
    • Ground LLM responses in domain-specific knowledge
    • Evaluate RAG quality with industry-standard metrics
    • Scale from prototype to production
  4. Optimize Model Performance

    • Benchmark multiple models for your use case
    • Measure latency, throughput, and first-token time
    • Select optimal models based on speed/quality tradeoffs
    • Compare SLM vs LLM trade-offs in real scenarios
  5. Orchestrate Multi-Agent Systems

    • Design specialized agents for different tasks
    • Implement agent memory and context management
    • Coordinate agents in complex workflows
    • Route requests intelligently across multiple models
  6. Deploy Production-Ready Solutions

    • Implement error handling and retry logic
    • Monitor token usage and system resources
    • Build scalable architectures with model-as-tools patterns
    • Plan migration paths from edge to hybrid (edge + cloud)

🎓 Learning Outcomes

What You'll Build

By the end of this workshop, you will have created:

SessionDeliverableSkills Demonstrated
1Chat application with streamingService setup, basic completions, streaming UX
2RAG system with evaluationEmbeddings, semantic search, quality metrics
3Multi-model benchmark suitePerformance measurement, model comparison
4SLM vs LLM comparatorTrade-off analysis, optimization strategies
5Multi-agent orchestratorAgent design, memory management, coordination
6Intelligent routing systemIntent detection, model selection, scalability

Competency Matrix

Skill LevelSession 1-2Session 3-4Session 5-6
Beginner✅ Setup & basics⚠️ Challenging❌ Too advanced
Intermediate✅ Quick review✅ Core learning⚠️ Stretch goals
Advanced✅ Breeze through✅ Refinement✅ Production patterns

Career-Ready Skills

After this workshop, you'll be prepared to:

Build Privacy-First Applications

  • Healthcare apps handling PHI/PII locally
  • Financial services with compliance requirements
  • Government systems with data sovereignty needs

Optimize for Edge Environments

  • IoT devices with limited resources
  • Offline-first mobile applications
  • Low-latency real-time systems

Design Intelligent Architectures

  • Multi-agent systems for complex workflows
  • Hybrid edge-cloud deployments
  • Cost-optimized AI infrastructure

Lead Edge AI Initiatives

  • Evaluate Edge AI feasibility for projects
  • Select appropriate models and frameworks
  • Architect scalable local AI solutions

🗺️ Workshop Structure

Session Overview (6 Sessions × 30 Minutes = 3 Hours)

SessionTopicFocusDuration
1Getting Started with Foundry LocalInstall, validate, first completions30 min
2Building AI Solutions with RAGPrompt engineering, embeddings, evaluation30 min
3Open Source ModelsModel discovery, benchmarking, selection30 min
4Cutting Edge ModelsSLM vs LLM, optimization, frameworks30 min
5AI-Powered AgentsAgent design, orchestration, memory30 min
6Models as ToolsRouting, chaining, scaling strategies30 min

🚀 Quick Start

Prerequisites

System Requirements:

  • OS: Windows 10/11, macOS 11+, or Linux (Ubuntu 20.04+)
  • RAM: 8GB minimum, 16GB+ recommended
  • Storage: 10GB+ free space for models
  • CPU: Modern processor with AVX2 support
  • GPU (optional): CUDA-compatible or Qualcomm NPU for acceleration

Software Requirements:

Setup in 3 Steps

1. Install Foundry Local

Windows:

winget install Microsoft.FoundryLocal

macOS:

brew tap microsoft/foundrylocal
brew install foundrylocal

Verify Installation:

foundry --version
foundry service status

Ensure Azure AI Foundry Local is running with a fixed port

# Set FoundryLocal to use port 58123 (default)
foundry service set --port 58123 --show

# Or use a different port
foundry service set --port 58000 --show

Verify it's working:

# Check service status
foundry service status

# Test the endpoint
curl http://127.0.0.1:58123/v1/models

Finding Available Models To see which models are available in your Foundry Local instance, you can query the models endpoint:

# cmd/bash/powershell
foundry model list

Using Web Endpoint

# Windows PowerShell
powershell -Command "Invoke-RestMethod -Uri 'http://127.0.0.1:58123/v1/models' -Method Get"

# Or using curl (if available)
curl http://127.0.0.1:58123/v1/models

2. Clone Repository & Install Dependencies

# Clone repository
git clone https://github.com/microsoft/edgeai-for-beginners.git
cd edgeai-for-beginners/Workshop

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# Windows:
.\.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. Run Your First Sample

# Start Foundry Local and load a model
foundry model run phi-4-mini

# Run the chat bootstrap sample
cd samples
python -m session01.chat_bootstrap "What is edge AI?"

✅ Success! You should see a streaming response about edge AI.


📦 Workshop Resources

Python Samples

Progressive hands-on examples demonstrating each concept:

SessionSampleDescriptionRun Time
1chat_bootstrap.pyBasic & streaming chat~30s
2rag_pipeline.pyRAG with embeddings~45s
2rag_eval_ragas.pyRAG quality evaluation~60s
3benchmark_oss_models.pyMulti-model benchmarking~2-3m
4model_compare.pySLM vs LLM comparison~45s
5agents_orchestrator.pyMulti-agent system~60s
6models_router.pyIntent-based routing~45s
6models_pipeline.pyMulti-step pipeline~60s

Jupyter Notebooks

Interactive exploration with explanations and visualizations:

SessionNotebookDescriptionDifficulty
1session01_chat_bootstrap.ipynbChat basics & streaming⭐ Beginner
2session02_rag_pipeline.ipynbBuild RAG system⭐⭐ Intermediate
2session02_rag_eval_ragas.ipynbEvaluate RAG quality⭐⭐ Intermediate
3session03_benchmark_oss_models.ipynbModel benchmarking⭐⭐ Intermediate
4session04_model_compare.ipynbModel comparison⭐⭐ Intermediate
5session05_agents_orchestrator.ipynbAgent orchestration⭐⭐⭐ Advanced
6session06_models_router.ipynbIntent routing⭐⭐⭐ Advanced
6session06_models_pipeline.ipynbPipeline orchestration⭐⭐⭐ Advanced

Documentation

Comprehensive guides and references:

DocumentDescriptionUse When
QUICK_START.mdFast-track setup guideStarting from scratch
QUICK_REFERENCE.mdCommand & API cheat sheetNeed quick answers
FOUNDRY_SDK_QUICKREF.mdSDK patterns & examplesWriting code
ENV_CONFIGURATION.mdEnvironment variable guideConfiguring samples
notebooks/TROUBLESHOOTING.mdCommon issues & fixesDebugging problems

🎓 Learning Path Recommendations

For Beginners (3-4 hours)

  1. ✅ Session 1: Getting Started (focus on setup and basic chat)
  2. ✅ Session 2: RAG Basics (skip evaluation initially)
  3. ✅ Session 3: Simple Benchmarking (2 models only)
  4. ⏭️ Skip Sessions 4-6 for now
  5. 🔄 Return to Sessions 4-6 after building first application

For Intermediate Developers (3 hours)

  1. ⚡ Session 1: Quick setup validation
  2. ✅ Session 2: Complete RAG pipeline with evaluation
  3. ✅ Session 3: Full benchmarking suite
  4. ✅ Session 4: Model optimization
  5. ✅ Sessions 5-6: Focus on architecture patterns

For Advanced Practitioners (2-3 hours)

  1. ⚡ Sessions 1-3: Quick review and validation
  2. ✅ Session 4: Optimization deep-dive
  3. ✅ Session 5: Multi-agent architecture
  4. ✅ Session 6: Production patterns and scaling
  5. 🚀 Extend: Build custom routing logic and hybrid deployments

Workshop Session Pack (Focused 30‑Minute Labs)

If you're following the condensed 6-session workshop format, use these dedicated guides (each maps to and complements the broader module docs above):

Workshop SessionGuideCore Focus
1Session01-GettingStartedFoundryLocalInstall, validate, run phi & GPT-OSS-20B, acceleration
2Session02-BuildAISolutionsRAGPrompt engineering, RAG patterns, CSV & document grounding, migration
3Session03-OpenSourceModelsHugging Face integration, benchmarking, model selection
4Session04-CuttingEdgeModelsSLM vs LLM, WebGPU, Chainlit RAG, ONNX acceleration
5Session05-AIPoweredAgentsAgent roles, memory, tools, orchestration
6Session06-ModelsAsToolsRouting, chaining, scaling path to Azure

Each session file includes: abstract, learning objectives, 30‑minute demo flow, starter project, validation checklist, troubleshooting, and references to the official Foundry Local Python SDK.

Sample Scripts

Install workshop dependencies (Windows):

cd Workshop
py -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt

macOS / Linux:

cd Workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If running the Foundry Local service on a different (Windows) machine or VM from macOS, export the endpoint:

export FOUNDRY_LOCAL_ENDPOINT=http://<windows-host>:5273/v1
SessionScript(s)Description
1samples/session01/chat_bootstrap.pyBootstrap service & streaming chat
2samples/session02/rag_pipeline.pyMinimal RAG (in-memory embeddings)
samples/session02/rag_eval_ragas.pyRAG evaluation with ragas metrics
3samples/session03/benchmark_oss_models.pyMulti-model latency & throughput benchmarking
4samples/session04/model_compare.pySLM vs LLM comparison (latency & sample output)
5samples/session05/agents_orchestrator.pyTwo‑agent research → editorial pipeline
6samples/session06/models_router.pyIntent-based routing demo
samples/session06/models_pipeline.pyMulti-step plan/execute/refine chain

Environment Variables (Common Across Samples)

VariablePurposeExample
FOUNDRY_LOCAL_ALIASDefault single model alias for basic samplesphi-4-mini
SLM_ALIAS / LLM_ALIASExplicit SLM vs larger model for comparisonphi-4-mini / gpt-oss-20b
BENCH_MODELSComma list of aliases to benchmarkqwen2.5-0.5b,mistral-7b
BENCH_ROUNDSBenchmark repetitions per model3
BENCH_PROMPTPrompt used in benchmarkingExplain retrieval augmented generation briefly.
EMBED_MODELSentence-transformers embedding modelsentence-transformers/all-MiniLM-L6-v2
RAG_QUESTIONOverride test query for RAG pipelineWhy use RAG with local inference?
AGENT_QUESTIONOverride agents pipeline queryExplain why edge AI matters for compliance.
AGENT_MODEL_PRIMARYModel alias for research agentphi-4-mini
AGENT_MODEL_EDITORModel alias for editor agent (can differ)gpt-oss-20b
SHOW_USAGEWhen 1, prints token usage per completion1
RETRY_ON_FAILWhen 1, retry once on transient chat errors1
RETRY_BACKOFFSeconds to wait before retry1.0

If a variable isn’t set, scripts fall back to sensible defaults. For single‑model demos you typically only need FOUNDRY_LOCAL_ALIAS.

Utility Module

All samples now share a helper samples/workshop_utils.py providing:

  • Cached FoundryLocalManager + OpenAI client creation
  • chat_once() helper with optional retry + usage printing
  • Simple token usage reporting (enable via SHOW_USAGE=1)

This reduces duplication and highlights best practices for efficient local model orchestration.

Optional Enhancements (Cross-Session)

ThemeEnhancementSessionsEnv / Toggle
DeterminismFixed temperature + stable prompt sets1–6Set temperature=0, top_p=1
Token Usage VisibilityConsistent cost/efficiency teaching1–6SHOW_USAGE=1
Streaming First TokenPerceived latency metric1,3,4,6BENCH_STREAM=1 (benchmark)
Retry ResilienceHandles transient cold-startAllRETRY_ON_FAIL=1 + RETRY_BACKOFF
Multi-Model AgentsHeterogeneous role specialization5AGENT_MODEL_PRIMARY, AGENT_MODEL_EDITOR
Adaptive RoutingIntent + cost heuristics6Extend router with escalation logic
Vector MemoryLong-term semantic recall2,5,6Integrate FAISS/Chroma embedding index
Trace ExportAuditing & evaluation2,5,6Append JSON lines per step
Quality RubricsQualitative tracking3–6Secondary scoring prompts
Smoke TestsQuick pre-workshop validationAllpython Workshop/tests/smoke.py

Deterministic Quick Start

set FOUNDRY_LOCAL_ALIAS=phi-4-mini
set SHOW_USAGE=1
python Workshop\tests\smoke.py

Expect stable token counts across repeated identical inputs.

RAG Evaluation (Session 2)

Use rag_eval_ragas.py to compute answer relevancy, faithfulness, and context precision on a tiny synthetic dataset:

cd Workshop/samples
python -m session02.rag_eval_ragas

Extend by supplying a larger JSONL of questions, contexts, and ground truths, then converting to a Hugging Face Dataset.

CLI Command Accuracy Appendix

The workshop deliberately uses only currently documented / stable Foundry Local CLI commands.

Stable Commands Referenced

CategoryCommandPurpose
Corefoundry --versionShow installed version
Servicefoundry service startStart local service (if not auto)
Servicefoundry service statusShow service status
Modelsfoundry model listList catalog / available models
Modelsfoundry model download <alias>Download model weights into cache
Modelsfoundry model run <alias>Launch (load) a model locally; combine with --prompt for one‑shot
Modelsfoundry model unload <alias> / foundry model stop <alias>Unload a model from memory (if supported)
Cachefoundry cache listList cached (downloaded) models

One‑Shot Prompt Pattern

Instead of a deprecated model chat subcommand, use:

foundry model run <alias> --prompt "Your question here"

This executes a single prompt/response cycle then exits.

Removed / Avoided Patterns

Deprecated / UndocumentedReplacement / Guidance
foundry model chat <model> "..."foundry model run <model> --prompt "..."
foundry model list --runningUse plain foundry model list + recent activity / logs
foundry model list --cachedfoundry cache list
foundry model stats <model>Use benchmark Python script + OS tools (Task Manager / nvidia-smi)
foundry model benchmark ...samples/session03/benchmark_oss_models.py

Benchmarking & Telemetry

  • Latency, p95, tokens/sec: samples/session03/benchmark_oss_models.py
  • First‑token latency (streaming): set BENCH_STREAM=1
  • Resource usage: OS monitors (Task Manager, Activity Monitor, nvidia-smi).

As new CLI telemetry commands stabilize upstream, they can be incorporated with minimal edits to session markdowns.

Automated Lint Guard

An automated linter prevents reintroduction of deprecated CLI patterns inside fenced code blocks of markdown files:

Script: Workshop/scripts/lint_markdown_cli.py

Deprecated patterns are blocked inside code fences.

Recommended replacements:

DeprecatedReplacement
foundry model chat <a> "..."foundry model run <a> --prompt "..."
model list --runningmodel list
model list --cachedcache list
model statsBenchmark script + system tools
model benchmarksamples/session03/benchmark_oss_models.py
model list --availablemodel list

Run locally:

python Workshop\scripts\lint_markdown_cli.py --verbose

GitHub Action: .github/workflows/markdown-cli-lint.yml runs on every push & PR.

Optional pre-commit hook:

echo "python Workshop/scripts/lint_markdown_cli.py" > .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit

Quick CLI → SDK Migration Table

TaskCLI One-LinerSDK (Python) EquivalentNotes
Run a model once (prompt)foundry model run phi-4-mini --prompt "Hello"manager=FoundryLocalManager("phi-4-mini"); client=OpenAI(base_url=manager.endpoint, api_key=manager.api_key or "not-needed"); client.chat.completions.create(model=manager.get_model_info("phi-4-mini").id, messages=[{"role":"user","content":"Hello"}])SDK bootstraps service & caching automatically
Download (cache) modelfoundry model download qwen2.5-0.5bFoundryLocalManager("qwen2.5-0.5b") # triggers download/loadManager picks best variant if alias maps to multiple builds
List catalogfoundry model list# use manager for each alias or maintain known listCLI aggregates; SDK currently per-alias instantiation
List cached modelsfoundry cache listmanager.list_cached_models()After manager init (any alias)
Get endpoint URL(implicit)manager.endpointUsed to create OpenAI-compatible client
Warm a modelfoundry model run <alias> then first promptchat_once(alias, messages=[...]) (utility)Utilities handle initial cold latency warmup
Measure latencypython -m session03.benchmark_oss_modelsimport benchmark_oss_models (or new exporter script)Prefer script for consistent metrics
Stop / unload modelfoundry model unload <alias>(Not exposed – restart service / process)Typically not required for workshop flow
Retrieve token usage(view output)resp.usage.total_tokensProvided if backend returns usage object

Benchmark Markdown Export

Use the script Workshop/scripts/export_benchmark_markdown.py to run a fresh benchmark (same logic as samples/session03/benchmark_oss_models.py) and emit a GitHub-friendly Markdown table plus raw JSON.

Example

python Workshop\scripts\export_benchmark_markdown.py --models "qwen2.5-0.5b,mistral-7b" --prompt "Explain retrieval augmented generation briefly." --rounds 3 --output benchmark_report.md

Generated files:

FileContents
benchmark_report.mdMarkdown table + interpretation hints
benchmark_report.jsonRaw metrics array (for diffing / trend tracking)

Set BENCH_STREAM=1 in the environment to include first-token latency if supported.