Kubeflow SDK

March 25, 2026 ยท View on GitHub

PyPI version PyPI Downloads Join Slack Coverage Status Ask DeepWiki

Latest News ๐Ÿ”ฅ

Overview

The Kubeflow SDK is a set of unified Pythonic APIs that let you run any AI workload at any scale โ€“ without the need to learn Kubernetes. It provides simple and consistent APIs across the Kubeflow ecosystem, enabling users to focus on building AI applications rather than managing complex infrastructure.

Kubeflow SDK Benefits

  • Unified Experience: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs
  • Simplified AI Workloads: Abstract away Kubernetes complexity and work effortlessly across all Kubeflow projects using familiar Python APIs
  • Built for Scale: Seamlessly scale any AI workload โ€” from local laptop to large-scale production cluster with thousands of GPUs using the same APIs.
  • Rapid Iteration: Reduced friction between development and production environments
  • Local Development: First-class support for local development without a Kubernetes cluster requiring only pip installation
Kubeflow SDK Diagram

Kubeflow SDK Introduction

The following KubeCon + CloudNativeCon 2025 talk provides an overview of Kubeflow SDK:

Kubeflow SDK

Additionally, check out these demos to deep dive into Kubeflow SDK capabilities:

Get Started

Install Kubeflow SDK

pip install -U kubeflow

Run your first PyTorch distributed job

from kubeflow.trainer import TrainerClient, CustomTrainer, TrainJobTemplate

def get_torch_dist(learning_rate: str, num_epochs: str):
    import os
    import torch
    import torch.distributed as dist

    dist.init_process_group(backend="gloo")
    print("PyTorch Distributed Environment")
    print(f"WORLD_SIZE: {dist.get_world_size()}")
    print(f"RANK: {dist.get_rank()}")
    print(f"LOCAL_RANK: {os.environ['LOCAL_RANK']}")

    lr = float(learning_rate)
    epochs = int(num_epochs)
    loss = 1.0 - (lr * 2) - (epochs * 0.01)

    if dist.get_rank() == 0:
        print(f"loss={loss}")

# Create the TrainJob template
template = TrainJobTemplate(
    runtime="torch-distributed",
    trainer=CustomTrainer(
        func=get_torch_dist,
        func_args={"learning_rate": "0.01", "num_epochs": "5"},
        num_nodes=3,
        resources_per_node={"cpu": 2},
    ),
)

# Create the TrainJob
job_id = TrainerClient().train(**template)

# Wait for TrainJob to complete
TrainerClient().wait_for_job_status(job_id)

# Print TrainJob logs
print("\n".join(TrainerClient().get_job_logs(name=job_id)))

Optimize hyperparameters for your training

from kubeflow.optimizer import OptimizerClient, Search, TrialConfig

# Create OptimizationJob with the same template
optimization_id = OptimizerClient().optimize(
    trial_template=template,
    trial_config=TrialConfig(num_trials=10, parallel_trials=2),
    search_space={
        "learning_rate": Search.loguniform(0.001, 0.1),
        "num_epochs": Search.choice([5, 10, 15]),
    },
)

print(f"OptimizationJob created: {optimization_id}")

Run data processing with Spark Connect

Install Kubeflow Spark support:

pip install 'kubeflow[spark]'

To install the Spark Operator, see the installation guide.

from kubeflow.spark import KubernetesBackendConfig, SparkClient

client = SparkClient(KubernetesBackendConfig(namespace="spark-test"))
spark = client.connect()

df = spark.range(5)
df.show()

You should see the DataFrame:

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
+---+

You can also configure number of executors and resources:

spark = client.connect(
    num_executors=5,
    resources_per_executor={"cpu": "5", "memory": "1Gi"},
)

df = spark.range(5)
df.show()

Manage models with Model Registry

Install Model Registry support:

pip install 'kubeflow[hub]'

To install the Model Registry server, see the installation guide.

from kubeflow.hub import ModelRegistryClient

client = ModelRegistryClient("https://model-registry.kubeflow.svc.cluster.local", author="Your Name")

# Register a model
model = client.register_model(
    name="my-model",
    uri="s3://bucket/path/to/model",
    version="v1.0.0",
    model_format_name="pytorch",
    model_format_version="2.0",
    version_description="My trained model"
)

# Get a registered model
model = client.get_model("my-model")

# List all models
for model in client.list_models():
    print(f"Model: {model.name}")

# List model versions
for version in client.list_model_versions("my-model"):
    print(f"Version: {version.name}")

You can also initialize the client using different port configurations:

ModelRegistryClient("https://example.org", port=456)  # Explicit port argument
ModelRegistryClient("https://example.org:456")        # Port parsed from base_url
ModelRegistryClient("https://example.org")            # Default port (443 for https, 8080 for http)

Local Development

Kubeflow Trainer client supports local development without needing a Kubernetes cluster.

Available Backends

  • KubernetesBackend (default) - Production training on Kubernetes
  • ContainerBackend - Local development with Docker/Podman isolation
  • LocalProcessBackend - Quick prototyping with Python subprocesses

Quick Start: Install container support: pip install kubeflow[docker] or pip install kubeflow[podman]

from kubeflow.trainer import TrainerClient, ContainerBackendConfig, CustomTrainer

# Switch to local container execution
client = TrainerClient(backend_config=ContainerBackendConfig())

# Your training runs locally in isolated containers
job_id = client.train(trainer=CustomTrainer(func=train_fn))

Supported Kubeflow Projects

ProjectStatusVersion SupportDescription
Kubeflow Trainerโœ… Availablev2.0.0+Train and fine-tune AI models with various frameworks
Kubeflow Katibโœ… Availablev0.19.0+Hyperparameter optimization
Kubeflow Model Registryโœ… Availablev0.3.0+Manage model artifacts, versions and ML artifacts metadata
Kubeflow Spark Operatorโœ… Availablev2.5.0+Manage Spark applications for data processing and feature engineering
Kubeflow Pipelines๐Ÿšง PlannedTBDBuild, run, and track AI workflows
Feast๐Ÿšง PlannedTBDFeature store for machine learning

Community

Getting Involved

Contributing

Kubeflow SDK is a community project and is still under active development. We welcome contributions! Please see our CONTRIBUTING Guide for details.

Documentation

โœจ Contributors

We couldn't have done it without these incredible people: