temporal-etcd-dynconfig

May 26, 2026 · View on GitHub

OSS Temporal Server ships with a file-based dynamic config client. It works, but it has real operational limits: you edit a YAML file, wait up to 10 seconds for the poll interval, and repeat that edit on every server host. In a multi-host or multi-cluster deployment this becomes error-prone — hosts can diverge silently, passive clusters drift from active ones, and there is no audit trail for what changed when.

This library replaces that client with one backed by etcd. All Temporal server hosts watch the same etcd prefix and receive config changes simultaneously via etcd's watch API — no polling, no per-host file management, no drift. A single etcdctl put (or a call to WriteConfig) propagates to every host in the cluster within milliseconds.

It implements both dynamicconfig.Client and dynamicconfig.NotifyingClient, so Temporal uses push-based updates rather than polling. It is a drop-in replacement: wire it in at server startup, point it at your etcd cluster, and the rest of your server code is unchanged.

When to use this:

You run multiple Temporal server hosts and want config changes applied simultaneously across all of them
You run active/passive multi-cluster Temporal and want a single source of truth for dynamic config
You run multiple environments (prod, staging, dev) and want prefix-isolated config on a shared etcd cluster
You want an audit log of every config change with old and new values

How it works
Prerequisites
Repository structure
Installation
Configuration
Usage
Storing dynamic config values in etcd
Startup behaviour
Shutdown
Connection resilience
Metrics
Differences from the OSS file-based client
Multi-environment setup
Active/passive multi-cluster setup
Local etcd for development
Production notes

How it works

On startup, bulk-loads all keys under globalKeyPrefix from etcd into an in-memory map
Opens an etcd Watch stream on the prefix — changes propagate immediately
Implements both dynamicconfig.Client and dynamicconfig.NotifyingClient, so Temporal uses push-based updates instead of polling
The watch supervisor handles etcd compaction, leader election, and connection resets transparently — reloads all values and opens a fresh stream on any disruption
Every key change is logged at INFO with old and new values diffed

Prerequisites

Go 1.22+
A running etcd cluster (v3.5+)
OSS Temporal server

Repository structure

atomic.go     atomicValue[T] — typesafe sync/atomic.Value wrapper
client.go     Dynamic config client: GetValue, Subscribe, WriteConfig, DumpAll, LogAll, watch loop
config.go     Config/EtcdConfig structs, YAML tags, validation, BuildConfig helper
provider.go   NewEtcdClient — clientv3.Client with round-robin LB and startup health check
tls.go        newClientMTLSConfig — stdlib mTLS helper (cert/key/CA files)

Installation

This library does not import go.temporal.io/server from the public module proxy — it requires a local checkout of the Temporal server source. This is intentional: the library compiles against the same server version you are running, so there is no version skew between the dynamic config client and the server internals it integrates with.

Step 1 — check out the Temporal server at the release tag matching the version you are deploying:

git clone https://github.com/temporalio/temporal.git /path/to/temporal
cd /path/to/temporal
git checkout v1.31.0   # use the tag matching your deployment

Step 2 — in your go.mod, add replace directives for both the Temporal server and this library. Neither is published to the module proxy, so both require a local path:

replace (
    go.temporal.io/server                         => /path/to/temporal
    github.com/temporalio/temporal-etcd-dynconfig => /path/to/temporal-etcd-dynconfig
)

Important: always point the go.temporal.io/server replace directive at a release tag checkout, not master. The master branch uses pre-release versions of go.temporal.io/api that are not published to the module proxy, which will break go mod tidy for anyone who does not also have those pre-release modules locally.

Configuration

Config is a plain Go struct — populate it directly or unmarshal it from YAML.

Minimal (no TLS, local etcd)

cfg := etcddynconfig.Config{
    EtcdConfigs: []etcddynconfig.EtcdConfig{
        {Name: "primary", Endpoints: []string{"127.0.0.1:2379"}},
    },
    GlobalKeyPrefix: "/temporal/dynamicconfig/",
    DisableTLS:      true,
    ClientName:      "temporal-server",
}
cfg.EnsureDefaults()

Equivalent YAML (e.g. loaded from a file and passed to BuildConfig):

etcdConfigs:
  - name: primary
    endpoints:
      - "127.0.0.1:2379"
globalKeyPrefix: "/temporal/dynamicconfig/"
disableTLS: true
clientName: temporal-server

With mTLS

etcdConfigs:
  - name: primary
    endpoints:
      - "etcd-1.example.com:2379"
      - "etcd-2.example.com:2379"
globalKeyPrefix: "/temporal/dynamicconfig/"
disableTLS: false
clientTlsCaCertFile:   /etc/temporal/certs/ca.crt
clientTlsCertFile:     /etc/temporal/certs/client.crt
clientTlsKeyFile:      /etc/temporal/certs/client.key
clientName:            temporal-server
dialTimeout:           2s
maxCallSendMsgSize:    4194304

Config fields

Field	Required	Default	Description
`etcdConfigs`	yes	—	List of etcd clusters. Currently only the first entry is used.
`globalKeyPrefix`	yes	—	Prepended to every key. Use a unique prefix per environment for isolation.
`clientName`	yes	—	Used for TLS SNI and log context.
`disableTLS`	no	`false`	Set `true` for local dev without certs.
`clientTlsCaCertFile`	if TLS	—	PEM CA certificate for verifying the etcd server.
`clientTlsCertFile`	if TLS	—	PEM client certificate for mTLS.
`clientTlsKeyFile`	if TLS	—	PEM client private key for mTLS.
`dialTimeout`	no	`2s`	Timeout for the initial etcd connection.
`maxCallSendMsgSize`	no	`4 MiB`	Max gRPC message size. Must match etcd server's `--max-request-bytes`.

Environment variable wiring

The config fields map naturally to environment variables. A typical container entrypoint sets:

Env var	Maps to	Example
`ETCD_ENDPOINTS`	`etcdConfigs[0].endpoints` (comma-separated)	`etcd-1:2379,etcd-2:2379`
`ETCD_KEY_PREFIX`	`globalKeyPrefix`	`/temporal/dynamicconfig/`
`ETCD_CLIENT_NAME`	`clientName` (and `etcdConfigs[0].name`)	`temporal-server`
`ETCD_DISABLE_TLS`	`disableTLS` (`"true"` to disable)	`true`

Example wiring in Go:

etcdCfg := etcddynconfig.Config{
    EtcdConfigs: []etcddynconfig.EtcdConfig{{
        Name:      os.Getenv("ETCD_CLIENT_NAME"),
        Endpoints: strings.Split(os.Getenv("ETCD_ENDPOINTS"), ","),
    }},
    GlobalKeyPrefix: os.Getenv("ETCD_KEY_PREFIX"),
    ClientName:      os.Getenv("ETCD_CLIENT_NAME"),
    DisableTLS:      os.Getenv("ETCD_DISABLE_TLS") == "true",
}
etcdCfg.EnsureDefaults()

Usage

Wire into OSS Temporal server

The key constraint is that the etcd dynconfig client and the Temporal server must share a single metrics.Handler. If you pass a separate handler to each, the server starts its own Prometheus HTTP listener that conflicts with the one already bound by the handler you gave the etcd client — server metrics will fail to start or emit nothing.

Build the handler once from the server config, pass it to NewClient, and pass the same instance to temporal.WithCustomMetricsHandler.

package main

import (
    "context"
    "log"

    etcddynconfig "github.com/temporalio/temporal-etcd-dynconfig"
    "go.temporal.io/server/common/config"
    temporallog "go.temporal.io/server/common/log"
    "go.temporal.io/server/common/metrics"
    "go.temporal.io/server/temporal"
)

func main() {
    ctx := context.Background()

    // Load the Temporal server config (config file path, env, etc. — see config.Load docs).
    cfg, err := config.Load(config.WithEmbedded())
    if err != nil {
        log.Fatalf("load config: %v", err)
    }

    logger := temporallog.NewZapLogger(temporallog.BuildZapLogger(cfg.Log))

    // Build ONE shared metrics handler from the server's own metrics config.
    // This handler is passed to both NewClient and WithCustomMetricsHandler so
    // they share a single Prometheus registry and HTTP listener.
    metricsHandler, err := metrics.MetricsHandlerFromConfig(logger, cfg.Global.Metrics)
    if err != nil {
        log.Fatalf("create metrics handler: %v", err)
    }

    etcdCfg := etcddynconfig.Config{
        EtcdConfigs:     []etcddynconfig.EtcdConfig{{Name: "primary", Endpoints: []string{"127.0.0.1:2379"}}},
        GlobalKeyPrefix: "/temporal/dynamicconfig/",
        DisableTLS:      true,
        ClientName:      "temporal-server",
    }
    etcdCfg.EnsureDefaults()

    // Create the raw etcd client (performs startup connectivity check).
    etcdClient := etcddynconfig.NewEtcdClient(etcdCfg, logger)
    defer etcdClient.Close()

    // Tag dynconfig metrics with the service(s) this process is running.
    dcMetrics := metricsHandler.WithTags(metrics.StringTag("service_name", "frontend,history,matching,worker"))

    dcClient, err := etcddynconfig.NewClient(ctx, etcdClient, etcdCfg.GlobalKeyPrefix, logger, dcMetrics)
    if err != nil {
        log.Fatalf("create etcd dynconfig client: %v", err)
    }
    defer dcClient.Stop()

    server, err := temporal.NewServer(
        temporal.WithConfig(cfg),
        temporal.WithLogger(logger),
        temporal.WithDynamicConfigClient(dcClient),
        temporal.WithCustomMetricsHandler(metricsHandler), // same handler — prevents duplicate listener
        temporal.InterruptOn(temporal.InterruptCh()),
    )
    if err != nil {
        log.Fatalf("create server: %v", err)
    }
    if err := server.Start(); err != nil {
        log.Fatalf("start server: %v", err)
    }
}

Load config from YAML

import "gopkg.in/yaml.v3"

var raw map[string]any
_ = yaml.Unmarshal(yamlBytes, &raw)

cfg, err := etcddynconfig.BuildConfig(raw)
if err != nil {
    // validation error
}

BuildConfig validates all required fields and fills in defaults. Use it when loading config from a file or a custom datastore options map.

Storing dynamic config values in etcd

Each key is stored as <globalKeyPrefix><temporalKeyName>. The value is a YAML list of constrained values — the same format as the OSS file-based dynamic config.

The recommended prefix is /temporal/dynamicconfig/ (note the leading slash). The leading slash is required for etcd UI tools like etcdkeeper to display keys in a proper directory tree. Without it, keys are stored at the root level and most UIs won't show them.

Simple global value

# etcd key: /temporal/dynamicconfig/frontend.rps
- value: 1200
  constraints: {}

Per-namespace override with global fallback

# etcd key: /temporal/dynamicconfig/frontend.rps
- value: 500
  constraints:
    namespace: high-traffic-namespace
- value: 1200
  constraints: {}

Supported constraint fields

Constraint key	Type	Description
`namespace`	string	Namespace name
`namespaceId`	string	Namespace ID
`taskQueueName`	string	Task queue name
`taskType`	string or int	`Workflow` or `Activity`
`historyTaskType`	string or int	Internal history task type
`shardId`	int	History shard ID
`destination`	string	Nexus destination

Temporal evaluates constraints in precedence order (most specific wins). A value with constraints: {} acts as the global default.

Writing values programmatically

import "go.temporal.io/server/common/dynamicconfig"

err := dcClient.WriteConfig(ctx,
    dynamicconfig.FrontendRPS,
    []dynamicconfig.ConstrainedValue{
        {
            Value:       500,
            Constraints: dynamicconfig.Constraints{Namespace: "high-traffic-namespace"},
        },
        {
            Value: 1200,
        },
    },
)

WriteConfig serializes the values to YAML, writes them to etcd, and immediately reloads the in-memory cache. Intended for CLI tooling and bootstrappers; not for hot paths.

Inspecting the loaded config (DumpAll / LogAll)

OSS Temporal has no built-in way to see what dynamic config values are currently active. The etcd client adds two methods for this.

DumpAll() returns a snapshot of the full in-memory map as map[string][]dynamicconfig.ConstrainedValue. The map is a copy — safe to iterate after the client is stopped:

snapshot := dcClient.DumpAll()
for key, values := range snapshot {
    fmt.Printf("%s: %+v\n", key, values)
}

Typical uses:

Expose it from a debug HTTP handler so you can curl the live state
Log it at startup to confirm all expected overrides were loaded from etcd
Diff two snapshots to see what changed between deployments

LogAll() writes every key and its constrained values to the logger at INFO level — one log line per key. Useful as a startup diagnostic without any extra wiring:

// call once after NewClient returns, before starting the server
dcClient.LogAll()

Example output (structured logging):

dynamic config dump start   totalKeys=12
dynamic config entry        key=frontend.rps          values=[{constraints:{} value:1200}]
dynamic config entry        key=history.cacheMaxSize  values=[{constraints:{} value:512}]
...
dynamic config dump end

Both methods read directly from the same atomic in-memory map that GetValue uses — no etcd round-trip, no lock contention.

Writing values with etcdctl

etcdctl put /temporal/dynamicconfig/frontend.rps -- '
- value: 1200
  constraints: {}
'

etcdctl put /temporal/dynamicconfig/history.defaultActivityRetryPolicy -- '
- value:
    initialInterval: 1s
    backoffCoefficient: 2.0
    maximumAttempts: 10
  constraints: {}
'

Note: the -- separator is required when the value starts with - (a YAML list), otherwise etcdctl interprets it as a flag.

Deleting a value (reverts to compiled-in default)

etcdctl del /temporal/dynamicconfig/frontend.rps

Listing all current dynamic config values

etcdctl get /temporal/dynamicconfig/ --prefix

Startup behaviour

NewEtcdClient performs a connectivity check before returning. It retries up to 3 times with exponential backoff (2s initial, 2× coefficient). If etcd is unreachable it calls logger.Fatal — the server should not start with a broken config backend.

Shutdown

defer dcClient.Stop()      // closes the etcd watcher, cancels watch goroutines
defer etcdClient.Close()   // closes the underlying gRPC connection

Call Stop() before Close().

Connection resilience

The watch supervisor handles:

Event	Behaviour
Transient stream error	Reload all values, reopen Watch from new revision
etcd compaction past last-seen revision	Same — reloads and resubscribes
Leader election / connection reset	Same
Context cancellation (`Stop()`)	Exits cleanly, no reload

Backoff on reload failure: 100ms → doubles each attempt → caps at 30s.

Metrics

The client emits metrics through the same metrics.Handler the Temporal server already uses — Prometheus, OpenTelemetry, or any other backend your server is configured with.

You must share a single handler between the etcd client and the Temporal server. Build it once with metrics.MetricsHandlerFromConfig, pass it to NewClient, and pass the same instance to temporal.WithCustomMetricsHandler. Without WithCustomMetricsHandler, the server starts its own Prometheus HTTP listener that conflicts with the one already bound by the handler you passed to NewClient — server metrics will fail to start or emit nothing.

metricsHandler, err := metrics.MetricsHandlerFromConfig(logger, cfg.Global.Metrics)

// Tag dynconfig metrics with the Temporal service name(s) for this process.
dcClient, err := etcddynconfig.NewClient(ctx, etcdClient, prefix, logger,
    metricsHandler.WithTags(metrics.StringTag("service_name", "frontend")),
)

server, err := temporal.NewServer(
    temporal.WithDynamicConfigClient(dcClient),
    temporal.WithCustomMetricsHandler(metricsHandler), // same handler — no duplicate listener
    // ...
)

Pass metrics.NoopMetricsHandler to NewClient to disable dynconfig metrics entirely (you can still pass the real handler to WithCustomMetricsHandler for server metrics).

Emitted metrics

All metrics inherit any tags set on the handler passed to NewClient. WithTags returns a new derived handler — it does not mutate the original — so scoping the etcd client's handler with service_name has no effect on Temporal server metrics, which use the original handler and apply their own tags internally.

Metric	Type	Tags	Description
`dynconfig_key_updates_total`	counter	`operation` (DynamicConfigUpdate, DynamicConfigDelete), `key` (config key name)	Incremented on every key change received from etcd. Each server process increments independently — with 3 frontends + 5 history hosts + 2 matching + 1 worker, a single `etcdctl put` produces 11 increments across all services.
`dynconfig_watch_reconnects_total`	counter	`reason` (compacted, stream_ended)	Incremented whenever the watch supervisor has to reload and reopen the stream. A spike here indicates etcd instability.
`dynconfig_watch_active`	gauge	—	`1` while the watch stream is running, `0` while stopped or reconnecting. Alert on this going to `0`.
`dynconfig_keys_loaded`	gauge	—	Number of keys in the in-memory map after each full reload.
`dynconfig_load_duration_seconds`	timer	—	Time taken for a full prefix scan from etcd, on startup and each reconnect.
`dynconfig_write_total`	counter	`result` (success, error)	Outcome of each `WriteConfig` call.

Useful alert queries

# Watch is down on any service — config changes are not propagating
dynconfig_watch_active{service=~"frontend|history|matching|worker"} == 0

# A config change was applied on some services but not all within 30s
# (indicates a broken watch on specific nodes)
max(timestamp(dynconfig_key_updates_total)) - min(timestamp(dynconfig_key_updates_total)) > 30

# Frequent watch reconnects — etcd is unstable
rate(dynconfig_watch_reconnects_total[5m]) > 0.1

Differences from the OSS file-based client

	File-based client	etcd client
Update latency	Poll interval (default 10s)	Near-realtime via etcd watch
Write path	Edit file on disk	`WriteConfig()` or `etcdctl put`
Multi-server consistency	Depends on filesystem / config management	All servers in the cluster see the same value simultaneously
Resilience	File must be present at startup	Fails fast if etcd unreachable at startup; survives disruptions at runtime
Audit log	None	Every change logged at INFO with old/new values

Active/passive multi-cluster setup

In a multi-cluster Temporal deployment (active + one or more passive/standby clusters), dynamic config must be kept in sync across all clusters. With file-based dynamic config this is a manual and error-prone process — you edit a file on one cluster, then remember to apply the same change to every passive cluster. Miss one, and your standby diverges silently. When you fail over, the passive cluster runs with stale config.

With etcd-backed dynamic config, all clusters share a single source of truth. A single etcdctl put propagates to every cluster simultaneously — active and passive — with no manual steps.

How it works with shared etcd

All clusters point at the same etcd cluster, each using its own prefix:

/active/temporal/dynamicconfig/
/passive-us-west/temporal/dynamicconfig/
/passive-eu/temporal/dynamicconfig/

Set ETCD_KEY_PREFIX per cluster accordingly. Each cluster watches only its own prefix — the prefixes are isolated, so a change to the active cluster does not automatically touch the passive clusters.

Keeping passive clusters in sync

To apply a change to all clusters at once, write to all prefixes in a single operation:

# Update frontend.globalNamespaceRPS on every cluster simultaneously
for prefix in /active /passive-us-west /passive-eu; do
  etcdctl put "${prefix}/temporal/dynamicconfig/frontend.globalNamespaceRPS" -- "- value: 2000"
done

All clusters receive the watch event and apply the change within milliseconds — no SSH, no config management tooling, no per-cluster scripts.

Why this matters for failover

When a passive cluster becomes active during a failover, it is already running with the exact same dynamic config as the cluster it is replacing. There is no config drift to discover under pressure. Rate limits, cache sizes, partition counts, and persistence QPS settings are all identical — the failover is behaviorally transparent.

Without this, a common failure mode after failover is: the passive cluster has outdated dynamic config (lower rate limits, wrong partition counts, stale feature flags) and starts behaving differently under production load, compounding the incident.

Shared vs. per-cluster etcd

You can also run a dedicated etcd cluster per Temporal cluster. In that case there is no prefix isolation needed, but you lose the single-write-to-all-clusters convenience. Use shared etcd when your clusters are in the same region or trust boundary; use per-cluster etcd when clusters are geographically separated and you want full isolation.

Local etcd for development

# Single-node etcd via Docker
docker run -d \
  --name etcd \
  -p 2379:2379 \
  gcr.io/etcd-development/etcd:v3.5.12 \
  etcd \
  --advertise-client-urls http://0.0.0.0:2379 \
  --listen-client-urls http://0.0.0.0:2379

# Verify
etcdctl --endpoints=127.0.0.1:2379 endpoint health

# Verify keys are visible
etcdctl --endpoints=127.0.0.1:2379 get /temporal/dynamicconfig/ --prefix --keys-only

Then use disableTLS: true in your config.

Multi-environment setup

A single etcd cluster can serve multiple Temporal environments (prod, staging, dev) by giving each a unique globalKeyPrefix. Each cluster only reads and watches its own prefix — a change to a staging key never touches prod.

Recommended prefix convention:

/prod/temporal/dynamicconfig/
/staging/temporal/dynamicconfig/
/dev/temporal/dynamicconfig/

Set the prefix via the ETCD_KEY_PREFIX environment variable (or equivalent config) per deployment:

# prod
ETCD_KEY_PREFIX=/prod/temporal/dynamicconfig/

# staging
ETCD_KEY_PREFIX=/staging/temporal/dynamicconfig/

# dev
ETCD_KEY_PREFIX=/dev/temporal/dynamicconfig/

To update a value for staging only:

etcdctl put /staging/temporal/dynamicconfig/frontend.globalNamespaceRPS -- "- value: 800"

Prod is unaffected. To list all keys for a specific environment:

etcdctl get /prod/temporal/dynamicconfig/ --prefix --keys-only
etcdctl get /staging/temporal/dynamicconfig/ --prefix --keys-only

The default seeding from defaults.yaml applies independently per prefix — each environment gets its own copy of the defaults on first start.

Production notes

Prefix isolation: use a unique globalKeyPrefix per cell or environment (e.g. prod-us-east/dynamicconfig/, staging/dynamicconfig/) to avoid key collisions when multiple clusters share an etcd cluster.
etcd sizing: dynamic config values are small and infrequently written. A 3-node etcd cluster used solely for this purpose can be very lightweight.
TLS: always enable mTLS in production. Generate client certs with the same CA as your etcd cluster.
Temporal server version: pin your replace directive to a release tag, not master. See the Installation section.