AWS
May 8, 2026 · View on GitHub
In this guide, we'll walk through how to run high-performance distributed training on AWS using Amazon Elastic Fabric Adapter (EFA) with dstack.
Overview
EFA is a network interface for Amazon EC2 that enables low-latency, high-bandwidth inter-node communication — essential for scaling distributed deep learning. With dstack, EFA is automatically enabled when you create fleets with supported instance types.
Prerequisite
Before you start, make sure the aws backend is properly configured.
projects:
- name: main
backends:
- type: aws
creds:
type: default
regions: ["us-west-2"]
!!! info "VPC" If you use a custom VPC, verify that it permits all internal traffic between nodes for EFA to function properly
Create a fleet
Once your backend is ready, define a fleet configuration.
```yaml
type: fleet
name: efa-fleet
nodes: 2
placement: cluster
resources:
gpu: H100:8
```
Provision the fleet with dstack apply:
$ dstack apply -f efa-fleet.dstack.yml
Provisioning...
---> 100%
FLEET INSTANCE BACKEND INSTANCE TYPE GPU PRICE STATUS CREATED
efa-fleet 0 aws (us-west-2) p4d.24xlarge H100:8:80GB \$98.32 idle 3 mins ago
1 aws (us-west-2) p4d.24xlarge H100:8:80GB \$98.32 idle 3 mins ago
??? info "Instance types"
dstack selects suitable instances automatically, but not
all types support EFA.
To enforce EFA, you can specify instance_types explicitly:
```yaml
type: fleet
name: efa-fleet
nodes: 2
placement: cluster
resources:
gpu: L4
instance_types: ["g6.8xlarge"] # If not specified, g6.xlarge is used (won't have EFA)
```
Run NCCL tests
To confirm that EFA is working, run NCCL tests:
type: task
name: nccl-tests
nodes: 2
startup_order: workers-first
stop_criteria: master-done
env:
- NCCL_DEBUG=INFO
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi
resources:
gpu: 1..8
shm_size: 16GB
Run it with dstack apply:
$ dstack apply -f nccl-tests.dstack.yml
Provisioning...
---> 100%
!!! info "Docker image"
You can use your own container by setting image. If omitted, dstack uses its default image with drivers, NCCL tests, and tools pre-installed.
Run distributed training
Here’s an example using torchrun for a simple multi-node PyTorch job:
type: task
name: train-distrib
nodes: 2
python: 3.12
env:
- NCCL_DEBUG=INFO
commands:
- git clone https://github.com/pytorch/examples.git pytorch-examples
- cd pytorch-examples/distributed/ddp-tutorial-series
- uv pip install -r requirements.txt
- |
torchrun \
--nproc-per-node=$DSTACK_GPUS_PER_NODE \
--node-rank=$DSTACK_NODE_RANK \
--nnodes=$DSTACK_NODES_NUM \
--master-addr=$DSTACK_MASTER_NODE_IP \
--master-port=12345 \
multinode.py 50 10
resources:
gpu: 1..8
shm_size: 16GB
Provision and launch it via dstack apply.
$ dstack apply -f train-distrib.dstack.yml
Provisioning...
---> 100%
Instead of setting python, you can specify your own Docker image using image. Make sure that the image is properly configured for EFA.
!!! info "What's next" 1. Learn more about distributed tasks and cluster placement 2. Check dev environments, services, and fleets