Distributed Inference
March 1, 2025 ยท View on GitHub
Distributed inference is easy with the help RPC backend. Basically, start one or more RPC server(s), then they can be used just like a backend device. Each RPC server corresponds to a single backend device. For example, if there are two GPUs in a box, then two RPC servers needs to be started, one for each GPU.
Start a Server
Use --serve_rpc SPEC to start a RPC server. Examples of SPEC:
-
8080: start a server on0.0.0.0:8080with device #0. -
127.0.0.1:9000: start a server on127.0.0.1:9000with device #0. -
8080@1: start a server on0.0.0.0:8080with device #1. -
127.0.0.1:9000@1: start a server on127.0.0.1:9000with device #1.
Don't forget to use --show_devices to check device IDs, and --log_level 2 to view more logs. Example:
main --serve_rpc 80 --log_level 2
Itrying to start RPC server at 0.0.0.0:80, using
IVulkan - Vulkan0 (NVIDIA GeForce ...)
I type: GPU
I memory total: .. B
I memory free : .. B
Use RPC Servers
After RPC servers are started, they can be used. Use --rcp_endpoints EPS to register RPC servers (each is called an endpoint.).
Each endpoint is specified by HOST:PORT, and when HOST is omitted, 127.0.0.1 is assumed. Multiple endpoints are joined by ;.
Use --show_devices to check if everything is Okay:
main --rpc_endpoints 80 --show_devices
0: .....
1: RPC - RPC[127.0.0.1:80] (NVIDIA GeForce ...)
type: GPU
memory total: .. B
memory free : .. B
1: CPU - CPU
...
Now, let go with distributed inference:
main --rpc_endpoints 80 -ngl 1:all -m ...