computing.md

August 23, 2023 · View on GitHub

GPU Comparisons

Contents

  1. My lab's computing setup; link
  2. Cloud GPU and whole machine price comparison and notes (云GPU与整机价格对比); link
  3. Free compute available from companies(教授可申请的免费计算资源); link
  4. Useful learning material on GPUs and setting up your clusters(如何搭建计算集群); link

Computing Setup of My Lab

In my lab, the Precognition Lab, using the start-up funds provided by the university, I have built 9 stand-alone machines equipped with a total of 32 RTX 3090/4090 GPUs (including 4-GPU and some 2-GPU machines). Additionally, I have established a cluster with 3 compute nodes, comprising a total of 24 RTX A6000 GPUs, and a 100TB NAS.

The rationale behind this hybrid setup is twofold: the stand-alone machines cost only 40% of what the cluster does, and they can be acquired quickly without necessitating additional machine room space.

As for the cluster, I've found that a 100 GB Ethernet suffices for the computing network, eliminating the need to invest in an Infiniband switch, which can cost two to three times more. With 3 nodes on this network, I can essentially achieve linear scaling with multi-node training (6 hours for 1-node training and 2 hours for 3-node training, etc.).

Price Comparison

Vendors in mainland, China (Updated 07/2022):

MachineDurationPrice (RMB)Note
阿里8xV100 (16GB)一年80万只有CentOS
一个月7.1万
一小时248.42
华为云8xV100 (32GB)一年63万
一个月6.3万
一小时131.5
腾讯云8xV100 (32GB)一年45.8万(8.3折)link
一个月4.6万
一小时 (TIONE)147
8xA100 (40GB)一年113.5万(8.3折)
一个月11.4万
百度云8xA100 (40GB)一年99.7万(8.3折)link
一个月10万
8xV100 (32GB)一年59.3万
一个月5.9万
一小时124.14
矩池云8xV100 (16GB)一小时48
智星云8x3090 (24GB)一个月2.1万
一小时36
8xA100 (40GB)一个月4.5万
一小时76
8xV100 (32GB)一个月2.8万
一小时48
极链AI云
恒源云
AutoDLlink , Most Popular
OpenBayeslink

整机购买 (08/2022咨询)

机器
dbcloud深脑云 (淘宝)8x309020万左右起
程明明教授的经验8xV100link

Junwei: 近期(09/2022)GPU价格大跌,明显是整机购买比较划算,而3090的算力相当于V100,是性价比最高的卡,所以我认为多个8x3090整机+网络硬盘NAS+kubeflow是最划算、scalable的设置,可以参考一下后面如何自建计算集群。

Vendors in NA (Updated 07/2022):

MachineDurationPrice
Google Cloud asia-Taiwan8xV100 (32GB)1 month$12,837.30
1 hour$17
Google Cloud asia-Tokyo8xA100 (40GB)1 month$18,216.98
vast.ai NA8xV100 (16GB)1 hour$2.80
8xA100 (40GB)1 hour$8.80
8xA6000 (48GB)1 hour$4.40
10x1080Ti (11GB)1 hour$2
8xA5000 (24GB)1 hour$2.40
4x3090 (24GB)1 hour$1.20
lambda NA8xV100 (16GB)1 hour$4.40
8xV100 (16GB)1 hour (>3 months)$3.20
8xA100 (40GB)1 hour (>3 months)$8.00
1xA100 (40GB)1 hour$1.10 link

Free Stuff

notelink
幻方AI万卡算力,免费申请,酣畅科研的夏天link
NVIDIA有一张免费卡的资助项目
AWS在CMU上课的时候,每门课教授都可以给每个学生申请100刀左右的cloud credit
Google Cloud类似AWS

Learning Stuff

notelink
GPU guide from Lambdalink
understanding GPU and DLlink
腾讯TEG星辰和机智团队link
MPIJoblink
机器学习平台link
集群硬盘,ceph clusterlink
A discussion on machine price on Twitter (for NA)link
A discussion on 1xA100 vs 6x3090 知乎link
程明明教授的GPU集群经验link
Good cluster building guide from Lambdalink
How to decide on cloud GPUs vs. on-perm vs. hybridlink