Configuration: Azure Cluster
March 27, 2020 ยท View on GitHub
For more customized configuration, please refer to the Configuration Section and Azure doc.
Azure Cluster specific configuration
We have greatly simplified Azure Cluster Configuration. As a minimum, you will only need to create a config.yaml file under src/ClusterBootstrap, with the cluster name.
Cluster Name
Cluster name must be unique, and should be specified as:
cluster_name: <your cluster name>
Authentication
If you are not building a cluster for Microsoft employee usage, you will also need to configure Authentication.
Additional configuration.
You may provide/change the specification of the deployed Azure cluster by editing the config.yaml, here's an example:
cluster_name: <unique cluster name, e.g. useanothername>
azure_cluster:
infra_node_num: 1
infra_vm_size : <az vm size, such as Standard_B2s>
azure_location: eastus
worker_node_num: 2
nfs_node_num: 1
nfs_data_disk_sz : 31
nfs_data_disk_num: 2
worker_vm_size: <az vm size, such as Standard_B2s>
nfs_vm_size: <az vm size, such as Standard_B2s>
nfs_local_storage_sz: 1023
vm_image: Canonical:UbuntuServer:18.04-LTS:18.04.201910030
nfs_vm:
- suffix: toad
data_disk_num: 2
data_disk_sz_gb: 31
data_disk_sku: Premium_LRS
data_disk_mnt_path: /data
nfs_mnt_setup:
- server_suffix: toad
mnt_point:
firstshare:
curphysicalmountpoint: /mntdlws/nfs
filesharename: /data/share
mountpoints: ''
datasource: MySQL
mysql_password: <password, e.g. useanotherpw!>
WinbindServers: []
priority: regular
nfs_client_CIDR:
node_range:
- "192.168.0.0/16"
samba_range:
- "s.a.m.0/24"
master_token: <DLTS master token for generating user passwords>
activeDirectory:
tenant: <tenant ID, usually associated with a corp, such as Microsoft>
clientId: <AAD app ID>
clientSecret: <AAD app secret>
domain-offset:
<url1>: <value1>
<url2>: <value2>
<can also set '*'>: <value0>
repair-manager:
portal_url: <a domain name, e.g. dltshub.mydomain.com>
ecc_rule:
cordon_dry_run: False
reboot_dry_run: True
alert_job_owners: True
days_until_node_reboot: 5
time_sleep_after_pausing: 30
attempts_for_pause_resume_jobs: 10
rest_url: http://localhost:5000
restore_from_rule_cache_dump: True
rule_cache_dump: /etc/RepairManager/rule-cache.json
job_owner_email_domain: <an email domain name like microsoft.com>
latency_rule:
alert_expiry: 4 # In hours
smtp:
smtp_url: <smtp, like xxx.com:587>
smtp_from: <email address that is used to send alert emails>
smtp_auth_username: <username used for authentication, e.g. same as smtp_from>
smtp_auth_password: <password for the username above>
default_recipients: <email address that would receive alert email>
cc: <email address that alert email would be cc to>
WebUIregisterGroups:
- MicrosoftUsers
WebUIauthorizedGroups : []
WebUIadminGroups : ["CCSAdmins"]
WebUIregisterGroups: [ "MicrosoftUsers" ]
DeployAuthentications : ["Corp"]
webuiport: 80
cloud_config_nsg_rules:
default_admin_username: core
dev_network:
source_addresses_prefixes:
# These are the dev box of the cluster, only the machine in the IP address below will have access to the cluster.
- "b.a.0.0/16"
- "z.x.0.0/16"
nfs_share:
source_ips:
# IPs that we want to share NFS storage to
- "x.y.z.0/24"
- "a.b.0.0/16"
nfs_ssh:
source_ips:
# IPs that that we want to use to ssh to NFS nodes
- "q.w.e.0/24"
- "r.f.0.0/16"
port: "22"
alert-manager:
configured: True
alert_users: False # True if we want to send out alert email to users, default False
smtp_url: <smtp url>
smtp_from: <email address used to send alert emails, e.g. 'dlts-bot@microsoft.com'>
smtp_auth_username: <email account that would send email to receivers, such as 'dlts-bot@microsoft.com'>
smtp_auth_password: <password for the email account above>
receiver: <email address to send alert email to>
reaper:
dry-run: True # change to False if we want to kill idle job
restful-url: http://localhost:5000
prometheus:
cluster_name: <the unique cluster name> # will be used in link to job detail page
watchdog:
vc_url: <url used for listing vc, e.g. http://localhost:5000/ListVCs?userName=Administrator>
prometheus:
cluster_name: <the unique cluster name> # will be used in link to job detail page
job-manager:
notifier:
cluster: <cluster name>
alert-manager-url: <url like http://localhost:9093/alert-manager>
registry_credential:
<docker registry name 1>:
username: <docker registry username 1>
password: <docker registry password 1>
<docker registry name 2>:
username: <docker registry username 2>
password: <docker registry password 2>
-
cluster_name: A name without underscore or numbers (purely consisting of lower case letters) is recommended.
-
infra_node_num: Should be odd (1, 3 or 5), number of infrastructure node for the deployment. 3 infrastructure nodes tolerate 1 failure, and 5 infrastructure nodes tolerate 2 failures. However, more infrastructure nodes (and more failure tolerance) will reduce performance of the node.
-
worker_node_num: Number of worker node used for deployment.
-
vm_image: Used to fix the image version if the changing LTS is breaking the consistency of the deployment.
-
nfs_vm: each item identified by
suffixspecs would describe an NFS node, and this item would overwrite default NFS specs. Aserver_suffixentry innfs_mnt_setupshould map to this item. -
azure_location: azure location of the cluster.
Please use the following to find all available azure locations.
az account list-locations
- infra_vm_size, worker_vm_size: infrastructure and worker VM size.
Usually, a CPU VM will be used for infra_vm_size, and a GPU VM will be used for worker_vm_size. Please find all available Azure VM size in a specific region, e.g. West US 2 in the below command:
az vm list-sizes --location <location, e.g. westus2>
- registry_credential: defines your access to certain dockers. A docker image name consists of three parts - registry name, image name, and image tag. If your job needs a certain private docker, then use 0. the registry name of that docker, 1. your user name and 2. your password to specify your access to it.