Singularity Configuration

September 23, 2020 ยท View on GitHub

Singularity (Service) is configured by DropWizard via a YAML file referenced on the command line. Top-level configuration elements reside at the root of the configuration file alongside DropWizard configuration.

Root Configuration

Common Configuration

These are settings that are more likely to be altered.

General

ParameterDefaultDescriptionType
allowRequestsWithoutOwnerstrueIf false, submitting a request without at least one owner will return a 400boolean
commonHostnameSuffixToOmitnullIf specified, will remove this hostname suffix from all taskIdsstring
defaultAgentPlacementGREEDYSee Agent Placementenum / string [GREEDY, OPTIMISTIC, SEPARATE (deprecated), SEPARATE_BY_DEPLOY, SEPARATE_BY_REQUEST, SPREAD_ALL_AGENTS]
defaultValueForKillTasksOfPausedRequeststrueWhen a task is paused, the API allows for the tasks of that request to optionally not be killed. If that parameter is not set in the pause request, this value is usedboolean
deltaAfterWhichTasksAreLateMillis30000 (30 seconds)The amount of time after a task's schedule time that Singularity will classify it (in state API and dashboard) as a late tasklong
deployHealthyBySeconds120Default amount of time to allow pending deploys to run for before transitioning them into active deploys. If more than this time passes before a deploy can be considered healthy (all of its tasks either make it to TASK_RUNNING or pass healthchecks), then the deploy will be rejectedlong
killNonLongRunningTasksInCleanupAfterSeconds86400 (1 day)Kills scheduled and one-off tasks after this amount of time if they have been scheduled for cleaning (a new deploy succeeds, the underlying agent is decomissioned)long
hostnamenullHostname of this Singularity instancestring

Healthchecks and New Task Checks

ParameterDefaultDescriptionType
considerTaskHealthyAfterRunningForSeconds5Tasks which make it to TASK_RUNNING and run for at least this long (that are not health-checked) are considered healthylong
healthcheckIntervalSeconds5Default amount of time to wait in between attempting task healthchecksint
healthcheckTimeoutSeconds5Default amount of time to wait for healthchecks to return before considering them failedint
killAfterTasksDoNotRunDefaultSeconds600 (10 minutes)Amount of time after which new tasks (that are not part of a deploy) will be killed if they do not enter TASK_RUNNINGlong
healthcheckMaxRetriesDefault max number of time to retry a failed healthcheck for a task before considering the task to be unhealthyint
startupDelaySecondsBy default, wait this long before starting any healthchecks on a taskint
startupTimeoutSeconds45If a healthchecked task has not responded with a valid http response in startupTimeoutSeconds consider it unhealthyint
startupIntervalSeconds2In the startup period (before a valid http response has been received) wait this long between healthcheck attemptsint
healthcheckFailureStatusCodes[]If any of these status codes is received during a healthcheck, immediately consider the task unhealthy, do not retry the checkList

Deploys

ParameterDefaultDescriptionType
defaultDeployStepWaitTimeMs0If using an incremental deploy, wait this long between deploy steps if not specified in the deployint
defaultDeployMaxTaskRetries0Allow this many tasks to fail and be retried before failing a new deployint
allowDeployOfPausedRequestsfalseIf true, paused requests can be deployed without unpausing or starting new tasks at deploy timeboolean

Limits

ParameterDefaultDescriptionType
maxDeployIdSize50Deploy ids over this size will cause deploy requests to fail with 400int
maxRequestIdSize100Request ids over this size will cause new requests to fail with 400int

Cooldown

Cooldown is divided into 2 types, fast and slow. These are essentially two sets of differing thresholds for cooldown, meant to act quickly for cases where there are rapid failures, but still provide a notification/signal for cases where there are slow but repeated failures

ParameterDefaultDescriptionType
fastFailureCooldownCount/slowFailureCooldownCount3/5The number of sequential failures after which a request is placed into system cooldownint
fastFailureCooldownMs/slowFailureCooldownMs30000/600000The time window during which ...CooldownCount failures must occurlong
fastCooldownExpiresMinutesWithoutFailure/slowCooldownExpiresMinutesWithoutFailure5/5If there are no failures after this time period, the request will exit cooldownint
cooldownMinScheduleSeconds120When a request enters cooldown, new tasks are delayed by at least this longlong

Load Balancer API

ParameterDefaultDescriptionType
loadBalancerQueryParamsnullAdditional query parameters to pass to the Load Balancer APIMap<String, String>
loadBalancerRequestTimeoutMillis2000The timeout for making API calls to the Load Balancer API (these will be retried)long
loadBalancerUrinullThe URI of the Load Balancer API (Baragon)string
deleteRemovedRequestsFromLoadBalancerfalseIf a request is removed from Singularity, issue a DELETE to the load balancer for that serviceboolean

User Interface

ParameterDefaultDescriptionType
sandboxDefaultsToTaskIdfalseIf true, the Singularity API will return the sandbox view of root/taskId when queried without a path (Useful when using SingularityExecutor)boolean
enableCorsFilterfalseIf true, provides a Bundle which will enable CORSboolean

Internal Scheduler Configuration

These settings are less likely to be changed, but were included in the configuration instead of hardcoding values.

Pollers

ParameterDefaultDescriptionType
checkDeploysEverySeconds5Check the status (health) of pending deploys, promoting them to active or removing them on this intervallong
checkNewTasksEverySeconds5Check the health of new (non-deployed, non-healthchecked) tasks to make sure they eventually get to running on this intervallong
checkSchedulerEverySeconds5Runs scheduler checks (processes decommissions and pending queue) on this interval (these tasks also run when an offer is received)long
checkWebhooksEveryMillis10000 (10 seconds)Will check for and send new queued webhooks on this intervallong
cleanupEverySeconds5Will cleanup request, task, and other queues on this intervallong
persistHistoryEverySeconds3600 (1 hour)Moves stale historical task data from ZooKeeper into the database, setting to 0 will disable history persistencelong
saveStateEverySeconds60State about this Singularity instance is saved (available over API) on this intervallong
checkJobsEveryMillis600000 (10 mins)Check for jobs running longer than the expected time on this intervallong
checkExpiringUserActionEveryMillis45000Check for expiring actions that should be expired on this intervallong

Mesos

ParameterDefaultDescriptionType
checkReconcileWhenRunningEveryMillis30000 (30 seconds)When reconciling tasks, will re-request task updates on this interval until reconciliation finisheslong
startNewReconcileEverySeconds600 (10 minutes)Starts a new reconciliation cycle (if one is not currently running) on this interval (A relatively costly operation that detects updates Mesos failed to deliver)long
askDriverToKillTasksAgainAfterMillis300000 (5 minutes)Amount of time to wait before instruction mesos to kill a task which has been killed by Singularity but is still runninglong

Thread Pools

ParameterDefaultDescriptionType
checkNewTasksScheduledThreads3Max number of threads to use to check new tasksint
healthcheckStartThreads3Max number of threads to use to start healthchecksint
logFetchMaxThreads15Max number of threads to use to fetch log directories from Mesos REST APIint

Operational

ParameterDefaultDescriptionType
closeWaitSeconds5Will wait at least this many seconds when shutting down thread poolslong
compressLargeDataObjectstrueWill compress larger objects inside of ZooKeeper and the databaseboolean
maxHealthcheckResponseBodyBytes8192Number of bytes to save from healthcheck responses (displayed in UI)int
maxQueuedUpdatesPerWebhook50Max number of updates to queue for a given webhook url, after which some webhooks will not be deliveredint
zookeeperAsyncTimeout5000Milliseconds for ZooKeeper timeout. Calls to ZooKeeper which take over this timeout will cause the operations to fail and Singularity to abortlong
cacheStateForMillis30000 (30 seconds)Amount of time to cache internal state for when requested over APIlong
sandboxHttpTimeoutMillis5000 (5 seconds)Sandbox HTTP calls will timeout after this amount of time (fetching logs for emails / UI)
newTaskCheckerBaseDelaySeconds1Added to the the amount of deploy to wait before checking a new tasklong
allowTestResourceCallsfalseIf true, allows calls to be made to the test resource, which can test internal methodsboolean
deleteDeploysFromZkWhenNoDatabaseAfterHours336 (14 days)Delete deploys from zk when they are older than this if we are not using a databaselong
maxStaleDeploysPerRequestInZkWhenNoDatabaseinfinite (disabled)Delete oldest deploys from zk when there are more than this number for a given request, if we're not already persisting them to a databaseint
deleteStaleRequestsFromZkWhenNoDatabaseAfterHours336 (14 days)Delete stale requests after this amount of time if we are not using a databaselong
maxRequestsWithHistoryInZkWhenNoDatabaseinfinite (disabled)Delete history of oldest requests from zk when there are more than this number of requests, if we're not already persisting them to a databaseint
deleteTasksFromZkWhenNoDatabaseAfterHours168 (7 days)Delete old tasks from zk after this amount of time if we are not using a databaselong
maxStaleTasksPerRequestInZkWhenNoDatabaseinfinite (disabled)Delete oldest tasks from zk when there are more than this number for a given request, if we're not already persisting them to a databaseint
taskPersistAfterStartupBufferMillis60000ms (1 min)Wait this long after a task starts before persisting it in historylong
deleteDeadAgentsAfterHours168 (7 days)Remove dead agents from the list after this amount of timelong
deleteUndeliverableWebhooksAfterHours168 (7 days)Delete (and stop retrying) failed webhooks after this amount of timelong
waitForListenerstrueIf true, the event system waits for all listeners having processed an event.boolean
warnIfScheduledJobIsRunningForAtLeastMillis86400000 (1 day)Warn if a scheduled job has been running for this longlong
warnIfScheduledJobIsRunningPastNextRunPct200Warn if a scheduled job has run this much past its next scheduled run time (e.g. 200 => ran through next two run times)int
pendingDeployHoldTaskDuringDecommissionMillis600000ms (10 minutes)Don't kill tasks on a decommissioning agent that are part of a pending deploy for this amount of time to allow the deploy to completelong
defaultBounceExpirationMinutes60Expire a bounce after this many minutes if an expiration is not provided in the request to bounceint
cacheOffersfalseHold on to unused offers for up to cacheOffersForMillisboolean
cacheOffersForMillisIf cacheOffers is true, decline offers after this amount of time if they ahve not been usedlong
offerCacheSizeThe maximum number of offers to cache at onceint

Mesos Configuration

These settings should live under the "mesos" field inside the root configuration.

Framework

ParameterDefaultDescriptionType
masternullA comma separated list of mesos master http(s)://user:password@host:port user and password are optional, http is used if no protocol is providedString
frameworkNamenullString
frameworkIdnullString
frameworkFailoverTimeout0.0double
frameworkRolenullSpecify framework's desired role when Singularity registers with the masterString
checkpointtrueboolean
credentialPrincipalUsed to enable authorization based on the authenticated principalString

Resource Limits

ParameterDefaultDescriptionType
defaultCpus1Number of CPUs to request for a task if none are specifiedint
defaultMemory64MB of memory to request for a task if none is specifiedint
defaultDisk1024MB of disk to request for a task if none is specifiedint
maxNumInstancesPerRequest25Max instances (tasks) to allow for a request (requests using over this will return a 400)int
maxNumCpusPerInstance50Max number of CPUs allowed on a given taskint
maxNumCpusPerRequest900Max number of CPUs allowed for a given request (cpus per task * task instance)int
maxMemoryMbPerInstance24000Max MB of memory allowed on a given taskint
maxMemoryMbPerRequest450000Max MB of memory allowed for a given request (memoryMb per task * task instances)int

Racks

ParameterDefaultDescriptionType
rackIdAttributeKeyrackidThe Mesos agent attribute to denote a rackstring
defaultRackIdDEFAULTThe rackId to assign to a agent if no rackId attribute value is presentstring

Agents

ParameterDefaultDescriptionType
agentHttpPort5051The port to talk to agents onint
agentHttpsPortabsentThe HTTPS port to talk to agents onInteger (Optional)

Offers

ParameterDefaultDescriptionType
allocatedResourceWeight0.5This portion of an offer's score depends on the amount of resources currently allocated by mesos on the mesos agentdouble
inUseResourceWeight0.5This portion of an offer's score depends on the currently used resources on a mesos agent as reported by the agent statistics endpointdouble
cpuWeight0.4The weight the agent's cpu carries when scoring an offerdouble
memWeight0.4The weight the agent's memory carries when scoring an offerdouble
diskWeight0.2The weight the agent's disk carries when scoring an offerdouble

Database

ParameterDefaultDescriptionType
databaseThe database connection for SingularityService follows the dropwizard DataSourceFactory formatDataSourceFactory

Network Configuration

These settings should live under the "network" field of the root configuration.

ParameterDefaultDescriptionType
defaultPortMappingfalseIf no port mapping is provided, map all Mesos-provided ports to the hostboolean

History Purging

These settings live under the "historyPuring" field in the root configuration

ParameterDefaultDescriptionType
deleteTaskHistoryAfterDays365Purge tasks older than this many daysint
deleteTaskHistoryAfterTasksPerRequest10000Purge oldest tasks when there are more than this many associated with a single requestint
deleteTaskHistoryBytesInsteadOfEntireRowtrueOnly delete the taskHistoryBytes instead of the entire record of the task (e.g. to save space)boolean
checkTaskHistoryEveryHours24Run the purge every x hoursint
enabledfalseShould we run the database purgeboolean

S3

These settings live under the "s3" field in the root configuration. If using the SingularityS3Uploader, this section will need to be provided in order to view lists of and download s3 logs from the SingularityUI.

ParameterDefaultDescriptionType
maxS3Thread3Max threads to run for fetching logs from s3int
waitForS3ListSeconds5Timeout in seconds for fetching list of s3 logsint
waitForS3LinksSeconds1Timeout in seconds for creating new s3 linksint
expireS3LinksAfterMillis86400000 (1 day)Expire generated s3 log links after this amount of timelong
s3BucketS3 bucket to search for logsString
groupOverridesExtra s3 configurations provided such that individual requests may use separate s3 buckets. Each S3GroupOverrideConfiguration has a name specified by the Map key and consists of an s3Bueckt, s3AccessKey, and s3SecretKeyMap<String, S3GroupOverrideConfiguration>
s3KeyFormatSearch for logs with keys in this format, should be the same as the key format set in the SingularityS3UploaderString
s3AccessKeyaws access key for the specified s3 bucketString
s3SecretKeyaws secret key for the specified s3 bucketString
missingTaskDefaultS3SearchPeriodMillis259200000ms (3 days)Search over this many days for s3 logs when no task data is foundlong

Sentry

These settings live under the "sentry" field in the root config and enable Singularity error reporting to sentry.

ParameterDefaultDescriptionType
dsnSentry DSN (Data Source Name)String
prefix""Prefix string for event culprit naming and messagesString

SMTP

These settings live under the "smtp" field in the root config.

ParameterDefaultDescriptionType
usernamesmtp usernameString
passwordsmtp passwordString
taskLogLength512Send this many lines of a tasks log in emailsint
hostlocalhostHost for smtp sessionString
port25Port for smtp sessionint
from"singularity-no-reply@example.com"Send emails form this addressString
mailMaxThreads3max threads for email sending processint
admins[]List of admin user emailsList<String>
rateLimitAfterNotifications5Rate limit email sending after this many notifications have been sent in rateLimitPeriodMillisint
rateLimitPeriodMillis60000 (10 mins)time period for rateLimitAfterNotificationslong
rateLimitCooldownMillis3600000 (1 hour)Cooldown time before rate limiting is removedlong
taskEmailTailFiles[stdout, stderr]Send the tail of these files in messages about tasksList<String>
emailsSee belowSee belowMap<EmailType, List<EmailDestination>>
subjectPrefixunsetString prepended to the email subject lineString
sslfalseConnect to SMTP host over sslboolean

You may need libmail-java installed on your Singularity master host in order to connect to your smtp server.

Emails List

The emails list determines what emails to send notifications to and for what events. You can specify a map of EmailType to a list of EmailDestinations

EmailType corresponds to different events that could trigger emails such as TASK_LOST or TASK_FAILED

EmailDestination corresponds to one of OWNERS (as listed on the Singularity Request), ACTION_TAKER (user who triggered the action causing the email update), or ADMINS (specified in config as seen above)

An email list might look something like

smtp:
  emails:
    TASK_LOST:
      - OWNERS
    TASK_FAILED:
      - OWNERS
    TASK_FAILED_DECOMISSIONED:
      - OWNERS
    TASK_KILLED:
      - OWNERS
    TASK_KILLED_DECOMISSIONED:
      - OWNERS
    TASK_KILLED_UNHEALTHY:
      - OWNERS
    TASK_SCHEDULED_OVERDUE_TO_FINISH:
      - OWNERS
    TASK_FINISHED_ON_DEMAND:
      - OWNERS
    TASK_FINISHED_RUN_ONCE:
      - OWNERS
    TASK_FINISHED_SCHEDULED:
      - OWNERS
    TASK_FINISHED_LONG_RUNNING:
      - OWNERS

UI Configuration

These settings live under the "ui" field in the root config.

ParameterDefaultDescriptionType
title"Singularity"Title shown in the left of the menu bar in uiString
navColor""Color for nav barString
baseUrlBase url where the ui will be hosted (e.g. http://localhost:7099/singularity)String
runningTaskLogPathstdoutGenerate link to this log for running tasks on the request pageString
finishedTaskLogPathstdoutGenerate link to this log for finished tasks on the request pageString
hideNewDeployButtonfalseDon't show the 'New Deploy' buttonboolean
hideNewRequestButtonfalseDon't show the 'New Request' buttonboolean
rootUrlModeINDEX_CATCHALLINDEX_CATCHALL: UI is served off of / using a catchall resource. UI_REDIRECT: UI is served off of /ui, path and index redirects there. DISABLED: UI is served off of /ui and the root resource is not served at allenum / String INDEX_CATCHALL, UI_REDIRECT, DISABLED

Zookeeper

These settings live under the "zookeeper" field in the root config.

ParameterDefaultDescriptionType
quorumComma separated host:port list of zk hostsString
sessionTimeoutMillis600_000zookeeper session timeoutint
connectTimeoutMillis60_000Connect to zookeeper timeoutint
retryBaseSleepTimeMilliseconds1_000Wait time between zookeeper connection retriesint
retryMaxTries3Max retries to obtain a zookeeper connection before abortingint
zkNamespacePath under which to store Singularity data in zk (e.g. /singularity)String