configure.md

June 4, 2024 · View on GitHub

Configuration

中文文档

The default value of each configuration can be modified by setting the corresponding properties in the $HBOX_HOME/conf/hbox-site.xml at the Hbox client or the parameter of --conf when submitting the application.

Application Configuration

Property NameDefaultMeaning
hbox.driver.memory2048amount of memory to use for the AM process, in MB
hbox.driver.cores1number of cores to use for the AM process
hbox.worker.num1number of worker containers to use for the application
hbox.worker.memory1024MBamount of memory to use for the worker process
hbox.worker.cores1number of cores to use for the worker process
hbox.chief.worker.memory1024amount of memory for chief worker,especially for the index 0 worker of the TensorFlow application, default as the setting of the worker memory.
hbox.evaluator.worker.memory1024amount of memory for evaluator worker, especially for the TensorFlow Estimator application, default as the setting of the worker memory.
hbox.ps.num0number of ps containers to use for the application
hbox.ps.memory1024MBamount of memory to use for the ps process
hbox.ps.cores1number of cores to use for the ps process
hbox.app.queueDEFAULTthe queue which application submitted to
hbox.app.priority3the priority of the application, divided into level 0 to 5, corresponding to DEFAULT, VERY_LOW, LOW, NORMAL, HIGH, VERY_HIGH
hbox.input.strategyDOWNLOADloading strategy of input file, including DOWNLOAD, STREAM, PLACEHOLDER
hbox.inputfile.renamefalsewhether to rename the download file in the DOWNLOAD strategy of input file
hbox.stream.epoch1the number of the input file loading in the STREAM strategy of input file
hbox.input.stream.shufflefalsewhether to shuffle the input splits in the STREAM strategy of input file
hbox.inputformat.classorg.apache.hadoop.mapred.TextInputFormat.classwhich inputformat implementation to use in the STREAM strategy of input file
hbox.inputformat.cachefalsewhether cache the inputformat file to local when the stream epoch longer than 1
hbox.inputformat.cachefile.nameinputformatCache.gzthe local cache file name for inputformat
hbox.inputformat.cachesize.limit100*1024the limit size of the local cache file (in MB)
hbox.output.local.diroutputIf the local output path is not specified, the local directory of the output file is the default value.
hbox.output.strategyUPLOADloading strategy of output file, including DOWNLOAD, STREAM
hbox.outputformat.classTextMultiOutputFormat.classwhich outputformat implementation to use in the STREAM strategy of output file
hbox.interresult.dir/interResult_specify the HDFS subdirectory that the intermediate output file upload to
hbox.interresult.upload.timeout30 * 60 * 1000upload timeout to save the intermediate output (in milliseconds)
hbox.interresult.save.incfalseincrement upload the intermediate output file, default not (upload all output file each time)
hbox.tf.evaluatorfalsewhether to set the last worker as evaluator of the distributed TensorFlow job type for the estimator api
hbox.tf.distribution.strategyfalsewhether use the distribution strategy API for the TensorFlow, default as false

Board Service Configuration

Property NameDefaultMeaning
hbox.tf.board.enabletrueIf set to false, Board service is not necessary
hbox.tf.board.worker.index0the index of the worker which start the service of Board
hbox.tf.board.log.direventLogthe directory saving TensorBoard event log
hbox.tf.board.history.dir/tmp/hbox/eventLogspecify the HDFS path which the TensorBoard event log upload to
hbox.tf.board.reload.interval1how often the backend should load more data of event log (in seconds) for tensorboard
hbox.board.modelpb""model proto in ONNX format for VisualDL
hbox.board.cache.timeout20memory cache timeout duration in seconds for VisualDL
hbox.tf.board.pathtensorboardthe path of the tensorboard
hbox.board.pathvisualDLthe path of the visualDL

System Configuration

Property NameDefaultMeaning
hbox.container.extra.java.opts""A string of extra JVM options to pass to ApplicationMaster to launch container
hbox.allocate.interval1000msinterval between the AM get the container assigned state from RM
hbox.status.update.interval1000msinterval between the AM report the state to RM
hbox.task.timeout5 * 60 * 1000communication timeout between the AM and container (in milliseconds)
hbox.task.timeout.check.interval3 * 1000how often the AM check the timeout of the container (in milliseconds)
hbox.localresource.timeout5 * 60 * 1000set the timeout of the download the localResources (in milliseconds)
hbox.messages.len.max1000Maximum size (in bytes) of message queue
hbox.execute.node.limit200Maximum number of nodes that application use
hbox.staging.dir/tmp/hbox/stagingHDFS directory that application local resources upload to
hbox.cleanup.enabletruewhether delete the resources after the application finished
hbox.container.maxFailures.rate0.5maximum percentage of the failure containers
hbox.download.file.retry3Maximum number of retries for the input file download when the strategy of input file is DOWNLOAD
hbox.download.file.thread.nums10number of download threads of the input file in the strategy of DOWNLOAD
hbox.upload.output.thread.nums10number of upload threads of the output file in the strategy of UPLOAD
hbox.container.heartbeat.interval10 * 1000interval between each container to the AM (in milliseconds)
hbox.container.heartbeat.retry3Maximum number of retries for the container send the heartbeat to the AM
hbox.container.update.appstatus.interval3 * 1000how often the containers get the state of the application process (in milliseconds)
hbox.container.auto.create.output.dirtrueIf set to true, the containers create the local output path automatically
hbox.log.pull.interval10000interval between the client get the log output of the AM (in milliseconds)
hbox.user.classpath.firsttruewhether user job jar should be the first one on class path or not.
hbox.worker.mem.autoscale0.5automatic memory scale ratio of worker when application retry after failed.
hbox.ps.mem.autoscale0.2automatic memory scale ratio of ps when application retry after failed.
hbox.app.max.attempts1the number of application attempts, default not retry after failed.
hbox.report.container.statustruewhether the client report the status of the container.
hbox.env.maxlength102400the maximum length of environment variable when container execute the user program.
hbox.am.env.[EnvironmentVariableName](none)Add the environment variable specified by EnvironmentVariableName to the AM process. The user can specify multiple of these to set multiple environment variables.
hbox.container.env.[EnvironmentVariableName](none)Add the environment variable specified by EnvironmentVariableName to the Container process. The user can specify multiple of these to set multiple environment variables.
hbox.am.nodeLabelExpression(none)A YARN node label expression that restricts the set of nodes AM will be scheduled on.
hbox.worker.nodeLabelExpression(none)A YARN node label expression that restricts the set of nodes Worker will be scheduled on.
hbox.ps.nodeLabelExpression(none)A YARN node label expression that restricts the set of nodes PS will be scheduled on.

History Configuration

Property NameDefaultMeaning
hbox.history.log.dir/tmp/hbox/historythe HDFS directory that saves the history log
hbox.history.log.delete-monitor-time-interval24 * 60 * 60 * 1000set the time interval by which the application history logs will be checked to clean (in milliseconds)
hbox.history.log.max-age-ms24 * 60 * 60 * 1000how long the history log can be saved (in milliseconds)
hbox.history.port10021port for the history service
hbox.history.address0.0.0.0:10021address for the history service
hbox.history.webapp.port19886port for the history http web service
hbox.history.webapp.address0.0.0.0:19886address for the history http web service
hbox.history.webapp.https.port19885port for the history https web service
hbox.history.webapp.https.address0.0.0.0:19885address for the history https web service

MPI Configuration

Property NameDefaultMeaning
hbox.mpi.install.dir/usr/local/openmpithe installation path of the openmpi
hbox.mpi.extra.ld.library.path(none)the extra library path that openmpi need
hbox.mpi.container.update.status.retry3the retry times for the container status update

Docker Configuration

Property NameDefaultMeaning
hbox.container.typeyarncontainer running type
hbox.docker.registry.host(none)docker register host
hbox.docker.registry.port(none)docker register port
hbox.docker.image(none)docker image name
hbox.docker.worker.dir/workthe work dir of the docker container