arena submit mpijob
Submit MPIjob as training job.
Synopsis
Submit MPIjob as training job.
arena submit mpijob [flags]
Options
-a, --annotation stringArray the annotations
--cpu string the cpu resource to use for the training, like 1 for 1 core.
-d, --data stringArray specify the datasource to mount to the job, like <name_of_datasource>:<mount_point_on_job>
--data-dir stringArray the data dir. If you specify /data, it means mounting hostpath /data into container path /data
-e, --env stringArray the environment variables
--gpus int the GPU count of each worker to run the training.
-h, --help help for mpijob
--image string the docker image name of training job
--logdir string the training logs dir, default is /training_logs (default "/training_logs")
--memory string the memory resource to use for the training, like 1Gi.
--name string override name
--rdma enable RDMA
--retry int retry times.
--sync-image string the docker image of syncImage
--sync-mode string syncMode: support rsync, hdfs, git
--sync-source string sync-source: for rsync, it's like 10.88.29.56::backup/data/logoRecoTrain.zip; for git, it's like https://github.com/kubeflow/tf-operator.git
--tensorboard enable tensorboard
--tensorboard-image string the docker image for tensorboard (default "registry.cn-zhangjiakou.aliyuncs.com/tensorflow-samples/tensorflow:1.12.0-devel")
--workers int the worker number to run the distributed training. (default 1)
--working-dir string working directory to extract the code. If using syncMode, the $workingDir/code contains the code (default "/root")
Options inherited from parent commands
--arena-namespace string The namespace of arena system service, like tf-operator (default "arena-system")
--config string Path to a kube config. Only required if out-of-cluster
--loglevel string Set the logging level. One of: debug|info|warn|error (default "info")
-n, --namespace string the namespace of the job (default "default")
--pprof enable cpu profile
--trace enable trace
SEE ALSO
- arena submit - Submit a job.