这个示例展示了如何使用 Arena 进行机器学习模型训练。它会从 git url 下载源代码,并使用 Tensorboard 可视化 Tensorflow 训练状态。
- 第一步是检查可用资源
arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 0
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)
有 3 个带有 GPU 的可用节点用于运行训练作业。
2.现在,我们可以通过 arena submit 提交一个训练作业,这会从 github 下载源代码
#arena submit tf \
--name=tf-tensorboard \
--gpus=1 \
--image=tensorflow/tensorflow:1.5.0-devel-gpu \
--env=TEST_TMPDIR=code/tensorflow-sample-code/ \
--syncMode=git \
--syncSource=https://github.com/cheyang/tensorflow-sample-code.git \
--tensorboard \
--logdir=/training_logs \
"python code/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 5000"
configmap/tf-tensorboard-tfjob created
configmap/tf-tensorboard-tfjob labeled
service/tf-tensorboard-tensorboard created
deployment.extensions/tf-tensorboard-tensorboard created
tfjob.kubeflow.org/tf-tensorboard created
INFO[0001] The Job tf-tensorboard has been submitted successfully
INFO[0001] You can run `arena get tf-tensorboard --type tfjob` to check the job status
这会下载源代码,并将其解压缩到工作目录的
code/目录。默认的工作目录是/root,您也可以使用--workingDir加以指定。
logdir指示 Tensorboard 在何处读取 TensorFlow 的事件日志
3.列出所有作业
#arena list
NAME STATUS TRAINER AGE NODE
tf-tensorboard RUNNING TFJOB 0s 192.168.1.119
4.检查作业所使用的GPU资源
#arena top job
NAME STATUS TRAINER AGE NODE GPU(Requests) GPU(Allocated)
tf-tensorboard RUNNING TFJOB 26s 192.168.1.119 1 1
Total Allocated GPUs of Training Job:
0
Total Requested GPUs of Training Job:
1
5.检查集群所使用的GPU资源
#arena top node
NAME IPADDRESS ROLE GPU(Total) GPU(Allocated)
i-j6c68vrtpvj708d9x6j0 192.168.1.116 master 0 0
i-j6c8ef8d9sqhsy950x7x 192.168.1.119 worker 1 1
i-j6c8ef8d9sqhsy950x7y 192.168.1.120 worker 1 0
i-j6c8ef8d9sqhsy950x7z 192.168.1.118 worker 1 0
i-j6ccue91mx9n2qav7qsm 192.168.1.115 master 0 0
i-j6ce09gzdig6cfcy1lwr 192.168.1.117 master 0 0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/3 (33%)
6.获取特定作业的详细信息
#arena get tf-tensorboard
NAME STATUS TRAINER AGE INSTANCE NODE
tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-586fcf4d6f-vtlxv 192.168.1.119
tf-tensorboard RUNNING tfjob 15s tf-tensorboard-tfjob-worker-0 192.168.1.119
Your tensorboard will be available on:
192.168.1.117:30670
注意:您可以使用
192.168.1.117:30670访问 Tensorboard。如果您通过笔记本电脑无法直接访问 Tensorboard,则可以考虑使用sshuttle。例如:sshuttle -r root@47.89.59.51 192.168.0.0/16

恭喜!您已经使用 arena 成功运行了训练作业,而且还能轻松检查 Tensorboard。