DeepLearning笔记:如何用亚马逊云服务 GPU 训练神经网络

in #cn7 years ago

p2_cool_gpus_1.png

在 Udacity 的深度学习纳米学位课程中,5 个实战项目里至少有 3 个需要用到 GPU 来训练模型。课程附带了 100 刀的亚马逊云服务(AWS)credit,这篇笔记分享如何使用 AWS 完成模型的训练。

注册帐户

首先,注册亚马逊 AWS 的免费帐号:Amazon Web Services Cloud

在项目中要用到 Elastic Compute Cloud (EC2),它可以启动 GPU 运行的虚拟服务,具体类型是 p2.xlarge

我们会用到 this AMI (Amazon Machine Image) 去定义所需要的环境。在使用之前,需要选择离你最近的 AWS 地区:

  • EU (Ireland)
  • Asia Pacific (Seoul)
  • Asia Pacific (Tokyo)
  • Asia Pacific (Sydney)
  • US East (N. Virginia)
  • US East (Ohio)
  • US West (Oregon)

选择好后,查看 EC2 Service Limit report,找到 「正在按需运行的 p2.xlarge 实例」项目:

如果限制是 0,点击右侧「请求提高限制」链接。提高限值不会收费,运行 instance 才会收费。

提高限制的表单需要填写:

  • Region: 选择前面步骤的 AWS 地区
  • Primary Instance Type: p2.xlarge
  • Limit: Instance Limit
  • New Limit Value: 1 (more if you like)
  • Use Case Description: I would like to use GPU instances for deep learning.

如果之前没有启动过 AWS 服务,可能会收到确认邮件。

Billing Management Console 页面输入 Udacity 提供的优惠代码。

运行实例

Launch an Instance

访问 EC2 Management Console, 点击 “Launch Instance” 。

选择 AMI (Amazon Machine Image)

如下图,进入 AWS Marketplace,搜索 Deep Learning AMI with Source Code (CUDA 8, Ubuntu)。

Select the Instance Type

在步骤 2: 选择一个实例类型中

  • Filter the instance list to only show “GPU compute”
  • Select the p2.xlarge instance type
  • Review and Launch

Configure the Security Group

在 步骤 7: 核查实例启动 中点击「编辑安全组」

On the "Configure Security Group" page:

  • Select "Create a new security group"
  • Set the "Security group name" (i.e. "Jupyter")
  • Click "Add Rule"
  • Set a "Custom TCP Rule"
    • Set the "Port Range" to "8888"
    • Select "Anywhere" as the "Source"
  • Click "Review and Launch" (again)

Create an Authentication Key Pair

"Create a new key pair” and click the "Download Key Pair" button. 下载 .pem 文件并保存好,在启动时需要这个文件。

下载完成后,继续点击「启动实例」按钮。

设置计费提醒

此刻开始,启动这个 EC2 instance,AWS 会开始计费。费用可以查看 EC2 On-Demand Pricing page

p2.xlarge: $0.9 每小时

Most importantly, remember to “stop” (i.e. shutdown) your instances when you are not using them. Otherwise, your instances might run for a day, week, month, or longer without you remembering, and you’ll wind up with a large bill!

登录云服务器

实例启动后,在命令行中进入 .pem 文件保存的目录,输入命令(IP 是控制台提供的 IP,每次都不同):

ssh -i DLND.pem ubuntu@13.115.162.209

这时候看到错误提示:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for 'DLND.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "DLND.pem": bad permissions
ubuntu@13.115.162.209: Permission denied (publickey).

查找到 排查实例的连接问题 - Amazon Elastic Compute Cloud

您的密钥必须不公开可见,SSH 才能工作。要修复此错误,请执行以下命令:

chmod 400 DLND.pem

配置 Jupyter notebook

连接服务器后,输入以下命令创建 Jupyter notebook 的配置文件:

jupyter notebook --generate-config

服务器返回:

Writing default config to: /home/ubuntu/.jupyter/jupyter_notebook_config.py

然后,修改 notebook 的 IP 地址设置:

sed -ie "s/#c.NotebookApp.ip = 'localhost'/#c.NotebookApp.ip = '*'/g" ~/.jupyter/jupyter_notebook_config.py

测试实例

On the EC2 instance

  • Clone a GitHub repository
    git clone https://github.com/udacity/aind2-cnn.git
  • Enter the repo directory
    cd aind2-cnn
  • Install the requirements
    sudo python3 -m pip install -r requirements/requirements-gpu.txt
  • Start Jupyter notebook
    jupyter notebook --ip=0.0.0.0 --no-browser

From your local machine

  • You will need the token generated by your jupyter notebook to access it. On your instance terminal, there will be the following line: Copy/paste this URL into your browser when you connect for the first time, to login with a token:. Copy everything starting with the :8888/?token=.

    • http://13.115.162.209:8888/?token=94e72e170ca3fdbe1cd7c58a3fd898e9533e740beb6070fa
  • Access the Jupyter notebook index from your web browser by visiting: X.X.X.X:8888/?token=... (where X.X.X.X is the IP address of your EC2 instance and everything starting with :8888/?token= is what you just copied).

  • Click on "mnist_mlp" to enter the folder, and select the "mnist_mlp.ipynb" notebook.

  • Run each cell in the notebook.

实验完,记得 stop instance。

新建环境

参考深度学习学前须知及常见问题 - DLND: 深度学习纳米学位 - 优达学城论坛

安装 conda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh

新建环境

conda create -n dlnd python=3

激活环境

source activate dlnd

安装 tf

pip install --ignore-installed https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp36-cp36m-linux_x86_64.whl

下次进入环境