DeepLearning笔记：如何用亚马逊云服务 GPU 训练神经网络

在 Udacity 的深度学习纳米学位课程中，5 个实战项目里至少有 3 个需要用到 GPU 来训练模型。课程附带了 100 刀的亚马逊云服务（AWS）credit，这篇笔记分享如何使用 AWS 完成模型的训练。

注册帐户

首先，注册亚马逊 AWS 的免费帐号：Amazon Web Services Cloud。

在项目中要用到 Elastic Compute Cloud (EC2)，它可以启动 GPU 运行的虚拟服务，具体类型是 p2.xlarge。

我们会用到 this AMI (Amazon Machine Image) 去定义所需要的环境。在使用之前，需要选择离你最近的 AWS 地区：

EU (Ireland)
Asia Pacific (Seoul)
Asia Pacific (Tokyo)
Asia Pacific (Sydney)
US East (N. Virginia)
US East (Ohio)
US West (Oregon)

选择好后，查看 EC2 Service Limit report，找到「正在按需运行的 p2.xlarge 实例」项目：

如果限制是 0，点击右侧「请求提高限制」链接。提高限值不会收费，运行 instance 才会收费。

提高限制的表单需要填写：

Region: 选择前面步骤的 AWS 地区
Primary Instance Type: p2.xlarge
Limit: Instance Limit
New Limit Value: 1 (more if you like)
Use Case Description: I would like to use GPU instances for deep learning.

如果之前没有启动过 AWS 服务，可能会收到确认邮件。

在 Billing Management Console 页面输入 Udacity 提供的优惠代码。

运行实例

Launch an Instance

访问 EC2 Management Console, 点击 “Launch Instance” 。

选择 AMI (Amazon Machine Image)

如下图，进入 AWS Marketplace，搜索 Deep Learning AMI with Source Code (CUDA 8, Ubuntu)。

Select the Instance Type

在步骤 2: 选择一个实例类型中

Filter the instance list to only show “GPU compute”
Select the p2.xlarge instance type
Review and Launch

Configure the Security Group

在步骤 7: 核查实例启动中点击「编辑安全组」

On the "Configure Security Group" page:

Select "Create a new security group"
Set the "Security group name" (i.e. "Jupyter")
Click "Add Rule"
Set a "Custom TCP Rule"
- Set the "Port Range" to "8888"
- Select "Anywhere" as the "Source"
Click "Review and Launch" (again)

Create an Authentication Key Pair

"Create a new key pair” and click the "Download Key Pair" button. 下载 .pem 文件并保存好，在启动时需要这个文件。

下载完成后，继续点击「启动实例」按钮。

设置计费提醒

此刻开始，启动这个 EC2 instance，AWS 会开始计费。费用可以查看 EC2 On-Demand Pricing page

p2.xlarge: $0.9 每小时

Most importantly, remember to “stop” (i.e. shutdown) your instances when you are not using them. Otherwise, your instances might run for a day, week, month, or longer without you remembering, and you’ll wind up with a large bill!

登录云服务器

实例启动后，在命令行中进入 .pem 文件保存的目录，输入命令（IP 是控制台提供的 IP，每次都不同）：

ssh -i DLND.pem ubuntu@13.115.162.209

这时候看到错误提示：

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@         WARNING: UNPROTECTED PRIVATE KEY FILE!          @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0644 for 'DLND.pem' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "DLND.pem": bad permissions
ubuntu@13.115.162.209: Permission denied (publickey).

查找到排查实例的连接问题 - Amazon Elastic Compute Cloud

您的密钥必须不公开可见，SSH 才能工作。要修复此错误，请执行以下命令:

chmod 400 DLND.pem

配置 Jupyter notebook

连接服务器后，输入以下命令创建 Jupyter notebook 的配置文件：

jupyter notebook --generate-config

服务器返回：

Writing default config to: /home/ubuntu/.jupyter/jupyter_notebook_config.py

然后，修改 notebook 的 IP 地址设置：

sed -ie "s/#c.NotebookApp.ip = 'localhost'/#c.NotebookApp.ip = '*'/g" ~/.jupyter/jupyter_notebook_config.py

测试实例

On the EC2 instance

Clone a GitHub repository
git clone https://github.com/udacity/aind2-cnn.git
Enter the repo directory
cd aind2-cnn
Install the requirements
sudo python3 -m pip install -r requirements/requirements-gpu.txt
Start Jupyter notebook
jupyter notebook --ip=0.0.0.0 --no-browser

From your local machine

You will need the token generated by your jupyter notebook to access it. On your instance terminal, there will be the following line: Copy/paste this URL into your browser when you connect for the first time, to login with a token:. Copy everything starting with the :8888/?token=.
- http://13.115.162.209:8888/?token=94e72e170ca3fdbe1cd7c58a3fd898e9533e740beb6070fa
Access the Jupyter notebook index from your web browser by visiting: X.X.X.X:8888/?token=... (where X.X.X.X is the IP address of your EC2 instance and everything starting with :8888/?token= is what you just copied).
Click on "mnist_mlp" to enter the folder, and select the "mnist_mlp.ipynb" notebook.
Run each cell in the notebook.

实验完，记得 stop instance。

新建环境

参考深度学习学前须知及常见问题 - DLND: 深度学习纳米学位 - 优达学城论坛

安装 conda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh

新建环境

conda create -n dlnd python=3

激活环境

source activate dlnd

安装 tf

pip install --ignore-installed https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp36-cp36m-linux_x86_64.whl

下次进入环境

从链接 EC2 Management Console 启动实例
本地输入命令 ssh -i DLND.pem [email protected]
连接服务器后，激活环境 source activate dlnd
启动 jupyter notebook jupyter notebook --ip=0.0.0.0
在浏览器打开：http://52.197.226.169:8888/?token=a55e1cfbc162df6d3358e3553d220b4d269e2789df6e5ddd