Reflecting on docker build speed

in #docker7 years ago

This new trendy thing called docker

Hey all, i've recently been doing some development work inside of docker containers and thought i'd reflect on my experience with it. Docker, as you are probably aware, is a set of software tools for handling application containers, which you can think of as being somewhat like the older traditional BSD jails, or a chroot on steroids.

It's getting a lot of attention because it makes deploying containers quite simple, with support for automated builds and caching built in. Using a Dockerfile allows you to build and then run a container from version control without needing to store an entire root filesystem in your version control repo.

Where it really seems to shine is in eliminating the classic "works on my machine" class of bugs: if your code runs inside of a docker container on your machine, it should run the same in a docker container on other machines.

Why I used to hate docker - slow builds

I used to hate Docker and considered it an overengineered replacement for stuff we already have: BSD jails, chroots, LXC, Solaris zones and good old fashioned paravirtualized VMs (think Xen).

The problem, as I saw it, was that the isolation provided by Docker only really works when you rebuild your images and restart containers after changing your code, and this build process is slow if your Dockerfile is written badly. You can work around this by only putting your application's dependencies into a standard image and then bind mounting your application's code into the container, but doing that loses the isolation that makes Docker so useful in the first place.

In practice, I would often install application dependencies on my host OS directly and then run code from the git checkout to eliminate the slow build problem. Doing this brings back "works on my machine" bugs, but it also means you lose all the advantages of Docker.

Fixing the slow build issue - how to make everything fly

When developing new code it's important to have fast build times, and you should also be able to start and stop any container you use quickly. Doing this well means making proper use of docker's caches so that rebuilds don't take forever, and to do this properly there are a few basic ways:

  1. First option - use tiered base images
    The idea here is you setup one base image that installs basics such as your programming language of choice (i'm a python guy myself) and stuff like nginx, then you build another base image that builds on top of this one to add app-specific dependencies, and finally you add another that sets up your actual application.
    Doing this means you only rebuild the base images when you need to, Docker doesn't even need to lookup the cache for each layer in the base images, it simply imports the latest version of the base image.
  2. Second option - organise your Dockerfile properly so docker's cache can handle it
    With this approach, you don't use base images beyond stuff like the phusion base system (a docker-optimised image of Ubuntu), then you structure your Dockerfile to install dependencies all in image.
    To make it run quickly, you must structure the Dockerfile so that stuff that changes least often is earlier in the file, you also should be careful where it comes to ENV and ARG commands and make sure you aren't invalidating cache on every build.
    Ultimately, each command in your Dockerfile forms a layer which Docker can cache, and each time the output of a command changes, the commands that follow it must be rerun on a rebuild.
    Anything specific to your app that changes often in the code/build/run cycle should be at the bottom, while things that change less often and take longer should be towards the top. Depending on app specifics, the actual ordering may vary, but I hope to release a template demonstrating this approach soon.
  3. Put dependencies in Dockerfile, and mount app code from disk
    This is essentially treating docker containers like VMs and is considered bad practice, but to make things really fly it can make sense - put simply, you build your docker container once and only once, and then run it by bind mounting your application code.
    Inside the container you need a way to automatically reload the code from disk when it changes, this can be as simple as having a script run your application in a loop and then using "docker exec" to kill it, ensuring the next invocation loads the newest code.
    While this should be VERY fast by comparison to rebuilding your image, it should be reserved for development work only, and you should run tests in a fresh container before distributing your code to others (whether inhouse at a company, or simply pushing to github for your free and open-source software).

Combined approaches

In practice, it makes sense to combine the above approaches depending on context: write a Makefile that can build your container from scratch but also have another version of the container that can run your code from a bind mount. Write a common base which supports either form, and then have a new Dockerfile for each.

Unfortunately Docker does not support conditionals in the Dockerfile, so you can't use a build argument to optionally run COPY commands, therefore to make life easier you should use good old-fashioned make to build each image, and put the Dockerfile for your base and for each version of your container into your version control repo.


Conclusion - watch this space

When it's possible for me to do so, i'll be releasing a standard build system for web apps inside docker that uses these approaches. The goal is to make development work go quickly while still allowing fresh rebuilds for production deployment.

Sort:  

Full disclosure: I intend to trial some upvote bots on this post.

I agree on all of the above. I tend to use custom --entrypoint="/bin/bash" for cases when I want to override defaults for debugging or exec -it for attaching to the already running container. Quite interesting feature of the volume mount, not described anywhere, is that with a fine grained target it can be maliciously used to workaround 30 day limit of most of the s/w trials. And about the tiered base images, people behind https://github.com/phusion/baseimage-docker did a really good job.

Phusion's base image is awesome - much prefer using that and then adding services rather than building on top of the various other base images around (seeing "FROM python" etc makes me weep).

About the 30 day trial limit thing - hasn't that always been possible to do if you're running the code on your own hardware?

It's fundamentally impossible to enforce such limits unless the software runs purely serverside.

Congratulations! This post has been upvoted from the communal account, @minnowsupport, by garethnelsonuk from the Minnow Support Project. It's a witness project run by aggroed, ausbitbank, teamsteem, theprophet0, someguy123, neoxian, followbtcnews/crimsonclad, and netuoso. The goal is to help Steemit grow by supporting Minnows and creating a social network. Please find us in the Peace, Abundance, and Liberty Network (PALnet) Discord Channel. It's a completely public and open space to all members of the Steemit community who voluntarily choose to be there.

This post has received a 45.45 % upvote from @nettybot thanks to: @garethnelsonuk.

Send 0.100 SBD to @nettybot with a post link in the memo field to bid on the next vote.

Oh, and be sure to vote for my owner, @netuoso, as Steem Witness

Have a great day!

This post has received a 1.55 % upvote from @booster thanks to: @garethnelsonuk.

This post has received a 37.50 % upvote from @lovejuice thanks to: @garethnelsonuk. They have officially sprayed their dank amps all over your post rewards. GOOD TIMES! Vote for Aggroed!

Hi Garet, I was looking for some NLP tools but I heard about Docker, thanks for the intro :)) Could you please recommend me some beginner tools for NLP? I am trying to make sentences out of YouTube's automatic captioning system to make transcripts. GNU Sed is just not enough anymore :D

NLTK

OK time to finally learn the Python then. I loved OOP in Objective C!

Hmm question not answered:
Stackoverflow, how to add punctuation

That's a VERY complex problem, no easy answer - you could possibly pull it off by creating a ruleset yourself though.

Basically the rules for when to start a new sentence can be defined in terms of what comes before and after the full stop - so write that ruleset and iterate through the words.

Yeah I basically took the auto-captions of YouTube I had already cleaned up for difficult words like "grid coin" and BOINC, as a vtt subtitle file.

I got rid of the vtt timecodes by GNU tools like sed.

I then loaded up the vtt in TextEdit and cmd-F to highlight words. I noticed that @CM-Steem aka customminer uses stop words like "So" a lot so I put periods before those.

https://steemit.com/gridcoin/@nutela/gridcoin-whaletank-rough-transcript-friday-8th-aug-2017

Here's the video:

I edited upto 15 mins or so.

You wouldn't believe how much text one can fill be simply talking for 15 minutes. Way too much work to do by hand.

You could try to make use of the natural pauses in speech to add the full stops as well.

Hey that's a great idea! I wonder though how to get that, I was wondering if YouTube would offer any insight but their tool is closed off. IBM Whatson looks much cooler and even has a github link but I'm not so sure of the quality. It couldn't keep up when testing real time (with Loopback) but then again real time is maybe too much to ask.

Full post with plenty of images