Building AMIs with Packer


During the planning stages of our migration to AWS, we identified the need to create custom images (AMI’s) as the base for new instances. While we are relatively experienced with Chef, we found that running Chef at instance launch time was much longer than acceptable. Creating custom AMI’s that are preconfigured (known as baking) allowed us to shift the heavy lifting from instance launch time to an earlier, out-of-band time.

In designing this process, we came up with multiple goals – we needed to have a reliable, repeatable, auditable and tested process with a fast spin-up time. This post explores our recent infrastructure automation efforts in this area.

With the goals above in mind, we reviewed many alternatives (VeeWee, Netflix Amimator, VMWare Orchestrator, Chef Provisioning (aka Chef Metal), Plain Old UserData Scripts, amongst others) but finally settled on Packer by HashiCorp.

Packer is “a tool for creating identical machine images for multiple platforms from a single source configuration”, which gives us the added benefit of supporting multiple platforms, so we can output an image for VirtualBox, or a template for VMWare, from the same configuration template we use to create an AWS AMI. That means that no matter what use case (e.g. local Chef testing against VirtualBox, or internal infrastructure on VMWare), we know that we are starting from a common base and our subsequent configuration can therefore be simpler.

Starting from a simple configuration file:

{
  "variables": {
    "vm_name": ""
  },
  "builders": [
    {
      "type": "amazon-ebs",
      "region": "eu-west-1",
      "source_ami": "ami-e99c1d9e",
      "iam_instance_profile": "packer-iam-instance-profile",
      "instance_type": "c4.large",
      "ssh_username": "Administrator",
      "ssh_timeout": "30m",
      "security_group_ids": [
        "sg-109e1de2",
        "sg-1d69fc12"
      ],
      "ami_name": "{{ user `vm_name` }}",
      "user_data_file": "scripts/install_ssh_server.ps1",
      "vpc_id": "vpc-e92eda83",
      "subnet_id": "subnet-1434dfa9"
    }
  ],
  "provisioners": [
    {
      "type": "shell",
      "remote_path": "/tmp/win-updates.bat",
      "binary": true,
      "execute_command": "{{.Vars}} cmd /c C:/Windows/Temp/win-updates.bat",
      "script": "./scripts/win-updates.bat"
    }
  ]
}

We can create a new AMI on AWS with ease:

packer build -var "vm_name=test-ami" win2012r2-template.json

Build chaining

One of the great benefits of Packer is its ability to take the output of one build and feed it back in as the input of a subsequent build. This allows us to create a “tree” of images, each building on the success of the last. This reduces the time to build a new image, as we are only building what has changed – a big benefit given that some components can take a tens of minutes to install.  It also allows variations of the image (rather than one “golden” image) with different installed software (e.g. only install a web server if you absolutely need it) to limit the attack service.

Tree of Images

Once connected to a CI system (GoCD in this case), we can set up scheduled builds to ensure that the latest patches are installed at the top level, and then chain the builds together so that new AMI’s are produced on a regular basis without human intervention.

packer build chain

This is run on a scheduled basis, so that we get the latest security patches built into the image, without having to push them out to the production environment. This gives us better testing of patch impact, as well as many benefits around immutable infrastructure.

Testing

While developing, we spin up a new AWS instance, and run ServerSpec remotely to develop tests and installation scripts. In the build pipeline, we run these same tests locally on the instance on every build to ensure that everything has installed and built successfully. These test scripts are currently removed, but could easily remain on the target box, so they can be run at any point to confirm the box is in a happy state.

Uploading spec => /tmp

Provisioning with shell script: ./scripts/run_serverspec_tests.bat

C:\>gem install serverspec --no-ri --no-rdoc
Successfully installed serverspec-2.10.1
1 gem installed

C:\>cd c:\Windows\Temp

C:\Windows\Temp>rake spec
..................
Finished in 3.51558 seconds (files took 1.4375 seconds to load)
18 examples, 0 failures

 

Build it and they will come

Once the images have been created, they are pretty useless unless they get used, so the next challenge was to make it as easy as possible for the consuming teams to use them. In our dev and test environments on AWS, we operate a “scale down to 0 out of business hours” model, to contain costs. This has the benefit of making sure our automation is rock solid, and ensures that development teams don’t rely on the same VM’s existing for long periods of time. Being able to rebuild and redeploy applications very easily means rebuilding the image underneath isn’t too much of an issue.

At this point, we have a manual process that teams need to follow to consume the image, as we want to ensure that applications are fully tested against the latest image before they rely on it. We are seeing that this pattern is encouraging some great behaviour around test automation (which while it was already at a fairly good level, was not quite good enough to have the final say in upgrading the base image).

Our special sauce

We wrote a little helper gem to wrap Packer, and help streamline the development / build process. This included functionality to:

  • Validate the packer template – failing fast is always a good idea.
  • Strip out the comments – the json template format doesn’t natively support comments, so sometimes it’s useful to explain some non-obvious settings
  • Clean up previous failed builds. In the case of an interrupted build, the old artefacts will not always be deleted.
  • Post steps for VMWare to re-register the vmx file. By default, it gets de-registered.
  • Read & write json files for build chaining
  • Update Vagrant box metadata file to allow Vagrant to update local boxes
  • Download from & upload files to Artifactory

Security

As a company with a focus on building security in, a set period of time was mandated, by which all instances must be using the new images. As we process large amounts of customer data and payment transactions, we obviously have the need to be PCI compliant, however our internal standards are much higher than that. It is proving to be a challenge to make this happen in a streamlined, low-impact way, but we are definitely on the right path.

Automated security testing of new images is at this point triggered manually, but we are planning on integrating this into the pipeline very shortly. We are also looking at the inclusion of automated security testing of the deployed applications in a similar fashion, which will be a large time saver for security, allowing more time to focus on specialist manual security testing.

This DevOps thing

Above all, this has been a cross-functional, collaborative effort. Bringing developers, architects and SysAdmins together allowed us to get the best of all worlds. However, different skill-sets, backgrounds and even different philosophies meant that on occasion, some vigorous discussions were had while coming to workable solutions.

Packer appears at first glance to be a developer-focused tool, rather than a SysAdmin one. This, combined with the ServerSpec testing approach, led to some hesitance among those with more of an Ops background. However, sharing information around the why, combined with pairing, helped alleviate the concerns. There is definitely more progress needed here to get everyone comfortable.

One thing that we have tried to emphasise through this whole process is that it is an evolutionary approach, meaning that no solution is set in stone, and we are definitely looking to improve the process continually.

Relative comfort with various tool-sets also drove some of the technical direction. While the development community at thetrainline have a lot of experience with Chef, the Ops teams were more comfortable with PowerShell. This led us to choose Powershell DSC as the main Configuration Management tool.

Not all roses and unicorns

Once working, Packer is generally an easy thing to work with. However, it has not been without its challenges, especially as the Windows world lags behind the Linux world in this area.

For example, SSH is a given on the Linux world, however it’s rarely found on Windows. Instead, the recommended approach is WinRM, which while rapidly improving, is still not great. Also the support in tooling for this lags behind SSH, although the packer-windows-plugins look promising. In effect we had to work around this by installing an SSH server at the start of the packer run, and disable it at the end. Not exactly ideal.

Another major pain that we faced was around Windows Updates. Frequently these take a long time to install, and do a lot of work at shutdown and startup. This was especially painful around the non-deterministic shutdown of the SSH daemon. We ended using the pause_before parameter on the next provisioner to wait for the shutdown, otherwise it would connect again before reboot and then fail when the reboot disconnected the session. This caused failed builds when “Patch Tuesday” rolled around and extra patches delayed the reboot longer than the delay.

Packer on VMWare is also less mature than ideal, especially around the build chaining. You can easily build a template from an ISO, however you cannot build a template from a template (using ESXi servers remotely). There appears to be some work in progress for this, but it’s still a work in progress. Another pain point was that the builder requires SSH access to the VMWare host, rather than using the web API.

VirtualBox also comes with its own pain points. To do the build chaining, we need to manually unpack the previous box file to extract the OVF file for the next stage. Not a huge issue, but not seamless. A bigger concern is that running VirtualBox on virtualised hardware is unfortunately extremely slow. Ideally we will be completely dropping this soon and moving to a pure VMWare-based approach for all internal virtual machines.

The future

While this approach is working fairly well for us, there is definitely room for improvement.

One improvement we are hoping to trial out is application-specific images, with the application pre-deployed into it, rather than installed at instance creation time. This will definitely help our spin-up time, however this will have a large impact on our build time, as the extra AMI creation can take up to 30 minutes on Windows.

A more promising approach is Docker / Containers. This will allow us to have much better application isolation on the servers, as well as much faster spin-up times. However, Docker and friends only work on Linux at this point, though Windows support is on the horizon. While there is good support for Mono, the time it will take for everyone to get as comfortable with running under Mono and Linux will be fairly large. Once Windows Server support arrives, this will be a much more manageable situation.

While we are still early in the roll out of these golden images, we have already found benefits in terms of consistency, reliability, auditability and speed of provisioning. This can only help us achieve our continuous delivery goals faster.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s