Critical Infrastructure Upgrade Process

Avatar

Sep 05

How do we upgrade an essential piece of software in our infrastructure stack without impacting our customers?

At Docker, we have asked ourselves this question multiple times in the last year as we’ve caught up our infrastructure software to more modern versions. One of the more recent and major upgrades was Hashicorp’s Consul. Consul is a piece of software that powers many of our internal services. It has the ability to perform many tasks, including service discovery, resilient key-value storage, multi-datacenter segmentation, and service-mesh networking. It is resilient to failure by using the Raft Consensus Algorithm for leader election and managing data inside of the cluster, while client nodes talk to one another via the gossip protocol. For our purposes, the stand out features are ease-of-use, resilient key-value storage, and a distributed locking mechanism.

Even though we use a small subset of Consul features, we always want to stay on the latest stable version to ensure we get the latest security and bug fixes, while enabling our users to access the newer features. This is easier said than done, as we have skipped upgrades over the years and shoved them into the backlog. Finally, the day had come to address this backlog item and with such a core piece of our infrastructure stack, we needed to plan this out in advance. We will walk you through how we approached this major infrastructure software upgrade, with examples of our Consul upgrade process.

Planning

To achieve a zero impact upgrade to a backbone piece of software requires planning with extreme attention to detail. We need to know exactly where we are, where we are going, and make sure that we understand every step of getting to our goal.

Catalog Your Current State

Know the state of your software. In order to get to where you want to go, you must know where you are. For us, we were running Consul version 0.7.4. Based on the commits in the Consul repository, this version is over two years old! To say the least, we were ready for a new version of Consul as we had a few incidents with CPU spiking on client nodes and we cannot submit bug reports for unmaintained versions of the software.

We have several simple deployments. There are 2 production environments, 1 staging environment, and then multiple personal environments. Each environment gets a cluster of 3 servers and tens to hundreds of clients, depending on the environment. The hardest part of all of this is that we automate our deployment of Consul, so if Consul changes in a dramatic way our automation needs to be able to handle that.

Read the Changelog and Ask Questions

Most open software has some sort of changelog that is available for public consumption. This is a goldmine of information and sometimes even points back to specific pieces of code that implement the change. READ IT ALL! Seriously, it is time-consuming and I recommend you take breaks, but even if you don’t need 80% of the features and bug fixes you are reading about, you now know that much more about the software you are managing in your infrastructure!

Not only should you be reading through it all, but note down any questions or comments you have. Frequently you will find a small feature that you didn’t know existed that might be very helpful in your scenario. Or maybe a bug you thought never would be fixed because it was so minor to the community, but so frustrating for your team to work around, actually gets fixed! Note it all down. You will then have your own little changelog that you can further groom into the most important things. For the minor things, you can then decide if they are in scope or if they should be added to the backlog for working on later.

So, taking our own advice we went to the Consul repository and read the whole changelog from 0.7.4 to 1.5.1. This took several hours with several cups of coffee along with multiple breaks to ensure mental freshness. In the end, it produced several major questions about the upgrade. We then proceeded to answer each of these questions and change the plan based on this information. One of the more important questions we encountered was around the compatibility of Consul across versions–could we upgrade straight to 1.5.1? The answer turns out to be no, or at least it’s not recommended, and here is why… Consul upgrades have a compatibility promise, which is great, but you have to understand it properly to do upgrades.

The most important thing to understand is that there are two protocol versions to be concerned with in Consul–the Consul protocol version and the Raft protocol version. When upgrading Consul, these versions need to be adjacent between consul versions in order to maintain compatibility (and the compatibility promise) and reduce the chances for data loss or corruption. For example, moving from version 0.7.4 to 0.8.0 works well because both operate on Consul protocol version 2, but can use version 3 if needed. The raft protocol version for 0.7.4 is 1, while the raft protocol version for 0.8.0 is 1,2,3, meaning it can speak any of the versions, but it prefers the highest available. Since the versions are adjacent, we can depend on the compatibility promise of Consul to make the upgrade even easier. Here is a diagram detailing this theoretical version upgrade.

Consul Version Compare

Consul – Asses Version Upgrade Risk

Our goal as an operator during an upgrade like this is to limit the number of possible changes so as to reduce the chances of failure. The color of the arrows in the diagram above denote risk, where our risk level increases with the hue (white = lowest risk, red = highest risk). The less risk, the better!

With that information in hand, we decided on a multi-phase upgrade plan to minimize risk, while avoiding upgrading to every version in-between our current and target versions. That plan looked something like this:

Consul Upgrade Path

Consul – Upgrade Path

Each of our environments was separated into two groups of hosts, Consul servers and Consul clients. From there, we upgraded each group separately, for each version upgrade of Consul to reduce blast radius and reduce the danger of cluster impairment. The last step for us was to bump the version of Consul in our base AMI and roll that out across all of our AWS Launch Configurations. Finally, we would remove a few nodes to let an AutoScalingGroup bring up a node with the new AMI to test that it works as well. At this point, we are still only in the theoretical section of our upgrade process. Will the plan we made even work? How will this work if we need to run it against hundreds of Consul clients that need upgrading as well?

Testing our Theories and Thinking of Edge Cases

A plan isn’t worth much if it is never executed on. If that execution is first done in your most important environment, say production, it’s safe to say you are doing it wrong. Just like a software development life cycle, as an operator doing an upgrade of software we need to test the upgrade prior to taking this plan to production. For us, we have several stages of testing available for our infrastructure at Docker. First, we have our personal environments. These are clones of production but without a lot of the customer workloads. It has all of the infrastructure software needed to run the platform, but the only workloads are our own test services. This works well and for us allows an initial test bed for major changes. In our case, the Consul upgrade went through extremely well, but upon thinking further we realized it may not be as smooth in other environments.

During brainstorming of edge case scenarios, you must think of the other environments and figure out what might be different that could cause problems. For us, the biggest problem was the fact that other environments were much larger, so manually doing all of the upgrade work was not feasible. So we developed scripts to automate as much as we could while ensuring that the most important parts (consul server upgrades) were left to a more manual process. In the process of testing our scripts, we found that it didn’t make sense to fully automate the upgrade of the servers because we didn’t want to risk missing an error only to have the automation script move on to the next server in the cluster, thus potentially breaking quorum. This meant that each upgrade of a group of nodes was broken out into two-phase, a preparation phase and an execution phase.

Host Preparation – cluster-wide

In the preparation phase, we first get all nodes in a subgroup of our infrastructure and put those IPs into a file named hosts (don’t worry about aws_node_ips–we’ll cover it a little later). Then, we use the pssh tool to run a series of commands across all of the nodes listed in the hosts file. These commands will result in having the <version> binary of Consul on each of the hosts that we specified. For example, if <version> is set to “1.5.1”, that would result in a new Consul binary located at /usr/bin/consul_1.5.1.

$ export CONSUL_VERSION=<version>
$ export CONSUL_URL="https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip"
$ aws_node_ips > hosts
$ pssh -t 0 -l ubuntu -i -h hosts "curl

-o /tmp/consul.zip -L $CONSUL_URL && (cd /tmp && unzip /tmp/consul.zip && sudo mv /tmp/consul /usr/bin/consul_${CONSUL_VERSION})”

Host Execution – server or client

For clients in the cluster, we only need to loop over all IPs in the hosts file and run a series of commands. Here we log the start and the finish upgrading a specific host. In-between those logs we do several things:

  1. move the current Consul binary to a backup location
  2. move the new Consul binary to /usr/bin/consul
  3. restart the Consul service
$ export OLD_VERSION=<version>
$ for host in $(cat hosts); do echo "starting $host..." && ssh ubuntu@"${host}" "sudo mv /usr/bin/consul /usr/bin/consul_${OLD_VERSION} && sudo mv /usr/bin/consul_${CONSUL_VERSION} /usr/bin/consul && sudo systemctl restart consul.service" && echo "finished $host" && sleep 10; done

Helper Script

In the host execution phase, there was a reference to an aws_node_ips command. We use this to more easily get lists of IPs of groups of nodes in our AWS EC2 environments. Below are some sample commands and the definition of the script:

aws_ip() {
  jq -r '.Reservations[].Instances[].PrivateIpAddress' $@
}
 
aap() {
  cmd="$(aap_gen_cmd $@)"
 
  eval "$cmd | aws_ip"
}
 
aap_gen_cmd() {
  stack_name="$1"
  role="$2"
  secondary_role="$3"
 
  cmd="aws ec2 describe-instances --filters"
 
  if [ "$stack_name" = "prod" ] || [ "$stack_name" = "live" ]; then
      cmd="$cmd 'Name=private-ip-address,Values=10.1.*'"
  else
    if [ ! -z "$stack_name" ] && [ "$stack_name" != "-" ]; then
      cmd="$cmd 'Name=tag:main-stack-name,Values=${stack_name}*'"
    fi
  fi
 
  if [ ! -z "$role" ] && [ "$role" != "-" ]; then
    cmd="$cmd Name=tag:role,Values=${role}"
  fi
 
  if [ ! -z "$secondary_role" ] && [ "$secondary_role" != "-" ]; then
    cmd="$cmd Name=tag:secondary-role,Values=${secondary_role}"
  fi
 
  cmd="$cmd Name=instance-state-name,Values=running --output=json"
 
  echo "$cmd"
}

# example usage
aws_node_ips prod docker infra > hosts # get our consul servers ips
aws_node_ips prod swarm > hosts # get all of the ips of nodes in the Universal Control Plane cluster

More or less, our aap function is a wrapper around the aws and jq commands. You can make sure that the results of and aws command is output in JSON format, which makes it trivial to parse with the jq library. The other additional logic is more for filtering EC2 instances for a specific environment based on a human-readable name like “prod” or “live”. The “role” and “secondary-role” that you see later in the functions are EC2 tags that we use on all of our instances to more easily automate the grouping of instances together. You can “role” as a “main group” and “secondary-role” as a subgroup within the larger group. We then consume those tags in various other places in our infrastructure to more easily automate everything.

Documentation

We should not have to say this, but document everything! Yup, so we are reading everything (changelog) and documenting everything. It sounds like a lot of work, but when you need to pick up the work after a long weekend you will be happy you did it. Document the exact steps, any caveats or issues you ran into, and make sure that you can understand it later. During our Consul upgrade, we actually documented the whole upgrade process, rolled back the changes, and then re-applied the upgrade the next day to the same environment only using the information in the documentation. This ensures that everything you wrote down is documented in such a way that it is easy to consume and execute on. It will help your team in performing the upgrade while also providing documentation to other readers who might be interested in how the upgrade was done.

Execution

Finally, we have made it! At this point in the process, you should have tested your automation and plan so much that you don’t even think twice as to whether it will work. I wouldn’t recommend upgrading production on a Friday afternoon for your team’s sanity, but your heart rate shouldn’t be increased when you upgrade your most important servers–it should be business as usual. It is common that something unexpected will come up and you will need to do some sort of live debugging to fix it. Go into the upgrade with that expectation and the assurance that you have prepared as best as you can in order to handle the situation–you can do it! Don’t forget to document these issues. Even though this is the last step in the process, you may want to refer back to your documentation for the next upgrade or while debugging something at a later date.

Further Thoughts on Infrastructure Software

What makes good infrastructure software? As we went through the upgrade process of our Consul clusters, we quickly fell in love with it…again. It made us think more about what makes good software for a distributed infrastructure from an operator’s point of view. The core principles we think are important are version compatibility promises, resiliency in the face of failure, and awesome documentation.

Consul has an awesome version compatibility promise, as we have already talked about. It allows less of the burden to be on the operator as they can have some assurance from the software vendor that things will continue working when moving from one version to another. This may not be as important for teams that have the time to always be on the latest version, but most teams do not have that luxury. We have more important things to do and like what happened with our Consul deployments, they quickly fell out of date. With the compatibility promise from HashiCorp, we were able to see a clear path forward for an upgrade to minimize risk.

Resiliency to failure sounds like an obvious feature of distributed systems, but most operators would agree it isn’t always prevalent. There are a lot of aspects of distributed computing that contribute to the failure modes of software, but let’s focus on the ability of software to rollback and the lack of downtime. With Consul we are able to run different versions of the software in the same cluster and all of the components still work. The only caveat is that you do not get any of the new version’s features until the whole cluster is upgraded. This is great because we do not need to upgrade everything at the same time–each component can be upgraded individually and the cluster is none the wiser. If a component fails to upgrade, you can rollback that single component or take it out of the cluster and troubleshoot it. There is no need to do a cluster-wide upgrade and then roll back the whole thing in the face of failure. This also means that there is less downtime or even zero downtime since each component can be upgraded individually.

Lastly, awesome documentation is a must for any software. Consul has guides for operators and developers, along with the usual documentation for all of the APIs. It also goes into more depth for each main feature of the software, so if I need that information, I can get it. If I don’t use the feature, I just skip over that part of the documentation. As we saw, the changelog and version upgrade guide were indispensable. Without it we would have been flying blind, testing as we go and hoping for the best. All of these things together make for a great operator experience when performing upgrades or troubleshooting issues.

Conclusion

Upgrading infrastructure software can be difficult, but with good software and even better planning by operators, it can be turned into a seamless experience. Our process of upgrading Consul exhibits important pieces of the puzzle for a successful infrastructure software upgrade: reading the changelog, asking questions, paying attention to detail, testing, automating as much as possible, more testing, and documenting everything. After it is all done you will be glad you did all of the upfront work–sit back and eat some ice cream. Enjoy life and move on to your next big upgrade!