Seamlessly upgrading a production OpenStack cluster in 4 hours : with 2k lines shell script


tl;dr:

To the question: “what does it take to upgrade OpenStack”, my personal answer is: less than 2K lines of dash script. I’ll here describe its internals, and why I believe it is the correct solution.

Why writing this blog post

During FOSSDEM 2024, I was asked “how to you handle upgrades”. I answered with a big smile and a short “with a very small shell script” as I couldn’t explain in 2 minutes how it was done. Just saying “it is great this way” doesn’t help giving readers enough hints to be trusted. Why and how did I do it the right way ? This blog post is an attempt to reply better to this question more deeply.

Upgrading OpenStack in production

I wrote this script maybe a 2 or 3 years ago. Though I’m only blogging about it today, because … I did such an upgrade in a public cloud in production last Thuesday evening (ie: the first region of the Infomaniak public cloud). I’d say the cluster is moderately large (as of today: about 8K+ VMs running, 83 compute nodes, 12 network nodes, … for a total of 10880 physical CPU cores and 125 TB of RAM if I only count the compute servers). It took “only” 4 hours to do the upgrade (though I already wore some more code to speed this up for the next time…). It went super smooth without a glitch. I mostly just sat, reading the script output… and went to bed once it finished running. The next day, all my colleagues at Infomaniak were nicely congratulating me that it went that smooth (a big thanks to all of you who did). I couldn’t dream of an upgrade that smooth! :)

Still not impressed? Boring read? Yeah… let’s dive into more technical details.

Intention behind the implementation

My script isn’t perfect. I wont ever pretend it is. But at least, it does minimize down time of every OpenStack service. It also is a “by the book” implementation of what’s written in the OpenStack doc, following every upstream advice. As a result, it is fully seamless for some OpenStack services, and as HA as OpenStack can be for others. The upgrade process is of course idempotent and can be re-run in case of failure. Here’s why.

General idea

My upgrade script does thing in a certain order, respecting what is documented about upgrades in the OpenStack documentation. It basically does:

  • Upgrade all dependency
  • Upgrade all services one by one, in all the cluster

Installing dependencies

The first thing the upgrade script does is:

  • disable puppet on all nodes of the cluster
  • switch the APT repository
  • apt-get update on all nodes
  • install library dependency on all nodes

For this last thing, a static list of all needed dependency upgrade is maintained between each release of OpenStack, and for each type of nodes. Then for all packages in this list, the script checks with dpkg-query that the package is really installed, and with apt-cache policy that it really is going to be upgraded (Maybe there’s an easier way to do this?). This way, no package is marked as manually installed by mistake during the upgrade process. This ensure that “apt-get –purge autoremove” really does what it should, and that the script is really idempotent.

The idea then, is that once all dependencies are installed, upgrading and restarting leaf packages (ie: OpenStack services like Nova, Glance, Cinder, etc.) is very fast, because the apt-get command doesn’t need to install all dependencies. So at this point, doing “apt-get install python3-cinder” for example (which will also, thanks to dependencies, upgrade cinder-api and cinder-scheduler, if it’s in a controller node) only takes a few seconds. This principle applies to all nodes (controller nodes, network nodes, compute nodes, etc.), which helps a lot speeding-up the upgrade and reduce unavailability.

hapc

At its core, the oci-cluster-upgrade-openstack-release script uses haproxy-cmd (ie: /usr/bin/hapc) to drain each API server to-be-upgraded from haproxy. Hapc is a simple Python wrapper around the haproxy admin socket: it sends command to it with an easy to understand CLI. So it is possible to reliably upgrade one API service only after it’s drained away. Draining means one just wait for the last query to finish and the client to disconnect from http before giving the backend server some more queries. If you do not know hapc / haproxy-cmd, I recommend trying it: it’s going to be hard for you to stop using it once you tested it. Its bash-completion script makes it VERY easy to use, and it is helpful in production. But not only: it is also nice to have when writing this type of upgrade script. Let’s dive into haproxy-cmd.

Example on how to use haproxy-cmd

Let me show you. First, ssh one of the 3 controller and search where the virtual IP (VIP) is located with “crm resource locate openstack-api-vip” or with a (more simple) “crm status”. Let’s ssh that server who got the VIP, and now, let’s drain it away from haproxy.

$ hapc list-backends
$ hapc drain-server --backend glancebe --server cl1-controller-1.infomaniak.ch --verbose --wait --timeout 50
$ apt-get install glance-api
$ hapc enable-server --backend glancebe --server cl1-controller-1.infomaniak.ch

Upgrading the control plane

My upgrade script leverages hapc just like above. For each OpenStack project, it’s done in this order on the first node holding the VIP:

  • “hapc drain-server” of the API, so haproxy gracefully stops querying it
  • stop all services on that node (including non-API services): stop, disable and mask with systemd.
  • upgrade that service Python code. For example: “apt-get install python3-nova”, which also will pull nova-api, nova-conductor, nova-novncprox, etc. but services wont start automatically as they’ve been stoped + disabled + masked on the previous bullet point.
  • perform the db_sync so that the db is up-to-date [1]
  • start all services (unmask, enable and start with systemd)
  • re-enable the API backend with “hapc enable-server”

Starting at [1], the risk is that other nodes may have a new version of the database schema, but an old version of the code that isn’t compatible with it. But it doesn’t take long, because the next step is to take care of the other (usually 2) nodes of the OpenStack control plane:

  • “hapc drain-server” of the API of the other 2 controllers
  • stop of all services on these 2 controllers [2]
  • upgrade of the package
  • start of all services

So while there’s technically zero down time, still some issues between [1] and [2] above may happen because of the new DB schema and the old code (both for API and other services) are up and running at the same time. It is however supposed to be rare cases (some OpenStack project don’t even have db change between some OpenStack releases, and it often continues to work on most queries with the upgraded db), and the cluster will be like that for a very short time, so that’s fine, and better than an full API down time.

Satellite services

Then there’s satellite services, that needs to be upgraded. Like Neutron, Nova, Cinder. Nova is the least offender as it has all the code to rewrite Json object schema on-the-fly so that it continues to work during an upgrade. Though it’s a known issue that Cinder doesn’t have the feature (last time I checked), and it’s also probably the same for Neutron (maybe recent-ish versions of OpenStack do use oslo.versionnedobjects ?). Anyways, upgrade on these nodes are done just right after the control plane for each service.

Parallelism and upgrade timings

As we’re dealing with potentially hundreds of nodes per cluster, a lot of operations are performed in parallel. I choose to simply use the & shell thingy with some “wait” shell stuff so that not too many jobs are done in parallel. For example, when disabling SSH on all nodes, this is done 24 nodes at a time. Which is fine. And the number of nodes is all depending on the type of thing that’s being done. For example, while it’s perfectly OK to disable puppet on 24 nodes at the same time, but it is not OK to do that with Neutron services. In fact, each time a Neutron agent is restarted, the script explicitly waits for 30 seconds. This conveniently avoids a hailstorm of messages in RabbitMQ, and neutron-rpc-server to become too busy. All of these waiting are necessary, and this is one of the reasons why can sometimes take that long to upgrade a (moderately big) cluster.

Not using config management tooling

Some of my colleagues would have prefer that I used something like Ansible. Whever, there’s no reason to use such tool if the idea is just to perform some shell script commands on every servers. It is a way more efficient (in terms of programming) to just use bash / dash to do the work. And if you want my point of view about Ansible: using yaml for doing such programming would be crasy. Yaml is simply not adapted for a job where if, case, and loops are needed. I am well aware that Ansible has workarounds and it could be done, but it wasn’t my choice.

Post a Comment

You must be logged in to post a comment.