Upgrading an OpenStack cluster from one version of OpenStack to another has become easier, thanks to the versioning of objects in the rabbitmq message bus (if you want to know more, see what oslo.versionedobjects is). But upgrading from Stretch to Buster isn’t easy at all, event with the same version of OpenStack (it is easier to be running OpenStack Rocky backports on Stretch and upgrade to Rocky on Buster, rather than upgrading OpenStack at the same time as the system).
The reason it is difficult, is because rabbitmq and corosync in Stretch can’t talk to the versions shipped in Buster. Also, in a normal OpenStack cluster deployment, services on all machines are constantly doing queries to the OpenStack API, and exchanging messages through the RabbitMQ message bus. One of the dangers, for example, would be if a Neutron DHCP agent could not exchange messages with the neutron-rpc-server. Your VM instances in the OpenStack cluster then could loose connectivity.
If a constantly online HA upgrade with no downtime isn’t possible, it is however possible to minimize down time to just a few seconds, if following a correct procedure. It took me more than 10 tries to be able to do everything in a smooth way, understanding and working around all the issues. 10 tries, means installing 10 times an OpenStack cluster in Stretch (which, even if fully automated, takes about 2 hours) and trying to upgrade it to Buster. All of this is very time consuming, and I haven’t seen any web site documenting this process.
This blog post intends to document such a process, to save the readers the pain of hours of experimentation.
Note that this blog post asserts you’re cluster has been deployed using OCI (see: https://salsa.debian.org/openstack-team/debian/openstack-cluster-installer) however, it should also apply to any generic OpenStack installation, or even to any cluster running RabbitMQ and Corosync.
The root cause of the problem more in details: incompatible RabbitMQ and Corosync in Stretch and Buster
RabbitMQ in Stretch is version 3.6.6, and Buster has version 3.7.8. In theory, the documentation of RabbitMQ says it is possible to smoothly upgrade a cluster with these versions. However, in practice, the problem is the Erlang version rather than Rabbit itself: RabbitMQ in Buster will refuse to talk to a cluster running Stretch (the daemon will even refuse to start).
The same way, Corosync 3.0 in Buster will refuse to accept messages from Corosync 2.4 in Stretch.
Overview of the solution for RabbitMQ & Corosync
To minimize downtime, my method is to shutdown RabbitMQ on node 1, and let all daemons (re-)connect to node 2 and 3. Then we upgrade node 1 fully, and then restart Rabbit in there. Then we shutdown Rabbit on node 2 and 3, so that all daemons of the cluster reconnect to node 1. If done well, the only issue is if a message is still in the cluster of node 2 and 3 when daemons fail-over to node 1. In reality, this isn’t really a problem, unless there’s a lot of activity on the API of OpenStack. If this was the case (for example, if running a public cloud), then the advise would simply to firewall the OpenStack API for the short upgrade period (which shouldn’t last more than a few minutes).
Then we upgrade node 2 and 3 and make them join the newly created RabbitMQ cluster in node 1.
For Corosync, node 1 will not allow start the VIP resource before node 2 is upgraded and both nodes can talk to each other. So we just upgrade node 2, and turn off the VIP resource on node 3 immediately when it is up on node 1 and 2 (which happens during the upgrade of node 2).
The above should be enough reading for most readers. If you’re not that much into OpenStack, it’s ok to stop reading this post. For those who are move involved users of OpenStack on Debian deployed with OCI, let’s go more in details…
Before you start: upgrading OCI
In previous versions of OCI, the haproxy configuration was missing a “option httpcheck” for the MariaDB backend, and therefore, if a MySQL server on one node was going down, haproxy wouldn’t detect it, and the whole cluster could fail (re-)connecting to MySQL. As we’re going to bring some MySQL servers down, make sure the puppet-master is running with the latest version of puppet-module-oci, and that the changes have been applied in all OpenStack controller nodes.
Upgrading compute nodes
Before we upgrade the controllers, it’s best to start by compute nodes, which are the most easy to do. The easiest way is to live-migrate all VMs away from the machine before proceeding. First, we disable the node, so no new VM can be spawned on it:
openstack compute service set --disable z-compute-1.example.com nova-compute
Then we list all VMs on that compute node:
openstack server list –all-projects –host z-compute-1.example.com
Finally we migrate all VMs away:
openstack server migrate --live hostname-compute-3.infomaniak.ch --block-migration 8dac2f33-d4fd-4c11-b814-5f6959fe9aac
Now we can do the upgrade. First disable pupet, then tweak the sources.list, upgrade and reboot:
puppet agent --disable "Upgrading to buster"
apt-get remove python3-rgw python3-rbd python3-rados python3-cephfs librgw2 librbd1 librados2 libcephfs2
rm /etc/apt/sources.list.d/ceph.list
sed -i s/stretch/buster/g /etc/apt/sources.list
mv /etc/apt/sources.list.d/stretch-rocky.list /etc/apt/sources.list.d/buster-rocky.list
echo "deb http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main
deb-src http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main" >/etc/apt/sources.list/buster-rocky.list
apt-get update
apt-get dist-upgrade
reboot
Then we simply re-apply puppet:
puppet agent --enable ; puppet agent -t
apt-get purge linux-image-4.19.0-0.bpo.5-amd64 linux-image-4.9.0-9-amd64
Then we can re-enable the compute service:
openstack compute service set --enable z-compute-1.example.com nova-compute
Repeate the operation for all compute nodes, then we’re ready for the upgrade of controller nodes.
Removing Ceph dependencies from nodes
Most likely, if running with OpenStack Rocky on Stretch, you’d be running with upstream packages for Ceph Luminous. When upgrading to Buster, there’s no upstream repository anymore, and packages will use Ceph Luminous directly from Buster. Unfortunately, the packages from Buster are in a lower version than the packages from upstream. So before upgrading, we must remove all Ceph packages from upstream. This is what has been done just above for the compute nodes also. Upstream Ceph packages are easily identifiable, because upstream uses “bpo90” instead of what we do in Debian (ie: bpo9), so the operation can be:
apt-get remove $(dpkg -l | grep bpo90 | awk '{print $2}' | tr '\n' ' ')
This will remove python3-nova, which is fine as it is also running on the other 2 controllers. After switching the /etc/apt/sources.list to buster, Nova can be installed again.
In a normal setup by OCI, here’s the sequence of command that needs to be done:
rm /etc/apt/sources.list.d/ceph.list
sed -i s/stretch/buster/g /etc/apt/sources.list
mv /etc/apt/sources.list.d/stretch-rocky.list /etc/apt/sources.list.d/buster-rocky.list
echo "deb http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main
deb-src http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main" >/etc/apt/sources.list/buster-rocky.list
apt-get update
apt-get dist-upgrade
apt-get install nova-api nova-conductor nova-consoleauth nova-consoleproxy nova-placement-api nova-scheduler
You may notice that we’re replacing the Stretch Rocky backports repository by one for Buster. Indeed, even if all of Rocky is in Buster, there’s a few packages that are still pending for the review of the Debian stable release team before they can be uploaded to Buster, and we need the fixes for a smooth upgrade. See release team bugs #942201, #942102, #944594, #941901 and #939036 for more details.
Also, since we only did a “apt-get remove”, the Nova configuration in nova.conf must have stayed, and nova is already configured, so when we reinstall the services we removed when removing the Ceph dependencies, they will be ready to go.
Upgrading the MariaDB galera cluster
In an HA OpenStack cluster, typically, a Galera MariaDB cluster is used. That isn’t a problem when upgrading from Stretch to Buster, because the on-the-wire format stays the same. However, the xtrabackup library in Stretch is held by the MariaDB packages themselves, while in Buster, one must install the mariadb-backup. As a consequence, best is to simply turn off MariaDB in a node, do the Buster upgrade, install the mariadb-backup package, and restart MariaDB. To avoid that the MariaDB package attempts restarting the mysqld daemon, best is to mask the systemd unit:
systemctl stop mysql.service
systemctl disable mysql.service
systemctl mask mysql.service
Upgrading rabbitmq-server
Before doing anything, make sure all of your cluster is running with the python3-oslo.messaging version >= 8.1.4. Indeed, version 8.1.3 suffers from a bug where daemons would attempt reconnect constantly to the same server, instead of trying each of the servers described in the transport_url directive. Note that I’ve uploaded 8.1.4-1+deb10u1 to Buster, and that it is part of the 10.2 Buster point release. Though upgrading oslo.messaging will not restart daemons automatically: this must be done manually.
The strategy for RabbitMQ is to completely upgrade one node, start Rabbit on it, without any clustering, then shutdown the service on the other 2 node of the cluster. If this is performed fast enough, no message will be list in the message bus. However, there’s a few traps. Running “rabbitmqctl froget_cluster_node” only removes a node from the cluster for those who will still be running. It doesn’t remove the other nodes from the one which we want to upgrade. The way I’ve found to solve this is to simply remove the mnesia database of the first node, so that when it starts, RabbitMQ doesn’t attempt to cluster with the other 2 which are running a different version of Erlang. If it did, then it would just fail and refused to start.
However, there’s another issue to take care. When upgrading the 1st node to Buster, we removed Nova, because of the Ceph issue. Before we restart the RabbitMQ service on node 1, we need to install Nova, so that it will connect to either node 2 or 3. If we don’t do that, then Nova on node 1 may connect to the RabbitMQ service on node 1, which at this point, is a different RabbitMQ cluster than the one in node 2 and 3.
rabbitmqctl stop_app
systemctl stop rabbitmq-server.service
systemctl disable rabbitmq-server.service
systemctl mask rabbitmq-server.service
[ ... do the Buster upgrade fully ...]
[ ... reinstall Nova services we removed when removing Ceph ...]
rm -rf /var/lib/rabbitmq/mnesia
systemctl unmask rabbitmq-server.service
systemctl enable rabbitmq-server.service
systemctl start rabbitmq-server.service
At this point, since the node 1 RabbitMQ service was down, all daemons are connected to the RabbitMQ service on node 2 or 3. Removing the mnesia database removes all the credentials previously added to rabbitmq. If nothing is done, OpenStack daemons will not be able to connect to the RabbitMQ service on node 1. If like I do, one is using a config management system to populate the access rights, it’s rather easy: simply re-apply the puppet manifests, which will re-add the credentials. However, that isn’t enough: the RabbitMQ message queues are created when the OpenStack daemon starts. As I experienced, daemons will reconnect to the message bus, but will not recreate the queues unless daemons are restarted. Therefore, the sequence is as follow:
Do “rabbitmqctl start_app” on the first node. Add all credentials to it. If your cluster was setup with OCI and puppet, simply look at the output of “puppet agent -t –debug” to capture the list of commands to perform the credential setup.
Do a “rabbitmqctl stop_app” on both remaining nodes 2 and 3. At this point, all daemons will reconnect to the only remaining server. However, they wont be able to exchange messages, as the queues aren’t declared. This is when we must restart all daemons in one of the controllers. The whole operation normally doesn’t take more than a few seconds, which is how long your message bus wont be available. To make sure everything works, check the logs in /var/log/nova/nova-compute.log of one of your compute nodes to make sure Nova is able to report its configuration to the placement service.
Once all of this is done, there’s nothing to worry anymore about RabbitMQ, as all daemons of the cluster are connected to the service on node 1. However, one must make sure that, when upgrading node 2 and 3, they don’t reconnect to the message service on node 2 and 3. So best is to simply stop, disable and mask the service with systemd before continuing. Then, when restarting the Rabbit service on node 2 and 3, OCI’s shell script “oci-auto-join-rabbitmq-cluster” will make them join the new Rabbit cluster, and everything should be fine regarding the message bus.
Upgrading corosync
In an OpenStack cluster setup by OCI, 3 controllers are typically setup, serving the OpenStack API through a VIP (a Virtual IP). What we call a virtual IP is simply an IP address which is able to move from one node to another automatically depending on the cluster state. For example, with 3 nodes, if one goes down, one of the other 2 nodes will take over hosting the IP address which serves the OpenStack API. This is typically done with corosync/pacemaker, which is what OCI sets up.
The way to upgrade corosync is easier than the RabbitMQ case. The first node will refuse to start the corosync resource if it can’t talk to at least a 2nd node. Therefore, upgrading the first node is transparent until we touch the 2nd node: the openstack-api resource wont be started on the first node, so we can finish the upgrade in it safely (ie: take care of RabbitMQ as per above). The first thing to do is probably to move the resource to the 3rd node:
crm_resource --move --resource openstack-api-vip --node z-controller-3.example.com
Once the first node is completely upgraded, we upgrade the 2nd node. When it is up again, we can check the corosync status to make sure it is running on both node 1 and 2:
crm status
If we see the service is up on node 1 and 2, we must quickly shutdown the corosync resource on node 3:
crm resource stop openstack-api-vip
If that’s not done, then node 3 may also reclaim the VIP, and therefore, 2 nodes may it. If running with the VIP using L2 protocol, normally switches will connect only one of the machines declaring the VIP, so even if we don’t take care of it immediately, the upgrade should be smooth anyway. If, like I do in production, you’re running with BGP (OCI allows one to use BGP for the VIP, or simply use an IP on a normal L2 network), then the situation must be even better, as the peering router will continue to route to one of the controllers in the cluster. So no stress, this must be done, but no need to hurry as much as for the RabbitMQ service.
Finalizing the upgrade
Once node 1 and 2 are up, most of the work is done, and the 3rd node can be upgraded without any stress.
Recap of the procedure for controllers
- Move all SNAT virtual routers running on node 1 to node 2 or 3 (note: this isn’t needed if the cluster has network nodes).
- Disable puppet on node 1.
- Remove all Ceph libraries from upstream on node 1, which also turn off some Nova services that runtime depend on them.
- shutdown rabbitmq on node 1, including masking the service with systemd.
- upgrade node 1 to Buster, fully. Then reboot it. This probably will trigger MySQL re-connections to node 2 or 3.
- install mariadb-backup, start the mysql service, and make sure MariaDB is in sync with the other 2 nodes (check the log files).
- reinstall missing Nova services on node 1.
- remove the mnesia db on node 1.
- start rabbitmq on node 1 (which now, isn’t part of the RabbitMQ cluster on node 2 and 3).
- Disable puppet on node 2.
- populate RabbitMQ access rights on node 1. This can be done by simply applying puppet, but may be dangerous if puppet restarts the OpenStack daemons (which therefore may connect to the RabbitMQ on node 1), so best is to just re-apply the grant access commands only.
- shutdown rabbitmq on node 2 and 3 using “rabbitmqctl stop_app”.
- quickly restart all daemons on one controller (for example the daemons on node 1) to declare message queues. Now all daemons must be reconnected and working with the RabbitMQ cluster on node 1 alone.
- Re-enable puppet, and re-apply puppet on node 1.
- Move all Neutron virtual routers from node 2 to node 1.
- Make sure the RabbitMQ services are completely stopped on node 2 and 3 (mask the service with systemd).
- upgrade node 2 to Buster (shutting down RabbitMQ completely, masking the service to avoid it restarts during upgrade, removing the mnesia db for RabbitMQ, and finally making it rejoin the newly node 1 single node cluster using oci-auto-join-rabbitmq-cluster: normally, puppet does that for us).
- Reboot node 2.
- When corosync on node 2 is up again, check corosync status to make sure we are clustering between node 1 and 2 (maybe the resource on node 1 needs to be started), and shutdown the corosync “openstack-api-vip” resource on node 3 to avoid the VIP to be declared on both nodes.
- Re-enable puppet and run puppet agent -t on node 2.
- Make node 2 rabbitmq-server has joined the new cluster declared on node 1 (do: rabbitmqctl cluster_status) so we have HA for Rabbit again.
- Move all Neutron virtual routers of node 3 to node 1 or 2.
- Upgrade node 3 fully, reboot it, and make sure Rabbit is connected to node 1 and 2, as well as corosync working too, then re-apply puppet again.
Note that we do need to re-apply puppet each time, because of some differences between Stretch and Buster. For example, Neutron in Rocky isn’t able to use iptables-nft, and puppet needs to run some update-alternatives command to select iptables-legacy instead (I’m writing this because this isn’t obvious, it’s just that sometimes, Neutron fails to parse the output of iptables-nft…).
Last words as a conclusion
While OpenStack itself has made a lot of progress for the upgrade, it is very disappointing that those components on which OpenStack relies (like corosync, who is typically used as the provider of high availability), aren’t designed with backward compatibility in mind. It is also disappointing that the Erlang versions in Stretch and Buster are incompatible this way.
However, with the correct procedure, it’s still possible to keep services up and running, with a very small down time, even to the point that a public cloud user wouldn’t even notice it.
As the procedure isn’t easy, I strongly suggest anyone attempting such an upgrade to train before proceeding. With OCI, it is easy to do run a PoC using the openstack-cluster-installer-poc package, which is the perfect environment to train on: it’s easy to reproduce, reinstall a cluster and restart the upgrade procedure.