A quick look into Storcli packaging horror

So, Megacli is to be replaced by Storcli, both being proprietary tools for configuring RAID cards from LSI.

So I went to download what’s provided by Lenovo, available here:
https://support.lenovo.com/fr/en/downloads/ds041827

It’s very annoying, because they force users to download a .zip file containing a deb file, instead of providing a Debian repository. Well, ok, though at least there’s a deb file there. Let’s have a look what’s using my favorite tool before installing (ie: let’s run Lintian).
Then it’s a horror story. Not only there’s obvious packaging wrong, like the package provide stuff in /opt, and all is statically linked and provide embedded copies of libm and ncurses, or even the package is marked arch: all instead of arch: amd64 (in fact, the package contains both i386 and amd64 arch files…), but there’s also some really wrong things going on:

E: storcli: arch-independent-package-contains-binary-or-object opt/MegaRAID/storcli/storcli
E: storcli: embedded-library opt/MegaRAID/storcli/storcli: libm
E: storcli: embedded-library opt/MegaRAID/storcli/storcli: ncurses
E: storcli: statically-linked-binary opt/MegaRAID/storcli/storcli
E: storcli: arch-independent-package-contains-binary-or-object opt/MegaRAID/storcli/storcli64
E: storcli: embedded-library opt/MegaRAID/storcli/storcli64: libm
E: storcli: embedded-library … use –no-tag-display-limit to see all (or pipe to a file/program)
E: storcli: statically-linked-binary opt/MegaRAID/storcli/storcli64
E: storcli: changelog-file-missing-in-native-package
E: storcli: control-file-has-bad-permissions postinst 0775 != 0755
E: storcli: control-file-has-bad-owner postinst asif/asif != root/root
E: storcli: control-file-has-bad-permissions preinst 0775 != 0755
E: storcli: control-file-has-bad-owner preinst asif/asif != root/root
E: storcli: no-copyright-file
E: storcli: extended-description-is-empty
W: storcli: essential-no-not-needed
W: storcli: unknown-section storcli
E: storcli: depends-on-essential-package-without-using-version depends: bash
E: storcli: wrong-file-owner-uid-or-gid opt/ 1000/1000
W: storcli: non-standard-dir-perm opt/ 0775 != 0755
E: storcli: wrong-file-owner-uid-or-gid opt/MegaRAID/ 1000/1000
E: storcli: dir-or-file-in-opt opt/MegaRAID/
W: storcli: non-standard-dir-perm opt/MegaRAID/ 0775 != 0755
E: storcli: wrong-file-owner-uid-or-gid opt/MegaRAID/storcli/ 1000/1000
E: storcli: dir-or-file-in-opt opt/MegaRAID/storcli/
W: storcli: non-standard-dir-perm opt/MegaRAID/storcli/ 0775 != 0755
E: storcli: wrong-file-owner-uid-or-gid … use –no-tag-display-limit to see all (or pipe to a file/program)
E: storcli: dir-or-file-in-opt opt/MegaRAID/storcli/storcli
E: storcli: dir-or-file-in-opt … use –no-tag-display-limit to see all (or pipe to a file/program)

Some of the above are grave security problems, like wrong Unix mode for folders, even with the preinst script installed as non-root.
I always wonder why this type of tool needs to be proprietary. They clearly don’t know how to get packaging right, so they’d better just provide the source code, and let us (the Debian community) do the work for them. I don’t think there’s any secret that they are keeping by hiding how to configure the cards, so it’s not in the vendor’s interest to keep everything closed. Or maybe they are just hiding really bad code in there, that they are ashamed to share? In any way, they’d better not provide any package than this pile of dirt (and I’m trying to stay polite here…).

Upgrading an OpenStack Rocky cluster from Stretch to Buster

Upgrading an OpenStack cluster from one version of OpenStack to another has become easier, thanks to the versioning of objects in the rabbitmq message bus (if you want to know more, see what oslo.versionedobjects is). But upgrading from Stretch to Buster isn’t easy at all, event with the same version of OpenStack (it is easier to be running OpenStack Rocky backports on Stretch and upgrade to Rocky on Buster, rather than upgrading OpenStack at the same time as the system).

The reason it is difficult, is because rabbitmq and corosync in Stretch can’t talk to the versions shipped in Buster. Also, in a normal OpenStack cluster deployment, services on all machines are constantly doing queries to the OpenStack API, and exchanging messages through the RabbitMQ message bus. One of the dangers, for example, would be if a Neutron DHCP agent could not exchange messages with the neutron-rpc-server. Your VM instances in the OpenStack cluster then could loose connectivity.

If a constantly online HA upgrade with no downtime isn’t possible, it is however possible to minimize down time to just a few seconds, if following a correct procedure. It took me more than 10 tries to be able to do everything in a smooth way, understanding and working around all the issues. 10 tries, means installing 10 times an OpenStack cluster in Stretch (which, even if fully automated, takes about 2 hours) and trying to upgrade it to Buster. All of this is very time consuming, and I haven’t seen any web site documenting this process.

This blog post intends to document such a process, to save the readers the pain of hours of experimentation.

Note that this blog post asserts you’re cluster has been deployed using OCI (see: https://salsa.debian.org/openstack-team/debian/openstack-cluster-installer) however, it should also apply to any generic OpenStack installation, or even to any cluster running RabbitMQ and Corosync.

The root cause of the problem more in details: incompatible RabbitMQ and Corosync in Stretch and Buster

RabbitMQ in Stretch is version 3.6.6, and Buster has version 3.7.8. In theory, the documentation of RabbitMQ says it is possible to smoothly upgrade a cluster with these versions. However, in practice, the problem is the Erlang version rather than Rabbit itself: RabbitMQ in Buster will refuse to talk to a cluster running Stretch (the daemon will even refuse to start).

The same way, Corosync 3.0 in Buster will refuse to accept messages from Corosync 2.4 in Stretch.

Overview of the solution for RabbitMQ & Corosync

To minimize downtime, my method is to shutdown RabbitMQ on node 1, and let all daemons (re-)connect to node 2 and 3. Then we upgrade node 1 fully, and then restart Rabbit in there. Then we shutdown Rabbit on node 2 and 3, so that all daemons of the cluster reconnect to node 1. If done well, the only issue is if a message is still in the cluster of node 2 and 3 when daemons fail-over to node 1. In reality, this isn’t really a problem, unless there’s a lot of activity on the API of OpenStack. If this was the case (for example, if running a public cloud), then the advise would simply to firewall the OpenStack API for the short upgrade period (which shouldn’t last more than a few minutes).

Then we upgrade node 2 and 3 and make them join the newly created RabbitMQ cluster in node 1.

For Corosync, node 1 will not allow start the VIP resource before node 2 is upgraded and both nodes can talk to each other. So we just upgrade node 2, and turn off the VIP resource on node 3 immediately when it is up on node 1 and 2 (which happens during the upgrade of node 2).

The above should be enough reading for most readers. If you’re not that much into OpenStack, it’s ok to stop reading this post. For those who are move involved users of OpenStack on Debian deployed with OCI, let’s go more in details…

Before you start: upgrading OCI

In previous versions of OCI, the haproxy configuration was missing a “option httpcheck” for the MariaDB backend, and therefore, if a MySQL server on one node was going down, haproxy wouldn’t detect it, and the whole cluster could fail (re-)connecting to MySQL. As we’re going to bring some MySQL servers down, make sure the puppet-master is running with the latest version of puppet-module-oci, and that the changes have been applied in all OpenStack controller nodes.

Upgrading compute nodes

Before we upgrade the controllers, it’s best to start by compute nodes, which are the most easy to do. The easiest way is to live-migrate all VMs away from the machine before proceeding. First, we disable the node, so no new VM can be spawned on it:

openstack compute service set --disable z-compute-1.example.com nova-compute

Then we list all VMs on that compute node:

openstack server list –all-projects –host z-compute-1.example.com

Finally we migrate all VMs away:

openstack server migrate --live hostname-compute-3.infomaniak.ch --block-migration 8dac2f33-d4fd-4c11-b814-5f6959fe9aac

Now we can do the upgrade. First disable pupet, then tweak the sources.list, upgrade and reboot:

puppet agent --disable "Upgrading to buster"
apt-get remove python3-rgw python3-rbd python3-rados python3-cephfs librgw2 librbd1 librados2 libcephfs2
rm /etc/apt/sources.list.d/ceph.list
sed -i s/stretch/buster/g /etc/apt/sources.list
mv /etc/apt/sources.list.d/stretch-rocky.list /etc/apt/sources.list.d/buster-rocky.list
echo "deb http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main
deb-src http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main" >/etc/apt/sources.list/buster-rocky.list
apt-get update
apt-get dist-upgrade
reboot

Then we simply re-apply puppet:

puppet agent --enable ; puppet agent -t
apt-get purge linux-image-4.19.0-0.bpo.5-amd64 linux-image-4.9.0-9-amd64

Then we can re-enable the compute service:

openstack compute service set --enable z-compute-1.example.com nova-compute

Repeate the operation for all compute nodes, then we’re ready for the upgrade of controller nodes.

Removing Ceph dependencies from nodes

Most likely, if running with OpenStack Rocky on Stretch, you’d be running with upstream packages for Ceph Luminous. When upgrading to Buster, there’s no upstream repository anymore, and packages will use Ceph Luminous directly from Buster. Unfortunately, the packages from Buster are in a lower version than the packages from upstream. So before upgrading, we must remove all Ceph packages from upstream. This is what has been done just above for the compute nodes also. Upstream Ceph packages are easily identifiable, because upstream uses “bpo90” instead of what we do in Debian (ie: bpo9), so the operation can be:

apt-get remove $(dpkg -l | grep bpo90 | awk '{print $2}' | tr '\n' ' ')

This will remove python3-nova, which is fine as it is also running on the other 2 controllers. After switching the /etc/apt/sources.list to buster, Nova can be installed again.

In a normal setup by OCI, here’s the sequence of command that needs to be done:

rm /etc/apt/sources.list.d/ceph.list
sed -i s/stretch/buster/g /etc/apt/sources.list
mv /etc/apt/sources.list.d/stretch-rocky.list /etc/apt/sources.list.d/buster-rocky.list
echo "deb http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main
deb-src http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main" >/etc/apt/sources.list/buster-rocky.list
apt-get update
apt-get dist-upgrade
apt-get install nova-api nova-conductor nova-consoleauth nova-consoleproxy nova-placement-api nova-scheduler

You may notice that we’re replacing the Stretch Rocky backports repository by one for Buster. Indeed, even if all of Rocky is in Buster, there’s a few packages that are still pending for the review of the Debian stable release team before they can be uploaded to Buster, and we need the fixes for a smooth upgrade. See release team bugs #942201, #942102, #944594, #941901 and #939036 for more details.

Also, since we only did a “apt-get remove”, the Nova configuration in nova.conf must have stayed, and nova is already configured, so when we reinstall the services we removed when removing the Ceph dependencies, they will be ready to go.

Upgrading the MariaDB galera cluster

In an HA OpenStack cluster, typically, a Galera MariaDB cluster is used. That isn’t a problem when upgrading from Stretch to Buster, because the on-the-wire format stays the same. However, the xtrabackup library in Stretch is held by the MariaDB packages themselves, while in Buster, one must install the mariadb-backup. As a consequence, best is to simply turn off MariaDB in a node, do the Buster upgrade, install the mariadb-backup package, and restart MariaDB. To avoid that the MariaDB package attempts restarting the mysqld daemon, best is to mask the systemd unit:

systemctl stop mysql.service
systemctl disable mysql.service
systemctl mask mysql.service

Upgrading rabbitmq-server

Before doing anything, make sure all of your cluster is running with the python3-oslo.messaging version >= 8.1.4. Indeed, version 8.1.3 suffers from a bug where daemons would attempt reconnect constantly to the same server, instead of trying each of the servers described in the transport_url directive. Note that I’ve uploaded 8.1.4-1+deb10u1 to Buster, and that it is part of the 10.2 Buster point release. Though upgrading oslo.messaging will not restart daemons automatically: this must be done manually.

The strategy for RabbitMQ is to completely upgrade one node, start Rabbit on it, without any clustering, then shutdown the service on the other 2 node of the cluster. If this is performed fast enough, no message will be list in the message bus. However, there’s a few traps. Running “rabbitmqctl froget_cluster_node” only removes a node from the cluster for those who will still be running. It doesn’t remove the other nodes from the one which we want to upgrade. The way I’ve found to solve this is to simply remove the mnesia database of the first node, so that when it starts, RabbitMQ doesn’t attempt to cluster with the other 2 which are running a different version of Erlang. If it did, then it would just fail and refused to start.

However, there’s another issue to take care. When upgrading the 1st node to Buster, we removed Nova, because of the Ceph issue. Before we restart the RabbitMQ service on node 1, we need to install Nova, so that it will connect to either node 2 or 3. If we don’t do that, then Nova on node 1 may connect to the RabbitMQ service on node 1, which at this point, is a different RabbitMQ cluster than the one in node 2 and 3.

rabbitmqctl stop_app
systemctl stop rabbitmq-server.service
systemctl disable rabbitmq-server.service
systemctl mask rabbitmq-server.service
[ ... do the Buster upgrade fully ...]
[ ... reinstall Nova services we removed when removing Ceph ...]
rm -rf /var/lib/rabbitmq/mnesia
systemctl unmask rabbitmq-server.service
systemctl enable rabbitmq-server.service
systemctl start rabbitmq-server.service

At this point, since the node 1 RabbitMQ service was down, all daemons are connected to the RabbitMQ service on node 2 or 3. Removing the mnesia database removes all the credentials previously added to rabbitmq. If nothing is done, OpenStack daemons will not be able to connect to the RabbitMQ service on node 1. If like I do, one is using a config management system to populate the access rights, it’s rather easy: simply re-apply the puppet manifests, which will re-add the credentials. However, that isn’t enough: the RabbitMQ message queues are created when the OpenStack daemon starts. As I experienced, daemons will reconnect to the message bus, but will not recreate the queues unless daemons are restarted. Therefore, the sequence is as follow:

Do “rabbitmqctl start_app” on the first node. Add all credentials to it. If your cluster was setup with OCI and puppet, simply look at the output of “puppet agent -t –debug” to capture the list of commands to perform the credential setup.

Do a “rabbitmqctl stop_app” on both remaining nodes 2 and 3. At this point, all daemons will reconnect to the only remaining server. However, they wont be able to exchange messages, as the queues aren’t declared. This is when we must restart all daemons in one of the controllers. The whole operation normally doesn’t take more than a few seconds, which is how long your message bus wont be available. To make sure everything works, check the logs in /var/log/nova/nova-compute.log of one of your compute nodes to make sure Nova is able to report its configuration to the placement service.

Once all of this is done, there’s nothing to worry anymore about RabbitMQ, as all daemons of the cluster are connected to the service on node 1. However, one must make sure that, when upgrading node 2 and 3, they don’t reconnect to the message service on node 2 and 3. So best is to simply stop, disable and mask the service with systemd before continuing. Then, when restarting the Rabbit service on node 2 and 3, OCI’s shell script “oci-auto-join-rabbitmq-cluster” will make them join the new Rabbit cluster, and everything should be fine regarding the message bus.

Upgrading corosync

In an OpenStack cluster setup by OCI, 3 controllers are typically setup, serving the OpenStack API through a VIP (a Virtual IP). What we call a virtual IP is simply an IP address which is able to move from one node to another automatically depending on the cluster state. For example, with 3 nodes, if one goes down, one of the other 2 nodes will take over hosting the IP address which serves the OpenStack API. This is typically done with corosync/pacemaker, which is what OCI sets up.

The way to upgrade corosync is easier than the RabbitMQ case. The first node will refuse to start the corosync resource if it can’t talk to at least a 2nd node. Therefore, upgrading the first node is transparent until we touch the 2nd node: the openstack-api resource wont be started on the first node, so we can finish the upgrade in it safely (ie: take care of RabbitMQ as per above). The first thing to do is probably to move the resource to the 3rd node:

crm_resource --move --resource openstack-api-vip --node z-controller-3.example.com

Once the first node is completely upgraded, we upgrade the 2nd node. When it is up again, we can check the corosync status to make sure it is running on both node 1 and 2:

crm status

If we see the service is up on node 1 and 2, we must quickly shutdown the corosync resource on node 3:

crm resource stop openstack-api-vip

If that’s not done, then node 3 may also reclaim the VIP, and therefore, 2 nodes may it. If running with the VIP using L2 protocol, normally switches will connect only one of the machines declaring the VIP, so even if we don’t take care of it immediately, the upgrade should be smooth anyway. If, like I do in production, you’re running with BGP (OCI allows one to use BGP for the VIP, or simply use an IP on a normal L2 network), then the situation must be even better, as the peering router will continue to route to one of the controllers in the cluster. So no stress, this must be done, but no need to hurry as much as for the RabbitMQ service.

Finalizing the upgrade

Once node 1 and 2 are up, most of the work is done, and the 3rd node can be upgraded without any stress.

Recap of the procedure for controllers

  • Move all SNAT virtual routers running on node 1 to node 2 or 3 (note: this isn’t needed if the cluster has network nodes).
  • Disable puppet on node 1.
  • Remove all Ceph libraries from upstream on node 1, which also turn off some Nova services that runtime depend on them.
  • shutdown rabbitmq on node 1, including masking the service with systemd.
  • upgrade node 1 to Buster, fully. Then reboot it. This probably will trigger MySQL re-connections to node 2 or 3.
  • install mariadb-backup, start the mysql service, and make sure MariaDB is in sync with the other 2 nodes (check the log files).
  • reinstall missing Nova services on node 1.
  • remove the mnesia db on node 1.
  • start rabbitmq on node 1 (which now, isn’t part of the RabbitMQ cluster on node 2 and 3).
  • Disable puppet on node 2.
  • populate RabbitMQ access rights on node 1. This can be done by simply applying puppet, but may be dangerous if puppet restarts the OpenStack daemons (which therefore may connect to the RabbitMQ on node 1), so best is to just re-apply the grant access commands only.
  • shutdown rabbitmq on node 2 and 3 using “rabbitmqctl stop_app”.
  • quickly restart all daemons on one controller (for example the daemons on node 1) to declare message queues. Now all daemons must be reconnected and working with the RabbitMQ cluster on node 1 alone.
  • Re-enable puppet, and re-apply puppet on node 1.
  • Move all Neutron virtual routers from node 2 to node 1.
  • Make sure the RabbitMQ services are completely stopped on node 2 and 3 (mask the service with systemd).
  • upgrade node 2 to Buster (shutting down RabbitMQ completely, masking the service to avoid it restarts during upgrade, removing the mnesia db for RabbitMQ, and finally making it rejoin the newly node 1 single node cluster using oci-auto-join-rabbitmq-cluster: normally, puppet does that for us).
  • Reboot node 2.
  • When corosync on node 2 is up again, check corosync status to make sure we are clustering between node 1 and 2 (maybe the resource on node 1 needs to be started), and shutdown the corosync “openstack-api-vip” resource on node 3 to avoid the VIP to be declared on both nodes.
  • Re-enable puppet and run puppet agent -t on node 2.
  • Make node 2 rabbitmq-server has joined the new cluster declared on node 1 (do: rabbitmqctl cluster_status) so we have HA for Rabbit again.
  • Move all Neutron virtual routers of node 3 to node 1 or 2.
  • Upgrade node 3 fully, reboot it, and make sure Rabbit is connected to node 1 and 2, as well as corosync working too, then re-apply puppet again.

Note that we do need to re-apply puppet each time, because of some differences between Stretch and Buster. For example, Neutron in Rocky isn’t able to use iptables-nft, and puppet needs to run some update-alternatives command to select iptables-legacy instead (I’m writing this because this isn’t obvious, it’s just that sometimes, Neutron fails to parse the output of iptables-nft…).

Last words as a conclusion

While OpenStack itself has made a lot of progress for the upgrade, it is very disappointing that those components on which OpenStack relies (like corosync, who is typically used as the provider of high availability), aren’t designed with backward compatibility in mind. It is also disappointing that the Erlang versions in Stretch and Buster are incompatible this way.

However, with the correct procedure, it’s still possible to keep services up and running, with a very small down time, even to the point that a public cloud user wouldn’t even notice it.

As the procedure isn’t easy, I strongly suggest anyone attempting such an upgrade to train before proceeding. With OCI, it is easy to do run a PoC using the openstack-cluster-installer-poc package, which is the perfect environment to train on: it’s easy to reproduce, reinstall a cluster and restart the upgrade procedure.

My work during DebCamp / DebConf

Lots of uploads

Grepping my IRC log for the BTS bot output shows that I uploaded roughly 244 times in Curitiba.

Removing Python 2 from OpenStack by uploading OpenStack Stein in Sid

Most of these uploads were uploading OpenStack Stein from Experimental to Sid, with a breaking record of 96 uploads in a single day. As the work for Python 2 removal was done before the Buster release (uploads in Experimental), this effectively removed a lot of Python 2 support.

Removing Python 2 from Django packages

But once that was done, I started uploading some Django packages. Indeed, since Django 2.2 was uploaded to Sid with the removal of Python 2 support, a lot of dangling python-django-* needed to be fixed. Not only Python 2 support needed to be removed from them, but often, patches were needed in order to fix at least unit tests since Django 2.2 removed a lot of things that were deprecated since a few earlier versions. I went through all of the django packages we have in Debian, and I believe I fixed most of them. I uploaded 43 times some Django packages, fixing 39 packages.

Removing Python 2 support from non-django or OpenStack packages

During the Python BoF at Curitiba, we collectively decided it was time to remove Python 2, and that we’ll try to do as much of that work as possible before Bullseye. Details of this will come from our dear leader p1otr, so I’ll let him write the document and wont comment (yet) on how we’re going to proceed. Anyway, we already have a “python2-rm” release tracker. After the Python BOF, I then also started removing Python 2 support on a few package with more generic usage. Hopefully, touching only leaf packages, without breaking things. I’m not sure of the total count of packages that I touched, probably a bit less than a dozen.

Horizon broken in Sid since the beginning of July

Unfortunately, Horizon, the OpenStack dashboard, is currently still broken in Debian Sid. Indeed, since Django 1.11, the login() function in views.py has been deprecated in the favor of a LoginView class. And in Django 2.2, the support for the function has been removed. As a consequence, since the 9th of July, when Django 2.2 was uploaded, Horizon’s openstack_auth/views.py is boken. Upstream says they are targeting Django 2.2 for next February. That’s a way too late. Hopefully, someone will be able to fix this situation with me (it’s probably a bit too much for Django my skills). Once this is fixed, I’ll be able to work on all the Horizon plugins which are still in Experimental. Note that I already fixed all of Horizon’s reverse dependencies in Sid, but some of the patches need to be upstreamed.

Next work (from home): fixing piuparts

I’ve already written a first attempt at a patch for piuparts, so that it uses Python 3 and not Python 2 anymore. That patch is already as a merge request in Salsa, though I haven’t had the time to test it yet. What’s remaining to do is: actually test using Puiparts with this patch, and fix debian/control so that it switches to Python 2.

Wrote a Debian mirror setup puppet module in 3 hours

As I needed the functionality, I wrote this:

https://salsa.debian.org/openstack-team/puppet/puppet-module-debian-archvsync

The matching Debian package has been uploaded and is now in the NEW queue. Thanks a lot to Waldi for packaging ftpsync, which I’m using.

Comments and contributions are welcome.

OpenStack-cluster-installer in Buster

I’ve been working on this for more than a year, and finally, I am acheiving my goal. I wrote a OpenStack cluster installer that is fully in Debian, and running in production for Infomaniak.

Note: I originally wrote this blog post a few weeks ago, though it was pending validation from my company (to make sure I wouldn’t disclose company business information).

What is it?

As per the package description and the package name, OCI (OpenStack Cluster Installer) is a software to provision an OpenStack cluster automatically, with a “push button” interface. The OCI package depends on a DHCP server, a PXE (tftp-hpa) boot server, a web server, and a puppet-master.

Once computers in the cluster boot for the first time over network (PXE boot), a Debian live system squashfs image is served by OCI (served by Apache), to act as a discovery image. This live system then reports the hardware features of the booted machine back to OCI (CPU, memory, HDDs, network interfaces, etc.). The computers can then be installed with Debian from that live system. During this process, a puppet-agent is configured so that it will connect to the puppet-master of OCI. Uppong first boot, OpenStack services are then installed and configured, depending on the server role in the cluster.

OCI is fully packaged in Debian, including all of the Puppet modules and so on. So just doing “apt-get install openstack-cluster-installer” is enough to bring absolutely all dependencies, and no other artifact are needed. This is very important so one only needs a local Debian mirror to install an OpenStack cluster. No external components must be downloaded from internet.

OCI setting-up a Swift cluster

At the begining of OCI’s life, we first used it at Infomaniak (my employer) to setup a Swift cluster. Swift is the object server of OpenStack. It is perfect solution for a (very) large backup system.

Think of a massive highly available cluster, with a capacity reaching peta bytes, storing millions of objects/files 3 times (for redundancy). Swift can virtually scale to infinity as long as you size your ring correctly.

The Infomaniak setup is also redundant at the data center level, as our cluster spans over 2 data centers, with at least one copy everything stored on each data center (the location of the 3rd copy depends on many things, and explaining it is not in the scope of this post).

If one wishes to use swift, it’s ok to start with 7 machines to begin with: 3 machines for the controller (holding the Keystone authentication, and a bit more), at least 1 swift-proxy machine, and 3 storage nodes. Though for redundancy purpose, it is IMO not good enough to start with only 3 storage node: if one fails, the proxy server will fall into timeouts waiting for the 3rd storage node. So 6 storage nodes feels like a better minimum. Though it doesn’t have to be top-noch servers, a cluster made of refurbished old hardware with only a few disks can do it, if you don’t need to store too much data.

Setting-up an OpenStack compute cluster

Though swift was the first thing OCI did for us, it now can do a way more than just Swift. Indeed, it can also setup a full OpenStack cluster with Nova (compute), Neutron (networking) and Cinder (network block devices). We also started using all of that, setup by OCI, at Infomaniak. Here’s the list services currently supported:

  • Keystone (identity)
  • Heat (orchestration)
  • Aodh (alarming)
  • Barbican (key/secret manager)
  • Nova (compute)
  • Glance (VM images)
  • Swift (object store)
  • Panko (event)
  • Ceilometer (resource monitoring)
  • Neutron (networking)
  • Cinder (network block device)

On the backend, OCI can use LVM or Ceph for Cinder, local storage or Ceph for Nova instances.

Full HA redundancy

The nice thing is, absolutely every component setup by OCI is done in a high availability way. Each machine of the control plane of OpenStack is setup with an instance of the components: all OpenStack controller components, a MariaDB server part of the Galera cluster, etc.

HAProxy is also setup on all controllers, in front of all of the REST API servers of OpenStack. And finally, the web address where final clients will connect is in fact a virtual IP, that can move from one server to another, thanks to corosync. Routing to that VIP can be done either over L2 (ie: a static address on a local network), or over BGP (useful if you need multi-datacenter redundancy). So if one of the controllers is down, it’s not such a big deal, HAproxy will detect this within seconds, and if it was the server that had the virtual IP (matching the API endpoint), then this IP will move to one of the other servers.

Full SSL transport

One of the things that OCI does when installing Debian, is setup a PKI (ie: SSL certificates signed by a local root CA) so that everything in the cluster is transported over SSL. Haproxy, of course does the SSL, but it also connects to the different API servers over SSL too. All connections to the RabbitMQ servers are also performed SSL. If one wishes, it’s possible to replace the self-signed SSL certificates before the cluster is deployed, so that the OpenStack API endpoint can be exposed on a public address.

OCI as a quite modular system

If one decides to use Ceph for storage, then for every compute node of the cluster, it is possible to choose to use either Ceph for the storage of /var/lib/nova/instance, or use local storage. On the later case, then of course, using RAID is strongly advised, to avoid any possible loss of data. It is possible to mix both types of compute node storage in a single cluster, and create server aggregates so it is later possible to decide which type of compute server to run the workload on.

If a cluster Ceph is part of the cluster, then on every compute node, the cinder-volume and cider-backup services will be provisioned. They will be in use to control the Cinder volumes of the Ceph cluster. Even though the network block storage itself will not run on the compute machines, it makes sense to do that. The idea is that the amount of these process needs to scale at the same time as the amount of compute nodes, so it makes sense to do that. Also, on compute servers, the Ceph secret is already setup using libvirt, so it was also convenient to re-use this.

As for Glance, if you have Ceph, it will use it as backend. If not, it will use Swift. And if you don’t have a Swift cluster, it will fall-back to the normal file backend, with a simple rsync from the first controller to the others. On such a setup, then only the first controller is used for glance-api. The other controllers also run glance-api, but haproxy doesn’t use them, as we really want the images to be stored on the first controller, so they can be rsync to the others. In practice, it’s not such a big deal, because the images are anyway in the cache of the compute servers when in use.

If one setup cinder volume nodes, then cinder-volume and cinder-backup will be installed there, and the system will automatically know that there’s cinder with LVM backend. Both Cinder over LVM and over Ceph can be setup on the same cluster (I never really tried this, though I don’t see why it wouldn’t work, normally, simply both backend will be available).

OCI in Buster vs current development

Lots of new features are being added to OCI. These, unfortunately, wont make it to Buster. Though the Buster release has just enough to be able to provision a working OpenStack cluster.

Future features

What I envision for OCI, is to make it able to provision a cluster ready for serving as a public cloud. This means having all of the resource accounting setup, as well as cloudkitty (which is OpenStack resource rating engine). I’ve already played a bit with this, and it should be out fast. Then the only missing bit to go public will be billing of the rated resources, which obviously, has to be done in-house, and doesn’t need to live within the OpenStack cluster itself.

The other things I am planning to do, is add more and more services. Currently, even though OCI can setup a fully working OpenStack, it is still a basic one. I do want to add advanced features like Octavia (load balancer as a service), Magnum (kubernets cluster as a service), Designate (DNS), Manila (shared filesystems) and much more if possible. The number of available projects is really big, so it probably will keep me busy for a very long time.

At this point, what OCI misses as well, is a custom ISO debian installer image that would include absolutely all. It shouldn’t be hard to write, though I lack the basic knowledge on how to do this. Maybe I will work on this at this summer’s DebConf. At the end, it could be a debian pure blend (ie: a fully integrated distro-in-the-distro system, just like debian-edu or debian-meds). It’d be nice if this ISO image could include all of the packages for the cluster, so that no external resources would be needed. The setting-up an OpenStack cluster with no internet connectivity at all would become possible. Because in fact, only the API endpoint on the port 443, and the virtual machines need internet access, your management network shouldn’t be connected (it’s much safer this way).

No, there wasn’t 80 engineers that burned-out in the process of implementing OCI

One thing that makes me proud, is that I wrote all of my OpenStack installer nearly alone (truth: leveraging all the work of puppet-openstack, it woudn’t have been possible without it…). That’s unique in the (small) OpenStack world. Companies like my previous employer, or a famous companies working on RPM based distros, this kind of product is the work of dozens of engineers. I heard that Red Hat has nearly 100 employees working on TripleO. This was possible because I tried to keep OCI in the spirit of “keep it simple stupid”. It is doing only what’s needed, and implemented the mot simple way possible, so that it is easy to maintain.

For example, the hardware discovery agent is made of 63 lines of ISO shell script (that is: not even bash… but dash), while I’ve seen others using really over engineered stuff, like heavy ruby or Python modules. Ironic-inspector, for example, in the Rocky release, is made of 98 files, for a total of 17974 lines. I really wonder what they are doing with all of this (I didn’t dare to look). There is one thing I’m sure: what I did is really enough for OCI’s needs, and I don’t want to run a 250+ MB initrd as the discovery system: OCI’s live build based discovery image loaded over the web rather than PXE is a way smarter.

On the same spirit, the part that does the bare-metal provisioning, is the same shell script that I wrote to create the official Debian OpenStack images. It was about 700 lines of shell script to install Debian on a .qcow2 image, it’s not about 1500 lines, and made of a single file. That’s the smallest footprint you’ll ever find. However, it does all what’s needed, still, and probably even more.

In comparison, in Fuel, there was a super-complicated scheduler, written in Ruby, used to be able to provision a full cluster by only a single click of a button. There’s no such thing in OCI, because I do believe that’s a useless gadget. With OCI, a user simply needs to remember the order for setting-up a cluster: Cephmon nodes needs to be setup first, then CephOSD nodes, then controllers, then finally, in no particular order, the computes, swiftproxy, swiftstore and volume nodes last. That’s really not a big deal to let this done by the final user, as it is not expected that one will setup multiple OpenStack every day. And even so, if you use the “ocicli” tool, it shouldn’t be hard to do these final bits of the automation. But I would consider this a useless gadget.

While every company jumped into the micro-service in container thing, even at this time, I continue to believe this is useless, and mostly driven by the needs marketing people that needs to sell features. Running OpenStack directly on bare metal is already hard, and the amount of complexity added by running OpenStack services in Docker is useless: it doesn’t bring any feature. I’ve been told that it makes upgrades easier, I very much doubt it: upgrades are complex for other reasons than just upgrading the running services themselves. Rather, they are complex because one needs to upgrade the cluster components with a given order, and scheduling this isn’t easy.

So this is how I managed to write an OpenStack installer alone, in less than a year, without compromising on features: because I wrote things simply, and avoided the over-engineering I saw at all levels on other products.

OpenStack Stein is comming

I’ve just pushed to Debian Experimental, and to https://buster-stein.debian.net/debian the last release of OpenStack (code name: Stein), which was released upstream on the 10th or April (yesterday, as I write these lines). I’ve been able to install Stein on top of Debian Buster, and I could start VMs on it: it’s all working as expected after a bit of changes in the puppet manifests of OCI. What’s needed now, is testing upgrades from Stretch + Rocky to Buster + Stein. Normally, puppet-openstack can do that. Let’s see…

Want to know more?

Read on… the README.md is on https://salsa.debian.org/openstack-team/debian/openstack-cluster-installer

Last words, last thanks

This concludes a bit more than a year of development. All of this wouldn’t have been possible without my employer, Infomaniak, giving me a total freedom on the way I implement things for going into production. So a big thanks to them, and also for being a platinium sponsor for this year’s Debconf in Brazil.

Also a big thanks to the whole of the OpenStack project, including (but not limited to) the Infra team and the puppet-openstack team.

Official Debian testing OpenStack image news

A few things happened to the testing image, thanks to Steve McIntire, myself, and … some debconf18 foo!

  • The buster/testing image wasn’t generated since last April, this is now fixed. Thanks to Steve for it.
  • The datasource_list is now correct, in both the Stretch and Testing image (previously, cloustack was set too early in the list, which made the image wait 120 seconds for a data source which wasn’t available if booting on OpenStack).
  • The buster/testing image is now using the new package linux-image-cloud-amd64. This made the qcow file shrink from 614 MB to 493 MB. Unfortunately, we don’t have a matching arm64 cloud kernel image yet, but it’s still nice to have this for the amd64 arch.

Please use the new images, and report any issue or suggestion against the openstack-debian-images package.

Using a dummy network interface

For a long time, I’ve been very much annoyed by network setups on virtual machines. Either you choose a bridge interface (which is very easy with something like Virtualbox), or you choose NAT. The issue with NAT is that you can’t easily get into your VM (for example, virtualbox doesn’t exposes the gateway to your VM). With bridging, you’re getting in trouble because your VM will attempt to get DHCP from the outside network, which means that first, you’ll get a different IP depending on where your laptop runs, and second, the external server may refuse your VM because it’s not authenticated (for example because of a MAC address filter, or 802.11x auth).

But there’s a solution to it. I’m now very happy with my network setup, which is using a dummy network interface. Let me share how it works.

In the modern Linux kernel, there’s “fake” network interface through a module called “dummy”. To add such an interface, simply load the kernel module (ie: “modprobe dummy”) and start playing. Then you can bridge that interface, and tap it, then plug your VM to it. Since the dummy interface is really living in your computer, you do have access to this internal network with a route to it.

I’m using this setup for connecting both KVM and Virtualbox VMs, you can even mix both. For Virtualbox, simply use the dropdown list for the bridge. For KVM, use something like this in the command line: -device e1000,netdev=net0,mac=08:00:27:06:CF:CF -netdev tap,id=net0,ifname=mytap0,script=no,downscript=no

Here’s a simple script to set that up, with on top, masquerading for both ip4 and ipv6:

# Load the dummy interface module
modprobe dummy

# Create a dummy interface called mynic0
ip link set name mynic0 dev dummy0

# Set its MAC address
ifconfig mynic0 hw ether 00:22:22:dd:ee:ff

# Add a tap device
ip tuntap add dev mytap0 mode tap user root

# Create a bridge, and bridge to it mynic0 and mytap0
brctl addbr mybr0
brctl addif mybr0 mynic0
brctl addif mybr0 mytap0

# Set an IP addresses to the bridge
ifconfig mybr0 192.168.100.1 netmask 255.255.255.0 up
ip addr add fd5d:12c9:2201:1::1/24 dev mybr0

# Make sure all interfaces are up
ip link set mybr0 up
ip link set mynic0 up
ip link set mytap0 up

# Set basic masquerading for both ipv4 and 6
iptables -I FORWARD -j ACCEPT
iptables -t nat -I POSTROUTING -s 192.168.100.0/24 -j MASQUERADE
ip6tables -I FORWARD -j ACCEPT
ip6tables -t nat -I POSTROUTING -s fd5d:12c9:2201:1::/64 -j MASQUERADE

Privacy breaches when unlocking a Xiaomi’s Mi 5s plus

My little girl decided the old OnePlus One of my wife had to take a swim in the toilets. So we had to buy a new phone. Since I know how bad standard ROMs are, I looked-up in the LineageOS list of compatible OS, and found out that the Xiaomi’s Mi 5s plus was not too bad, and we bought one. The phone itself looks quite nice: a 64 bits fast processor, a huge amount of RAM, nice screen, etc. Then I tried the procedure for unlocking… because I care about privacy, and I knew the Chinese Xiaomi ROM is full of spyware (the phone was purchased in China). Though what I didn’t know is that the unlock procedure (needed before changing the ROM) is itself is full of privacy breaches. Let me give you the details.

First, you got to register on Xiaomi’s website, and request for the permission to unlock the device. That’s already bad enough: why should I ask for the permission to use the device I own as I am pleased to? Anyway, I did that. The procedure includes receiving an SMS. Again, more bad: why should I give-up such a privacy thing as my phone number? Anyway, I did it, and received the code to activate my website account. Then I started the unlock program in a virtualbox Windows XP VM (yeah right… I wasn’t expecting something better anyway…), and then, the program tells me that I need to add my Xiaomi’s account in the phone. Of course, it then sends a web request to Xiaomi’s server (it refused to work unless I connected the phone to WiFi). I’m already not happy with all of this, but that’s not it. After all of these privacy breaches, the unlock APP tells me that I need to wait 72 hours to get my phone to account association to be activated. Since I wont be available in the middle of the week, for me, that means waiting until next week-end to do that. Silly…

Let’s recap. During this unlock procedure, I had to give-up:

  • My phone number (due to the SMS).
  • My phone ID (probably the EMEI was sent).
  • My email address (truth is: I could have given them a temporary email address).
  • Hours of my time understanding and run the stupid procedure, and I can’t even finish it in a single day.
  • My policy of not using Windows. I also consider that using Windows is a privacy breach, though here I have a way to roll-back the Virtualbox image, and I only use it for this kind of bad software, so privacy wise, it’s kind of fine, because I’m used of this trick. The real issue here is that, to unlock freedom on that phone, one must use a proprietary OS.

So my advice: if you want an unlocked Android device, do not choose Xiaomi, unless you’re ok to give up the above. It’s probably fine to pay a little bit more and reward the maker of a phone if the unlock experience isn’t that bad.

Testing OpenStack using tempest: all is packaged, try it yourself

tl;dr: this post explains how the new openstack-tempest-ci-live-booter package configures a machine to PXE boot a Debian Live system running on KVM in order to run functional testing of OpenStack. It may be of interest to you if you want to learn how to PXE boot a KVM virtual machine running Debian Live, even if you aren’t interested in OpenStack.

Moving my CI from one location to another leads to package it fully

After packaging a release of OpenStack, it’s kind of mandatory to functionally test the set of packages. This is done by running the tempest test suite on an already deployed OpenStack installation. I used to do that on a real hardware, provided by my employer. But since I’ve lost my job (I’m still looking for a new employer at this time), I also lost access to the hardware they were providing to me.

As a consequence, I searched for a sponsor to provide the hardware to run tempest on. I first sent a mail to the openstack-dev list, asking for such a hardware. Then Rochelle Grober and Stephen Li from Huawei got me in touch with Zachary Smith, the CEO of Packet.net. And packet.net gave me an account on their system. I am amazed how good their service is. They provide baremetal servers around the world (15 data centers), provisioned using an API (meaning, fully automatically). A big thanks to them!

Anyway, even if I planned for a few weeks to give a big thanks to the above people (they really deserves it!), this isn’t the only goal of this post. This is to introduce how to run your own tempest CI on your own machine. Because since I have been in the situation where my CI had to move twice, I decided to industrialize it, and fully automate the setup of the CI server. And what does a DD do when writing software? Package it of course. So I packaged it all, and uploaded it to the archive. Here’s how to use all of this.

General principle

The best way to run an OpenStack tempest CI is to run it on a Debian Live system. Why? Because setting-up a full OpenStack environment takes a lot of time, mostly spent on disk I/O. And on a live system, everything runs on a RAM disk, so installing under this environment is the fastest way one could do. This is what I did when working with Mirantis: I had a real baremetal server, which I was PXE booting on a Debian Live system. However nice, this imposes having access to 2 servers: one for running the Live system, and one running the dhcp/pxe/tftp server. Also, this means the boot server needs 2 nics, one on the internet, and one for booting the 2nd server that will run the Live system. It was not possible to have such specific setup at packet, so I decided to replicate this using KVM, so it would become portable. And since the servers at packet.net are very fast, it isn’t much of an issue anymore to not run on baremetal.

Anyway, let’s dive into setting-up all of this.

Network topology

We’ll assume that one of your interface has internet access, let’s say eth0. Since we don’t want to destroy any of your network config, the openstack-tempest-ci-live-booter package will use a dummy network interface (ie: modprobe dummy) and bridge it to the network interface of the KVM virtual machine. That dummy network interface will be configured with 192.168.100.1, and the Debian Live KVM will use 192.168.100.2. This convenient default can be changed, but then you’ll have to pass your specific network configuration to each and every script (just read the beginning of each script to read the parameters).

Configure the host machine

First install the openstack-tempest-ci-live-booter package. This runtime depends on the isc-dhcp-server, tftpd-hpa, apache2, qemu-kvm and all what’s needed to run a Debian Live machine, booting it over PXE / iPXE (the package support both, more on iPXE later). So, let’s do it:

apt-get install openstack-tempest-ci-live-booter

The package, once installed, doesn’t do much. To respect the Debian policy, it can’t touch configuration files of other packages in maintainer scripts. Therefore, you have to manually run:

openstack-tempest-ci-live-booter-config --configure-dummy-nick

Running this script will:

  • configure the kvm-intel module to allow nested visualization (by unloading the module, adding “options kvm-intel nested=y” to /etc/modprobe.d, and reloading the module)
  • modprobe the dummy kernel module, run “ip link set name tempestnic0 dev dummy0” to create a tempestnic0 dummy interface
  • create a tempestbr bridge, set 192.168.100.1 for the bridge IP, bridge the tempestnic0 and tempesttap
  • configure tftpd-hpa to listen on 192.168.100.1
  • configure isc-dhcp-server to dhcpreply 192.168.100.2 on the tempestbr, so that the KVM machine can boot up with an IP
  • configure apache2 to serve the filesystem.squashfs root filesystem, loaded by the Linux kernel at boot time. Note that you may need to manually start and/or reload apache after this setup though.

Again, you can change the IP addresses if you like. You can also use a real interface if you intend to boot a real hardware rather than a KVM machine (in which case, just omit the –configure-dummy-nick, and manually configure your 2nd interface).

Also, openstack-tempest-ci-live-booter provides a /etc/init.d/openstack-tempest-ci-live-booter script which will configure NAT on your server, so that the Debian Live machine has internet access (needed for apt-get operations). Edit the file if you need to change 192.168.100.1/24 by something else. The script will pick-up the interface that is connected to the default gateway by itself.

The dhcp server is configured to support both legacy PXE and the new iPXE standard. I had to support iPXE, because that’s what the standard KVM ROM does, and also I wanted to keep legacy support for older baremetal hardware. The way iPXE works is that dhcpd tells the client where to fetch the iPXE script, which itself chains to lpxelinux.0 (instead of the standard pxelinux.0). It’s rather easy to setup once you understood how it works.

Build the live image

Now that the PXE server is configured, it’s now time to build the Debian live image. Simply do this to build the image, and copy its resulting files in the PXE server folder (ie: /var/lib/tempest-live-booter):

mkdir live
cd live
openstack-tempest-ci-build-live-image --debian-mirror-addr http://ftp.nl.debian.org/debian

Since we need to login in that server later on, the script will create an ssh key-pair. If you want your own keys, simply drop the id_rsa and id_rsa.pub files in your current folder before running the script. Then make it so that this key-pair can be later on used by default by the user who will run the tempest script (ie: copy id_rsa and id_rsa.pub in the ~/.ssh folder).

Running the openstack-tempest-ci

What the openstack-tempest-ci script does is (re-)starting your KVM virtual machine, ssh into it, upgrade it to sid, install OpenStack, and eventually run all the tempest suite. There’s 2 ways to run it: either install the openstack-tempest-ci package, eventually configure it (in /etc/default/openstack-tempest-ci), and simply run the “openstack-tempest-ci” command. Or, you can skip the installation of the package, and simply run it from source:

git clone http://anonscm.debian.org/git/openstack/debian/openstack-meta-packages.git
cd openstack-meta-packages/src
./openstack-tempest-ci

Indeed, the script is designed to copy all scripts from source inside the Debian Live machine before using these scripts. The reason it’s doing that is because we want to avoid the situation where a modification needs to be uploaded to Debian before being able to test it, and also it was needed to be able to run the openstack-tempest-ci script without installing a package (which would need root access that I don’t have on casulana.debian.org, where running tempest is needed to test official OpenStack Debian images). So, definitively, feel free to hack everything in openstack-meta-packages/src before running the tempest script. Also, openstack-tempest-ci will look for a sources.list file in the current directory, and upload it to the Debian Live system before doing the upgrade/install. This way, it is easy to use the closest mirror.

There’s cloud, and it can even be YOURS on YOUR computer

Each time I see the FSFE picture, just like on Daniel’s last post to planet.d.o, where it says:

“There is NO CLOUD, just other people’s computers”

it makes me so frustrated. There’s such a thing as private cloud, setup on your own set of servers. I’ve been working on delivering OpenStack to Debian for the last 6 years and a half, motivated exactly to fix this issue: I refuse that the only cloud people could use would be a closed source solution like GCE, AWS or Azure. The FSFE (and the FSF) completely dismissing this work is more than annoying: it is counter productive. Not only the FSFE shouldn’t pull anyone away from the cloud, but it should push for the public to choose cloud providers using free software like OpenStack.

The openstack.org market place lists 23 public cloud providers using OpenStack, so there is now no excuse to use any other type of cloud: for sure, there’s one where you need it. If you use a free software solution like OpenStack, then the question if you’re running on your own hardware, on some rented hardware (on which you deployed OpenStack yourself), or on someone else’s OpenStack deployment is just a practical one, on which you can always back-up quickly. That’s one of the very reason why one should deploy on the cloud: so that it’s possible to redeploy quickly on another cloud provider, or even on your own private cloud. This gives you more freedom than you ever had, because it makes you not dependent anymore on the hosting company you’ve selected: switching provider is just the mater of launching a script. The reality is that neither the FSFE or RMS understand all of this. Please don’t dive into the FSFE very wrong message.