I’ve been working on this for more than a year, and finally, I am acheiving my goal. I wrote a OpenStack cluster installer that is fully in Debian, and running in production for Infomaniak.
Note: I originally wrote this blog post a few weeks ago, though it was pending validation from my company (to make sure I wouldn’t disclose company business information).
What is it?
As per the package description and the package name, OCI (OpenStack Cluster Installer) is a software to provision an OpenStack cluster automatically, with a “push button” interface. The OCI package depends on a DHCP server, a PXE (tftp-hpa) boot server, a web server, and a puppet-master.
Once computers in the cluster boot for the first time over network (PXE boot), a Debian live system squashfs image is served by OCI (served by Apache), to act as a discovery image. This live system then reports the hardware features of the booted machine back to OCI (CPU, memory, HDDs, network interfaces, etc.). The computers can then be installed with Debian from that live system. During this process, a puppet-agent is configured so that it will connect to the puppet-master of OCI. Uppong first boot, OpenStack services are then installed and configured, depending on the server role in the cluster.
OCI is fully packaged in Debian, including all of the Puppet modules and so on. So just doing “apt-get install openstack-cluster-installer” is enough to bring absolutely all dependencies, and no other artifact are needed. This is very important so one only needs a local Debian mirror to install an OpenStack cluster. No external components must be downloaded from internet.
OCI setting-up a Swift cluster
At the begining of OCI’s life, we first used it at Infomaniak (my employer) to setup a Swift cluster. Swift is the object server of OpenStack. It is perfect solution for a (very) large backup system.
Think of a massive highly available cluster, with a capacity reaching peta bytes, storing millions of objects/files 3 times (for redundancy). Swift can virtually scale to infinity as long as you size your ring correctly.
The Infomaniak setup is also redundant at the data center level, as our cluster spans over 2 data centers, with at least one copy everything stored on each data center (the location of the 3rd copy depends on many things, and explaining it is not in the scope of this post).
If one wishes to use swift, it’s ok to start with 7 machines to begin with: 3 machines for the controller (holding the Keystone authentication, and a bit more), at least 1 swift-proxy machine, and 3 storage nodes. Though for redundancy purpose, it is IMO not good enough to start with only 3 storage node: if one fails, the proxy server will fall into timeouts waiting for the 3rd storage node. So 6 storage nodes feels like a better minimum. Though it doesn’t have to be top-noch servers, a cluster made of refurbished old hardware with only a few disks can do it, if you don’t need to store too much data.
Setting-up an OpenStack compute cluster
Though swift was the first thing OCI did for us, it now can do a way more than just Swift. Indeed, it can also setup a full OpenStack cluster with Nova (compute), Neutron (networking) and Cinder (network block devices). We also started using all of that, setup by OCI, at Infomaniak. Here’s the list services currently supported:
- Keystone (identity)
- Heat (orchestration)
- Aodh (alarming)
- Barbican (key/secret manager)
- Nova (compute)
- Glance (VM images)
- Swift (object store)
- Panko (event)
- Ceilometer (resource monitoring)
- Neutron (networking)
- Cinder (network block device)
On the backend, OCI can use LVM or Ceph for Cinder, local storage or Ceph for Nova instances.
Full HA redundancy
The nice thing is, absolutely every component setup by OCI is done in a high availability way. Each machine of the control plane of OpenStack is setup with an instance of the components: all OpenStack controller components, a MariaDB server part of the Galera cluster, etc.
HAProxy is also setup on all controllers, in front of all of the REST API servers of OpenStack. And finally, the web address where final clients will connect is in fact a virtual IP, that can move from one server to another, thanks to corosync. Routing to that VIP can be done either over L2 (ie: a static address on a local network), or over BGP (useful if you need multi-datacenter redundancy). So if one of the controllers is down, it’s not such a big deal, HAproxy will detect this within seconds, and if it was the server that had the virtual IP (matching the API endpoint), then this IP will move to one of the other servers.
Full SSL transport
One of the things that OCI does when installing Debian, is setup a PKI (ie: SSL certificates signed by a local root CA) so that everything in the cluster is transported over SSL. Haproxy, of course does the SSL, but it also connects to the different API servers over SSL too. All connections to the RabbitMQ servers are also performed SSL. If one wishes, it’s possible to replace the self-signed SSL certificates before the cluster is deployed, so that the OpenStack API endpoint can be exposed on a public address.
OCI as a quite modular system
If one decides to use Ceph for storage, then for every compute node of the cluster, it is possible to choose to use either Ceph for the storage of /var/lib/nova/instance, or use local storage. On the later case, then of course, using RAID is strongly advised, to avoid any possible loss of data. It is possible to mix both types of compute node storage in a single cluster, and create server aggregates so it is later possible to decide which type of compute server to run the workload on.
If a cluster Ceph is part of the cluster, then on every compute node, the cinder-volume and cider-backup services will be provisioned. They will be in use to control the Cinder volumes of the Ceph cluster. Even though the network block storage itself will not run on the compute machines, it makes sense to do that. The idea is that the amount of these process needs to scale at the same time as the amount of compute nodes, so it makes sense to do that. Also, on compute servers, the Ceph secret is already setup using libvirt, so it was also convenient to re-use this.
As for Glance, if you have Ceph, it will use it as backend. If not, it will use Swift. And if you don’t have a Swift cluster, it will fall-back to the normal file backend, with a simple rsync from the first controller to the others. On such a setup, then only the first controller is used for glance-api. The other controllers also run glance-api, but haproxy doesn’t use them, as we really want the images to be stored on the first controller, so they can be rsync to the others. In practice, it’s not such a big deal, because the images are anyway in the cache of the compute servers when in use.
If one setup cinder volume nodes, then cinder-volume and cinder-backup will be installed there, and the system will automatically know that there’s cinder with LVM backend. Both Cinder over LVM and over Ceph can be setup on the same cluster (I never really tried this, though I don’t see why it wouldn’t work, normally, simply both backend will be available).
OCI in Buster vs current development
Lots of new features are being added to OCI. These, unfortunately, wont make it to Buster. Though the Buster release has just enough to be able to provision a working OpenStack cluster.
Future features
What I envision for OCI, is to make it able to provision a cluster ready for serving as a public cloud. This means having all of the resource accounting setup, as well as cloudkitty (which is OpenStack resource rating engine). I’ve already played a bit with this, and it should be out fast. Then the only missing bit to go public will be billing of the rated resources, which obviously, has to be done in-house, and doesn’t need to live within the OpenStack cluster itself.
The other things I am planning to do, is add more and more services. Currently, even though OCI can setup a fully working OpenStack, it is still a basic one. I do want to add advanced features like Octavia (load balancer as a service), Magnum (kubernets cluster as a service), Designate (DNS), Manila (shared filesystems) and much more if possible. The number of available projects is really big, so it probably will keep me busy for a very long time.
At this point, what OCI misses as well, is a custom ISO debian installer image that would include absolutely all. It shouldn’t be hard to write, though I lack the basic knowledge on how to do this. Maybe I will work on this at this summer’s DebConf. At the end, it could be a debian pure blend (ie: a fully integrated distro-in-the-distro system, just like debian-edu or debian-meds). It’d be nice if this ISO image could include all of the packages for the cluster, so that no external resources would be needed. The setting-up an OpenStack cluster with no internet connectivity at all would become possible. Because in fact, only the API endpoint on the port 443, and the virtual machines need internet access, your management network shouldn’t be connected (it’s much safer this way).
No, there wasn’t 80 engineers that burned-out in the process of implementing OCI
One thing that makes me proud, is that I wrote all of my OpenStack installer nearly alone (truth: leveraging all the work of puppet-openstack, it woudn’t have been possible without it…). That’s unique in the (small) OpenStack world. Companies like my previous employer, or a famous companies working on RPM based distros, this kind of product is the work of dozens of engineers. I heard that Red Hat has nearly 100 employees working on TripleO. This was possible because I tried to keep OCI in the spirit of “keep it simple stupid”. It is doing only what’s needed, and implemented the mot simple way possible, so that it is easy to maintain.
For example, the hardware discovery agent is made of 63 lines of ISO shell script (that is: not even bash… but dash), while I’ve seen others using really over engineered stuff, like heavy ruby or Python modules. Ironic-inspector, for example, in the Rocky release, is made of 98 files, for a total of 17974 lines. I really wonder what they are doing with all of this (I didn’t dare to look). There is one thing I’m sure: what I did is really enough for OCI’s needs, and I don’t want to run a 250+ MB initrd as the discovery system: OCI’s live build based discovery image loaded over the web rather than PXE is a way smarter.
On the same spirit, the part that does the bare-metal provisioning, is the same shell script that I wrote to create the official Debian OpenStack images. It was about 700 lines of shell script to install Debian on a .qcow2 image, it’s not about 1500 lines, and made of a single file. That’s the smallest footprint you’ll ever find. However, it does all what’s needed, still, and probably even more.
In comparison, in Fuel, there was a super-complicated scheduler, written in Ruby, used to be able to provision a full cluster by only a single click of a button. There’s no such thing in OCI, because I do believe that’s a useless gadget. With OCI, a user simply needs to remember the order for setting-up a cluster: Cephmon nodes needs to be setup first, then CephOSD nodes, then controllers, then finally, in no particular order, the computes, swiftproxy, swiftstore and volume nodes last. That’s really not a big deal to let this done by the final user, as it is not expected that one will setup multiple OpenStack every day. And even so, if you use the “ocicli” tool, it shouldn’t be hard to do these final bits of the automation. But I would consider this a useless gadget.
While every company jumped into the micro-service in container thing, even at this time, I continue to believe this is useless, and mostly driven by the needs marketing people that needs to sell features. Running OpenStack directly on bare metal is already hard, and the amount of complexity added by running OpenStack services in Docker is useless: it doesn’t bring any feature. I’ve been told that it makes upgrades easier, I very much doubt it: upgrades are complex for other reasons than just upgrading the running services themselves. Rather, they are complex because one needs to upgrade the cluster components with a given order, and scheduling this isn’t easy.
So this is how I managed to write an OpenStack installer alone, in less than a year, without compromising on features: because I wrote things simply, and avoided the over-engineering I saw at all levels on other products.
OpenStack Stein is comming
I’ve just pushed to Debian Experimental, and to https://buster-stein.debian.net/debian the last release of OpenStack (code name: Stein), which was released upstream on the 10th or April (yesterday, as I write these lines). I’ve been able to install Stein on top of Debian Buster, and I could start VMs on it: it’s all working as expected after a bit of changes in the puppet manifests of OCI. What’s needed now, is testing upgrades from Stretch + Rocky to Buster + Stein. Normally, puppet-openstack can do that. Let’s see…
Want to know more?
Read on… the README.md is on https://salsa.debian.org/openstack-team/debian/openstack-cluster-installer
Last words, last thanks
This concludes a bit more than a year of development. All of this wouldn’t have been possible without my employer, Infomaniak, giving me a total freedom on the way I implement things for going into production. So a big thanks to them, and also for being a platinium sponsor for this year’s Debconf in Brazil.
Also a big thanks to the whole of the OpenStack project, including (but not limited to) the Infra team and the puppet-openstack team.