/ OpenStack

Lessons from OpenStack Telemetry: Incubation

It was mostly around that time in 2012 that I and a couple of fellow open-source enthusiasts started working on Ceilometer, the first piece of software from the OpenStack Telemetry project. Six years have passed since then. I've been thinking about this blog post for several months (even years, maybe), but lacked the time and the hindsight needed to lay out my thoughts properly. In a series of posts, I would like to share my observations about the Ceilometer development history.

To understand the full picture here, I think it is fair to start with a small retrospective on the project. I'll try to keep it short, and it will be unmistakably biased, even if I'll do my best to stay objective – bear with me.

Incubation

Early 2012, I remember discussing with the first Ceilometer developers the right strategy to solve the problem we were trying to address. The company I worked for wanted to run a public cloud, and billing the resources usage was at the heart of the strategy. The fact that no components in OpenStack were exposing any consumption API was a problem.

We debated about how to implement those metering features in the cloud platform. There were two natural solutions: either achieving some resource accounting report in each OpenStack projects or building a new software on the side, covering for the lack of those functionalities.

At that time there were only less than a dozen of OpenStack projects. Still, the burden of patching every project seemed like an infinite task. Having code reviewed and merged in the most significant projects took several weeks, which, considering our timeline, was a show-stopper. We wanted to go fast.

Pragmatism won, and we started implementing Ceilometer using the features each OpenStack project was offering to help us: very little.

Our first and obvious candidate for usage retrieval was Nova, where Ceilometer aimed to retrieves statistics about virtual machines instances utilization. Nova offered no API to retrieve those data – and still doesn't. Since it was out of the equation to wait several months to have such an API exposed, we took the shortcut of polling directly libvirt, Xen or VMware from Ceilometer.

That's precisely how temporary hacks become historical design. Implementing this design broke the basis of the abstraction layer that Nova aims to offer.

As time passed, several leads were followed to mitigate those trade-offs in better ways. But on each development cycle, getting anything merged in OpenStack became harder and harder. It went from patches long to review, to having a long list of requirements to merge anything. Soon, you'd have to create a blueprint to track your work, write a full specification linked to that blueprint, with that specification being reviewed itself by a bunch of the so-called core developers. The specification had to be a thorough document covering every aspect of the work, from the problem that was trying to be solved, to the technical details of the implementation. Once the specification was approved, which could take an entire cycle (6 months), you'd have to make sure that the Nova team would make your blueprint a priority. To make sure it was, you would have to fly a few thousands of kilometers from home to an OpenStack Summit, and orally argue with developers in a room filled with hundreds of other folks about the urgency of your feature compared to other blueprints.

OpenStack Summit design session

An OpenStack design session in Hong-Kong, 2013

Even if you passed all of those ordeals, the code you'd send could be rejected, and you'd get back to updating your specification to shed light on some particular points that confused people. Back to square one.

Nobody wanted to play that game. Not in the Telemetry team at least.

So Ceilometer continued to grow, surfing the OpenStack hype curve. More developers were joining the project every cycle – each one with its list of ideas, features or requirements cooked by its in-house product manager.

But many features did not belong in Ceilometer. They should have been in different projects. Ceilometer was the first OpenStack project to pass through the OpenStack Technical Committee incubation process that existed before the rules were relaxed.

This incubation process was uncertain, long, and painful. We had to justify the existence of the project, and many technical choices that have been made. Where we were expecting the committee to challenge us at fundamental decisions, such as breaking abstraction layers, it was mostly nit-picking about Web frameworks or database storage.

Consequences

The rigidity of the process discouraged anyone to start a new project for anything related to telemetry. Therefore, everyone went ahead and started dumping its idea in Ceilometer itself. With more than ten companies interested, the frictions were high, and the project was at some point pulled apart in all directions. This phenomenon was happening to every OpenStack projects anyway.

On the one hand, many contributions brought marvelous pieces of technology to Ceilometer. We implemented several features you still don't find any metering system. Dynamically sharded, automatic horizontally scalable polling? Ceilometer has that for years, whereas you can't have it in, e.g., Prometheus.

On the other hand, there were tons of crappy features. Half-baked code merged because somebody needed to ship something. As the project grew further, some of us developers started to feel that this was getting out of control and could be disastrous. The technical debt was growing as fast as the project was.

Several technical choices made were definitely bad. The architecture was a mess; the messaging bus was easily overloaded, the storage engine was non-performant, etc. People would come to me (as I was the Project Team Leader at that time) and ask why the REST API would need 20 minutes to reply to an autoscaling request. The willingness to solve everything for everyone was killing Ceilometer. It's around that time that I decided to step out of my role of PTL and started working on Gnocchi to, at least, solve one of our biggest challenge: efficient data storage.

Ceilometer was also suffering from the poor quality of many OpenStack projects. As Ceilometer retrieves data from a dozen of other projects, it has to use their interface for data retrieval (API calls, notifications) – or sometimes, palliate for their lack of any interface. Users were complaining about Ceilometer dysfunctioning while the root of the problem was actually on the other side, in the polled project. The polling agent would try to retrieve the list of virtual machines running on Nova, but just listing and retrieving this information required several HTTP requests to Nova. And those basic retrieval requests would overload the Nova API. The API does not offer any genuine interface from where the data could be retrieved in a small number of calls. And it had terrible performances.
From the point of the view of the users, the load was generated by Ceilometer. Therefore, Ceilometer was the problem. We had to imagine new ways of circumventing tons of limitation from our siblings. That was exhausting.

At its peak, during the Juno and Kilo releases (early 2015), the code size of Ceilometer reached 54k lines of code, and the number of committers reached 100 individuals (20 regulars). We had close to zero happy user, operators were hating us, and everybody was wondering what the hell was going in those developer minds.

Nonetheless, despite the impediments, most of us had a great time working on Ceilometer. Nothing's ever perfect. I've learned tons of things during that period, which were actually mostly non-technical. Community management, social interactions, human behavior and politics were at the heart of the adventure, offering a great opportunity for self-improvement.

In the next blog post, I will cover what happened in the years that followed that booming period, up until today. Stay tuned!

Lessons from OpenStack Telemetry: Incubation
Share this