gnocchi — jd:/dev/blog

Podcast.init: Gnocchi, a Time Series Database for your Metrics

Tue, 11 Dec 2018 00:00:00 GMT

A few weeks ago, Tobias Macey contacted me as he wanted to talk about Gnocchi, the time series database I've been working on for the last few years. It was a great opportunity to talk about the project, so I jumped on it! We talk about how Gnocchi came to life, how we built its architecture, the challenges we met, what kind of trade-off we made, etc. You can list to this episode [here](https://www.podcastinit.com/gnocchi-with-julien-danjou-episode-189/).

Gnocchi engine optimization

Tue, 27 Mar 2018 00:00:00 GMT

Software speed is relative. After all, it is the result of a set of trade-offs made between the ease of programming and the speed of hardware. The comfort of the developer and its use of multiple abstraction layers has a direct impact on decreasing the cost in time (and therefore in money), while it on the other hand increases the hardware expenditure as the software is less performant. In the end, whichever between optimization and hardware that is the cheapest gets privileged. Of course, there are terrible exceptions, such as picking the wrong algorithm or including `sleep()` calls, but the essence of it is here. Pick C to be fast, saving money on hardware and spending it on development, or pick Java to save money on development, while making rich hardware manufacturers. Last month, a co-worker at [Red Hat](https://redhat.com) picked Gnocchi for a test run and was disappointed by the performance he saw for his particular usage. After correctly understanding the use case scenario, I wrote a small test case that implemented this scheme and popped out my favorite code profiler. You know how I roll. The profiling result made the performance issue obvious. Along with its releases, Gnocchi evolved from a one metric at a time processing approach to a bunch of metric at a time approach – especially since Gnocchi 4 and the introduction of the _sacks_. However, that batched approach is not yet complete in Gnocchi 4.2, and the processing engine still manipulates metrics one by one in parallel. The parallelization using processes and threads makes sure that the CPU usage is high and that the I/O latency does not impact the processing too much. Processing incoming measures can be therefore schematized as this: ![gnocchi-engine-4](/content/images/03/gnocchi-engine-4-1.png) In the schema above, each operation in red is an I/O operation. The three branches I drew represents three metrics being processed. Obviously, if there were ten metrics, there would be ten branches, creating even more I/O operations. With the current Gnocchi 4.2 code, the number of I/O operations for processing a sack of metric can be roughly computed to `2 + (5 × M × D × G)` actions, with `M` the number of metrics and `D` the number of definitions in the archive policy and `G` the number of aggregation methods. For my test scenario, I used `D=1` and `G=1`, which is what can be seen on the diagram above. The obvious solution is to merge those I/O operations for each metric in a single I/O operations for a bunch of metrics. This allows for storage backends to batch the reading and writing operations, reducing latency and improving throughput. It took me a few tens of patches and a few code reviews from my peers to rework the internal storage engine of Gnocchi. The new engine is now ready to be used and merged into the _master_ branch. ![gnocchi-engine-5](/content/images/03/gnocchi-engine-5.png) The new engine reduces the number of I/O operations to process a bunch of metric to `5 + M` – a (at least) five times reduction in the amount of operations. In my case, for 1000 metrics being processed in a batch, with only one aggregation, that decreases the quantity of transactions from 5002 to 1005. A typical metric has 6 aggregation methods defined usually, so that could reduce the number of I/O operations from 40,002 to only 1005 for 1000 metrics – a fourty times reduction of I/O operations. The benchmark code that I wrote, which implements the desired use case with a single aggregation, is now performing more than four times faster. Not all the drivers will benefit from this improvement, as some of them are better at doing batched operations than others; Redis is great at it, while Swift is not. And even if the number I/O operations has been largely reduced, they still need to be fully executed, which can take time depending on the backend performance. It's a really great improvement, not a silver bullet. [Mehdi](http://sileht.net) started to use that new internal driver API to implement a [RocksDB](http://rocksdb.org/) driver. While it has its own limitation (has to be single-threaded) that we will need to circumvent, it could improve performance for the non-distributed use-case by a large magnitude. This code will be included in the next Gnocchi major release in a few weeks, so stay tuned for further update. And benchmark, I hope!

Gnocchi 4.2 release

Tue, 06 Feb 2018 00:00:00 GMT

The time of the release arrived. A little more than three months have passed since the latest minor version, 4.1, has been released. There are tons of improvement and a few nice significant features in this release! ![](/content/images/03/gnocchi-logo.png) Most of the principal changes are recorded in [the 4.2 release notes](http://gnocchi.xyz/releasenotes/4.2.html), but here are as a few that I find particularly interesting. There were 141 commits since 4.1.0 that we merged. As a comparison, it is a lot less than the 228 we had between 4.0.0 and 4.1.0 or the 375 we had between 3.1.0 and 4.0.0! We added two compatibility endpoints on the REST API for [InfluxDB](http://influxdb.org) and [Prometheus](http://prometheus.io). We want users coming from those other database systems or using tools that are compatible with them to be able also to use Gnocchi. This is now possible as Gnocchi offers endpoint to write data using the InfluxDB line protocol and Prometheus HTTP API. Reading data using their API is not supported yet though. For example, this has been tested with [Telegraf](https://www.influxdata.com/time-series-platform/telegraf/) and works perfectly fine! Some other improvements were made, such as enhancing the ACL filtering when using [Keystone](https://docs.openstack.org/keystone/latest/) for authentication, a new batch format for passing more information about non-existing metrics to create and tons of performance improvements! We already started working on the next version of Gnocchi! Come and join us on [GitHub](http://github.com/gnocchixyz)! Star us, and stay tuned for some more awesome news around metrics.

Gnocchi 4.1 is out

Fri, 27 Oct 2017 00:00:00 GMT

We did it again. A bit more of our usual four months were needed to do it, but Gnocchi 4.1 has been released. This is a great news and another big milestone for the project! ![](/content/images/03/gnocchi-logo.png) As usual, we enhanced Gnocchi and added a bunch of new things that [can all be seen in the online changelog](http://gnocchi.xyz/releasenotes/4.1.html). Nevertheless, I would like to talk of a few here! First, we added notification support to the Redis incoming driver. This feature makes sure that, when using Redis as an incoming measure driver, the metrics as processed as fast as possible, rather than waiting `metric_processing_delay`. This moves the incoming driver toward more of a push model than a pull model – even if it still uses both. That feature decreases the latency between the time metrics are pushed, and metrics are processed by _metricd_, which is transcendent. Secondly, the internal computing engine (measures aggregation) has entirely been ported from [Pandas](http://pandas.pydata.org) to [Numpy](http://numpy.org). While Pandas is written using Numpy, it does some things more than Numpy itself when used. Those features are beneficial when quickly writing data analysis processes, but are not needed for Gnocchi. They take CPU time, which means less throughput for _metricd_. Pandas is still needed for the old and deprecated dynamic aggregation feature and will be entirely removed as a dependency in the next version of Gnocchi. Finally, the biggest functionality that has landed is [the new `/v1/aggregates` endpoint](http://gnocchi.xyz/rest.html#aggregates-on-the-fly-measurements-modification-and-aggregation). This is a principal feature that allows to retrieve aggregates but also to do cross-aggregation in new ways that were not possible before. For example, you can request "the absolute value of the average of two metrics being multiplied by" writing: `(absolute (* (metric 32dd0731-c423-45aa-94f6-e4069989eb57 mean) (metric 942990de-b208-4bf7-a0ee-93e4890df73a mean)))`. This endpoint supports fetching any metric from the database (by id or by search in resources) and applying any mathematics operation. The syntax is inspired from Lisp, which makes it easy to write both as a string or as JSON. Come and join us on [GitHub](http://github.com/gnocchixyz)! Star us, and stay tuned for some more awesome news around metrics.

My interview with Cool Python Codes

Thu, 05 Oct 2017 00:00:00 GMT

A few days ago, I've recently been contacted by Godson Rapture from [Cool Python codes](http://coolpythoncodes.com/) to answer a few questions about what I work on in open source. Godson regularly interview developers and I invite you to check out his website! Here's a copy of [my original interview](http://coolpythoncodes.com/julien-danjou/). Enjoy! > Good day, Julien Danjou, welcome to Cool Python Codes. Thanks for taking your precious time to be here. You’re welcome! > Could you kindly tell us about yourself like your full name, hobbies, nationality, education, and experience in programming? Sure. I’m Julien Danjou, I’m French and live in Paris, France. I studied Computer science for 5 years around 15 years ago, and continued my career in that field since then, specializing in open source projects. Those last years, I’ve been working as a software engineer at Red Hat. I’ve spent the last 10 years working with the Python programming language. Now I work on the Gnocchi project which is a time series database. When I’m not coding, I enjoy running half-marathon and playing FPS games. ![](/content/images/03/pyconfr-2017-jd.jpg) > Can you narrate your first programming experience and what got you to start learning to program? I started programming around 2001, and my first serious programs were in Perl. I was contributing to a hosting platform for free software named VHFFS. It was a free software project itself, and I enjoyed being able to learn from other more experienced developers and being able to contribute back to it. That’s what got me stuck into that world of open source projects. > Which programming language do you know and which is your favorite? I know quite a few, I’ve been doing serious programming in Perl, C, Lua, Common Lisp, Emacs Lisp and Python. Obviously, my favorite is Common Lisp, but I was never able to use it for any serious project, for various reasons. So I spend most of my time hacking with Python, which I really enjoy as it is close to Lisp, in some ways. I see it as a small subset of Lisp. > What inspired you to venture into the world of programming and drove you to learn a handful of programming languages? It was mostly scratching my own itches when I started. Each time I saw something I wanted to do or a feature I wanted in an existing software, I learned what I needed to get going and get it working. I studied C and Lua while writing awesome- the window manager that I created 10 years ago and used for a while. I learned Emacs Lisp while writing extensions that I wanted to see in Emacs, etc. It’s the best way to start. > What is your blog about? My blog is a platform where I write about what I work on most of the time. Nowadays, it’s mostly about Python and the main project I contribute to, Gnocchi. When writing about Gnocchi, I usually try to explain what part of the project I worked on, what new features we achieved, etc. On Python, I try to share solutions to common problems I encountered or identified while doing e.g. code reviews. Or presenting a new library I created! > Tell us more about your book, The Hacker’s Guide to Python. It’s a compilation of everything I learned those last years building large Python applications. I spent the last 6 years developing on a large code base with thousands of other developers. I’ve reviewed tons of code and identified the biggest issues, mistakes, and bad practice that developers tend to have. I decided to compile that in a guide, helping developers that played a bit with Python to learn the stages to get really productive with Python. > OpenStack is the biggest open source project in Python, Can you tell us more about OpenStack? OpenStack is a cloud computing platform, started 7 years ago now. Its goal is to provide a programmatic platform to manage your infrastructure while being open source and avoiding vendor lock-in. > Who uses OpenStack? Is it for programmers, website owners? It’s used by a lot of different organizations – not really by individuals. It’s a big piece of software. You can find it in some famous public cloud providers (Dreamhost, Rackspace…), and also as a private cloud in a lot of different organizations, from Bloomberg to eBay or the CERN in Switzerland, a big OpenStack user. Tons of telecom providers also leverages OpenStack for their own internal infrastructure. > Have you participated in any OpenStack conference? What did you speak on if > you did? I’ve attended the last 9 OpenStack summits and a few other OpenStack events around the world. I’ve been engaged in the upstream community for the last 6 years now. My area of expertise is telemetry, the stack of software that is in charge of collecting and storing metrics from the various OpenStack components. This is what I regularly talk about during those events. > How can one join the OpenStack community? There’s an entire documentation about that, called the [Developer’s Guide](https://docs.openstack.org/infra/manual/developers.html). It explains how to setup your environment to send patches, how to join the community using the mailing-lists or IRC. > What makes your book, [The Hacker’s Guide to Python](https://thehackerguidetopython.com) stand out from other Python books? Also, who exactly did you write this book for? I wrote the book that I always wanted to read about Python, but never found. It’s not a book for people that want to learn Python from scratch. It’s a great guide for those who know the language but don’t know the details that experienced developers know and that make the difference. The best practice, the elegant solutions to common problems, etc. That’s why it also includes interviews with prominent Python developers, so they can share their advice on different areas. > How can someone get your book? I’ve decided to self-publish my book, so he does not have an editor like you can be used to see. The best place to get it is online at where you can pick the format you want, electronic or paper. > What do you mean when you say you hack with Python? Unfortunately, most people refer to hacking as the activity of some bad guys trying to get access to whatever they’re not supposed to see. In the book title, I mean “hacking” as the elegant way of writing code and making things worse smoothly even when you were not expecting to make it. > You mentioned earlier that Gnocchi is a time series database. Can you please be more elaborate about Gnocchi? Is there also any documentation about Gnocchi? So Gnocchi is a project I started a few years ago to store time series at large scale. Timeseries are basically a series of tuple composed of a timestamp and a value. Imagine you wanted to store the temperature of all the rooms of the world at any point of time. You’d need a dedicated database for that with the right data structure. This is what Gnocchi does: it provides this data structure storage at very, very large scale. The primary use case is infrastructure monitoring, so most people use it to store tons of metrics about their hardware, software, etc. It’s fully documented on [its website](http://gnocchi.xyz). > How can a programmer without much experience contribute to open source projects? The best way to start is to try to fix something that irritates you in some way. It might be a bug, it might be a missing feature. Start small. Don’t try big things first or you could be discouraged. Never stop. Also, don’t plunge right away in the community and start poking random people or spam them with questions. Do your homework, and listen to the community for a while to get a sense of how things are going. That can be joining IRC and lurking or following the mailing lists for example. Big open source communities dedicate programs to help you become engaged. It might be worth a try. Generic programs like Outreachy or Google Summer of Code are a great way to start if you don’t feel confident enough to jump by your own means in a community. > Just out of curiosity, do you write code in French? Never ever. I think it’s acceptable to write in your language if you are sure that your code will never be open sourced and that your whole team is talking in that language, no matter what – but it’s a ballsy assumption, clearly. Truth is that if you do open source, English is the standard, so go with it. Be sad if you want, but please be pragmatic. I’ve seen projects being open sourced by companies where all the code source comments were in Korean. It was impossible for any non-Korean people to get a glance of what the code and the project was doing, so it just failed and disappeared. > How does a team of programmers handle bugs in a large open source project? I wish there was some magic recipe, but I don’t think it’s the case. What you want is to have a place where your users can feel safe reporting bugs. Include a template so they don’t forget any details: how to reproduce the bugs, what they expected, etc. The worst thing is to have users reporting “That does not work.” with no details. It’s a waste of time. What tool to use to log all of that really depends on the team size and culture. Once that works, the actual fixing of bug doesn’t follow any rule. Most developers fix the bug they encounter or the ones that are the most critical for users. Smaller problems might not be fixed for a long time. > Can you tell us about the new book you are working on and when do we expect > to get it? That new book is entitled [“Scaling Python”](https://scaling-python.com) and it provides insight into how to build largely scalable and distributed applications using Python. It is also based on my experience in building this kind of software during the past years. This book also includes interviews of great Python hackers who work on scalable system or know a thing or two about writing applications for performance – an important point to have scalable applications. The book is in its final stage now, and it should be out at the beginning of 2018. > How can someone get in contact with you? I’m reachable at [julien@danjou.info](mailto:julien@danjou.info) by email or via Twitter, [@juldanjou](https://twitter.com/juldanjou).

Gnocchi 4 performance

Mon, 11 Sep 2017 00:00:00 GMT

It has been a long time since I have tested [Gnocchi](http://gnocchi.xyz) performances. [Last time was two years ago, on version 2](/blog/2015/gnocchi-benchmarks). The current version for Gnocchi is 4.0, [released a couple of months ago](/blog/2017/gnocchi-4-release). It adds a lot of new features, such as a [Redis](http://redis.org) incoming driver and a new job distribution method. Many of those features and improvement implemented over the last couple of years were made with performance in mind. It is time to check if this lives up to our expectation. ### Test protocol I have pulled the servers I used a couple of years ago out of the dust, updated them with latest RHEL 7 and installed Gnocchi 4.0.1 and Redis 4.0.1 on one of them. I used the other server as the benchmark client, in charge of generating a bunch of loads. The hardware configuration for each server is: - 2 × Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 cores each) - 32 GB RAM - SanDisk Extreme Pro 240GB SSD I have installed Gnocchi using `pip install gnocchi[postgresql,file,redis]`, created a PostgreSQL database and wrote the following configuration file: ```ini [indexer] url = postgresql://root:@localhost/gnocchi ## Uncomment when testing with Redis ## [incoming] ## driver = redis [storage] file_basepath = /root/gnocchi-venv/data ``` The perk of having good default values: you only to write a couple of configuration lines to get it working. I have used uWSGI as the Web server, using the configuration file [provided Gnocchi's documentation](http://gnocchi.xyz/operating.html#running-api-as-a-wsgi-application) and configured it with 64 processes and 16 threads. Since the hardware configurations are identical, I allow myself in this article to compare the performances of Gnocchi 2 and Gnocchi 4 directly. ### Benchmark tools For generating loads, I have reused the code that I wrote and merged in [python-gnocchiclient](http://pypi.python.org/gnocchiclient). It is still not that easy to generate a lot of parallel loads in Python, though it is still the best tool I find available that was not too complicated to setup for things like CRUD operations. To benchmark measures, I needed something very fast to generate requests on the client side to be sure to be able to overload the server. I have leveraged [wrk](https://github.com/wg/wrk), which is written in C++ and is fast. It is scriptable using Lua, so it made it easy to generate fake batches of data. ### Metric CRUD operations The first step is to benchmarks the CRUD operations for metrics. Here are the results, compared to the benchmarks I did against Gnocchi 2. Without surprises (but with great pleasure), everything is between 13% and 26% faster. Those operations mostly consist of SQL operations for the backend and serialization on the API – nothing heavy. ### Sending and getting measures Writing measures is still the hottest topic! How fast can you push things into that time series database and how efficient it is at retrieving those? Gnocchi has been supporting various batching methods for a while, and here the tested one is the simplest case, i.e., batching for one metric at a time. I think the chart talks for itself. With Redis as a driver, I attained almost **1 million measures per second**. I did not find a suitable tool to report performances with a payload bigger than 5000 points, so I stopped at that. Those results are inline with what [Gordon Chung measured recently on Gnocchi 4](https://medium.com/@gord.chung/gnocchi-4-introspective-a83055e99776) – though he achieved **1.3 million measures per second** with his bigger hardware! These are performances using HTTP as a protocol – with all its overhead and JSON serialization going on. Gnocchi does not implement any custom protocol so far because we never had any requirement for more performances. However, that would certainly be a good path to follow for anyone wanting to go even faster. Reading metrics is 54% faster here again. You can retrieve up to 400 000 measures per second (around 150 Mbit/s of data). That means you can retrieve a metric with a whole year of measures with a one-minute aggregate in 1.3 seconds. More realistically, you can retrieve the last 24 hours of data with a one minute precision for 280 metrics in just one second. That is more data you could ever fit on your graph dashboard! Most of the time is spent serializing points in JSON – again, a different retrieving mechanism could be envisioned to achieve even higher performances. ### _Metricd_ speed I did not benchmark myself metricd speed, as [Gordon wrote a complete report in the meantime](https://medium.com/@gord.chung/gnocchi-4-introspective-a83055e99776). Gnocchi 4 multiplies the processing speed from Gnocchi 2 by a factor of 2. ![](/content/images/03/gnocchi-metricd-speed.png) This speed is quite impressive and allows Gnocchi to ingest and pre-compute considerable amount of data in a short time span. Some of the changes Gordon tested here are not yet released and will be part of the next minor release (4.1). Being that efficient means that with only 1 CPU, Gnocchi can process (data aggregation) roughly 700 measures per second. If you have 70 servers and gather 10 metrics per server every second, Gnocchi can process them without any delay. If you scale back your polling to one minute instead of one second (the most common scenario) and use a single computer with 12 cores, that means Gnocchi can **aggregate the metrics from 50 400 servers with only one server**. Not that bad. ### Conclusion Our processing engine is getting now really mature. Hundreds of deployments are now using it for production purpose of gathering metrics. The recent improvements made for Gnocchi 4 are a compelling argument for users to upgrade, and we are pretty proud of our work! We still have a few ideas on how to improve some corner cases, but the general use case is getting well covered. Adding to that the native horizontal capability that Gnocchi provides since day one, it is getting hard to find a time series database that has those features with this level of performance (but of course I'm biased, haha). And if you have any questions, feel free to shoot them in the comment section. 😉 $(function () { var chartColors = { red: 'rgb(234, 18, 51)', orange: 'rgb(255, 159, 64)', yellow: 'rgb(255, 205, 86)', green: 'rgb(75, 192, 192)', blue: 'rgb(46, 106, 234)', purple: 'rgb(153, 102, 255)', grey: 'rgb(201, 203, 207)' }; var color = Chart.helpers.color; var ctx = $("#metric\_crud").get(0).getContext("2d"); var metric\_crud = new Chart(ctx, { type: 'bar', data: { labels: \["Create", "Read", "Delete"\], datasets: \[ { label: "Gnocchi 2", backgroundColor: color(chartColors.blue).alpha(0.6).rgbString(), borderColor: chartColors.blue, borderWidth: 1, data: \[1300, 670, 524\] }, { label: "Gnocchi 4", backgroundColor: color(chartColors.red).alpha(0.6).rgbString(), borderColor: chartColors.red, borderWidth: 1, data: \[1473, 843, 708\] } \] }, options: { responsive: true, legend: { position: 'top', }, scales: { yAxes: \[ { ticks: { beginAtZero: true } } \] }, title: { display: true, text: 'CRUD operations' } }}); var ctx = $("#metric\_measures").get(0).getContext("2d"); var metric\_measures = new Chart(ctx, { type: 'bar', data: { labels: \["1", "10", "100", "500", "1000", "2000", "3000", "4000", "5000"\], datasets: \[ { label: "Gnocchi 2 (file)", backgroundColor: color(chartColors.blue).alpha(0.6).rgbString(), borderColor: chartColors.blue, borderWidth: 1, data: \[624, 6000, 45000, 98000, 113000, 121000, 123000, 125000, 122000\] }, { label: "Gnocchi 4 (file)", backgroundColor: color(chartColors.red).alpha(0.6).rgbString(), borderColor: chartColors.red, borderWidth: 1, data: \[1 \* 754.26, 10 \* 770.16, 100 \* 583.53, 500 \* 522.88, 1000 \* 406.38, 2000 \* 273.03, 3000 \* 215.11, 4000 \* 185.08, 5000 \* 176.11\] }, { label: "Gnocchi 4 (Redis)", backgroundColor: color(chartColors.purple).alpha(0.6).rgbString(), borderColor: chartColors.purple, borderWidth: 1, data: \[1 \* 674, 10 \* 782.34, 100 \* 600, 500 \* 533.66, 1000 \* 405.38, 2000 \* 282, 3000 \* 223, 4000 \* 195, 5000 \* 185.41\] } \] }, options: { responsive: true, legend: { position: 'top', }, scales: { yAxes: \[ { scaleLabel: { display: true, labelString: "Measures per second" }, ticks: { beginAtZero: true } } \], xAxes: \[ { scaleLabel: { display: true, labelString: "Measures per request" }, } \] }, title: { display: true, text: 'Measures writing' } }}); var ctx = $("#metric\_measures\_get").get(0).getContext("2d"); var metric\_measures\_get = new Chart(ctx, { type: 'bar', data: { labels: \["Get measures for metric"\], datasets: \[ { label: "Gnocchi 2 (file)", backgroundColor: color(chartColors.blue).alpha(0.6).rgbString(), borderColor: chartColors.blue, borderWidth: 1, data: \[260000\] }, { label: "Gnocchi 4 (file)", backgroundColor: color(chartColors.red).alpha(0.6).rgbString(), borderColor: chartColors.red, borderWidth: 1, data: \[46.43 \* 8640\] } \] }, options: { responsive: true, legend: { position: 'top', }, scales: { yAxes: \[ { scaleLabel: { display: true, labelString: "Measures per second" }, ticks: { beginAtZero: true } } \] }, title: { display: true, text: 'Measures reading' } }}); });

Gnocchi or Prometheus?

Wed, 30 Aug 2017 00:00:00 GMT

The realm of time series database keeps expanding those last years. Now and then a new contender appears from the fog. People keep asking me about the difference between [Gnocchi](http://gnocchi.xyz) and [Prometheus](http://prometheus.io). It's time to content them. Gnocchi and Prometheus are two open source projects evolving in the same expertise area, time series handling. They both are licensed under the **Apache 2.0 license** (see [Gnocchi license file](https://github.com/gnocchixyz/gnocchi/blob/master/LICENSE) and [Prometheus license file](https://github.com/prometheus/prometheus/blob/master/LICENSE). And that's a good thing! Both Gnocchi and Prometheus offers a bunch of features. Here's a table summary of the differences between the features they both offer – or not. **Feature** Prometheus Gnocchi Multi-tenant ❌ ✓ User auth & ACL ❌ ✓ Resource history ❌ ✓ Metric polling ✓ ❌ Highly available ❌ ✓ Horizontal scalability ❌ ✓ Alerting engine ✓ ❌ Data compression ✓ ✓ Pre-computed aggregation ✓ ✓ Grafana support ✓ ✓ collectd support ✓ ✓ #comparison th, #comparison td + td { text-align: center; } There's a lot of overlap between the two projects, but there are also some major differences. First, Gnocchi does not try to solve the metric retrieval problem. Prometheus provides a pull mechanism and takes in charge of getting the measurements. Gnocchi developers estimate that they are plenty of tools already doing that and that work well, such as [collectd](http://collectd.org). ![](/content/images/03/icon_siren.png) Secondly, Prometheus offers an [alerting engine](https://prometheus.io/docs/alerting/overview/), statically configured with a YAML file. It is way better than Gnocchi which offers nothing in comparison – for now. Gnocchi developers [are discussing the feature](https://github.com/gnocchixyz/gnocchi/issues/71) and while it's not on the roadmap yet, it will happen. It will, however, leverage a REST API to be controlled, as it seems important to us to be able to define alerts programmatically. ![](/content/images/03/icon_storage.png) Then there is a bunch of features where Gnocchi shines compared to Prometheus, and it is the core of its function: storing metrics. Gnocchi has a great storage engine that supports many storage backends (plain files, [OpenStack Swift](https://docs.openstack.org/swift/latest/), [Ceph](http://ceph.org)…). It helps Gnocchi scaling horizontally and providing native high-availability, whereas Prometheus stays a single point of failure. Multi-tenant and authentication are also supported by Gnocchi, allowing a single instance to be shared by multiple accounts. System administrators do not commonly use this kind of feature, but applications developers usually need them. That brings me to the usage and querying of Prometheus and Gnocchi. Prometheus has its small DSL (referred to as [PromQL](https://prometheus.io/docs/querying/basics/)) whereas Gnocchi has a [fully featured REST API](http://gnocchi.xyz/rest.html) that tries to expose proper semantic. It does not seem there are major differences between the two in term of features. Both Prometheus and Gnocchi support aggregating values over time ranges on query time ("give me the minimum value for every 5 minutes range over the last day"). Gnocchi always aggregates metrics at writing time, and never at query time (unless doing it cross-metrics). This implies that Gnocchi needs a bit of CPU time at write time to pre-compute those aggregates, but it is blazingly fast at reading time as it has nothing to compute. Prometheus can do the same thing using [recording rules](https://prometheus.io/docs/querying/rules/). ![](/content/images/03/icon_clock.png) Prometheus has some limitations inherent to time series database designed around the notion of "monitoring": they tend to compute everything relatively to `$NOW`. For example, it seems impossible to inject data from the past. The timestamp for a value is the timestamp where Prometheus read that value. If Prometheus misses values for a few hours, don't think about importing it back. I'm noting this here as it makes it harder to benchmark Prometheus for ingestion. You need tons of fake metrics to polls and build data. I did not find any reference of Prometheus performances online, though it is advertised to ingest "millions of measures from thousands of sources". Query performances seem to vary on Prometheus, and I did not find any benchmark on that neither. Gnocchi leverages standard RDBMS (MySQL or PostgreSQL is supported) to query indexed data and the metrics retrieval is always _O(1)_, making it **always fast**. ## Conclusion If you look in different and older areas, there never has been only one HTTP server. Many people use Apache HTTP server, but you'll find plenty of users of nginx, Tomcat, HAProxy, Node.js or uwsgi which are also common options nowadays. Same goes for RDBMS if you look at PostgreSQL, MySQL and other databases solution, etc. There will never be a project winning all the market share. It seems to me that time series storage and management is also growing in this category. There will probably be various projects that will enjoy some popularity and growth. Every project addresses the time series problem space with a different view and different trade-offs. There might never be a single project solving all problems at once. Prometheus seems to be oriented toward monitoring of live systems. Gnocchi is oriented to highly available time series storage at massive scale. Not considering performances (I was not able to compare anyway), both have different tradeoffs in term of features, philosophy, and orientation. Depending on your use cases, one might be a better fit than the other.

Using Gnocchi with Docker

Thu, 17 Aug 2017 00:00:00 GMT

I've recently started to look into Docker to build images ready to be used with [Gnocchi](http://gnocchi.xyz) in it. I found it would be a great way to distribute a working instance of Gnocchi. To this end, we created the [gnocchi-docker](https://github.com/gnocchixyz/gnocchi-docker) repository on GitHub. It contains: - a 11 lines long (only!) [Dockerfile](https://github.com/gnocchixyz/gnocchi-docker/blob/master/gnocchi/Dockerfile) to build a Linux image containing Gnocchi; - a [Dockerfile](https://github.com/gnocchixyz/gnocchi-docker/tree/master/grafana) to create a [Grafana](https://grafana.com/) image that will use Gnocchi as datasource (preconfigured); - a [Dockerfile](https://github.com/gnocchixyz/gnocchi-docker/blob/master/collectd/Dockerfile) to create a [collectd](http://collectd.org) image that gather various metrics for your container in order to feed Gnocchi and have something to display in Grafana; - a [docker-compose file](https://github.com/gnocchixyz/gnocchi-docker/blob/master/docker-compose.yaml) that orchestrates and runs those containers. If you don't know [docker-compose](https://docs.docker.com/compose/), it's a tool to define and run applications using multiple containers. This is very handy in our case, as we need to start a few services, and therefore a few containers, to have our whole stack running. If you just want to use and run Gnocchi in a snap using this, it's easy. First clone the repository: ``` $ git clone https://github.com/gnocchixyz/gnocchi-docker.git ``` Then, just ask docker-compose to start your stack of containers: ``` $ cd gnocchi-docker $ docker-compose up ``` On the first run, `docker-compose` will build the various images (this should take only a few minutes) and then will start them. Once everything is started, you can connect to Grafana by typing the URL `http://:3000` in your browser and using "admin" as username and "password" as password. Just click on the dashboard entitled "Gnocchi" and wait a few minutes: you will see the chart being drawn in real time! ![gnocchi-docker-grafana-screenshot](/content/images/03/gnocchi-docker-grafana-screenshot.png) The data fed into Gnocchi come from the `collectd` container, which gathers various metrics (CPU, network interface statistics, etc). You can then edit the docker files as you like to add new features or test your code. The files are also a good basis if you want to deploy Gnocchi in production running Docker! If you want to access and play with Gnocchi in command line, just install [gnocchiclient](https://pypi.python.org/pypi/gnocchiclient) and do the following: ``` $ export GNOCCHI_ENDPOINT=http://`docker-machine ip`:8041 $ gnocchi resource list +----------+----------+------------+---------+----------------------+------------+----------+----------------+--------------+---------+ | id | type | project_id | user_id | original_resource_id | started_at | ended_at | revision_start | revision_end | creator | +----------+----------+------------+---------+----------------------+------------+----------+----------------+--------------+---------+ | c31e4adc | collectd | None | None | collectd:fake-phy- | 2017-08-17 | None | 2017-08-17T12: | None | admin | | -2cff-5f | | | | host-719acbad336c | T12:20:27. | | 20:27.643790+0 | | | | 78-8206- | | | | | 643778+00: | | 0:00 | | | | f5ca66e4 | | | | | 00 | | | | | | 6cce | | | | | | | | | | +----------+----------+------------+---------+----------------------+------------+----------+----------------+--------------+---------+ ``` You can now have fun creating new resources and metrics! Feel free to contribute patches to [the GitHub project](http://github.com/gnocchixyz/gnocchi-docker) too, obviously!

Gnocchi 4 is out

Tue, 13 Jun 2017 00:00:00 GMT

Finally! Four months ago we pushed the Gnocchi 3.1 release and here we are now, release the 4th major version of that timeseries database. A lot happened in the last 4 months. ![](/content/images/03/gnocchi-logo.png) First, as I already wrote about, [we moved to GitHub for hosting our project](/blog/gnocchi-independence). This slowed down our development pace for a couple of weeks, but we're now almost back to normal! We were a bit sad to quit the great infrastructure that we used before, but it feels great to be hosted on a platform everyone knows about and is more straightforward to use. Second, we implemented some major changes that should improve performances _again_. We tend to that in each release, I know, I know. As usual, the release notes contains most of [the major changes we did and can be read online](http://gnocchi.xyz/releasenotes/4.0.html). But I'd like to talk about few here that I find very exciting. The work and performances tests that Alex Krzos did (and [we presented during the last OpenStack Summit](/blog/2017/openstack-summit-pike-boston-recap)) was of a great help for inspiration on where to improve performances. - [Redis](http://redis.io)! We added a Redis driver which can store incoming measures and metric archives. Obviously, it's more meant for incoming measures. Remember, in Gnocchi 3.1 we split the storage driver into two parts: the incoming measure storage and the archive storage. Since you can use two different drivers for those different functions, with Gnocchi 4.0 you can use Redis to store your incoming measures in a very fast temporary storage service and then _metricd_ will process them and store the results in your favorite scalable storage such as [Ceph](http://ceph.com), where it's mostly read. - Sacks! We rewrote the entire scheduling mechanism for _metricd_. It now uses several "sacks" to store incoming measures in a distributed manner, instead of the previous one-sack-only storage for those incoming data. A hashring is then used to spread the processing workload on all the running _metricd_ daemon. Faster, simpler and more efficient scheduling should happen with this version! - S3! We fixed the S3 driver. It was a nice proof-of-concept in 3.1 and now it should work. For real. That's mostly it. The rest of the changes are bug fixes there and there and some performance improvement, but this should be enough to get you excited to try it out. Come and join us on [GitHub](http://github.com/gnocchixyz)! Star us, and stay tuned for some more awesome news around metrics.

OpenStack Summit Boston 2017 recap

Mon, 15 May 2017 00:00:00 GMT

The [first OpenStack Summit of 2017](https://www.openstack.org/summit/boston-2017/) was last week, in Boston, MA, USA. I was able to attend as I've been selected to give 3 talks, to help for a hands-on and to animate an on-boarding session. This made sure I was a bit busy every day, which was good. This is the first summit to happen since the new [Project Team Gathering (PTG)](https://www.openstack.org/ptg/) happened last February. I was unable to attend this first PTG back then, as there was no way to justify my presence there. The OpenStack Telemetry team that I lead is pretty small. People don't really need to talk to each other face to face to discuss: therefore we decided to not ask to be present during the last PTG event. The Telemetry on-boarding session that I organized with my fellow developer Gordon Chung on Tuesday had only 3 people showing up to ask a few questions about Telemetry. The session lasted 15 minutes on 90 planned. We shared that session with [CloudKitty](https://wiki.openstack.org/wiki/CloudKitty), for which nobody showed up for. When you think about it, this was really disappointing but did not come as a surprise. First, the amount of company engaging developers into OpenStack has shrunk drastically during the last year. Secondly, since there's now another event (the PTG) twice a year, it seems pretty clear that every developer will not be able to attend all the 4 events every year, creating dispersion in the community. I personally was glad to attend the Summit rather than the PTG, as it is more valuable to meet operators and users than developers to gather feedback. However, meeting everyone at the same time would be great, especially for smaller teams. The PTG scattered some teams to a point that many of developers of those lineups won't go to either the PTG nor the OpenStack. As a consequence, I won't have any meeting point in the future with many of my fellow developers around OpenStack. I warned the Technical Committee last year about this when it was decided to reorganize the events. I'm glad to be right but I'm a bit sad that the Foundation did not listen. Though all the projects I work on tend to follow [the good practice I wrote last year](/blog/foss-projects-management-bad-practice). Therefore I cannot say that it has huge consequences on the projects I work on. It's a loss as it makes it harder to reach users and operators for some of us. It also reduces our occasion for social interaction, which was a great benefit. But it will not prevent us from building great software anyway! The few other sessions of _[The Forum](https://wiki.openstack.org/wiki/Forum)_ (the space dedicated to developers during the Summit) that I attended discussed various technical things, and some sessions were pretty empty. I wonder if it was a lack of interest of people or if people were unable to travel to discuss those items. Anyhow, at this stage I am not sure it would have really mattered: this has been my 9th OpenStack Summit and many of the subjects discussed already have been discussed multiple time with barely any change since. Talk is cheap. Furthermore, most of the discussion were not made by stakeholders of the various projects involved, but by people on the side, or by members of the Technical Committee. There is just unfortunately too much of wishful thinking. On the talk side, my presentation with Alex Krzos entitled _Telemetry and the 10,000 instances_ went pretty well. We demonstrated what how we tested the performance of the telemetry stack. Same goes for my hands-on with the CloudKitty developers, where we managed to explain how Ceilometer, Gnocchi, and CloudKitty were able to work with each other to create nice billing reports. The last day was concluded with my talk on collectd and Gnocchi with Emma, which was short and to the point. My final talk was about the status and roadmap of the OpenStack Telemetry team where I tried to explain how the Telemetry works and what we might do (or not) in the next cycles. It was pretty short as we barely have a roadmap, the project having 3 developers doing 80% of the work. I was also able to catch up with Nubeliu about their Gnocchi usage. They [presented a nice demo of the cloud monitoring solution](https://www.youtube.com/watch?v=Hlt3UwsvgjU) they build on top of Gnocchi. They completely understood how to use Gnocchi to store a large number of metrics at scale and how to leverage the API to render what's happening in your infrastructure. It is pretty amazing. While I missed the energy and the drive that the design session used to have in the first summits, it has been a pretty good summit. I was especially happy to be able to discuss OpenStack Telemetry and Gnocchi. The feedback I gathered was tremendous and terrific and I'm looking forward to the work we'll achieve in the next months!

Gnocchi independence

Sat, 06 May 2017 00:00:00 GMT

Three years have passed since I started working on [Gnocchi](http://gnocchi.xyz). It's amazing to gaze at the path we wandered on. During all this time, Gnocchi has been "incubated" inside OpenStack. It has been created there and it grew with the rest of the ecosystem. But Gnocchi (developers) always stuck to some strange principles: autonomy and independence from the other OpenStack projects. This actually made the project a bit unpopular sometimes inside OpenStack, being stamped as some kind of _rebel_. I've spent the last years asserting that each project inside OpenStack should seek towards living its own life. It is a key success for any open source project to be able to be used in any context, not only the one it has been built for. Having to use large bundles of projects together is not a good user story. I wish OpenStack will be a set of more autonomous building blocks. One of the most used project by people not using an entire OpenStack installation has been [Swift](https://launchpad.net/swift). That was possible because Swift always tried to be autonomous and to not depend on any other service. It is able to leverage external services but it can also work without any. And I feel that Swift is the most successful project if you measure that success by being used by people having zero knowledge about OpenStack. With the move toward the _Big Tent_, it struck me that the OpenStack Foundation will end up as some sort of an Apache Foundation. And I am pretty sure nobody forces you to use the [Apache HTTP server](https://httpd.apache.org/) if you want to use e.g. [Lucene](http://lucene.apache.org/) or [HBase](http://hbase.apache.org/). Being part of OpenStack for Gnocchi has been a great advantage at the beginning of the project. The infrastructure provided is awesome. The support we had from the community was great. The Gerrit workflow suited us well. But unfortunately, now that the project is getting more and more mature, many of the requirements of being an OpenStack project has become a real burden. The various processes forced by OpenStack is hurting the development pace. The contribution workflow based around Gerrit and [Launchpad](https://launchpad.net) is too complicated for most external contributors and therefore prevents new users to participate to the development. Worse, the bad image or reputation that OpenStack carries in certain situation or communities is preventing Gnocchi to be evaluated and, maybe, used. I think that many of those negative aspects are finally taken into account by the OpenStack Technical Committee, as can be seen in the [proposed vision of 2 years from now for OpenStack](https://review.openstack.org/#/c/453262/). Better late than never. So after spending a lot of time weighing the pros and the cons, we, Gnocchi contributors, [finally decided to move Gnocchi out of OpenStack](http://lists.openstack.org/pipermail/openstack-dev/2017-March/114300.html). We started to move the project to a brand new [Gnocchi organization on GitHub](https://github.com/gnocchixyz). At the time of this writing, only the main gnocchi repository is missing and should be moved soon after the OpenStack Summit happening next week. We also used that opportunity to make usage of the new Gnocchi logo, courtesy of my friend Thierry Ung! ![](/content/images/03/gnocchi-logo.png) We'll see how everything will turn out and if the project will gain more traction, as we hope. This will not change the consumption of Gnocchi made by projects such as [Ceilometer](http://launchpad.net/ceilometer). and the project aims to remain a good friend of OpenStack. 😀

Scalable metrics storage: Gnocchi on Amazon Web Services

Wed, 22 Feb 2017 00:00:00 GMT

As I wrote a few weeks ago in my [post about Gnocchi 3.1 being released](/blog/2017/gnocchi-3.1-release), one of the new feature available in this version it the [S3](https://aws.amazon.com/s3/) driver. Today I would like to show you how easy it is to use it and store millions of metrics into the simple, durable and massively scalable object storage provided by [Amazon Web Services](https://aws.amazon.com/). ## Installation The installation of Gnocchi for this use case is not different than the [standard installation procedure described in the documentation](http://gnocchi.xyz/install.html). Simply install Gnocchi from [PyPI](http://pypi.python.org) using the following command: ``` $ pip install gnocchi[s3,postgresql] gnocchiclient ``` This will install Gnocchi with the dependencies for the S3 and PostgreSQL drivers and the command-line interface to talk with Gnocchi. ## Configuring Amazon RDS Since you need a SQL database for the indexer, the easiest way to get started is to create a database on [Amazon RDS](https://console.aws.amazon.com/rds/). You can create a managed [PostgreSQL](http://postgresql.org) database instance in just a few clicks. Once you're on the homepage of [Amazon RDS](https://console.aws.amazon.com/rds/), pick PostgreSQL as a database: ![gnocchi-rds-postgresql](/content/images/03/gnocchi-rds-postgresql.png) You can then configure your PostgreSQL instance: I've picked a dev/test instance with the basic options available within the RDS Free Tier, but you can pick whatever you think is needed for your production use. Set a username and a password and note them for later: we'll need them to configure Gnocchi. ![gnocchi-rds-postgresql-conf](/content/images/03/gnocchi-rds-postgresql-conf.png) The next step is to configure the database in details. Just set the database name to "gnocchi" and leave the other options to their default values (I'm lazy). ![gnocchi-rds-postgresql-details](/content/images/03/gnocchi-rds-postgresql-details.png) After a few minutes, your instance should be created and running. Note down the endpoint. In this case, my instance is `gnocchi.cywagbaxpert.us-east-1.rds.amazonaws.com`. ![gnocchi-rds-postgresql-running](/content/images/03/gnocchi-rds-postgresql-running.png) ## Configuring Gnocchi for S3 access In order to give Gnocchi an access to S3, you need to create access keys. The easiest way to create them is to go to [IAM](https://console.aws.amazon.com/iam) in your AWS console, pick a user with S3 access and click on the big gray button named "Create access key". ![gnocchi-iam-create-keys](/content/images/03/gnocchi-iam-create-keys.png) Once you do that, you'll get the _access key id_ and _secret access key_. Note them down, we will need these later. ![gnocchi-iam-get-keys](/content/images/03/gnocchi-iam-get-keys.png) ## Creating `gnocchi.conf` Now is time to create the `gnocchi.conf` file. You can place it in `/etc/gnocchi` if you want to deploy it system-wide, or in any other directory and add the `--config-file` option to each Gnocchi command.. Here are the values that you should retrieve and write in the configuration file: - `indexer.url`: the PostgreSQL RDS instance endpoint and credentials (see above) to set into - `storage.s3_endpoint_url`: the S3 endpoint URL – that depends on the region you want to use and [they are listed here](http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region). - `storage.s3_region_name`: the S3 region name matching the endpoint you picked. - `storage.s3_access_key_id` and `storage.s3_secret_acess_key`: your AWS access key id and secret access key. Your `gnocchi.conf` file should then look like that: ```ini [indexer] url = postgresql://gnocchi:gn0cch1rul3z@gnocchi.cywagbaxpert.us-east-1.rds.amazonaws.com:5432/gnocchi [storage] driver = s3 s3_endpoint_url = https://s3-eu-west-1.amazonaws.com s3_region_name = eu-west-1 s3_access_key_id = s3_secret_access_key = ``` Once that's done, you can run `gnocchi-upgrade` in order to initialize Gnocchi indexer (PostgreSQL) and storage (S3): ``` $ gnocchi-upgrade --config-file gnocchi.conf 2017-02-07 15:35:52.491 3660 INFO gnocchi.cli [-] Upgrading indexer 2017-02-07 15:36:04.127 3660 INFO gnocchi.cli [-] Upgrading storage ``` Then you can run the API endpoint using the test endpoint `gnocchi-api` and specifying its default port 8041: ``` $ gnocchi-api --port 8041 -- --config-file gnocchi.conf 2017-02-07 15:53:06.823 6290 INFO gnocchi.rest.app [-] WSGI config used: /Users/jd/Source/gnocchi/gnocchi/rest/api-paste.ini ******************************************************************************** STARTING test server gnocchi.rest.app.build_wsgi_app Available at http://127.0.0.1:8041/ DANGER! For testing only, do not use in production ******************************************************************************** ``` The best way to run Gnocchi API is to use [uwsgi as documented](http://gnocchi.xyz/master/running.html#running-api-as-a-wsgi-application), but in this case, using the testing daemon `gnocchi-api` is good enough. Finally, in another terminal, you can start the `gnocchi-metricd` daemon that will process metrics in background: ``` $ gnocchi-metricd --config-file gnocchi.conf 2017-02-07 15:52:41.416 6262 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed. ``` Once everything is running, you can use Gnocchi's client to query it and check that everything is OK. The backlog should be empty at this stage, obviously. ``` $ gnocchi status +-----------------------------------------------------+-------+ | Field | Value | +-----------------------------------------------------+-------+ | storage/number of metric having measures to process | 0 | | storage/total number of measures to process | 0 | +-----------------------------------------------------+-------+ ``` Gnocchi is ready to be used! ``` $ # Create a generic resource "foobar" with a metric named "visitor" $ gnocchi resource create foobar -n visitor +-----------------------+-----------------------------------------------+ | Field | Value | +-----------------------+-----------------------------------------------+ | created_by_project_id | | | created_by_user_id | admin | | creator | admin | | ended_at | None | | id | b4d568e4-7af1-5aec-ac3f-9c09fa3685a9 | | metrics | visitor: 05f45876-1a69-4a64-8575-03eea5b79407 | | original_resource_id | foobar | | project_id | None | | revision_end | None | | revision_start | 2017-02-07T14:54:54.417447+00:00 | | started_at | 2017-02-07T14:54:54.417414+00:00 | | type | generic | | user_id | None | +-----------------------+-----------------------------------------------+ ## Send the number of visitor at 2 different timestamps $ gnocchi measures add --resource-id foobar -m 2017-02-07T15:56@23 visitor $ gnocchi measures add --resource-id foobar -m 2017-02-07T15:57@42 visitor ## Check the average number of visitor ## (the --refresh option is given to be sure the measure are processed) $ gnocchi measures show --resource-id foobar visitor --refresh +---------------------------+-------------+-------+ | timestamp | granularity | value | +---------------------------+-------------+-------+ | 2017-02-07T15:55:00+00:00 | 300.0 | 32.5 | +---------------------------+-------------+-------+ ## Now shows the minimum number of visitor $ gnocchi measures show --aggregation min --resource-id foobar visitor +---------------------------+-------------+-------+ | timestamp | granularity | value | +---------------------------+-------------+-------+ | 2017-02-07T15:55:00+00:00 | 300.0 | 23.0 | +---------------------------+-------------+-------+ ``` And voilà! You're ready to store millions of metrics and measures on your Amazon Web Services cloud platform. I hope you'll enjoy it and feel free to ask any question in the comment section or by reaching me directly!

Sending your collectd metrics to Gnocchi

Thu, 16 Feb 2017 00:00:00 GMT

Knowing that [collectd](http://collectd.org/) is a daemon that collects system and applications metrics and that [Gnocchi](http://gnocchi.xyz) is a scalable timeseries database, it sounds like a good idea to combine them together. _Cherry on the cake_: you can easily draw charts using [Grafana](http://grafana.org). While it's true that Gnocchi is well integrated with [OpenStack](http://openstack.org), as it orginally comes from this ecosystem, it actually works standalone by default. Starting with the 3.1 version, it is now easy to send metrics to _Gnocchi_ using _collectd_. ## Installation What we'll need to install to accomplish this task is: - collectd - Gnocchi - collectd-gnocchi How you install them does not really matter. If they are packaged by your operating system, go ahead. For Gnocchi and collectd-gnocchi, you can also use _pip_: ``` ## pip install gnocchi[file,postgresql] […] Successfully installed gnocchi-3.1.0 ## pip install collectd-gnocchi Collecting collectd-gnocchi Using cached collectd-gnocchi-1.0.1.tar.gz […] Installing collected packages: collectd-gnocchi Running setup.py install for collectd-gnocchi ... done Successfully installed collectd-gnocchi-1.0.1 ``` The detailed installation procedure for Gnocchi is [detailed in the documentation](http://gnocchi.xyz/install.html#id1). It among other things explains which flavors are available – here I picked PostgreSQL and the file driver to store the metrics. ## Configuration ### Gnocchi Gnocchi is simple to configure and is again [documented](http://gnocchi.xyz/configuration.html). The default configuration file is `/etc/gnocchi/gnocchi.conf` – you can generate it with `gnocchi-config-generator` if needed. However, it also possible to specify another configuration file by appending the `--config-file` option to any command line In Gnocchi's configuration file, you need to set the `indexer.url` configuration option to point an existing PostgreSQL database and set `storage.file_basepath` to an existing directory to store your metrics (the default is `/var/lib/gnocchi`). That gives something like: ```ini [indexer] url = postgresql://root:p4assw0rd@localhost/gnocchi [storage] file_basepath = /var/lib/gnocchi ``` Once done, just run the `gnocchi-upgrade` command to initialize the index and storage. ### collectd Collectd provides a default configuration file that loads a bunch of plugin by default, that will meter all sort of metrics on your computer. You can check the [documentation](http://collectd.org/documentation.shtml) online to see how to disable or enable plugins. As the _collectd-gnocchi_ plugin is written in Python, you'll need to enable the Python plugin and load the _collectd-gnocchi_ module: ```apacheconf LoadPlugin python Import "collectd_gnocchi" endpoint "http://localhost:8041" ``` That is enough to enable the storage of metrics in Gnocchi. ## Running the daemons Once everything is configured, you can launch `gnocchi-metricd` and the `gnocchi-api` daemon: ``` $ gnocchi-metricd 2017-01-26 15:22:49.018 15971 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed. […] ## In another terminal $ gnocchi-api --port 8041 […] STARTING test server gnocchi.rest.app.build_wsgi_app Available at http://127.0.0.1:8041/ […] ``` It's not recommended to run Gnocchi using Gnocchi API (as [written in the documentation](http://gnocchi.xyz/running.html#running-as-a-wsgi-application)): using [uwsgi](https://uwsgi-docs.readthedocs.io/) is a better option. However for rapid testing, the `gnocchi-api` daemon is good enough. Once that's done, you can start `collectd`: ``` $ collectd ## Or to run in foreground with a different configuration file: ## $ collectd -C collectd.conf -f ``` If you have any problem launchding _colllectd_, check syslog for more information: there might be an issue loading a module or plugin. If no error are printed, then everythin's working fine and you soon should see _gnocchi-api_ printing some requests such as: ``` 127.0.0.1 - - [26/Jan/2017 15:27:03] "POST /v1/resource/collectd HTTP/1.1" 409 113 127.0.0.1 - - [26/Jan/2017 15:27:03] "POST /v1/batch/resources/metrics/measures?create_metrics=True HTTP/1.1" 400 91 ``` ## Enjoying the result Once everything runs, you can access your newly created resources and metric by using the [gnocchiclient](http://pypi.python.org/pypi/gnocchiclient). It should have been installed as a dependency of _collectd\_gnocchi_, but you can also install it manually using `pip install gnocchiclient`. If you need to specify a different endpoint you can use the `--endpoint` option (which default to [http://localhost:8041](http://localhost:8041)). Do not hesitate to check the `--help` option for more information. ``` $ gnocchi resource list --details +---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+ | id | type | project_id | user_id | original_resource_id | started_at | ended_at | revision_start | revision_end | creator | host | +---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+ | dd245138-00c7 | collectd | None | None | dd245138-00c7-5bdc- | 2017-01-26T14 | None | 2017-01-26T14: | None | admin | localhost | | -5bdc-94f8-26 | | | | 94f8-263e236812f7 | :21:02.297466 | | 21:02.297483+0 | | | | | 3e236812f7 | | | | | +00:00 | | 0:00 | | | | +---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+ $ gnocchi resource show collectd:localhost +-----------------------+-----------------------------------------------------------------------+ | Field | Value | +-----------------------+-----------------------------------------------------------------------+ | created_by_project_id | | | created_by_user_id | admin | | creator | admin | | ended_at | None | | host | localhost | | id | dd245138-00c7-5bdc-94f8-263e236812f7 | | metrics | interface-en0@if_errors-0: 5d60f224-2e9e-4247-b415-64d567cf5866 | | | interface-en0@if_errors-1: 1df8b08b-555a-4cab-9186-f9b79a814b03 | | | interface-en0@if_octets-0: 491b7517-7219-4a04-bdb6-934d3bacb482 | | | interface-en0@if_octets-1: 8b5264b8-03f3-4aba-a7f8-3cd4b559e162 | | | interface-en0@if_packets-0: 12efc12b-2538-45e7-aa66-f8b9960b5fa3 | | | interface-en0@if_packets-1: 39377ff7-06e8-454a-a22a-942c8f2bca56 | | | interface-en1@if_errors-0: c3c7e9fc-f486-4d0c-9d36-55cea855596a | | | interface-en1@if_errors-1: a90f1bec-3a60-4f58-a1d1-b3c09dce4359 | | | interface-en1@if_octets-0: c1ee8c75-95bf-4096-8055-8c0c4ec8cd47 | | | interface-en1@if_octets-1: cbb90a94-e133-4deb-ac10-3f37770e32f0 | | | interface-en1@if_packets-0: ac93b1b9-da71-4876-96aa-76067b35c6c9 | | | interface-en1@if_packets-1: 2f8528b2-12ae-4c4d-bec7-8cc987e7487b | | | interface-en2@if_errors-0: ddcf7203-4c49-400b-9320-9d3e0a63c6d5 | | | interface-en2@if_errors-1: b249ea42-01ad-4742-9452-2c834010df71 | | | interface-en2@if_octets-0: 8c23013a-604e-40bf-a07a-e2dc4fc5cbd7 | | | interface-en2@if_octets-1: 806c1452-0607-4b56-b184-c4ffd48f52c0 | | | interface-en2@if_packets-0: c5bc6103-6313-4b8b-997d-01930d1d8af4 | | | interface-en2@if_packets-1: 478ae87e-e56b-44e4-83b0-ed28d99ed280 | | | load@load-0: 5db2248d-2dca-401e-b2e2-bbaee23b623e | | | load@load-1: 6f74ac93-78fd-4a74-a47e-d2add487a30f | | | load@load-2: 1897aca1-356e-4791-907f-512e516992b5 | | | memory@memory-active-0: 83944a85-9c84-4fe4-b471-1a6cf8dce858 | | | memory@memory-free-0: 0ccc7cfa-26a5-4441-a15f-9ebb2aa82c6d | | | memory@memory-inactive-0: 63736026-94c4-47c5-8d6f-a9d89d65025b | | | memory@memory-wired-0: b7217fd6-2cdc-4efd-b1a8-a1edd52eaa2e | | original_resource_id | dd245138-00c7-5bdc-94f8-263e236812f7 | | project_id | None | | revision_end | None | | revision_start | 2017-01-26T14:21:02.297483+00:00 | | started_at | 2017-01-26T14:21:02.297466+00:00 | | type | collectd | | user_id | None | +-----------------------+-----------------------------------------------------------------------+ % gnocchi metric show -r collectd:localhost load@load-0 +------------------------------------+-----------------------------------------------------------------------+ | Field | Value | +------------------------------------+-----------------------------------------------------------------------+ | archive_policy/aggregation_methods | min, std, sum, median, mean, 95pct, count, max | | archive_policy/back_window | 0 | | archive_policy/definition | - timespan: 1:00:00, granularity: 0:05:00, points: 12 | | | - timespan: 1 day, 0:00:00, granularity: 1:00:00, points: 24 | | | - timespan: 30 days, 0:00:00, granularity: 1 day, 0:00:00, points: 30 | | archive_policy/name | low | | created_by_project_id | | | created_by_user_id | admin | | creator | admin | | id | 5db2248d-2dca-401e-b2e2-bbaee23b623e | | name | load@load-0 | | resource/created_by_project_id | | | resource/created_by_user_id | admin | | resource/creator | admin | | resource/ended_at | None | | resource/id | dd245138-00c7-5bdc-94f8-263e236812f7 | | resource/original_resource_id | dd245138-00c7-5bdc-94f8-263e236812f7 | | resource/project_id | None | | resource/revision_end | None | | resource/revision_start | 2017-01-26T14:21:02.297483+00:00 | | resource/started_at | 2017-01-26T14:21:02.297466+00:00 | | resource/type | collectd | | resource/user_id | None | | unit | None | +------------------------------------+-----------------------------------------------------------------------+ $ gnocchi measures show -r collectd:localhost load@load-0 +---------------------------+-------------+--------------------+ | timestamp | granularity | value | +---------------------------+-------------+--------------------+ | 2017-01-26T00:00:00+00:00 | 86400.0 | 3.2705004391254193 | | 2017-01-26T15:00:00+00:00 | 3600.0 | 3.2705004391254193 | | 2017-01-26T15:00:00+00:00 | 300.0 | 2.6022800611413044 | | 2017-01-26T15:05:00+00:00 | 300.0 | 3.561742940080275 | | 2017-01-26T15:10:00+00:00 | 300.0 | 2.5605337960379466 | | 2017-01-26T15:15:00+00:00 | 300.0 | 3.837517851142473 | | 2017-01-26T15:20:00+00:00 | 300.0 | 3.9625948392427883 | | 2017-01-26T15:25:00+00:00 | 300.0 | 3.2690042162698414 | +---------------------------+-------------+--------------------+ ``` As you can see, the command line works smoothly and can show you any kind of metric reported by _collectd_. In this case, it was just running on my laptop, but you can imagine it's easy enough to poll thousands of hosts with _collectd_ and _Gnocchi_. ## Bonus: charting with Grafana [Grafana](http://grafana.org), a charting software, has a plugin for _Gnocchi_ as [detailed in the documentation](http://gnocchi.xyz/grafana.html). Once installed, you can just configure _Grafana_ to point to _Gnocchi_ this way: ![](/content/images/03/grafana-config-screen-gnocchi.png) You can then create a new dashboard by filling the forms as you wish. See this other screenshot for a nice example: ![Charts of my laptop's load average](/content/images/03/grafana-gnocchi-load.png) I hope everything is clear and easy enough. If you have any question, feel free to write something in the comment section!

FOSDEM 2017, recap

Mon, 06 Feb 2017 00:00:00 GMT

![](/content/images/03/fosdem.png) Last week-end, I was in Brussels, Belgium for the 2017 edition of the [FOSDEM](http://fosdem.org), one of the greatest open source developer conference. This year, I decided to propose a talk about [Gnocchi](http://gnocchi.xyz) which was accepted in the [Python devroom](https://fosdem.org/2017/schedule/track/python/). The track was very well organized (thanks to [Stéphane Wirtel](https://wirtel.be/)) and I was able to present Gnocchi to a room full of Python developers! I've explained why we created Gnocchi and how we did it, and finally briefly explained how to use it with the command-line interface or in a Python application using the [SDK](http://gnocchi.xyz/gnocchiclient). You can check the slides below and \[the video of the talk ([https://video.fosdem.org/2017/UD2.120/storing\_metrics\_gnocchi.mp4](https://video.fosdem.org/2017/UD2.120/storing_metrics_gnocchi.mp4)).

Gnocchi 3.1 unleashed

Thu, 02 Feb 2017 00:00:00 GMT

It's always difficult to know when to release, and we really wanted to do it earlier. But it seems that each week more awesome work was being done in [Gnocchi](http://gnocchi.xyz), so we kept delaying it while having no pressure to push it out. But now that the OpenStack cycle is finishing, even Gnocchi does not strictly follow it, it seemed to be a good time to cut the leash and leave this release be. There are again some major new changes coming from 3.0. The previous version 3.0 was tagged in October and had 90 changes merged from 13 authors since 2.2. This 3.1 version have 200 changes merged from 24 different authors. This is a great improvement of our contributor base and our rate of change – even if our delay to merge is very low. Once again, we pushed usage of release notes to document user visible changes, and [they can be read online](http://gnocchi.xyz/releasenotes/3.1.html). Therefore, I am going to summary quickly the major changes: - The REST API authentication mechanism has been modularized. It's now simple to provide any authentication mechanism for Gnocchi as a plugin. The default is now a HTTP basic authentication mechanism that does not implement any kind of enforcement. The [Keystone]($http://docs.openstack.org/developer/keystone/$) authentication is still available, obviously. - Batching has been improved and can now create metrics on the fly, reducing the latency needed when pushing measures to non-existing metrics. This is leveraged by the [collectd-gnoccchi](https://github.com/gnocchixyz/collectd-gnocchi) plugin for example. - The performance of Carbonara based backend has been largely improved. This is not really listed as a change as it's not user-visible, but an amazing work of profiling and rewriting code from [Pandas](http://pandas.pydata.org/) to [NumPy](http://www.numpy.org/) has been done. While Pandas is very developer-friendly and generic, using NumPy directly offers way more performance and should decrease `gnocchi-metricd` CPU usage by a large factor. - The storage has been split into two parts: the storage of incoming new measures to be processed, and the storage and archival of aggregated metrics. This allows to use e.g. file to store new measures being sent, and once processed store them into e.g. Ceph. Before that change, all the new measures had to go into Ceph. While there's no specific driver yet for incoming measures, it's easy to envision a driver for systems like [Redis](https://redis.io) or [Memcached](https://memcached.org). - A new [Amazon S3](https://aws.amazon.com/s3/) driver has been merged. It works in the same way than the file or [OpenStack Swift](http://docs.openstack.org/developer/swift/) drivers. ![](/content/images/03/gnocchi-logo-old.jpg) I will write more about some of these new features in the upcoming weeks, as they are very interesting for Gnocchi's users. We are planning to run a scalability test and benchmarks using the [ScaleLab](http://scalelab.redhat.com/) in a few weeks if everything goes as planned. I will obviously share the result here, but we also submitted a talk for the next [OpenStack Summit in Boston](https://www.openstack.org/summit/boston-2017/) to present the results of our scalability and performance tests – hoping the session will be accepted. I will also be talking about Gnocchi [this Sunday at FOSDEM](https://fosdem.org/2017/schedule/event/storing_metrics_gnocchi/). We don't have a very determined roadmap for Gnocchi during the next weeks. Sure we do have a few ideas on what we want to implement, but we are also very easily influenced by the requests of our user: therefore feel free to ask for anything!

Attending OpenStack Summit Ocata

Mon, 31 Oct 2016 00:00:00 GMT

For the last time in 2016, I flew out to the [OpenStack Summit in Barcelona](https://www.openstack.org/summit/barcelona-2016/), where I had the chance to meet (again) a lot of my fellow OpenStack contributors there. ## How To Work Upstream with OpenStack My week started by giving a talk about _How To Work Upstream with OpenStack_ where I explained, accompanied by Ryota and Ashiq, to the audience how to contribute upstream to OpenStack. It went well and was well received by the public – you can watch the video below or [download the slides](/talks/how-to-work-upstream-with-openstack.pdf). ## Python 3 in telemetry projects I've attended a few interesting cross-project sessions, which helped me getting some prioritization for my work during the next few months. The Python 3 porting effort is blocked for a while in Nova and Swift for various (mostly non-technical) reasons, while almost all other projects are working correctly. On the other hand, we have committed the telemetry projects to be the first one to drop Python 2 support has soon as it is possible. The next steps are to be sure downstream is ready and enable functional testing in devstack with Python 3. ## Ceilometer deprecation ![gordon-gnocchi-talk](/content/images/03/gordon-gnocchi-talk.jpg) The Ceilometer sessions were really interesting, are we mainly discussed deprecating and removing old crufts that are not or should not be used anymore. The main change will be the depreciation of the Ceilometer API. It has been clear for more than a year that [Gnocchi](http://gnocchi.xyz) is the way-to-go to store and provide access to metrics, but we failed at announcing wildly. A lot of the people I talked to during the summit were not aware that the Ceilometer API was not a good pick, and that Gnocchi was the now recommended storage backend. Bad communication from our side – but we are going to fix it as of now. We also committed to simplify the current architecture by removing the collector, which has now be made obsolete by the agent based architecture that was implemented during the last development cycles. ## Aodh alarm timeout We had a feature proposal for a while in Aodh that we postponed for too long already: having timeout triggered after not having seen some events. This seems to be a functionality requested by NFV users – something we want Aodh to cover. We spent some time discussing this feature, and now that we all have a clear understanding of the use case, we'll work on having a clear path to the implementation. I've also attended a session with the [Vitrage](https://wiki.openstack.org/wiki/Vitrage) developers in order to discuss how we could work better together, as they rely on Aodh. It seems there might be some convergence in the future, which would be very welcome. Wait'n see. ## Gnocchi improvement, past and future The Gnocchi session ran smoothly, and everyone seemed happy with the work we have done so far. We've made some impressive improvement in Gnocchi 3.0 – as [I already covered previously](/blog/2016/gnocchi-3.0-release) – and Gordon Chung presented a short talk about the performance difference metered while working on this new version of Gnocchi: The return of the InfluxDB driver is on the table, as Sam Morrison proposed a patch for that while back. While it's not as fast and scalable as other drivers, it offers a good alternative for people having to use it. Leandro Reox presented how to do capacity planning using Ceilometer and Gnocchi, presenting the projects at the same time: It is pretty impressive to see what they achieved with this project, and I'm looking forward to being able to check how it works inside. ## PTG and beyond The next meeting is supposed to be the new [OpenStack PTG](https://www.openstack.org/ptg/) in February in Atlanta, though we did not request any specific space there. While the team love seeing each other face-to-face every few months, we achieved to follow [all of the guidelines I listed recently](/blog/foss-projects-management-bad-practice) on good open source project management, meaning we are able to work very well asynchronously and remotely. There is no need to put hard requirements on people wanting to participate in our community. Nevertheless, I expect cross-projects discussions that will happen to still concern the OpenStack Telemetry projects. In the end, we're all very happy with our past and future roadmaps and I'm looking forward to achieving our next big milestones with our amazing telemetry team!

Gnocchi 3.0 release

Mon, 03 Oct 2016 00:00:00 GMT

After a few weeks of hard work with the team, here is the new major version of Gnocchi, stamped [3.0.0](https://launchpad.net/gnocchi/3.0/3.0.0). It was very challenging, as we wanted to implement a few big changes in it. Gnocchi is now using [reno](http://docs.openstack.org/developer/reno/) to its maximum and you can read [the release notes of the 3.0 branch](http://gnocchi.xyz/releasenotes/3.0.html) online. Some notes might be missing as it is our first release with it, but we are making good progress at writing changelogs for most of our user facing and impacting changes. Therefore, I'll only write here about our big major feature that made us bump the major version number. ## New storage engine And so the most interesting thing that went in the 3.0 release, is the new storage engine that has been built by me and Gordon Chung during those last months. The original approach of writing data in Gnocchi was really naive, so we had an iterative improvement process since version 1.0, and we're getting close to something very solid. This new version leverages several important features which increase performance by a large factor on Ceph (using `write(offset)` rather than `read()+write()` to append new points), our recommended back-end. ![gnocchi3_processtime_readwrite_vs_offset](/content/images/03/gnocchi3_processtime_readwrite_vs_offset.png) To summarize, since most data points are sent sequentially and ordered, we enhanced the data format to profit from that fact and be able to be appended without reading anything. That only works on Ceph though, which provides the needed features. We also enabled data compression on all storage drivers by enabling LZ4 compression ([see my previous article and research on the subject](/blog/gnocchi-carbonara-timeseries-compression)), which obviously offers its own set of challenges when using append-only write. The results are tremendous and decrease data usage by a huge factor: ![gnocchi3_disksize](/content/images/03/gnocchi3_disksize.png) The rest of the processing pipeline also has been largely improved: ![gnocchi3_processtime_post](/content/images/03/gnocchi3_processtime_post.png) ![gnocchi3_processtime_compress_offset](/content/images/03/gnocchi3_processtime_compress_offset.png) Overall, we're delighted with the performance improvement we achieved, and we're looking forward making even better more progress. Gnocchi is now one of the most performing and scalable timeseries databases out there. ## Upcoming challenges With that big change done, we're now heading toward a set of more lightweight improvements. Our [bug tracker](https://bugs.launchpad.net/gnocchi) is a good place to learn what might be on our mind (check for the _wishlist_ bugs). Improving our API features and offering a better experience for those coming outside of the real of OpenStack are now on my top priority list. But let me know if there's anything you have scratching you, obviously. 😎

A retrospective of the OpenStack Telemetry project Newton cycle

Mon, 05 Sep 2016 00:00:00 GMT

A few weeks ago, I recorded an interview with Krishnan Raghuram about what was discussed for this development cycle for OpenStack Telemetry at the Austin summit. It's interesting to look back at this video more than 3 months after recording it, and see what actually happened to Telemetry. It turns out that some of the things that I think were going to happen did not happen yet. As the first release candidate version is approaching, it's very unlikely they happen. And on the other side, some new fancy features arrived suddenly without me having a clue about them. As far as **Ceilometer** is concerned, here's the list of what really happened in terms of user features: - Added full support for SNMP v3 USM model - Added support for batch measurement in Gnocchi dispatcher - Set ended\_at timestamp in Gnocchi dispatcher - Allow Swift pollster to specify regions - Add L3 cache usage and memory bandwidth meters - Split out the event code (REST API and storage) to a new **Panko** project And a few other minor things. I planned none of them except Panko (which I was responsible for), and the ones we planned (documentation update, pipeline rework and polling enhancement) did not happen yet. For **Aodh**, we expected to rework the documentation entirely too, and that did not happen either. What we did instead: - Deprecate and disable combination alarms - Add pagination support in REST API - Deprecated all non-SQL database store and provide a tool to migrate - Support batch notification for aodh-notifier It's definitely a good list of new features for Aodh, still small, but simplifying it, removing technical debt and continuing building momentum around it. For **Gnocchi**, we really had no plan, except maybe a few small features (they're usually tracked in the Launchpad bug list). It turned out we had some fancy new idea with Gordon Chung on how to boost our storage engine, so we work on that. That kept us busy a few weeks in the end, though the preliminary results look tremendous – so it was definitely worth it. We also have a AWS S3 storage driver on its way. I find this exercise interesting, as it really emphasizes how you can't really control what's happening in any open source project, where your contributors come and go and work on their own agenda. That does not mean we're dropping the themes and ideas I've laid out in that video. We're still pushing our "documentation is mandatory" policy and improving our "work by default" scenario. It's just a longer road that we expected.

Gnocchi talk at the Paris Monitoring Meetup #6

Fri, 27 May 2016 00:00:00 GMT

Last week was the sixth edition of the [Paris Monitoring Meetup](http://www.meetup.com/Paris-Monitoring/events/230515751/), where I was invited as a speaker to present and talk about [Gnocchi](http://gnocchi.xyz). ![paris-monitoring](/content/images/03/paris-monitoring.png) There was around 50 persons in the room, listening to my presentation of Gnocchi. ![jd-gnocchi-paris-monitoring-meetup-6](/content/images/03/jd-gnocchi-paris-monitoring-meetup-6.jpg) The talk went fine and I had a few interesting questions and feedback. One interesting point that keeps coming when talking about Gnocchi, is its OpenStack label, which scares away a lot of people. We definitely need to continue explaining that the project work stand-alone has a no dependency on OpenStack, just a great integration with it. The [Monitoring-fr](http://www.monitoring-fr.org/) organization also [interviewed me](http://www.monitoring-fr.org/2016/05/meetup-paris-monitoring-6-interview-de-julien-danjou-pour-gnocchi-metric-as-a-service/) after the meetup about Gnocchi. The interview is in French, obviously. I talk about Gnocchi, what it does, how it does it and why we started the project a couple of years ago. Enjoy, and let me know what you think!

OpenStack Summit Newton from a Telemetry point of view

Mon, 02 May 2016 00:00:00 GMT

It's again that time of the year, where we all fly out to a different country to chat about OpenStack and what we'll do during the next 6 months. This time, it was in [Austin, TX](https://en.wikipedia.org/wiki/Austin,_Texas) and we chatted about the new Newton release that will be out in October. As the _Project Team Leader_ for the Telemetry project, I set up and animated the week for our team. We had 9 discussion slots of 40 minutes assigned, but finally only used 8. We also, somehow, canceled the contributor team meet-up on the last day, as only a few of us developers were there and available. We took [a few notes in our Etherpads](https://wiki.openstack.org/wiki/Design_Summit/Newton/Etherpads#Telemetry), but I think most of them were pretty sparse, as there was nothing really important we talked about. Actually, many topics were already discussed and covered 6 months ago in Tokyo during the previous summit. We just did not have time to implement everything we wanted, so talking over it again would not have been of a great help. ## Reference architecture Unfortunately, nor Gordon Chung nor the [OpenStack Innovation Center](https://osic.org/) had time to run the tests and benchmarks they wanted to run before the summit. We still discussed their plan to run tests and benchmark of the whole Telemetry suite (Ceilometer, Gnocchi & Aodh). They should run their tests for 3 weeks, no more, in a few weeks. The window to run tests being narrow, they want to be sure they are prepared, and will reach to us for help, ideas, and validation. I've also requested them to, if possible, provide us some profiling (e.g. cProfile) data so we can have better knowledge of the area we can optimize. ## Gnocchi, next steps This session was particularly smooth since most people in the room were not up-to-date with Gnocchi 2.1. Some people expressed concerned about the InfluxDB driver removal, though they were not aware of the bugs it had, and that Gnocchi was actually performing better – so they may very likely be testing Gnocchi directly instead. No particular fancy feature was requested, only a few bugs and ideas noted on Launchpad were discussed. ## Enhancing Ceilometer polling This session was not particularly productive, as everything was we wanted to discuss was already on the Etherpad from… Tokyo, 6 months ago. It turns out nobody had time to pursue this project, so we'll see what happens. There's definitely some work to do to pursue our goal of splitting the pipeline definition into smaller files. ## Aodh roadmap & improvements First, we decided to definitely kill the combination alarm in the future, in favor of the new composite alarms definition that we like better. We should switch to [OpenStackClient](http://docs.openstack.org/developer/python-openstackclient/) in the future for [aodhclient](http://docs.openstack.org/developer/python-aodhclient/). The OSC team indicated they are willing to provide a way to keep the "aodh" CLI command on its own, which is something that blocked us to move to OSC. A bunch of people indicated that had support for alarms CRUD in the Horizon dashboard. They should work together with the Horizon team to complete what has been started in Horizon recently to add Aodh support. ## Ceilometer splitting A year ago, we decided to split Ceilometer and its alarm feature: Aodh was born. We did discuss doing it again 6 months ago, but nothing happened as we already had so many stuff on our plate. As far as I'm concerned, I think it's now time to split some Ceilometer functionality again, so I'm going to do that this time with the event part. Gordon found a name, and this new project will be named _Panko_. ## Documentation We have then discussed our documentation. Users present in the room were particularly happy with the Gnocchi policy that we apply since the beginning: no doc = no merge of your patch. The consensus is to move forward on this policy for all Telemetry projects, especially since it's now clear that the documentation team is not going to help us more. Ildikó, our documentation wizard, will take care of making some links between the official OpenStack documentation and our projects, avoid content duplication. For this cycle, my personal plan is to document Aodh up to roughly 80 %, and then force that policy on newly implemented changes. ## Events management The event management part of Ceilometer and API (soon to be split in its own project as stated above) was discussed in this session. Nothing really exciting coming here, as nobody is willing to enhance it for now. Which, again, makes it a great candidate for splitting it out of Ceilometer. ## Vitrage The last session was dedicated to [Vitrage](https://wiki.openstack.org/wiki/Vitrage), a root cause analysis tool built on OpenStack. The Vitrage team had a few features that they wanted to see in Aodh, so we discussed that at length. Notably, more support for sending notifications on events (alarm creation, deletion…) should be added in this next release. Also, a new alarm type that would be entirely managed and triggered over HTTP would be very useful for external projects such as Vitrage. We'll try to make that happen during this cycle too. ## Talks There were a few interesting talks about our telemetry projects during this summit, among other I highly recommend watching: - [OpenStack Ceilometer with Gnocchi and Aodh Feature](https://www.youtube.com/watch?v=W5KT5GJKJw8), where Amol and Paul from Ericsson explain what Gnocchi and Aodh do and how they work, and then help people deploy it on their lab. - [DPDK, Collectd & Ceilometer The Missing Link](https://www.youtube.com/watch?v=BdebhsBFEJs), where Ryota Mibu, one of the contributor to Aodh explains why he implemented the event alarm feature - [Showback & Chargeback!! OpenStack Gnocchi + Cloudkitty as a Whole Billing System](https://www.youtube.com/watch?v=-K8NI38LPtU), where Maximiliano Venesio (Nubeliu) and Stéphane Albert (Objectif Libre) talk about how they built an amazing scalable billing solution using [Gnocchi](http://gnocchi.xyz) and [CloudKitty](https://wiki.openstack.org/wiki/CloudKitty) - [Using Ceilometer Data for Effective Witch-Hunting](https://www.youtube.com/watch?v=0Q8pfbwxMb8), where Mike explain how Overstock.com leveraged Ceilometer to track anomalies in their cloud. All of this should keep me and the team busy for the next cycle. If you have any question about what has been discussed or the future of our projects, don't hesitate to leave a comment or ask us on the [OpenStack development mailing list](http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev).

Gnocchi 2.1 release

Wed, 13 Apr 2016 00:00:00 GMT

A little less than 2 months after our latest major release, here is the new minor version of Gnocchi, stamped [2.1.0](https://launchpad.net/gnocchi/2.1/2.1.0). It was a smooth release, but with one major feature implemented by my fellow fantastic developer Mehdi Abaakouk: the ability to create resource types dynamically. ## Resource types REST API This new version of Gnocchi offers the long-awaited ability to create resource types dynamically. What does that mean? Well, until version 2.0, the resources that you were able to create in Gnocchi had a particular type that was defined in the code: instance, volume, SNMP host, Swift account, etc. All of them were tied to OpenStack, since it was our primary use case. Now, [the API allows to create resource types dynamically](http://gnocchi.xyz/rest.html#resource-types). This means you can create your own custom types to describe your own architecture. You then can exploit the same features that were offered before: history of your resources, searching through them, associating metrics, etc! ## Performances improvement We did some profiling on Gnocchi, and some benchmarks, and with the help of my fellow developer Gordon Chung, improved the metric performances. The API speed improved a bit, and I've measured the Gnocchi API endpoint of being able to ingest up to 190k measures/s with only one node (the same as used in my [previous benchmark](/blog/gnocchi-benchmarks)) using [uwsgi](https://uwsgi-docs.readthedocs.org/), so a 50 % improvement. The time required to compute aggregation on new measures is now also metered and displayed in the `gnocchi-metricd` log in debug mode. Handy to have an idea of how fast your measures are treated. ## Ceph backend optimization The Ceph back-end has been improved again by Mehdi. We're now relying on OMAP rather than xattr for finer grained control and better performance. We already have a few new features being prepared for our next release, so stay tuned! And if you have any suggestion, feel free to say a word.

Gnocchi 2.0 release

Fri, 19 Feb 2016 00:00:00 GMT

A little more than 3 months after our latest minor release, here is the new major version of Gnocchi, stamped [2.0.0](https://launchpad.net/gnocchi/2.0/2.0.0). It contains a lot of new and exciting features, and I'd like to talk about some of them to celebrate! You may notice that this release happens in the middle of the OpenStack release cycle. Indeed, Gnocchi does not follow that 6-months cycle, and we release whenever our code is ready. That forces us to have a more iterative approach, less disruptive for other projects and allow us to achieve a higher velocity. Applying the good old mantra _release early, release often_. ## Documentation This version features a large documentation update. Gnocchi is still the only OpenStack server project that implements a "no doc, no merge" policy, meaning any code must come with the documentation addition or change included in the patch. The full documentation is included in the source code and available online at [gnocchi.xyz](http://gnocchi.xyz/). ## Data split & compression I've already covered this change extensively in [my last blog about timeseries compression](/gnocchi-carbonara-timeserie-compression). Long story short, Gnocchi now splits timeseries archives in small chunks that are compressed, increasing speed and decreasing data size. ## Measures batching support Gnocchi now supports batching, which allow submitting several measures for different metric in a single request. This is especially useful in the context where your application tends to cache metrics for a while and is able to send them in a batch. Usage is [fully documented for the REST API](http://gnocchi.xyz/rest.html#measures-batching). ## Group by support in aggregation One of the most demanded features was the ability to do measure aggregation no resource, using a group by type query. This is now possible using the [new `groupby` parameter to aggregation queries](http://gnocchi.xyz/rest.html#aggregation-across-metrics). ## Ceph backend optimization We improved the Ceph back-end a lot. Mehdi Abaakouk wrote a new Python binding for Ceph, called [Cradox](https://github.com/sileht/pycradox), that is going to replace the current Python rados module in the subsequent Ceph releases. Gnocchi makes usage of this new module to speed things up, making the Ceph based driver really, really faster than before. We also implemented asynchronous data deletion, which improves performance a bit. The next step will be to run some new benchmarks [like I did a few months ago](/blog/gnocchi-benchmarks) and compare with the Gnocchi 1.3 series. Stay tuned!

Timeseries storage and data compression

Mon, 15 Feb 2016 00:00:00 GMT

The first major version of the scalable timeserie database I work on, [Gnocchi](http://gnocchi.xyz) was a released a few months ago. In this first iteration, it took a rather naive approach to data storage. We had little ideas about if and how our distributed back-ends were going to be heavily used, so we stuck to the code of the first proof-of-concept written a couple of years ago. Recently we got more feedbacks from our users, ran a few [benchmarks](/blog/gnocchi-benchmarks). That gave us enough feedback to start investigating in improving our storage strategy. ## Data split Up to Gnocchi 1.3, all data for a single metric are stored in a single gigantic file per aggregation method (_min_, _max_, _average_…). This means that the file can grow to several megabytes in size, which make it slow to manipulate. For the next version of Gnocchi, our first work has been to rework that storage and split the data into smaller parts. ![gnocchi-carbonara-split](/content/images/03/gnocchi-carbonara-split.png) The diagram above shows how data are organized inside Gnocchi. Until version 1.3, there would have been only one file for each aggregation methods. In the upcoming 2.0 version, Gnocchi will split all these data into smaller parts, where each data split is stored in a file/object. This allows to manipulate smaller pieces of data and to increase the parallelism of the CRUD operations on the back-end – leading to large speed improvement. In order to split timeseries into several chunks, Gnocchi defines a maximum number of N points to keep per chunk, to limit their maximum size. It then defines a hash function that produces a non-unique key for any timestamp. It makes it easy to find in which chunk any timestamp should be stored or retrieved. ## Data compression Up to Gnocchi 1.3, the data stored for each metric is simply serialized using [msgpack](http://msgpack.org), a fast and small serialization format. Though, this format does not provide any compression. That means that storing data points needs 8 bytes for a timestamp (64 bits timestamp with nanosecond precision) and 8 bytes for a value (64 bits double-precision floating-point), plus some overhead (extra information and _msgpack_ itself). After looking around on how to compress all these measures, I stumbled upon a paper from some [Facebook](http://facebook) engineers called about Gorilla, their in-memory timeserie database, entitled "_[Gorilla: A Fast, Scalable, In-Memory Time Series Database](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf)_". For reference, part of this encoding is also used by [InfluxDB](https://docs.influxdata.com/influxdb/v0.9/concepts/storage_engine/) in its new storage engine. The first technique I implemented is easy enough, and it's inspired from delta-of-delta encoding. Instead of storing each timestamp for each data point, and since all the data points are aggregated on a regular interval, we transpose points to be the time difference divided by the interval. For example, the suite of timestamps `timestamps = [41230, 41235, 41240, 41250, 41255]` is encoded into `timestamps = [41230, 1, 1, 2, 1], interval = 5`. This allows regular compression algorithms to reduce the size of the integer list using [run-length encoding](https://en.wikipedia.org/wiki/Run-length_encoding). To actually compress the values, I tried two different algorithms: - [LZ4](https://en.wikipedia.org/wiki/LZ4_$compression_algorithm$), a fast compression/decompression algorithm - The XOR based compression scheme described in the Gorilla paper mentioned above – that [I had to implement myself](https://gist.github.com/jd/b0aa5cbfa42f4eb23eb9). For reference, it also exists a [Go](http://golang.org) implementation in [go-tsz](https://github.com/dgryski/go-tsz). I then benchmarked these solutions: ![gnocchi-carbonara-compression-speed](/content/images/03/gnocchi-carbonara-compression-speed.png) The XOR algorithm implemented in Python is pretty slow, compared to LZ4. Truth is that [python-lz4](https://github.com/steeve/python-lz4) is fully implemented in C, which makes it fast. I've profiled my XOR implementation in Python, to discover that one operation took 20 % of the time: `count_lead_and_trail_zeroes`, which is in charge of counting the number of leading and trailing zeroes in a binary number. ![gnocchi-carbonara-xor-profiling](/content/images/03/gnocchi-carbonara-xor-profiling.png) I tried 2 Python implementations of the same algorithm (and submitted them to my friend and Python developer [Victor Stinner](http://haypo-notes.readthedocs.org/) by the way). The first version using string search with `.index()` is 10× faster than the second one that only do integer computation. Ah, Python… As Victor explained, each Python operation is slow and there's a lot in the second version, whereas `.index()` is implemented in C and really well optimized and only needs 2 Python operations. Finally, I ended up optimizing that code by leveraging [cffi](https://cffi.readthedocs.org/en/latest/) to use directly `ffsll()` and `flsll()`. That decreased the run-time of `count_lead_and_trail_zeroes` by 45 %, making the entire XOR compression code speed increased by a small 7 %. This is not enough to catch up with LZ4 speed. At this stage, the only solution to achieve a high-speed would probably to go with a full C implementation. ![gnocchi-carbonara-compression-size](/content/images/03/gnocchi-carbonara-compression-size.png) Considering the compression ratio of the different algorithms, they are pretty much identical. The worst case scenario (random values) for LZ4 compress down to 9 bytes per data point, whereas XOR can go down to 7.38 bytes per data point. In general XOR encoding beats LZ4 by 15 %, except for cases where all values are 0 or 1. However, LZ4 is faster than XOR by a factor of 4×-70× depending on cases. That means that we'll use LZ4 for data compression in Gnocchi 2.0. It's possible that we could achieve as fast compression/decompression algorithm, but I don't think it's worth the effort right now – it'd represent a lot of code to write and to maintain.

Profiling Python using cProfile: a concrete case

Mon, 16 Nov 2015 00:00:00 GMT

Writing programs is fun, but making them fast can be a pain. Python programs are no exception to that, but the basic profiling toolchain is actually not that complicated to use. Here, I would like to show you how you can quickly profile and analyze your Python code to find what part of the code you should optimize. ## What's profiling? Profiling a Python program is doing a dynamic analysis that measures the execution time of the program and everything that compose it. That means measuring the time spent in each of its functions. This will give you data about where your program is spending time, and what area might be worth optimizing. It's a very interesting exercise. Many people focus on local optimizations, such as determining e.g. which of the Python functions `range` or `xrange` is going to be faster. It turns out that knowing which one is faster may never be an issue in your program, and that the time gained by one of the functions above might not be worth the time you spend researching that, or arguing about it with your colleague. Trying to blindly optimize a program without measuring where it is actually spending its time is a useless exercise. Following your guts alone is not always sufficient. There are many types of profiling, as there are many things you can measure. In this exercise, we'll focus on CPU utilization profiling, meaning the time spent by each function executing instructions. Obviously, we could do many more kind of profiling and optimizations, such as memory profiling which would measure the memory used by each piece of code – something I talk about in [The Hacker's Guide to Python](https://thehackerguidetopython.com). ## cProfile Since Python 2.5, Python provides a C module called _[cProfile](https://docs.python.org/2/library/profile.html)_ which has a reasonable overhead and offers a good enough feature set. The basic usage goes down to: ```python >>> import cProfile >>> cProfile.run('2 + 2') 2 function calls in 0.000 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 :1() 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} ``` Though you can also run a script with it, which turns out to be handy: ```shell $ python -m cProfile -s cumtime lwn2pocket.py 72270 function calls (70640 primitive calls) in 4.481 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.004 0.004 4.481 4.481 lwn2pocket.py:2() 1 0.001 0.001 4.296 4.296 lwn2pocket.py:51(main) 3 0.000 0.000 4.286 1.429 api.py:17(request) 3 0.000 0.000 4.268 1.423 sessions.py:386(request) 4/3 0.000 0.000 3.816 1.272 sessions.py:539(send) 4 0.000 0.000 2.965 0.741 adapters.py:323(send) 4 0.000 0.000 2.962 0.740 connectionpool.py:421(urlopen) 4 0.000 0.000 2.961 0.740 connectionpool.py:317(_make_request) 2 0.000 0.000 2.675 1.338 api.py:98(post) 30 0.000 0.000 1.621 0.054 ssl.py:727(recv) 30 0.000 0.000 1.621 0.054 ssl.py:610(read) 30 1.621 0.054 1.621 0.054 {method 'read' of '_ssl._SSLSocket' objects} 1 0.000 0.000 1.611 1.611 api.py:58(get) 4 0.000 0.000 1.572 0.393 httplib.py:1095(getresponse) 4 0.000 0.000 1.572 0.393 httplib.py:446(begin) 60 0.000 0.000 1.571 0.026 socket.py:410(readline) 4 0.000 0.000 1.571 0.393 httplib.py:407(_read_status) 1 0.000 0.000 1.462 1.462 pocket.py:44(wrapped) 1 0.000 0.000 1.462 1.462 pocket.py:152(make_request) 1 0.000 0.000 1.462 1.462 pocket.py:139(_make_request) 1 0.000 0.000 1.459 1.459 pocket.py:134(_post_request) […] ``` This prints out all the function called, with the time spend in each and the number of times they have been called. ### Advanced visualization with KCacheGrind While being useful, the output format is very basic and does not make easy to grab knowledge for complete programs. For more advanced visualization, I leverage [KCacheGrind](https://kcachegrind.github.io/html/Home.html). If you did any C programming and profiling these last years, you may have used it as it is primarily designed as front-end for [Valgrind](http://valgrind.org/) generated call-graphs. In order to use, you need to generate a _cProfile_ result file, then convert it to KCacheGrind format. To do that, I use _[pyprof2calltree](https://pypi.python.org/pypi/pyprof2calltree)_. ```shell $ python -m cProfile -o myscript.cprof myscript.py $ pyprof2calltree -k -i myscript.cprof ``` And the KCacheGrind window magically appears! ![kcachegrind](/content/images/03/kcachegrind.png) ## Concrete case: Carbonara optimization I was curious about the performances of [Carbonara](https://git.openstack.org/cgit/openstack/gnocchi/tree/gnocchi/carbonara.py), the small timeseries library I wrote for [Gnocchi](http://launchpad.net/gnocchi). I decided to do some basic profiling to see if there was any obvious optimization to do. In order to profile a program, you need to run it. But running the whole program in profiling mode can generate _a lot_ of data that you don't care about, and adds noise to what you're trying to understand. Since Gnocchi has thousands of unit tests and a few for Carbonara itself, I decided to profile the code used by these unit tests, as it's a good reflection of basic features of the library. Note that this is a good strategy for a curious and naive first-pass profiling. There's no way that you can make sure that the hotspots you will see in the unit tests are the actual hotspots you will encounter in production. Therefore, a profiling in conditions and with a scenario that mimics what's seen in production is often a necessity if you need to push your program optimization further and want to achieve perceivable and valuable gain. I activated _cProfile_ using the method described above, creating a `cProfile.Profile` object around my tests (I actually [started to implement that in testtools](https://github.com/testing-cabal/testtools/pull/163)). I then run _KCacheGrind_ as described above. Using _KCacheGrind_, I generated the following figures. ![kcachegrind-carbonara-old-list](/content/images/03/kcachegrind-carbonara-old-list.png) The test I profiled here is called `test_fetch` and is pretty easy to understand: it puts data in a timeserie object, and then fetch the aggregated result. The above list shows that 88 % of the ticks are spent in `set_values` (44 ticks over 50). This function is used to insert values into the timeserie, not to fetch the values. That means that it's really slow to insert data, and pretty fast to actually retrieve them. Reading the rest of the list indicates that several functions share the rest of the ticks, `update`, `_first_block_timestamp`, `_truncate`, `_resample`, etc. Some of the functions in the list are not part of Carbonara, so there's no point in looking to optimize them. The only thing that can be optimized is, sometimes, the number of times they're called. ![kcachegrind-carbonara-old-graph](/content/images/03/kcachegrind-carbonara-old-graph.png) The call graph gives me a bit more insight about what's going on here. Using my knowledge about how Carbonara works, I don't think that the whole stack on the left for `_first_block_timestamp` makes much sense. This function is supposed to find the first timestamp for an aggregate, e.g. with a timestamp of 13:34:45 and a period of 5 minutes, the function should return 13:30:00. The way it works currently is by calling the `resample` function from Pandas on a timeserie with only one element, but that seems to be very slow. Indeed, currently this function represents 25 % of the time spent by `set_values` (11 ticks on 44). Fortunately, I recently added a small function called `_round_timestamp` that does exactly what `_first_block_timestamp` needs that without calling any Pandas function, so no `resample`. So I ended up rewriting that function this way: ```diff def _first_block_timestamp(self): - ts = self.ts[-1:].resample(self.block_size) - return (ts.index[-1] - (self.block_size * self.back_window)) + rounded = self._round_timestamp(self.ts.index[-1], self.block_size) + return rounded - (self.block_size * self.back_window) ``` And then I re-run the exact same test to compare the output of _cProfile_. ![kcachegrind-carbonara-new-list](/content/images/03/kcachegrind-carbonara-new-list.png) The list of function seems quite different this time. The number of time spend used by `set_values` dropped from 88 % to 71 %. ![kcachegrind-carbonara-new-graph](/content/images/03/kcachegrind-carbonara-new-graph.png) The call stack for `set_values` shows that pretty well: we can't even see the `_first_block_timestamp` function as it is so fast that it totally disappeared from the display. It's now being considered insignificant by the profiler. So we just speed up the whole insertion process of values into Carbonara by a nice 25 % in a few minutes. Not that bad for a first naive pass, right? If you want to know more, I wrote a whole chapter about optimizing code in [Scaling Python](https://scaling-python.com). Check it out!

Gnocchi 1.3.0 release

Wed, 04 Nov 2015 00:00:00 GMT

Finally, [Gnocchi 1.3.0](https://launchpad.net/gnocchi/trunk/1.3.0) is out. This is our final release, more or less matching the OpenStack 6 months schedule, that concludes the Liberty development cycle. This release was supposed to be released a few weeks earlier, but our integration test got completely blocked for several days just the week before the OpenStack Mitaka summit. ## New website We build a new dedicated website for Gnocchi at [gnocchi.xyz](http://gnocchi.xyz). We want to promote Gnocchi outside of the [OpenStack](http://openstack.org) bubble, as it a useful timeseries database on its own that can work without the rest of the stack. We'll try to improve the documentation. If you're curious, feel free to check it out and report anything you miss! ## The speed bump Obviously, if it was a bug in Gnocchi that we have hit, it would have been quick to fix. However, we found [a nasty bug](https://bugs.launchpad.net/python-keystoneclient/+bug/1508424) in Swift caused by the evil monkey-patching of Eventlet (once again) blended with a mixed usage of native threads and Eventlet threads in Swift. Shake all of that, and you got yourself pretty race conditions when using the Keystone middleware authentication. In the meantime, we disabled Swift multi-threading by using mod\_wsgi instead of Eventlet in devstack. ## New features So what's new in this new shiny release? A few interesting things: - Metric deletion is now asynchronous. That's not the most used feature in the REST API – weirdly people do not often delete metrics – but it's now way faster and reliable by being asynchronous. _Metricd_ is now in charge of cleaning up things up. - Speed improvement. We are now confident to be even more faster than in the [latest benchmarks I run](/blog/gnocchi-benchmarks) (around 1.5-2× faster), which makes Gnocchi _really_ fast with its native storage back-ends. We profiled and optimized Carbonara and the REST API data validation. - Improve _metricd_ status report. It now reports the size of the backlog of the whole cluster both in its log and via the REST API. Easy monitoring! - Ceph drivers enhancement. We had people testing the Ceph drivers in production, so we made a few changes and fixes to it to make it more solid. And that's all we did in the last couple of months. We have a lot of things on the roadmap that are pretty exciting, and I'll sure talk about them in the next weeks.

OpenStack Summit Mitaka from a Telemetry point of view

Mon, 02 Nov 2015 00:00:00 GMT

Last week I was in Tokyo, Japan for the [OpenStack Summit](https://www.openstack.org/summit/tokyo-2015/), discussing the new Mitaka version that will be released in 6 months. I've attended the summit mainly to discuss and follow-up new developments on [Ceilometer](http://launchpad.net/ceilometer), [Gnocchi](http://launchpad.net/gnocchi), [Aodh](http://launchpad.net/aodh) and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things. Below are what I found remarkable during this summit concerning those projects. ## Distributed lock manager I did not attend this session, but I need to write something about it. See, when working in a distributed environment like OpenStack, it's almost obvious that sooner or later you end up needing a distributed lock mechanism. It started to be pretty obvious and a serious problem for us 2 years ago in Ceilometer. Back then, we proposed the [service-sync](https://wiki.openstack.org/wiki/Oslo/blueprints/service-sync) blueprint and talked about it during the OpenStack Icehouse Design Summit in Hong-Kong. The session at that time was a success, and in 20 minutes I convinced everyone it was the good thing to do. The night following the session, we picked a named, Tooz, to name this new library. It was the first time I met Joshua Harlow, which became one of the biggest Tooz contributor since then. For the following months, we tried to move the lines in OpenStack. It was very hard to convince people that it was the solution to their problem. Most of the time, they did not seem to grasp the entirety of what was at stake. This time, it seems that we managed to convince everyone that a DLM is indeed needed. Joshua wrote an extensive specification called [Chronicle of a DLM](https://review.openstack.org/#/c/209661/), which ended up being discussed and somehow adopted during that session in Tokyo. So yes, Tooz will be the weapon of choice for OpenStack. It will avoid a hard requirement on any DLM solution directly. The best driver right now is the [ZooKeeper](https://zookeeper.apache.org/) one, but it'll still be possible for operators to use e.g. Redis. This is a great achievement for us, after spending years trying to fix features such as the [Nova service group subsystem](https://blueprints.launchpad.net/nova/+spec/tooz-for-service-groups) and seeing our proposals postponed forever. (If you want to know more, [LWN.net](http://lwn.net) has [a great article about that session](https://lwn.net/Articles/662140/).) ## Telemetry team name With the new projects launched this last year, Aodh & Gnocchi, in parallel of the old Ceilometer, plus the change from programs to Big Tent in OpenSack, the team is having an identity issue. Being referred to as the "Ceilometer team" is not really accurate, as some of us only work on Aodh or on Gnocchi. So after discussing that, I [proposed to rename the team to Telemetry](https://review.openstack.org/#/c/240809/) instead. We'll see how it goes. ## Alarms The first session was about alarms and the Aodh project. It turns out that the project is in pretty good shape, but probably need some more love, which I hope I'll be able to provide in the next months. The need for a new _aodhclient_ based on the technologies we recently used building _gnocchiclient_ has been reasserted, so we might end up working on that pretty soon. The Tempest support also needs some improvement, and we have a plan to enhance that. ## Data visualisation We got David Lyle in this session, the Project Technical Leader for [Horizon](http://openstack/horizon). It was an interesting discussion. It used to be technically challenging to draw charts from the data Ceilometer collects, but it's now very easy with Gnocchi and its API. While the technical side is resolved, the more political and user experience side of was to draw and how was discussed at length. We don't want to make people think that Ceilometer and Gnocchi are a full monitoring solution, so there's some precaution to take. Other than that, it would be pretty cool to have view of the data in Horizon. ## Rolling upgrade It turns out that Ceilometer has an architecture that makes it easy to have rolling upgrade. We just need to write a proper documentation explaining how to do it and in which order the services should be upgraded. ## Ceilometer splitting The split of the alarm feature of Ceilometer in its own project Aodh in the last cycle was a great success for the whole team. We want to split other pieces of Ceilometer, as they make sense on their own, makes it easier to manage. They are also some projects that want to use them without the whole stack, so that's a good idea to make it happen. ## CloudKitty & Gnocchi I attended the 2 sessions that were allocated to [CloudKitty](https://wiki.openstack.org/wiki/CloudKitty). It was pretty interesting as they want to simplify their architecture and leverage what Gnocchi provides. I proposed my view of the project architecture and how they could leverage the more of Gnocchi to retrieve and store data. They want to go in that direction though it's a large amount of work and refactoring on their side, so it'll take time. We also need to enhance the support of extension for new resources in Gnocchi, and that's something I hope I'll work on in the next months. Overall, this summit was pretty good and I got a tremendous amount of good feedback on Gnocchi. I again managed to get enough ideas and tasks to tackle for the next 6 months. It really looks interesting to see where the whole team will go from that. Stay tuned!

Benchmarking Gnocchi for fun & profit

Tue, 13 Oct 2015 00:00:00 GMT

We got pretty good feedback on [Gnocchi](http://launchpad.net/gnocchi) so far, even if we only had little. Recently, in order to have a better feeling of where we were at, we wanted to know how fast (or slow) Gnocchi was. The [early benchmarks that some of the Mirantis engineers ran last year](/openstack-ceilometer-the-gnocchi-experiment.html) showed pretty good signs. But a year later, it was time to get real numbers and have a good understanding of Gnocchi capacity. ## Benchmark tools The first thing I realized when starting that process, is that we were lacking of tools to run benchmarks. Therefore I started to write some benchmark tools in [python-gnocchiclient](https://launchpad.net/python-gnocchiclient), which provides a command line tool to interrogate Gnocchi. I added a few basic commands to measure metric performance, such as: ```shell $ gnocchi benchmark metric create -w 48 -n 10000 -a low +----------------------+------------------+ | Field | Value | +----------------------+------------------+ | client workers | 48 | | create executed | 10000 | | create failures | 0 | | create failures rate | 0.00 % | | create runtime | 8.80 seconds | | create speed | 1136.96 create/s | | delete executed | 10000 | | delete failures | 0 | | delete failures rate | 0.00 % | | delete runtime | 39.56 seconds | | delete speed | 252.75 delete/s | +----------------------+------------------+ ``` The command line tool supports the `--verbose` switch to have detailed progress report on the benchmark progression. So far it supports metric operations only, but that's the most interesting part of Gnocchi. ## Spinning up some hardware I got a couple of bare metal servers to test Gnocchi on. I dedicated the first one to Gnocchi, and used the second one as the benchmark client, plugged on the same network. Each server is made of 2×[Intel Xeon E5-2609 v3](http://ark.intel.com/products/81897/Intel-Xeon-Processor-E5-2609-v3-15M-Cache-1_90-GHz) (12 cores in total) and 32 GB of RAM. That provides a lot of CPU to handle requests in parallel. Then I simply performed a basic [RHEL 7](http://www.redhat.com/en/technologies/linux-platforms/enterprise-linux) installation and ran [devstack](http://devstack.org) to spin up an installation of Gnocchi based on the master branch, disabling all of the others OpenStack components. I then tweaked the Apache httpd configuration to use the worker MPM and increased the maximum number of clients that can sent request simultaneously. I configured Gnocchi to use the _PostsgreSQL_ indexer, as it's the recommended one, and the _file_ storage driver, based on Carbonara (Gnocchi own storage engine). That means files were stored locally rather than in Ceph or Swift. Using the _file_ driver is less scalable (you have to run on only one node or uses a technology like NFS to share the files), but it was good enough for this benchmark and to have some numbers and profiling the beast. The OpenStack Keystone authentication middleware was not enabled in this setup, as it would add some delay validating the authentication token. ## Metric CRUD operations Metric creation is pretty fast. I managed to attain 1300 metric/s created pretty easily. Deletion is now asynchronous, which means it's faster than in Gnocchi 1.2, but it's still slower than creation: 500 metric/s can be deleted. That does not sound like a huge issue since metric deletion is actually barely used in production. Retrieving metric information is also pretty fast and goes up to 800 metric/s. It'd be easy to achieve very higher throughput for this one, as it'd be easy to cache, but we didn't feel the need to implement it so far. Another important thing is that all of these numbers are constant and barely depends on the number of the metric already managed by Gnocchi. | Operation | Details | Rate | |---|---|---| | Create metric | Created 100k metrics in 77 seconds | 1300 metric/s | | Show metric | Show a metric 100k times in 149 seconds | 670 metric/s | | Delete metric | Deleted 100k metrics in 190 seconds | 524 metric/s | ## Sending and getting measures Pushing measures into metrics is one of the hottest topic. Starting with Gnocchi 1.1, the measures pushed are treated asynchronously, which makes it much faster to push new measures. Getting new numbers on that feature was pretty interesting. The number of metric per second you can push depends on the batch size, meaning the number of actual measurements you send per call. The naive approach is to push 1 measure per call, and in that case, Gnocchi is able to handle around 600 measures/s. With a batch containing 100 measures, the number of calls per second goes down to 450, but since you push 100 measures each time, that means 45k measures per second pushed into Gnocchi! I've pushed the test further, inspired by the recent [blog post of InfluxDB claiming to achieve 300k points per second](https://influxdb.com/blog/2015/10/07/the_new_influxdb_storage_engine_a_time_structured_merge_tree.html) with their new engine. I ran the same benchmark on the hardware I had, which is roughly two times smaller than the one they used. I achieved to push Gnocchi to a little more than 120k measurement per second. If I had same hardware as they used, I could interpolate the results to achieve almost 250k measures/s pushed. Obviously, you can't strictly compare Gnocchi and InfluxDB since they are not doing exactly the same thing, but it still looks way better than what I expected. Using smaller batch sizes of 1k or 2k improve the throughput further to around 125k measures/s. | Operation | Details | Rate | |---|---|---| | Push metric 5k | Push 5M measures with batch of 5k measures in 40 seconds | 122k measures/s | | Push metric 4k | Push 5M measures with batch of 4k measures in 40 seconds | 125k measures/s | | Push metric 3k | Push 5M measures with batch of 3k measures in 40 seconds | 123k measures/s | | Push metric 2k | Push 5M measures with batch of 2k measures in 41 seconds | 121k measures/s | | Push metric 1k | Push 5M measures with batch of 1k measures in 44 seconds | 113k measures/s | | Push metric 500 | Push 5M measures with batch of 500 measures in 51 seconds | 98k measures/s | | Push metric 100 | Push 5M measures with batch of 100 measures in 112 seconds | 45k measures/s | | Push metric 10 | Push 5M measures with batch of 10 measures in 852 seconds | 6k measures/s | | Push metric 1 | Push 500k measures with batch of 1 measure in 800 seconds | 624 measures/s | | Get measures | Push 43k measures of 1 metric | 260k measures/s | What about getting measures? Well, it's actually pretty fast too. Retrieving a metric with 1 month of data with 1 minute interval (that's 43k points) takes less than 2 second. Though it's actually slower than what I expected. The reason seems to be that the JSON is 2 MB big and encoding it takes a lot of time for Python. I'll investigate that. Another point I discovered, is that by default Gnocchi returns all the datapoints for each granularities available for the asked period, which might double the size of the returned data for nothing if you don't need it. It'll be easy to add an option to the API to only retrieve what you need though! Once benchmarked, that meant I was able to retrieve 6 metric/s per second, which translates to around 260k measures/s. ## _Metricd_ speed New measures that are pushed into Gnocchi are processed asynchronously by the `gnocchi-metricd` daemon. When doing the benchmarks above, I ran into a very interesting issue: sending 10k measures on a metric would make `gnocchi-metricd` uses up to 2 GB RAM and 120 % CPU for more than 10 minutes. After further investigation, I found that the naive approach we used to resample datapoints in Carbonara using [Pandas](http://pandas.pydata.org/) was causing that. I [reported a bug on Pandas](https://github.com/pydata/pandas/issues/11217) and the upstream author was kind enough to provide a nice workaround, that I sent as [a pull request](https://github.com/pydata/pandas/pull/11242) to Pandas documentation. I wrote a fix for Gnocchi based on that, and started using it. Computing the standard aggregation methods set (std, count, 95pct, min, max, sum, median, mean) for 10k batches of 1 measure (worst case scenario) for one metric with 10k measures now takes only 20 seconds and uses 100 MB of RAM – 45× faster. That means that in normal operations, where only a few new measures are processed, the operation of updating a metric only takes a few milliseconds. Awesome! ## Comparison with Ceilometer For comparison sake, I've quickly run some read operations benchmark in Ceilometer. I've fed it with one month of samples for 100 instances polled every minute. That represents roughly 4.3M samples injected, and that took a while – almost 1 hour whereas it would have taken less than a minute in Gnocchi. Then I tried to retrieve some statistics in the same way that we provide them in Gnocchi, which mean aggregating them over a period of 60 seconds over a month. | Operation | Details | Rate | |---|---|---| | Read metric SQL | Read measures for 1 metric | 2min 58s | | Read metric MongoDB | Read measures for 1 metric | 28s | | Read metric Gnocchi | Read measures for 1 metric | 2s | Obviously, Ceilometer is very slow. It has to look into 4M of samples to compute and return the result, which takes a lot of time. Whereas Gnocchi just has to fetch a file and pass it over. That also means that the more samples you have (so the more time you collect data and the more resources you have), slower Ceilometer will become. This is not a problem with Gnocchi, as I emphasized when I started designing it. Most Gnocchi operations are _O(log R)_ where R is the number of metrics or resources, whereas most Ceilometer operations are _O(log S)_ where S is the number of samples (measures). Since is R millions of time smaller than S, Gnocchi gets to be much faster. And what's even more interesting, is that Gnocchi is entirely scalable horizontally. Adding more Gnocchi servers (for the API and its background processing worker _metricd_) will multiply Gnocchi performances by the number of servers added. ## Improvements There are several things to improve in Gnocchi, such as splitting Carbonara archives to make them more efficient, especially from drivers such as Ceph and Swift. It's already on my plate, and I'm looking forwarding to working on that! And if you have any questions, feel free to shoot them in the comment section. 😉

Gnocchi talk at OpenStack Paris Meetup #16

Mon, 05 Oct 2015 00:00:00 GMT

Last week, I've been invited to the [OpenStack Paris meetup #16](http://www.meetup.com/OpenStack-France/events/225227112/), whose subject was about metrics in OpenStack. Last time I spoke at this meetup was back in 2012, during the [OpenStack Paris meetup #2](/blog/openstack-france-meetup-2). A very long time ago! ![gnocchi-talk-2](/content/images/03/gnocchi-talk-2.jpg) I talked for half an hour about [Gnocchi](http://launchpad.net/gnocchi), the OpenStack project I've been running for 18 months now. I started by explaining the story behind the project and why we needed to build it. Ceilometer has an interesting history and had a curious roadmap these last year, and I summarized that briefly. Then I talk about how Gnocchi works and what it offers to users and operators. The slides where full of JSON, but I imagine it offered a interesting view of what the API looks like and how easy it is to operate. This also allowed me to emphasize how many use cases are actually really covered and solved, contrary to what Ceilometer did so far. The talk has been well received and I got a few interesting questions at the end.

Visualize your OpenStack cloud: Gnocchi & Grafana

Mon, 14 Sep 2015 00:00:00 GMT

We've been hard working with the Gnocchi team these last months to store your metrics, and I guess it's time to show off a bit. So far Gnocchi offers scalable metric storage and resource indexation, especially for OpenStack cloud – but not only, we're generic. It's cool to store metrics, but it can be even better to have a way to visualize them! ## Prototyping We very soon started to build a little HTML interface. Being REST-friendly guys, we enabled it on the same endpoints that were being used to retrieve information and measures about metric, sending back `text/html` instead of `application/json` if you were requesting those pages from a Web browser. But let's face it: we are back-end developers, we suck at any kind front-end development. CSS, HTML, JavaScript? Bwah! So what we built was a starting point, hoping some magical Web developer would jump in and finish the job. Obviously it never happened. ## Ok, so what's out there? It turns out there are back-end agnostic solutions out there, and we decided to pick [Grafana](http://grafana.org). Grafana is a complete graphing dashboard solution that can be plugged on top of any back-end. It already supports timeseries databases such as Graphite, InfluxDB and OpenTSDB. That was largely enough for that my fellow developer [Mehdi Abaakouk](https://blog.sileht.net/) to jump in and start writing a Gnocchi plugin for Grafana! Consequently, there is now a basic but solid and working back-end for Grafana that lies in the _[grafana-plugins](https://github.com/grafana/grafana-plugins/tree/master/datasources/gnocchi)_ repository. ![gnocchi-grafana](/content/images/03/gnocchi-grafana.png) With that plugin, you can graph anything that is stored in Gnocchi, from raw metrics to metrics tied to resources. You can use templating, but no annotation yet. The back-end supports Gnocchi with or without Keystone involved, and any type of authentication (basic auth or Keystone token). So yes, it even works if you're not running Gnocchi with the rest of OpenStack. ![gnocchi-grafana-group](/content/images/03/gnocchi-grafana-group.png) It also supports advanced queries, so you can search for resources based on some criterion and graphs their metrics. ## I want to try it! If you want to deploy it, all you need to do is to install Grafana and its plugins, and create a new datasource pointing to Gnocchi. It is that simple. There's some CORS middleware configuration involved if you're planning on using Keystone authentication, but it's pretty straightforward – just set the `cors.allowed_origin` option to the URL of your Grafana dashboard. We added support of Grafana directly in Gnocchi devstack plugin. If you're running [DevStack](http://devstack.org) you can follow [the instructions](http://docs.openstack.org/developer/gnocchi/devstack.html) – which are basically adding the line `enable_service gnocchi-grafana`. ## Moving to Grafana core \[Mehdi just opened a pull request\] ([https://github.com/grafana/grafana/pull/2716](https://github.com/grafana/grafana/pull/2716)) a few days ago to merge the plugin into Grafana core. It's actually one of the most unit-tested plugin in Grafana so far, so it should be on a good path to be merged in the future and have support of Gnocchi directly into Grafana without any plugin involved. ![grafana-gnocchi-unittests](/content/images/03/grafana-gnocchi-unittests.png)

Ceilometer, Gnocchi & Aodh: Liberty progress

Tue, 04 Aug 2015 00:00:00 GMT

It's been a while since I talked about Ceilometer and its companions, so I thought I'd go ahead and write a bit about what's going on this side of OpenStack. I'm not going to cover new features and fancy stuff today, but rather a shallow overview of the new project processes we initiated. ## Ceilometer growing [Ceilometer](http://launchpad.net/ceilometer) has grown a lot since that time when we started it 3 years ago. It has evolved from a system designed to fetch and store measurements, to a more complex system, with agents, alarms, events, databases, APIs, etc. All those features were needed and asked for by users and operators, but let's be honest, some of them should never have ended up in the Ceilometer code repository, especially not all at the same time. The reality is we picked a pragmatic approach due to the rigidity of the OpenStack Technical Committee in regards to new projects to become OpenStack integrated – and, therefore, blessed – projects. Ceilometer was actually the first project to be incubated and then integrated. We had to go through the very first issues of that process. Fortunately, now that time has passed, and all those constraints have been relaxed. To me, the [OpenStack Foundation](https://www.openstack.org/foundation) is turning into something that looks like the [Apache Foundation](http://www.apache.org/foundation/), and there's, therefore, no need to tie technical solutions to political issues. Indeed, the [Big Tent](https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/the-big-tent-a-look-at-the-new-openstack-projects-governance) now allows much more flexibility to all of that. Back a year ago, we were afraid to bring Gnocchi into Ceilometer. Was the Technical Committee going to review the project? Was the project going to be in the scope of Ceilometer for the Technical Committee? Now we don't have to ask ourselves those questions, now that we have that freedom, it empowers us to actually do what we think is good in term of technical design without worrying too much about political issues. ![ceilometer-activity](/content/images/03/ceilometer-activity.png) ## Acknowledging Gnocchi The first step in this new process was to continue working on [Gnocchi](https://launchpad.net/gnocchi) (a timeserie database and resource indexer designed to overcome historical Ceilometer storage issue) and to decide that it was not the right call to merge it into Ceilometer as some REST API v3, but that it was better to keep it standalone. We managed to get traction to Gnocchi, getting a few contributors and users. We're even seeing talks proposed to the next Tokyo Summit where people leverage Gnocchi, such as "Service of predictive analytics on cost and performance in OpenStack", "[Suveil](https://wiki.openstack.org/wiki/Surveil)" and "Cutting Edge NFV On OpenStack: Healing and Scaling Distributed Applications". We are also doing some progress on pushing Gnocchi outside of the OpenStack community, as it can be a self-sufficient timeserie and resource database that can be used without any OpenStack interaction. ## Branching Aodh Rather than continuing to grow Ceilometer, during the last summit we all decided that it was time to reorganize and split Ceilometer into the different components it is made of, leveraging a more [service-oriented architecture](https://en.wikipedia.org/wiki/Service-oriented_architecture). The alarm subsystem of Ceilometer being mostly untied to the rest of Ceilometer, we decided it was the first and perfect candidate to do that. I personally engaged into doing the work and created a new repository with only the alarm code from Ceilometer, named [Aodh](https://launchpad.net/aodh). ![woman-fire](/content/images/03/woman-fire.jpg) This made sense for a lot of reason. First because Aodh can now work completely standalone, using either Ceilometer or Gnocchi as a backend – or any new plugin you'd write. I love the idea that OpenStack projects can work standalone – like Swift does for example – without implying any other OpenStack component. I think it's a proof of good design. Secondly, because it allows us to resonate on a smaller chunk of software – a reason really under-estimated today in OpenStack. I believe that the size of your software should match a certain ratio to the size of your team. Aodh is, therefore, a new project under the OpenStack Telemetry program (or what remains of OpenStack programs now), alongside Ceilometer and Gnocchi, forked from the original Ceilometer alarm feature. We'll deprecate the latter with the Liberty release, and we'll remove it in the Mitaka release. ## Lessons learned Actually, moving that code out of Ceilometer (in the case of Aodh), or not merging it in (in the case of Gnocchi) had a few side effects that I admit I think we probably under-estimated back then. Indeed, the code size of Gnocchi or Aodh ended up being much smaller than the entire Ceilometer project – Gnocchi is 7× smaller and Aodh 5x smaller than Ceilometer – and therefore much more easy to manipulate and to hack on. That allowed us to merge dozens of patches in a few weeks, cleaning-up and enhancing a lot of small things in the code. Those tasks are very much harder in Ceilometer, due to the bigger size of the code base and the small size of our team. By having our small team working on smaller chunks of changes – even when it meant actually doing more reviews – greatly improved our general velocity and the number of bugs fixed and features implemented. On the more sociological side, I think it gave the team the sensation of finally owning the project. Ceilometer was huge, and it was impossible for people to know every side of it. Now, it's getting possible for people inside a team to cover a much larger portion of those smaller project, which gives them a greater sense of ownership and caring. Which ends up being good for the project quality overall. That also means that we technically decided to have different core teams by project (Ceilometer, Gnocchi, and Aodh) as they all serve different purposes and can all be used standalone or with each others. Meaning we could have contributors completely ignoring other projects. All of that reminds me some discussion I heard about projects such as Glance, trying to fit new features in - some that are really orthogonal to the original purpose. It's now clear to me that having different small components interacting together that can be completely owned and taken care of by a (small) team of contributors is the way to go. People that can therefore trust each others and easily bring new people in, makes a project really incredibly more powerful. Having a project covering a too wide set of features make things more difficult if you don't have enough manpower. This is clearly an issue that big projects inside OpenStack are facing now, such as Neutron or Nova.

OpenStack Summit Liberty from a Ceilometer & Gnocchi point of view

Tue, 26 May 2015 00:00:00 GMT

Last week I was in [Vancouver, BC](http://vancouver.ca/) for the [OpenStack Summit](https://www.openstack.org/summit/vancouver-2015/), discussing the new Liberty version that will be released in 6 months. I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things. ## Ops feedback We had half a dozen Ceilometer sessions, and the first one was dedicated to getting feedbacks from operators using Ceilometer. We had a few operators present, and a few of the Ceilometer team. We had constructive discussion, and my feeling is that operators struggles with 2 things so far: scaling Ceilometer storage and having Ceilometer not killing the rest of OpenStack. We discussed the first point as being addressed by [Gnocchi](http://launchpad.net/gnocchi), and I presented a bit Gnocchi itself, as well as how and why it will fix the storage scalability issue operators encountered so far. Ceilometer putting down the OpenStack installation is more interesting problem. Ceilometer pollsters request information from Nova, Glance… to gather statistics. Until Kilo, Ceilometer used to do that regularly and at fixed interval, causing high pike load in OpenStack. With the [introduction of jitter](http://docs.openstack.org/developer/ceilometer/architecture.html#polling-agents-asking-for-data) in Kilo, this should be less of a problem. However, Ceilometer hits various endpoints in OpenStack that are poorly designed, and hitting those endpoints of Nova or other components triggers a lot of load on the platform. Unfortunately, this makes operators blame Ceilometer rather than blaming the components being guilty of poor designs. We'd like to push forward improving these components, but it's probably going to take a long time. ## Componentisation When I started the Gnocchi project last year, I pretty soon realized that we would be able to split Ceilometer itself in different smaller components that could work independently, while being able to leverage each others. For example, Gnocchi can run standalone and store your metrics even if you don't use Ceilometer – nor even OpenStack itself. My fellow developer [Chris Dent](http://burningchrome.com/) had the same idea about splitting Ceilometer a few months ago and drafted a proposal. The idea is to have Ceilometer split in different parts that people could assemble together or run on their owns. Interestingly enough, we had three 40 minutes sessions planned to talk and debate about this division of Ceilometer, though we all agreed in 5 minutes that this was the good thing to do. Five more minutes later, we agreed on which part to split. The rest of the time was allocated to discuss various details of that split, and I engaged to start doing the work with Ceilometer alarming subsystem. I wrote a [specification](https://review.openstack.org/#/c/184307/) on the plane bringing me to Vancouver, that should be approved pretty soon now. I already started doing the implementation work. So fingers crossed, Ceilometer should have a new components in Liberty handling alarming on its own. This would allow users for example to only deploys Gnocchi and Ceilometer alarm. They would be able to feed data to Gnocchi using their own system, and build alarms using Ceilometer alarm subsystem relying on Gnocchi's data. ## Gnocchi We didn't have a Gnocchi dedicated slot – mainly because I indicated I didn't feel we needed one. We anyway discussed a few points around coffee, and I've been able to draw a few new ideas and changes I'd like to see in Gnocchi. Mainly changing the API contract to be more asynchronously so we can support [InfluxDB](http://influxdb.com/) more correctly, and improve Carbonara (the library we created to manipulate timeseries) based drivers to be faster. All of those should – plus a few Oslo tasks I'd like to tackle – should keep me busy for the next cycle!

Gnocchi 1.0: storing metrics and resources at scale

Tue, 21 Apr 2015 00:00:00 GMT

A few months ago, I wrote a long post about what I called back then the "[Gnocchi experiment](/blog/openstack-ceilometer-the-gnocchi-experiment)". Time passed and we – me and the rest of the Gnocchi team – continued to work on that project, finalizing it. It's with a great pleasure that we are going to release our first _1.0_ version this month, roughly at the same time that the integrated [OpenStack](http://openstack.org) projects release their Kilo milestone. The [first release candidate numbered 1.0.0rc1](https://pypi.python.org/pypi/gnocchi) has been released this morning! ## The problem to solve Before I dive into Gnocchi details, it's important to have a good view of what problems Gnocchi is trying to solve. Most of the IT infrastructures out there consists of a set of resources. These resources have properties: some of them are simple attributes whereas others might be measurable quantities (also known as metrics). And in this context, the cloud infrastructures make no exception. We talk about instances, volumes, networks… which are all different kind of resources. The problems that are arising with the cloud trend is the scalability of storing all this data and being able to request them later, for whatever usage. What Gnocchi provides is a REST API that allows the user to manipulate resources (CRUD) and their attributes, while preserving the history of those resources and their attributes. Gnocchi is fully documented and the [documentation is available online](http://gnocchi.xyz). We are the first OpenStack project to require patches to _integrate the documentation_. We want to raise the bar, so we took a stand on that. That's part of our policy, the same way it's part of the OpenStack policy to require unit tests. I'm not going to paraphrase the whole Gnocchi documentation, which covers things like installation (super easy), but I'll guide you through some basics of the features provided by the REST API. I will show you some example so you can have a better understanding of what you could leverage using Gnocchi! ## Handling metrics Gnocchi provides a full REST API to manipulate time-series that are called _metrics_. You can easily create a metric using a simple HTTP request: ``` POST /v1/metric HTTP/1.1 Content-Type: application/json { "archive_policy_name": "low" } HTTP/1.1 201 Created Location: http://localhost/v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a Content-Type: application/json; charset=UTF-8 { "archive_policy": { "aggregation_methods": [ "std", "sum", "mean", "count", "max", "median", "min", "95pct" ], "back_window": 0, "definition": [ { "granularity": "0:00:01", "points": 3600, "timespan": "1:00:00" }, { "granularity": "0:30:00", "points": 48, "timespan": "1 day, 0:00:00" } ], "name": "low" }, "created_by_project_id": "e8afeeb3-4ae6-4888-96f8-2fae69d24c01", "created_by_user_id": "c10829c6-48e2-4d14-ac2b-bfba3b17216a", "id": "387101dc-e4b1-4602-8f40-e7be9f0ed46a", "name": null, "resource_id": null } ``` The `archive_policy_name` parameter defines how the measures that are being sent are going to be aggregated. You can also define archive policies using the API and specify what kind of aggregation period and granularity you want. In that case , the _low_ archive policy keeps 1 hour of data aggregated over 1 second and 1 day of data aggregated to 30 minutes. The functions used for aggregations are the mathematical functions standard deviation, minimum, maximum, … and even 95th percentile. All of that is obviously customizable and you can create your own archive policies. If you don't want to specify the archive policy manually for each metric, you can also create _archive policy rule_, that will apply a specific archive policy based on the metric name, e.g. metrics matching `disk.*` will be high resolution metrics so they will use the `high` archive policy. It's also worth noting Gnocchi is precise up to the nanosecond and is not tied to the current time. You can manipulate and inject measures that are years old and precise to the nanosecond. You can also inject points with old timestamps (i.e. old compared to the most recent one in the timeseries) with an archive policy allowing it (see `back_window` parameter). It's then possible to send measures to this metric: ``` POST /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1 Content-Type: application/json [ { "timestamp": "2014-10-06T14:33:57", "value": 43.1 }, { "timestamp": "2014-10-06T14:34:12", "value": 12 }, { "timestamp": "2014-10-06T14:34:20", "value": 2 } ] HTTP/1.1 204 No Content ``` These measures are synchronously aggregated and stored into the configured storage backend. Our most scalable storage drivers for now are either based on [Swift](http://launchpad.net/swift) or [Ceph](http://ceph.com) which are both scalable storage objects systems. It's then possible to retrieve these values: ``` GET /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1 HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 [ [ "2014-10-06T14:30:00.000000Z", 1800.0, 19.033333333333335 ], [ "2014-10-06T14:33:57.000000Z", 1.0, 43.1 ], [ "2014-10-06T14:34:12.000000Z", 1.0, 12.0 ], [ "2014-10-06T14:34:20.000000Z", 1.0, 2.0 ] ] ``` As older Ceilometer users might notice here, metrics are only storing points and values, nothing fancy such as metadata anymore. By default, values eagerly aggregated using mean are returned for all supported granularities. You can obviously specify a time range or a different aggregation function using the `aggregation`, `start` and `stop` query parameter. Gnocchi also supports doing aggregation across aggregated metrics: ``` GET /v1/aggregation/metric?metric=65071775-52a8-4d2e-abb3-1377c2fe5c55&metric=9ccdd0d6-f56a-4bba-93dc-154980b6e69a&start=2014-10-06T14:34&aggregation=mean HTTP/1.1 HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 [ [ "2014-10-06T14:34:12.000000Z", 1.0, 12.25 ], [ "2014-10-06T14:34:20.000000Z", 1.0, 11.6 ] ] ``` This computes the mean of mean for the metric `65071775-52a8-4d2e-abb3-1377c2fe5c55` and `9ccdd0d6-f56a-4bba-93dc-154980b6e69a` starting on 6th October 2014 at 14:34 UTC. ## Indexing your resources Another object and concept that Gnocchi provides is the ability to manipulate resources. There is a basic type of resource, called _generic_, which has very few attributes. You can extend this type to specialize it, and that's what Gnocchi does by default by providing resource types known for OpenStack such as _instance_, _volume_, _network_ or even _image_. ``` POST /v1/resource/generic HTTP/1.1 Content-Type: application/json { "id": "75C44741-CC60-4033-804E-2D3098C7D2E9", "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D", "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D" } HTTP/1.1 201 Created Location: http://localhost/v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9 ETag: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd" Last-Modified: Fri, 17 Apr 2015 11:18:48 GMT Content-Type: application/json; charset=UTF-8 { "created_by_project_id": "ec181da1-25dd-4a55-aa18-109b19e7df3a", "created_by_user_id": "4543aa2a-6ebf-4edd-9ee0-f81abe6bb742", "ended_at": null, "id": "75c44741-cc60-4033-804e-2d3098c7d2e9", "metrics": {}, "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d", "revision_end": null, "revision_start": "2015-04-17T11:18:48.696288Z", "started_at": "2015-04-17T11:18:48.696275Z", "type": "generic", "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d" } ``` The resource is created with the UUID provided by the user. Gnocchi handles the history of the resource, and that's what the `revision_start` and `revision_end` fields are for. They indicates the lifetime of this revision of the resource. The `ETag` and `Last-Modified` headers are also unique to this resource revision and can be used in a subsequent request using `If-Match` or `If-Not-Match` header, for example: ``` GET /v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9 HTTP/1.1 If-Not-Match: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd" HTTP/1.1 304 Not Modified ``` Which is useful to synchronize and update any view of the resources you might have in your application. You can use the `PATCH` HTTP method to modify properties of the resource, which will create a new revision of the resource. The history of the resources are available via the REST API obviously. The `metrics` properties of the resource allow you to link metrics to a resource. You can link existing metrics or create new ones dynamically: ``` POST /v1/resource/generic HTTP/1.1 Content-Type: application/json { "id": "AB68DA77-FA82-4E67-ABA9-270C5A98CBCB", "metrics": { "temperature": { "archive_policy_name": "low" } }, "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D", "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D" } HTTP/1.1 201 Created Location: http://localhost/v1/resource/generic/ab68da77-fa82-4e67-aba9-270c5a98cbcb ETag: "9f64c8890989565514eb50c5517ff01816d12ff6" Last-Modified: Fri, 17 Apr 2015 14:39:22 GMT Content-Type: application/json; charset=UTF-8 { "created_by_project_id": "cfa2ebb5-bbf9-448f-8b65-2087fbecf6ad", "created_by_user_id": "6aadfc0a-da22-4e69-b614-4e1699d9e8eb", "ended_at": null, "id": "ab68da77-fa82-4e67-aba9-270c5a98cbcb", "metrics": { "temperature": "ad53cf29-6d23-48c5-87c1-f3bf5e8bb4a0" }, "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d", "revision_end": null, "revision_start": "2015-04-17T14:39:22.181615Z", "started_at": "2015-04-17T14:39:22.181601Z", "type": "generic", "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d" } ``` ## Haystack, needle? Find! With such a system, it becomes very easy to index all your resources, meter them and retrieve this data. What's even more interesting is to query the system to find and list the resources you are interested in! You can search for a resource based on any field, for example: ``` POST /v1/search/resource/instance HTTP/1.1 Content-Type: application/json { "=": { "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d" } } ``` That query will return a list of all resources owned by the `user_id` `bd3a1e52-1c62-44cb-bf04-660bd88cd74d`. You can do fancier queries such as retrieving all the instances started by a user this month: ``` POST /v1/search/resource/instance HTTP/1.1 Content-Type: application/json Content-Length: 113 { "and": [ { "=": { "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d" } }, { ">=": { "started_at": "2015-04-01" } } ] } ``` And you can even do fancier queries than the fancier ones (still following?). What if we wanted to retrieve all the instances that were on host `foobar` the 15th April and who had already 30 minutes of uptime? Let's ask Gnocchi to look in the history! ``` POST /v1/search/resource/instance?history=true HTTP/1.1 Content-Type: application/json Content-Length: 113 { "and": [ { "=": { "host": "foobar" } }, { ">=": { "lifespan": "1 hour" } }, { "<=": { "revision_start": "2015-04-15" } } ] } ``` I could also mention the fact that you can [search for value in metrics](http://docs.openstack.org/developer/gnocchi/rest.html#searching-for-values-in-metrics). One feature that I will very likely include in Gnocchi 1.1 is the ability to search for resource whose specific metrics matches some value. For example, having the ability to search for instances whose CPU consumption was over 80% during a month. ## Cherries on the cake While Gnocchi is well integrated and based on common OpenStack technology, please do note that it is completely able to function without any other OpenStack component and is pretty straight-forward to deploy. Gnocchi also implements a full RBAC system based on the [OpenStack standard oslo.policy](http://docs.openstack.org/developer/oslo.policy/) and which allows pretty fine grained control of permissions. ![gnocchi-resource-html](/content/images/03/gnocchi-resource-html.png) There is also some work ongoing to have HTML rendering when browsing the API using a Web browser. While still simple, we'd like to have a minimal Web interface served on top of the API for the same price! Ceilometer alarm subsystem supports Gnocchi with the Kilo release, meaning you can use it to trigger actions when a metric value crosses some threshold. And OpenStack [Heat](http://launchpad.net/heat) also supports auto-scaling your instances based on Ceilometer+Gnocchi alarms. And there are a few more API calls that I didn't talk about here, so don't hesitate to take a peek at the [full documentation](http://gnocchi.xyz)! ## Towards Gnocchi 1.1! Gnocchi is a different beast in the OpenStack community. It is under the umbrella of the Ceilometer program, but it's one of the first projects that is not part of the (old) integrated release. Therefore we decided to have a release schedule not directly linked to the OpenStack and we'll release more often that the rest of the old OpenStack components – probably once every 2 months or the like. What's coming next is a close integration with Ceilometer (e.g. moving the dispatcher code from Gnocchi to Ceilometer) and probably more features as we have more requests from our users. We are also exploring different backends such as InfluxDB (storage) or MongoDB (indexer). Stay tuned, and happy hacking!

OpenStack Ceilometer and the Gnocchi experiment

Mon, 18 Aug 2014 00:00:00 GMT

A little more than 2 years ago, the [Ceilometer](http://launchpad.net/ceilometer) project was launched inside the OpenStack ecosystem. Its main objective was to measure OpenStack cloud platforms in order to provide data and mechanisms for functionalities such as billing, alarming or capacity planning. In this article, I would like to relate what I've been doing with other Ceilometer developers in the last 5 months. I've lowered my involvement in Ceilometer itself directly to concentrate on solving one of its biggest issue at the source, and I think it's largely time to take a break and talk about it. ## Ceilometer early design For the last years, Ceilometer didn't change in its core architecture. Without diving too much in all its parts, one of the early design decision was to build the metering around a data structure we called **samples**. A sample is generated each time Ceilometer measures something. It is composed of a few fields, such as the the resource id that is metered, the user and project id owning that resources, the meter name, the measured value, a timestamp and a few free-form metadata. Each time Ceilometer measures something, one of its components (an agent, a pollster…) constructs and emits a sample headed for the storage component that we call the **collector**. This collector is responsible for storing the samples into a database. The Ceilometer collector uses a pluggable storage system, meaning that you can pick any database system you prefer. Our original implementation has been based on MongoDB from the beginning, but we then added a SQL driver, and people contributed things such as HBase or DB2 support. The REST API exposed by Ceilometer allows to execute various reading requests on this data store. It can returns you the list of resources that have been measured for a particular project, or compute some statistics on metrics. Allowing such a large panel of possibilities and having such a flexible data structure allows to do a lot of different things with Ceilometer, as you can almost query the data in any mean you want. ## The scalability issue We soon started to encounter scalability issues in many of the read requests made via the REST API. A lot of the requests requires the data storage to do full scans of all the stored samples. Indeed, the fact that the API allows you to filter on any fields and also on the free-form metadata (meaning non indexed key/values tuples) has a terrible cost in terms of performance (as pointed before, the metadata are attached to each _sample_ generated by Ceilometer and is stored as is). That basically means that the _sample_ data structure is stored in most drivers in just one table or collection, in order to be able to scan them at once, and there's no good "perfect" sharding solution, making data storage scalability painful. It turns out that the Ceilometer REST API is unable to handle most of the requests in a timely manner as most operations are _O(n)_ where _n_ is the number of samples recorded (see [big O notation](http://en.wikipedia.org/wiki/Big_O_notation) if you're unfamiliar with it). That number of samples can grow very rapidly in an environment of thousands of metered nodes and with a data retention of several weeks. There is a few optimizations to make things smoother in general cases fortunately, but as soon as you run specific queries, the API gets barely usable. During this last year, as the Ceilometer PTL, I discovered these issues first hand since a lot of people were feeding me back with this kind of testimony. We engaged several blueprints to improve the situation, but it was soon clear to me that this was not going to be enough anyway. ![unacceptable](/content/images/03/unacceptable.jpg) ## Thinking outside the box Unfortunately, the PTL job doesn't leave him enough time to work on the actual code nor to play with anything new. I was coping with most of the project bureaucracy and I wasn't able to work on any good solution to tackle the issue at its root. Still, I had a few ideas that I wanted to try and as soon as I stepped down from the PTL role, I stopped working on Ceilometer itself to try something new and to think a bit outside the box. When one takes a look at what have been brought recently in Ceilometer, they can see the idea that Ceilometer actually needs to handle 2 types of data: events and metrics. Events are data generated when something happens: an instance start, a volume is attached, or an HTTP request is sent to an REST API server. These are events that Ceilometer needs to collect and store. Most OpenStack components are able to send such events using the notification system built into _[oslo.messaging](https://wiki.openstack.org/wiki/Oslo/Messaging)_. Metrics is what Ceilometer needs to store but that is not necessarily tied to an event. Think about an instance CPU usage, a router network bandwidth usage, the number of images that Glance is storing for you, etc… These are not events, since nothing is happening. These are facts, states we need to meter. Computing statistics for billing or capacity planning requires both of these data sources, but they should be distinct. Based on that assumption, and the fact that Ceilometer was getting support for storing events, I started to focus on getting the metric part right. I had been a system administrator for a decade before jumping into OpenStack development, so I know a thing or two on how monitoring is done in this area, and what kind of technology operators rely on. I also know that there's still no silver bullet – this made it a good challenge. The first thing that came to my mind was to use some kind of time-series database, and export its access via a REST API – as we do in all OpenStack services. This should cover the metric storage pretty well. ## Cooking Gnocchi ![gnocchi-logo-old-2](/content/images/03/gnocchi-logo-old-2.jpg) At the end of April 2014, this led met to start a new project code-named Gnocchi. For the record, the name was picked after confusing so many times the OpenStack Marconi project, reading OpenStack Macaroni instead. At least one OpenStack project should have a "pasta" name, right? The point of having a new project and not send patches on Ceilometer, was that first I had no clue if it was going to make something that would be any better, and second, being able to iterate more rapidly without being strongly coupled with the release process. The first prototype started around the following idea: what you want is to meter things. That means storing a list of tuples of (timestamp, value) for it. I've named these things "entities", as no assumption are made on what they are. An entity can represent the temperature in a room or the CPU usage of an instance. The service shouldn't care and should be agnostic in this regard. One feature that we discussed for several OpenStack summits in the Ceilometer sessions, was the idea of doing aggregation. Meaning, aggregating samples over a period of time to only store a smaller amount of them. These are things that time-series format such as the [RRDtool](http://oss.oetiker.ch/rrdtool/) have been doing for a long time on the fly, and I decided it was a good trail to follow. I assumed that this was going to be a requirement when storing metrics into Gnocchi. The user would need to provide what kind of archiving it would need: 1 second precision over a day, 1 hour precision over a year, or even both. The first driver written to achieve that and store those metrics inside Gnocchi was based on [whisper](http://graphite.wikidot.com/whisper). Whisper is the file format used to store metrics for the [Graphite](http://graphite.wikidot.com/) project. For the actual storage, the driver uses Swift, which has the advantages to be part of OpenStack and scalable. Storing metrics for each entities in a different _whisper_ file and putting them in Swift turned out to have a fantastic algorithm complexity: it was _O(1)_. Indeed, the complexity needed to store and retrieve metrics doesn't depends on the number of metrics you have nor on the number of things you are metering. Which is already a huge win compared to the current Ceilometer collector design. However, it turned out that _whisper_ has a few limitations that I was unable to circumvent in any manner. I needed to patch it to remove a lot of its assumption about manipulating file, or that everything is relative to now (`time.time()`). I've started to hack on that in my own fork, but… then everything broke. The _whisper_ project code base is, well, not the state of the art, and have 0 unit test. I was starring at a huge effort to transform _whisper_ into the time-series format I wanted, without being sure I wasn't going to break everything (remember, no test coverage). I decided to take a break and look into alternatives, and stumbled upon [Pandas](http://pandas.pydata.org/), a data manipulation and statistics library for Python. Turns out that Pandas support time-series natively, and that it could do a lot of the smart computation needed in Gnocchi. I built a new file format leveraging Pandas for computing the time-series and named it **carbonara** (a wink to both the [Carbon](https://github.com/graphite-project/carbon) project and pasta, how clever!). The code is quite small (a third of _whisper_'s, 200 SLOC vs 600 SLOC), does not have many of the _whisper_ limitations and… it has test coverage. These Carbonara files are then, in the same fashion, stored into Swift containers. Anyway, Gnocchi storage driver system is designed in the same spirit that the rest of OpenStack and Ceilometer storage driver system. It's a plug-in system with an API, so anyone can write their own driver. Eoghan Glynn has already started to write a [InfluxDB](http://influxdb.com/) driver, working closely with the upstream developer of that database. Dina Belova started to write an [OpenTSDB](http://opentsdb.net/) driver. This helps to make sure the API is designed directly in the right way. ## Handling resources Measuring individual entities is great and needed, but you also need to link them with resources. When measuring the temperature and the number of a people in a room, it is useful to link these 2 separate entities to a resource, in that case the room, and give a name to these relations, so one is able to identify what attribute of the resource is actually measured. It is also important to provide the possibility to store attributes on these resources, such as their owners, the time they started and ended their existence, etc. ![gnocchi-relationship](/content/images/03/gnocchi-relationship.png) Once this list of resource is collected, the next step is to list and filter them, based on any criteria. One might want to retrieve the list of resources created last week or the list of instances hosted on a particular node right now. Resources also need to be specialized. Some resources have attributes that must be stored in order for filtering to be useful. Think about an instance name or a router network. All of these requirements led to to the design of what's called the _indexer_. The indexer is responsible for indexing entities, resources, and link them together. The initial implementation is based on [SQLAlchemy](http://sqlalchemy.org) and should be pretty efficient. It's easy enough to index the most requested attributes (columns), and they are also correctly typed. We plan to establish a model for all known OpenStack resources (instances, volumes, networks, …) to store and index them into the Gnocchi indexer in order to request them in an efficient way from one place. The generic resource class can be used to handle generic resources that are not tied to OpenStack. It'd be up to the users to store extra attributes. Dropping the free form metadata we used to have in Ceilometer makes sure that querying the indexer is going to be efficient and scalable. ![gnocchi-classes](/content/images/03/gnocchi-classes.png) ## REST API All of this is exported via a REST API that was partially designed and documented in the [Gnocchi specification in the Ceilometer repository](http://git.openstack.org/cgit/openstack/ceilometer-specs/tree/specs/juno/gnocchi.rst); though the spec is not up-to-date yet. We plan to auto-generate the documentation from the code as we are currently doing in Ceilometer. The REST API is pretty easy to use, and you can use it to manipulate entities and resources, and request the information back. ![gnocchi-architecture](/content/images/03/gnocchi-architecture.png) ## Roadmap & Ceilometer integration All of this plan has been exposed and discussed with the Ceilometer team during the last [OpenStack summit in Atlanta](https://www.openstack.org/summit/openstack-summit-atlanta-2014/) in May 2014, for the Juno release. I led a session about this entire concept, and convinced the team that using Gnocchi for our metric storage would be a good approach to solve the Ceilometer collector scalability issue. It was decided to conduct this project experiment in parallel of the current Ceilometer collector for the time being, and see where that would lead the project to. ## Early benchmarks Some engineers from Mirantis did a few benchmarks around Ceilometer and also against an early version of Gnocchi, and Dina Belova presented them to us during the mid-cycle sprint we organized in Paris in early July. The following graph sums up pretty well the current Ceilometer performance issue. The more you feed it with metrics, the more slow it becomes. ![image03](/content/images/03/image03.png) For Gnocchi, while the numbers themselves are not fantastic, what is interesting is that all the graphs below show that the performances are stable without correlation with the number of resources, entities or measures. This proves that, indeed, most of the code is built around a complexity of _O(1)_, and not _O(n)_ anymore. ![image00](/content/images/03/image00.png) ![image01](/content/images/03/image01.png) ![image04](/content/images/03/image04.png) ![image05](/content/images/03/image05.png) ![image06](/content/images/03/image06.png) ## Next steps ![clement-drawing-gnocchi](/content/images/03/clement-drawing-gnocchi.jpg) While the Juno cycle is being wrapped-up for most projects, including Ceilometer, Gnocchi development is still ongoing. Fortunately, the composite architecture of Ceilometer allows a lot of its features to be replaced by some other code dynamically. That, for example, enables Gnocchi to provides a Ceilometer dispatcher plugin for its collector, without having to ship the actual code in Ceilometer itself. That should help the development of Gnocchi to not be slowed down by the release process for now. The Ceilometer team aims to provide Gnocchi as a sort of technology preview with the Juno release, allowing it to be deployed along and plugged with Ceilometer. We'll discuss how to integrate it in the project in a more permanent and strong manner probably during the [OpenStack Summit for Kilo](https://www.openstack.org/summit/openstack-paris-summit-2014/) that will take place next November in Paris.