openstack — jd:/dev/blog

Gnocchi 4.3.0 released

Mon, 30 Jul 2018 00:00:00 GMT

This new release minor release of Gnocchi has been a bit longer than usual, but here it is! So what's new in this version of Gnocchi? Well, according to [the release notes](https://gnocchi.xyz/releasenotes/4.3.html), not much. There are only two new features: - _gnocchi-injector_ which allows injecting data for _metricd_ consumption directly. This is useful to test _metricd_ performances. - The ability for the `/v1/aggregation/resources` endpoint to read a string rather than a JSON formatted payload for filtering. Nothing exciting here… however, other changes are not user-visible and are not in those notes: Performance boost, everywhere! The storage engine has been largely improved to batch a ton of operations that used to be done on a per-metric basis. When ingesting new measures, Gnocchi was storing those new points in batch. However, the processing done by _metricd_ later was single-metric based for most it. This did not leverage the efficiency that some backend might have and would create more I/O operations than necessary. Each incoming data sack is now processed in batch mode, making _metricd_ much faster at aggregating metrics data! When doing local benchmarks, some scenario presented an improvement of 8x. This new storage internal API is not used by the REST API yet, as many operations exposed by the API are oriented for a single metric. That might be a significant improvement for the next version of Gnocchi's API. Happy upgrade!

Lessons from OpenStack Telemetry: Deflation

Thu, 19 Apr 2018 00:00:00 GMT

This post is the second and final episode of _Lessons from OpenStack Telemetry_. If you have missed the first post, you can read it [here](/blog/lessons-from-openstack-telemetry-incubation). ## Splitting At some point, the rules relaxed on new projects addition with the Big Tent initiative, allowing us to rename ourselves to the OpenStack Telemetry team and splitting Ceilometer into several subprojects: Aodh (alarm evaluation functionality) and Panko (events storage). Gnocchi was able to join the OpenStack Telemetry party for its first anniversary. Finally being able to split Ceilometer into several independent pieces of software allowed us to tackle technical debt more rapidly. We built autonomous teams for each project and gave them the same liberty they had in Ceilometer. The cost of migrating the code base to several projects was higher than we wanted it to be, but we managed to build a clear migration path nonetheless. ## Gnocchi Shamble With Gnocchi in town, we stopped all efforts on Ceilometer storage and API and expected people to adopt Gnocchi. What we underestimated is the unwillingness of many operators to think about telemetry. They did not want to deploy anything to have telemetry features in the first place, so adding yet a new component (a timeseries database) to have proper metric features was seen a burden – and sometimes not seen at all. Indeed, we also did not communicate enough on our vision for that transition. After two years of existence, many operators were asking what Gnocchi was and what they needed it for. They deployed Ceilometer and its bogus storage and API and were confused about needing yet another piece of software. It took us more than two years to deprecate the Ceilometer storage and API, which is way too long. ## Deflation In the meantime, people were leaving the OpenStack boat. Soon enough, we started to feel the shortage of human resources. Smartly, we never followed the OpenStack trend of imposing blueprints, specs, bug reports or any process to contributors, obeying my list of [open source best practice](/blog/foss-projects-management-bad-practice). This flexibility allowed us to iterate more rapidly; compared to other OpenStack projects; we were going faster proportionately to the size of our contributor base. ![Capturer Le moment](https://images.unsplash.com/photo-1520018319835-74bf61f79844?ixlib=rb-0.3.5&q=80&fm=jpg&crop=entropy&cs=tinysrgb&w=1080&fit=max&ixid=eyJhcHBfaWQiOjExNzczfQ&s=7b410a77641efbb205b4157f7b4c62b0) Nonetheless, we felt like bailing out a sinking ship. Our contributors were disappearing while we were swamped with technical debt: half-baked feature, unfinished migration, legacy choices and temporary hacks. After the big party that happened, we had to wash the dishes and sweep the floor. Being part of OpenStack started to feel like a burden in many ways. The inertia of OpenStack being a big project was beginning to surface, so we put up a lot of efforts to dodge most of its implications. Consequently, the team was perceived as an outlier, which does not help, especially when you have to interact with a lot your neighbors. The OpenStack Foundation never understood the organization of our team. They would refer to us as "Ceilometer" whereas we formally renamed ourselves to "Telemetry" since we were englobing four server projects and a few libraries. For example, while Gnocchi has been an OpenStack project for two years before leaving, it has never been listed on the [project navigator](https://www.openstack.org/software/project-navigator/) maintained by the foundation. That's a funny anecdote that demonstrates the peculiarity of our team, and how it has been both a strength and a weakness. ## Competition Nobody was trying to do what we were doing when we started Ceilometer. We filled the space of metering OpenStack. However, as the number of companies involved increased and the friction with it along, some people grew unhappy. The race to have a seat at the table of the feast and becoming a _Project Team Leader_ was strong, so some people preferred to create their project rather than trying to play the contribution game. In many areas, including our, that divided the effort up to a ridiculous point where several teams where doing the exact the same thing, or were trying to step on each other toes to kill the competitors. We spent a significant amount of time trying to bring other teams in the Telemetry scope, to unify our efforts, without much success. Some companies were not embracing open-source because of their cultural differences, while some others had no interest to join a project where they would not be seen as the leader. That fragmentation did not help us, but also did not do much harm in the end. Most of those projects are now either dead or becoming irrelevant as the rest of the world caught up on what they were trying to do. ## Epilogue As of 2018, I'm the PTL for Telemetry – because nobody else ran. The official list of maintainer for the telemetry projects is five people: two are inactive, and three are part-time. During the latest development cycle (Queens), 48 people committed in Ceilometer, though only three developers made impactful contributions. The code size has been divided by two since the peak: Ceilometer is now 25k lines of code long. Panko and Aodh have no active developer. A Red Hat colleague and I are maintaining the projects afloat to keep it working. Gnocchi has humbly thriven since it left OpenStack. The stains from having been part of OpenStack are not yet all gone. It has a small community, but users see its real value and enjoy using it. Those last six years have been intense, and riding the OpenStack train has been amazing. As I concluded in the first blog post of this series, most of us had a great time overall; the point of those writings is not to complain, but to reflect. I find it fascinating to see how the evolution of a piece of software and the metamorphosis of its community are entangled. The amount of politics that a corporately-backed project of this size generates is majestic and has a prominent influence on the outcome of software development. So, what's next? Well, as far as Ceilometer is concerned, we still have ideas and plans to keep shrinking its footprint to a minimum. We hope that one-day Ceilometer will become irrelevant – at least that's what we're trying to achieve so we don't have anything to maintain. That mainly depends on how the myriad of OpenStack projects will chose to address their metering. We don't see any future for Panko nor Aodh. Gnocchi, now blooming outside of OpenStack, is still young and promising. We've plenty of ideas and every new release brings new fancy features. The storage of timeseries at large scale is exciting. Users are happy, and the ecosystem is growing. We'll see how all of that concludes, but I'm sure it'll be new lessons to learn and write about in six years!

Lessons from OpenStack Telemetry: Incubation

Thu, 12 Apr 2018 00:00:00 GMT

It was mostly around that time in 2012 that I and a couple of fellow open-source enthusiasts started working on Ceilometer, the first piece of software from the OpenStack Telemetry project. Six years have passed since then. I've been thinking about this blog post for several months (even years, maybe), but lacked the time and the hindsight needed to lay out my thoughts properly. In a series of posts, I would like to share my observations about the Ceilometer development history. To understand the full picture here, I think it is fair to start with a small retrospective on the project. I'll try to keep it short, and it will be unmistakably biased, even if I'll do my best to stay objective – bear with me. ## Incubation Early 2012, I remember discussing with the first Ceilometer developers the right strategy to solve the problem we were trying to address. The company I worked for wanted to run a public cloud, and billing the resources usage was at the heart of the strategy. The fact that no components in OpenStack were exposing any consumption API was a problem. We debated about how to implement those metering features in the cloud platform. There were two natural solutions: either achieving some resource accounting report in each OpenStack projects or building a new software on the side, covering for the lack of those functionalities. At that time there were only less than a dozen of OpenStack projects. Still, the burden of patching every project seemed like an infinite task. Having code reviewed and merged in the most significant projects took several weeks, which, considering our timeline, was a show-stopper. We wanted to go fast. Pragmatism won, and we started implementing Ceilometer using the features each OpenStack project was offering to help us: very little. Our first and obvious candidate for usage retrieval was Nova, where Ceilometer aimed to retrieves statistics about virtual machines instances utilization. Nova offered no API to retrieve those data – and still doesn't. Since it was out of the equation to wait several months to have such an API exposed, we took the shortcut of polling directly libvirt, Xen or VMware from Ceilometer. That's precisely how temporary hacks become historical design. Implementing this design broke the basis of the abstraction layer that Nova aims to offer. As time passed, several leads were followed to mitigate those trade-offs in better ways. But on each development cycle, getting anything merged in OpenStack became harder and harder. It went from patches long to review, to having a long list of requirements to merge anything. Soon, you'd have to create a blueprint to track your work, write a full specification linked to that blueprint, with that specification being reviewed itself by a bunch of the so-called core developers. The specification had to be a thorough document covering every aspect of the work, from the problem that was trying to be solved, to the technical details of the implementation. Once the specification was approved, which could take an entire cycle (6 months), you'd have to make sure that the Nova team would make your blueprint a priority. To make sure it was, you would have to fly a few thousands of kilometers from home to an OpenStack Summit, and orally argue with developers in a room filled with hundreds of other folks about the urgency of your feature compared to other blueprints. ![An OpenStack design session in Hong-Kong, 2013](/content/images/04/ods_1-tripleo_design_session.jpg) Even if you passed all of those ordeals, the code you'd send could be rejected, and you'd get back to updating your specification to shed light on some particular points that confused people. Back to square one. Nobody wanted to play that game. Not in the Telemetry team at least. So Ceilometer continued to grow, surfing the OpenStack hype curve. More developers were joining the project every cycle – each one with its list of ideas, features or requirements cooked by its in-house product manager. But many features did not belong in Ceilometer. They should have been in different projects. Ceilometer was the first OpenStack project to pass through the OpenStack Technical Committee incubation process that existed before the rules were relaxed. This incubation process was uncertain, long, and painful. We had to justify the existence of the project, and many technical choices that have been made. Where we were expecting the committee to challenge us at fundamental decisions, such as breaking abstraction layers, it was mostly nit-picking about Web frameworks or database storage. ## Consequences The rigidity of the process discouraged anyone to start a new project for anything related to telemetry. Therefore, everyone went ahead and started dumping its idea in Ceilometer itself. With more than ten companies interested, the frictions were high, and the project was at some point pulled apart in all directions. This phenomenon was happening to every OpenStack projects _anyway_. On the one hand, many contributions brought marvelous pieces of technology to Ceilometer. We implemented several features you still don't find any metering system. Dynamically sharded, automatic horizontally scalable polling? Ceilometer has that for years, whereas you can't have it in, e.g., Prometheus. On the other hand, there were tons of crappy features. Half-baked code merged because somebody needed to ship something. As the project grew further, some of us developers started to feel that this was getting out of control and could be disastrous. The technical debt was growing as fast as the project was. Several technical choices made were definitely _bad_. The architecture was a mess; the messaging bus was easily overloaded, the storage engine was non-performant, etc. People would come to me (as I was the _Project Team Leader_ at that time) and ask why the REST API would need 20 minutes to reply to an autoscaling request. The willingness to solve everything for everyone was killing Ceilometer. It's around that time that I decided to step out of my role of PTL and started working on Gnocchi to, at least, solve one of our biggest challenge: efficient data storage. Ceilometer was also suffering from the poor quality of many OpenStack projects. As Ceilometer retrieves data from a dozen of other projects, it has to use their interface for data retrieval (API calls, notifications) – or sometimes, palliate for their lack of any interface. Users were complaining about Ceilometer dysfunctioning while the root of the problem was actually on the other side, in the polled project. The polling agent would try to retrieve the list of virtual machines running on Nova, but just listing and retrieving this information required several HTTP requests to Nova. And those basic retrieval requests would overload the Nova API. The API does not offer any genuine interface from where the data could be retrieved in a small number of calls. And it had terrible performances. From the point of the view of the users, the load was generated by Ceilometer. Therefore, Ceilometer **was** the problem. We had to imagine new ways of circumventing tons of limitation from our siblings. That was exhausting. At its peak, during the Juno and Kilo releases (early 2015), the code size of Ceilometer reached 54k lines of code, and the number of committers reached 100 individuals (20 regulars). We had close to zero happy user, operators were hating us, and everybody was wondering what the hell was going in those developer minds. Nonetheless, despite the impediments, most of us had a great time working on Ceilometer. Nothing's ever perfect. I've learned tons of things during that period, which were actually mostly non-technical. Community management, social interactions, human behavior and politics were at the heart of the adventure, offering a great opportunity for self-improvement. In the next blog post, I will cover what happened in the years that followed that booming period, up until today. Stay tuned!

Scaling a polling Python application with tooz

Mon, 05 Mar 2018 00:00:00 GMT

This article is the final one of the series I wrote about scaling a large number of connections in a Python application. If you don't remember what the problem we're trying to solve is, here it is, coming from one of my followers: > It so happened that I'm currently working on scaling some Python app. Specifically, now I'm trying to figure out the best way to scale SSH connections - when one server has to connect to thousands (or even tens of thousands) of remote machines in a short period of time (say, several minutes). > How would you write an application that does that in a scalable way? The [first blog post](/blog/scaling-python-application-threads) was exploring a solution based on threads, while the [second blog post](/blog/scaling-python-application-asyncio) was exploring an architecture around _asyncio_. In the two first articles, we wrote programs that could handle this problem by using multiple _threads_ or _asyncio_ – or both. While this worked pretty well, this had some limitations, such as only using one computer. So this time, we're going to take a different approach and use multiple computers! ### The job As we've already seen, writing a Python application that connects to a host by ssh can be done using [Paramiko](http://docs.paramiko.org/en/) or [asyncssh](https://github.com/ronf/asyncssh) as we've seen previously. Here again, that will not be the focus of this blog post since it is pretty straightforward to do. To keep this exercise simple, we'll reuse our `ping` function from the first article. It looked like this: ```python import subprocess def ping(hostname): p = subprocess.Popen(["ping", "-c", "3", "-w", "1", hostname], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) return p.wait() == 0 ``` As a reminder, running this program alone and pinging serially 255 IP addresses takes more than 10 minutes. Let's try to make it faster by running it in parallel. ### The architecture Remember: if pinging 255 hosts takes 10 minutes, pinging the whole Internet is going to take forever – around five years at this rate. With our ping experiment, we already divided our mission (e.g. "who's alive on the Internet") into very small tasks ("ping"). If we want to ping 4 billion hosts, we need to run those tasks in parallel. But one computer is not going to be enough: we need to distribute those tasks to different hosts, so we can use some massive parallelism to go even faster! There are two ways to distribute such a set of tasks: - Use a queue. That works well for jobs that are not determined in advance, such as user-submitted tasks or that are going to be executed only once. - Use a distribution algorithm. That works only for tasks are determined in advance, and that are scheduled regularly, such as polling. We are going to pick the second option here, as those ping tasks (or polling in the original problem) should regularly be run. That approach will allow us to spread the jobs onto several processes whose can be even spread onto several nodes over a network. We also won't have to "maintain" the queue (e.g. make it work and monitor it) so that's also a bonus point. That's infinite horizontal scalability! ### The distribution algorithm The algorithm we're going to use to distribute this task is based on a [consistent hashring](https://en.wikipedia.org/wiki/Consistent_hashing). Here's how it works in short. Picture a circular ring. We map objects onto this ring. The ring is then split into partitions. Those partitions are distributed among all the workers. The workers take care of jobs that are in the partitions they are responsible for. In the case where a new node joins the ring, it is inserted between 2 nodes and take a bit of their workload. In the case where a node leaves the ring, the partitions it was taking care of are reassigned to its adjacent nodes. ![](/content/images/03/consistent-hashing.png) If you want more details, it exists plenty of explanations about how this algorithm work. Feel free to look online! However, to make this work, we need to know which nodes are alive or dead. This is another problem to solve, and the best way to tackle it is to use a coordination mechanism. There are plenty of those, from [Apache ZooKeeper](https://zookeeper.apache.org/) to [etcd](https://coreos.com/etcd/). Without going too much into details, those pieces of software provide a network service where every node can connect to and can manage its state. If a client gets disconnected or crashes, it's then easy to consider it as removed. That enables the application to get the full list of nodes, and split the ring accordingly. There's no need to have any shared state between the nodes other than who's alive and running. ### Using group membership To get a list of nodes that are available to help us pinging the Internet, we need a service that provides this and a library to interact with it. Since the use case is pretty simple and I don't know which backends you like the most, we're going to use the [Tooz](https://pypi.python.org/pypi/tooz) library. Tooz provides a coordination mechanism on top of a large variety of backends: ZooKeeper or etcd, as suggested earlier, but also [Redis](https://redis.io) or [memcached](https://memcached.org) for those who want to live more dangerously. Indeed, while ZooKeeper or etcd can be set up in a synchronized cluster, memcached, on the other hand, is a [SPOF](https://en.wikipedia.org/wiki/Single_point_of_failure). For the sake of the exercise, we're going to use a single instance of etcd here. Thanks to Tooz, switching to another backend would be a one-line change anyway. Tooz provides a `tooz.coordination.Coordinator` object that represents the connection to the coordination subsystem. It then exposes an API based on groups and members. A member is a node connected through a `Coordinator` instance. A group is a place that members can join or leave. Here's a first implementation of a member joining a group and printing the member list: ```python import sys import time from tooz import coordination ## Check that a client and group ids are passed as arguments if len(sys.argv) != 3: print("Usage: %s " % sys.argv[0]) sys.exit(1) ## Get the Coordinator object c = coordination.get_coordinator( "etcd3://localhost", sys.argv[1].encode()) ## Start it (initiate connection). c.start(start_heart=True) group = sys.argv[2].encode() ## Create the group try: c.create_group(group).get() except coordination.GroupAlreadyExist: pass ## Join the group c.join_group(group).get() try: while True: # Print the members list members = c.get_members(group) print(members.get()) time.sleep(1) finally: # Leave the group c.leave_group(group).get() # Stop when we're done c.stop() ``` Don't forget to run etcd on your machine before running this program. Running a first instance of this program will print `set(['client1'])` every second. As soon as you run a second instance of this program, they both start to print `set(['client1', 'client2'])`. If you shut down one of the clients, they will print the member list with only one member of it. This can work with any number of client. If a client crashes rather than disconnect properly, its membership will automatically expire a few seconds – you can configure this expiration period with by passing a `timeout` value in Tooz URL. ### Using consistent hashing Now that we have a group, which will turn out to be our _ring_, we can implement consistent hashring on top of it. Fortunately, Tooz also provides an implementation of this that is ready to be used. Rather than using the `join_group` method, we're gonna use the `join_partitioned_group` method. ```python import sys import time from tooz import coordination ## Check that a client and group ids are passed as arguments if len(sys.argv) != 3: print("Usage: %s " % sys.argv[0]) sys.exit(1) ## Get the Coordinator object c = coordination.get_coordinator( "etcd3://localhost", sys.argv[1].encode()) ## Start it (initiate connection). c.start(start_heart=True) group = sys.argv[2].encode() ## Join the partitioned group p = c.join_partitioned_group(group) try: while True: print(p.members_for_object("foobar")) time.sleep(1) finally: # Leave the group c.leave_group(group).get() # Stop when we're done c.stop() ``` Running this program on one node (or just one terminal) will output the following every second: ``` $ python distribution.py client1 foobar 0 handled by set(['client1']) 1 handled by set(['client1']) 2 handled by set(['client1']) 3 handled by set(['client1']) 4 handled by set(['client1']) 5 handled by set(['client1']) 6 handled by set(['client1']) 7 handled by set(['client1']) 8 handled by set(['client1']) 9 handled by set(['client1']) ``` As soon as a second members join (just run another copy of the script in another terminal), the output changes and both the running programs output the same thing: ``` 0 handled by set(['client2']) 1 handled by set(['client1']) 2 handled by set(['client1']) 3 handled by set(['client1']) 4 handled by set(['client1']) 5 handled by set(['client2']) 6 handled by set(['client2']) 7 handled by set(['client1']) 8 handled by set(['client1']) 9 handled by set(['client2']) ``` They just shared the ten objects between them. They **did not communicate with each other**. They just know each other presence, and since they are using the same algorithm to compute where an object should belong, they share the same results. You can do the test with a third copy of the program: ``` 0 handled by set(['client2']) 1 handled by set(['client1']) 2 handled by set(['client1']) 3 handled by set(['client1']) 4 handled by set(['client1']) 5 handled by set(['client2']) 6 handled by set(['client2']) 7 handled by set(['client3']) 8 handled by set(['client1']) 9 handled by set(['client3']) ``` Here we got a third client in the mix, excellent! If we stop one of the clients, the rebalancing is done automatically. While the consistent hashing approach is great, is has a few characteristics you might want to know about: - The distribution algorithm is not made to be perfectly even. If you have a vast number of objects, it might seem pretty even statistically, but if you are trying to distribute two objects on two nodes, it's probable one node will handle the two objects and the other one none. - The distribution is not done in real time, meaning there's a small chance that an object might be owned by two nodes at the same time. This is not a problem in a scenario such as this one, since pinging a host twice is not going to be a big deal, but if your job needed to be unique and executed once and only once, this might not be an adequate method of distribution. Rather use a queue which has the proper characteristics. ### Distributed ping Now that we have our hashring ready to distribute our job, we can implement our final program! ```python import sys import subprocess import time from tooz import coordination ## Check that a client and group ids are passed as arguments if len(sys.argv) != 3: print("Usage: %s " % sys.argv[0]) sys.exit(1) ## Get the Coordinator object c = coordination.get_coordinator( "etcd3://localhost", sys.argv[1].encode()) ## Start it (initiate connection). c.start(start_heart=True) group = sys.argv[2].encode() ## Join the partitioned group p = c.join_partitioned_group(group) class Host(object): def __init__(self, hostname): self.hostname = hostname def __tooz_hash__(self): """Returns a unique byte identifier so Tooz can distribute this object.""" return self.hostname.encode() def __str__(self): return "<%s: %s>" % (self.__class__.__name__, self.hostname) def ping(self): p = subprocess.Popen(["ping", "-q", "-c", "3", "-W", "1", self.hostname], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) return p.wait() == 0 hosts_to_ping = [Host("192.168.2.%d" % i) for i in range(255)] try: while True: for host in hosts_to_ping: c.run_watchers() if p.belongs_to_self(host): print("Pinging %s" % host) if host.ping(): print(" %s is alive" % host) time.sleep(1) finally: # Leave the group c.leave_group(group).get() # Stop when we're done c.stop() ``` When the first client starts, it starts iterating on the host, and since it is alone, all hosts belong to it. So it starts pinging all nodes: {% syntax %} $ python3 ping.py client1 ping Pinging is alive Pinging is alive Pinging {% endsyntax %} Then, a second client starts pinging too, and automatically the jobs are split. The `client1` instance starts skipping some nodes that now belongs to `client2`: ``` ## client1 output Pinging is alive Pinging Pinging Pinging ## client2 output Pinging Pinging Pinging is alive ``` On the other hand, `client2` is skipping nodes that are belonging to `client1`. If you want to scale further our application, we can start new clients on other nodes on the network and expand our pinging system! ### Just a first step [![](/content/images/03/the-hacker-guide-to-scaling-python.png)](https://scaling-python.com) This `ping` job does not use a lot of CPU time or I/O bandwidth, neither would the original ssh case by Alon. However, if that would be the case, this method would be even more efficient as the scalability of the resources would be a key. These are just the first steps of the distribution and scalability mechanism that you can implement using Python. There are a few other options available on top of this mechanism such as defining different weights for different nodes or using replicas to achieve high-availability scenario. I've covered those in my book [Scaling Python](https://scaling-python.com), if you're interested in learning more!

My interview with Cool Python Codes

Thu, 05 Oct 2017 00:00:00 GMT

A few days ago, I've recently been contacted by Godson Rapture from [Cool Python codes](http://coolpythoncodes.com/) to answer a few questions about what I work on in open source. Godson regularly interview developers and I invite you to check out his website! Here's a copy of [my original interview](http://coolpythoncodes.com/julien-danjou/). Enjoy! > Good day, Julien Danjou, welcome to Cool Python Codes. Thanks for taking your precious time to be here. You’re welcome! > Could you kindly tell us about yourself like your full name, hobbies, nationality, education, and experience in programming? Sure. I’m Julien Danjou, I’m French and live in Paris, France. I studied Computer science for 5 years around 15 years ago, and continued my career in that field since then, specializing in open source projects. Those last years, I’ve been working as a software engineer at Red Hat. I’ve spent the last 10 years working with the Python programming language. Now I work on the Gnocchi project which is a time series database. When I’m not coding, I enjoy running half-marathon and playing FPS games. ![](/content/images/03/pyconfr-2017-jd.jpg) > Can you narrate your first programming experience and what got you to start learning to program? I started programming around 2001, and my first serious programs were in Perl. I was contributing to a hosting platform for free software named VHFFS. It was a free software project itself, and I enjoyed being able to learn from other more experienced developers and being able to contribute back to it. That’s what got me stuck into that world of open source projects. > Which programming language do you know and which is your favorite? I know quite a few, I’ve been doing serious programming in Perl, C, Lua, Common Lisp, Emacs Lisp and Python. Obviously, my favorite is Common Lisp, but I was never able to use it for any serious project, for various reasons. So I spend most of my time hacking with Python, which I really enjoy as it is close to Lisp, in some ways. I see it as a small subset of Lisp. > What inspired you to venture into the world of programming and drove you to learn a handful of programming languages? It was mostly scratching my own itches when I started. Each time I saw something I wanted to do or a feature I wanted in an existing software, I learned what I needed to get going and get it working. I studied C and Lua while writing awesome- the window manager that I created 10 years ago and used for a while. I learned Emacs Lisp while writing extensions that I wanted to see in Emacs, etc. It’s the best way to start. > What is your blog about? My blog is a platform where I write about what I work on most of the time. Nowadays, it’s mostly about Python and the main project I contribute to, Gnocchi. When writing about Gnocchi, I usually try to explain what part of the project I worked on, what new features we achieved, etc. On Python, I try to share solutions to common problems I encountered or identified while doing e.g. code reviews. Or presenting a new library I created! > Tell us more about your book, The Hacker’s Guide to Python. It’s a compilation of everything I learned those last years building large Python applications. I spent the last 6 years developing on a large code base with thousands of other developers. I’ve reviewed tons of code and identified the biggest issues, mistakes, and bad practice that developers tend to have. I decided to compile that in a guide, helping developers that played a bit with Python to learn the stages to get really productive with Python. > OpenStack is the biggest open source project in Python, Can you tell us more about OpenStack? OpenStack is a cloud computing platform, started 7 years ago now. Its goal is to provide a programmatic platform to manage your infrastructure while being open source and avoiding vendor lock-in. > Who uses OpenStack? Is it for programmers, website owners? It’s used by a lot of different organizations – not really by individuals. It’s a big piece of software. You can find it in some famous public cloud providers (Dreamhost, Rackspace…), and also as a private cloud in a lot of different organizations, from Bloomberg to eBay or the CERN in Switzerland, a big OpenStack user. Tons of telecom providers also leverages OpenStack for their own internal infrastructure. > Have you participated in any OpenStack conference? What did you speak on if > you did? I’ve attended the last 9 OpenStack summits and a few other OpenStack events around the world. I’ve been engaged in the upstream community for the last 6 years now. My area of expertise is telemetry, the stack of software that is in charge of collecting and storing metrics from the various OpenStack components. This is what I regularly talk about during those events. > How can one join the OpenStack community? There’s an entire documentation about that, called the [Developer’s Guide](https://docs.openstack.org/infra/manual/developers.html). It explains how to setup your environment to send patches, how to join the community using the mailing-lists or IRC. > What makes your book, [The Hacker’s Guide to Python](https://thehackerguidetopython.com) stand out from other Python books? Also, who exactly did you write this book for? I wrote the book that I always wanted to read about Python, but never found. It’s not a book for people that want to learn Python from scratch. It’s a great guide for those who know the language but don’t know the details that experienced developers know and that make the difference. The best practice, the elegant solutions to common problems, etc. That’s why it also includes interviews with prominent Python developers, so they can share their advice on different areas. > How can someone get your book? I’ve decided to self-publish my book, so he does not have an editor like you can be used to see. The best place to get it is online at where you can pick the format you want, electronic or paper. > What do you mean when you say you hack with Python? Unfortunately, most people refer to hacking as the activity of some bad guys trying to get access to whatever they’re not supposed to see. In the book title, I mean “hacking” as the elegant way of writing code and making things worse smoothly even when you were not expecting to make it. > You mentioned earlier that Gnocchi is a time series database. Can you please be more elaborate about Gnocchi? Is there also any documentation about Gnocchi? So Gnocchi is a project I started a few years ago to store time series at large scale. Timeseries are basically a series of tuple composed of a timestamp and a value. Imagine you wanted to store the temperature of all the rooms of the world at any point of time. You’d need a dedicated database for that with the right data structure. This is what Gnocchi does: it provides this data structure storage at very, very large scale. The primary use case is infrastructure monitoring, so most people use it to store tons of metrics about their hardware, software, etc. It’s fully documented on [its website](http://gnocchi.xyz). > How can a programmer without much experience contribute to open source projects? The best way to start is to try to fix something that irritates you in some way. It might be a bug, it might be a missing feature. Start small. Don’t try big things first or you could be discouraged. Never stop. Also, don’t plunge right away in the community and start poking random people or spam them with questions. Do your homework, and listen to the community for a while to get a sense of how things are going. That can be joining IRC and lurking or following the mailing lists for example. Big open source communities dedicate programs to help you become engaged. It might be worth a try. Generic programs like Outreachy or Google Summer of Code are a great way to start if you don’t feel confident enough to jump by your own means in a community. > Just out of curiosity, do you write code in French? Never ever. I think it’s acceptable to write in your language if you are sure that your code will never be open sourced and that your whole team is talking in that language, no matter what – but it’s a ballsy assumption, clearly. Truth is that if you do open source, English is the standard, so go with it. Be sad if you want, but please be pragmatic. I’ve seen projects being open sourced by companies where all the code source comments were in Korean. It was impossible for any non-Korean people to get a glance of what the code and the project was doing, so it just failed and disappeared. > How does a team of programmers handle bugs in a large open source project? I wish there was some magic recipe, but I don’t think it’s the case. What you want is to have a place where your users can feel safe reporting bugs. Include a template so they don’t forget any details: how to reproduce the bugs, what they expected, etc. The worst thing is to have users reporting “That does not work.” with no details. It’s a waste of time. What tool to use to log all of that really depends on the team size and culture. Once that works, the actual fixing of bug doesn’t follow any rule. Most developers fix the bug they encounter or the ones that are the most critical for users. Smaller problems might not be fixed for a long time. > Can you tell us about the new book you are working on and when do we expect > to get it? That new book is entitled [“Scaling Python”](https://scaling-python.com) and it provides insight into how to build largely scalable and distributed applications using Python. It is also based on my experience in building this kind of software during the past years. This book also includes interviews of great Python hackers who work on scalable system or know a thing or two about writing applications for performance – an important point to have scalable applications. The book is in its final stage now, and it should be out at the beginning of 2018. > How can someone get in contact with you? I’m reachable at [julien@danjou.info](mailto:julien@danjou.info) by email or via Twitter, [@juldanjou](https://twitter.com/juldanjou).

OpenStack Summit Boston 2017 recap

Mon, 15 May 2017 00:00:00 GMT

The [first OpenStack Summit of 2017](https://www.openstack.org/summit/boston-2017/) was last week, in Boston, MA, USA. I was able to attend as I've been selected to give 3 talks, to help for a hands-on and to animate an on-boarding session. This made sure I was a bit busy every day, which was good. This is the first summit to happen since the new [Project Team Gathering (PTG)](https://www.openstack.org/ptg/) happened last February. I was unable to attend this first PTG back then, as there was no way to justify my presence there. The OpenStack Telemetry team that I lead is pretty small. People don't really need to talk to each other face to face to discuss: therefore we decided to not ask to be present during the last PTG event. The Telemetry on-boarding session that I organized with my fellow developer Gordon Chung on Tuesday had only 3 people showing up to ask a few questions about Telemetry. The session lasted 15 minutes on 90 planned. We shared that session with [CloudKitty](https://wiki.openstack.org/wiki/CloudKitty), for which nobody showed up for. When you think about it, this was really disappointing but did not come as a surprise. First, the amount of company engaging developers into OpenStack has shrunk drastically during the last year. Secondly, since there's now another event (the PTG) twice a year, it seems pretty clear that every developer will not be able to attend all the 4 events every year, creating dispersion in the community. I personally was glad to attend the Summit rather than the PTG, as it is more valuable to meet operators and users than developers to gather feedback. However, meeting everyone at the same time would be great, especially for smaller teams. The PTG scattered some teams to a point that many of developers of those lineups won't go to either the PTG nor the OpenStack. As a consequence, I won't have any meeting point in the future with many of my fellow developers around OpenStack. I warned the Technical Committee last year about this when it was decided to reorganize the events. I'm glad to be right but I'm a bit sad that the Foundation did not listen. Though all the projects I work on tend to follow [the good practice I wrote last year](/blog/foss-projects-management-bad-practice). Therefore I cannot say that it has huge consequences on the projects I work on. It's a loss as it makes it harder to reach users and operators for some of us. It also reduces our occasion for social interaction, which was a great benefit. But it will not prevent us from building great software anyway! The few other sessions of _[The Forum](https://wiki.openstack.org/wiki/Forum)_ (the space dedicated to developers during the Summit) that I attended discussed various technical things, and some sessions were pretty empty. I wonder if it was a lack of interest of people or if people were unable to travel to discuss those items. Anyhow, at this stage I am not sure it would have really mattered: this has been my 9th OpenStack Summit and many of the subjects discussed already have been discussed multiple time with barely any change since. Talk is cheap. Furthermore, most of the discussion were not made by stakeholders of the various projects involved, but by people on the side, or by members of the Technical Committee. There is just unfortunately too much of wishful thinking. On the talk side, my presentation with Alex Krzos entitled _Telemetry and the 10,000 instances_ went pretty well. We demonstrated what how we tested the performance of the telemetry stack. Same goes for my hands-on with the CloudKitty developers, where we managed to explain how Ceilometer, Gnocchi, and CloudKitty were able to work with each other to create nice billing reports. The last day was concluded with my talk on collectd and Gnocchi with Emma, which was short and to the point. My final talk was about the status and roadmap of the OpenStack Telemetry team where I tried to explain how the Telemetry works and what we might do (or not) in the next cycles. It was pretty short as we barely have a roadmap, the project having 3 developers doing 80% of the work. I was also able to catch up with Nubeliu about their Gnocchi usage. They [presented a nice demo of the cloud monitoring solution](https://www.youtube.com/watch?v=Hlt3UwsvgjU) they build on top of Gnocchi. They completely understood how to use Gnocchi to store a large number of metrics at scale and how to leverage the API to render what's happening in your infrastructure. It is pretty amazing. While I missed the energy and the drive that the design session used to have in the first summits, it has been a pretty good summit. I was especially happy to be able to discuss OpenStack Telemetry and Gnocchi. The feedback I gathered was tremendous and terrific and I'm looking forward to the work we'll achieve in the next months!

Gnocchi independence

Sat, 06 May 2017 00:00:00 GMT

Three years have passed since I started working on [Gnocchi](http://gnocchi.xyz). It's amazing to gaze at the path we wandered on. During all this time, Gnocchi has been "incubated" inside OpenStack. It has been created there and it grew with the rest of the ecosystem. But Gnocchi (developers) always stuck to some strange principles: autonomy and independence from the other OpenStack projects. This actually made the project a bit unpopular sometimes inside OpenStack, being stamped as some kind of _rebel_. I've spent the last years asserting that each project inside OpenStack should seek towards living its own life. It is a key success for any open source project to be able to be used in any context, not only the one it has been built for. Having to use large bundles of projects together is not a good user story. I wish OpenStack will be a set of more autonomous building blocks. One of the most used project by people not using an entire OpenStack installation has been [Swift](https://launchpad.net/swift). That was possible because Swift always tried to be autonomous and to not depend on any other service. It is able to leverage external services but it can also work without any. And I feel that Swift is the most successful project if you measure that success by being used by people having zero knowledge about OpenStack. With the move toward the _Big Tent_, it struck me that the OpenStack Foundation will end up as some sort of an Apache Foundation. And I am pretty sure nobody forces you to use the [Apache HTTP server](https://httpd.apache.org/) if you want to use e.g. [Lucene](http://lucene.apache.org/) or [HBase](http://hbase.apache.org/). Being part of OpenStack for Gnocchi has been a great advantage at the beginning of the project. The infrastructure provided is awesome. The support we had from the community was great. The Gerrit workflow suited us well. But unfortunately, now that the project is getting more and more mature, many of the requirements of being an OpenStack project has become a real burden. The various processes forced by OpenStack is hurting the development pace. The contribution workflow based around Gerrit and [Launchpad](https://launchpad.net) is too complicated for most external contributors and therefore prevents new users to participate to the development. Worse, the bad image or reputation that OpenStack carries in certain situation or communities is preventing Gnocchi to be evaluated and, maybe, used. I think that many of those negative aspects are finally taken into account by the OpenStack Technical Committee, as can be seen in the [proposed vision of 2 years from now for OpenStack](https://review.openstack.org/#/c/453262/). Better late than never. So after spending a lot of time weighing the pros and the cons, we, Gnocchi contributors, [finally decided to move Gnocchi out of OpenStack](http://lists.openstack.org/pipermail/openstack-dev/2017-March/114300.html). We started to move the project to a brand new [Gnocchi organization on GitHub](https://github.com/gnocchixyz). At the time of this writing, only the main gnocchi repository is missing and should be moved soon after the OpenStack Summit happening next week. We also used that opportunity to make usage of the new Gnocchi logo, courtesy of my friend Thierry Ung! ![](/content/images/03/gnocchi-logo.png) We'll see how everything will turn out and if the project will gain more traction, as we hope. This will not change the consumption of Gnocchi made by projects such as [Ceilometer](http://launchpad.net/ceilometer). and the project aims to remain a good friend of OpenStack. 😀

Attending OpenStack Summit Ocata

Mon, 31 Oct 2016 00:00:00 GMT

For the last time in 2016, I flew out to the [OpenStack Summit in Barcelona](https://www.openstack.org/summit/barcelona-2016/), where I had the chance to meet (again) a lot of my fellow OpenStack contributors there. ## How To Work Upstream with OpenStack My week started by giving a talk about _How To Work Upstream with OpenStack_ where I explained, accompanied by Ryota and Ashiq, to the audience how to contribute upstream to OpenStack. It went well and was well received by the public – you can watch the video below or [download the slides](/talks/how-to-work-upstream-with-openstack.pdf). ## Python 3 in telemetry projects I've attended a few interesting cross-project sessions, which helped me getting some prioritization for my work during the next few months. The Python 3 porting effort is blocked for a while in Nova and Swift for various (mostly non-technical) reasons, while almost all other projects are working correctly. On the other hand, we have committed the telemetry projects to be the first one to drop Python 2 support has soon as it is possible. The next steps are to be sure downstream is ready and enable functional testing in devstack with Python 3. ## Ceilometer deprecation ![gordon-gnocchi-talk](/content/images/03/gordon-gnocchi-talk.jpg) The Ceilometer sessions were really interesting, are we mainly discussed deprecating and removing old crufts that are not or should not be used anymore. The main change will be the depreciation of the Ceilometer API. It has been clear for more than a year that [Gnocchi](http://gnocchi.xyz) is the way-to-go to store and provide access to metrics, but we failed at announcing wildly. A lot of the people I talked to during the summit were not aware that the Ceilometer API was not a good pick, and that Gnocchi was the now recommended storage backend. Bad communication from our side – but we are going to fix it as of now. We also committed to simplify the current architecture by removing the collector, which has now be made obsolete by the agent based architecture that was implemented during the last development cycles. ## Aodh alarm timeout We had a feature proposal for a while in Aodh that we postponed for too long already: having timeout triggered after not having seen some events. This seems to be a functionality requested by NFV users – something we want Aodh to cover. We spent some time discussing this feature, and now that we all have a clear understanding of the use case, we'll work on having a clear path to the implementation. I've also attended a session with the [Vitrage](https://wiki.openstack.org/wiki/Vitrage) developers in order to discuss how we could work better together, as they rely on Aodh. It seems there might be some convergence in the future, which would be very welcome. Wait'n see. ## Gnocchi improvement, past and future The Gnocchi session ran smoothly, and everyone seemed happy with the work we have done so far. We've made some impressive improvement in Gnocchi 3.0 – as [I already covered previously](/blog/2016/gnocchi-3.0-release) – and Gordon Chung presented a short talk about the performance difference metered while working on this new version of Gnocchi: The return of the InfluxDB driver is on the table, as Sam Morrison proposed a patch for that while back. While it's not as fast and scalable as other drivers, it offers a good alternative for people having to use it. Leandro Reox presented how to do capacity planning using Ceilometer and Gnocchi, presenting the projects at the same time: It is pretty impressive to see what they achieved with this project, and I'm looking forward to being able to check how it works inside. ## PTG and beyond The next meeting is supposed to be the new [OpenStack PTG](https://www.openstack.org/ptg/) in February in Atlanta, though we did not request any specific space there. While the team love seeing each other face-to-face every few months, we achieved to follow [all of the guidelines I listed recently](/blog/foss-projects-management-bad-practice) on good open source project management, meaning we are able to work very well asynchronously and remotely. There is no need to put hard requirements on people wanting to participate in our community. Nevertheless, I expect cross-projects discussions that will happen to still concern the OpenStack Telemetry projects. In the end, we're all very happy with our past and future roadmaps and I'm looking forward to achieving our next big milestones with our amazing telemetry team!

Gnocchi 3.0 release

Mon, 03 Oct 2016 00:00:00 GMT

After a few weeks of hard work with the team, here is the new major version of Gnocchi, stamped [3.0.0](https://launchpad.net/gnocchi/3.0/3.0.0). It was very challenging, as we wanted to implement a few big changes in it. Gnocchi is now using [reno](http://docs.openstack.org/developer/reno/) to its maximum and you can read [the release notes of the 3.0 branch](http://gnocchi.xyz/releasenotes/3.0.html) online. Some notes might be missing as it is our first release with it, but we are making good progress at writing changelogs for most of our user facing and impacting changes. Therefore, I'll only write here about our big major feature that made us bump the major version number. ## New storage engine And so the most interesting thing that went in the 3.0 release, is the new storage engine that has been built by me and Gordon Chung during those last months. The original approach of writing data in Gnocchi was really naive, so we had an iterative improvement process since version 1.0, and we're getting close to something very solid. This new version leverages several important features which increase performance by a large factor on Ceph (using `write(offset)` rather than `read()+write()` to append new points), our recommended back-end. ![gnocchi3_processtime_readwrite_vs_offset](/content/images/03/gnocchi3_processtime_readwrite_vs_offset.png) To summarize, since most data points are sent sequentially and ordered, we enhanced the data format to profit from that fact and be able to be appended without reading anything. That only works on Ceph though, which provides the needed features. We also enabled data compression on all storage drivers by enabling LZ4 compression ([see my previous article and research on the subject](/blog/gnocchi-carbonara-timeseries-compression)), which obviously offers its own set of challenges when using append-only write. The results are tremendous and decrease data usage by a huge factor: ![gnocchi3_disksize](/content/images/03/gnocchi3_disksize.png) The rest of the processing pipeline also has been largely improved: ![gnocchi3_processtime_post](/content/images/03/gnocchi3_processtime_post.png) ![gnocchi3_processtime_compress_offset](/content/images/03/gnocchi3_processtime_compress_offset.png) Overall, we're delighted with the performance improvement we achieved, and we're looking forward making even better more progress. Gnocchi is now one of the most performing and scalable timeseries databases out there. ## Upcoming challenges With that big change done, we're now heading toward a set of more lightweight improvements. Our [bug tracker](https://bugs.launchpad.net/gnocchi) is a good place to learn what might be on our mind (check for the _wishlist_ bugs). Improving our API features and offering a better experience for those coming outside of the real of OpenStack are now on my top priority list. But let me know if there's anything you have scratching you, obviously. 😎

From decimal to timestamp with MySQL

Thu, 08 Sep 2016 00:00:00 GMT

When working with timestamps, one question that often arises is the precision of those timestamps. Most software is good enough with a precision up to the second, and that's easy. But in some cases, like working on metering, a finer precision is required. I don't know exactly why, and it makes me suffer every day, but [OpenStack](http://openstack.org) is really tied to [MySQL](http://mysql.com) (and its clones). It hurts because MySQL is a very poor solution if you want to leverage your database to actually solve problems. But that's how life is, unfair. And in the context of the projects I work on, that boils down to that we can't afford to not support MySQL. So here we are, needing to work with MySQL and at the same time requiring timestamp with a finer precision than just seconds. And guess what: MySQL did not support that until 2011. ## No microseconds in MySQL? No problem: DECIMAL! MySQL 5.6.4 (released in 2011), a beta version of MySQL 5.6 (hello MySQL, ever heard of [Semantic Versioning](http://semver.org)?), brought microsecond precision to timestamps. But the first stable version supporting that, MySQL 5.6.10, was only released in 2013. So for a long time, there was a problem without any solution. The obvious workaround, in this case, is to reassess your choices in technologies, discover that [PostgreSQL supports microsecond precision for at least a decade](https://www.postgresql.org/docs/7.1/static/datatype-datetime.html) and problem solved. This is not what happened in our case, and in order to support MySQL, one had to find a workaround. And so did they in our [Ceilometer](http://launchpad.net/ceilometer) project, using a [`DECIMAL`](https://dev.mysql.com/doc/refman/5.7/en/precision-math-decimal-characteristics.html) type instead of `DATETIME`. The `DECIMAL` type takes 2 arguments: the total number of digits you need to store, and how many in that total will be used for the fractional part. Knowing that the internal storage of MySQL uses 1 byte for 2 digits, 2 bytes for 4 digits, 3 bytes for 6 digits and 4 bytes for 9 digits, and that each part is stored independently, in order to maximize your storage space, you want to pick a number of digits that fits that correctly. This is why Ceilometer picked 14 for the integer part (9 digits on 4 bytes and 5 digits on 3 bytes) and 6 for the decimal part (3 bytes). Wait. It's stupid because: - `DECIMAL(20, 6)` implies that you uses 14 digits for the integer part, which using epoch as a reference makes you able to encode timestamp `(10^14) - 1` which is year 3170843. I am certain Ceilometer won't last that far. - 14 digits is 9 + 5 digits in MySQL which is 7 bytes, the same size that is used for 9 + 6 digits. So if you could have `DECIMAL(21, 6)` for the same storage space (and go up to year 31690708 which is a nice bonus, right?) Well, I guess the original author of the patch did not read the documentation entirely (`DECIMAL(20, 6)` being on the MySQL documentation page as an example, I imagine it just has been copy-pasted blindly?). The best choice for this use case would have been `DECIMAL(17, 6)` which would allow storing 11 digits for integer (5 bytes), supporting timestamp up to `(2^11)-1` (year 5138), and 6 digits for decimal part (3 bytes), using only 8 bytes in total per timestamp. Nonetheless, this workaround has been implemented using a [SQLAlchemy](http://sqlalchemy.org) custom type and works as expected: ```python class PreciseTimestamp(sqlalchemy.types.TypeDecorator): """Represents a timestamp precise to the microsecond.""" impl = sqlalchemy.DateTime def load_dialect_impl(self, dialect): if dialect.name == 'mysql': return sqlalchemy.dialect.type_descriptor( sqlalchemy.types.DECIMAL(precision=20, scale=6, asdecimal=True)) return sqlalchemy.dialect.type_descriptor(self.impl) ``` ## Microseconds in MySQL? Damn, migration! As I said, MySQL 5.6.4 brought microseconds precision to the table (pun intended). Therefore, it's a great time to migrate away from this hackish format to the brand new one. First, be aware that the default `DATETIME` type has no microseconds precision: [you have to specify how many digits you want as an argument](http://dev.mysql.com/doc/refman/5.7/en/datetime.html). To support microseconds, you should therefore use `DATETIME(6)`. If we were using a great RDBMS, let's say, hum, PostgreSQL, we could do that very easily, see: ```sql postgres=# CREATE TABLE foo (mytime decimal); CREATE TABLE postgres=# \d foo Table "public.foo" Column │ Type │ Modifiers ────────┼─────────┼─────────── mytime │ numeric │ postgres=# INSERT INTO foo (mytime) VALUES (1473254401.234); INSERT 0 1 postgres=# ALTER TABLE foo ALTER COLUMN mytime SET DATA TYPE timestamp with time zone USING to_timestamp(mytime); ALTER TABLE postgres=# \d foo Table "public.foo" Column │ Type │ Modifiers ────────┼──────────────────────────┼─────────── mytime │ timestamp with time zone │ postgres=# select * from foo; mytime ──────────────────────────── 2016-09-07 13:20:01.234+00 (1 row) ``` And since this is a pretty common use case, it's even [an example in the PostgreSQL documentation](https://www.postgresql.org/docs/9.5/static/sql-altertable.html). The version from the documentation uses a calculation based on epoch, whereas my example here leverages the `to_timestamp()` function. That's my personal touch. Obviously, doing this conversion in a single line is not possible with MySQL: it does not implement the `USING` keyword on `ALTER TABLE … ALTER COLUMN`. So what's the solution gonna be? Well, it's a 4 steps job: 1. Create a new column of type `DATETIME(6)` 2. Copy data from the old column to the new column, converting them to the new format 3. Delete the old column 4. Rename the new column to the old column name. But I know what you're thinking: there are 4 steps, but that's not a problem, we'll just use a transaction and embed these operations inside. [MySQL does not support transactions on data definition language (DDL)](http://dev.mysql.com/doc/refman/5.7/en/cannot-roll-back.html). So if any of those steps fails, you'll be unable rollback steps 1, 3 and 4. Who knew that using MySQL was like living on the edge, right? ## Doing this in Python with our friend Alembic I like [Alembic](http://alembic.zzzcomputing.com/). It's a Python library based on [SQLAlchemy](http://sqlalchemy.org) that handles schema migration for your favorite RDBMS. Once you created a new alembic migration script using `alembic revision`, it's time to edit it and write something along those lines: ```python from alembic import op import sqlalchemy as sa from sqlalchemy.sql import func class Timestamp(sa.types.TypeDecorator): """Represents a timestamp precise to the microsecond.""" impl = sqlalchemy.DateTime def load_dialect_impl(self, dialect): if dialect.name == 'mysql': return dialect.type_descriptor(mysql.DATETIME(fsp=6)) return self.impl def upgrade(): bind = op.get_bind() if bind and bind.engine.name == "mysql": existing_type = sa.types.DECIMAL( precision=20, scale=6, asdecimal=True) existing_col = sa.Column("mytime", existing_type, nullable=False) temp_col = sa.Column("mytime_ts", Timestamp(), nullable=False) # Step 1: ALTER TABLE mytable ADD COLUMN mytime_ts DATETIME(6) op.add_column("mytable", temp_col) t = sa.sql.table("mytable", existing_col, temp_col) # Step 2: UPDATE mytable SET mytime_ts=from_unixtime(mytime) op.execute(t.update().values(mytime_ts=func.from_unixtime(existing_col)})) # Step 3: ALTER TABLE mytable DROP COLUMN mytime op.drop_column("mytable", "mytime") # Step 4: ALTER TABLE mytable CHANGE mytime_ts mytime # Note: MySQL needs to have all the old/new information to just rename a column… op.alter_column("mytable", "mytime_ts", nullable=False, type_=Timestamp(), existing_nullable=False, existing_type=existing_type, new_column_name="mytime") ``` In MySQL, the function to convert a float to a UNIX timestamp is `from_unixtime()`, so the script leverages it to convert the data. As said, you'll notice we don't bother using any kind of transaction, so if anything goes wrong, there's no rollback, and it won't be possible to re-run the migration without a manual intervention. `TimestampUTC` is a custom class that implements `sqlalchemy.DateTime` using a `DATETIME(6)` type for MySQL, and a regular `sqlalchemy.DateTime` type for other back-ends. It is used by the rest of the code (e.g. ORM model) but I've pasted it in this example for a better understanding. Once written, you can easily test your migration using [_pifpaf_](https://github.com/jd/pifpaf) to run a temporary database: ```shell $ pifpaf run mysql $SHELL $ alembic -c alembic/alembic.ini upgrade 1c98ac614015 # upgrade to the initial revision $ mysql -S $PIFPAF_MYSQL_SOCKET pifpaf mysql> INSERT INTO mytable (mytime) VALUES (1325419200.213000); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM mytable; +-------------------+ | mytime | +-------------------+ | 1325419200.213000 | +-------------------+ 1 row in set (0.00 sec) $ alembic -c alembic/alembic.ini upgrade head $ mysql -S $PIFPAF_MYSQL_SOCKET pifpaf mysql> SELECT * FROM mytable; +----------------------------+ | mytime | +----------------------------+ | 2012-01-01 13:00:00.213000 | +----------------------------+ 1 row in set (0.00 sec) ``` And voilà, we just migrated unsafely our data to a new fancy format. Thank you Alembic for solving a problem we would not have without MySQL. 😊

A retrospective of the OpenStack Telemetry project Newton cycle

Mon, 05 Sep 2016 00:00:00 GMT

A few weeks ago, I recorded an interview with Krishnan Raghuram about what was discussed for this development cycle for OpenStack Telemetry at the Austin summit. It's interesting to look back at this video more than 3 months after recording it, and see what actually happened to Telemetry. It turns out that some of the things that I think were going to happen did not happen yet. As the first release candidate version is approaching, it's very unlikely they happen. And on the other side, some new fancy features arrived suddenly without me having a clue about them. As far as **Ceilometer** is concerned, here's the list of what really happened in terms of user features: - Added full support for SNMP v3 USM model - Added support for batch measurement in Gnocchi dispatcher - Set ended\_at timestamp in Gnocchi dispatcher - Allow Swift pollster to specify regions - Add L3 cache usage and memory bandwidth meters - Split out the event code (REST API and storage) to a new **Panko** project And a few other minor things. I planned none of them except Panko (which I was responsible for), and the ones we planned (documentation update, pipeline rework and polling enhancement) did not happen yet. For **Aodh**, we expected to rework the documentation entirely too, and that did not happen either. What we did instead: - Deprecate and disable combination alarms - Add pagination support in REST API - Deprecated all non-SQL database store and provide a tool to migrate - Support batch notification for aodh-notifier It's definitely a good list of new features for Aodh, still small, but simplifying it, removing technical debt and continuing building momentum around it. For **Gnocchi**, we really had no plan, except maybe a few small features (they're usually tracked in the Launchpad bug list). It turned out we had some fancy new idea with Gordon Chung on how to boost our storage engine, so we work on that. That kept us busy a few weeks in the end, though the preliminary results look tremendous – so it was definitely worth it. We also have a AWS S3 storage driver on its way. I find this exercise interesting, as it really emphasizes how you can't really control what's happening in any open source project, where your contributors come and go and work on their own agenda. That does not mean we're dropping the themes and ideas I've laid out in that video. We're still pushing our "documentation is mandatory" policy and improving our "work by default" scenario. It's just a longer road that we expected.

The bad practice in FOSS projects management

Thu, 09 Jun 2016 00:00:00 GMT

During the OpenStack summit a few weeks ago, I had the chance to talk to some people about my experience on running open source projects. It turns out that after hanging out in communities and contributing to many projects for years, I may be able to provide some hindsight and an external eye to many of those who are new to it. There are plenty of resource explaining how to run an open source projects out there. Today, I would like to take a different angle and emphasize what you should not _socially_ do in your projects. This list comes from various open source projects I encountered these past years. I'm going to go through some of the bad practice I've spotted, in a random order, illustrated by some concrete example. ## Seeing contributors as an annoyance When software developers and maintainers are busy, there's one thing they don't need: more work. To many people, the instinctive reactions to external contribution is: damn, more work. And actually, it is. Therefore, some maintainers tend to avoid that surplus of work: they state they don't want contributions, or make contributors feel un-welcomed. This can take a lot of different forms, from ignoring them to being unpleasant. It indeed avoids the immediate need to deal with the work that has been added on the maintainer shoulders. This is one of the biggest mistake and misconception of open source. If people are sending you more work, you should do whatever it takes to feel them welcome so they continue working with you. They might pretty soon become the guys doing the work you are doing instead of you. Think: retirement! Let's take a look at my friend Gordon, who I saw starting as a Ceilometer contributor in 2013. He was doing great code reviews, but he was actually giving me more work by catching bugs in my patches and sending patches I had to review. Instead of being a bully so he would stop making me rework my code and reviews his patches, [I requested that we trust him even more by adding him as a core reviewer](http://lists.openstack.org/pipermail/openstack-dev/2013-May/008975.html). time contribution. And if they don't do this one-time contribution, they won't make it two. They won't make any. Those projects may have just lost their new maintainers. ## Letting people only do the grunt work When new contributors arrive and want to contribute to a particular project, they may have very different motivation. Some of them are users, but some of them are just people looking to see how it is to contribute. Getting the thrill of contribution, as an exercise, or as a willingness to learn and start contributing back to the ecosystem they use. The usual response from maintainers is to push people into doing grunt work. That means doing jobs that have no interest, little value, and probably no direct impact on the project. Some people actually have no problem with it, some have. Some will feel offended to do low impact work, and some will love it as soon as you give them some sort of acknowledgment. Be aware of it, and be sure to high-five people doing it. That's the only way to keep them around. ![computer-coding](/content/images/03/computer-coding.jpg) ## Not valorizing small contributions When the first patch that comes in from a new contributor is a typo fix, what developers think? That they don't care, that you're wasting their precious time with your small contribution. And nobody cares about bad English in the documentation, don't they? This is wrong. See my first contributions to [home-assistant](https://github.com/home-assistant/home-assistant/commit/36cb12cd157b22bdc1fa28b700ca0fb751cca7a4) and [Postmodern](https://github.com/marijnh/Postmodern/commit/ec537f72393e1032853b78e0b7b4d0ff98632a02): I fixed typos in the documentation. I contributed to [Org-mode](http://orgmode.org) for a few years. [My first patch to orgmode](http://repo.or.cz/org-mode.git/commit/a153f5a31dffbc6b78a8c5d8d027961abe585a38) was about fixing a docstring. Then, I sent 56 patches, fixing bugs and adding fancy features and also wrote a few external modules. To this day, I'm still #16 in the top-committer list of Org-mode who contains 390 contributors. So not that would call a small contributor. I am sure the community is glad they did not despise my documentation fix. ## Setting the bar too high for new comers ![too-high](/content/images/03/too-high.png) When new contributors arrive, their knowledge about the project, its context, and the technologies can vary largely. One of the mistakes people often make is to ask contributors too complicated things that they cannot realize. That scares them away (many people are going to be shy or introvert) and they may just disappear, feeling too stupid to help. Before making any comment, you should not have any assumption about their knowledge. That should avoid such situation. You also should be very delicate when assessing their skills, as some people might feel vexed if you underestimate them too much. Once that level has been properly evaluated (a few exchanges should be enough), you need to mentor to the right degree your contributor so it can blossom. It takes time and experience to master this, and you may likely lose some of them in the process, but it's a path every maintainer has to take. Mentoring is a very important aspect of welcoming new contributors to your project, whatever it is. Pretty sure that applies nicely outside free software too. ## Requiring people to make sacrifices with their lives ![balance-stones](/content/images/03/balance-stones.jpg) This is an aspect that varies a lot depending on the project and context, but it's really important. As a free software project, where most people will contribute on their own good will and sometimes spare time, you must not require them to make big sacrifices. This won't work. One of the worst implementation of that is requiring people to fly 5 000 kilometers to meet in some place to discuss the project. This puts contributors in an unfair position, based on their ability to leave their family for a week, take a plane/boat/car/train, rent an hotel, etc. This is not good, and everything should be avoided to _require_ people to do that in order to participate and feel included in the project and blend in your community. Don't get me wrong: that does not me social activities should be prohibited, on the contrary. Just avoid excluding people when you discuss any project. The same apply to any other form of discussion that makes it complicated for everyone to participate: IRC meetings (it's hard for some people to book an hour, especially depending on the timezone they live in), video-conference (especially using non-free software), etc. Everything that requires people to basically interact with the project in a synchronous manner for a period of time will put constraints on them that can make them uncomfortable. The best medium is still e-mail and asynchronous derivative (bug trackers, etc), as it is asynchronous and allow people to work at their own pace at their own time. ## Not having an (implicit) CoC Codes of conduct seem to be a trendy topic (and a touchy subject), as more and more communities are opening to a wilder audience than they used to be – which is great. Actually, all communities have a code of conduct, being written with black ink or being carried in everyone's mind unconsciously. Its form is a matter of community size and culture. Now, depending on the size of your community and how you feel comfortable applying it, you may want to have it composed in a document, e.g. like [Debian did](https://www.debian.org/code_of_conduct). Having a code of conduct does not transform your whole project community magically into a bunch of carebears following its guidance. But it provides an interesting point you can refer to as soon as you need. It can help throwing it at some people, to indicate that their behavior is not welcome in the project, and somehow, ease their potential exclusion – even if nobody wants to go that far generally, and that's it's rarely that useful. I don't think it's mandatory to have such a paper on smaller projects. But you have to keep in mind that the implicit code of conduct will be derived from _your_ own behavior. The way your leader(s) will communicate with others will set the entire social mood of the project. Do not underestimate that. When we started the [Ceilometer](http://launchpad.net/ceilometer) project, we implicitly followed the [OpenStack Code of Conduct](https://www.openstack.org/legal/community-code-of-conduct/) before it even existed, and probably set the bar a little higher. Being nice, welcoming and open-minded, we achieved a descent score of diversity, having up to 25% of our core team being women – way above the current ratio in OpenStack and most open source projects! ![friends-beach](/content/images/03/friends-beach.jpg) ## Making people not English native feeling like outsider It's quite important to be aware of that the vast majority of free software project out there are using English as the common language of communication. It makes a lot of sense: it's a commonly spoken language, and it seems to do the job correctly. But a large part of the hackers out there are not native English speakers. Many are not able to speak English fluently. That means the rate at which they can communicate and run a conversation might be very low, which can make some people frustrated, especially native English speaker. The principal demonstration of this phenomena can be seen in social events (e.g. conferences) where people are debating. It can be very hard for people to explain their thoughts in English and to communicate properly at a decent rate, making the conversation and the transmission of ideas slow. The worst thing that one can see in this context is an English native speaker cutting people off and ignoring them, just because they are talking too slowly. I do understand that it can be frustrating, but the problem here is not the non-native English speaking, it's the medium being used that does not make your fellow on the same level of everyone by moving the conversation orally. To a lesser extent, the same applies to IRC meetings, which are by relatively synchronous. Completely asynchronous media do not have this flaw, that's why they should also be preferred in my opinion. ## No vision, no delegation Two of the most commonly encountered mistakes in open source projects: seeing the maintainer struggling with the growth of its project while having people trying to help. Indeed, when the flow of contributor starts coming in, adding new features, asking for feedback and directions, some maintainers choke and don't know how to respond. That ends up frustrating contributors, which therefore may simply vanish. It's important to have a vision for your project and communicate it. Make it clear for contributors what you want or don't want in your project. Transferring that in a clear (and non-aggressive, please) manner, is a good way of lowering the friction between contributors. They'll pretty soon know if they want to join your ship or not, and what to expect. So be a good captain. If they chose to work with you and contribute, you should start trusting them as soon as you can and delegate some of your responsibilities. This can be anything that you used to do: review patches targeting some subsystem, fixing bugs, writing docs. Let people own an entire part of the project so they feel responsible and care about it as much as you do. Doing the opposite, which is being a control-freak, is the best shot at staying alone with your open source software. And no project is going to grow and be successful that way. In 2009, when Uli Schlachter sent [his first patch to awesome](http://article.gmane.org/gmane.comp.window-managers.awesome.devel/1746/match=uli+schlachter), this was more work for me. I had to review this patch, and I was already pretty busy designing the new versions of awesome and doing my day job! Uli's work was not perfect, and I had to fix it myself. More work. And what did I do? A few minutes later, I [replied to him](http://article.gmane.org/gmane.comp.window-managers.awesome.devel/1747/match=uli+schlachter) with a clear plan of what he should do and what I thought about his work. In response, Uli sent patches and improved the project. Do you know what Uli does today? He manages the awesome window manager project since 2010 instead of me. I managed to transmit my vision, delegate, and then retired! ## Non-recognition of contributions People contribute in different ways, and it's not always code. There's a lot of things around a free software projects: documentation, bug triage, user support, user experience design, communication, translation… It took a while for example to [Debian](http://debian.org) to recognize that their translators could have the status of Debian Developer. [OpenStack](http://openstack.org) is working in the same direction by trying to [recognize non-technical contributions](https://wiki.openstack.org/wiki/NonATCRecognition). As soon as your project starts attributing badges to some people and creating classes of different members in the community, you should be very careful that you don't forget anyone. That's the easiest road to losing contributors along the road. ![heart-sign](/content/images/03/heart-sign.jpg) ## Don't forget to be thankful This whole list has been inspired by many years of open source hacking and free software contributions. Everyone's experience and feeling might be different, or malpractice may have been seen under different forms. Let me know and if there's any other point that you encountered and blocked you to contribute to open source projects!

Gnocchi talk at the Paris Monitoring Meetup #6

Fri, 27 May 2016 00:00:00 GMT

Last week was the sixth edition of the [Paris Monitoring Meetup](http://www.meetup.com/Paris-Monitoring/events/230515751/), where I was invited as a speaker to present and talk about [Gnocchi](http://gnocchi.xyz). ![paris-monitoring](/content/images/03/paris-monitoring.png) There was around 50 persons in the room, listening to my presentation of Gnocchi. ![jd-gnocchi-paris-monitoring-meetup-6](/content/images/03/jd-gnocchi-paris-monitoring-meetup-6.jpg) The talk went fine and I had a few interesting questions and feedback. One interesting point that keeps coming when talking about Gnocchi, is its OpenStack label, which scares away a lot of people. We definitely need to continue explaining that the project work stand-alone has a no dependency on OpenStack, just a great integration with it. The [Monitoring-fr](http://www.monitoring-fr.org/) organization also [interviewed me](http://www.monitoring-fr.org/2016/05/meetup-paris-monitoring-6-interview-de-julien-danjou-pour-gnocchi-metric-as-a-service/) after the meetup about Gnocchi. The interview is in French, obviously. I talk about Gnocchi, what it does, how it does it and why we started the project a couple of years ago. Enjoy, and let me know what you think!

OpenStack Summit Newton from a Telemetry point of view

Mon, 02 May 2016 00:00:00 GMT

It's again that time of the year, where we all fly out to a different country to chat about OpenStack and what we'll do during the next 6 months. This time, it was in [Austin, TX](https://en.wikipedia.org/wiki/Austin,_Texas) and we chatted about the new Newton release that will be out in October. As the _Project Team Leader_ for the Telemetry project, I set up and animated the week for our team. We had 9 discussion slots of 40 minutes assigned, but finally only used 8. We also, somehow, canceled the contributor team meet-up on the last day, as only a few of us developers were there and available. We took [a few notes in our Etherpads](https://wiki.openstack.org/wiki/Design_Summit/Newton/Etherpads#Telemetry), but I think most of them were pretty sparse, as there was nothing really important we talked about. Actually, many topics were already discussed and covered 6 months ago in Tokyo during the previous summit. We just did not have time to implement everything we wanted, so talking over it again would not have been of a great help. ## Reference architecture Unfortunately, nor Gordon Chung nor the [OpenStack Innovation Center](https://osic.org/) had time to run the tests and benchmarks they wanted to run before the summit. We still discussed their plan to run tests and benchmark of the whole Telemetry suite (Ceilometer, Gnocchi & Aodh). They should run their tests for 3 weeks, no more, in a few weeks. The window to run tests being narrow, they want to be sure they are prepared, and will reach to us for help, ideas, and validation. I've also requested them to, if possible, provide us some profiling (e.g. cProfile) data so we can have better knowledge of the area we can optimize. ## Gnocchi, next steps This session was particularly smooth since most people in the room were not up-to-date with Gnocchi 2.1. Some people expressed concerned about the InfluxDB driver removal, though they were not aware of the bugs it had, and that Gnocchi was actually performing better – so they may very likely be testing Gnocchi directly instead. No particular fancy feature was requested, only a few bugs and ideas noted on Launchpad were discussed. ## Enhancing Ceilometer polling This session was not particularly productive, as everything was we wanted to discuss was already on the Etherpad from… Tokyo, 6 months ago. It turns out nobody had time to pursue this project, so we'll see what happens. There's definitely some work to do to pursue our goal of splitting the pipeline definition into smaller files. ## Aodh roadmap & improvements First, we decided to definitely kill the combination alarm in the future, in favor of the new composite alarms definition that we like better. We should switch to [OpenStackClient](http://docs.openstack.org/developer/python-openstackclient/) in the future for [aodhclient](http://docs.openstack.org/developer/python-aodhclient/). The OSC team indicated they are willing to provide a way to keep the "aodh" CLI command on its own, which is something that blocked us to move to OSC. A bunch of people indicated that had support for alarms CRUD in the Horizon dashboard. They should work together with the Horizon team to complete what has been started in Horizon recently to add Aodh support. ## Ceilometer splitting A year ago, we decided to split Ceilometer and its alarm feature: Aodh was born. We did discuss doing it again 6 months ago, but nothing happened as we already had so many stuff on our plate. As far as I'm concerned, I think it's now time to split some Ceilometer functionality again, so I'm going to do that this time with the event part. Gordon found a name, and this new project will be named _Panko_. ## Documentation We have then discussed our documentation. Users present in the room were particularly happy with the Gnocchi policy that we apply since the beginning: no doc = no merge of your patch. The consensus is to move forward on this policy for all Telemetry projects, especially since it's now clear that the documentation team is not going to help us more. Ildikó, our documentation wizard, will take care of making some links between the official OpenStack documentation and our projects, avoid content duplication. For this cycle, my personal plan is to document Aodh up to roughly 80 %, and then force that policy on newly implemented changes. ## Events management The event management part of Ceilometer and API (soon to be split in its own project as stated above) was discussed in this session. Nothing really exciting coming here, as nobody is willing to enhance it for now. Which, again, makes it a great candidate for splitting it out of Ceilometer. ## Vitrage The last session was dedicated to [Vitrage](https://wiki.openstack.org/wiki/Vitrage), a root cause analysis tool built on OpenStack. The Vitrage team had a few features that they wanted to see in Aodh, so we discussed that at length. Notably, more support for sending notifications on events (alarm creation, deletion…) should be added in this next release. Also, a new alarm type that would be entirely managed and triggered over HTTP would be very useful for external projects such as Vitrage. We'll try to make that happen during this cycle too. ## Talks There were a few interesting talks about our telemetry projects during this summit, among other I highly recommend watching: - [OpenStack Ceilometer with Gnocchi and Aodh Feature](https://www.youtube.com/watch?v=W5KT5GJKJw8), where Amol and Paul from Ericsson explain what Gnocchi and Aodh do and how they work, and then help people deploy it on their lab. - [DPDK, Collectd & Ceilometer The Missing Link](https://www.youtube.com/watch?v=BdebhsBFEJs), where Ryota Mibu, one of the contributor to Aodh explains why he implemented the event alarm feature - [Showback & Chargeback!! OpenStack Gnocchi + Cloudkitty as a Whole Billing System](https://www.youtube.com/watch?v=-K8NI38LPtU), where Maximiliano Venesio (Nubeliu) and Stéphane Albert (Objectif Libre) talk about how they built an amazing scalable billing solution using [Gnocchi](http://gnocchi.xyz) and [CloudKitty](https://wiki.openstack.org/wiki/CloudKitty) - [Using Ceilometer Data for Effective Witch-Hunting](https://www.youtube.com/watch?v=0Q8pfbwxMb8), where Mike explain how Overstock.com leveraged Ceilometer to track anomalies in their cloud. All of this should keep me and the team busy for the next cycle. If you have any question about what has been discussed or the future of our projects, don't hesitate to leave a comment or ask us on the [OpenStack development mailing list](http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev).

Gnocchi 2.1 release

Wed, 13 Apr 2016 00:00:00 GMT

A little less than 2 months after our latest major release, here is the new minor version of Gnocchi, stamped [2.1.0](https://launchpad.net/gnocchi/2.1/2.1.0). It was a smooth release, but with one major feature implemented by my fellow fantastic developer Mehdi Abaakouk: the ability to create resource types dynamically. ## Resource types REST API This new version of Gnocchi offers the long-awaited ability to create resource types dynamically. What does that mean? Well, until version 2.0, the resources that you were able to create in Gnocchi had a particular type that was defined in the code: instance, volume, SNMP host, Swift account, etc. All of them were tied to OpenStack, since it was our primary use case. Now, [the API allows to create resource types dynamically](http://gnocchi.xyz/rest.html#resource-types). This means you can create your own custom types to describe your own architecture. You then can exploit the same features that were offered before: history of your resources, searching through them, associating metrics, etc! ## Performances improvement We did some profiling on Gnocchi, and some benchmarks, and with the help of my fellow developer Gordon Chung, improved the metric performances. The API speed improved a bit, and I've measured the Gnocchi API endpoint of being able to ingest up to 190k measures/s with only one node (the same as used in my [previous benchmark](/blog/gnocchi-benchmarks)) using [uwsgi](https://uwsgi-docs.readthedocs.org/), so a 50 % improvement. The time required to compute aggregation on new measures is now also metered and displayed in the `gnocchi-metricd` log in debug mode. Handy to have an idea of how fast your measures are treated. ## Ceph backend optimization The Ceph back-end has been improved again by Mehdi. We're now relying on OMAP rather than xattr for finer grained control and better performance. We already have a few new features being prepared for our next release, so stay tuned! And if you have any suggestion, feel free to say a word.

Pifpaf, or how to run any daemon briefly

Fri, 08 Apr 2016 00:00:00 GMT

There's a lot of situation where you end up needing a software deployed temporarily. This can happen when testing something manually, when running a script or when launching a test suite. Indeed, many applications need to use and interconnect with external software: a RDBMS ([PostgreSQL](http://postgressql.org), [MySQL](http://mysql.org)…), a cache ([memcached](http://memcached.org), [Redis](http://redis.io)…) or any other external component. This tends to make more difficult running a software (or its test suite). If you want to rely on this component being installed and deployed, you end up needing a full environment set-up and properly configured to run your tests. Which is discouraging. The different [OpenStack](http://openstack.org) projects I work on ended up pretty soon spawning some of their back-ends temporarily to run their tests. Some of those unit tests somehow became entirely what you would call functional or integration tests. But that's just a name. In the end, what we ended up doing is testing that the software was really working. And there's no better way doing that than talking to a real PostgreSQL instance rather than mocking every call. ## Pifpaf to the rescue To solve that issue, I created a new tool, named _[Pifpaf](https://github.com/jd/pifpaf)_. _Pifpaf_ eases the run of any daemon in a test mode for a brief moment, before making it disappear completely. It's pretty easy to install as [it is available on PyPI](http://pypi.python.org/pypi/pifpaf): ``` $ pip install pifpaf Collecting pifpaf […] Installing collected packages: pifpaf Successfully installed pifpaf-0.0.7 ``` You can then use it to run any of the listed daemons: ``` $ pifpaf list +---------------+ | Daemons | +---------------+ | redis | | postgresql | | mongodb | | zookeeper | | aodh | | influxdb | | ceph | | elasticsearch | | etcd | | mysql | | memcached | | rabbitmq | | gnocchi | +---------------+ ``` _Pifpaf_ accepts any shell command line to execute after its arguments: ``` $ pifpaf run postgresql -- psql Expanded display is used automatically. Line style is unicode. SET psql (9.5.2) Type "help" for help. template1=# \l List of databases Name │ Owner │ Encoding │ Collate │ Ctype │ Access privileges ───────────┼───────┼──────────┼─────────────┼─────────────┼─────────────────── postgres │ jd │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ template0 │ jd │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/jd ↵ │ │ │ │ │ jd=CTc/jd template1 │ jd │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/jd ↵ │ │ │ │ │ jd=CTc/jd (3 rows) template1=# create database foobar; CREATE DATABASE template1=# \l List of databases Name │ Owner │ Encoding │ Collate │ Ctype │ Access privileges ───────────┼───────┼──────────┼─────────────┼─────────────┼─────────────────── foobar │ jd │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ postgres │ jd │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ template0 │ jd │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/jd ↵ │ │ │ │ │ jd=CTc/jd template1 │ jd │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/jd ↵ │ │ │ │ │ jd=CTc/jd (4 rows) template1=# \q ``` What _pifpaf_ does is that it runs the different commands needed to create a new PostgreSQL cluster and then run PostgreSQL on a temporary port for you. So your _psql_ session actually connects to a temporary PostgreSQL server, that is trashed as soon as you quit _psql_. And all of that in less than 10 seconds, without the use of any virtualization or container technology! You can see what it does in detail using the _debug_ mode: ``` $ pifpaf --debug run mysql $SHELL DEBUG: pifpaf.drivers: executing: ['mysqld', '--initialize-insecure', '--datadir=/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg'] DEBUG: pifpaf.drivers: executing: ['mysqld', '--datadir=/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg', '--pid-file=/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg/mysql.pid', '--socket=/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg/mysql.socket', '--skip-networking', '--skip-grant-tables'] DEBUG: pifpaf.drivers: executing: ['mysql', '--no-defaults', '-S', '/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg/mysql.socket', '-e', 'CREATE DATABASE test;'] […] $ exit […] DEBUG: pifpaf.drivers: mysqld output: 2016-04-08T08:52:04.202143Z 0 [Note] InnoDB: Starting shutdown... ``` _Pifpaf_ also supports my pet project [Gnocchi](http://launchpad.net/gnocchi), so you can run and try that timeseries database in a snap: ``` $ pifpaf run gnocchi $SHELL $ gnocchi metric create +------------------------------------+-----------------------------------------------------------------------+ | Field | Value | +------------------------------------+-----------------------------------------------------------------------+ | archive_policy/aggregation_methods | std, count, 95pct, min, max, sum, median, mean | | archive_policy/back_window | 0 | | archive_policy/definition | - points: 12, granularity: 0:05:00, timespan: 1:00:00 | | | - points: 24, granularity: 1:00:00, timespan: 1 day, 0:00:00 | | | - points: 30, granularity: 1 day, 0:00:00, timespan: 30 days, 0:00:00 | | archive_policy/name | low | | created_by_project_id | admin | | created_by_user_id | admin | | id | ff825d33-c8c8-46d4-b696-4b1e8f84a871 | | name | None | | resource/id | None | +------------------------------------+-----------------------------------------------------------------------+ $ exit ``` And it takes less than 10 seconds to launch Gnocchi on my laptop using _pifpaf_. I'm then able to play with the `gnocchi` command line tool. It's by far faster than using OpenStack [devstack](http://devstack.org) to deloy everything the software. ## Using _pifpaf_ with your test suite We leverage _Pifpaf_ in several of our OpenStack telemetry related projects now, and even in [tooz](http://launchpad.net/tooz). For example, to run unit/functional tests with a _memcached_ server available, a `tox.ini` file should like this: ```ini [testenv:py27-memcached] commands = pifpaf run memcached -- python setup.py testr ``` The tests can then use the environment variable `PIFPAF_MEMCACHED_PORT` to connect to _memcached_ and run tests using it. As soon as the tests are finished, _memcached_ is killed by _pifpaf_ and the temporary data are trashed. We move a few OpenStack projects to using _Pifpaf_ already, and I'm planning to make use of it in a few more. My fellow developer [Mehdi Abaakouk](http://sileht.net) added support for [RabbitMQ](http://rabbitmq.com) in _Pifpaf_ and [added support for more advanced tests](https://review.openstack.org/#/c/301771) in [oslo.messaging](http://launchpad.net/oslo.messaging) (such as failure scenarios) using _Pifpaf_. _Pifpaf_ is a very small and handy tool. Give it a try and let me know how it works for you!

The OpenStack Schizophrenia

Wed, 30 Mar 2016 00:00:00 GMT

When I started contributing to [OpenStack](http://openstack.org), almost five years ago, it was a small ecosystem. There were no foundation, a handful of projects and you could understand the code base in a few days. Fast forward 2016, and it is a totally different beast. The project grew to [no less than 54 teams](http://governance.openstack.org/reference/projects/index.html), each team providing one or more deliverable. For example, the Nova and Swift team each one produces one service and its client, whereas the Telemetry team produces 3 services and 3 different clients. In 5 years, OpenStack went to a few [IaaS](https://en.wikipedia.org/wiki/Infrastructure_as_a_service) projects, to 54 different teams tackling different areas related to cloud computing. Once upon a time, OpenStack was all about starting some virtual machines on a network, backed by images and volumes. Nowadays, it's also about orchestrating your network deployment over containers, while managing your application life-cycle using a database service, everything being metered and billed for. This exponential growth has been made possible with the decision of the [OpenStack Technical Committee](http://governance.openstack.org/reference/charter.html) to open the gates with [the project structure reform voted at the end of 2014](http://governance.openstack.org/resolutions/20141202-project-structure-reform-spec.html). This amendment suppresses the old OpenStack model of "integrated projects" (i.e. Nova, Glance, Swift…). The big tent, as it's called, allowed OpenStack to land new projects every month, growing from the 20 project teams of December 2014 to the 54 we have today – multiplying the number of projects by 2.7 in a little more than a year. Amazing growth, right? And this was clearly a good change. I sat at the Technical Committee in 2013, when projects were trying to apply to be "integrated", after Ceilometer and Heat were. It was painful to see how the Technical Committee was trying to assess whether new projects should be brought in or not. But what I notice these days, is how OpenStack is still stuck between its old and new models. On one side, it accepted a lot of new teams, but on the other side, many are considered as second-class citizens. Efforts are made to continue to build an OpenStack project that does not exist anymore. For example, there is a team trying to define what's OpenStack core, named [DefCore](https://github.com/openstack/defcore). That is looking to define which projects are, somehow, actually OpenStack. This leads to weird situations, [such as having non-DefCore projects seeing their doc rejected from installation guides](http://lists.openstack.org/pipermail/openstack-dev/2016-March/090214.html). Again, [I reiterated my proposal](http://lists.openstack.org/pipermail/openstack-dev/2016-March/090231.html) to publish documentation as part of each project code to solve that dishonest situation and put everything on a level playing field Some cross-projects specs are also pushed without implication of all OpenStack projects. For example, The [deprecate-cli](https://specs.openstack.org/openstack/openstack-specs/specs/deprecate-cli.html) spec which proposes to deprecate command-line interface tools proposed by each project had a lot of sense in the old OpenStack sense, where the goal was to build a unified and ubiquitous cloud platform. But when you now have tens of projects with largely different scopes, this start making less sense. Still, this spec was merged by the OpenStack Technical Committee this cycle. Keystone is the first project to proudly force users to rely on [openstack-client](http://docs.openstack.org/developer/python-openstackclient/), removing its old `keystone` command line tool. I find it odd to push that specs when it's pretty clear that some projects (e.g. Swift, Gnocchi…) have no intention to go down that path. Unfortunately, most specs pushed by the Technical Committee are in the realm of wishful thinking. It somehow makes sense, since only a few of the members are actively contributing to OpenStack projects, and they can't by themselves implement all of that magically. But OpenStack is no exception in the free software world and remains a do-ocracy. There is good cross-project content in OpenStack, such as [the API working group](https://wiki.openstack.org/wiki/API_Working_Group). While the work done should probably not be OpenStack specific, there's a lot that teams have learned by building various HTTP REST API with different frameworks. Compiling this knowledge and offering it as a guidance to various teams is a great help. My fellow developer [Chris Dent](https://anticdent.org) wrote a post about [what he would do on the Technical Committee](https://anticdent.org/if-i-were-on-the-openstack-tc.html). In this article, he points to a lot of the shortcomings I described here, and his confusion between OpenStack being a product or being a kit is quite understandable. Indeed, the message broadcasted by OpenStack is still very confusing after the big tent openness. There's no enough user experience improvement being done. The OpenStack Technical Committee election is opened for April 2016, and from what I read so far, many candidates are proposing to now clean up the big tent, kicking out projects that do not match certain criteria anymore. This is probably a good idea, there is some inactive project laying around. But I don't think that will be enough to solve the identity crisis that OpenStack is experiencing. So this is why, once again this cycle, I will throw my hat in the ring and submit my candidacy for OpenStack Technical Committee.

Gnocchi 2.0 release

Fri, 19 Feb 2016 00:00:00 GMT

A little more than 3 months after our latest minor release, here is the new major version of Gnocchi, stamped [2.0.0](https://launchpad.net/gnocchi/2.0/2.0.0). It contains a lot of new and exciting features, and I'd like to talk about some of them to celebrate! You may notice that this release happens in the middle of the OpenStack release cycle. Indeed, Gnocchi does not follow that 6-months cycle, and we release whenever our code is ready. That forces us to have a more iterative approach, less disruptive for other projects and allow us to achieve a higher velocity. Applying the good old mantra _release early, release often_. ## Documentation This version features a large documentation update. Gnocchi is still the only OpenStack server project that implements a "no doc, no merge" policy, meaning any code must come with the documentation addition or change included in the patch. The full documentation is included in the source code and available online at [gnocchi.xyz](http://gnocchi.xyz/). ## Data split & compression I've already covered this change extensively in [my last blog about timeseries compression](/gnocchi-carbonara-timeserie-compression). Long story short, Gnocchi now splits timeseries archives in small chunks that are compressed, increasing speed and decreasing data size. ## Measures batching support Gnocchi now supports batching, which allow submitting several measures for different metric in a single request. This is especially useful in the context where your application tends to cache metrics for a while and is able to send them in a batch. Usage is [fully documented for the REST API](http://gnocchi.xyz/rest.html#measures-batching). ## Group by support in aggregation One of the most demanded features was the ability to do measure aggregation no resource, using a group by type query. This is now possible using the [new `groupby` parameter to aggregation queries](http://gnocchi.xyz/rest.html#aggregation-across-metrics). ## Ceph backend optimization We improved the Ceph back-end a lot. Mehdi Abaakouk wrote a new Python binding for Ceph, called [Cradox](https://github.com/sileht/pycradox), that is going to replace the current Python rados module in the subsequent Ceph releases. Gnocchi makes usage of this new module to speed things up, making the Ceph based driver really, really faster than before. We also implemented asynchronous data deletion, which improves performance a bit. The next step will be to run some new benchmarks [like I did a few months ago](/blog/gnocchi-benchmarks) and compare with the Gnocchi 1.3 series. Stay tuned!

Gnocchi 1.3.0 release

Wed, 04 Nov 2015 00:00:00 GMT

Finally, [Gnocchi 1.3.0](https://launchpad.net/gnocchi/trunk/1.3.0) is out. This is our final release, more or less matching the OpenStack 6 months schedule, that concludes the Liberty development cycle. This release was supposed to be released a few weeks earlier, but our integration test got completely blocked for several days just the week before the OpenStack Mitaka summit. ## New website We build a new dedicated website for Gnocchi at [gnocchi.xyz](http://gnocchi.xyz). We want to promote Gnocchi outside of the [OpenStack](http://openstack.org) bubble, as it a useful timeseries database on its own that can work without the rest of the stack. We'll try to improve the documentation. If you're curious, feel free to check it out and report anything you miss! ## The speed bump Obviously, if it was a bug in Gnocchi that we have hit, it would have been quick to fix. However, we found [a nasty bug](https://bugs.launchpad.net/python-keystoneclient/+bug/1508424) in Swift caused by the evil monkey-patching of Eventlet (once again) blended with a mixed usage of native threads and Eventlet threads in Swift. Shake all of that, and you got yourself pretty race conditions when using the Keystone middleware authentication. In the meantime, we disabled Swift multi-threading by using mod\_wsgi instead of Eventlet in devstack. ## New features So what's new in this new shiny release? A few interesting things: - Metric deletion is now asynchronous. That's not the most used feature in the REST API – weirdly people do not often delete metrics – but it's now way faster and reliable by being asynchronous. _Metricd_ is now in charge of cleaning up things up. - Speed improvement. We are now confident to be even more faster than in the [latest benchmarks I run](/blog/gnocchi-benchmarks) (around 1.5-2× faster), which makes Gnocchi _really_ fast with its native storage back-ends. We profiled and optimized Carbonara and the REST API data validation. - Improve _metricd_ status report. It now reports the size of the backlog of the whole cluster both in its log and via the REST API. Easy monitoring! - Ceph drivers enhancement. We had people testing the Ceph drivers in production, so we made a few changes and fixes to it to make it more solid. And that's all we did in the last couple of months. We have a lot of things on the roadmap that are pretty exciting, and I'll sure talk about them in the next weeks.

OpenStack Summit Mitaka from a Telemetry point of view

Mon, 02 Nov 2015 00:00:00 GMT

Last week I was in Tokyo, Japan for the [OpenStack Summit](https://www.openstack.org/summit/tokyo-2015/), discussing the new Mitaka version that will be released in 6 months. I've attended the summit mainly to discuss and follow-up new developments on [Ceilometer](http://launchpad.net/ceilometer), [Gnocchi](http://launchpad.net/gnocchi), [Aodh](http://launchpad.net/aodh) and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things. Below are what I found remarkable during this summit concerning those projects. ## Distributed lock manager I did not attend this session, but I need to write something about it. See, when working in a distributed environment like OpenStack, it's almost obvious that sooner or later you end up needing a distributed lock mechanism. It started to be pretty obvious and a serious problem for us 2 years ago in Ceilometer. Back then, we proposed the [service-sync](https://wiki.openstack.org/wiki/Oslo/blueprints/service-sync) blueprint and talked about it during the OpenStack Icehouse Design Summit in Hong-Kong. The session at that time was a success, and in 20 minutes I convinced everyone it was the good thing to do. The night following the session, we picked a named, Tooz, to name this new library. It was the first time I met Joshua Harlow, which became one of the biggest Tooz contributor since then. For the following months, we tried to move the lines in OpenStack. It was very hard to convince people that it was the solution to their problem. Most of the time, they did not seem to grasp the entirety of what was at stake. This time, it seems that we managed to convince everyone that a DLM is indeed needed. Joshua wrote an extensive specification called [Chronicle of a DLM](https://review.openstack.org/#/c/209661/), which ended up being discussed and somehow adopted during that session in Tokyo. So yes, Tooz will be the weapon of choice for OpenStack. It will avoid a hard requirement on any DLM solution directly. The best driver right now is the [ZooKeeper](https://zookeeper.apache.org/) one, but it'll still be possible for operators to use e.g. Redis. This is a great achievement for us, after spending years trying to fix features such as the [Nova service group subsystem](https://blueprints.launchpad.net/nova/+spec/tooz-for-service-groups) and seeing our proposals postponed forever. (If you want to know more, [LWN.net](http://lwn.net) has [a great article about that session](https://lwn.net/Articles/662140/).) ## Telemetry team name With the new projects launched this last year, Aodh & Gnocchi, in parallel of the old Ceilometer, plus the change from programs to Big Tent in OpenSack, the team is having an identity issue. Being referred to as the "Ceilometer team" is not really accurate, as some of us only work on Aodh or on Gnocchi. So after discussing that, I [proposed to rename the team to Telemetry](https://review.openstack.org/#/c/240809/) instead. We'll see how it goes. ## Alarms The first session was about alarms and the Aodh project. It turns out that the project is in pretty good shape, but probably need some more love, which I hope I'll be able to provide in the next months. The need for a new _aodhclient_ based on the technologies we recently used building _gnocchiclient_ has been reasserted, so we might end up working on that pretty soon. The Tempest support also needs some improvement, and we have a plan to enhance that. ## Data visualisation We got David Lyle in this session, the Project Technical Leader for [Horizon](http://openstack/horizon). It was an interesting discussion. It used to be technically challenging to draw charts from the data Ceilometer collects, but it's now very easy with Gnocchi and its API. While the technical side is resolved, the more political and user experience side of was to draw and how was discussed at length. We don't want to make people think that Ceilometer and Gnocchi are a full monitoring solution, so there's some precaution to take. Other than that, it would be pretty cool to have view of the data in Horizon. ## Rolling upgrade It turns out that Ceilometer has an architecture that makes it easy to have rolling upgrade. We just need to write a proper documentation explaining how to do it and in which order the services should be upgraded. ## Ceilometer splitting The split of the alarm feature of Ceilometer in its own project Aodh in the last cycle was a great success for the whole team. We want to split other pieces of Ceilometer, as they make sense on their own, makes it easier to manage. They are also some projects that want to use them without the whole stack, so that's a good idea to make it happen. ## CloudKitty & Gnocchi I attended the 2 sessions that were allocated to [CloudKitty](https://wiki.openstack.org/wiki/CloudKitty). It was pretty interesting as they want to simplify their architecture and leverage what Gnocchi provides. I proposed my view of the project architecture and how they could leverage the more of Gnocchi to retrieve and store data. They want to go in that direction though it's a large amount of work and refactoring on their side, so it'll take time. We also need to enhance the support of extension for new resources in Gnocchi, and that's something I hope I'll work on in the next months. Overall, this summit was pretty good and I got a tremendous amount of good feedback on Gnocchi. I again managed to get enough ideas and tasks to tackle for the next 6 months. It really looks interesting to see where the whole team will go from that. Stay tuned!

Benchmarking Gnocchi for fun & profit

Tue, 13 Oct 2015 00:00:00 GMT

We got pretty good feedback on [Gnocchi](http://launchpad.net/gnocchi) so far, even if we only had little. Recently, in order to have a better feeling of where we were at, we wanted to know how fast (or slow) Gnocchi was. The [early benchmarks that some of the Mirantis engineers ran last year](/openstack-ceilometer-the-gnocchi-experiment.html) showed pretty good signs. But a year later, it was time to get real numbers and have a good understanding of Gnocchi capacity. ## Benchmark tools The first thing I realized when starting that process, is that we were lacking of tools to run benchmarks. Therefore I started to write some benchmark tools in [python-gnocchiclient](https://launchpad.net/python-gnocchiclient), which provides a command line tool to interrogate Gnocchi. I added a few basic commands to measure metric performance, such as: ```shell $ gnocchi benchmark metric create -w 48 -n 10000 -a low +----------------------+------------------+ | Field | Value | +----------------------+------------------+ | client workers | 48 | | create executed | 10000 | | create failures | 0 | | create failures rate | 0.00 % | | create runtime | 8.80 seconds | | create speed | 1136.96 create/s | | delete executed | 10000 | | delete failures | 0 | | delete failures rate | 0.00 % | | delete runtime | 39.56 seconds | | delete speed | 252.75 delete/s | +----------------------+------------------+ ``` The command line tool supports the `--verbose` switch to have detailed progress report on the benchmark progression. So far it supports metric operations only, but that's the most interesting part of Gnocchi. ## Spinning up some hardware I got a couple of bare metal servers to test Gnocchi on. I dedicated the first one to Gnocchi, and used the second one as the benchmark client, plugged on the same network. Each server is made of 2×[Intel Xeon E5-2609 v3](http://ark.intel.com/products/81897/Intel-Xeon-Processor-E5-2609-v3-15M-Cache-1_90-GHz) (12 cores in total) and 32 GB of RAM. That provides a lot of CPU to handle requests in parallel. Then I simply performed a basic [RHEL 7](http://www.redhat.com/en/technologies/linux-platforms/enterprise-linux) installation and ran [devstack](http://devstack.org) to spin up an installation of Gnocchi based on the master branch, disabling all of the others OpenStack components. I then tweaked the Apache httpd configuration to use the worker MPM and increased the maximum number of clients that can sent request simultaneously. I configured Gnocchi to use the _PostsgreSQL_ indexer, as it's the recommended one, and the _file_ storage driver, based on Carbonara (Gnocchi own storage engine). That means files were stored locally rather than in Ceph or Swift. Using the _file_ driver is less scalable (you have to run on only one node or uses a technology like NFS to share the files), but it was good enough for this benchmark and to have some numbers and profiling the beast. The OpenStack Keystone authentication middleware was not enabled in this setup, as it would add some delay validating the authentication token. ## Metric CRUD operations Metric creation is pretty fast. I managed to attain 1300 metric/s created pretty easily. Deletion is now asynchronous, which means it's faster than in Gnocchi 1.2, but it's still slower than creation: 500 metric/s can be deleted. That does not sound like a huge issue since metric deletion is actually barely used in production. Retrieving metric information is also pretty fast and goes up to 800 metric/s. It'd be easy to achieve very higher throughput for this one, as it'd be easy to cache, but we didn't feel the need to implement it so far. Another important thing is that all of these numbers are constant and barely depends on the number of the metric already managed by Gnocchi. | Operation | Details | Rate | |---|---|---| | Create metric | Created 100k metrics in 77 seconds | 1300 metric/s | | Show metric | Show a metric 100k times in 149 seconds | 670 metric/s | | Delete metric | Deleted 100k metrics in 190 seconds | 524 metric/s | ## Sending and getting measures Pushing measures into metrics is one of the hottest topic. Starting with Gnocchi 1.1, the measures pushed are treated asynchronously, which makes it much faster to push new measures. Getting new numbers on that feature was pretty interesting. The number of metric per second you can push depends on the batch size, meaning the number of actual measurements you send per call. The naive approach is to push 1 measure per call, and in that case, Gnocchi is able to handle around 600 measures/s. With a batch containing 100 measures, the number of calls per second goes down to 450, but since you push 100 measures each time, that means 45k measures per second pushed into Gnocchi! I've pushed the test further, inspired by the recent [blog post of InfluxDB claiming to achieve 300k points per second](https://influxdb.com/blog/2015/10/07/the_new_influxdb_storage_engine_a_time_structured_merge_tree.html) with their new engine. I ran the same benchmark on the hardware I had, which is roughly two times smaller than the one they used. I achieved to push Gnocchi to a little more than 120k measurement per second. If I had same hardware as they used, I could interpolate the results to achieve almost 250k measures/s pushed. Obviously, you can't strictly compare Gnocchi and InfluxDB since they are not doing exactly the same thing, but it still looks way better than what I expected. Using smaller batch sizes of 1k or 2k improve the throughput further to around 125k measures/s. | Operation | Details | Rate | |---|---|---| | Push metric 5k | Push 5M measures with batch of 5k measures in 40 seconds | 122k measures/s | | Push metric 4k | Push 5M measures with batch of 4k measures in 40 seconds | 125k measures/s | | Push metric 3k | Push 5M measures with batch of 3k measures in 40 seconds | 123k measures/s | | Push metric 2k | Push 5M measures with batch of 2k measures in 41 seconds | 121k measures/s | | Push metric 1k | Push 5M measures with batch of 1k measures in 44 seconds | 113k measures/s | | Push metric 500 | Push 5M measures with batch of 500 measures in 51 seconds | 98k measures/s | | Push metric 100 | Push 5M measures with batch of 100 measures in 112 seconds | 45k measures/s | | Push metric 10 | Push 5M measures with batch of 10 measures in 852 seconds | 6k measures/s | | Push metric 1 | Push 500k measures with batch of 1 measure in 800 seconds | 624 measures/s | | Get measures | Push 43k measures of 1 metric | 260k measures/s | What about getting measures? Well, it's actually pretty fast too. Retrieving a metric with 1 month of data with 1 minute interval (that's 43k points) takes less than 2 second. Though it's actually slower than what I expected. The reason seems to be that the JSON is 2 MB big and encoding it takes a lot of time for Python. I'll investigate that. Another point I discovered, is that by default Gnocchi returns all the datapoints for each granularities available for the asked period, which might double the size of the returned data for nothing if you don't need it. It'll be easy to add an option to the API to only retrieve what you need though! Once benchmarked, that meant I was able to retrieve 6 metric/s per second, which translates to around 260k measures/s. ## _Metricd_ speed New measures that are pushed into Gnocchi are processed asynchronously by the `gnocchi-metricd` daemon. When doing the benchmarks above, I ran into a very interesting issue: sending 10k measures on a metric would make `gnocchi-metricd` uses up to 2 GB RAM and 120 % CPU for more than 10 minutes. After further investigation, I found that the naive approach we used to resample datapoints in Carbonara using [Pandas](http://pandas.pydata.org/) was causing that. I [reported a bug on Pandas](https://github.com/pydata/pandas/issues/11217) and the upstream author was kind enough to provide a nice workaround, that I sent as [a pull request](https://github.com/pydata/pandas/pull/11242) to Pandas documentation. I wrote a fix for Gnocchi based on that, and started using it. Computing the standard aggregation methods set (std, count, 95pct, min, max, sum, median, mean) for 10k batches of 1 measure (worst case scenario) for one metric with 10k measures now takes only 20 seconds and uses 100 MB of RAM – 45× faster. That means that in normal operations, where only a few new measures are processed, the operation of updating a metric only takes a few milliseconds. Awesome! ## Comparison with Ceilometer For comparison sake, I've quickly run some read operations benchmark in Ceilometer. I've fed it with one month of samples for 100 instances polled every minute. That represents roughly 4.3M samples injected, and that took a while – almost 1 hour whereas it would have taken less than a minute in Gnocchi. Then I tried to retrieve some statistics in the same way that we provide them in Gnocchi, which mean aggregating them over a period of 60 seconds over a month. | Operation | Details | Rate | |---|---|---| | Read metric SQL | Read measures for 1 metric | 2min 58s | | Read metric MongoDB | Read measures for 1 metric | 28s | | Read metric Gnocchi | Read measures for 1 metric | 2s | Obviously, Ceilometer is very slow. It has to look into 4M of samples to compute and return the result, which takes a lot of time. Whereas Gnocchi just has to fetch a file and pass it over. That also means that the more samples you have (so the more time you collect data and the more resources you have), slower Ceilometer will become. This is not a problem with Gnocchi, as I emphasized when I started designing it. Most Gnocchi operations are _O(log R)_ where R is the number of metrics or resources, whereas most Ceilometer operations are _O(log S)_ where S is the number of samples (measures). Since is R millions of time smaller than S, Gnocchi gets to be much faster. And what's even more interesting, is that Gnocchi is entirely scalable horizontally. Adding more Gnocchi servers (for the API and its background processing worker _metricd_) will multiply Gnocchi performances by the number of servers added. ## Improvements There are several things to improve in Gnocchi, such as splitting Carbonara archives to make them more efficient, especially from drivers such as Ceph and Swift. It's already on my plate, and I'm looking forwarding to working on that! And if you have any questions, feel free to shoot them in the comment section. 😉

Gnocchi talk at OpenStack Paris Meetup #16

Mon, 05 Oct 2015 00:00:00 GMT

Last week, I've been invited to the [OpenStack Paris meetup #16](http://www.meetup.com/OpenStack-France/events/225227112/), whose subject was about metrics in OpenStack. Last time I spoke at this meetup was back in 2012, during the [OpenStack Paris meetup #2](/blog/openstack-france-meetup-2). A very long time ago! ![gnocchi-talk-2](/content/images/03/gnocchi-talk-2.jpg) I talked for half an hour about [Gnocchi](http://launchpad.net/gnocchi), the OpenStack project I've been running for 18 months now. I started by explaining the story behind the project and why we needed to build it. Ceilometer has an interesting history and had a curious roadmap these last year, and I summarized that briefly. Then I talk about how Gnocchi works and what it offers to users and operators. The slides where full of JSON, but I imagine it offered a interesting view of what the API looks like and how easy it is to operate. This also allowed me to emphasize how many use cases are actually really covered and solved, contrary to what Ceilometer did so far. The talk has been well received and I got a few interesting questions at the end.

My interview in le Journal du Hacker

Thu, 17 Sep 2015 00:00:00 GMT

A few days ago, the French equivalent of [Hacker News](https://news.ycombinator.com/), called "[Le Journal du Hacker](https://www.journalduhacker.net/)", [interviewed me](https://www.journalduhacker.net/s/l5qktw/journal_du_hacker_entretien_avec_julien_danjou_d_veloppeur_openstack) about my work on [OpenStack](http://openstack.org), my job at [Red Hat](http://redhat.com) and my self-published book [The Hacker's Guide to Python](https://thehackerguidetopython.com). I've spent some time translating it into English so you can read it if you don't understand French! I hope you'll enjoy it. > Hi Julien, and thanks for participating in this interview for the Journal du Hacker. For our readers who don't know you, can you introduce you briefly? You're welcome! My name is Julien, I'm 31 years old, and I live in Paris. I now have been developing free software for around fifteen years. I had the pleasure to work (among other things) on [Debian](http://debian.org), [Emacs](https://www.gnu.org/software/emacs/) and [awesome](http://awesome.naquadah.org) these last years, and more recently on OpenStack. Since a few months now, I work at Red Hat, as a Principal Software Engineer on [OpenStack](http://opensack.org). I am in charge of doing upstream development for that cloud-computing platform, mainly around the Ceilometer, Aodh and Gnocchi projects. > Being myself a system architect, I follow your work in [OpenStack](http://openstack.org) since a while. It's uncommon to have the point of view of someone as implied as you are. Can you give us a summary of the state of the project, and then detail your activities in this project? The [OpenStack](http://openstack.org) project has grown and changed a lot since I started 4 years ago. It started as a few projects providing the basics, like [Nova](https://launchpad.net/nova) (compute), [Swift](https://launchpad.net/swift) (object storage), [Cinder](https://launchpad.net/cinder) (volume), [Keystone](https://launchpad.net/keystone) (identity) or even [Neutron](https://launchpad.net/neutron) (network) who are basis for a cloud-computing platform, and finally became composed of a lot more projects. For a while, the inclusion of projects was the subject of a strict review from the technical committee. But since a few months, the rules have been relaxed, and we see a lot more projects connected to cloud-computing [joining us](http://governance.openstack.org/reference/projects/). As far as I'm concerned, I've started with a few others people the [Ceilometer](http://governance.openstack.org/reference/projects/ceilometer.html) project in 2012, devoted to handling metrics of OpenStack platforms. Our goal is to be able to collect all the metrics and record them to analyze them later. We also have a module providing the ability to trigger actions on threshold crossing (alarm). The project grew in a monolithic way, and in a linear way for the number of contributors, during the first two years. I was the PTL (Project Technical Leader) for a year. This leader position asks for a lot of time for bureaucratic things and people management, so I decided to leave my spot in order to be able to spend more time solving the technical challenges that Ceilometer offered. I've started the [Gnocchi](https://launchpad.net/gnocchi) project in 2014. The first stable version (1.0.0) was released a few months ago. It's a timeseries database offering a REST API and a strong ability to scale. It was a necessary development to solve the problems tied to the large amount of metrics created by a cloud-computing platform, where tens of thousands of virtual machines have to be metered as often as possible. This project works as a standalone deployment or with the rest of OpenStack. More recently, I've started [Aodh](https://launchpad.net/aodh), the result of moving out the code and features of Ceilometer related to threshold action triggering (alarming). That's the logical suite to what we started with Gnocchi. It means Ceilometer is to be split into independent modules that can work together – with or without OpenStack. It seems to me that the features provided by Ceilometer, Aodh and Gnocchi can also be interesting for operators running more classical infrastructures. That's why I've pushed the projects into that direction, and also to have a more service-oriented architecture ([SOA](https://fr.wikipedia.org/wiki/Architecture_orient%C3%A9e_services)). > I'd like to stop for a moment on Ceilometer. I think that this solution was very expected, especially by the cloud-computing providers using OpenStack for billing resources sold to their customers. I remember reading a blog post where you were talking about the high-speed construction of this brick, and features that were not supposed to be there. Nowadays, with Gnocchi and Aodh, what is the quality of the brick Ceilometer and the programs it relies on? Indeed, one of the first use-case for Ceilometer was tied to the ability to get metrics to feed a billing tool. That's now a reached goal since we have billing tools for OpenStack using Ceilometer, such as [CloudKitty](https://wiki.openstack.org/wiki/CloudKitty). However, other use-cases appeared rapidly, such as the ability to trigger alarms. This feature was necessary, for example, to implement the auto scaling feature that [Heat](http://launchpad.net/heat) needed. At the time, for technical and political reasons, it was not possible to implement this feature in a new project, and the functionality ended up in Ceilometer, since it was using the metrics collected and stored by Ceilometer itself. Though, like I said, this feature is now in its own project, Aodh. The alarm feature is used since a few cycles in production, and the Aodh project brings new features on the table. It allows to trigger threshold actions and is one of the few solutions able to work at high scale with several thousands of alarms. It's impossible to make Nagios run with millions of instances to fetch metrics and triggers alarms. Ceilometer and Aodh can do that easily on a few tens of nodes automatically. On the other side, Ceilometer has been for a long time painted as slow and complicated to use, because its metrics storage system was by default using [MongoDB](https://www.mongodb.org/). Clearly, the data structure model picked was not optimal for what the users were doing with the data. That's why I started Gnocchi last year, which is perfectly designed for this use case. It allows linear access time to metrics (O(1) complexity) and fast access time to the resources data via an index. Today, with 3 projects having their own perimeter of features defined – and which can work together – Ceilometer, Aodh and Gnocchi finally erased the biggest problems and defects of the initial project. > To end with OpenStack, one last question. You're a [Python](http://www.python.org/) developer for a long time and a fervent user of software testing and [test-driven development](https://en.wikipedia.org/wiki/Test_driven_development). Several of your blogs posts point how important their usage are. Can you tell us more about the usage of tests in OpenStack, and the test prerequisites to contribute to OpenStack? I don't know any project that is as tested on every layer as OpenStack is. At the start of the project, there was a vague test coverage, made of a few unit tests. For each release, a bunch of new features were provided, and you had to keep your fingers crossed to have them working. That's already almost unacceptable. But the big issue was that there was also a lot of regressions, et things that were working were not anymore. It was often corner cases that developers forgot about that stopped working. Then the project decided to change its policy and started to refuse all patches – new features or bug fix – that would not implement a minimal set of unit tests, proving the patch would work. Quickly, regressions were history, and the number of bugs largely reduced months after months. Then came the functional tests, with the [Tempest](http://launchpad.net/tempest) project, which runs a test battery on a complete OpenStack deployment. OpenStack now possesses a [complete test infrastructure](http://status.openstack.org/zuul/), with operators hired full-time to maintain them. The developers have to write the test, and the operators maintain an architecture based on Gerrit, Zuul, and Jenkins, which runs the test battery of each project for each patch sent. Indeed, for each version of a patch sent, a full OpenStack is deployed into a virtual machine, and a battery of thousands of unit and functional tests is run to check that no regressions are possible. To contribute to OpenStack, you need to know how to write a unit test – the policy on functional tests is laxer. The tools used are standard Python tools, unittest for the framework and [tox](https://pypi.python.org/pypi/tox) to run a virtual environment (venv) and run them. It's also possible to use [DevStack](http://docs.openstack.org/developer/devstack/) to deploy an OpenStack platform on a virtual machine and run functional tests. However, since the project infrastructure also do that when a patch is submitted, it's not mandatory to do that yourself locally. > The tools and tests you write for OpenStack are written in Python, a language which is very popular today. You seem to like it more than you have to, since you wrote a book about it, [The Hacker's Guide to Python](https://thehackerguidetopython.com), that I really enjoyed. Can you explain what brought you to Python, the main strong points you attribute to this language (quickly) and how you went from developer to author? I stumbled upon Python by chance, around 2005. I don't remember how I hear about it, but I bought a first book to discover it and started toying with that language. At that time, I didn't find any project to contribute to or to start. My first project with Python was [rebuildd](/projects/rebuildd) for Debian in 2007, a bit later. I like Python for its simplicity, its object orientation rather clean, its easiness to be deployed and its rich open source ecosystem. Once you get the basics, it's very easy to evolve and to use it for anything, because the ecosystem makes it easy to find libraries to solve any kind of problem. I became an author by chance, writing blog posts from time to time about Python. I finally realized that after a few years studying Python internals (CPython), I learned a lot of things. While writing a post about [the differences between method types in Python](/blog/2013/guide-python-static-class-abstract-methods) – which is still one of the most read post on my blog – I realized that a lot of things that seemed obvious to me where not for other developers. I wrote that initial post after thousands of hours spent doing code reviews on OpenStack. I, therefore, decided to note all the developers pain points and to write a book about that. A compilation of what years of experience taught me and taught to the other developers I decided to interview in the book. > I've been very interested by the publication of your book, for the subject itself, but also the process you chose. You self-published the book, which seems very relevant nowadays. Is that a choice from the start? Did you look for an editor? Can you tell use more about that? I've been lucky to find out about others self-published authors, such as [Nathan Barry](http://nathanbarry.com/) – who even wrote a book on that subject, called [Authority](http://nathanbarry.com/authority/). That's what convinced me it was possible and gave me hints for that project. I've started to write in August 2013, and I ran the firs interviews with other developers at that time. I started to write the table of contents and then filled the pages with what I knew and what I wanted to share. I manage to finish the book around January 2014. The proof-reading took more time than I expected, so the book was only released in March 2014. I wrote a [complete report](/blog/making-of-the-hacker-guide-to-python) about that on my blog, where I explain the full process in detail, from writing to launching. I did not look for editors though I've been proposed some. The idea of self-publishing really convince me, so I decided to go on my own, and I have no regret. It's true that you have to wear two hats at the same time and handle a lot more things, but with a minimal audience and some help from the Internet, anything's possible! I've been reached by two editors since then, a [Chinese](http://item.jd.com/11685556.html) and [Korean](https://twitter.com/juldanjou/status/552056642322583552) one. I gave them rights to translate and publish the books in their countries, so you can buy the Chinese and Korean version of the first edition of the book out there. Seeing how successful it was, I decided to launch a second edition in May 2015, and it's likely that a third edition will be released in 2016. > Nowadays, you work for [Red Hat](http://www.redhat.com), a company that represents the success of using Free Software as a commercial business model. This company fascinates a lot in our community. What can you say about your employer from your point of view? It only has been a year since I joined Red Hat (when they bought [eNovance](http://www.enovance.com/)), so my experience is quite recent. Though, Red Hat is really a special company on every level. It's hard to see from the outside how open it is, and how it works. It's really close to and it really looks like an open source project. For more details, you should read [The Open Organization](https://www.redhat.com/en/explore/the-open-organization-book), a book wrote by Jim Whitehurst (CEO of Red Hat), which he just published. It describes perfectly how Red Hat works. To summarize, meritocracy and the lack of organization in silos is what makes Red Hat a strong organization and puts them as [one of the most innovative company](http://www.forbes.com/innovative-companies/list/). In the end, I'm lucky enough to be autonomous for the project I work on with my team around OpenStack, and I can spend 100% working upstream and enhance the Python ecosystem.

Visualize your OpenStack cloud: Gnocchi & Grafana

Mon, 14 Sep 2015 00:00:00 GMT

We've been hard working with the Gnocchi team these last months to store your metrics, and I guess it's time to show off a bit. So far Gnocchi offers scalable metric storage and resource indexation, especially for OpenStack cloud – but not only, we're generic. It's cool to store metrics, but it can be even better to have a way to visualize them! ## Prototyping We very soon started to build a little HTML interface. Being REST-friendly guys, we enabled it on the same endpoints that were being used to retrieve information and measures about metric, sending back `text/html` instead of `application/json` if you were requesting those pages from a Web browser. But let's face it: we are back-end developers, we suck at any kind front-end development. CSS, HTML, JavaScript? Bwah! So what we built was a starting point, hoping some magical Web developer would jump in and finish the job. Obviously it never happened. ## Ok, so what's out there? It turns out there are back-end agnostic solutions out there, and we decided to pick [Grafana](http://grafana.org). Grafana is a complete graphing dashboard solution that can be plugged on top of any back-end. It already supports timeseries databases such as Graphite, InfluxDB and OpenTSDB. That was largely enough for that my fellow developer [Mehdi Abaakouk](https://blog.sileht.net/) to jump in and start writing a Gnocchi plugin for Grafana! Consequently, there is now a basic but solid and working back-end for Grafana that lies in the _[grafana-plugins](https://github.com/grafana/grafana-plugins/tree/master/datasources/gnocchi)_ repository. ![gnocchi-grafana](/content/images/03/gnocchi-grafana.png) With that plugin, you can graph anything that is stored in Gnocchi, from raw metrics to metrics tied to resources. You can use templating, but no annotation yet. The back-end supports Gnocchi with or without Keystone involved, and any type of authentication (basic auth or Keystone token). So yes, it even works if you're not running Gnocchi with the rest of OpenStack. ![gnocchi-grafana-group](/content/images/03/gnocchi-grafana-group.png) It also supports advanced queries, so you can search for resources based on some criterion and graphs their metrics. ## I want to try it! If you want to deploy it, all you need to do is to install Grafana and its plugins, and create a new datasource pointing to Gnocchi. It is that simple. There's some CORS middleware configuration involved if you're planning on using Keystone authentication, but it's pretty straightforward – just set the `cors.allowed_origin` option to the URL of your Grafana dashboard. We added support of Grafana directly in Gnocchi devstack plugin. If you're running [DevStack](http://devstack.org) you can follow [the instructions](http://docs.openstack.org/developer/gnocchi/devstack.html) – which are basically adding the line `enable_service gnocchi-grafana`. ## Moving to Grafana core \[Mehdi just opened a pull request\] ([https://github.com/grafana/grafana/pull/2716](https://github.com/grafana/grafana/pull/2716)) a few days ago to merge the plugin into Grafana core. It's actually one of the most unit-tested plugin in Grafana so far, so it should be on a good path to be merged in the future and have support of Gnocchi directly into Grafana without any plugin involved. ![grafana-gnocchi-unittests](/content/images/03/grafana-gnocchi-unittests.png)

Ceilometer, Gnocchi & Aodh: Liberty progress

Tue, 04 Aug 2015 00:00:00 GMT

It's been a while since I talked about Ceilometer and its companions, so I thought I'd go ahead and write a bit about what's going on this side of OpenStack. I'm not going to cover new features and fancy stuff today, but rather a shallow overview of the new project processes we initiated. ## Ceilometer growing [Ceilometer](http://launchpad.net/ceilometer) has grown a lot since that time when we started it 3 years ago. It has evolved from a system designed to fetch and store measurements, to a more complex system, with agents, alarms, events, databases, APIs, etc. All those features were needed and asked for by users and operators, but let's be honest, some of them should never have ended up in the Ceilometer code repository, especially not all at the same time. The reality is we picked a pragmatic approach due to the rigidity of the OpenStack Technical Committee in regards to new projects to become OpenStack integrated – and, therefore, blessed – projects. Ceilometer was actually the first project to be incubated and then integrated. We had to go through the very first issues of that process. Fortunately, now that time has passed, and all those constraints have been relaxed. To me, the [OpenStack Foundation](https://www.openstack.org/foundation) is turning into something that looks like the [Apache Foundation](http://www.apache.org/foundation/), and there's, therefore, no need to tie technical solutions to political issues. Indeed, the [Big Tent](https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/the-big-tent-a-look-at-the-new-openstack-projects-governance) now allows much more flexibility to all of that. Back a year ago, we were afraid to bring Gnocchi into Ceilometer. Was the Technical Committee going to review the project? Was the project going to be in the scope of Ceilometer for the Technical Committee? Now we don't have to ask ourselves those questions, now that we have that freedom, it empowers us to actually do what we think is good in term of technical design without worrying too much about political issues. ![ceilometer-activity](/content/images/03/ceilometer-activity.png) ## Acknowledging Gnocchi The first step in this new process was to continue working on [Gnocchi](https://launchpad.net/gnocchi) (a timeserie database and resource indexer designed to overcome historical Ceilometer storage issue) and to decide that it was not the right call to merge it into Ceilometer as some REST API v3, but that it was better to keep it standalone. We managed to get traction to Gnocchi, getting a few contributors and users. We're even seeing talks proposed to the next Tokyo Summit where people leverage Gnocchi, such as "Service of predictive analytics on cost and performance in OpenStack", "[Suveil](https://wiki.openstack.org/wiki/Surveil)" and "Cutting Edge NFV On OpenStack: Healing and Scaling Distributed Applications". We are also doing some progress on pushing Gnocchi outside of the OpenStack community, as it can be a self-sufficient timeserie and resource database that can be used without any OpenStack interaction. ## Branching Aodh Rather than continuing to grow Ceilometer, during the last summit we all decided that it was time to reorganize and split Ceilometer into the different components it is made of, leveraging a more [service-oriented architecture](https://en.wikipedia.org/wiki/Service-oriented_architecture). The alarm subsystem of Ceilometer being mostly untied to the rest of Ceilometer, we decided it was the first and perfect candidate to do that. I personally engaged into doing the work and created a new repository with only the alarm code from Ceilometer, named [Aodh](https://launchpad.net/aodh). ![woman-fire](/content/images/03/woman-fire.jpg) This made sense for a lot of reason. First because Aodh can now work completely standalone, using either Ceilometer or Gnocchi as a backend – or any new plugin you'd write. I love the idea that OpenStack projects can work standalone – like Swift does for example – without implying any other OpenStack component. I think it's a proof of good design. Secondly, because it allows us to resonate on a smaller chunk of software – a reason really under-estimated today in OpenStack. I believe that the size of your software should match a certain ratio to the size of your team. Aodh is, therefore, a new project under the OpenStack Telemetry program (or what remains of OpenStack programs now), alongside Ceilometer and Gnocchi, forked from the original Ceilometer alarm feature. We'll deprecate the latter with the Liberty release, and we'll remove it in the Mitaka release. ## Lessons learned Actually, moving that code out of Ceilometer (in the case of Aodh), or not merging it in (in the case of Gnocchi) had a few side effects that I admit I think we probably under-estimated back then. Indeed, the code size of Gnocchi or Aodh ended up being much smaller than the entire Ceilometer project – Gnocchi is 7× smaller and Aodh 5x smaller than Ceilometer – and therefore much more easy to manipulate and to hack on. That allowed us to merge dozens of patches in a few weeks, cleaning-up and enhancing a lot of small things in the code. Those tasks are very much harder in Ceilometer, due to the bigger size of the code base and the small size of our team. By having our small team working on smaller chunks of changes – even when it meant actually doing more reviews – greatly improved our general velocity and the number of bugs fixed and features implemented. On the more sociological side, I think it gave the team the sensation of finally owning the project. Ceilometer was huge, and it was impossible for people to know every side of it. Now, it's getting possible for people inside a team to cover a much larger portion of those smaller project, which gives them a greater sense of ownership and caring. Which ends up being good for the project quality overall. That also means that we technically decided to have different core teams by project (Ceilometer, Gnocchi, and Aodh) as they all serve different purposes and can all be used standalone or with each others. Meaning we could have contributors completely ignoring other projects. All of that reminds me some discussion I heard about projects such as Glance, trying to fit new features in - some that are really orthogonal to the original purpose. It's now clear to me that having different small components interacting together that can be completely owned and taken care of by a (small) team of contributors is the way to go. People that can therefore trust each others and easily bring new people in, makes a project really incredibly more powerful. Having a project covering a too wide set of features make things more difficult if you don't have enough manpower. This is clearly an issue that big projects inside OpenStack are facing now, such as Neutron or Nova.

Timezones and Python

Tue, 16 Jun 2015 00:00:00 GMT

Recently, I've been fighting with the never ending issue of timezones. I never thought I would have plunged into this rabbit hole, but hacking on OpenStack and Gnocchi I felt into that trap easily is, thanks to Python. ## “Why you really, really, should never ever deal with timezones” To get a glimpse of the complexity of timezones, I recommend that you watch [Tom Scott](http://www.tomscott.com/)'s video on the subject. It's fun and it summarizes remarkably well the nightmare that timezones are and why you should stop thinking that you're smart. ## The importance of timezones in applications Once you've heard what Tom says, I think it gets pretty clear that a timestamp without any timezone attached does not give any useful information. It should be considered irrelevant and useless. Without the necessary context given by the timezone, you cannot infer what point in time your application is really referring to. That means your application should never handle timestamps with no timezone information. It should try to guess or raises an error if no timezone is provided in any input. Of course, you can infer that having no timezone information means UTC. This sounds very handy, but can also be dangerous in certain applications or language – such as Python, as we'll see. Indeed, in certain applications, converting timestamps to UTC and losing the timezone information is a terrible idea. Imagine that a user create a recurring event every Wednesday at 10:00 in its local timezone, say CET. If you convert that to UTC, the event will end up being stored as every Wednesday at 09:00. Now imagine that the CET timezone switches from UTC+01:00 to UTC+02:00: your application will compute that the event starts at 11:00 CET every Wednesday. Which is wrong, because as the user told you, the event starts at 10:00 CET, whatever the definition of CET is. Not at 11:00 CET. So CET means CET, not necessarily UTC+1. As for endpoints like REST API, a thing I daily deal with, all timestamps should include a timezone information. It's nearly impossible to know what timezone the timestamps are in otherwise: UTC? Server local? User local? No way to know. ## Python design & defect Python comes with a timestamp object named `datetime.datetime`. It can store date and time precise to the microsecond, and is qualified of timezone "aware" or "unaware", whether it embeds a timezone information or not. To build such an object based on the current time, one can use `datetime.datetime.utcnow()` to retrieve the date and time for the UTC timezone, and `datetime.datetime.now()` to retrieve the date and time for the current timezone, whatever it is. ```python >>> import datetime >>> datetime.datetime.utcnow() datetime.datetime(2015, 6, 15, 13, 24, 48, 27631) >>> datetime.datetime.now() datetime.datetime(2015, 6, 15, 15, 24, 52, 276161) ``` As you can notice, none of these results contains timezone information. Indeed, Python `datetime` API always returns unaware `datetime` objects, which is very unfortunate. Indeed, as soon as you get one of this object, there is no way to know what the timezone is, therefore these objects are pretty "useless" on their own. [Armin Ronacher proposes that an application always consider that the unaware `datetime` objects from Python are considered as UTC](http://lucumr.pocoo.org/2011/7/15/eppur-si-muove/). As we just saw, that statement cannot be considered true for objects returned by `datetime.datetime.now()`, so I would not advise doing so. `datetime` objects with no timezone should be considered as a "bug" in the application. ![timezone-map](/content/images/03/timezone-map.jpg) ## Recommendations My recommendation list comes down to: 1. Always use aware `datetime` object, i.e. with timezone information. That makes sure you can compare them directly (aware and unaware `datetime` objects are not comparable) and will return them correctly to users. Leverage [pytz](http://pytz.sourceforge.net/) to have timezone objects. 2. Use [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) as input and output string format. Use `datetime.datetime.isoformat()` to return timestamps as string formatted using that format, which includes the timezone information. In Python, that's equivalent to having: ```python >>> import datetime >>> import pytz >>> def utcnow(): return datetime.datetime.now(tz=pytz.utc) >>> utcnow() datetime.datetime(2015, 6, 15, 14, 45, 19, 182703, tzinfo=) >>> utcnow().isoformat() '2015-06-15T14:45:21.982600+00:00' ``` If you need to parse strings containing ISO 8601 formatted timestamp, you can rely on the _[iso8601](https://pypi.python.org/pypi/iso8601)_, which returns timestamps with correct timezone information. This makes timestamps directly comparable: ```python >>> import iso8601 >>> iso8601.parse_date(utcnow().isoformat()) datetime.datetime(2015, 6, 15, 14, 46, 43, 945813, tzinfo=) >>> iso8601.parse_date(utcnow().isoformat()) < utcnow() True ``` If you need to store those timestamps, the same rule should apply. If you rely on [MongoDB](http://mongodb.org), it assumes that all the timestamp are in UTC, so be careful when storing them – you will have to normalize the timestamp to UTC. For [MySQL](http://mysql.org), nothing is assumed, it's up to the application to insert them in a timezone that makes sense to it. Obviously, if you have multiple applications accessing the same database with different data sources, this can end up being a nightmare. [PostgreSQL](http://postgresql.org) has a [special data type that is recommended](http://www.postgresql.org/docs/9.4/static/datatype-datetime.html) called `timestamp with timezone`. That does not mean you should not use UTC in most cases; that just means you are sure that the timestamp are stored in UTC since it's written in the database, and you check if any other application inserted timestamps with different timezone. ## OpenStack status As a side note, I've improved OpenStack situation recently by changing the [oslo.utils.timeutils](http://docs.openstack.org/developer/oslo.utils/api/timeutils.html) module to deprecate some useless and dangerous functions. I've also added support for returning timezone aware objects when using the `oslo_utils.timeutils.utcnow()` function. It's not possible to make it a default unfortunately for backward compatibility reason, but it's there nevertheless, and it's advised to use it. Thanks to my colleague [Victor](http://haypo-notes.readthedocs.org/) for the help! Have a nice day, whatever your timezone is!

OpenStack Summit Liberty from a Ceilometer & Gnocchi point of view

Tue, 26 May 2015 00:00:00 GMT

Last week I was in [Vancouver, BC](http://vancouver.ca/) for the [OpenStack Summit](https://www.openstack.org/summit/vancouver-2015/), discussing the new Liberty version that will be released in 6 months. I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things. ## Ops feedback We had half a dozen Ceilometer sessions, and the first one was dedicated to getting feedbacks from operators using Ceilometer. We had a few operators present, and a few of the Ceilometer team. We had constructive discussion, and my feeling is that operators struggles with 2 things so far: scaling Ceilometer storage and having Ceilometer not killing the rest of OpenStack. We discussed the first point as being addressed by [Gnocchi](http://launchpad.net/gnocchi), and I presented a bit Gnocchi itself, as well as how and why it will fix the storage scalability issue operators encountered so far. Ceilometer putting down the OpenStack installation is more interesting problem. Ceilometer pollsters request information from Nova, Glance… to gather statistics. Until Kilo, Ceilometer used to do that regularly and at fixed interval, causing high pike load in OpenStack. With the [introduction of jitter](http://docs.openstack.org/developer/ceilometer/architecture.html#polling-agents-asking-for-data) in Kilo, this should be less of a problem. However, Ceilometer hits various endpoints in OpenStack that are poorly designed, and hitting those endpoints of Nova or other components triggers a lot of load on the platform. Unfortunately, this makes operators blame Ceilometer rather than blaming the components being guilty of poor designs. We'd like to push forward improving these components, but it's probably going to take a long time. ## Componentisation When I started the Gnocchi project last year, I pretty soon realized that we would be able to split Ceilometer itself in different smaller components that could work independently, while being able to leverage each others. For example, Gnocchi can run standalone and store your metrics even if you don't use Ceilometer – nor even OpenStack itself. My fellow developer [Chris Dent](http://burningchrome.com/) had the same idea about splitting Ceilometer a few months ago and drafted a proposal. The idea is to have Ceilometer split in different parts that people could assemble together or run on their owns. Interestingly enough, we had three 40 minutes sessions planned to talk and debate about this division of Ceilometer, though we all agreed in 5 minutes that this was the good thing to do. Five more minutes later, we agreed on which part to split. The rest of the time was allocated to discuss various details of that split, and I engaged to start doing the work with Ceilometer alarming subsystem. I wrote a [specification](https://review.openstack.org/#/c/184307/) on the plane bringing me to Vancouver, that should be approved pretty soon now. I already started doing the implementation work. So fingers crossed, Ceilometer should have a new components in Liberty handling alarming on its own. This would allow users for example to only deploys Gnocchi and Ceilometer alarm. They would be able to feed data to Gnocchi using their own system, and build alarms using Ceilometer alarm subsystem relying on Gnocchi's data. ## Gnocchi We didn't have a Gnocchi dedicated slot – mainly because I indicated I didn't feel we needed one. We anyway discussed a few points around coffee, and I've been able to draw a few new ideas and changes I'd like to see in Gnocchi. Mainly changing the API contract to be more asynchronously so we can support [InfluxDB](http://influxdb.com/) more correctly, and improve Carbonara (the library we created to manipulate timeseries) based drivers to be faster. All of those should – plus a few Oslo tasks I'd like to tackle – should keep me busy for the next cycle!

My interview about software tests and Python

Mon, 11 May 2015 00:00:00 GMT

I've recently been contacted by [Johannes Hubertz](http://hubertz.de/blog/), who is writing a new book about Python in German called _"Softwaretests mit Python"_ which will be published by _Open Source Press, Munich_ this summer. His book will feature some interviews, and he was kind enough to let me write a bit about software testing. This is the interview that I gave for his book. Johannes translated to German and it will be included in Johannes' book, and I decided to publish it on my blog today. Following is the original version. ## How did you come to Python? I don't recall exactly, but around ten years ago, I saw more and more people using it and decided to take a look. Back then, I was more used to Perl. I didn't really like Perl and was not getting a good grip on its object system. As soon as I found an idea to work on – if I remember correctly that was rebuildd – I started to code in Python, learning the language at the same time. I liked how Python worked, and how fast I was to able to develop and learn it, so I decided to keep using it for my next projects. I ended up diving into Python core for some reasons, even doing things like briefly hacking on projects like Cython at some point, and finally ended up working on OpenStack. OpenStack is a cloud computing platform entirely written in Python. So I've been writing Python every day since working on it. That's what pushed me to write [The Hacker's Guide to Python](https://thehackerguidetopython.com) in 2013 and then self-publish it a year later in 2014, a book where I talk about doing smart and efficient Python. It had a great success, has even been translated in Chinese and Korean, so I'm currently working on a second edition of the book. It has been an amazing adventure! ## Zen of Python: Which line is the most important for you and why? I like the "There should be one – and preferably only one – obvious way to do it". The opposite is probably something that scared me in languages like Perl. But having one obvious way to do it is and something I tend to like in functional languages like Lisp, which are in my humble opinion, even better at that. ## For a python newbie, what are the most difficult subjects in Python? I haven't been a newbie since a while, so it's hard for me to say. I don't think the language is hard to learn. There are some subtlety in the language itself when you deeply dive into the internals, but for beginners most of the concept are pretty straight-forward. If I had to pick, in the language basics, the most difficult thing would be around the generator objects (yield). Nowadays I think the most difficult subject for new comers is what version of Python to use, which libraries to rely on, and how to package and distribute projects. Though things get better, fortunately. ## When did you start using Test Driven Development and why? I learned unit testing and TDD at school where teachers forced me to learn Java, and I hated it. The frameworks looked complicated, and I had the impression I was losing my time. Which I actually was, since I was writing disposable programs – that's the only thing you do at school. Years later, when I started to write real and bigger programs (e.g. rebuildd), I quickly ended up fixing bugs… I already fixed. That recalled me about unit tests and that it may be a good idea to start using them to stop fixing the same things over and over again. For a few years, I wrote less Python and more C code and Lua (for the [awesome window manager](http://awesome.naquadah.org)), and I didn't use any testing. I probably lost hundreds of hours testing manually and fixing regressions – that was a good lesson. Though I had good excuses at that time – it is/was way harder to do testing in C/Lua than in Python. Since that period, I have never stopped writing "tests". When I started to hack on OpenStack, the project was adopting a "no test? no merge!" policy due to the high number of regressions it had during the first releases. I honestly don't think I could work on any project that does not have – at least a minimal – test coverage. It's impossible to hack efficiently on a code base that you're not able to test in just a simple command. It's also a real problem for new comers in the open source world. When there are no test, you can hack something and send a patch, and get a "you broke this" in response. Nowadays, this kind of response sounds unacceptable to me: if there is no test, then I didn't break anything! In the end, it's just too much frustration to work on non tested projects as I demonstrated in [my study of whisper source code](/blog/python-bad-practice-concrete-case). ## What do you think to be the most often seen pitfalls of TDD and how to avoid them best? The biggest problems are when and at what rate writing tests. On one hand, some people starts to write too precise tests way too soon. Doing that slows you down, especially when you are prototyping some idea or concept you just had. That does not mean that you should not do test at all, but you should probably start with a light coverage, until you are pretty sure that you're not going to rip every thing and start over. On the other hand, some people postpone writing tests for ever, and end up with no test all or a too thin layer of test. Which makes the project with a pretty low coverage. Basically, your test coverage should reflect the state of your project. If it's just starting, you should build a thin layer of test so you can hack it on it easily and remodel it if needed. The more your project grow, the more you should make it sold and lay more tests. Having too detailed tests is painful to make the project evolve at the start. Having not enough in a big project makes it painful to maintain it. ## Do you think, TDD fits and scales well for the big projects like OpenStack? Not only I think it fits and scales well, but I also think it's just impossible to not use TDD in such big projects. When unit and functional tests coverage was weak in OpenStack – at its beginning – it was just impossible to fix a bug or write a new feature without breaking a lot of things without even noticing. We would release version N, and a ton of old bugs present in N-2 – but fixed in N-1 – were reopened. For big projects, with a lot of different use cases, configuration options, etc, you need belt and braces. You cannot throw code in a repository thinking it's going to work ever, and you can't afford to test everything manually at each commit. That's just insane.

Gnocchi 1.0: storing metrics and resources at scale

Tue, 21 Apr 2015 00:00:00 GMT

A few months ago, I wrote a long post about what I called back then the "[Gnocchi experiment](/blog/openstack-ceilometer-the-gnocchi-experiment)". Time passed and we – me and the rest of the Gnocchi team – continued to work on that project, finalizing it. It's with a great pleasure that we are going to release our first _1.0_ version this month, roughly at the same time that the integrated [OpenStack](http://openstack.org) projects release their Kilo milestone. The [first release candidate numbered 1.0.0rc1](https://pypi.python.org/pypi/gnocchi) has been released this morning! ## The problem to solve Before I dive into Gnocchi details, it's important to have a good view of what problems Gnocchi is trying to solve. Most of the IT infrastructures out there consists of a set of resources. These resources have properties: some of them are simple attributes whereas others might be measurable quantities (also known as metrics). And in this context, the cloud infrastructures make no exception. We talk about instances, volumes, networks… which are all different kind of resources. The problems that are arising with the cloud trend is the scalability of storing all this data and being able to request them later, for whatever usage. What Gnocchi provides is a REST API that allows the user to manipulate resources (CRUD) and their attributes, while preserving the history of those resources and their attributes. Gnocchi is fully documented and the [documentation is available online](http://gnocchi.xyz). We are the first OpenStack project to require patches to _integrate the documentation_. We want to raise the bar, so we took a stand on that. That's part of our policy, the same way it's part of the OpenStack policy to require unit tests. I'm not going to paraphrase the whole Gnocchi documentation, which covers things like installation (super easy), but I'll guide you through some basics of the features provided by the REST API. I will show you some example so you can have a better understanding of what you could leverage using Gnocchi! ## Handling metrics Gnocchi provides a full REST API to manipulate time-series that are called _metrics_. You can easily create a metric using a simple HTTP request: ``` POST /v1/metric HTTP/1.1 Content-Type: application/json { "archive_policy_name": "low" } HTTP/1.1 201 Created Location: http://localhost/v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a Content-Type: application/json; charset=UTF-8 { "archive_policy": { "aggregation_methods": [ "std", "sum", "mean", "count", "max", "median", "min", "95pct" ], "back_window": 0, "definition": [ { "granularity": "0:00:01", "points": 3600, "timespan": "1:00:00" }, { "granularity": "0:30:00", "points": 48, "timespan": "1 day, 0:00:00" } ], "name": "low" }, "created_by_project_id": "e8afeeb3-4ae6-4888-96f8-2fae69d24c01", "created_by_user_id": "c10829c6-48e2-4d14-ac2b-bfba3b17216a", "id": "387101dc-e4b1-4602-8f40-e7be9f0ed46a", "name": null, "resource_id": null } ``` The `archive_policy_name` parameter defines how the measures that are being sent are going to be aggregated. You can also define archive policies using the API and specify what kind of aggregation period and granularity you want. In that case , the _low_ archive policy keeps 1 hour of data aggregated over 1 second and 1 day of data aggregated to 30 minutes. The functions used for aggregations are the mathematical functions standard deviation, minimum, maximum, … and even 95th percentile. All of that is obviously customizable and you can create your own archive policies. If you don't want to specify the archive policy manually for each metric, you can also create _archive policy rule_, that will apply a specific archive policy based on the metric name, e.g. metrics matching `disk.*` will be high resolution metrics so they will use the `high` archive policy. It's also worth noting Gnocchi is precise up to the nanosecond and is not tied to the current time. You can manipulate and inject measures that are years old and precise to the nanosecond. You can also inject points with old timestamps (i.e. old compared to the most recent one in the timeseries) with an archive policy allowing it (see `back_window` parameter). It's then possible to send measures to this metric: ``` POST /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1 Content-Type: application/json [ { "timestamp": "2014-10-06T14:33:57", "value": 43.1 }, { "timestamp": "2014-10-06T14:34:12", "value": 12 }, { "timestamp": "2014-10-06T14:34:20", "value": 2 } ] HTTP/1.1 204 No Content ``` These measures are synchronously aggregated and stored into the configured storage backend. Our most scalable storage drivers for now are either based on [Swift](http://launchpad.net/swift) or [Ceph](http://ceph.com) which are both scalable storage objects systems. It's then possible to retrieve these values: ``` GET /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1 HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 [ [ "2014-10-06T14:30:00.000000Z", 1800.0, 19.033333333333335 ], [ "2014-10-06T14:33:57.000000Z", 1.0, 43.1 ], [ "2014-10-06T14:34:12.000000Z", 1.0, 12.0 ], [ "2014-10-06T14:34:20.000000Z", 1.0, 2.0 ] ] ``` As older Ceilometer users might notice here, metrics are only storing points and values, nothing fancy such as metadata anymore. By default, values eagerly aggregated using mean are returned for all supported granularities. You can obviously specify a time range or a different aggregation function using the `aggregation`, `start` and `stop` query parameter. Gnocchi also supports doing aggregation across aggregated metrics: ``` GET /v1/aggregation/metric?metric=65071775-52a8-4d2e-abb3-1377c2fe5c55&metric=9ccdd0d6-f56a-4bba-93dc-154980b6e69a&start=2014-10-06T14:34&aggregation=mean HTTP/1.1 HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 [ [ "2014-10-06T14:34:12.000000Z", 1.0, 12.25 ], [ "2014-10-06T14:34:20.000000Z", 1.0, 11.6 ] ] ``` This computes the mean of mean for the metric `65071775-52a8-4d2e-abb3-1377c2fe5c55` and `9ccdd0d6-f56a-4bba-93dc-154980b6e69a` starting on 6th October 2014 at 14:34 UTC. ## Indexing your resources Another object and concept that Gnocchi provides is the ability to manipulate resources. There is a basic type of resource, called _generic_, which has very few attributes. You can extend this type to specialize it, and that's what Gnocchi does by default by providing resource types known for OpenStack such as _instance_, _volume_, _network_ or even _image_. ``` POST /v1/resource/generic HTTP/1.1 Content-Type: application/json { "id": "75C44741-CC60-4033-804E-2D3098C7D2E9", "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D", "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D" } HTTP/1.1 201 Created Location: http://localhost/v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9 ETag: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd" Last-Modified: Fri, 17 Apr 2015 11:18:48 GMT Content-Type: application/json; charset=UTF-8 { "created_by_project_id": "ec181da1-25dd-4a55-aa18-109b19e7df3a", "created_by_user_id": "4543aa2a-6ebf-4edd-9ee0-f81abe6bb742", "ended_at": null, "id": "75c44741-cc60-4033-804e-2d3098c7d2e9", "metrics": {}, "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d", "revision_end": null, "revision_start": "2015-04-17T11:18:48.696288Z", "started_at": "2015-04-17T11:18:48.696275Z", "type": "generic", "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d" } ``` The resource is created with the UUID provided by the user. Gnocchi handles the history of the resource, and that's what the `revision_start` and `revision_end` fields are for. They indicates the lifetime of this revision of the resource. The `ETag` and `Last-Modified` headers are also unique to this resource revision and can be used in a subsequent request using `If-Match` or `If-Not-Match` header, for example: ``` GET /v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9 HTTP/1.1 If-Not-Match: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd" HTTP/1.1 304 Not Modified ``` Which is useful to synchronize and update any view of the resources you might have in your application. You can use the `PATCH` HTTP method to modify properties of the resource, which will create a new revision of the resource. The history of the resources are available via the REST API obviously. The `metrics` properties of the resource allow you to link metrics to a resource. You can link existing metrics or create new ones dynamically: ``` POST /v1/resource/generic HTTP/1.1 Content-Type: application/json { "id": "AB68DA77-FA82-4E67-ABA9-270C5A98CBCB", "metrics": { "temperature": { "archive_policy_name": "low" } }, "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D", "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D" } HTTP/1.1 201 Created Location: http://localhost/v1/resource/generic/ab68da77-fa82-4e67-aba9-270c5a98cbcb ETag: "9f64c8890989565514eb50c5517ff01816d12ff6" Last-Modified: Fri, 17 Apr 2015 14:39:22 GMT Content-Type: application/json; charset=UTF-8 { "created_by_project_id": "cfa2ebb5-bbf9-448f-8b65-2087fbecf6ad", "created_by_user_id": "6aadfc0a-da22-4e69-b614-4e1699d9e8eb", "ended_at": null, "id": "ab68da77-fa82-4e67-aba9-270c5a98cbcb", "metrics": { "temperature": "ad53cf29-6d23-48c5-87c1-f3bf5e8bb4a0" }, "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d", "revision_end": null, "revision_start": "2015-04-17T14:39:22.181615Z", "started_at": "2015-04-17T14:39:22.181601Z", "type": "generic", "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d" } ``` ## Haystack, needle? Find! With such a system, it becomes very easy to index all your resources, meter them and retrieve this data. What's even more interesting is to query the system to find and list the resources you are interested in! You can search for a resource based on any field, for example: ``` POST /v1/search/resource/instance HTTP/1.1 Content-Type: application/json { "=": { "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d" } } ``` That query will return a list of all resources owned by the `user_id` `bd3a1e52-1c62-44cb-bf04-660bd88cd74d`. You can do fancier queries such as retrieving all the instances started by a user this month: ``` POST /v1/search/resource/instance HTTP/1.1 Content-Type: application/json Content-Length: 113 { "and": [ { "=": { "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d" } }, { ">=": { "started_at": "2015-04-01" } } ] } ``` And you can even do fancier queries than the fancier ones (still following?). What if we wanted to retrieve all the instances that were on host `foobar` the 15th April and who had already 30 minutes of uptime? Let's ask Gnocchi to look in the history! ``` POST /v1/search/resource/instance?history=true HTTP/1.1 Content-Type: application/json Content-Length: 113 { "and": [ { "=": { "host": "foobar" } }, { ">=": { "lifespan": "1 hour" } }, { "<=": { "revision_start": "2015-04-15" } } ] } ``` I could also mention the fact that you can [search for value in metrics](http://docs.openstack.org/developer/gnocchi/rest.html#searching-for-values-in-metrics). One feature that I will very likely include in Gnocchi 1.1 is the ability to search for resource whose specific metrics matches some value. For example, having the ability to search for instances whose CPU consumption was over 80% during a month. ## Cherries on the cake While Gnocchi is well integrated and based on common OpenStack technology, please do note that it is completely able to function without any other OpenStack component and is pretty straight-forward to deploy. Gnocchi also implements a full RBAC system based on the [OpenStack standard oslo.policy](http://docs.openstack.org/developer/oslo.policy/) and which allows pretty fine grained control of permissions. ![gnocchi-resource-html](/content/images/03/gnocchi-resource-html.png) There is also some work ongoing to have HTML rendering when browsing the API using a Web browser. While still simple, we'd like to have a minimal Web interface served on top of the API for the same price! Ceilometer alarm subsystem supports Gnocchi with the Kilo release, meaning you can use it to trigger actions when a metric value crosses some threshold. And OpenStack [Heat](http://launchpad.net/heat) also supports auto-scaling your instances based on Ceilometer+Gnocchi alarms. And there are a few more API calls that I didn't talk about here, so don't hesitate to take a peek at the [full documentation](http://gnocchi.xyz)! ## Towards Gnocchi 1.1! Gnocchi is a different beast in the OpenStack community. It is under the umbrella of the Ceilometer program, but it's one of the first projects that is not part of the (old) integrated release. Therefore we decided to have a release schedule not directly linked to the OpenStack and we'll release more often that the rest of the old OpenStack components – probably once every 2 months or the like. What's coming next is a close integration with Ceilometer (e.g. moving the dispatcher code from Gnocchi to Ceilometer) and probably more features as we have more requests from our users. We are also exploring different backends such as InfluxDB (storage) or MongoDB (indexer). Stay tuned, and happy hacking!

Distributed group management and locking in Python with tooz

Fri, 21 Nov 2014 00:00:00 GMT

With [OpenStack](http://openstack.org) embracing the [Tooz](http://launchpad.net/python-tooz) library more and more over the past year, I think it's a good start to write a bit about it. ## A bit of history A little more than year ago, with my colleague Yassine Lamgarchal and others at [eNovance](http://enovance.com), we investigated on how to solve a problem often encountered inside OpenStack: synchronization of multiple distributed workers. And while many people in our ecosystem continue to drive development by adding new bells and whistles, we made a point of solving new problems with a generic solution able to address the technical debt at the same time. Yassine wrote the first ideas of what should be the [group membership service](https://wiki.openstack.org/wiki/Oslo/blueprints/service-sync) that was needed for OpenStack, identifying several projects that could make use of this. I've presented this concept during the [OpenStack Summit in Hong-Kong](https://www.openstack.org/summit/openstack-summit-hong-kong-2013/) during an Oslo session. It turned out that the idea was well-received, and the week following the summit we started the [tooz](http://launchpad.net/python-tooz) project on [StackForge](http://ci.openstack.org/stackforge.html). ## Goals Tooz is a Python library that provides a coordination API. Its primary goal is to handle groups and membership of these groups in distributed systems. Tooz also provides another useful feature which is distributed locking. This allows distributed nodes to acquire and release locks in order to synchronize themselves (for example to access a shared resource). ## The architecture If you are familiar with distributed systems, you might be thinking that there are a lot of solutions already available to solve these issues: [ZooKeeper](http://zookeeper.apache.org/), the [Raft consensus algorithm](http://raftconsensus.github.io/) or even [Redis](http://redis.io/) for example. You'll be thrilled to learn that Tooz is not the result of the [NIH](http://en.wikipedia.org/wiki/Not_invented_here) syndrome, but is an abstraction layer on top of all these solutions. It uses drivers to provide the real functionalities behind, and does not try to do anything fancy. All the drivers do not have the same amount of functionality of robustness, but depending on your environment, any available driver might be suffice. Like most of OpenStack, we let the deployers/operators/developers chose whichever backend they want to use, informing them of the potential trade-offs they will make. So far, Tooz provides drivers based on: - [Kazoo](https://pypi.python.org/pypi/kazoo) (ZooKeeper) - [Zake](https://pypi.python.org/pypi/zake) - [memcached](http://memcached.org) - [redis](http://redis.io) - [SysV IPC](http://www.tldp.org/LDP/lpg/node21.html) (only for distributed locks for now) - [PostgreSQL](http://postgresql.org) (only for distributed locks for now) - [MySQL](http://mysql.org) (only for distributed locks for now) All drivers are distributed across processes. Some can be distributed across the network (ZooKeeper, memcached, redis…) and some are only available on the same host (IPC). Also note that the Tooz API is completely asynchronous, allowing it to be more efficient, and potentially included in an event loop. ## Features ### Group membership Tooz provides an API to manage group membership. The basic operations provided are: the creation of a group, the ability to join it, leave it and list its members. It's also possible to be notified as soon as a member joins or leaves a group. ### Leader election Each group can have a leader elected. Each member can decide if it wants to run for the election. If the leader disappears, another one is elected from the list of current candidates. It's possible to be notified of the election result and to retrieve the leader of a group at any moment. ### Distributed locking When trying to synchronize several workers in a distributed environment, you may need a way to lock access to some resources. That's what a distributed lock can help you with. ## Adoption in OpenStack [Ceilometer](http://launchpad.net/ceilometer) is the first project in OpenStack to use Tooz. It has replaced part of the old alarm distribution system, where RPC was used to detect active alarm evaluator workers. The group membership feature of Tooz was leveraged by Ceilometer to coordinate between alarm evaluator workers. Another new feature part of the Juno release of Ceilometer is the distribution of polling tasks of the central agent among multiple workers. There's again a group membership issue to know which nodes are online and available to receive polling tasks, so Tooz is also being used here. The [Oslo](http://wiki.openstack.org/Oslo) team [has accepted the adoption of Tooz](https://review.openstack.org/#/c/122439/) during this release cycle. That means that it will be maintained by more developers, and will be part of the OpenStack release process. This opens the door to push Tooz further in OpenStack. Our next candidate would be write a service group driver for [Nova](http://launchpad.net/nova). The [complete documentation for Tooz is available online](http://tooz.rtfd.org/) and has examples for the various features described here, go read it if you're curious and adventurous!

Tracking OpenStack contributions in GitHub

Tue, 19 Aug 2014 00:00:00 GMT

I've switched my Git repositories to [GitHub](https://github.com) recently, and started to watch my contributions statistics, which were very low considering I spend my days hacking on open source software, especially [OpenStack](https://openstack.org). ![octocat-on-openstack-2](/content/images/03/octocat-on-openstack-2.png) OpenStack hosts its Git repositories on its own infrastructure at [git.openstack.org](http://git.openstack.org), but also mirrors them on GitHub. Logically, I was expecting GitHub to track my commits there too, as I'm using the same email address everywhere. It turns out that it was not the case, and the [help page about that](https://help.github.com/articles/why-are-my-contributions-not-showing-up-on-my-profile) on GitHub describes the rule in place to compute statistics. Indeed, according to GitHub, I had no relations to the OpenStack repositories, as I never forked them nor opened a pull request on them (OpenStack uses [Gerrit](http://review.openstack.org)). Starring a repository is enough to build a relationship between a user and a repository, so this is was the only thing needed to inform GitHub that I have contributed to those repositories. Considering OpenStack has hundreds of repositories, I decided to star them all by using a small Python script using [pygithub](https://pypi.python.org/pypi/pygithub). And voilà, [my statistics](https://github.com/jd) are now including all my contributions to OpenStack! ![github-openstack-stats](/content/images/03/github-openstack-stats.png)

OpenStack Ceilometer and the Gnocchi experiment

Mon, 18 Aug 2014 00:00:00 GMT

A little more than 2 years ago, the [Ceilometer](http://launchpad.net/ceilometer) project was launched inside the OpenStack ecosystem. Its main objective was to measure OpenStack cloud platforms in order to provide data and mechanisms for functionalities such as billing, alarming or capacity planning. In this article, I would like to relate what I've been doing with other Ceilometer developers in the last 5 months. I've lowered my involvement in Ceilometer itself directly to concentrate on solving one of its biggest issue at the source, and I think it's largely time to take a break and talk about it. ## Ceilometer early design For the last years, Ceilometer didn't change in its core architecture. Without diving too much in all its parts, one of the early design decision was to build the metering around a data structure we called **samples**. A sample is generated each time Ceilometer measures something. It is composed of a few fields, such as the the resource id that is metered, the user and project id owning that resources, the meter name, the measured value, a timestamp and a few free-form metadata. Each time Ceilometer measures something, one of its components (an agent, a pollster…) constructs and emits a sample headed for the storage component that we call the **collector**. This collector is responsible for storing the samples into a database. The Ceilometer collector uses a pluggable storage system, meaning that you can pick any database system you prefer. Our original implementation has been based on MongoDB from the beginning, but we then added a SQL driver, and people contributed things such as HBase or DB2 support. The REST API exposed by Ceilometer allows to execute various reading requests on this data store. It can returns you the list of resources that have been measured for a particular project, or compute some statistics on metrics. Allowing such a large panel of possibilities and having such a flexible data structure allows to do a lot of different things with Ceilometer, as you can almost query the data in any mean you want. ## The scalability issue We soon started to encounter scalability issues in many of the read requests made via the REST API. A lot of the requests requires the data storage to do full scans of all the stored samples. Indeed, the fact that the API allows you to filter on any fields and also on the free-form metadata (meaning non indexed key/values tuples) has a terrible cost in terms of performance (as pointed before, the metadata are attached to each _sample_ generated by Ceilometer and is stored as is). That basically means that the _sample_ data structure is stored in most drivers in just one table or collection, in order to be able to scan them at once, and there's no good "perfect" sharding solution, making data storage scalability painful. It turns out that the Ceilometer REST API is unable to handle most of the requests in a timely manner as most operations are _O(n)_ where _n_ is the number of samples recorded (see [big O notation](http://en.wikipedia.org/wiki/Big_O_notation) if you're unfamiliar with it). That number of samples can grow very rapidly in an environment of thousands of metered nodes and with a data retention of several weeks. There is a few optimizations to make things smoother in general cases fortunately, but as soon as you run specific queries, the API gets barely usable. During this last year, as the Ceilometer PTL, I discovered these issues first hand since a lot of people were feeding me back with this kind of testimony. We engaged several blueprints to improve the situation, but it was soon clear to me that this was not going to be enough anyway. ![unacceptable](/content/images/03/unacceptable.jpg) ## Thinking outside the box Unfortunately, the PTL job doesn't leave him enough time to work on the actual code nor to play with anything new. I was coping with most of the project bureaucracy and I wasn't able to work on any good solution to tackle the issue at its root. Still, I had a few ideas that I wanted to try and as soon as I stepped down from the PTL role, I stopped working on Ceilometer itself to try something new and to think a bit outside the box. When one takes a look at what have been brought recently in Ceilometer, they can see the idea that Ceilometer actually needs to handle 2 types of data: events and metrics. Events are data generated when something happens: an instance start, a volume is attached, or an HTTP request is sent to an REST API server. These are events that Ceilometer needs to collect and store. Most OpenStack components are able to send such events using the notification system built into _[oslo.messaging](https://wiki.openstack.org/wiki/Oslo/Messaging)_. Metrics is what Ceilometer needs to store but that is not necessarily tied to an event. Think about an instance CPU usage, a router network bandwidth usage, the number of images that Glance is storing for you, etc… These are not events, since nothing is happening. These are facts, states we need to meter. Computing statistics for billing or capacity planning requires both of these data sources, but they should be distinct. Based on that assumption, and the fact that Ceilometer was getting support for storing events, I started to focus on getting the metric part right. I had been a system administrator for a decade before jumping into OpenStack development, so I know a thing or two on how monitoring is done in this area, and what kind of technology operators rely on. I also know that there's still no silver bullet – this made it a good challenge. The first thing that came to my mind was to use some kind of time-series database, and export its access via a REST API – as we do in all OpenStack services. This should cover the metric storage pretty well. ## Cooking Gnocchi ![gnocchi-logo-old-2](/content/images/03/gnocchi-logo-old-2.jpg) At the end of April 2014, this led met to start a new project code-named Gnocchi. For the record, the name was picked after confusing so many times the OpenStack Marconi project, reading OpenStack Macaroni instead. At least one OpenStack project should have a "pasta" name, right? The point of having a new project and not send patches on Ceilometer, was that first I had no clue if it was going to make something that would be any better, and second, being able to iterate more rapidly without being strongly coupled with the release process. The first prototype started around the following idea: what you want is to meter things. That means storing a list of tuples of (timestamp, value) for it. I've named these things "entities", as no assumption are made on what they are. An entity can represent the temperature in a room or the CPU usage of an instance. The service shouldn't care and should be agnostic in this regard. One feature that we discussed for several OpenStack summits in the Ceilometer sessions, was the idea of doing aggregation. Meaning, aggregating samples over a period of time to only store a smaller amount of them. These are things that time-series format such as the [RRDtool](http://oss.oetiker.ch/rrdtool/) have been doing for a long time on the fly, and I decided it was a good trail to follow. I assumed that this was going to be a requirement when storing metrics into Gnocchi. The user would need to provide what kind of archiving it would need: 1 second precision over a day, 1 hour precision over a year, or even both. The first driver written to achieve that and store those metrics inside Gnocchi was based on [whisper](http://graphite.wikidot.com/whisper). Whisper is the file format used to store metrics for the [Graphite](http://graphite.wikidot.com/) project. For the actual storage, the driver uses Swift, which has the advantages to be part of OpenStack and scalable. Storing metrics for each entities in a different _whisper_ file and putting them in Swift turned out to have a fantastic algorithm complexity: it was _O(1)_. Indeed, the complexity needed to store and retrieve metrics doesn't depends on the number of metrics you have nor on the number of things you are metering. Which is already a huge win compared to the current Ceilometer collector design. However, it turned out that _whisper_ has a few limitations that I was unable to circumvent in any manner. I needed to patch it to remove a lot of its assumption about manipulating file, or that everything is relative to now (`time.time()`). I've started to hack on that in my own fork, but… then everything broke. The _whisper_ project code base is, well, not the state of the art, and have 0 unit test. I was starring at a huge effort to transform _whisper_ into the time-series format I wanted, without being sure I wasn't going to break everything (remember, no test coverage). I decided to take a break and look into alternatives, and stumbled upon [Pandas](http://pandas.pydata.org/), a data manipulation and statistics library for Python. Turns out that Pandas support time-series natively, and that it could do a lot of the smart computation needed in Gnocchi. I built a new file format leveraging Pandas for computing the time-series and named it **carbonara** (a wink to both the [Carbon](https://github.com/graphite-project/carbon) project and pasta, how clever!). The code is quite small (a third of _whisper_'s, 200 SLOC vs 600 SLOC), does not have many of the _whisper_ limitations and… it has test coverage. These Carbonara files are then, in the same fashion, stored into Swift containers. Anyway, Gnocchi storage driver system is designed in the same spirit that the rest of OpenStack and Ceilometer storage driver system. It's a plug-in system with an API, so anyone can write their own driver. Eoghan Glynn has already started to write a [InfluxDB](http://influxdb.com/) driver, working closely with the upstream developer of that database. Dina Belova started to write an [OpenTSDB](http://opentsdb.net/) driver. This helps to make sure the API is designed directly in the right way. ## Handling resources Measuring individual entities is great and needed, but you also need to link them with resources. When measuring the temperature and the number of a people in a room, it is useful to link these 2 separate entities to a resource, in that case the room, and give a name to these relations, so one is able to identify what attribute of the resource is actually measured. It is also important to provide the possibility to store attributes on these resources, such as their owners, the time they started and ended their existence, etc. ![gnocchi-relationship](/content/images/03/gnocchi-relationship.png) Once this list of resource is collected, the next step is to list and filter them, based on any criteria. One might want to retrieve the list of resources created last week or the list of instances hosted on a particular node right now. Resources also need to be specialized. Some resources have attributes that must be stored in order for filtering to be useful. Think about an instance name or a router network. All of these requirements led to to the design of what's called the _indexer_. The indexer is responsible for indexing entities, resources, and link them together. The initial implementation is based on [SQLAlchemy](http://sqlalchemy.org) and should be pretty efficient. It's easy enough to index the most requested attributes (columns), and they are also correctly typed. We plan to establish a model for all known OpenStack resources (instances, volumes, networks, …) to store and index them into the Gnocchi indexer in order to request them in an efficient way from one place. The generic resource class can be used to handle generic resources that are not tied to OpenStack. It'd be up to the users to store extra attributes. Dropping the free form metadata we used to have in Ceilometer makes sure that querying the indexer is going to be efficient and scalable. ![gnocchi-classes](/content/images/03/gnocchi-classes.png) ## REST API All of this is exported via a REST API that was partially designed and documented in the [Gnocchi specification in the Ceilometer repository](http://git.openstack.org/cgit/openstack/ceilometer-specs/tree/specs/juno/gnocchi.rst); though the spec is not up-to-date yet. We plan to auto-generate the documentation from the code as we are currently doing in Ceilometer. The REST API is pretty easy to use, and you can use it to manipulate entities and resources, and request the information back. ![gnocchi-architecture](/content/images/03/gnocchi-architecture.png) ## Roadmap & Ceilometer integration All of this plan has been exposed and discussed with the Ceilometer team during the last [OpenStack summit in Atlanta](https://www.openstack.org/summit/openstack-summit-atlanta-2014/) in May 2014, for the Juno release. I led a session about this entire concept, and convinced the team that using Gnocchi for our metric storage would be a good approach to solve the Ceilometer collector scalability issue. It was decided to conduct this project experiment in parallel of the current Ceilometer collector for the time being, and see where that would lead the project to. ## Early benchmarks Some engineers from Mirantis did a few benchmarks around Ceilometer and also against an early version of Gnocchi, and Dina Belova presented them to us during the mid-cycle sprint we organized in Paris in early July. The following graph sums up pretty well the current Ceilometer performance issue. The more you feed it with metrics, the more slow it becomes. ![image03](/content/images/03/image03.png) For Gnocchi, while the numbers themselves are not fantastic, what is interesting is that all the graphs below show that the performances are stable without correlation with the number of resources, entities or measures. This proves that, indeed, most of the code is built around a complexity of _O(1)_, and not _O(n)_ anymore. ![image00](/content/images/03/image00.png) ![image01](/content/images/03/image01.png) ![image04](/content/images/03/image04.png) ![image05](/content/images/03/image05.png) ![image06](/content/images/03/image06.png) ## Next steps ![clement-drawing-gnocchi](/content/images/03/clement-drawing-gnocchi.jpg) While the Juno cycle is being wrapped-up for most projects, including Ceilometer, Gnocchi development is still ongoing. Fortunately, the composite architecture of Ceilometer allows a lot of its features to be replaced by some other code dynamically. That, for example, enables Gnocchi to provides a Ceilometer dispatcher plugin for its collector, without having to ship the actual code in Ceilometer itself. That should help the development of Gnocchi to not be slowed down by the release process for now. The Ceilometer team aims to provide Gnocchi as a sort of technology preview with the Juno release, allowing it to be deployed along and plugged with Ceilometer. We'll discuss how to integrate it in the project in a more permanent and strong manner probably during the [OpenStack Summit for Kilo](https://www.openstack.org/summit/openstack-paris-summit-2014/) that will take place next November in Paris.

OpenStack Design Summit Juno, from a Ceilometer point of view

Fri, 30 May 2014 00:00:00 GMT

Last week was the [OpenStack Design Summit](https://www.openstack.org/summit/openstack-summit-atlanta-2014/) in Atlanta, GA where we, developers, discussed and designed the new OpenStack release (Juno) coming up. I've been there mainly to discuss Ceilometer upcoming developments. The summit has been great. It was my third OpenStack design summit, and the first one not being a PTL, meaning it was a largely more relaxed summit for me! On Monday, we started by a 2.5 hours meeting with Ceilometer core developers and contributors about the Gnocchi experimental project that I've started a few weeks ago. It was a great and productive afternoon, and allowed me to introduce and cover this topic extensively, something that would not have been possible in the allocated session we had later in the week. Ceilometer had his design sessions running mainly during Wednesday. We noted a lot of things and commented during the sessions in our [Etherpads instances](https://wiki.openstack.org/wiki/Summit/Juno/Etherpads#Ceilometer). Here is a short summary of the sessions I've attended. ## Scaling the central agent I was in charge of the first session, and introduced the work that was done so far in the scaling of the central agent. Six months ago, during the Havana summit, I proposed to scale the central agent by distributing the tasks among several node, using a library to handle the group membership aspect of it. That led to the creation of the [tooz](https://pypi.python.org/pypi/tooz) library that we worked on at eNovance during the last 6 months. Now that we have this foundation available, Cyril Roelandt started to replace the Ceilometer alarming job repartition code by Taskflow and Tooz. Starting with the central agent is simpler and will be a first proof of concept to be used by the central agent then. We plan to get this merged for Juno. For the central agent, the same work needs to be done, but since it's a bit more complicated, it will be done after the alarming evaluators are converted. ## Test strategy The next session discussed the test strategy and how we could improve Ceilometer unit and functional testing. There is a lot in this area to be done, and this is going to be one of the main focus of the team in the upcoming weeks. Having Tempest tests run was a goal for Havana, and even if we made a lot of progress, we're still no there yet. ## Complex queries and per-user/project data collection This session, led by Ildikó Váncsa, was about adding finer-grained configuration into the pipeline configuration to allow per-user and per-project data retrieval. This was not really controversial, though how to implement this exactly is still to be discussed, but the idea was well received. The other part of the session was about adding more in the complex queries feature provided by the v2 API. ## Rethinking Ceilometer as a Time-Series-as-a-Service This was my main session, the reason we met on Monday for a few hours, and one of the most promising session – I hope – of the week. It appears that the way Ceilometer designed its API and storage backends a long time ago is now a problem to scale the data storage. Also, the events API we introduced in the last release partially overlaps some of the functionality provided by the samples API that causes us scaling troubles. Therefore, I've started to rethink the Ceilometer API by building it as a time series read/write service, letting the audit part of our previous sample API to the event subsystem. After a few researches and experiments, I've designed a new project called [Gnocchi](https://wiki.openstack.org/Gnocchi), which provides exactly that functionality in a hopefully scalable way. Gnocchi is split in two parts: a time series API and its driver, and a resource indexing API with its own driver. Having two distinct driver sets allows it to use different technologies to store each data type in the best storage engine possible. The canonical driver for time series handling is based on [Pandas](http://pandas.pydata.org/) and [Swift](https://launchpad.net/swift). The canonical resource indexer driver is based on [SQLAlchemy](http://sqlalchemy.org). The idea and project was well received and looked pretty exciting to most people. Our hope is to design a version 3 of the Ceilometer API around Gnocchi at some point during the Juno cycle, and have it ready as some sort of preview for the final release. ## Revisiting the Ceilometer data model This session led by Alexei Kornienko, kind of echoed the previous session, as it clearly also tried to address the Ceilometer scalability issue, but in a different way. Anyway, the SQL driver limitations have been discussed and Mehdi Abaakouk implemented some of the suggestions during the week, so we should very soon see more performances in Ceilometer with the current default storage driver. ## Ceilometer devops session We organized this session to get feedbacks from the devops community about deploying Ceilometer. It was very interesting, and the list of things we could improve is long, and I think will help us to drive our future efforts. ## SNMP inspectors This session, led by Lianhao Lu, discussed various details of the future of SNMP support in Ceilometer. ## Alarm and logs improvements This mixed session, led by Nejc Saje and Gordon Chung, was about possible improvements on the alarm evaluation system provided by Ceilometer, and making logging in Ceilometer more effective. Both half-sessions were interesting and led to several ideas on how to improve both systems. ## Conclusion Considering the current QA problems with Ceilometer, Eoghan Glynn, the new _Project Technical Leader_ for Ceilometer, clearly indicated that this will be the main focus of the release cycle. Personally, I will be focused on working on Gnocchi, and will likely be joined by others in the next weeks. Our idea is to develop a complete solution with a high velocity in the next weeks, and then works on its integration with Ceilometer itself.

OpenStack Ceilometer Icehouse-2 milestone released

Fri, 24 Jan 2014 00:00:00 GMT

Yesterday, the second milestone of the Icehouse development branch of Ceilometer has been released and is now available for testing and download. This means the first half of the OpenStack _Icehouse_ development has passed! ## New features For the [Icehouse-1 milestone](https://launchpad.net/ceilometer/+milestone/icehouse-1), we barely had enough time to implement 2 blueprints. We almost did a better job this time, but finally only [2 blueprints were implemented](https://launchpad.net/ceilometer/+milestone/icehouse-2) again. This is really far from what we planned initially. The infrastructure slowdown issues and the lower number of reviews is probably the root cause here. Anyway, Ceilometer now offers a [REST API to accesses the stored event](https://blueprints.launchpad.net/ceilometer/+spec/specify-event-api). The initial work to replace the `/v2/meters` endpoint with something more RESTy has started with the [implementation of `/v2/samples`](https://blueprints.launchpad.net/ceilometer/+spec/sample-api). ## Bug fixes Thirty-one bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read [the list](https://launchpad.net/ceilometer/+milestone/icehouse-2) if you are curious. ## Toward Icehouse 3 We now have 29 blueprints targeting the [Ceilometer's third Icehouse milestone](https://launchpad.net/ceilometer/+milestone/icehouse-3), with some of them are already started and ready to merge. However, it's likely that we won't make all of them. As usual, the priority should indicate how confident we are that we want and need a feature. Still, it's likely the roadmap will be adjusted in the upcoming weeks. I'll try to make sure we'll get there without too much trouble for the 6th March 2013. Stay tuned!

Databases integration testing strategies with Python

Mon, 06 Jan 2014 00:00:00 GMT

The [Ceilometer](http://launchpad.net/ceilometer) project supports various database backend that can be used as storage. Among them are [MongoDB](http://www.mongodb.org/), [SQLite](http://sqlite.org) [MySQL](http://mysql.com), [PostgreSQL](http://postgresql.org), [HBase](http://hbase.apache.org/), DB2… All Ceilometer's code is unit tested, but when dealing with external storage services, one cannot be sure that the code is really working. You could be inserting data with an incorrect SQL statement, or in the wrong table. Only having the real database storage running and being used can tell you that. ![python_db_tests](/content/images/03/python_db_tests.png) Over the months, we developed integration testing on top of our unit testing to validate that our storage drivers are able to deal with real world databases. That is not really different from generic [integration testing](http://en.wikipedia.org/wiki/Integration_testing). Integration testing is about plugging all the pieces of your software all together and running. In what I call "database integration testing", the pieces will be both your software and the database system that you are going to rely on. The only difference here is that one of the module is not coming from the application itself but is an external project. The type of database that you use (RDBMS, NoSQL…) does not matter. Taking a step back, what I will describe here could also apply to a lot of other different software modules, even something that would not be a database sytem at all. ### Writing tests for integration Presumably, your Python application has unit tests. In order to test against a database back-end, you need to write a few specific classes of tests that will use the database subsystem for real. For example: ```python import unittest import os import sqlalchemy class TestDB(unittest.TestCase): def setUp(self): url = os.getenv("DB_TEST_URL") if not url: self.skipTest("No database URL set") self.engine = sqlalchemy.create_engine(url) ``` This code will try to fetch the database URL to use from an environment variable, and then will rely on [SQLAlchemy](http://sqlalchemy.org) to create a database connection. ```python import unittest import os import sqlalchemy import myapp class TestDB(unittest.TestCase): def setUp(self): url = os.getenv("DB_TEST_URL") if not url: self.skipTest("No database URL set") self.engine = sqlalchemy.create_engine(url) def test_foobar(self): self.assertTrue(myapp.store_integer(self.engine, 42)) ``` You can then add as many tests as you want using the connection stored in `self.engine`. If no test database URL is, the tests will be skipped; however that decision is up to you. You may want to have these tests always run and fail if they can't be run. In the `setUp()` method, you may also need to do more work, like create a database and delete a database. ```python import unittest import os import sqlalchemy class TestDB(unittest.TestCase): def setUp(self): url = os.getenv("DB_TEST_URL") if not url: self.skipTest("No database URL set") self.engine = sqlalchemy.create_engine(url) self.connection = self.engine.connect() self.connection.execute("CREATE DATABASE testdb") def tearDown(self): self.connection.execute("DROP DATABASE testdb") ``` This will make sure that the database you need is clean and ready to be used to testing. ### Launching modules, a.k.a. databases ![postgresql](/content/images/03/postgresql.png) The main problem we encountered when building integration testing with databases, is to find a way to start them. Most users are used to start them system-wide with some sort of init script, but when running sandboxed tests, that is not really a good option. Browsing the documentation of each storage allowed us to find a way to start them in foreground and control them "interactively" via a shell script. The following is a script that you can use to run Python tests using [nose](http://nose.readthedocs.org/) and is heavily inspired by the one we wrote for Ceilometer. ```bash #!/bin/bash set -e clean_exit() { local error_code="$?" kill -9 $(jobs -p) >/dev/null 2>&1 || true rm -rf "$PGSQL_DATA" return $error_code } check_for_cmd () { if ! which "$1" >/dev/null 2>&1 then echo "Could not find $1 command" 1>&2 exit 1 fi } wait_for_line () { while read line do echo "$line" | grep -q "$1" && break done < "$2" # Read the fifo for ever otherwise process would block cat "$2" >/dev/null & } check_for_cmd postgres trap "clean_exit" EXIT ## Start PostgreSQL process for tests PGSQL_DATA=`mktemp -d /tmp/PGSQL-XXXXX` PGSQL_PATH=`pg_config --bindir` ${PGSQL_PATH}/initdb ${PGSQL_DATA} mkfifo ${PGSQL_DATA}/out ${PGSQL_PATH}/postgres -F -k ${PGSQL_DATA} -D ${PGSQL_DATA} &> ${PGSQL_DATA}/out & ## Wait for PostgreSQL to start listening to connections wait_for_line "database system is ready to accept connections" ${PGSQL_DATA}/out export DB_TEST_URL="postgresql:///?host=${PGSQL_DATA}&dbname=template1" ## Run the tests nosetests ``` If you use [tox](http://tox.readthedocs.org) to automatize your test run, you can use this scripts (I call it `run-test.sh`) in your `tox.ini` file. ``` [testenv] commands = {toxinidir}/run-tests.sh {posargs} ``` ![mysql](/content/images/03/mysql.png) Most databases are able to be run in some sort of standalone mode where you can connect to them using a either a Unix domain socket, or a fixed port. Here are the snippet used in Ceilometer to run with MongoDB and MySQL: ```bash ## Start MongoDB process for tests MONGO_DATA=$(mktemp -d /tmp/MONGODB-XXXXX) MONGO_PORT=29000 mkfifo ${MONGO_DATA}/out mongod --maxConns 32 --nojournal --noprealloc --smallfiles --quiet --noauth --port ${MONGO_PORT} --dbpath "${MONGO_DATA}" --bind_ip localhost &>${MONGO_DATA}/out & ## Wait for Mongo to start listening to connections wait_for_line "waiting for connections on port ${MONGO_PORT}" ${MONGO_DATA}/out export DB_TEST_URL="mongodb://localhost:${MONGO_PORT}/testdb" ``` ![mongodb](/content/images/03/mongodb.png) ```bash ## Start MySQL process for tests MYSQL_DATA=$(mktemp -d /tmp/MYSQL-XXXXX) mkfifo ${MYSQL_DATA}/out mysqld --datadir=${MYSQL_DATA} --pid-file=${MYSQL_DATA}/mysql.pid --socket=${MYSQL_DATA}/mysql.socket --skip-networking --skip-grant-tables &> ${MYSQL_DATA}/out & ## Wait for MySQL to start listening to connections wait_for_line "mysqld: ready for connections." ${MYSQL_DATA}/out export DB_TEST_URL="mysql://root@localhost/testdb?unix_socket=${MYSQL_DATA}/mysql.socket&charset=utf8" ``` The mechanism is always the same. We create a _fifo_ with `mkfifo`, and then run the database daemon with the output redirected to that fifo. We then read from it until we find a line stating the the database is ready to be used. At that point, we can continue and start running the tests. You have to read continuously from the fifo, otherwise the process writing to it will block. We redirect the output to `/dev/null`, but you could also redirect it to a different log file, or not at all. > Note: [Evgeni Golov](http://www.die-welt.net/) pointed it exists a [pg\_virtualenv](https://alioth.debian.org/scm/loggerhead/pkg-postgresql/postgresql-common/trunk/view/head:/pg_virtualenv) for PostgreSQL and [my\_virtualenv](https://github.com/evgeni/my_virtualenv) for MySQL that does the same kind of thing, but with more bells and whistles. ### One step further: using parallelism and scenarios The described approach is quite simple, as it only support one database type. When using an abstraction layer, such as SQLAlchemy, it would be a good idea to run all these tests against different RDBMS, such as MySQL and PostgreSQL for example. The snippet above allows to run both RDBMS in parallel, but the classic approach of unit tests does not allow that. Using one scenario for each database backend would be a great idea. To that end, you can use the [testscenarios](https://launchpad.net/testscenarios) library. ```python import unittest import os import sqlalchemy import testscenarios load_tests = testscenarios.load_tests_apply_scenarios class TestDB(unittest.TestCase): scenarios = [ ('mysql', dict(database_connection=os.getenv("MYSQL_TEST_URL")), ('postgresql', dict(database_connection=os.getenv("PGSQL_TEST_URL")), ] def setUp(self): if not self.database_connection: self.skipTest("No database URL set") self.engine = sqlalchemy.create_engine(self.database_connection) self.connection = self.engine.connect() self.connection.execute("CREATE DATABASE testdb") def tearDown(self): self.connection.execute("DROP DATABASE testdb") ``` ``` $ python -m subunit.run test_scenario | subunit2pyunit test_scenario.TestDB.test_foobar(mysql) test_scenario.TestDB.test_foobar(mysql) ... ok test_scenario.TestDB.test_foobar(postgresql) test_scenario.TestDB.test_foobar(postgresql) ... ok --------------------------------------------------------- Ran 2 tests in 0.061s OK ``` To speed up tests run, you could also run the test in parallel. It can be intesting as you'll be able to spread the workload among a lot of different CPUs. However, note that it can require a different database for each test or a locking mechanism to be in place. It's likely that your tests won't be able to work altogether at the same time on only one database. (Both usage of scenarios and parallelism in testing will be covered in [The Hacker's Guide to Python](https://thehackerguidetopython.com), in case you wonder.)

OpenStack Design Summit Icehouse, from a Ceilometer point of view

Wed, 13 Nov 2013 00:00:00 GMT

Last week was the [OpenStack Design Summit Icehouse](http://www.openstack.org/summit/openstack-summit-hong-kong-2013/) in Hong-Kong where we, OpenStack developers, discussed and designed the new OpenStack release (Icehouse) that is coming up. The week has been wonderful. It was my second OpenStack design summit, and I loved it. Bumping into various people I've never met so far and worked with online was a real pleasure. As it was to meet again with fellow OpenStack developers! The event organisation was great, as were the parties. :-) On the last day, I had the chance to present a talk with Eoghan Glynn and Nick Barcet how we built the auto-scaling feature in Heat, implementing the "alarming" feature needed in Ceilometer. ![ods_icehouse_ceilometer_heat_nijaba_eglynn_jd](/content/images/03/ods_icehouse_ceilometer_heat_nijaba_eglynn_jd.jpg) ## Design sessions This time, Ceilometer design sessions were spread on 3 days. Everything we talked about has its [Etherpad instance](https://wiki.openstack.org/wiki/Summit/Icehouse/Etherpads#Ceilometer). The discussions were interesting, and the amount of feedback gathered is big and is going to be very useful. There's a lot of people and companies using Ceilometer now, and the project is getting more and more traction in general. There's a lot of different way to use it and to bend it to one's needs. Considering the amount of features and options that is provided, building functionality with a genericized approach it making Ceilometer useful for a lot of different and interesting use-cases. ## Icehouse roadmap The [list of blueprints targeting Icehouse is available](https://blueprints.launchpad.net/ceilometer/icehouse), but not yet complete. I expect people to start filling this list in the next days. If you want to propose blueprints, you're free to do so and inform us about it so we can validate it. The same applies if you wish to implement one of them! Thereafter, I try to guess what the roadmap will look like in the upcoming weeks for Ceilometer based on the discussion we had last week during the summit. ### Events management A lot of work is going to be put into event management. Ceilometer plans to store notifications sent using _oslo.messaging_ by OpenStack projects. Some work already got merge for Havana, but the API part and future improvements and ideas will continue to flow into the Icehouse release. ### Agents and group management A lot has been discussed around the polling agents and around the alarm evaluator agent. The current state of the _ceilometer-central-agent_ disallows any kind of high-availability and load-balancing, as the polling task are kept and scheduled on only one node. The high-availability part is already covered by a custom mechanism built into _ceilometer-alarm-evaluator_, but it came clear to us that a more generic approach is needed. A lot of other projects needs this kind of functionality, and a pattern have been pointed out. A [blueprint about group membership](https://wiki.openstack.org/wiki/Oslo/blueprints/service-sync) has been discussed in an Oslo session, and will end into a new Python library written to solve this in Ceilometer and in other projects. [TaskFlow](https://wiki.openstack.org/wiki/Taskflow) will also probably be leveraged to solve the task distribution issue. ### Documentation Since a few weeks, Ceilometer auto-generates its [API reference documentation](http://api.openstack.org/api-ref-metering.html) using [sphinxcontrib-docbookrestapi](https://git.openstack.org/cgit/stackforge/sphinxcontrib-docbookrestapi/) that parses our API code that uses [WSME](https://pypi.python.org/pypi/WSME). We also want to start writing a user guide, and we'll do that inside our own repository. That way, I hope that we will be the first project in OpenStack to require documentation to be incorporated into every patch that's being sent to Ceilometer. This is the best way to assure that nothing can be changed nor added without being accompanied with a documentation update. ### Tempest testing Testing of Ceilometer already has been a subject during the previous design summit about testing. We already put a large effort on Tempest testing in this last cycle, but we encountered a lot of small issues that we had to tackle to achieve something. Some Ceilometer basic tests are already on their way into Tempest, so this is something that is going to be achieved very soon. Ultimately, I would also want Ceilometer moving towards providing its own set of Tempest tests as part of the code base. That way, it'd be as easy for core reviewers to refuse a patch if it doesn't provide functional tests as it is to refuse it if it doesn't provide unit tests. As we'll do for the documentation. ### API improvements Once again, a few API improvements will probably be implemented, like aggregation or the ability to specify multiple queries with _OR_ and _AND_ operators. ### Roll-up, archiving of data There seems to be interest in archiving and rolling-up the data stored by Ceilometer, so work in this area is to be expected. Supporting multiple data storage driver in parallel seems to be something that needs to be done for this and other aspects of Ceilometer feature set. ### Alarming The alarming feature set is already big, and the work that has been accomplished pretty amazing. A few improvements will be made, as retrieving better metrics and building better statistics (exclusion of low quality data points).

OpenStack Ceilometer Havana-3 milestone released

Tue, 10 Sep 2013 00:00:00 GMT

Last week, the third and last milestone of the Havana development branch of Ceilometer has been released and is now available for testing and download. This means the end of the OpenStack _Havana_ development time is coming, and that the features are now frozen. ## New features ![blueprint-1](/content/images/03/blueprint-1.jpg) Eleven blueprints have been implemented as you can see on the [release page](https://launchpad.net/ceilometer/+milestone/havana-3). That's one more than during Havana-2, but it's less than was planned initially, though we had a pretty high score considering the size of our contributors team. I'm going to talk through some of them here, that are the most interesting for users. - Our favorite [OPW](https://wiki.openstack.org/wiki/OutreachProgramForWomen) intern Terri Yu implemented the long awaited [GROUP BY API feature](https://blueprints.launchpad.net/ceilometer/+spec/api-group-by), that allows to group samples by fields before returning statistics. - Eoghan Glynn (Red Hat) continued his implementation of alarming features, and the [audit API](https://blueprints.launchpad.net/ceilometer/+spec/alarm-audit-api) has been merged. A few blueprints related to alarming slipped and will be delayed for RC1, as they have been granted feature freeze exceptions: [logical combinations of alarms](https://blueprints.launchpad.net/ceilometer/+spec/alarming-logical-combination) and [alarm service partitioner](https://blueprints.launchpad.net/ceilometer/+spec/alarm-service-partitioner). - With the help of Gordon Chung (IBM), I've worked on creating a [middleware to meter API requests](https://blueprints.launchpad.net/ceilometer/+spec/count-api-requests). This has been merged into Oslo and is handled by Ceilometer. Gordon added another middleware on top of it to add CADF support for audit. - Ceilometer agent compute gained his second inspector to poll for virtual machine, thanks to Alessandro Pilotti (Cloudbase) who implemented [the Hyper-V inspector](https://blueprints.launchpad.net/ceilometer/+spec/hyper-v-agent). - Ceilometer will be able to meter Neutron bandwidth thanks to eNovance folks that worked on [bandwidth metering blueprint](https://blueprints.launchpad.net/ceilometer/+spec/ceilometer-quantum-bw-metering), both on Ceilometer and Neutron parts. This is also a long awaited feature. - Finally, Ceilometer will be shipped with yet another storage back-end, as Tong Li (IBM) implemented a [DB2 driver](https://blueprints.launchpad.net/ceilometer/+spec/ibm-db2-support). ## Bug fixes Fifty-six bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read [the list](https://launchpad.net/ceilometer/+milestone/havana-3) if you are curious. ## Toward our final Havana release With the feature freeze in place, we're now focusing on fixing bugs and improving documentation. I'll try to make sure we'll get there without too much trouble for the 17th October 2013. Stay tuned!

OpenStack Ceilometer Havana-2 milestone released

Sat, 27 Jul 2013 00:00:00 GMT

Last week, the second milestone of the Havana development branch of Ceilometer has been released and is now available for testing and download. This means the first half of the OpenStack _Havana_ development has passed! ## New features Ten blueprints have been implemented as you can see on the [release page](https://launchpad.net/ceilometer/+milestone/havana-2). I'm going to talk through some of them here, that are the most interesting for users. ![blueprint](/content/images/03/blueprint.jpg) The Ceilometer API now returns [all the samples sorted by timestamp](https://blueprints.launchpad.net/ceilometer/+spec/api-sample-sorted). This blueprint is the first one implemented by Terri Yu, our [OPW](https://wiki.openstack.org/wiki/OutreachProgramForWomen) intern! In the same spirit, I've added the ability to [limit the number of samples returned](https://blueprints.launchpad.net/ceilometer/+spec/api-limit). On the alarming front, things evolved a lot. I've implemented the [notifier system](https://blueprints.launchpad.net/ceilometer/+spec/alarm-notifier) that will be used to run actions when alarms are triggered. To trigger these alarms, Eoghan Glynn (Red Hat) worked on the [alarm evaluation system](https://blueprints.launchpad.net/ceilometer/+spec/alarm-distributed-threshold-evaluation) that will use the Ceilometer API to check for alarm states. I've reworked the publisher system so it now uses [URL formatted target](https://blueprints.launchpad.net/ceilometer/+spec/pipeline-publisher-url) for publication. That now allows to publish different meters to different target using the same publishing protocol (e.g. via UDP toward different hosts). Sandy Walsh (RackSpace) have been working on the StackTach like functionality and added the ability for the collector to optionally [store the notification events received](https://blueprints.launchpad.net/ceilometer/+spec/collector-stores-events). Finally, Mehdi Abaakouk (eNovance) implemented a [TTL system for the database](https://blueprints.launchpad.net/ceilometer/+spec/db-ttl), so you're now able to expire your data whenever you like. ## Bug fixes Thirty-five bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read [the list](https://launchpad.net/ceilometer/+milestone/havana-2) if you are curious. ## Toward Havana 3 We now have 30 blueprints targeting the [Ceilometer's third Havana milestone](https://launchpad.net/ceilometer/+milestone/havana-3), with some of them are already started. I'll try to make sure we'll get there without too much trouble for the 6th September 2013. Stay tuned!

OpenStack meets Lisp: cl-openstack-client

Thu, 04 Jul 2013 00:00:00 GMT

A month ago, a mail hit the [OpenStack](http://openstack.org) mailing list entitled "[The OpenStack Community Welcomes Developers in All Programming Languages](https://lists.launchpad.net/openstack/msg24349.html)". You may know that OpenStack is essentially built using Python, and therefore it is the reference language for the client libraries implementations. As a Lisp and OpenStack practitioner, I used this excuse to build a challenge for myself: let's prove this point by bringing Lisp into OpenStack! ![cl-openstack-client-1](/content/images/03/cl-openstack-client-1.png) Welcome [cl-openstack-client](https://github.com/stackforge/cl-openstack-client), the OpenStack client library for [Common Lisp](http://common-lisp.net/)! The project is hosted on the classic OpenStack infrastructure for third party project, [StackForge](http://ci.openstack.org/stackforge.html). It provides the [continuous integration system based on Jenkins](https://jenkins.openstack.org/job/gate-cl-openstack-client-run-tests/) and the Gerrit infrastructure used to review contributions. ## How the tests works OpenStack projects ran a fabulous contribution workflow, [which I already talked about](/blog/2013/rant-about-github-pull-request-workflow-implementation), based on tools like [Gerrit](http://gerrit.googlecode.com/) and [Jenkins](http://jenkins-ci.org/). OpenStack Python projects are used to run [tox](https://pypi.python.org/pypi/tox), to build a virtual environment and run test inside. We don't have such thing in Common Lisp as far as I know, so I had to build it myself. Fortunately, using [Quicklisp](http://www.quicklisp.org/), the fabulous equivalent of Python's PyPI, it has been a breeze to set this up. _cl-openstack-client_ just includes a [basic shell script](https://github.com/stackforge/cl-openstack-client/blob/master/run-tests.sh) that does the following: - Download quicklisp.lisp - Run a [Lisp program to install the dependencies using Quicklisp](https://github.com/stackforge/cl-openstack-client/blob/master/update-deps.lisp) - Run a [Lisp program running the test suite](https://github.com/stackforge/cl-openstack-client/blob/master/run-tests.lisp) using [FiveAM](http://common-lisp.net/project/fiveam/), that exit with 0 or 1 based on the tests results. I just run the test using [SBCL](http://www.sbcl.org), though adding more compiler on the table would be a really good plan in the future, and should be straightforward. You can [admire a log from a successful test](https://jenkins.openstack.org/job/gate-cl-openstack-client-run-tests/4/console) run done when I proposed a patch via Gerrit, to check what it looks like. ## Implementation status For the curious, here's an example of how it works: ```cl * (require 'cl-openstack-client) * (use-package 'cl-keystone-client) * (defvar k (make-instance 'connection-v2 :username "demo" :password "somepassword" :tenant-name "demo" :url "http://devstack:5000")) K * (authenticate k) ((:ISSUED--AT . "2013-07-04T05:59:55.454226") (:EXPIRES . "2013-07-05T05:59:55Z") (:ID . "wNFQwNzo1OTo1NS40NTQyMthisisaverylongtokenwNFQwNzo1OTo1NS40NTQyM") (:TENANT (:DESCRIPTION) (:ENABLED . T) (:ID . "1774fd545df4400380eb2b4f4985b3be") (:NAME . "demo"))) * (connection-token-id k) "wNFQwNzo1OTo1NS40NTQyMthisisaverylongtokenwNFQwNzo1OTo1NS40NTQyM" ``` Unfortunately, the implementation is far from being complete. It only implements for now the Keystone token retrieval. I've actually started this project to build an already working starting point. With this, future potential contributors will be able to spend efforts on writing code, and not on setting up the basic continuous integration system or module infrastructure. If you wish to help me and contribute, just follow the [OpenStack Gerrit workflow howto](https://wiki.openstack.org/wiki/GerritWorkflow) or feel free to come by me and ask any question (I'm hanging out on #lisp on Freenode too). See you soon, hopping to bring more Lisp into OpenStack!

OpenStack Ceilometer Havana-1 milestone released

Fri, 31 May 2013 00:00:00 GMT

Yesterday, the first milestone of the Havana development branch of Ceilometer has been released and is now available for testing and download. This means the first quarter of the OpenStack _Havana_ development has passed! ## New features Ten blueprints have been implemented as you can see on the [release page](https://launchpad.net/ceilometer/+milestone/havana-1). I'm going to talk through some of them here, that are the most interesting for users. Ceilometer can now [counts the scheduling attempt](https://blueprints.launchpad.net/ceilometer/+spec/scheduler-counter) of instances done by _nova-scheduler_. This can be useful to eventually bill such information or for audit (implemented by me for eNovance). ![hbase](/content/images/03/hbase.png) People using the [HBase](http://hbase.apache.org/) backend can now do requests filtering on any of the counter fields, something we call [metadata queries](https://blueprints.launchpad.net/ceilometer/+spec/hbase-metadata-query), and which was missing for this backend driver. Thanks to Shengjie Min (Dell) for the implementation. Counters can now be [sent over UDP](https://blueprints.launchpad.net/ceilometer/+spec/udp-publishing) instead of the Oslo RPC mechanism (AMQP based by default). This allows counter transmission to be done in a much faster way, though less reliable. The primary use case being not audit or billing, but the alarming features that we are working on (implemented by me for eNovance). ![siren](/content/images/03/siren.png) The [initial alarm API](https://blueprints.launchpad.net/ceilometer/+spec/alarm-api) has been designed and implemented, thanks to Mehdi Abaakouk (eNovance) and Angus Salkled (RedHat) who tackled this. We're now able to do _CRUD_ actions on these. Posting of meters via the HTTP API is now possible. This is now another conduct that can be used to publish and collector meter. Thanks to Angus Salkled (RedHat) for implementing this. I've been working on an somewhat experimental [notifier driver for Oslo](https://blueprints.launchpad.net/ceilometer/+spec/oslo-multi-publisher) notification that publishes Ceilometer counters instead of the standard notification, using the Ceilometer pipeline setup. Sandy Walsh (Rackspace) has put in place the base needed to [store raw notifications (events)](https://blueprints.launchpad.net/ceilometer/+spec/add-event-table), with the final goal of bringing more functionalities around these into Ceilometer. Obviously, all of this blueprint and bug fixes wouldn't be implemented or fixed without the harden eyes of our entire team, reviewing code and advising restlessly the developers. Thanks to them! ## Bug fixes Thirty-one bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read [the list](https://launchpad.net/ceilometer/+milestone/havana-1) if you are curious. ## Toward Havana 2 We now have 21 blueprints targeting the [Ceilometer's second Havana milestone](https://launchpad.net/ceilometer/+milestone/havana-2), with some of them are already started. I'll try to make sure we'll get there without too much trouble for the 18th July 2013. Stay tuned!

Rant about Github pull-request workflow implementation

Fri, 10 May 2013 00:00:00 GMT

One of my recent innocent tweet about _Gerrit vs Github_ triggered much more reponses and debate that I expected it to. I realize that it might be worth explaining a bit what I meant, in a text longer than 140 characters. > I'm having a hard time now contributing to projects not using Gerrit. Github isn't that good. > > — Julien Danjou (@juldanjou) [May 8, 2013](https://twitter.com/juldanjou/status/332076595521146881) ## The problems with Github pull-requests ![github-1](/content/images/03/github-1.svg) I always looked at Github from a distant eye, mainly because I always disliked their pull-request handling, and saw no value in the social hype it brings. Why? ### One click away isn't one click effort The pull-request system looks like an incredible easy way to contribute to any project hosted on Github. You're a click away to send your contribution to any software. But the problem is that any worthy contribution isn't an effort of a single click. Doing any proper and useful contribution to a software is never done right the first time. There's a dance you will have to play. A slowly rhythmed back and forth between you and the software maintainer or team. You'll have to dance it until your contribution is correct and can be merged. But as a software maintainer, not everybody is going to follow you on this choregraphy, and you'll end up with pull-request you'll never get finished unless you wrap things up yourself. So the gain in pull-requests here, isn't really bigger than a good bug report in most cases. This is where the social argument of Github isn't anymore. As soon as you're talking about projects bigger than a color theme for your favorite text editor, this feature is overrated. ### Contribution rework If you're lucky enough, your contributor will play along and follow you on this pull-request review process. You'll make suggestions, he will listen and will modify his pull-request to follow your advice. At this point, there's two technics he can use to please you. #### Technic #1: the Topping Github's pull-requests invite you to send an entire branch, eclipsing the fact that it is composed of several commits. The problem is that a lot of one-click-away contributors do not masterize Git and/or do not make efforts to build a logical patchset, and nothing warns them that their branch history is wrong. So they tend to change stuff around, commit, make a mistake, commit, fix this mistake, commit, etc. This kind of branch is composed of the whole brain's construction process of your contributor, and is a real pain to review. To the point I quite often give up. ![github-pull-request-iterative](/content/images/03/github-pull-request-iterative.png) Without Github, the old method that all software used, and that many software still use (e.g. Linux), is to send a patch set over e-mail (or any other medium like Gerrit). This method has one positive effect, that it forces the contributor to acknowledge the list of commits he is going to publish. So, if the contributor he has fixup commits in his history, they are going to be seen as first class citizen. And nobody is going to want to see that, neither your contributor, nor the software maintainers. Therefore, such a system tend to push contributors to write atomic, logical and self-contained patchset that can be more easily reviewed. #### Technic #2: the History Rewriter This is actually the good way to build a working and logical patchset using Git. Rewriting history and amending problematic patches using the famous `git rebase --interactive` trick. The problem is that if your contributor does this and then repush the branch composing your pull-request to Github, you will both lose the previous review done, each time. There's no history on the different versions of the branch that has been pushed. In the old alternative system like e-mail, no information is lost when reworked patches are resent, obviously. This is far better because it eases the following of the iterative discussions that the patch triggered. Of course, it would be possible for Github to enhance this and fix it, but currently it doesn't handle this use case correctly.. ![Exercise for the doubtful readers: good luck finding all revisions of my patch in the pull-request #157 of Hy.](/content/images/03/hylang-pull-request-157.png) ## A quick look at OpenStack workflow ![openstack-5](/content/images/03/openstack-5.png) It's not a secret for anyone that I've been contributing to OpenStack as a daily routine for the last 18 months. The more I contribute, the more I like the contribution workflow and process. It's already [well and longly described on the wiki](https://wiki.openstack.org/wiki/Gerrit_Workflow), so I'll summarize here my view and what I like about it. ### Gerrit To send a contribution to any OpenStack project, you need to pass via Gerrit. This is way simpler than doing a pull-request on Github actually, all you have to do is do your commit(s), and type [`git review`](https://pypi.python.org/pypi/git-review). That's it. Your patch will be pushed to Gerrit and available for review. Gerrit allows other developers to review your patch, add comments anywhere on it, and score your patch up or down. You can build any rule you want for the score needed for a patch to be merged; OpenStack requires one positive scoring from two core developers before the patch is merged. Until a patch is validated, it can be reworked and amended locally using Git, and then resent using `git review` again. That simple. The historic and the different version of the patches are available, with the whole comments. Gerrit doesn't lose any historic information on your workflow. Finally, you'll notice that this is actually the same kind of workflow projects use when they work by patch sent over e-mail. Gerrit just build a single place to regroup and keep track of patchsets, which is really handy. It's also much easier for people to actually send patch using a command line tool than their MUA or _git send-email_. ### Gate testing Testing is mandatory for any patch sent to OpenStack. Unit tests and functionnals test are run for _each version of each patch of the patchset_ sent. And until your patch passes all tests, it will be _impossible_ to merge it. Yes, this implies that all patches in a patchset must be working commits and can be merged on their own, without the entire patchset going in! With such a restricution, it's impossible to have "fixup commits" merged in your project and pollute the history and the testability of the project. Once your patch is validated by core developers, the system checks that there is not any merge conflicts. If there's not, tests are re-run, since the branch you are pushing to might have changed, and if everything's fine, the patch is merged. This is an uncredible force for the quality of the project. This implies that no broken patchset can ever sneak in, and that the project pass always all tests. ## Conclusion: accessibility vs code review In the end, I think that one of the key of any development process, which is code review, is not well covered by Github pull-request system. It is, along with history integrity, damaged by the goal of making contributions easier. Choosing between these features is probably a trade-off that each project should do carefully, considering what are its core goals and the quality of code it want to reach. I tend to find that OpenStack found one of the best trade-off available using Gerrit and plugging testing automation via Jenkins on it, and I would probably recommend it for any project taking seriously code reviews and testing.

OpenStack Design Summit Havana, from a Ceilometer point of view

Thu, 25 Apr 2013 00:00:00 GMT

Last week was the [OpenStack Design Summit](https://www.openstack.org/summit/portland-2013/) in Portland, OR where we, developers, discussed and designed the new OpenStack release (Havana) coming up. The summit has been wonderful. It was my first OpenStack design summit -- even more as a PTL -- and bumping into various people I've never met so far and worked with online only was a real pleasure! ![ods_havana_ceilometer_nijaba_jd_talk](/content/images/03/ods_havana_ceilometer_nijaba_jd_talk.jpg) [Nick Barcet](http://nicolas.barcet.com/) from [eNovance](http://www.enovance.com), our dear previous Ceilometer PTL, and myself, talked about Ceilometer and presented the work that has been done for Grizzly, with some previews of what we'll like to see done for its Havana release. ## Design sessions Ceilometer had his design sessions during the last days of the summit. We noted a lot of things and commented during the sessions in our [Etherpads instances](https://wiki.openstack.org/wiki/Summit/Havana/Etherpads#Ceilometer). The first session was a description of Ceilometer core architecture for interested people, and was a wonderful success considering that the room was packed. Our [Doug Hellmann](http://doughellmann.com/) did a wonderful job introducing people to Ceilometer and answering question. ![ods_havana_ceilometer_dhellmann](/content/images/03/ods_havana_ceilometer_dhellmann.jpg) The next session was about getting feedbacks from our users. We had a lot of surprise to discover wonderful real use-cases and deployments, like the CERN using Ceilometer and generating 2 GB of data per day! The following sessions ran on Thursday and were much more about new features discussion. A lot ot already existing blueprints were discussed and quickly validated during the first morning session. Then, [Sandy Walsh](http://www.sandywalsh.com/) introduced the architecture they use inside [StackTach](https://github.com/rackerlabs/stacktach), so we can start thinking about getting things from it into Ceilometer. API improvements were discussed without surprises and with a good consensus on what needs to be done. The four following sessions that occupied a lot of the days were related to alarming. All were lead by Eoghan Glynn, from [Red Hat](http://redhat.com), who did an amazing job presenting the possible architectures with theirs pros and cons. Actually, all we had to do was to nod to his designs and acknowledge the plan on how to build this. That last two sessions were about discussing advanced models for billing where we got some interesting feedback from Daniel Dyer from HP, and then were a quick follow-up of the StackTach presentation from the morning session. ## Havana roadmap The [list of blueprints targeting Havana is available](https://blueprints.launchpad.net/ceilometer/havana) and should be finished by next week. If you want to propose blueprints, you're free to do so and inform us about it so we can validate it. The same applies if you wish to implement one of them! ### API extension I do think the API version 2 is going to be heavily extended during this release cycle. We need more feature, like the [group-by](https://blueprints.launchpad.net/ceilometer/+spec/api-group-by) functionality. ### Healthnmon In parallel of the design sessions, discussions took place in the unconference room with the Healthnmon developers to figure out a plan in order to merge some of their efforts into Ceilometer. They should provide a component to help Ceilometer supports more hypervisors than it currently does. ### Alarming Alarming is definitely going to be the next big project for Ceilometer. Today, Eoghan and I started building blueprints on alarming, [centralised in a general blueprint](https://blueprints.launchpad.net/ceilometer/+spec/alarming). We know this is going to happen for real and very soon, thanks to the engagements of [eNovance](http://enovance.com) and [Red Hat](http://redhat.com) who are committing resources to this amazing project!

Announcing Climate, the OpenStack capacity leasing project

Mon, 25 Mar 2013 00:00:00 GMT

While working on the [XLcloud project](http://xlcloud.org/bin/view/Main/) (HPC on cloud) it appeared clear to us that OpenStack was missing a critical component towards resource reservations. A capacity leasing service is something really needed by service providers, especially in the context of cloud platforms dedicated to HPC style workload. Instead of building something really specific, the decision has been made to build a new standalone OpenStack components aiming to provide this kind of functionnality to OpenStack. In the spirit of others OpenStack components, it will be extensible to fullfil a large panel of needs around this problematic. The project is named [Climate](http://launchpad.net/climate), and is hosted on [StackForge](http://ci.openstack.org/stackforge.html). It will follow the standard OpenStack development modal. This service will be able to handle a calendar of reservations for various resources, based on various criteria. The project is still at its early design stage, and we plan to have a unconference session during [the next OpenStack summit in Portland](http://www.openstack.org/summit/portland-2013/) to present our plans and ideas for the future!

Ceilometer bug squash day #2

Mon, 04 Mar 2013 00:00:00 GMT

The Ceilometer team is pleased [to announce](http://lists.openstack.org/pipermail/openstack-dev/2013-March/006188.html) that tomorrow [Tuesday 5th March 2013 will be the second bug squash day for Ceilometer](http://wiki.openstack.org/Ceilometer/BugSquashingDay/20130304). We wrote an extensive page about [how you can contribute to Ceilometer](http://wiki.openstack.org/Ceilometer/Contributing), from updating the documentation, to fixing bugs. There's a lot you can do. We've good support for Ceilometer built into [Devstack](http://devstack.org), so installing a development platform is really easy. The main goal for this bug day will be to put Ceilometer in the best possible shape before the _grizzly-rc1_ release arrives (14th March 2013). This version of Ceilometer _should_ be the last one before the final Grizzly release, so it's a pretty important one. We'll be hanging out on the _#openstack-metering_ IRC channel on [Freenode](http://freenode.net), as usual, so feel free to come by and join us!

OpenStack Ceilometer and Heat projects graduated

Wed, 27 Feb 2013 00:00:00 GMT

![openstack-tech-committee](/content/images/03/openstack-tech-committee.jpg) The [OpenStack Technical Committee](http://www.openstack.org/foundation/technical-committee/) has voted these last weeks about graduation of [Heat](https://launchpad.net/heat) and [Ceilometer](http://launchpad.net/ceilometer), to change their status from **incubation** to **integrated**. The details of the discussion can be found in the [TC IRC meetings logs](http://eavesdrop.openstack.org/meetings/tc/2013/) for the brave. The results are: - Approve graduation of Heat (to be integrated in common Havana release)? yes: 10, abstain: 1, no: 1 - Approve graduation of Ceilometer (to be integrated in common Havana release)? yes: 11, abstain: 1 Therefore both projects have been graduated from _Incubation_ to _Integrated_ status. That means that Heat and Ceilometer will be released as part as OpenStack for the next release cycle _Havana_, due in Autumn 2013. For people being curious, we the Ceilometer team put up a [nice wiki page about our status and what we think we were ready to jump](https://wiki.openstack.org/wiki/Ceilometer/Graduation). For the curious, The [OpenStack Technical Committee charter](https://wiki.openstack.org/wiki/Governance/Foundation/TechnicalCommittee) has some explanations about the incubation and integration process. ## What about Grizzly? Both projects will be released with Grizzly too, obviously, since they already follow the release process of OpenStack. ## What about core? The question that has been raised several times to me is if that means the projects are becoming _Core_ projects. The answer is no, because how to become a _Core_ project is still under discussion and is more a matter for the _Board of Directors_ than the _Technical Committee_. But this is definitely a step in this direction. Anyway, from a technical point of view, this means both projects are now onboard with other OpenStack components so you can enjoy them!

Cloud tools for Debian

Wed, 13 Feb 2013 00:00:00 GMT

Recently, I've worked on the cloud utilities that are provided as standard in Ubuntu, and I ported them to Debian. Let's see how that brings Debian to the cloud! ## Basics of a cloud image When starting an instance on a IaaS platform, your instance image is raw, un-configured. Therefore, you need to have a way to configure it automagically at boot time, based on what you want to do with it. Usually, IaaS platforms provides for this a metadata server, like [Amazon EC2](http://aws.amazon.com/ec2) does. It's a special HTTP server listening on a special and hard-coded IP address that your instance can request to know basic information about itself, like its hostname, and retrieve basic user metadata to auto-configure itself. You can check the [documentation about the OpenStack metadata service](http://docs.openstack.org/trunk/openstack-compute/admin/content/metadata-service.html) for more information. Also, image have a predefined size at upload time. So when you run it on a platform, the disk size you request is usually bigger than the size of your image disk: you mayneed to resize and grow your image to use the full disk space that is allocated to your instance. ## Needed tools ![debian-cloud-1](/content/images/03/debian-cloud-1.jpg) To run a cloud platform, and especially [Amazon EC2](http://aws.amazon.com/ec2) or [OpenStack](http://openstack.org), you need to configure and update your image based on the context you're started in. This also includes extending your template image disk to use the full available disk size provided to the running instance. Ubuntu provides a set of cloud utils, which is actually composed of different source packages (_cloud-init_, _cloud-utils_ and _clout-initiramfs-tools_). Combined, these 3 packages will allow you to run a number of step, from disk resize at boot time to Puppet configuration handling. So _Ubuntu_ got this working right a long time ago, but unfortunately, Debian was really late on that. Until now. I've worked on getting these into Debian, and you can now find these 3 packages adapted and uploaded to Debian sid. All you need to do, is to build a Debian image and then run: ```sh apt-get install cloud-init cloud-tools cloud-initiramfs-growroot ``` And voilà: at the next reboot, your instance will extend its root partition size to the full available disk size, and ask the metadata server to configure things like its hostname. The packages sources are available on Debian's git server for [cloud-utils](http://anonscm.debian.org/gitweb/?p=collab-maint/cloud-utils.git;a=summary) and [cloud-initramfs-tools](http://anonscm.debian.org/gitweb/?p=collab-maint/cloud-initramfs-tools.git;a=summary) and you can build them yourself until the packages are processed by ftp-master and get out of the [NEW queue](http://ftp-master.debian.org/new.html). cloud-init on the other hand is directly [available in sid](http://packages.debian.org/search?keywords=cloud-init). One of next steps would probably be to build or enhance a tool like [vmbuilder](https://launchpad.net/vmbuilder) to be able to build cloud-compatible Debian images with a simple command line.

Extending Swift with middleware: example with ClamAV

Tue, 22 Jan 2013 00:00:00 GMT

In this article, I'm going to explain you how you can extend [Swift](http://launchpad.net/swift), the [OpenStack](http://openstack.org) Object Storage project, so it performs extra action on files at upload or at download time. We're going to build an anti-virus filter inside Swift. The goal is to refuse uploaded data if they contain a virus. To help us with virus analyses, we'll use [ClamAV](http://www.clamav.net). ## WSGI, paste and middleware ![lolcat-tube](/content/images/03/lolcat-tube.jpg) To do our content analysis, the best place to hook in the Swift architecture is at the beginning of every request, on **swift-proxy**, before the file is actually stored on the cluster. Swift proxy uses, like many other OpenStack projects, [paste](https://pypi.python.org/pypi/Paste) to build his HTTP architecture. Paste uses WSGI and provides an architecture based on a pipeline. The pipeline is composed of a succession of middleware, ending with one application. Each middleware has the chance to look at the request or at the response, can modify it, and then pass it to the following middleware. The latest component of the pipeline is the real application, and in this case, the Swift proxy server. If you've already deployed Swift, you encountered a default pipeline in the _swift-proxy.conf_ configuration file: ```ini [pipeline:main] pipeline = catch_errors healthcheck cache ratelimit tempauth proxy-logging proxy-server ``` This is a really basic pipeline with a few middleware. The first one catches error, the second one is in charge to return _200 OK_ response if you send a `GET /healthcheck` request on your proxy server. The third one is in charge of caching, the fourth one is used for rate limiting, the fifth for authentication, the sixth one for logging, and the final one is the actual proxy server, in charge of proxying the request to the account, container, or object servers (the others components of Swift). Of course, we could remove or add any of the middleware here at our convenience. Be aware that the order matters: for example, if you put _healthcheck_ after _tempauth_, you won't be able to access the _/healthcheck_ URL without being authenticated! ## ClamAV ![clamav](/content/images/03/clamav.png) If you don't know [ClamAV](http://clamav.org), it's an antivirus engine designed for detecting trojans, viruses, malware and other malicious threats. Wwe're going to use it to scan every incoming file. To build the middleware, we'll use the Python binding [pyclamd](http://pypi.python.org/pypi/clamd). The API is quite simple, see: ```python >>> import pyclamd >>> pyclamd.init_unix_socket('/var/run/clamav/clamd.ctl') >>> print pyclamd.scan_stream(pyclamd.EICAR) {'stream': 'Eicar-Test-Signature(44d88612fea8a8f36de82e1278abb02f:68)'} >>> print pyclamd.scan_stream("safe!") None ``` ## Anatomy of a WSGI middleware Your WSGI middleware should consist of a callable object. Usually this is done with a class implementing the _\_\_call\_\__ method. Here's a basic boilerplate: ```python class SwiftClamavMiddleware(object): """Middleware doing virus scan for Swift.""" def __init__(self, app, conf): # app is the final application self.app = app def __call__(self, env, start_response): return self.app(env, start_response) def filter_factory(global_conf, **local_conf): conf = global_conf.copy() conf.update(local_conf) def clamav_filter(app): return SwiftClamavMiddleware(app, conf) return clamav_filter ``` I'm not going to expand more on why this is built this way, but if you want to have more info on this kind of filter middleware, you can read [their documentation on Paste](http://pythonpaste.org/deploy/#paste-filter-factory). This middleware will just do nothing as it is. It's going to simply pass all requests it receives to the final application, and returns the result. ## Testing our basic middleware ![lolcat-testing](/content/images/03/lolcat-testing.jpg) Now is a really good time to add unit tests. I hope you didn't think we were going to write code without some tests, right? It's really easy to test a middleware, as we're going to use [WebOb](http://webob.org/) for that. ```python import unittest from webob import Request class FakeApp(object): def __call__(self, env, start_response): return Response(body="FAKE APP")(env, start_response) class TestSwiftClamavMiddleware(unittest.TestCase): def setUp(self): self.app = SwiftClamavMiddleware(FakeApp(), {}) def test_simple_request(self): resp = Request.blank('/', environ={ 'REQUEST_METHOD': 'GET', }).get_response(self.app) self.assertEqual(resp.body, "FAKE APP") ``` We create a FakeApp class, that represents a fake WSGI application. You could also use a real application, or write a fake application looking like the one you want to test. It'll require more time, but your tests will be closer to the reality. Here we write the simplest test we can for our middleware. We're just sending a _GET /_ request to it, so it passes the request to the final application and returns the result. It is transparent, it does nothing. Now, with that solid base we'll able to add more features and test these features incrementally. ## Plugging ClamAV in With our base ready, we can start thinking about how to plug ClamAV in. What we want to check here, is the content of the file when it's uploaded. If we refer to the [OpenStack object storage API](http://docs.openstack.org/api/openstack-object-storage/1.0/content/), a file upload is done via a _PUT_ request, so we're going to limit the check to that kind of requests. Obviously, more checks could be added, but we'll keep things simple here for the sake of comprehensibility. With WSGI, the content of the request is available in `env['wsgi.input']` as an object implementing a file interface. We'll scan that stream with ClamAV to check for viruses. ```python import pyclamd from webob import Response class SwiftClamavMiddleware(object): """Middleware doing virus scan for Swift.""" def __init__(self, app, conf): pyclamd.init_unix_socket('/var/run/clamav/clamd.ctl') # app is the final application self.app = app def __call__(self, env, start_response): if env['REQUEST_METHOD'] == "PUT": # We have to read the whole content in memory because pyclamd # forces us to, but this is a bad idea if the file is huge. scan = pyclamd.scan_stream(env['wsgi.input'].read()) if scan: return Response(status=403, body="Virus %s detected" % scan['stream'], content_type="text/plain")(env, start_response) return self.app(env, start_response) def filter_factory(global_conf, **local_conf): conf = global_conf.copy() conf.update(local_conf) def clamav_filter(app): return SwiftClamavMiddleware(app, conf) return clamav_filter ``` That's it. We only check for _PUT_ requests and if there's a virus in the file, we return a 403 Forbidden error with the name of the detected virus, bypassing entirely the rest of the middleware chain and the application handling. Then, we can simply test it. ```python import unittest from cStringIO import StringIO from webob import Request, Response class FakeApp(object): def __call__(self, env, start_response): return Response(body="FAKE APP")(env, start_response) class TestSwiftClamavMiddleware(unittest.TestCase): def setUp(self): self.app = SwiftClamavMiddleware(FakeApp(), {}) def test_put_empty(self): resp = Request.blank('/v1/account/container/object', environ={ 'REQUEST_METHOD': 'PUT', }).get_response(self.app) self.assertEqual(resp.body, "FAKE APP") def test_put_no_virus(self): resp = Request.blank('/v1/account/container/object', environ={ 'REQUEST_METHOD': 'PUT', 'wsgi.input': StringIO('foobar') }).get_response(self.app) self.assertEqual(resp.body, "FAKE APP") def test_put_virus(self): resp = Request.blank('/v1/account/container/object', environ={ 'REQUEST_METHOD': 'PUT', 'wsgi.input': StringIO(pyclamd.EICAR) }).get_response(self.app) self.assertEqual(resp.status_code, 403) ``` The first test _test\_put\_empty_ simulates an empty _PUT_ request. The second one, _test\_put\_no\_virus_ simulates a regular _PUT_ request but with a simple file containing no virus. Finally, the third and last test simulates the upload of a virus using the [EICAR](http://www.eicar.org/) test file. This is a special test file that is recognized as a virus, even if it's not real one. Very handy for testing virus detection software! ## Configuring Swift proxy Our middleware is ready! We can configure Swift's proxy server to use it. We need to add the following lines to our _swift-proxy.conf_ to teach it how to load the filter: ```ini [filter:clamav] paste.filter_factory = swiftclamav:filter_factory ``` We'll assume that our Python modules is named _swiftclamava_ here. Now that we've defined our filter and how to load it, we can use it in our pipeline: ```ini [pipeline:main] pipeline = catch_errors healthcheck cache ratelimit tempauth clamav proxy-logging proxy-server ``` Just before reaching the _proxy-server_, and after the user being authenticated, the content will be scanned for viruses. It's important here to put this after authentication for example, because otherwise we may scan content that will get rejected by the authtemp module, thus scanning for nothing! ## Beyond scanning And voilà, we now have a simple middleware testing uploaded content and refusing infected files. We could enhance it with various other things, like configuration handling, but I'll let that as an exercise for the interested readers. We didn't exploited it here, but note that you can also manipulate request headers and modify them if needed. For example, we could have added a header _X-Object-Meta-Scanned-By: ClamAV_ to indicates that the file has been scanned by ClamAV. You should now be able to build your own middleware doing whatever you want with uploaded data. Happy hacking!

Ceilometer bug squash day #1

Mon, 24 Dec 2012 00:00:00 GMT

In order to start the year in a good mood, what's the best than squashing some bugs on OpenStack? Therefore, the Ceilometer team is pleased [to announce](http://lists.openstack.org/pipermail/openstack-dev/2012-December/004161.html) that it organizes a [bug squashing day on the Friday 4th January 2013](http://wiki.openstack.org/Ceilometer/BugSquashingDay/20130104). We wrote an extensive page about [how you can contribute to Ceilometer](http://wiki.openstack.org/Ceilometer/Contributing), from updating the documentation, to fixing bugs. There's a lot you can do. We've good support for Ceilometer built into [Devstack](http://devstack.org), so installing a development platform is really easy. The main goal on this bug day will be put Ceilometer in the best possible shape before the _grizzly-2_ milestone arrives (10th January 2013). This version of Ceilometer will aim to keep compatibility with _Folsom_, so early deployers can enjoy some of our new features before upgrading to _Grizzly_. After that date, we'll start merging more extensive changes. We'll be hanging out on the _#openstack-metering_ IRC channel on [Freenode](http://freenode.net), as usual, so feel free to come by and join us!

OpenStack France meetup #2

Tue, 06 Nov 2012 00:00:00 GMT

I was at the [OpenStack France meetup 2](http://www.meetup.com/OpenStack-France/events/84177022/) yesterday evening. This has been a wonderful evening, talking about OpenStack and all with around 30-40 people. I and [Nick Barcet](http://nicolas.barcet.com/) presented [Ceilometer](http://launchpad.net/ceilometer) and have received some good feedbacks about it. We should also thanks [Nebula](http://www.nebula.com/), who sponsored the evening, and [Erwan Gallen](http://erwan.com/) since it was nicely organized, and free beers are always enjoyable.

Inside Synaps, a CloudWatch-like implementation for OpenStack

Mon, 22 Oct 2012 00:00:00 GMT

A few days ago, [Samsung](http://www.samsung.com/) released the source code of [Synaps](https://github.com/spcs/synaps), an implementation of the [Amazon Web Service CloudWatch API](http://aws.amazon.com/cloudwatch/) for [OpenStack](http://openstack.org). Being a developer on the [Ceilometer](http://launchpad.net/ceilometer) project, I've been curious to look on this project and how it could overlap with Ceilometer or other projects like [Heat](http://www.heat-api.org). ## What is CloudWatch? CloudWatch is a monitoring system provided by Amazon on its Web Services platform to monitor services. This allows you get notifications and trigger an action on certain threshold. For example, this can be used to scale your architecture by monitoring the number of requests you get on it and its general load by starting new servers. ![cloudwatch](/content/images/03/cloudwatch.jpg) ## Synaps Synaps is written in around 7k lines of Python (with 28 % of which are comments), reuses at least one common module of OpenStack (_openstack.common.cfg_) and copy some modules from Nova. One thing that strikes me, is that there seems to be only a few unit tests compared to most OpenStack projects. Also, many parts of the code and documentation contains text written in korean, which won't be very helpful for most people! :-) It uses some external technologies, like [Storm](http://storm-project.net/), [Cassandra](http://cassandra.apache.org/) to store its persistent data and [Pandas](http://pandas.pydata.org/) to do data analysis. The API server provides an EC2 compatible API only: no OpenStack specific API. This is probably not a bad thing for now, since I am not aware of any work in this direction. The API access directly the Cassandra back-end for read operation, but relies on RPC to do writes. This way, a set of daemon handles the write using the Storm part of Synaps and do data aggregation. The authentication only supports LDAP, but it should still be possible to add a driver for Keystone. A Java and a Python SDK are provided to record metrics into Synaps, but there's not enough documentation for it to be useful. ![SynapsDeployment-1](/content/images/03/SynapsDeployment-1.jpg) ## Overlap with Heat For now, there's not a lot of overlap with Heat, because Heat does not implement completely the CloudWatch API. Heat actually still misses a lot of the CloudWatch functions. But as soon as it will implement the CloudWatch API completely, the overlap will be complete with Synaps in this regard. One divergence point however, is that Heat uses RPC to access data from the storage back-end via its engine (the central daemon), whereas Synaps directly connects to Cassandra. Also, Heat relies on SQLAlchemy, like most OpenStack projects needing a database. ## Overlap with Ceilometer One of the goal of Ceilometer is to provide data probes and pollsters for all OpenStack components (Nova, Swift, Quantum…) whereas Synaps let the OpenStack users to put any kind of metric inside it, and therefore doesn't provide anything for now. But the storage of metrics is the main common point between Synaps and Ceilometer. Synaps chose only one technology, Cassandra, to store its metrics, whereas Ceilometer took care of building an abstraction layer for the storage engine. Ceilometer currently allows an operator to use SQL or MongoDB, but Cassandra could likely be added. Data metric consolidation is done by Synaps. This makes sense, since Synaps don't need to have the full data history to trigger alarms. On the opposite, Ceilometer needs to have a full history to allow things like billing, and don't do any aggregation on data. Also, in Synaps, the data analysis is done using Pandas. This means the data used are retrieved from the Cassandra back-end, and then transformed by Pandas inside Synaps in something else. It's likely that in such a case, Synaps should use CQL to achieve that. Ceilometer manipulates the data near their storage: it means that the computation are done by back-end to be efficient (SQL, mapreduce…). ## Conclusion Considering Samsung open-sourced Synaps late in the development process, I don't feel like they aimed to have it becoming a core component. This is always sad, because the effort put into this implementation are big and it would have probably little to add some abstraction layers to follow what other OpenStack projects do. But this takes time and energy, and it's understandable that Samsung didn't want to achieve this in a short time frame. There's a part of the code and architecture that overlaps with Ceilometer and Heat. Ceilometer is becoming a specialized point to store data metrics from any source: so it's sad, but understandable, that Synaps did not tried to reuse it. Fortunately, Heat is working with Ceilometer to achieve exactly that. This means OpenStack would have only one metrics storage point, used for billing, for monitoring and alarming. Therefore, I think Synaps is an implementation of CloudWatch that should be looked at as an inspiration for Heat and Ceilometer to build a better and more integrated solution!

Ceilometer 0.1 released

Fri, 12 Oct 2012 00:00:00 GMT

After 6 months of development, we are proud to release the first release of [Ceilometer](http://launchpad.net/ceilometer), the [OpenStack](http://openstack.org) Metering project. Ceilometer. This is a first and amazing milestone for us: we follow all other projects by releasing a version for Folsom! Using Ceilometer, you should now be able to meter your OpenStack cloud and retrieve its usage to build statistics or bill your customer! You can read [our announcement on the OpenStack mailing list](https://lists.launchpad.net/openstack/msg17410.html). ## Architecture We spent a good amount of time defining and refining [our architecture](http://ceilometer.readthedocs.org/en/latest/architecture.html#high-level-description). ![Ceilometer_Architecture-1](/content/images/03/Ceilometer_Architecture-1.png) One of its important point, is that it has been designed to work without modifying any of the existing core components. Patching OpenStack components in an intrusive way to meter them was not an option for now, simply because we had no legitimacy to do so. This may change in the future, and this will likely be discussed next week during the [OpenStack Summit](http://www.openstack.org/summit/san-diego-2012/). ## Meters Initially, we defined a bunch of meters we'd like to have for a first release, and in the end, most of them are available. Some of them are still missing, like OpenStack Object Storage (Swift) ones, mainly due to lack of interest from the involved parties so far. Anyhow, with this first release, you should be able to meter your instances, their network usage, memory, CPU. Images, networks and volumes and their CRUD operations are metered too. For more detail, you can read the [complete list of implemented meters](http://ceilometer.readthedocs.org/en/latest/measurements.html). ## REST API The HTTP REST API has been partially implemented. The provided methods should allow basic integration with a billing system. [DreamHost](http://dreamhost.com/) is using Ceilometer in their deployment architecture and coupling it with their billing system! ## Towards Grizzly We don't have a clear and established road-map for Grizzly yet. We already have a couple of patches waiting in the queue to be merged, like the use of [Keystone to authenticate API request](https://review.openstack.org/#/c/13989/) and the [removal of Nova DB access](https://review.openstack.org/#/c/14185/). On my side, these last days I've been working on a small debug user interface for the API. Ceilometer API server will return this interface if your do an API request from a browser (i.e. requesting `text/html` instead of `application/json`). ![ceilometer-debug-interface](/content/images/03/ceilometer-debug-interface.png) I hope this will help to discover Ceilometer API more easily for new comers and leverage it to build powerful tools! Anyhow, we have tons of idea and work to do, and I'm sure the upcoming weeks will be very interesting. Also, we hope to be able to become an OpenStack incubated project soon. So stay tuned!

Ceilometer, the OpenStack metering project

Fri, 27 Jul 2012 00:00:00 GMT

For the last months, I've been working on a metering project for [OpenStack](http://openstack.org), so it's time to talk a bit about it. OpenStack is a growing cloud platform providing IaaS. A problem easily identified by everyone building a public cloud platform is that nothing is provided to retrieve the platform usage data. Some data are available in some places, but not everything is, and you have to do a lot of processing from the various components to get something useful in the end. But in order to bill customers that are using your public cloud platform, you need to do his. In this regard, a lot of companies running public OpenStack based infrastructure wrote their own solution to cover this functional areas, and to become able to bill theirs customers. To avoid everybody doing and maintaining such a stack in their corners, the [Ceilometer](http://launchpad.net/ceilometer) has been created. The project aims to cover the metering aspect of the OpenStack components, pulling usage data from every components and storing them into a single place. It then offer a retrieving point for this data via a REST API. The [initial specifications](http://wiki.openstack.org/EfficientMetering) have been written in April this year, and actual implementation started in May. The project is currently worked on by me, Dreamhost and Canonical. We already have designed [an architecture](http://wiki.openstack.org/EfficientMetering/ArchitectureProposalV1) that we are implementing, and we hope to release a first usable version with Folsom. ![ceilometer-architecture-1](/content/images/03/ceilometer-architecture-1.png) I did a presentation of this project yesterday at [XLCloud](http://xlcloud.org/), which has been very well received. If you are interested in helping us and contributing, feel free to join us during one of our [weekly IRC meeting](http://wiki.openstack.org/Meetings/MeteringAgenda) or fix [some bugs](https://bugs.launchpad.net/ceilometer). :-)

OpenStack Swift eventual consistency analysis & bottlenecks

Mon, 23 Apr 2012 00:00:00 GMT

[Swift](https://launchpad.net/swift) is the software behind the [OpenStack Object Storage](http://openstack.org/projects/storage/) service. This service provides a simple storage service for applications using [RESTful interfaces](http://docs.openstack.org/api/openstack-object-storage/1.0/content/), providing maximum data availability and storage capacity. I explain here how some parts of the storage and replication in Swift works, and show some of its current limitations. If you don't know Swift and want to read a more "shallow" overview first, you can read John Dickinson's [Swift Tech Overview](http://programmerthoughts.com/openstack/swift-tech-overview/). ## How Swift storage works If we refer to the [CAP theorem](http://en.wikipedia.org/wiki/CAP_theorem), Swift chose **availability** and **partition tolerance** and dropped **consistency**. That means that you'll always get your data, they will be dispersed on many places, but you could get an old version of them (or no data at all) in some odd cases (like some server overload or failure). This compromise is made to allow maximum availability and scalability of the storage platform. But there are mechanisms built into Swift to minimize the potential data inconsistency window: they are responsible for data replication and consistency. The [official Swift documentation](http://swift.openstack.org/) explains the internal storage in a certain way, but I'm going to write my own explanation here about this. ### Consistent hashing Swift uses the principle of [consistent hashing](http://en.wikipedia.org/wiki/Consistent_hashing). It builds what it calls a _ring_. A ring represents the space of all possible computed hash values divided in equivalent parts. Each part of this space is called a _partition_. The following schema (stolen from the [Riak](http://wiki.basho.com/) project) shows the principle nicely: ![riak-ring](/content/images/03/riak-ring.png) In a simple world, if you wanted to store some objects and distribute them on 4 nodes, you would split your hash space in 4. You would have 4 partitions, and computing _hash(object) modulo 4_ would tell you where to store your object: on node 0, 1, 2 or 3. But since you want to be able to extend your storage cluster to more nodes without breaking the whole hash mapping and moving everything around, you need to build a lot more partitions. Let's say we're going to build 210 partitions. Since we have 4 nodes, each node will have `210 ÷ 4 = 256` partitions. If we ever want to add a 5th node, it's easy: we just have to re-balance the partitions and move 1⁄4 of the partitions from each node to this 5th node. That means all our nodes will end up with `210 ÷ 5 ≈ 204` partitions. We can also define a _weight_ for each node, in order for some nodes to get more partitions than others. With 210 partitions, we can have up to 210 nodes in our cluster. Yeepee. For reference, Gregory Holt, one of the Swift authors, also wrote [an explanation post about the ring](http://greg.brim.net/page/building_a_consistent_hashing_ring.html). Concretely, when building one Swift ring, you'll have to say how much partitions you want, and this is what this value is really about. ### Data duplication Now, to assure availability and partitioning (as seen in the _CAP theorem_) we also want to store replicas of our objects. By default, Swift stores 3 copies of every objects, but that's configurable. In that case, we need to store each partition defined above not only on 1 node, but on 2 others. So Swift adds another concept: zones. A zone is an isolated space that does not depends on other zone, so in case of an outage on a zone, the other zones are still available. Concretely, a zone is likely to be a disk, a server, or a whole cabinet, depending on the size of your cluster. It's up to you to choose anyway. Consequently, each partitions has not to be mapped to 1 host only anymore, but to N hosts. Each node will therefore store this number of partitions: ``` number of partition stored on one node = number of replicas × total number of partitions ÷ number of node ``` Examples: > We split the ring in 210 = 1024 partitions. We have 3 nodes. We want 3 replicas of data. > → Each node will store a copy of the full partition space: `3 × 210 ÷ 3 = 210 = 1024 partitions`. > We split the ring in 211 = 2048 partitions. We have 5 nodes. We want 3 replicas of data. > → Each node will store `211 × 3 ÷ 5 ≈ 1129 partitions`. > We split the ring in 211 = 2048 partitions. We have 6 nodes. We want 3 replicas of data. > → Each node will store `211 × 3 ÷ 6 = 1024 partitions`. ### Three rings to rule them all In Swift, there is 3 categories of thing to store: _account_, _container_ and _objects_. An **account** is what you'd expect it to be, a user account. An account contains **containers** (the equivalent of Amazon S3's buckets). Each container can contains user-defined key and values (just like a hash table or a dictionary): values are what Swift call **objects**. Swift wants you to build 3 different and independent rings to store its 3 kind of things (_accounts_, _containers_ and _objects_). Internally, the two first categories are stored as [SQLite](http://www.sqlite.org/) databases, whereas the last one is stored using regular files. Note that this 3 rings can be stored and managed on 3 completely different set of servers. ![openstack-swift-storage-1](/content/images/03/openstack-swift-storage-1.png) ## Data replication Now that we have our storage theory in place (accounts, containers and objects distributed into partitions, themselves stored into multiple zones), let's go the replication practice. When you put something in one of the 3 rings (being an account, a container or an object) it is uploaded into all the zones responsible for the ring partition the object belongs to. This upload into the different zones is the responsibility of the _swift-proxy_ daemon. ![openstack-swift-replication](/content/images/03/openstack-swift-replication.png) But if one of the zone is failing, you can't upload all your copies in all zones at the upload time. So you need a mechanism to be sure the failing zone will catch up to a correct state at some point. That's the role of the _swift-{container,account,object}-replicator_ processes. This processes are **running on each node part of a zone** and replicates their contents to nodes of the other zones. When they run, they walk through all the contents from all the partitions on the whole file system and for each partition, issue a special _REPLICATE_ HTTP request to all the other zones responsible for that same partition. The other zone responds with information about the local state of the partition. That allows the replicator process to decide if the remote zone has an up-to-date version of the partition. In case of account and containers, it doesn't check at the partition level, but check each account/container contained inside each partition. If something is not up-to-date, it will be pushed using _rsync_ by the replicator process. This is why you'll read that the replication updates are _"push based"_ in Swift documentation. ```python ## Pseudo code describing replication process for accounts ## The principle is exactly the same for containers for account in accounts: # Determine the partition used to store this account partition = hash(account) % number_of_partitions # The number of zone is the number of replicas configured for zone in partition.get_zones_storing_this_partition(): # Send a HTTP REPLICATE command to the remote swift-account-server process version_of_account = zone.send_HTTP_REPLICATE_for(account): if version_of_account < account.version() account.sync_to(zone) ``` This replication process is _O(number of account × number of replicas)_. The more your number of account will increase and the more you will want replicas for your data, the more the replication time for your accounts will grow. The same rule applies for containers. ```python ## Pseudo code describing replication process for objects for partition in partitions_storing_objects: # The number of zone is the number of replicas configured for zone in partition.get_zones_storing_this_partition(): # Send a HTTP REPLICATE command to the remote swift-object-server process verion_of_partition = zone.send_HTTP_REPLICATE_for(partition): if version_of_partition < partition.version() # Use rsync to synchronize the whole partition # and all its objects partition.rsync_to(zone) ``` This replication process is _O(number of objects partitions × number of replicas)_. The more your number of objects partitions will increase, and the more you will want replicas for your data, the more the replication time for your objects will grow. I think this is something important to know when deciding how to build your Swift architecture. Choose the right number the number of replicas, partitions and nodes. ## Replication process bottlenecks ![copy-cat](/content/images/03/copy-cat.jpg) ### File accesses The problem, as you might have guessed, is that to replicate, **it walks through every damn things**, things being accounts, containers, or object's partition hash files. This means it need to open and read (part of) a every file your node stores to check that data need or not to be replicated! For accounts & containers replication, this is done every 30 seconds by default, but it will likely take more than 30 seconds as soon as you hit around 12 000 containers on a node (see measurements below). Therefore you'll end up checking consistency of accounts & containers on each all node **all the time**, using obviously a lot of CPU time. For reference, [Alex Yang also did an analysis](http://web.archive.org/web/20120903043209/http://alexyang.sinaapp.com/?p=115) of that same problem. ### TCP connections Worst, the HTTP connections used to send the _REPLICATE_ commands are not pooled: a new TCP connection is established each time something has to be checked against the same thing stored on a remote zone. This is why you'll see in the [Swift's Deployment Guide](http://swift.openstack.org/deployment_guide.html) this lines listed under ["general system tuning"](http://swift.openstack.org/deployment_guide.html#general-system-tuning): ```ini ## disable TIME_WAIT.. wait.. net.ipv4.tcp_tw_recycle=1 net.ipv4.tcp_tw_reuse=1 ## double amount of allowed conntrack net.ipv4.netfilter.ip_conntrack_max = 262144 ``` In my humble opinion, this is more an ugly hack than a tuning. If you don't activate this and if you have a lot of containers on your node, you'll end up soon with thousands of connections in _TIME\_WAIT_ state, and you indeed risk to overload the IP conntrack module. ### Container deletion We also should talk about container deletion. When a user deletes a container from its account, the container is **marked as deleted**. And that's it. It's not deleted. Therefore the SQLite database file representing the container will continue to be checked for synchronization, over and over. The only way to have a container permanently deleted is to **mark an account as deleted**. This way the _swift-account-reaper_ will delete all its containers and, finally, the account. ## Measurement On a pretty big server, I measured the replications to be done at a speed of around 350 {account,container,object-partitions}/second, which can be a real problem if you choose to build a lots of partition and you have a low _number\_of\_node ⁄ number\_of\_replicas_ ratio. For example, the default parameters runs the container replication every 30 seconds. To check replication status of 12 000 containers stored on one node at the speed of 350 containers/seconds, you'll need around 34 seconds to do so. In the end, you'll never stop checking replication of your containers, and the more you'll have containers, the more your **inconsistency window will increase**. ## Conclusion Until some of the code is fixed (the HTTP connection pooling probably being the "easiest" one), I warmly recommend to choose correctly the different Swift parameters for your setup. The replication process optimization consists in having the minimum amount of partitions per node, which can be done by: - decreasing the number of partitions - decreasing the number of replicas - increasing the number of node For very large setups, some code to speed up accounts and containers synchronization, and remove deleted containers will be required, but this does not exist yet, as far as I know.

Ten years as a Debian developer

Fri, 24 Feb 2012 00:00:00 GMT

Ten years ago, I joined the [Debian](http://www.debian.org) project as a developer. At that time, I was 18 and in my first year at university, hanging out with the [TuxFamily](http://tuxfamily.org) system administrators, which included 3 french Debian developers (sjg, igenibel and creis). I was learning Debian packaging while working on [VHFFS](http://vhffs.org), and decided to package one or two non-yet-packaged software for Debian. My friends pushed me into the [NM process](http://nm.debian.org), and [less than 2 months later](https://nm.debian.org/nmstatus.php?email=acid@hno3.org) I was a Debian developer. One have to admit that back in the days, the NM process was really fast if you were able to reply to the questions quickly. :-) I think I became the youngest developer among Debian's ones. That was my first steps in a Free Software project, and it was really exciting. In 10 years, I've been doing a lot of different things for Debian. Sure, I've been using it all the years long, but let's recap a bit what I did, from what I recall. My first Debian only project was [apt-build](http://packages.debian.org/apt-build) around 2003, and later [rebuildd](http://packages.debian.org/rebuildd) in 2007. I built the [Xen packaging team](https://alioth.debian.org/projects/pkg-xen/) in 2005, I've been a Stable Release Manager for a year in 2006, and did heavy bug squashing to release Etch that same year. I also was an [Application Manager in 2006](https://nm.debian.org/whoisam.php) and managed the application of 2 Debian developers ([Jose Parrella](https://nm.debian.org/nmstatus.php?email=joseparrella%40cantv.net) and [Damián Viano](https://nm.debian.org/nmstatus.php?email=debian%40damianv.com.ar)). I admit I've been less active in Debian after 2007, mainly because I was busy working on [awesome](http://awesome.naquadah.org), [GNU Emacs](http://www.gnu.org/software/emacs/) and others software. Since 2011, I joined the [OpenStack packaging team](http://alioth.debian.org/projects/openstack/) and I'm working on OpenStack on a (almost) daily basis. I don't know how many packages I touched, managed or updated, but that should be one or two hundreds. I still maintain [53 of them](http://qa.debian.org/developer.php?login=acid). After all, the adventure has been really pleasant, and I had the chance to work with and meet fabulous and smart people. I always liked this project and what it's trying to do. After all these years, I'm definitively staying! See you in another 10 years, folks! :)

My OpenStack work

Fri, 16 Dec 2011 00:00:00 GMT

Like I already wrote here last week, I've been heavily working on [OpenStack](http://openstack.org) for the last weeks. My first assignment was to package OpenStack for Debian. The packages already present in unstable were mainly done by [Thomas Goirand](http://thomas.goirand.fr/), who based its work on the one done in [Ubuntu](http://ubuntu.com). Therefore, the packages where not in a very good shape for Debian. Today Ghe Rivero and I (members of the [OpenStack Debian packaging team](https://alioth.debian.org/projects/openstack)) managed to push the [OpenStack Essex 2 milestone](https://launchpad.net/openstack/+milestone/essex-2) into unstable with great success. You can now test and deploy OpenStack Essex 2 very easily! Packaging OpenStack [made me write several patches](https://review.openstack.org/#dashboard,1669), mainly related to packaging, patches which were all accepted and merged by upstream. This is nice because most of the OpenStack Debian packages lost their _debian/patches_ directories now! Finally, I've finished to implement one blueprint I really missed: the [ability to boot from an ISO image](https://blueprints.launchpad.net/nova/+spec/support-kvm-boot-from-iso) using [libvirt](http://libvirt.org). The code still needs a review, but it should be included in the Essex 3 milestone if everything's right.

New job, new blog

Wed, 07 Dec 2011 00:00:00 GMT

It has been a while since I blogged but I've been very busy, with my new job and this new blog! ## New job! I quitted my job last September, and found another one that I started in October. I'm now the lead developer of [eNovance Labs](http://www.enovance.com/fr/produits-solutions/opencloud-opensource/enovance-labs), where I work on the [OpenStack](http://openstack.org/) project. So far, this allowed me to contribute heavily to the [Debian packaging of OpenStack](https://alioth.debian.org/projects/openstack). ## New blog! In the meantime, I took some time to redesign my personal homepage and this blog, which is now using [Hyde](https://github.com/hyde/hyde), the [Python](http://python.org) equivalent of [Jekyll](http://jekyllrb.com/), which is in [Ruby](http://www.ruby-lang.org/). Since I dislike Ruby (sorry), I preferred to use a Python based generator, and I admit Hyde is really cool. Since I really suck at Web design, this one is obviously based on [Twitter's bootstrap](http://twitter.github.com/bootstrap/)