A bit of history
A little more than year ago, with my colleague Yassine Lamgarchal and others at eNovance, we investigated on how to solve a problem often encountered inside OpenStack: synchronization of multiple distributed workers. And while many people in our ecosystem continue to drive development by adding new bells and whistles, we made a point of solving new problems with a generic solution able to address the technical debt at the same time.
Yassine wrote the first ideas of what should be the group membership service that was needed for OpenStack, identifying several projects that could make use of this. I've presented this concept during the OpenStack Summit in Hong-Kong during an Oslo session. It turned out that the idea was well-received, and the week following the summit we started the tooz project on StackForge.
Tooz is a Python library that provides a coordination API. Its primary goal is to handle groups and membership of these groups in distributed systems.
Tooz also provides another useful feature which is distributed locking. This allows distributed nodes to acquire and release locks in order to synchronize themselves (for example to access a shared resource).
If you are familiar with distributed systems, you might be thinking that there are a lot of solutions already available to solve these issues: ZooKeeper, the Raft consensus algorithm or even Redis for example.
You'll be thrilled to learn that Tooz is not the result of the NIH syndrome, but is an abstraction layer on top of all these solutions. It uses drivers to provide the real functionalities behind, and does not try to do anything fancy.
All the drivers do not have the same amount of functionality of robustness, but depending on your environment, any available driver might be suffice. Like most of OpenStack, we let the deployers/operators/developers chose whichever backend they want to use, informing them of the potential trade-offs they will make.
So far, Tooz provides drivers based on:
- Kazoo (ZooKeeper)
- SysV IPC (only for distributed locks for now)
- PostgreSQL (only for distributed locks for now)
- MySQL (only for distributed locks for now)
All drivers are distributed across processes. Some can be distributed across the network (ZooKeeper, memcached, redis…) and some are only available on the same host (IPC).
Also note that the Tooz API is completely asynchronous, allowing it to be more efficient, and potentially included in an event loop.
Tooz provides an API to manage group membership. The basic operations provided are: the creation of a group, the ability to join it, leave it and list its members. It's also possible to be notified as soon as a member joins or leaves a group.
Each group can have a leader elected. Each member can decide if it wants to run for the election. If the leader disappears, another one is elected from the list of current candidates. It's possible to be notified of the election result and to retrieve the leader of a group at any moment.
When trying to synchronize several workers in a distributed environment, you may need a way to lock access to some resources. That's what a distributed lock can help you with.
Adoption in OpenStack
Ceilometer is the first project in OpenStack to use Tooz. It has replaced part of the old alarm distribution system, where RPC was used to detect active alarm evaluator workers. The group membership feature of Tooz was leveraged by Ceilometer to coordinate between alarm evaluator workers.
Another new feature part of the Juno release of Ceilometer is the distribution of polling tasks of the central agent among multiple workers. There's again a group membership issue to know which nodes are online and available to receive polling tasks, so Tooz is also being used here.
This opens the door to push Tooz further in OpenStack. Our next candidate would be write a service group driver for Nova.
The complete documentation for Tooz is available online and has examples for the various features described here, go read it if you're curious and adventurous!