Software speed is relative.
After all, it is the result of a set of trade-offs made between the ease of programming and the speed of hardware. The comfort of the developer and its use of multiple abstraction layers has a direct impact on decreasing the cost in time (and therefore in money), while it on the other hand increases the hardware expenditure as the software is less performant. In the end, whichever between optimization and hardware that is the cheapest gets privileged.
Of course, there are terrible exceptions, such as picking the wrong algorithm or including
sleep() calls, but the essence of it is here. Pick C to be fast, saving money on hardware and spending it on development, or pick Java to save money on development, while making rich hardware manufacturers.
Last month, a co-worker at Red Hat picked Gnocchi for a test run and was disappointed by the performance he saw for his particular usage. After correctly understanding the use case scenario, I wrote a small test case that implemented this scheme and popped out my favorite code profiler. You know how I roll.
The profiling result made the performance issue obvious. Along with its releases, Gnocchi evolved from a one metric at a time processing approach to a bunch of metric at a time approach – especially since Gnocchi 4 and the introduction of the sacks. However, that batched approach is not yet complete in Gnocchi 4.2, and the processing engine still manipulates metrics one by one in parallel. The parallelization using processes and threads makes sure that the CPU usage is high and that the I/O latency does not impact the processing too much.
Processing incoming measures can be therefore schematized as this:
In the schema above, each operation in red is an I/O operation. The three branches I drew represents three metrics being processed. Obviously, if there were ten metrics, there would be ten branches, creating even more I/O operations. With the current Gnocchi 4.2 code, the number of I/O operations for processing a sack of metric can be roughly computed to
2 + (5 × M × D × G) actions, with
M the number of metrics and
D the number of definitions in the archive policy and
G the number of aggregation methods. For my test scenario, I used
G=1, which is what can be seen on the diagram above.
The obvious solution is to merge those I/O operations for each metric in a single I/O operations for a bunch of metrics. This allows for storage backends to batch the reading and writing operations, reducing latency and improving throughput.
It took me a few tens of patches and a few code reviews from my peers to rework the internal storage engine of Gnocchi. The new engine is now ready to be used and merged into the master branch.
The new engine reduces the number of I/O operations to process a bunch of metric to
5 + M – a (at least) five times reduction in the amount of operations. In my case, for 1000 metrics being processed in a batch, with only one aggregation, that decreases the quantity of transactions from 5002 to 1005.
A typical metric has 6 aggregation methods defined usually, so that could reduce the number of I/O operations from 40,002 to only 1005 for 1000 metrics – a fourty times reduction of I/O operations. The benchmark code that I wrote, which implements the desired use case with a single aggregation, is now performing more than four times faster.
Not all the drivers will benefit from this improvement, as some of them are better at doing batched operations than others; Redis is great at it, while Swift is not. And even if the number I/O operations has been largely reduced, they still need to be fully executed, which can take time depending on the backend performance. It's a really great improvement, not a silver bullet.
Mehdi started to use that new internal driver API to implement a RocksDB driver. While it has its own limitation (has to be single-threaded) that we will need to circumvent, it could improve performance for the non-distributed use-case by a large magnitude.
This code will be included in the next Gnocchi major release in a few weeks, so stay tuned for further update. And benchmark, I hope!