monitoring — jd:/dev/blog

Python Logging with Datadog

Mon, 03 Feb 2020 00:00:00 GMT

At [Mergify](https://mergify.io), we generate a pretty large amount of logs. Every time an event is received from GitHub for a particular pull request, our engine computes a new state for it. Doing so, it logs some informational statements about what it's doing — and any error that might happen. This information is precious to us. Without proper logging, it'd be utterly impossible for us to debug any issue. As we needed to store and index our logs somewhere, we picked Datadog as our log storage provider. Datadog offers real-time indexing of our logs. The ability to search our records that fast is compelling as we're able to retrieve log about a GitHub repository or a pull request with a single click. ![Our custom Datadog log facets](/content/images/01/Screenshot-2020-01-06-at-17.23.58.png) To achieve this result, we had to inject our Python application logs into Datadog. To set up the Python logging mechanism, we rely on [_daiquiri_](https://github.com/jd/daiquiri), a fantastic library I maintained for several years now. _Daiquiri_ leverages the regular Python `logging` module, making its a no-brainer to set up and offering a few extra features. We recently added native support for the Datadog agent in _daiquiri_, making it even more straightforward to log from your Python application. ## Enabling log on the Datadog agent Datadog has [extensive documentation on how to configure its agent](https://docs.datadoghq.com/agent/logs/?tab=tailexistingfiles). This can be summarized to adding `logs_enabled: true` in your agent configuration. Simple as that. You then need to create a new source for the agent. The easiest way to connect your application and the Datadog agent is using the TCP socket. Your application will write logs directly to the Datadog agent, which will forward the entries to Datadog backend. Create a configuration file in `conf.d/python.d/conf.yaml` with the following content: ## Setting up `daiquiri` Once this is done, you need to configure your Python application to log to the TCP socket configured in the agent above. The Datadog agent expects logs in JSON format being sent, which is what _daiquiri_ does for you. Using JSON allows to embed any extra fields to leverage fast search and indexing. As _daiquiri_ provides native handling for extra fields, you'll be able to send those extra fields without trouble. First, list _daiquiri_ in your application dependency. Then, set up logging in your application this way: ```python import daiquiri daiquiri.setup( outputs=[ daiquiri.output.Datadog(), ], level=logging.INFO, ) ``` This configuration logs to the default TCP destination `localhost:10518` — though you can pass the `host` and `port` argument to change that. You can customize the outputs as you wish by checking out [daiquiri documentation](https://daiquiri.readthedocs.io/en/latest/). For example, you could also include logging to `stdout` by adding `daiquiri.output.Stream(sys.stdout)` in the output list. ## Using `extra` When using _daiquiri_, you're free to use `logging.getLogger` to get your regular logging object. However, by using the alternative `daiquiri.getLogger` function, you're enabling the native use of extra arguments — which is quite handy. That means you can pass any arbitrary key/value to your log call, and see it up being embedded in your log data — up to Datadog. Here's an example: ``` import daiquiri […] log = daiquiri.getLogger(__name__) log.info("User did something important", user=user, request_id=request_id) ``` The extra keyword argument passed to `log.info` will be directly shown as attributes in Datadog logs: ![One of the log line of our Mergify engine](/content/images/01/Screenshot-2020-01-06-at-18.22.04.png) All those attributes can then be used to search or to display custom views. This is really powerful to monitor and debug any kind of service. ![](/content/images/01/Screenshot-2020-01-06-at-18.39.05.png) ## A log object per object When passing _extra_ arguments, it is easy to make mistakes and forget some. This especially can happen when your application wants to log information for a particular object. The best pattern to avoid this is to create a custom log object per object: ```python import daiquiri class MyObject: def __init__(self, x, y): self.x = x self.y = y self.log = daiquiri.getLogger("MyObject", x=self.x, y=self.y) def do_something(self): try: self.call_this() except Exception: self.log.error("Something bad happened") ``` By using the `self.log` object as defined above, there's no way for your application to miss some extra fields for an object. All your logs will look in the same style and will end up being indexed correctly in Datadog. ## Log Design The _extra_ arguments from the Python loggers are often dismissed, and many developers stick to logging strings with various information included inside. Having a proper explanation string, plus a few extra key/value pairs that are parsable by machines and humans, is a better way to do logging. Leveraging engines such as Datadog allow to store and query those logs in a snap. This is way more efficient than trying to parse and grep strings yourselves!

Gnocchi or Prometheus?

Wed, 30 Aug 2017 00:00:00 GMT

The realm of time series database keeps expanding those last years. Now and then a new contender appears from the fog. People keep asking me about the difference between [Gnocchi](http://gnocchi.xyz) and [Prometheus](http://prometheus.io). It's time to content them. Gnocchi and Prometheus are two open source projects evolving in the same expertise area, time series handling. They both are licensed under the **Apache 2.0 license** (see [Gnocchi license file](https://github.com/gnocchixyz/gnocchi/blob/master/LICENSE) and [Prometheus license file](https://github.com/prometheus/prometheus/blob/master/LICENSE). And that's a good thing! Both Gnocchi and Prometheus offers a bunch of features. Here's a table summary of the differences between the features they both offer – or not. **Feature** Prometheus Gnocchi Multi-tenant ❌ ✓ User auth & ACL ❌ ✓ Resource history ❌ ✓ Metric polling ✓ ❌ Highly available ❌ ✓ Horizontal scalability ❌ ✓ Alerting engine ✓ ❌ Data compression ✓ ✓ Pre-computed aggregation ✓ ✓ Grafana support ✓ ✓ collectd support ✓ ✓ #comparison th, #comparison td + td { text-align: center; } There's a lot of overlap between the two projects, but there are also some major differences. First, Gnocchi does not try to solve the metric retrieval problem. Prometheus provides a pull mechanism and takes in charge of getting the measurements. Gnocchi developers estimate that they are plenty of tools already doing that and that work well, such as [collectd](http://collectd.org). ![](/content/images/03/icon_siren.png) Secondly, Prometheus offers an [alerting engine](https://prometheus.io/docs/alerting/overview/), statically configured with a YAML file. It is way better than Gnocchi which offers nothing in comparison – for now. Gnocchi developers [are discussing the feature](https://github.com/gnocchixyz/gnocchi/issues/71) and while it's not on the roadmap yet, it will happen. It will, however, leverage a REST API to be controlled, as it seems important to us to be able to define alerts programmatically. ![](/content/images/03/icon_storage.png) Then there is a bunch of features where Gnocchi shines compared to Prometheus, and it is the core of its function: storing metrics. Gnocchi has a great storage engine that supports many storage backends (plain files, [OpenStack Swift](https://docs.openstack.org/swift/latest/), [Ceph](http://ceph.org)…). It helps Gnocchi scaling horizontally and providing native high-availability, whereas Prometheus stays a single point of failure. Multi-tenant and authentication are also supported by Gnocchi, allowing a single instance to be shared by multiple accounts. System administrators do not commonly use this kind of feature, but applications developers usually need them. That brings me to the usage and querying of Prometheus and Gnocchi. Prometheus has its small DSL (referred to as [PromQL](https://prometheus.io/docs/querying/basics/)) whereas Gnocchi has a [fully featured REST API](http://gnocchi.xyz/rest.html) that tries to expose proper semantic. It does not seem there are major differences between the two in term of features. Both Prometheus and Gnocchi support aggregating values over time ranges on query time ("give me the minimum value for every 5 minutes range over the last day"). Gnocchi always aggregates metrics at writing time, and never at query time (unless doing it cross-metrics). This implies that Gnocchi needs a bit of CPU time at write time to pre-compute those aggregates, but it is blazingly fast at reading time as it has nothing to compute. Prometheus can do the same thing using [recording rules](https://prometheus.io/docs/querying/rules/). ![](/content/images/03/icon_clock.png) Prometheus has some limitations inherent to time series database designed around the notion of "monitoring": they tend to compute everything relatively to `$NOW`. For example, it seems impossible to inject data from the past. The timestamp for a value is the timestamp where Prometheus read that value. If Prometheus misses values for a few hours, don't think about importing it back. I'm noting this here as it makes it harder to benchmark Prometheus for ingestion. You need tons of fake metrics to polls and build data. I did not find any reference of Prometheus performances online, though it is advertised to ingest "millions of measures from thousands of sources". Query performances seem to vary on Prometheus, and I did not find any benchmark on that neither. Gnocchi leverages standard RDBMS (MySQL or PostgreSQL is supported) to query indexed data and the metrics retrieval is always _O(1)_, making it **always fast**. ## Conclusion If you look in different and older areas, there never has been only one HTTP server. Many people use Apache HTTP server, but you'll find plenty of users of nginx, Tomcat, HAProxy, Node.js or uwsgi which are also common options nowadays. Same goes for RDBMS if you look at PostgreSQL, MySQL and other databases solution, etc. There will never be a project winning all the market share. It seems to me that time series storage and management is also growing in this category. There will probably be various projects that will enjoy some popularity and growth. Every project addresses the time series problem space with a different view and different trade-offs. There might never be a single project solving all problems at once. Prometheus seems to be oriented toward monitoring of live systems. Gnocchi is oriented to highly available time series storage at massive scale. Not considering performances (I was not able to compare anyway), both have different tradeoffs in term of features, philosophy, and orientation. Depending on your use cases, one might be a better fit than the other.

Sending your collectd metrics to Gnocchi

Thu, 16 Feb 2017 00:00:00 GMT

Knowing that [collectd](http://collectd.org/) is a daemon that collects system and applications metrics and that [Gnocchi](http://gnocchi.xyz) is a scalable timeseries database, it sounds like a good idea to combine them together. _Cherry on the cake_: you can easily draw charts using [Grafana](http://grafana.org). While it's true that Gnocchi is well integrated with [OpenStack](http://openstack.org), as it orginally comes from this ecosystem, it actually works standalone by default. Starting with the 3.1 version, it is now easy to send metrics to _Gnocchi_ using _collectd_. ## Installation What we'll need to install to accomplish this task is: - collectd - Gnocchi - collectd-gnocchi How you install them does not really matter. If they are packaged by your operating system, go ahead. For Gnocchi and collectd-gnocchi, you can also use _pip_: ``` ## pip install gnocchi[file,postgresql] […] Successfully installed gnocchi-3.1.0 ## pip install collectd-gnocchi Collecting collectd-gnocchi Using cached collectd-gnocchi-1.0.1.tar.gz […] Installing collected packages: collectd-gnocchi Running setup.py install for collectd-gnocchi ... done Successfully installed collectd-gnocchi-1.0.1 ``` The detailed installation procedure for Gnocchi is [detailed in the documentation](http://gnocchi.xyz/install.html#id1). It among other things explains which flavors are available – here I picked PostgreSQL and the file driver to store the metrics. ## Configuration ### Gnocchi Gnocchi is simple to configure and is again [documented](http://gnocchi.xyz/configuration.html). The default configuration file is `/etc/gnocchi/gnocchi.conf` – you can generate it with `gnocchi-config-generator` if needed. However, it also possible to specify another configuration file by appending the `--config-file` option to any command line In Gnocchi's configuration file, you need to set the `indexer.url` configuration option to point an existing PostgreSQL database and set `storage.file_basepath` to an existing directory to store your metrics (the default is `/var/lib/gnocchi`). That gives something like: ```ini [indexer] url = postgresql://root:p4assw0rd@localhost/gnocchi [storage] file_basepath = /var/lib/gnocchi ``` Once done, just run the `gnocchi-upgrade` command to initialize the index and storage. ### collectd Collectd provides a default configuration file that loads a bunch of plugin by default, that will meter all sort of metrics on your computer. You can check the [documentation](http://collectd.org/documentation.shtml) online to see how to disable or enable plugins. As the _collectd-gnocchi_ plugin is written in Python, you'll need to enable the Python plugin and load the _collectd-gnocchi_ module: ```apacheconf LoadPlugin python Import "collectd_gnocchi" endpoint "http://localhost:8041" ``` That is enough to enable the storage of metrics in Gnocchi. ## Running the daemons Once everything is configured, you can launch `gnocchi-metricd` and the `gnocchi-api` daemon: ``` $ gnocchi-metricd 2017-01-26 15:22:49.018 15971 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed. […] ## In another terminal $ gnocchi-api --port 8041 […] STARTING test server gnocchi.rest.app.build_wsgi_app Available at http://127.0.0.1:8041/ […] ``` It's not recommended to run Gnocchi using Gnocchi API (as [written in the documentation](http://gnocchi.xyz/running.html#running-as-a-wsgi-application)): using [uwsgi](https://uwsgi-docs.readthedocs.io/) is a better option. However for rapid testing, the `gnocchi-api` daemon is good enough. Once that's done, you can start `collectd`: ``` $ collectd ## Or to run in foreground with a different configuration file: ## $ collectd -C collectd.conf -f ``` If you have any problem launchding _colllectd_, check syslog for more information: there might be an issue loading a module or plugin. If no error are printed, then everythin's working fine and you soon should see _gnocchi-api_ printing some requests such as: ``` 127.0.0.1 - - [26/Jan/2017 15:27:03] "POST /v1/resource/collectd HTTP/1.1" 409 113 127.0.0.1 - - [26/Jan/2017 15:27:03] "POST /v1/batch/resources/metrics/measures?create_metrics=True HTTP/1.1" 400 91 ``` ## Enjoying the result Once everything runs, you can access your newly created resources and metric by using the [gnocchiclient](http://pypi.python.org/pypi/gnocchiclient). It should have been installed as a dependency of _collectd\_gnocchi_, but you can also install it manually using `pip install gnocchiclient`. If you need to specify a different endpoint you can use the `--endpoint` option (which default to [http://localhost:8041](http://localhost:8041)). Do not hesitate to check the `--help` option for more information. ``` $ gnocchi resource list --details +---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+ | id | type | project_id | user_id | original_resource_id | started_at | ended_at | revision_start | revision_end | creator | host | +---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+ | dd245138-00c7 | collectd | None | None | dd245138-00c7-5bdc- | 2017-01-26T14 | None | 2017-01-26T14: | None | admin | localhost | | -5bdc-94f8-26 | | | | 94f8-263e236812f7 | :21:02.297466 | | 21:02.297483+0 | | | | | 3e236812f7 | | | | | +00:00 | | 0:00 | | | | +---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+ $ gnocchi resource show collectd:localhost +-----------------------+-----------------------------------------------------------------------+ | Field | Value | +-----------------------+-----------------------------------------------------------------------+ | created_by_project_id | | | created_by_user_id | admin | | creator | admin | | ended_at | None | | host | localhost | | id | dd245138-00c7-5bdc-94f8-263e236812f7 | | metrics | interface-en0@if_errors-0: 5d60f224-2e9e-4247-b415-64d567cf5866 | | | interface-en0@if_errors-1: 1df8b08b-555a-4cab-9186-f9b79a814b03 | | | interface-en0@if_octets-0: 491b7517-7219-4a04-bdb6-934d3bacb482 | | | interface-en0@if_octets-1: 8b5264b8-03f3-4aba-a7f8-3cd4b559e162 | | | interface-en0@if_packets-0: 12efc12b-2538-45e7-aa66-f8b9960b5fa3 | | | interface-en0@if_packets-1: 39377ff7-06e8-454a-a22a-942c8f2bca56 | | | interface-en1@if_errors-0: c3c7e9fc-f486-4d0c-9d36-55cea855596a | | | interface-en1@if_errors-1: a90f1bec-3a60-4f58-a1d1-b3c09dce4359 | | | interface-en1@if_octets-0: c1ee8c75-95bf-4096-8055-8c0c4ec8cd47 | | | interface-en1@if_octets-1: cbb90a94-e133-4deb-ac10-3f37770e32f0 | | | interface-en1@if_packets-0: ac93b1b9-da71-4876-96aa-76067b35c6c9 | | | interface-en1@if_packets-1: 2f8528b2-12ae-4c4d-bec7-8cc987e7487b | | | interface-en2@if_errors-0: ddcf7203-4c49-400b-9320-9d3e0a63c6d5 | | | interface-en2@if_errors-1: b249ea42-01ad-4742-9452-2c834010df71 | | | interface-en2@if_octets-0: 8c23013a-604e-40bf-a07a-e2dc4fc5cbd7 | | | interface-en2@if_octets-1: 806c1452-0607-4b56-b184-c4ffd48f52c0 | | | interface-en2@if_packets-0: c5bc6103-6313-4b8b-997d-01930d1d8af4 | | | interface-en2@if_packets-1: 478ae87e-e56b-44e4-83b0-ed28d99ed280 | | | load@load-0: 5db2248d-2dca-401e-b2e2-bbaee23b623e | | | load@load-1: 6f74ac93-78fd-4a74-a47e-d2add487a30f | | | load@load-2: 1897aca1-356e-4791-907f-512e516992b5 | | | memory@memory-active-0: 83944a85-9c84-4fe4-b471-1a6cf8dce858 | | | memory@memory-free-0: 0ccc7cfa-26a5-4441-a15f-9ebb2aa82c6d | | | memory@memory-inactive-0: 63736026-94c4-47c5-8d6f-a9d89d65025b | | | memory@memory-wired-0: b7217fd6-2cdc-4efd-b1a8-a1edd52eaa2e | | original_resource_id | dd245138-00c7-5bdc-94f8-263e236812f7 | | project_id | None | | revision_end | None | | revision_start | 2017-01-26T14:21:02.297483+00:00 | | started_at | 2017-01-26T14:21:02.297466+00:00 | | type | collectd | | user_id | None | +-----------------------+-----------------------------------------------------------------------+ % gnocchi metric show -r collectd:localhost load@load-0 +------------------------------------+-----------------------------------------------------------------------+ | Field | Value | +------------------------------------+-----------------------------------------------------------------------+ | archive_policy/aggregation_methods | min, std, sum, median, mean, 95pct, count, max | | archive_policy/back_window | 0 | | archive_policy/definition | - timespan: 1:00:00, granularity: 0:05:00, points: 12 | | | - timespan: 1 day, 0:00:00, granularity: 1:00:00, points: 24 | | | - timespan: 30 days, 0:00:00, granularity: 1 day, 0:00:00, points: 30 | | archive_policy/name | low | | created_by_project_id | | | created_by_user_id | admin | | creator | admin | | id | 5db2248d-2dca-401e-b2e2-bbaee23b623e | | name | load@load-0 | | resource/created_by_project_id | | | resource/created_by_user_id | admin | | resource/creator | admin | | resource/ended_at | None | | resource/id | dd245138-00c7-5bdc-94f8-263e236812f7 | | resource/original_resource_id | dd245138-00c7-5bdc-94f8-263e236812f7 | | resource/project_id | None | | resource/revision_end | None | | resource/revision_start | 2017-01-26T14:21:02.297483+00:00 | | resource/started_at | 2017-01-26T14:21:02.297466+00:00 | | resource/type | collectd | | resource/user_id | None | | unit | None | +------------------------------------+-----------------------------------------------------------------------+ $ gnocchi measures show -r collectd:localhost load@load-0 +---------------------------+-------------+--------------------+ | timestamp | granularity | value | +---------------------------+-------------+--------------------+ | 2017-01-26T00:00:00+00:00 | 86400.0 | 3.2705004391254193 | | 2017-01-26T15:00:00+00:00 | 3600.0 | 3.2705004391254193 | | 2017-01-26T15:00:00+00:00 | 300.0 | 2.6022800611413044 | | 2017-01-26T15:05:00+00:00 | 300.0 | 3.561742940080275 | | 2017-01-26T15:10:00+00:00 | 300.0 | 2.5605337960379466 | | 2017-01-26T15:15:00+00:00 | 300.0 | 3.837517851142473 | | 2017-01-26T15:20:00+00:00 | 300.0 | 3.9625948392427883 | | 2017-01-26T15:25:00+00:00 | 300.0 | 3.2690042162698414 | +---------------------------+-------------+--------------------+ ``` As you can see, the command line works smoothly and can show you any kind of metric reported by _collectd_. In this case, it was just running on my laptop, but you can imagine it's easy enough to poll thousands of hosts with _collectd_ and _Gnocchi_. ## Bonus: charting with Grafana [Grafana](http://grafana.org), a charting software, has a plugin for _Gnocchi_ as [detailed in the documentation](http://gnocchi.xyz/grafana.html). Once installed, you can just configure _Grafana_ to point to _Gnocchi_ this way: ![](/content/images/03/grafana-config-screen-gnocchi.png) You can then create a new dashboard by filling the forms as you wish. See this other screenshot for a nice example: ![Charts of my laptop's load average](/content/images/03/grafana-gnocchi-load.png) I hope everything is clear and easy enough. If you have any question, feel free to write something in the comment section!

Gnocchi talk at the Paris Monitoring Meetup #6

Fri, 27 May 2016 00:00:00 GMT

Last week was the sixth edition of the [Paris Monitoring Meetup](http://www.meetup.com/Paris-Monitoring/events/230515751/), where I was invited as a speaker to present and talk about [Gnocchi](http://gnocchi.xyz). ![paris-monitoring](/content/images/03/paris-monitoring.png) There was around 50 persons in the room, listening to my presentation of Gnocchi. ![jd-gnocchi-paris-monitoring-meetup-6](/content/images/03/jd-gnocchi-paris-monitoring-meetup-6.jpg) The talk went fine and I had a few interesting questions and feedback. One interesting point that keeps coming when talking about Gnocchi, is its OpenStack label, which scares away a lot of people. We definitely need to continue explaining that the project work stand-alone has a no dependency on OpenStack, just a great integration with it. The [Monitoring-fr](http://www.monitoring-fr.org/) organization also [interviewed me](http://www.monitoring-fr.org/2016/05/meetup-paris-monitoring-6-interview-de-julien-danjou-pour-gnocchi-metric-as-a-service/) after the meetup about Gnocchi. The interview is in French, obviously. I talk about Gnocchi, what it does, how it does it and why we started the project a couple of years ago. Enjoy, and let me know what you think!

Visualize your OpenStack cloud: Gnocchi & Grafana

Mon, 14 Sep 2015 00:00:00 GMT

We've been hard working with the Gnocchi team these last months to store your metrics, and I guess it's time to show off a bit. So far Gnocchi offers scalable metric storage and resource indexation, especially for OpenStack cloud – but not only, we're generic. It's cool to store metrics, but it can be even better to have a way to visualize them! ## Prototyping We very soon started to build a little HTML interface. Being REST-friendly guys, we enabled it on the same endpoints that were being used to retrieve information and measures about metric, sending back `text/html` instead of `application/json` if you were requesting those pages from a Web browser. But let's face it: we are back-end developers, we suck at any kind front-end development. CSS, HTML, JavaScript? Bwah! So what we built was a starting point, hoping some magical Web developer would jump in and finish the job. Obviously it never happened. ## Ok, so what's out there? It turns out there are back-end agnostic solutions out there, and we decided to pick [Grafana](http://grafana.org). Grafana is a complete graphing dashboard solution that can be plugged on top of any back-end. It already supports timeseries databases such as Graphite, InfluxDB and OpenTSDB. That was largely enough for that my fellow developer [Mehdi Abaakouk](https://blog.sileht.net/) to jump in and start writing a Gnocchi plugin for Grafana! Consequently, there is now a basic but solid and working back-end for Grafana that lies in the _[grafana-plugins](https://github.com/grafana/grafana-plugins/tree/master/datasources/gnocchi)_ repository. ![gnocchi-grafana](/content/images/03/gnocchi-grafana.png) With that plugin, you can graph anything that is stored in Gnocchi, from raw metrics to metrics tied to resources. You can use templating, but no annotation yet. The back-end supports Gnocchi with or without Keystone involved, and any type of authentication (basic auth or Keystone token). So yes, it even works if you're not running Gnocchi with the rest of OpenStack. ![gnocchi-grafana-group](/content/images/03/gnocchi-grafana-group.png) It also supports advanced queries, so you can search for resources based on some criterion and graphs their metrics. ## I want to try it! If you want to deploy it, all you need to do is to install Grafana and its plugins, and create a new datasource pointing to Gnocchi. It is that simple. There's some CORS middleware configuration involved if you're planning on using Keystone authentication, but it's pretty straightforward – just set the `cors.allowed_origin` option to the URL of your Grafana dashboard. We added support of Grafana directly in Gnocchi devstack plugin. If you're running [DevStack](http://devstack.org) you can follow [the instructions](http://docs.openstack.org/developer/gnocchi/devstack.html) – which are basically adding the line `enable_service gnocchi-grafana`. ## Moving to Grafana core \[Mehdi just opened a pull request\] ([https://github.com/grafana/grafana/pull/2716](https://github.com/grafana/grafana/pull/2716)) a few days ago to merge the plugin into Grafana core. It's actually one of the most unit-tested plugin in Grafana so far, so it should be on a good path to be merged in the future and have support of Gnocchi directly into Grafana without any plugin involved. ![grafana-gnocchi-unittests](/content/images/03/grafana-gnocchi-unittests.png)

Python bad practice, a concrete case

Mon, 15 Sep 2014 00:00:00 GMT

A lot of people read up on good Python practice, and there's plenty of information about that on the Internet. Many tips are included in the book I wrote this year, [The Hacker's Guide to Python](https://thehackerguidetopython.com). Today I'd like to show a concrete case of code that I don't consider being the state of the art. ![python-thumb-down](/content/images/03/python-thumb-down.png) In my [last article](/blog/openstack-ceilometer-the-gnocchi-experiment) where I talked about my new project Gnocchi, I wrote about how I tested, hacked and then ditched _[whisper](http://graphite.wikidot.com/whisper)_ out. Here I'm going to explain part of my thought process and a few things that raised my eyebrows when hacking this code. Before I start, please don't get the spirit of this article wrong. It's in no way a personal attack to the authors and contributors (who I don't know). Furthermore, _whisper_ is a piece of code that is in production in thousands of installation, storing metrics for years. While I can argue that I consider the code not to be following best practice, it definitely works well enough and is worthy to a lot of people. ## Tests The first thing that I noticed when trying to hack on _whisper_, is the lack of test. There's only one file containing tests, named `test_whisper.py`, and the coverage it provides is pretty low. One can check that using the _coverage_ tool. ``` $ coverage run test_whisper.py ........... ---------------------------------------------------------------------- Ran 11 tests in 0.014s OK $ coverage report Name Stmts Miss Cover ---------------------------------- test_whisper 134 4 97% whisper 584 227 61% ---------------------------------- TOTAL 718 231 67% ``` While one would think that 61% is "not so bad", taking a quick peak at the actual test code shows that the tests are incomplete. Why I mean by incomplete is that they for example use the library to store values into a database, but they never check if the results can be fetched and if the fetched results are accurate. Here's a good reason one should never blindly trust the test cover percentage as a quality metric. When I tried to modify _whisper_, as the tests do not check the entire cycle of the values fed into the database, I ended up doing wrong changes but had the tests still pass. ## No PEP 8, no Python 3 The code doesn't respect PEP 8 . A run of [flake8](https://flake8.readthedocs.org/) + [hacking](https://pypi.python.org/pypi/hacking) shows 732 errors… While it does not impact the code itself, it's more painful to hack on it than it is on most Python projects. The _hacking_ tool also shows that the code is not Python 3 ready as there is usage of Python 2 only syntax. A good way to fix that would be to set up [tox](https://testrun.org/tox/latest/) and adds a few targets for PEP 8 checks and Python 3 tests. Even if the test suite is not complete, starting by having flake8 run without errors and the few unit tests working with Python 3 should put the project in a better light. ## Not using idiomatic Python A lot of the code could be simplified by using idiomatic Python. Let's take a simple example: ```python def fetch(path,fromTime,untilTime=None,now=None): fh = None try: fh = open(path,'rb') return file_fetch(fh, fromTime, untilTime, now) finally: if fh: fh.close() ``` That piece of code could be easily rewritten as: ```python def fetch(path,fromTime,untilTime=None,now=None): with open(path, 'rb') as fh: return file_fetch(fh, fromTime, untilTime, now) ``` This way, the function looks actually so simple that one can even wonder why it should exists – but why not. Usage of loops could also be made more Pythonic: ```python for i,archive in enumerate(archiveList): if i == len(archiveList) - 1: break ``` could be actually: ```python for archive in itertools.islice(archiveList, len(archiveList) - 1): ``` That reduce the code size and makes it easier to read through the code. ## Wrong abstraction level Also, one thing that I noticed in _whisper_, is that it abstracts its features at the wrong level. Take the `create()` function, it's pretty obvious: ```python def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False): # Set default params if xFilesFactor is None: xFilesFactor = 0.5 if aggregationMethod is None: aggregationMethod = 'average' #Validate archive configurations... validateArchiveList(archiveList) #Looks good, now we create the file and write the header if os.path.exists(path): raise InvalidConfiguration("File %s already exists!" % path) fh = None try: fh = open(path,'wb') if LOCK: fcntl.flock( fh.fileno(), fcntl.LOCK_EX ) aggregationType = struct.pack( longFormat, aggregationMethodToType.get(aggregationMethod, 1) ) oldest = max([secondsPerPoint * points for secondsPerPoint,points in archiveList]) maxRetention = struct.pack( longFormat, oldest ) xFilesFactor = struct.pack( floatFormat, float(xFilesFactor) ) archiveCount = struct.pack(longFormat, len(archiveList)) packedMetadata = aggregationType + maxRetention + xFilesFactor + archiveCount fh.write(packedMetadata) headerSize = metadataSize + (archiveInfoSize * len(archiveList)) archiveOffsetPointer = headerSize for secondsPerPoint,points in archiveList: archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points) fh.write(archiveInfo) archiveOffsetPointer += (points * pointSize) #If configured to use fallocate and capable of fallocate use that, else #attempt sparse if configure or zero pre-allocate if sparse isn't configured. if CAN_FALLOCATE and useFallocate: remaining = archiveOffsetPointer - headerSize fallocate(fh, headerSize, remaining) elif sparse: fh.seek(archiveOffsetPointer - 1) fh.write('\x00') else: remaining = archiveOffsetPointer - headerSize chunksize = 16384 zeroes = '\x00' * chunksize while remaining > chunksize: fh.write(zeroes) remaining -= chunksize fh.write(zeroes[:remaining]) if AUTOFLUSH: fh.flush() os.fsync(fh.fileno()) finally: if fh: fh.close() ``` The function is doing **everything**: checking if the file doesn't exist already, opening it, building the structured data, writing this, building more structure, then writing that, etc. That means that the caller has to give a file path, even if it just wants a _whipser_ data structure to store itself elsewhere. `StringIO()` could be used to fake a file handler, but it will fail if the call to `fcntl.flock()` is not disabled – and it is inefficient anyway. There's a lot of other functions in the code, such as for example `setAggregationMethod()`, that mixes the handling of the files – even doing things like `os.fsync()` – while manipulating structured data. This is definitely not a good design, especially for a library, as it turns out reusing the function in different context is near impossible. ## Race conditions There are race conditions, for example in `create()` (see added comment): ```python if os.path.exists(path): raise InvalidConfiguration("File %s already exists!" % path) fh = None try: # TOO LATE I ALREADY CREATED THE FILE IN ANOTHER PROCESS YOU ARE GOING TO # FAIL WITHOUT GIVING ANY USEFUL INFORMATION TO THE CALLER :-( fh = open(path,'wb') ``` That code should be: ```python try: fh = os.fdopen(os.open(path, os.O_WRONLY | os.O_CREAT | os.O_EXCL), 'wb') except OSError as e: if e.errno == errno.EEXIST: raise InvalidConfiguration("File %s already exists!" % path) ``` to avoid any race condition. ## Unwanted optimization We saw earlier the `fetch()` function that is barely useful, so let's take a look at the `file_fetch()` function that it's calling. ```python def file_fetch(fh, fromTime, untilTime, now = None): header = __readHeader(fh) [...] ``` The first thing the function does is to read the header from the file handler. Let's take a look at that function: ```python def __readHeader(fh): info = __headerCache.get(fh.name) if info: return info originalOffset = fh.tell() fh.seek(0) packedMetadata = fh.read(metadataSize) try: (aggregationType,maxRetention,xff,archiveCount) = struct.unpack(metadataFormat,packedMetadata) except: raise CorruptWhisperFile("Unable to read header", fh.name) [...] ``` The first thing the function does is to look into a cache. Why is there a cache? It actually caches the header based with an index based on the file path (`fh.name`). Except that if one for example decide not to use file and cheat using `StringIO`, then it does not have any name attribute. So this code path will raise an `AttributeError`. One has to set a fake name manually on the `StringIO` instance, and it must be unique so nobody messes with the cache ```python import StringIO packedMetadata = fh = StringIO.StringIO(packedMetadata) fh.name = "myfakename" header = __readHeader(fh) ``` The cache may actually be useful when accessing files, but it's definitely useless when not using files. But it's not necessarily true that the complexity (even if small) that the cache adds is worth it. I doubt most of _whisper_ based tools are long run processes, so the cache that is really used when accessing the files is the one handled by the operating system kernel, and this one is going to be much more efficient anyway, and shared between processed. There's also no expiry of that cache, which could end up of tons of memory used and wasted. ## Docstrings None of the docstrings are written in a a parsable syntax like [Sphinx](http://sphinx-doc.org/). This means you cannot generate any documentation in a nice format that a developer using the library could read easily. The documentation is also not up to date: ```python def fetch(path,fromTime,untilTime=None,now=None): """fetch(path,fromTime,untilTime=None) [...] """ def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False): """create(path,archiveList,xFilesFactor=0.5,aggregationMethod='average') [...] """ ``` This is something that could be avoided if a proper format was picked to write the docstring. A tool cool be used to be noticed when there's a diversion between the actual function signature and the documented one, like missing an argument. ## Duplicated code Last but not least, there's a lot of code that is duplicated around in the scripts provided by _whisper_ in its `bin` directory. Theses scripts should be very lightweight and be using the `console_scripts` facility of _setuptools_, but they actually contains a lot of (untested) code. Furthermore, some of that code is partially duplicated from the `whisper.py` library which is against [DRY](http://en.wikipedia.org/wiki/Don't_repeat_yourself). ## Conclusion There are a few more things that made me stop considering _whisper_, but these are part of the _whisper_ features, not necessarily code quality. One can also point out that the code is very condensed and hard to read, and that's a more general problem about how it is organized and abstracted. A lot of these defects are actually points that made me start writing [The Hacker's Guide to Python](https://thehackerguidetopython.com) a year ago. Running into this kind of code makes me think it was a really good idea to write a book on advice to write better Python code!

First release of PyMuninCli

Tue, 17 Apr 2012 00:00:00 GMT

Today I release a [Python](http://python.org) client library to query [Munin](http://munin-monitoring.org/) servers. I wrote it as part of some experiments I did a few weeks ago. I discovered there was no client library to query a Munin server. There's [PyMunin](http://aouyar.github.com/PyMunin/) or [python-munin](http://samuelks.com/python-munin/) which help developing Munin plugins, but nothing to access the _munin-node_ and retrieve its data. So I decided to write a quick and simple one, and it's released under the name of [PyMuninCli](https://github.com/jd/pymunincli), providing the _munin.client_ Python module.