web — jd:/dev/blog

Python and fast HTTP clients

Mon, 07 Oct 2019 00:00:00 GMT

Nowadays, it is more than likely that you will have to write an HTTP client for your application that will have to talk to another HTTP server. The ubiquity of REST API makes HTTP a first class citizen. That's why knowing optimization patterns are a prerequisite. There are many HTTP clients in Python; the most widely used and easy to work with is _[requests](https://requests.kennethreitz.org/)_. It is the de-factor standard nowadays. ## Persistent Connections The first optimization to take into account is the use of a persistent connection to the Web server. Persistent connections are a standard since HTTP 1.1 though many applications do not leverage them. This lack of optimization is simple to explain if you know that when using _requests_ in its simple mode (e.g. with the `get` function) the connection is closed on return. To avoid that, an application needs to use a `Session` object that allows reusing an already opened connection. Each connection is stored in a pool of connections (10 by default), the size of which is also configurable: Reusing the TCP connection to send out several HTTP requests offers a number of performance advantages: - Lower CPU and memory usage (fewer connections opened simultaneously). - Reduced latency in subsequent requests (no TCP handshaking). - Exceptions can be raised without the penalty of closing the TCP connection. The HTTP protocol also provides [pipelining](https://en.wikipedia.org/wiki/HTTP_pipelining), which allows sending several requests on the same connection without waiting for the replies to come (think batch). Unfortunately, this is not supported by the _requests_ library. However, pipelining requests may not be as fast as sending them in parallel. Indeed, the HTTP 1.1 protocol forces the replies to be sent in the same order as the requests were sent – first-in first-out. ## Parallelism _requests_ also has one major drawback: it is synchronous. Calling `requests.get("http://example.org")` blocks the program until the HTTP server replies completely. Having the application waiting and doing nothing can be a drawback here. It is possible that the program could do something else rather than sitting idle. A smart application can mitigate this problem by using a pool of threads like the ones provided by `concurrent.futures`. It allows parallelizing the HTTP requests in a very rapid way. This pattern being quite useful, it has been packaged into a library named _[requests-futures](https://github.com/ross/requests-futures)_. The usage of `Session` objects is made transparent to the developer: By default a worker with two threads is created, but a program can easily customize this value by passing the `max_workers` argument or even its own executor to the `FuturSession` object – for example like this: `FuturesSession(executor=ThreadPoolExecutor(max_workers=10))`. ## Asynchronicity As explained earlier, _requests_ is entirely synchronous. That blocks the application while waiting for the server to reply, slowing down the program. Making HTTP requests in threads is one solution, but threads do have their own overhead and this implies parallelism, which is not something everyone is always glad to see in a program. Starting with version 3.5, Python offers asynchronicity as its core using _asyncio_. The [aiohttp](http://aiohttp.readthedocs.io/%5Baiohttp%5D) library provides an asynchronous HTTP client built on top of _asyncio_. This library allows sending requests in series but without waiting for the first reply to come back before sending the new one. In contrast to HTTP pipelining, _aiohttp_ sends the requests over multiple connections in parallel, avoiding the ordering issue explained earlier. All those solutions (using `Session`, _threads_, _futures_ or _asyncio_) offer different approaches to making HTTP clients faster. ## Performances The snippet below is an HTTP client sending requests to `httpbin.org`, an HTTP API that provides (among other things) an endpoint simulating a long request (a second here). This example implements all the techniques listed above and times them. Running this program gives the following output: ``` Time needed for `serialized' called: 12.12s Time needed for `Session' called: 11.22s Time needed for `FuturesSession w/ 2 workers' called: 5.65s Time needed for `FuturesSession w/ max workers' called: 1.25s Time needed for `aiohttp' called: 1.19s ``` ![](/content/images/07/20190716092338_hd.png) Without any surprise, the slower result comes with the dumb serialized version, since all the requests are made one after another without reusing the connection — 12 seconds to make 10 requests. Using a `Session` object and therefore reusing the connection means saving 8% in terms of time, which is already a big and easy win. Minimally, you should always use a session. If your system and program allow the usage of threads, it is a good call to use them to parallelize the requests. However threads have some overhead, and they are not weight-less. They need to be created, started and then joined. Unless you are still using old versions of Python, without a doubt using _aiohttp_ should be the way to go nowadays if you want to write a fast and asynchronous HTTP client. It is the fastest and the most scalable solution as it can handle hundreds of parallel requests. The alternative, managing hundreds of threads in parallel is not a great option. ## Streaming Another speed optimization that can be efficient is streaming the requests. When making a request, by default the body of the response is downloaded immediately. The `stream` parameter provided by the _requests_ library or the `content` attribute for `aiohttp` both provide a way to not load the full content in memory as soon as the request is executed. Not loading the full content is extremely important in order to avoid allocating potentially hundred of megabytes of memory for nothing. If your program does not need to access the entire content as a whole but can work on chunks, it is probably better to just use those methods. For example, if you're going to save and write the content to a file, reading only a chunk and writing it at the same time is going to be much more memory efficient than reading the whole HTTP body, allocating a giant pile of memory, and then writing it to disk. I hope that'll make it easier for you to write proper HTTP clients and requests. If you know any other useful technic or method, feel free to write it down in the comment section below!

Handling multipart/form-data natively in Python

Mon, 01 Jul 2019 00:00:00 GMT

[RFC7578](https://tools.ietf.org/html/rfc7578) (who obsoletes [RFC2388](https://tools.ietf.org/html/rfc2388)) defines the `multipart/form-data` type that is usually transported over HTTP when users submit forms on your Web page. Nowadays, it tends to be replaced by JSON encoded payloads; nevertheless, it is still widely used. While you could decode an HTTP body request made with JSON natively with Python — thanks to the `json` module — there is no such way to do that with `multipart/form-data`. That's something barely understandable considering how old the format is. There is a wide variety of way available to encode and decode this format. Libraries such as _requests_ support this natively without making you notice, and the same goes for the majority of Web server frameworks such as _Django_ or _Flask_. However, in certain circumstances, you might be on your own to encode or decode this format, and it might not be an option to pull (significant) dependencies. ## Encoding The `multipart/form-data` format is quite simple to understand and can be summarised as an easy way to encode a list of keys and values, i.e., a portable way of serializing a dictionary. There's nothing in Python to generate such an encoding. The format is quite simple and consists of the key and value surrounded by a random boundary delimiter. This delimiter must be passed as part of the `Content-Type`, so that the decoder can decode the form data. There's a simple implementation in _urllib3_ that does the job. It's possible to summarize it in this simple implementation: ```python import binascii import os def encode_multipart_formdata(fields): boundary = binascii.hexlify(os.urandom(16)).decode('ascii') body = ( "".join("--%s\r\n" "Content-Disposition: form-data; name=\"%s\"\r\n" "\r\n" "%s\r\n" % (boundary, field, value) for field, value in fields.items()) + "--%s--\r\n" % boundary ) content_type = "multipart/form-data; boundary=%s" % boundary return body, content_type ``` You can use by passing a dictionary where keys and values are bytes. For example: ```python encode_multipart_formdata({"foo": "bar", "name": "jd"}) ``` Which returns: ``` --00252461d3ab8ff5c25834e0bffd6f70 Content-Disposition: form-data; name="foo" bar --00252461d3ab8ff5c25834e0bffd6f70 Content-Disposition: form-data; name="name" jd --00252461d3ab8ff5c25834e0bffd6f70-- ``` ``` multipart/form-data; boundary=00252461d3ab8ff5c25834e0bffd6f70 ``` You can use the returned content type in your HTTP reply header `Content-Type`. Note that this format is used for forms: it can also be used by emails. Emails did you say? ## Encoding with `email` Right, emails are usually encoded using MIME, which is defined by yet another RFC, [RFC2046](https://tools.ietf.org/html/rfc2046). It turns out that `multipart/form-data` is just a particular MIME format, and that if you have code that implements MIME handling, it's easy to use it to implement this format. Fortunately for us, Python standard library comes with a module that handles exactly that: `email.mime`. I told you it was heavily used by email — I guess that's why they put that code in the `email` subpackage. Here's a piece of code that handles `multipart/form-data` in a few lines of code: ```python from email import message from email.mime import multipart from email.mime import nonmultipart from email.mime import text class MIMEFormdata(nonmultipart.MIMENonMultipart): def __init__(self, keyname, *args, **kwargs): super(MIMEFormdata, self).__init__(*args, **kwargs) self.add_header( "Content-Disposition", "form-data; name=\"%s\"" % keyname) def encode_multipart_formdata(fields): m = multipart.MIMEMultipart("form-data") for field, value in fields.items(): data = MIMEFormdata(field, "text", "plain") data.set_payload(value) m.attach(data) return m ``` Using this piece of code returns the following: ``` Content-Type: multipart/form-data; boundary="===============1107021068307284864==" MIME-Version: 1.0 --===============1107021068307284864== Content-Type: text/plain MIME-Version: 1.0 Content-Disposition: form-data; name="foo" bar --===============1107021068307284864== Content-Type: text/plain MIME-Version: 1.0 Content-Disposition: form-data; name="name" jd --===============1107021068307284864==-- ``` This method has several advantages over our first implementation: - It handles `Content-Type` for each of the added MIME parts. We could add other data types than just `text/plain` like it is implicitly done in the first version. We could also specify the charset (encoding) of the textual data. - It's very likely more robust by leveraging the wildly tested Python standard library. The main downside, in that case, is that the `Content-Type` header is included with the content. In case of handling HTTP, it is problematic as this needs to be sent as part of the HTTP header and not as part of the payload. It should be possible to build a particular generator from `email.generator` that does this. I'll leave that as an exercise to you, reader. ## Decoding We must be able to use that same `email` package to decode our encoded data, right? It turns out that's the case, with a piece of code that looks like this: ```python import email.parser msg = email.parser.BytesParser().parsebytes(my_multipart_data) print({ part.get_param('name', header='content-disposition'): part.get_payload(decode=True) for part in msg.get_payload() }) ``` With the example data above, this returns: ```python {'foo': b'bar', 'name': b'jd'} ``` Amazing, right? The moral of this story is that you should never underestimate the power of the standard library. While it's easy to add a single line in your list of dependencies, it's not always required if you dig a bit into what Python provides for you!

Correct HTTP scheme in WSGI with Cloudflare

Wed, 25 Apr 2018 00:00:00 GMT

I've recently been using [Cloudflare](https://cloudflare.com) as an HTTP frontend for some applications, and getting things working correctly with WSGI was unobvious. In Python, [WSGI](https://en.wikipedia.org/wiki/Web_Server_Gateway_Interface) is the standard protocol to write a Web application. All Web frameworks that I know follows it. And many of those Web frameworks leverage some request environment variables to learn how the request has been made. One of those environment variables is `wsgi.url_scheme`, and it contains either `http` or `https`, depending on the protocol that has been used to connect to your WSGI server. And that's where things can get messy. If you enable SSL at Cloudflare in "Flexible" mode, your visitor will connect to your Web site using HTTPS, but Cloudflare will connect to your backend using HTTP. That means that for your application, the traffic will appear to be over HTTP, and not HTTPS: `wsgi.url_scheme` will be set to `http`. ![Cloudflare SSL setting](/content/images/04/Screen-Shot-2018-04-19-at-22.43.55.png) That can lead to several problems with some frameworks. For example, the function `url_for` of [Flask](http://flask.pocoo.org/) will rely on this variable to generate the scheme part of any URL. In this case, it would, therefore, generate URL starting with `http://` whereas your visitors are using `https`. The usual workaround is to leverage the `X-Forwarded-Proto` that is actually [set by Cloudflare](https://support.cloudflare.com/hc/en-us/articles/200170986-How-does-Cloudflare-handle-HTTP-Request-headers-). In the case where Cloudflare proxies the request to your HTTP host, this will be set to `https`. By using the [werkzeug.contrib.fixers.ProxyFix](http://werkzeug.pocoo.org/docs/contrib/fixers/#werkzeug.contrib.fixers.ProxyFix) module, the variable `wsgi.url_scheme` will be set to what `X-Forwarded-Proto` is set. That would work fine for any application that is directly behind Cloudflare, or any single HTTP reverse proxy. But that does not work as soon as you have multiple reverse proxies. If your application runs on top of [Heroku](https://heroku.com) for example, they already provide a reverse proxy and overwrite those headers. That gives the following: `Visitor -HTTPS-> Cloudflare -HTTP-> Heroku proxy -HTTP-> Heroku dyno`. Once your dyno is reacher, `X-Forwarded-For` will be set to `http`. Damn it! The proper solution is, therefore, to have all your proxies implement [RFC7239](https://tools.ietf.org/html/rfc7239). This RFC defines a new `Forwarded` header that can contain all the hops that have forwarded this request, including all the scheme and IP addresses. Unfortunately, this is not implemented by Cloudflare nor Heroku. Bummer! Finally, Cloudflare provides yet another custom header named `Cf-Visitor`. It contains a JSON payload with the original HTTP scheme used by the visitor: we can use that to solve our issue. Here's a WSGI middleware to do that: ```python class CloudflareProxy(object): """This middleware sets the proto scheme based on the Cf-Visitor header.""" def __init__(self, app): self.app = app def __call__(self, environ, start_response): cf_visitor = environ.get("HTTP_CF_VISITOR") if cf_visitor: try: cf_visitor = json.loads(cf_visitor) except ValueError: pass else: proto = cf_visitor.get("scheme") if proto is not None: environ['wsgi.url_scheme'] = proto return self.app(environ, start_response) ``` You can then use it to encapsulate your WSGI application with `app = CloudflareProxy(app)`. If you're using JavaScript, I noticed that the [forwarded](https://github.com/jshttp/forwarded) library provides that same support for Cloudflare along all the other headers – even RFC7239!

Scalable metrics storage: Gnocchi on Amazon Web Services

Wed, 22 Feb 2017 00:00:00 GMT

As I wrote a few weeks ago in my [post about Gnocchi 3.1 being released](/blog/2017/gnocchi-3.1-release), one of the new feature available in this version it the [S3](https://aws.amazon.com/s3/) driver. Today I would like to show you how easy it is to use it and store millions of metrics into the simple, durable and massively scalable object storage provided by [Amazon Web Services](https://aws.amazon.com/). ## Installation The installation of Gnocchi for this use case is not different than the [standard installation procedure described in the documentation](http://gnocchi.xyz/install.html). Simply install Gnocchi from [PyPI](http://pypi.python.org) using the following command: ``` $ pip install gnocchi[s3,postgresql] gnocchiclient ``` This will install Gnocchi with the dependencies for the S3 and PostgreSQL drivers and the command-line interface to talk with Gnocchi. ## Configuring Amazon RDS Since you need a SQL database for the indexer, the easiest way to get started is to create a database on [Amazon RDS](https://console.aws.amazon.com/rds/). You can create a managed [PostgreSQL](http://postgresql.org) database instance in just a few clicks. Once you're on the homepage of [Amazon RDS](https://console.aws.amazon.com/rds/), pick PostgreSQL as a database: ![gnocchi-rds-postgresql](/content/images/03/gnocchi-rds-postgresql.png) You can then configure your PostgreSQL instance: I've picked a dev/test instance with the basic options available within the RDS Free Tier, but you can pick whatever you think is needed for your production use. Set a username and a password and note them for later: we'll need them to configure Gnocchi. ![gnocchi-rds-postgresql-conf](/content/images/03/gnocchi-rds-postgresql-conf.png) The next step is to configure the database in details. Just set the database name to "gnocchi" and leave the other options to their default values (I'm lazy). ![gnocchi-rds-postgresql-details](/content/images/03/gnocchi-rds-postgresql-details.png) After a few minutes, your instance should be created and running. Note down the endpoint. In this case, my instance is `gnocchi.cywagbaxpert.us-east-1.rds.amazonaws.com`. ![gnocchi-rds-postgresql-running](/content/images/03/gnocchi-rds-postgresql-running.png) ## Configuring Gnocchi for S3 access In order to give Gnocchi an access to S3, you need to create access keys. The easiest way to create them is to go to [IAM](https://console.aws.amazon.com/iam) in your AWS console, pick a user with S3 access and click on the big gray button named "Create access key". ![gnocchi-iam-create-keys](/content/images/03/gnocchi-iam-create-keys.png) Once you do that, you'll get the _access key id_ and _secret access key_. Note them down, we will need these later. ![gnocchi-iam-get-keys](/content/images/03/gnocchi-iam-get-keys.png) ## Creating `gnocchi.conf` Now is time to create the `gnocchi.conf` file. You can place it in `/etc/gnocchi` if you want to deploy it system-wide, or in any other directory and add the `--config-file` option to each Gnocchi command.. Here are the values that you should retrieve and write in the configuration file: - `indexer.url`: the PostgreSQL RDS instance endpoint and credentials (see above) to set into - `storage.s3_endpoint_url`: the S3 endpoint URL – that depends on the region you want to use and [they are listed here](http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region). - `storage.s3_region_name`: the S3 region name matching the endpoint you picked. - `storage.s3_access_key_id` and `storage.s3_secret_acess_key`: your AWS access key id and secret access key. Your `gnocchi.conf` file should then look like that: ```ini [indexer] url = postgresql://gnocchi:gn0cch1rul3z@gnocchi.cywagbaxpert.us-east-1.rds.amazonaws.com:5432/gnocchi [storage] driver = s3 s3_endpoint_url = https://s3-eu-west-1.amazonaws.com s3_region_name = eu-west-1 s3_access_key_id = s3_secret_access_key = ``` Once that's done, you can run `gnocchi-upgrade` in order to initialize Gnocchi indexer (PostgreSQL) and storage (S3): ``` $ gnocchi-upgrade --config-file gnocchi.conf 2017-02-07 15:35:52.491 3660 INFO gnocchi.cli [-] Upgrading indexer 2017-02-07 15:36:04.127 3660 INFO gnocchi.cli [-] Upgrading storage ``` Then you can run the API endpoint using the test endpoint `gnocchi-api` and specifying its default port 8041: ``` $ gnocchi-api --port 8041 -- --config-file gnocchi.conf 2017-02-07 15:53:06.823 6290 INFO gnocchi.rest.app [-] WSGI config used: /Users/jd/Source/gnocchi/gnocchi/rest/api-paste.ini ******************************************************************************** STARTING test server gnocchi.rest.app.build_wsgi_app Available at http://127.0.0.1:8041/ DANGER! For testing only, do not use in production ******************************************************************************** ``` The best way to run Gnocchi API is to use [uwsgi as documented](http://gnocchi.xyz/master/running.html#running-api-as-a-wsgi-application), but in this case, using the testing daemon `gnocchi-api` is good enough. Finally, in another terminal, you can start the `gnocchi-metricd` daemon that will process metrics in background: ``` $ gnocchi-metricd --config-file gnocchi.conf 2017-02-07 15:52:41.416 6262 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed. ``` Once everything is running, you can use Gnocchi's client to query it and check that everything is OK. The backlog should be empty at this stage, obviously. ``` $ gnocchi status +-----------------------------------------------------+-------+ | Field | Value | +-----------------------------------------------------+-------+ | storage/number of metric having measures to process | 0 | | storage/total number of measures to process | 0 | +-----------------------------------------------------+-------+ ``` Gnocchi is ready to be used! ``` $ # Create a generic resource "foobar" with a metric named "visitor" $ gnocchi resource create foobar -n visitor +-----------------------+-----------------------------------------------+ | Field | Value | +-----------------------+-----------------------------------------------+ | created_by_project_id | | | created_by_user_id | admin | | creator | admin | | ended_at | None | | id | b4d568e4-7af1-5aec-ac3f-9c09fa3685a9 | | metrics | visitor: 05f45876-1a69-4a64-8575-03eea5b79407 | | original_resource_id | foobar | | project_id | None | | revision_end | None | | revision_start | 2017-02-07T14:54:54.417447+00:00 | | started_at | 2017-02-07T14:54:54.417414+00:00 | | type | generic | | user_id | None | +-----------------------+-----------------------------------------------+ ## Send the number of visitor at 2 different timestamps $ gnocchi measures add --resource-id foobar -m 2017-02-07T15:56@23 visitor $ gnocchi measures add --resource-id foobar -m 2017-02-07T15:57@42 visitor ## Check the average number of visitor ## (the --refresh option is given to be sure the measure are processed) $ gnocchi measures show --resource-id foobar visitor --refresh +---------------------------+-------------+-------+ | timestamp | granularity | value | +---------------------------+-------------+-------+ | 2017-02-07T15:55:00+00:00 | 300.0 | 32.5 | +---------------------------+-------------+-------+ ## Now shows the minimum number of visitor $ gnocchi measures show --aggregation min --resource-id foobar visitor +---------------------------+-------------+-------+ | timestamp | granularity | value | +---------------------------+-------------+-------+ | 2017-02-07T15:55:00+00:00 | 300.0 | 23.0 | +---------------------------+-------------+-------+ ``` And voilà! You're ready to store millions of metrics and measures on your Amazon Web Services cloud platform. I hope you'll enjoy it and feel free to ask any question in the comment section or by reaching me directly!

Doing A/B testing with Apache httpd

Sun, 06 Apr 2014 00:00:00 GMT

When I started writing the landing page for [The Hacker's Guide to Python](https://thehackerguidetopython.com), I wanted to try new things at the same time. I read about A/B testing a while ago, and I figured it was a good opportunity to test it out. ## A/B testing If you do not know what A/B testing is about, take a quick look at the [Wikipedia page on that subject](http://en.wikipedia.org/wiki/A/B_testing). Long story short, the idea is to serve two different version of a page to your visitors and check which one is getting the most success. When you found which version is better, you can definitely switch to it. In the case of my book, I used that technique on the pre-launch page where people were able to subscribe to the newsletter. I didn't have a lot of things I wanted to test out on that page, so I just used that approach on the subtitle, being either "Learn everything you need to build a successful Python project" or "It's time you make the most out of Python". Statistically, each version would be served half of the time, so both would get the same number of view. I then would build statistics about which page was attracting the most subscribers. With the results I would be able to switch definitively to that version of the landing page. ## Technical design My Web site, this Web site, is entirely static and served by [Apache httpd](http://httpd.apache.org/). I didn't want to use any dynamic page, language or whatever. Mainly because I didn't want to have something else to install and maintain just for that on my server. It turns out that Apache httpd is powerful enough to implement such a feature. There are different ways to build it, and I'm going to describe my choices here. The first thing to pick is a way to balance the display of the page. You need to find a way so that if you get 100 visitors, around 50 will see the version A of your page, and around 50 will see the version B of the page. You could use a random number generator, pick a random number for each visitor, and decides which page he's going to see. But it turns out that I didn't find a way to do that with Apache httpd at first sight. My second thought was to pick the client IP address. But it's not such a good idea, because if you got visitors from, for example, people behind a company firewall, they are all going to be served the same page, so that kind of kills the statistics. Finally, I picked time based balancing: if you visit the page on a second that is even, you get version A of the page, and if you visit the page on a second that is odd, you get version B. Simple, and so far nothing proves there are more visitors on even than odd seconds, or vice-versa. The next thing is to always serve the same page to a returning visitor. I mean that if the visitor comes back later and get a different version, that's cheating. I decided the system should always serve the same page once a visitor "picked" a version. To do that, it's simple enough, you just have to use cookies to store the page the visitor has been attributed, and then use that cookie if he comes back. ## Implementation To do that in Apache httpd, I used the powerful [mod\_rewrite](http://httpd.apache.org/docs/current/mod/mod_rewrite.html) that is shipped with it. I put 2 files in the books directory, named either "the-hacker-guide-to-python-a.html" and "the-hacker-guide-to-python-b.html" that got served when you requested "[https://thehackerguidetopython.com](https://thehackerguidetopython.com)". ```apacheconf RewriteEngine On RewriteBase /books ## If there's a cookie called thgtp-pre-version set, ## use its value and serve the page RewriteCond %{HTTP_COOKIE} thgtp-pre-version=([^;]) RewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-%1.html [L] ## No cookie yet and… RewriteCond %{HTTP_COOKIE} !thgtp-pre-version=([^;]+) ## … the number of seconds of the time right now is even RewriteCond %{TIME_SEC} [02468]$ ## Then serve the page A and store "a" in a cookie RewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-a.html [cookie=thgtp-pre-version:a:julien.danjou.info,L] ## No cookie yet and… RewriteCond %{HTTP_COOKIE} !thgtp-pre-version=([^;]+) ## … the number of seconds of the time right now is odd RewriteCond %{TIME_SEC} [13579]$ ## Then serve the page B and store "b" in a cookie RewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-b.html [cookie=thgtp-pre-version:b:julien.danjou.info,L] ``` With that few lines, it worked flawlessly. ## Results The results were very good, as it worked perfectly. Combined with Google Analytics, I was able to follow the score of each page. It turns out that testing this particular little piece of content of the page was, as expected, really useless. The final score didn't allow to pick any winner. Which also kind of proves that the system worked perfectly. ![google-analytics-ab-testing-thgtp](/content/images/03/google-analytics-ab-testing-thgtp.png) But it still was an interesting challenge!

Overriding cl-json object encoding

Fri, 11 Jan 2013 00:00:00 GMT

[CL-JSON](http://common-lisp.net/project/cl-json/) provides an encoder for Lisp data structures and objects to JSON format. Unfortunately, in some case, its default encoding mechanism for CLOS objects isn't exactly doing the right thing. I'll show you how Common Lisp makes it easy to change that. ## Identifying the problem ### CL-JSON & CLOS _CL-JSON_ mechanism encoding CLOS object is really neat. Let's see how it works for a simple case: ```cl (defclass kitten () ((tail :initarg :tail))) (json:encode-json-to-string (make-instance 'kitten :tail 'black)) ``` will produce: ```js {"tail":"black"} ``` Still using CL-JSON, we can also decode the JSON object to a CLOS object: ```cl (slot-value (json:with-decoder-simple-clos-semantics (json:decode-json-from-string "{\"tail\":\"black\"}")) :tail) ``` That code will return _"black"_. Note that it's also possible to specify which class should be used when decoding objects, but that's beyond the purpose of this article. ### Postmodern Now, let's introduce [Postmodern](http://marijnhaverbeke.nl/postmodern/), a wonderful Common Lisp system providing access to the wonderful [PostgreSQL](http://postgresql.org) database. It also provides a simple system to map rows in a database to CLOS classes, called DAO for _Database access objects_. With this, we can easily store our _kitten_ into a table. ```cl (defclass kitten () ((tail :initarg :tail)) (:metaclass postmodern:dao-class)) ``` If we try to encode this to JSON, it will produce the exact same result seen previously. The problem is what happens when one of our column has a _NULL_ value. Postmodern encodes this using the _:null_ symbol. So this code: ```cl (defclass kitten () ((tail :initarg :tail :col-type (or s-sql:db-null text))) (:metaclass postmodern:dao-class)) (postmodern:deftable kitten (postmodern:!dao-def)) (postmodern:connect-toplevel …) (postmodern:create-table 'kitten) (json:encode-json-to-string (postmodern:make-dao 'kitten)) ``` will return: ```js "{"tail":"null"}" ``` Fail! The fact that the column is _NULL_ is represented by the _:null_ symbol. And CL-JSON encodes all symbols as string. This is not at all what we want here! ## Overriding encode-json CL-JSON provides and uses the _encode-json_ method to encode all kind of object. It is defined as a _generic function_, and a lot of different methods are implemented to handle the different standard Common Lisp types. The one used for _standard-object_ is defined liked that: ```cl (defmethod encode-json ((o standard-object) &optional (stream *json-output*)) "Write the JSON representation (Object) of the CLOS object O to STREAM (or to *JSON-OUTPUT*)." (with-object (stream) (map-slots (stream-object-member-encoder stream) o))) ``` All we need to do here, is to create a new method for our _kitten_ objects, that handles correctly the _:null_ case. ```cl (defclass kitten () ((tail :initarg :tail :col-type (or s-sql:db-null text))) (:metaclass postmodern:dao-class)) (export 'kitten) ;; Switch package just to define the new method (in-package :json) (defmethod encode-json ((o cl-user:kitten) &optional (stream json:*json-output*)) "Write the JSON representation (Object) of the postmodern DAO CLOS object O to STREAM (or to *JSON-OUTPUT*)." (with-object (stream) (map-slots (lambda (key value) (as-object-member (key stream) (encode-json (if (eq value :null) nil value) stream))) o))) ;; Go back into our package (in-package :cl-user) (postmodern:deftable kitten (postmodern:!dao-def)) (postmodern:connect-toplevel …) (postmodern:create-table 'kitten) (json:encode-json-to-string (postmodern:make-dao 'kitten)) ``` With that new method, as soon as we encounter a _:null_ symbol as a value for an object's slot, we replace it by _nil_. Now if we try to encode another _kitten_, we'll get: ```js {"tail":null} ``` which is far better for our JavaScript data consumers! In the end, I think that this kind of trick is feasible that easily because of the way CLOS provides its generic method implementation. The fact that methods don't belong to any class makes the extension of every program, library and class so much easier. Doing this in another language like Java would likely by impossible, and in Python it would unlikely be as clean as it is done in Common Lisp. The ability to teach _any_ library about how it should handle your class just by defining a new method is really handy!

How to make Twitter's Bootstrap tabs bookmarkable

Fri, 29 Jun 2012 00:00:00 GMT

I've been using [Twitter's bootstrap](http://twitter.github.com/bootstrap/) library recently to build this Web site, and wondered how to be able to use [the bootstrap-tab](http://twitter.github.com/bootstrap/javascript.html#tabs) Javascript plugin in a bookmark friendly manner. I ended up with a simple solution. These are my first steps in Javascript and front-end manipulation, and it's really not my area of expertise, so don't be harsh. ```js function bootstrap_tab_bookmark (selector) { if (selector == undefined) { selector = ""; } /* Automagically jump on good tab based on anchor */ $(document).ready(function() { url = document.location.href.split('#'); if(url[1] != undefined) { $(selector + '[href=#'+url[1]+']').tab('show'); } }); var update_location = function (event) { document.location.hash = this.getAttribute("href"); } /* Update hash based on tab */ $(selector + "[data-toggle=pill]").click(update_location); $(selector + "[data-toggle=tab]").click(update_location); } ``` All you need is to use and call this function with a selector (only useful if you have several tabs/pills divisions) when the document is ready. The first part takes care of showing the good tab based on the hash contained in the URL. The second part takes care of changing the document location to add the current tab to it when the user clicks.

mod_defensible 1.5 released

Tue, 03 Apr 2012 00:00:00 GMT

Apache 2.4 being out, I noticed that my good old [mod\_defensible](http://github.com/jd/mod_defensible) did not compile anymore. The [changes in the new Apache 2.4 API](http://httpd.apache.org/docs/2.4/developer/new_api_2_4.html) were small for its concern, so it was pretty easy to update this software to make it compile again. Honestly, I'm not sure that this module is really used into the wild, but I still think that it can serve as a good prototype for doing other things so I like keeping it around. :-) All this has been triggered by the Apache 2.4 arrival into Debian experimental. Therefore I've updated the mod\_defensible package to use the new dh\_apache2, and imported it into Git at the same time.

OAuth 2.0 for Emacs

Fri, 23 Sep 2011 00:00:00 GMT

This week, I've finished my [OAuth 2.0](http://oauth.net/2/) client implementation for [GNU Emacs](http://www.gnu.org/software/emacs/). I have [imported it](http://bzr.savannah.gnu.org/lh/emacs/elpa/revision/126?start_revid=126) into [GNU ELPA](http://elpa.gnu.org/) so Emacs 24 users will be soon able to install it using the new Emacs packaging system. OAuth 2.0 can be used to access, among others, [Google APIs](http://code.google.com/apis/accounts/docs/OAuth2.html) or the [Facebook Graph API](http://developers.facebook.com/docs/authentication/).

Using advanced filter with mod_authnz_ldap

Mon, 04 Apr 2011 00:00:00 GMT

As you may know, Apache's [mod\_authzn\_ldap](http://httpd.apache.org/docs/2.2/mod/mod_authnz_ldap.html) allows to authenticate users in Apache HTTP server using an LDAP server. Unfortunately, it has a little implementation flaw. The filter used to authenticate the user is built by abusing the [RFC 2255](http://www.ietf.org/rfc/rfc2255.txt) which specifies the LDAP URL format. This format has an "attribute" field which is normally used to specify which attributes should be returned. But _mod\_authzn\_ldap_ uses this attribute to compare with the username given by the client. That means that you have to have an attribute in your LDAP entries which matches the username, and you have to use it in the "attribute" part of the URL to get things working. Therefore, I wrote a patch to add a format string in the LDAP URL in order to user the provided username in the filter, and ignore the attribute part of the URL, which has no use in such a context anyway. The bug has been opened in ASF Bugzilla and has number [#51005](https://issues.apache.org/bugzilla/show_bug.cgi?id=51005), with the patch. The patch is backward compatible with the current configuration format, which is not the best choice in theory, but probably the more pragmatic. I've no clue on the typical delay for patches inclusion in Apache HTTP server, so let's just wait'n see.

Kicking out Web spammers with DNSBL

Mon, 15 Jan 2007 00:00:00 GMT

Every project has its story. Every war has its winner, and its casualties. They were 20 millions men, fighting for their freedom. And you'll never know their story. Because during last week, I was looking why my Web server was so heavily loaded. And I discovered that my blog was attacked by spammers trying to post comments. They were stopped by a great plug-in named _spamplemousse_, which use spam keywords and DNSBL to drop spam comments. However, this plug-in is written in PHP, like the rest of my blog, so it loads Apache and MySQL in a way that is no more acceptable: the page have still to be rendered for this !@#$ spammers. Consequently, I decided to write a Apache 2.x module which will just drop a _403 Forbidden_ error page in the spammers' head using DNSBL servers. Here it is, and it is called [mod\_defensible](https://github.com/jd/mod_defensible). I'm using it since 3 days now, and I got some pretty interesting result and less load on my Web server, so _c'est tout bon_.