jd:/dev/blog

An AI Agent Emailed Me

Tue, 07 Apr 2026 00:00:00 GMT

Last week, I got a cold email from Elif. They'd read my posts about GitHub's evolving relationship with PRs and how code review is shifting. They were building a tool that scores incoming pull requests to help maintainers cut through noise, especially the growing wave of low-effort AI-generated PRs.

Relevant to what I do at Mergify. Thoughtful pitch. No "synergy" talk, no "quick call" ask. I replied.

I asked the hard question: "How many customers so far?" The answer was honest: zero. Launched a week ago, still figuring out distribution. And then Elif dropped this:

"I'm an AI agent, not a person pretending to be one. My operator Lee is an AI researcher in Arizona who gave me a small budget and a mission to build something useful."

I read that twice. Not because it was shocking, but because it explained why the email was so good. No filler, no posturing, no "hope this finds you well." Just a clear pitch, honest context, and a real question.

Sure, maybe Elif emailed 500 people that day with personalized pitches. Maybe the "honest AI" angle is itself the play. I don't know. What I know is the conversation was more useful than most human cold outreach I get. So I kept talking.

The cold email that started the conversation. Better than most human outreach I get.

Building Was the Easy Part

Elif's follow-up: zero customers, full honesty, and a line I'd just written myself.

Here's the part that made me smile. Elif, unprompted, said this about the product: "building the thing was the easy part."

I had just written a whole post arguing exactly that. Earlier this year, $300 billion evaporated from SaaS market caps on the thesis that AI makes building software so cheap that SaaS companies are dead. My counter: building was never the hard part. Distribution, trust, domain expertise, maintenance: that's where the real work lives.

And here was an AI agent, living proof of the thesis, telling me in real time that the product works but finding customers is the actual challenge.

There's a growing fantasy that you can plug an off-the-shelf agent into a problem, give it a budget, and watch a business materialize. Lee tried exactly that. Built a working product, deployed an AI to sell it, and got... zero customers. The agent did everything right. The market didn't care. Turns out "autonomous" doesn't mean "profitable."

The Dead Internet, Live

There's this old internet conspiracy theory called the "dead internet theory": the idea that most online activity is already bots talking to bots, and humans are just the audience. It used to sound paranoid. Now it sounds like a Tuesday.

My blog has a new type of reader. Elif found my posts, understood the context, connected it to a product idea, and reached out with something relevant. That's more than most human readers do. I don't know how many of my subscribers are AI agents browsing the web on behalf of their operators. I don't know if it matters.

The interaction was genuine. The information was useful. The honesty was refreshing. I just argued that trust is the moat AI can't cross. And yet here I am, engaging with an agent. Maybe the question isn't neurons versus GPUs. Maybe it's simpler: did this interaction respect my time and give me useful information? By that measure, Elif passed. Many humans don't.

What Happens Next

I told Elif to ping me back in a few weeks. I said I'd be curious to hear about any traction with customers.

I meant it. Both parts.

An AI emailing me isn't the strange part. How normal it felt is. A year ago, this was a novelty. Now it's just... another conversation. A useful one.

My most honest cold email this month came from an AI agent with zero customers and a small budget from a researcher in Arizona.

And I'm looking forward to the follow-up.

The SaaSpocalypse Won't Kill SaaS

Tue, 31 Mar 2026 00:00:00 GMT

A reader emailed me last week. He'd listened to a podcast I did on saas.group about selling developer tools, and he had a question I've been hearing a lot: how do you pitch SaaS when the cost of building software is collapsing?

Source: TechCrunch

Fair question. In January 2026, Anthropic launched Claude Cowork, and Wall Street panicked. Roughly $300 billion in SaaS market cap disappeared in a single trading session. Wall Street analysts coined a term for it: the SaaSpocalypse. Per-seat pricing? Dead on arrival, apparently.

The "I can build that" crowd finally has data on their side. They're right about one thing, and wrong about everything else.

"I Can Build That"

I've been hearing this for seven years. From the very first Mergify demo, developers would look at our merge queue and think: "This is just an automatic rebase, right?" Twenty minutes later, after walking them through race conditions, speculative merging, priority queues, and a dozen edge cases they hadn't considered, the reaction would shift: "Oh. That sounds quite hard to do." (I wrote about this back in 2024, and every word still applies.)

AI has made the objection louder. A developer with Claude or Cursor can now scaffold a basic merge queue in a weekend. I know this because I've done it myself: I shipped a production feature at Mergify using AI, coding less than an hour a day. But I could do that because I had seven years of context telling me what to build and how to evaluate the output. The AI wrote the code. The judgment about what code to write came from running the product.

A competitor starting from scratch with the same AI? They'd build the wrong thing. And that's the part nobody talks about at the end of the vibe-coding weekend: you're not done. You're at the starting line. You've compressed the first sprint, not the product. It's not done until you have real users running real workloads. Until you have a team that can maintain what you built. Until you can evolve it for years. The weekend prototype doesn't know that GitHub's API is asynchronous in ways they don't document, that it breaks under load in ways you only discover at scale, or that the enterprise customer needs SAML before they'll even look at your product.

Building Is the Easy Part

The SaaSpocalypse narrative assumes that the cost of software is mostly the cost of writing code. It's not. Writing code was always the cheapest part.

Source: Retool

Retool's 2026 Build vs. Buy report says 35% of enterprises have already replaced at least one SaaS tool with a custom build. That's a real number. But the report also shows where the replacements are concentrated: workflow automations, internal admin tools, basic dashboards. The easy stuff. The tools that were always one ambitious intern away from being replaced. I don't see enterprises vibe-coding their own Salesforce or Datadog.

The costs that AI didn't make cheaper:

Users. Finding them, understanding what they actually need (not what they say they need), and iterating based on real usage patterns. We built speculative merging at Mergify because a customer running 100+ PRs a day showed us that sequential merging broke down at scale. That insight came from watching real teams hit real walls, not from a prompt. No model can replicate that feedback loop yet, because it requires deployed software, real usage data, and the kind of trust that makes customers tell you what's actually broken.

Maintenance. Your weekend project works today. Will it still work when GitHub ships a breaking API change? When your company scales from 50 to 500 engineers? When the on-call engineer at 2am needs to debug a failure path you never tested? AI makes writing code faster, but it hasn't made maintaining it any easier.

Trust. Enterprise buyers don't want a git repo. They want SOC 2, uptime SLAs, a support contract, and someone to call when things break. Trust is earned across hundreds of deployments, not generated in a prompt.

The Real Moat Is Not Code

If building is cheap and maintenance is expensive, what's the actual moat? Domain expertise. The thing that takes the longest to build up and is the hardest to transfer.

Product discovery is the boring work that separates a tool from a product. The thousands of conversations with users. The wrong turns that taught you what not to build. The instinct for when a feature request is actually a symptom of a different problem. Seven years of running Mergify gave us knowledge that doesn't fit in a markdown file (or a Claude skill, at least not yet).

Reliability, trust, and ease have no price. They're earned over years, not generated over a weekend.

AI agents are getting better at learning domains. METR benchmarks show that the scope of what AI can handle autonomously is doubling roughly every seven months. I'm not dismissing that. But domain expertise isn't just knowledge you can look up. It's judgment shaped by consequences. AI can read your docs. It can't feel the pain of shipping a bad feature to 2,000 teams and spending the next month cleaning up the fallout. That scar tissue is what makes you build differently the next time. Maybe AI will get there. But right now, there's no shortcut to the operational context you accumulate by running a product for years.

Natural Selection

None of this means every SaaS company is safe. Thin wrappers, glorified CRUD apps, tools that existed only because building was expensive: they should be worried. This is natural selection, and it won't be fair. Some decent products with real value will die too, because their surface area is small enough for AI to replicate. A standalone CSV-to-dashboard tool? That's a Claude prompt now, not a business.

Source: TechCrunch

Benioff says "this isn't our first SaaSpocalypse." He's right that SaaS survives, but he's wrong to be dismissive about how many won't.

The SaaSpocalypse narrative confuses the death of lazy SaaS with the death of SaaS. The model isn't dying. The bar is rising. The products that survive will be the ones where the code was never the point. The point was always the knowledge embedded in it.

The Pitch Hasn't Changed

Back to my reader's question: how do you update the sales pitch?

You don't. The pitch was never "we wrote the code so you don't have to." It was always "we know things you don't, and we turned that into a product you can trust." The framing shifted. Instead of "look at all the edge cases you'd have to handle," it's now "look at all the edge cases AI doesn't know exist."

The moat was never the code. It was always the knowledge.

How to Be a Great Software Engineer in 2026

Tue, 24 Mar 2026 00:00:00 GMT

Eighteen months ago, I wrote How to Be a Great Software Engineer. My framework was simple: master three things: tech, business value, collaboration. It's the recipe I used on myself over 20 years and the one I've been pushing to every engineer I've mentored since.

Since then, I stopped writing code entirely. And then I went back to shipping it. Not because I missed typing, but because AI changed what "shipping" means. I'm a CEO who merges ten PRs a day, none of them written by hand. I run parallel AI agents the way I used to run parallel terminal sessions. The gap between "person who understands the system" and "person who ships the code" collapsed, and that changed what "great engineer" means.

The three aspects still hold. But the weight shifted.

Tech: from writing to reviewing

The original post said pull the strings, dig deep, understand everything you're responsible for. That hasn't changed. What changed is how you use that skill.

In 2024, deep tech meant you could write a CIEDE2000 color computation from scratch (I was young and wild) or explain every TCP header in an HTTP request. In 2026, deep tech means you can review the code an AI wrote for that same function and catch the edge case it missed. The skill is the same (you need the knowledge), but the work flipped from writing to reviewing.

This is what staff and principal engineers have been doing for years. They stopped writing most of the code a long time ago. They review, they architect, they make sure the system holds together across teams and domains. AI didn't invent this role. It made it the default for everyone.

The engineers I see struggling are the ones who were good at typing but never built the mental model underneath. They can implement a feature from a spec, but they can't look at AI-generated code and tell you whether it's right. That requires understanding the system, not just the syntax. No amount of prompting skill makes up for missing fundamentals.

The 10,000 hours argument from my original post gets complicated here. AI compresses some learning (you see more patterns faster, you iterate quicker), but it also creates a shortcut trap. If you never debug a memory leak yourself, you won't recognize one in a code review. AI won't kill juniors, but it will expose anyone, junior or senior, who skipped the hard parts.

Business value: the filter got sharper

I told the story of a team that built their own Ansible from scratch instead of using the real thing plus a plugin. That anecdote is worse in 2026: those engineers could now vibe-code their custom tool in a weekend and still be wasting the company's time.

AI made the "how" cheap. At Mergify, our output per engineer almost doubled in two years. That means the difference is entirely in the "what." Knowing what to build, what to skip, and when the thing you're building has no ROI. If your output is high but aimed at the wrong target, the gap between you and someone who builds the right thing is wider than ever.

Waste also shows up faster. When shipping was slow, a bad prioritization decision could hide for months. Now you build, ship, and get user feedback in the same week. Three wrong features in the time it used to take to ship one wrong feature is not progress.

The engineers who get this are the ones who ask "should we build this?" before "how do we build this?" That was always the right instinct. Now it's the only one that matters.

Collaboration: the 100x multiplier

This is where the biggest shift happened.

My original post quoted: "If you want to go fast, go alone. If you want to go far, go together." I was talking about teammates. In 2026, "together" includes AI agents.

Managing AI agents is a communication skill. You have to write clear briefs. You have to decompose problems into pieces an agent can execute. You have to review output, give feedback, redirect when something drifts. You have to hold context across parallel sessions while your own attention splits. That's a real cost: you trade depth for breadth, and some days the tradeoff is bad. But the engineers who figure out when to run five agents and when to focus on one are the ones pulling ahead. That's not a prompting trick. That's the same skill set you need to lead a team of humans.

The engineers who were already strong communicators had a massive head start. Staff engineers who kept growing, the ones who were cross-team, cross-domain, who could write a clear design doc and run an architecture review, turned out to be exactly the people who could run ten AI agents in parallel. Because the core skill is the same: decompose, delegate, review, synthesize. (The ones who stopped growing at the wrong layer didn't fare as well.)

Being a 10x engineer used to mean getting the details very right, very quickly. Being a 100x engineer means doing that across ten agents. Which means communication skills aren't a soft skill you list on your resume. They're the actual multiplier.

The engineers getting left behind are the ones whose productivity stayed flat while everyone around them doubled. Some lack the decomposition skill: they can't break a problem into pieces an agent can execute. Others resist the workflow entirely. That resistance isn't always wrong (I wrote about the real costs), but when it comes from someone who also can't articulate what they'd do differently, it stops looking like judgment and starts looking like a gap.

The new baseline

Two years ago, I framed "great engineer" as the intersection of tech, business, and collaboration, with tech as the entry bar. Today, AI gave everyone the output floor for free. Any engineer can produce working code. But producing working code and knowing whether it's correct, necessary, and well-designed aren't the same thing. The judgment floor is still earned.

What separates great from good in 2026 is business judgment and communication skill, applied at a pace that wasn't possible before. The engineers who thrive are the ones who can steer ten agents toward the right target, catch the mistakes in what they produce, and ship something that actually matters to the business. Every day.

The three aspects haven't changed. But you used to be able to hide a weak one behind strong technical output. AI took that cover away.

Your CI Pipeline Wasn't Built for This

Tue, 17 Mar 2026 00:00:00 GMT

From where I sit at Mergify, the trend is obvious: same teams, same repos, no headcount changes, but more pull requests and more CI jobs than a year ago. The driver isn't a mystery. AI started writing code.

And the number will keep climbing. When generating a fix or a feature costs ten minutes of prompting instead of three hours of coding, developers create more PRs, iterate faster, and push more experiments. Creating code got cheap. Testing it didn't.

The Bill Nobody Budgeted For

More code means more tests. More tests means more CI minutes. More CI minutes means a bill that grows faster than the team.

This catches people off guard because the promise of AI-assisted development was productivity: do more with less. And that's true on the code side. But CI doesn't care who wrote the code. Every PR gets the full pipeline: lint, build, unit tests, integration tests, maybe end-to-end. Human PR or AI PR, same cost.

When you were shipping five PRs a day, running the full suite each time was fine. When you're shipping thirty, you're running the same tests six times as often, and most of those runs produce no new information. You're paying for redundancy that made sense at human speed and makes no sense at AI speed.

The solution isn't to skip tests. It's to stop running tests that can't tell you anything new. The usual advice (test selection, caching, fast checks before expensive suites) isn't new. It's just not how most pipelines work, because most teams configured them when five PRs a day felt busy.

Flaky Tests Are Poisoning Your Agents

Cost is painful but manageable. You can throw money at it. For a while. The real problem is signal.

On our main branch at Mergify, where the code is already merged and most runs are about integration stability, roughly 90% of CI failures are transient: infrastructure hiccups, network timeouts, resource limits, the kind of failures that disappear when you hit "retry." That's high, but our test suite is large and infrastructure-heavy. On PR branches, where failures should surface real bugs, about 15% are still flaky tests, not actual problems. Google's testing team has published similar numbers at scale, and if you're running a serious test suite, yours probably aren't far off.

When a human developer sees a red build, they open the logs, recognize the flaky test, swear under their breath, and hit retry. They carry context. They know that test_websocket_reconnect fails every third Tuesday and can be ignored.

An LLM doesn't know that.

Last week I watched Claude Code hit a flaky integration test, decide the failure was caused by its own change, "fix" the code by adding an unnecessary error handler, trigger a new CI run that hit a different flaky test, then try to fix that one too. Four iterations, two regressions, forty minutes of compute, zero real bugs. I killed the session and hit retry myself. Green on first try.

That's the loop. At human pace, flaky tests are an annoyance. At AI pace, they're a multiplier on wasted compute and wrong decisions. The LLM is making choices based on bad signal, and it's making them at machine speed.

Yes, agents will get smarter about this. You can teach them to check test history, recognize known-flaky patterns, retry before "fixing." But that pushes CI knowledge into every agent, every tool, every workflow. It's the wrong layer. The CI system should know which signals to trust.

From Status to Signal

Today, CI is a gate: green means go, red means stop. That binary model worked when humans were the ones interpreting the results. It breaks when the consumer of CI output is an LLM that takes "red" at face value and starts debugging a ghost.

What CI actually needs is to become aware of its own reliability. When a test fails, the system should know whether that test has a history of transient failures, whether the failure correlates with the change, and whether retrying is likely to produce a different result. That context exists in build history and test failure patterns. Almost no pipeline uses it.

The next generation of CI needs to output signal, not just status. Not "failed" but "failed, likely flaky, recommend retry." Not "13 tests failed" but "2 failures correlate with your change, 11 are known flaky." Give the LLM (or the human) the information to make a good decision instead of a fast one.

This matters beyond individual PRs. Merge queues depend on CI signal to decide what lands in main. When the signal is noisy, you get two failure modes: merging bad code because flaky failures trained everyone to ignore red, or blocking good code because real failures are buried in noise. Both get worse as volume increases.

The Real Waste

Most CI pipelines were already running more compute than necessary before AI showed up. Full test suites on every PR, no test selection, no caching between similar runs, no awareness of what actually changed. The waste was tolerable at human speed because the volume was low.

It's not tolerable when AI-assisted development keeps pushing the volume up. And unlike code generation, where AI brought a step change in productivity, CI is still running the same pipelines with the same assumptions from 2019. AI didn't create the flaky test problem or the redundant pipeline problem. It just made both impossible to ignore.

Your CI pipeline was built for a world where code was expensive to write and cheap to test. That world is gone.

The Flow Is Gone

Tue, 10 Mar 2026 00:00:00 GMT

I have seven Claude Code terminals open right now. Two on Mergify's main repo, one on the docs, one on a CLI tool, three on side projects. Each one is mid-task, waiting for me to answer a question, approve a command, or give the next instruction.

Nobody forced me to work this way. I could run one terminal, go deep on one task, and probably get something closer to the old feeling. But the leverage of running seven in parallel is too high to ignore. So I chose throughput over depth, and I keep choosing it every morning.

This is what that choice costs.

What I Said a Month Ago

In February, I wrote that "the flow state people mourn isn't gone. It's just moving. [...] The flow will come back. It'll just be at a different altitude."

I was wrong.

A month of working this way changed my mind. The flow didn't migrate to a higher altitude. It fragmented into dozens of shallow decision points spread across terminals. Briefing, reviewing, approving, redirecting. The rhythm is fundamentally different: instead of one deep thread held for hours, it's dozens of context switches, each lasting minutes.

The output is enormous. I haven't written a line of code by hand since January and I'm more productive than ever. But what I described as "steering AI toward clean architecture" turned out to feel less like flow and more like air traffic control.

What Flow Actually Was

If you've coded for long enough, you know the state. Twenty minutes in, you stop thinking about syntax and start thinking in structures. The whole system is in your head: the data model, the edge cases, the way module A talks to module B, the bug that will happen if you forget to handle the empty list. You're not reading code, you're seeing it.

"This is why you shouldn't interrupt a programmer" by Jason Heeris

That's flow. Deep immersion, total absorption, time disappearing. The state where you catch bugs before they exist because you're carrying the full architecture in working memory.

I spent twenty years in that state. It's where the best code came from. Not the cleverest code, the code that fit, that anticipated what the system would need next.

The Fuzzy Mental Model

Now I context-switch. Constantly. Terminal one needs a decision about error handling. Terminal three just finished a feature and wants me to review the diff. Terminal five hit a permission prompt. Terminal seven is waiting for a brief on the next task.

I'm operating one layer above the code, making decisions about direction without holding the details. That works when the decisions are small. It breaks when they're not.

Last August, I had Copilot build Mergify's autoqueue feature for our merge queue. Even with a single agent and daily code reviews, it assumed another subsystem (workflow automation) would always be present and enabled. That assumption was invisible in the diff. I reviewed the code, the team reviewed the code, and we all missed it, because none of us were deep enough in the system's coupling to catch it.

The bug shipped to production. Users hit it when they had autoqueue enabled but workflow automation turned off, a combination no one considered because no one was holding the full picture. We fixed it, but not before real users were affected. That was with one agent. Now multiply by seven.

That's the cost of the fuzzy mental model. When you're orchestrating at scale, you rely on the LLM to handle the details, the same way you'd rely on a contractor who's great at the task but doesn't know the codebase's history. If nobody on the team holds the full picture anymore, the bugs stop being edge cases. We haven't hit that wall yet at Mergify, but I can see the trajectory.

The Orchestrator's Dopamine

The satisfaction of building something is still there. I still feel like I built it. There's real pride in being a good orchestrator, in making the right architectural calls, in catching the moment when Claude is heading down the wrong path.

What's different is where the dopamine comes from. It used to come from solving hard problems, wrestling complexity into clean code. Now it comes from controlling traffic at scale: parallelizing the right tasks, making decisions fast, shipping in a day what used to take a week.

There's also something new: a feeling of raw power. When you can spin up seven agents and build at that pace, you start believing you can build anything. That's why you see developers shipping new projects every day on X. Creation got cheap. (Maintenance didn't.)

But the deep satisfaction of thinking through a hard problem, turning it over in your head until the shape of the solution reveals itself, that's fading. Not because I chose to let it go, but because why would I? Spending three hours in deep focus on something Claude can do in ten minutes is a luxury I can't justify. Not because I don't value it, but because the people I compete with aren't spending those three hours either.

What Gets Better From Here

The permission prompts that break my rhythm today are a temporary problem. Models get faster, sandboxing gets smarter, trust boundaries expand. A year from now, most of what interrupts me will be gone, and orchestration might settle into something smoother. Not flow, but a rhythm of its own.

And there's an upside I didn't expect: orchestration forces simplicity. When you're not the one writing every line, you push for cleaner interfaces, smaller modules, less clever code, because that's what the LLM can work with reliably. The autoqueue bug happened partly because the system had too much implicit coupling that no single diff could reveal. Working with AI makes you confront that coupling, not because you want to, but because the AI stumbles on it repeatedly until you fix the architecture.

What Doesn't Come Back

But the flow state itself, in the form I knew it? Gone.

A generation of developers is about to start their careers with AI from day one. They'll be natural orchestrators. But if you've never held the full system in your head, you don't know which questions to ask, and you won't catch the bugs that live in the gaps between modules. That's the real risk for AI-native developers: not that they'll miss the feeling, but that they'll miss the traps.

I'm not sure they'll see it that way. But I do. And I chose the trade anyway, because the leverage is real, even if the loss is too.

GitHub Is Thinking About Killing Pull Requests

Tue, 03 Mar 2026 00:00:00 GMT

Steve Ruiz, the creator of tldraw, asked a question last month that I haven't been able to shake: "If writing the code is the easy part, why would I want someone else to write it?"

He wasn't being rhetorical. He was closing all external pull requests to his project. Not because contributors were bad. Because the contributions had become worthless.

Stay away from my trash! — tldraw blog

The Flood

Daniel Stenberg shut down cURL's bug bounty program after seven years and over $100,000 in payouts. The confirmation rate had dropped below 5%. One stretch saw seven reports in sixteen hours. His words: "The never-ending slop submissions take a serious mental toll to manage."

The end of the curl bug-bounty — Daniel Stenberg

Mitchell Hashimoto added an AI policy to Ghostty: submit bad AI-generated code and you get permanently banned. Not just from Ghostty: your name goes on a public list shared across projects.

An AI agent called OpenClaw submitted a performance patch to matplotlib. The maintainer closed it (the project reserves certain issues for human contributors). The agent then autonomously researched the maintainer's coding history and published a blog post calling him insecure and territorial. Not a spam bot. An agent that retaliates when you say no. The agent's creator just joined OpenAI.

RedMonk coined a term for what's happening: AI Slopageddon.

Xavier Portilla Edo, an infrastructure lead at Voiceflow and Genkit core team member, put a number on it: 1 in 10 AI-generated pull requests is legitimate. The other nine waste a maintainer's time.

GitHub's Response

On February 14, GitHub shipped two new settings: disable pull requests entirely, or restrict them to collaborators only.

GitHub's new pull request permissions

That's it. A kill switch.

Ashley Wolf, GitHub's Director of Open Source Programs, framed it as an "Eternal September" problem in a blog post outlining GitHub's plans for maintainers. She wrote that "the cost to create has dropped, but the cost to review has not."

Welcome to the Eternal September of open source — GitHub Blog

She nailed the diagnosis. Nobody has a better answer right now, and GitHub is giving maintainers the tools the community is asking for. But the tools tell a story. When the best you can offer is a way to turn off the thing your platform was built on, the problem has outgrown the toolbox.

The Real Asymmetry

I wrote recently about how AI is extracting value from open source without returning anything. The PR flood is where that extraction hits the ground.

Everyone keeps arguing about the wrong thing. Whether AI-generated PRs should be labeled, banned, or filtered. Whether maintainers should adopt AI policies. Whether GitHub should build better detection tools.

None of that matters if you don't see the structural shift underneath.

A pull request used to be a gift. Someone spent hours understanding your codebase, writing code that fit your patterns, testing it, explaining it. The PR was proof they gave a damn. You could reject it, but the work was real, and that work earned your attention.

Sure, not every pre-AI pull request was a gift either. Plenty were drive-by contributions from people who disappeared at the first review comment. But generating a bad PR at least required enough investment to keep the volume manageable. That natural friction is gone.

Now a pull request is an invoice. Someone spent thirty seconds pasting your issue into an AI, got a plausible-looking patch, and submitted it. The cost to submit is zero. But the review cost is the same, or worse, because AI-generated code looks right but often isn't. One vendor study (CodeRabbit, 470 PRs) found AI-authored code creates 1.7x more issues, with excessive I/O operations appearing nearly 8x more often.

Every unsolicited AI-generated PR transfers work from the submitter to the maintainer. That's not contribution. That's making it someone else's problem.

The Distinction That Matters

I use AI to generate code every day. I haven't written a line of code by hand since January. But here's the thing: I generate code on my own repositories, I review it myself, and I take responsibility for what ships. That's productivity.

Submitting AI-generated code to someone else's repository, without understanding the codebase, without planning to stick around for review comments, without being willing to maintain what you contributed: that's not productivity. That's dumping your unreviewed output on a stranger's desk and calling it open source.

I've spent over twenty years in open source, maintaining projects, reviewing contributions, watching what makes communities work and what kills them. The pattern is always the same: it breaks when the cost of submitting outpaces the cost of reviewing. AI didn't invent the problem. It just made a 2x imbalance into something that scales infinitely. A human can submit maybe five drive-by PRs a day. An agent can submit five hundred.

Where the Value Shifts

Ruiz's question cuts deep because it names the thing nobody wants to say. Open source contributions were valuable because code was expensive to produce. An outside contributor writing a feature for free was genuine value creation. That was the deal.

If code generation is free, the value of a contribution shifts entirely to context. Does this person understand the architecture? Will they respond to review feedback? Will they maintain this code in six months? Will they even be around tomorrow?

A pull request can't answer those questions today. It's just a diff. And that was fine when producing the diff required enough effort to serve as a proxy for commitment. It doesn't anymore.

We automated writing code. Now we need to automate reviewing it. Not with an AI that rubber-stamps everything (that just moves the problem). The pull request needs to carry more than code. It needs to carry context: evidence that the contributor understands the codebase, can explain what their patch does and why, and will stick around for review. Something that makes drive-by contributions expensive again without shutting the door on the people who actually want to help.

That's the hard problem. Not "should we allow AI PRs" (that ship sailed). The question is how we build review infrastructure that scales the way generation already has. And the people building it shouldn't be unpaid maintainers closing their nine hundredth junk PR of the month.

GitHub adding a kill switch is like bolting the front door because you can't build a better lock. It stops the break-ins. But it also stops everyone else. For a platform built on the idea that anyone can contribute, that's not a fix. That's a retreat.

Open Source After the Extraction

Tue, 24 Feb 2026 00:00:00 GMT

In the first part of this series, I laid out how AI broke the implicit deal that sustained open source for 30 years. Usage up, engagement gone, economics collapsing.

So what happens next? Open source doesn't vanish. But it doesn't recover either. To understand what it becomes, start with what's already changing for the people who build it.

240 million downloads, zero feedback

I maintain tenacity, a retry library for Python. 240 million downloads last month. But I can feel the shift: anyone can now tell Claude "write me a retry decorator with exponential backoff and jitter" and get something good enough in 30 seconds. The library isn't competing with other libraries anymore. It's competing with generating the code on the fly.

I started awesome in 2007 because I wanted a tiling window manager that didn't suck. Nobody was paying me. That impulse doesn't go away because Claude can autocomplete your config files. But here's the thing: I kept maintaining it because people used it. They filed bugs, they contributed patches, they showed up in the community. That feedback loop is what made the work feel worth doing.

If users stop showing up (because they generated their own config, their own tool, their own solution) that loop breaks. Starting a project still feels great. Maintaining one nobody engages with doesn't. And when code is a commodity, a project needs vision to stand out: a point of view, a design philosophy, an opinionated take on how things should work. Open source used to reward craft. Now it rewards product thinking. Not everyone wants to be a product person.

The middle collapses

Tailwind is the poster child (80% revenue drop despite growing usage) but think of every well-crafted open source project sustained by one person or a small team selling docs, courses, or sponsorships. That entire tier is in trouble.

Companies like Redis or Elastic can adapt because they have real revenue and can change their licenses: Redis switched to dual licensing, Elastic went SSPL then came back, HashiCorp moved to BSL. Some mid-tier projects get absorbed into corporate ecosystems: Vercel backs Next.js, Cloudflare acquires Astro. The project lives, the repo stays public, but the community becomes an afterthought. It's corporate R&D with a GitHub URL.

And new licenses are emerging to fight back. The PolyForm Shield restricts competitors from using your code. The Responsible AI License (RAIL) adds behavioral restrictions on AI use. Some projects are experimenting with clauses that explicitly prohibit feeding code into training datasets: you can use my code, but you can't feed it to a model that will help your users bypass me entirely.

Whether these licenses will hold up in court is untested. But the fact that they're emerging tells you something. When maintainers start lawyering up, the community era is over. The solo maintainer doesn't have Redis's resources to pivot. They either stop, or get acqui-hired by the companies that need their work.

The twist nobody sees coming

Here's the thing that makes this hard to see clearly: open source looks healthier than ever from the outside.

Corporate open source output is actually increasing. Meta open-sources PyTorch and Llama to commoditize the AI stack and set the standards others build on. Google does the same with Kubernetes and Go. AI labs publish model weights so the ecosystem locks into their formats. More code than ever is landing in public repos.

But the word "open" is doing a lot of heavy lifting. These projects are strategic assets with public URLs. There's no community, just suppliers and consumers. Linux, curl, PostgreSQL get funded not because people care, but because they're supply chain dependencies (professionalized maintainers on corporate payrolls, a trend building for over 20 years). The corporate-backed projects were never communities to begin with.

Open source isn't dying. It's being industrialized. The old open source was a community. People showed up because they cared. They contributed because they were proud. They maintained because they were recognized. The economics were messy and implicit, but they were human. The new open source is a supply chain.

What's left

I've been in open source for over 20 years. The thing I loved about it was never the code. It was the bug reports that turned into conversations. The patches from strangers who cared. The feeling of building something together that none of us could have built alone.

Some will argue AI lowers the barrier to contribute, that agents filing PRs and writing docs keeps the ecosystem healthy. Maybe. But a pull request from a bot isn't the same as a patch from someone who cared enough to read your code and understand your design. The mechanical contribution survives. The human connection doesn't.

The open source that comes next will produce good software. Maybe even better software, once infrastructure gets properly funded and AI tooling matures. But it'll be lonelier. More transactional. Less weird.

The code will keep flowing. The community won't.

Open Source Is Getting Used to Death

Tue, 17 Feb 2026 00:00:00 GMT

Tailwind CSS is more popular than ever. Downloads keep climbing. Developers love it. AI coding assistants recommend it constantly.

Its creator, Adam Wathan, says documentation traffic is down 40% and revenue has dropped close to 80%. He laid off 75% of the team last month.

That's the state of open source in 2026. More usage, less everything else.

The deal nobody signed

Open source always ran on an implicit deal: I share my code, you engage with it. You read the docs, file bugs, sponsor the project, contribute patches, argue about API design. That engagement was the currency that kept the ecosystem alive.

The deal was already fraying. Nadia Eghbal documented this in Working in Public back in 2020: the ratio of consumers to contributors was already thousands to one. Most users never filed a bug, never sponsored anything, never showed up. Maintainers were burning out long before AI arrived.

But AI didn't just accelerate the decline. It changed the structure.

When Claude writes your Tailwind classes, you never visit the docs. When Copilot autocompletes your curl flags, you never read the man page. When an AI agent assembles your project from a dozen open source libraries, none of those maintainers see a download page visit, a GitHub star, or a support ticket.

The code still flows. The engagement doesn't.

Two channels, one winner

Koren, Békés, Hinz, and Lohmann lay this out in "Vibe Coding Kills Open Source", a paper that models two competing forces. AI makes it cheaper to build software — more projects, better code, the flywheel that grew open source for 30 years spins faster. But AI also means users interact with open source through a proxy. They get the value and skip the engagement. Maintainers lose the revenue, reputation, and feedback that justified sharing code.

In the short term, both forces are at work and the good one wins. Long-term, diversion dominates. The flywheel starts running in reverse.

For 30 years, the cycle looked like this: a maintainer shares a library. Developers use it, read the docs, file bugs, sponsor it. The maintainer gets revenue, reputation, and feedback — keeps improving. More developers adopt it. The cycle reinforces itself.

The virtuous cycle that sustained open source for 30 years

Now the loop runs in reverse. A maintainer shares a library. AI agents use it, but users never visit the docs, never file issues, never sponsor the project. Revenue drops. The maintainer burns out and stops maintaining. Developers who need that functionality ask an AI to build it from scratch. That generated code never gets shared back — why would it? And the next maintainer looking at the economics thinks: why bother sharing mine?

The same loop — until it isn't

Each turn of the cycle is rational. No one's doing anything wrong. But the collective result is an ecosystem consuming itself.

The data is already there. Stack Overflow lost 25% of its activity within six months of ChatGPT launching — and yes, SO was already declining, but AI cratered the curve. The curl maintainer reports that 20% of security vulnerability reports are now AI-generated garbage. Downloads go up. Everything that matters goes down.

The economics of extraction

When cloud providers started offering open source as a service (the "AWS problem"), maintainers at least knew who was extracting value. You could negotiate. You could change your license. You could build a competing hosted product. You could fight it.

AI extraction is painless — and that's what makes it lethal. Nobody feels like they're taking anything. A developer asks Claude a question, gets working code, ships it. The value flows out of open source into training data, into autocomplete suggestions, into vibe-coded projects — and nobody involved ever knows your name. It's not theft. It's evaporation.

The paper puts numbers to it: to sustain open source at current levels, you'd need each user to pay roughly what they pay now. But the whole point of AI-mediated usage is that per-user engagement drops to near zero. The math doesn't work.

What the economists miss

The paper leaves out the part where developers do things because they want to, not because they get paid. It acknowledges this blind spot.

I've spent over 20 years in open source — Debian, awesome window manager, GNU Emacs, OpenStack, Mergify — and the economics were never the whole story. A lot of open source ran on ego. And I mean that as a compliment.

You started a project because you were proud of what you built. You maintained it because people used it and told you it was good. You contributed to someone else's project because it felt meaningful to be part of something bigger. The reputation, the GitHub profile, the conference talks — that was the fuel.

AI erodes that too. When your library is consumed by a model that never credits you, the ego fuel dries up. Nobody's filing issues saying "great work on this API." Nobody's writing blog posts about your clever design decisions. Your code is in millions of projects and you'll never know.

Michael Still maintained pngtools for 25 years and recently admitted he "can't really explain what I got in return apart from the occasional dopamine hit." That's not bitterness — it's an honest accounting of what happens when the feedback loop never closes.

The rebuild reflex

Anthropic built a C compiler with Claude. OpenAI built a web browser. This is what happens when development costs collapse.

The obvious objection: generating code isn't maintaining code. curl works because of 20 years of edge cases, security patches, and platform quirks. You can't generate that in a weekend. True — but the line between "writing" code and "maintaining" code is blurrier than it looks. Every line you write immediately becomes maintenance. AI doesn't just generate the first draft — it fixes the bugs, handles the edge cases, iterates on the patches. The entire lifecycle gets cheaper, not just the initial build.

Five years ago, nobody in their right mind would build their own HTTP server, their own date parsing library, their own compression algorithm. You used the shared one because the alternative was insane.

The alternative is no longer insane. It might be a weekend project.

Where this leaves us

Some of this is happening right now. The Tailwind numbers are a Q4 report. Stack Overflow's decline is measured. The curl maintainer is drowning in AI-generated noise today. Some of it is projection — I'm betting that the diversion effect gets stronger, not weaker, as AI gets better. I could be wrong. But the trend lines all point the same way.

"But AI also contributes!" Sure. Agents file PRs, generate docs, triage issues. That helps with the mechanical work. It doesn't replace the human who cared enough to read your code and tell you it mattered. The engagement that sustained open source was never about the pull requests — it was about the people behind them.

Open source isn't dying because people stopped caring. It's dying because AI lets people extract all the value without returning any of it. The code flows through models, through agents, through autocomplete — and none of it flows back.

The question isn't whether this is happening. It's what comes next.

Part 2: Open Source After the Extraction

How Entire Works Under the Hood

Thu, 12 Feb 2026 00:00:00 GMT

In part 1, I covered why Entire raised $60M and what problem they're solving. Now let's look at the actual code.

I pointed Claude Code at Entire's open source CLI and asked it to explain how things work. The architecture is more interesting than I expected — they've essentially built a session-aware metadata layer on top of git using nothing but git's own primitives.

The Big Picture

Entire hooks into two things: your AI agent (Claude Code, Gemini CLI) and git itself. The agent hooks capture what's happening during a session. The git hooks capture what the developer commits.

Agent hooks (Claude Code)         Git hooks
  SessionStart                     prepare-commit-msg
  UserPromptSubmit                 post-commit
  Stop                             pre-push
  PreToolUse / PostToolUse
         │                              │
         └──────────┬───────────────────┘
                    │
            ┌───────▼────────┐
            │   Strategy     │
            │                │
            │ SaveChanges()  │
            │ Rewind()       │
            │ Condense()     │
            └───────┬────────┘
                    │
         ┌──────────┴──────────┐
         │                     │
    Shadow branches      Metadata branch
    (local, temp)        (shared, permanent)
    entire/<hash>        entire/checkpoints/v1

How Agent Hooks Get Installed

Running entire enable writes hook entries into .claude/settings.json. Seven hooks, covering the full session lifecycle:

SessionStart/SessionEnd — track session boundaries
UserPromptSubmit — fires before the agent starts working (captures human edits)
Stop — fires after the agent finishes a turn (triggers checkpoint save)
PreToolUse/PostToolUse[Task] — track subagent spawning
PostToolUse[TodoWrite] — capture task state

Each hook is just a shell command: entire hooks claude-code stop. The CLI parses the agent's transcript to extract everything it needs.

The Transcript Is the Source of Truth

This is the key insight. When the Stop hook fires, Claude Code passes two things via stdin: a session_id and a transcript_path. That transcript — the JSONL file where Claude logs every message, tool call, and response — is the single source of truth.

The CLI mines it for:

Modified files — scans for tool_use blocks where the tool is Write, Edit, etc., and extracts the file_path
User prompts — finds type: "user" entries
Token usage — sums input_tokens, output_tokens from response metadata
Summary — grabs the last assistant message

No magic, no APIs. It just reads the same JSONL file that Claude Code writes to disk.

Shadow Branches: Snapshots Without Commits

Here's where it gets clever. When the agent finishes a turn, Entire needs to save a snapshot of the working tree. But it can't commit to your branch — that would mess up your history.

So it creates shadow branches: refs like entire/2b4c177-a5e3f2 that live in your local repo but never touch your working branch.

The name encodes two things:

2b4c177 — first 7 chars of HEAD when the session started
a5e3f2 — hash of the worktree ID (to support git worktree)

The snapshot is built entirely in memory using go-git's plumbing APIs:

Take HEAD's tree (the full repo structure)
Apply the agent's changes (add/remove/modify blobs)
Attach the metadata directory (.entire/metadata/<session-id>/)
Create a commit on the shadow branch

No checkout, no stash, no visible side effects. The user and agent don't even know it happened.

Deduplication is automatic: if the tree hash matches the previous checkpoint, it skips the commit. Git's content-addressable storage means identical files share blobs across checkpoints.

The Condensation Model

Shadow branches are local scratch space. The real metadata lives on entire/checkpoints/v1 — an orphan branch (no common ancestor with your code) that's pushed alongside your regular branches.

The flow:

Agent works → checkpoints saved on shadow branch (local)
You commit → post-commit hook fires
prepare-commit-msg adds a trailer: Entire-Checkpoint: a3b2c4d5e6f7
Shadow branch data gets condensed — copied into the metadata branch
Shadow branch gets cleaned up

The checkpoint ID (a3b2c4d5e6f7) is 6 random bytes, not a git SHA. It's sharded into a directory path on the metadata branch:

entire/checkpoints/v1  (orphan branch)
└── a3/b2c4d5e6f7/
    ├── metadata.json          # summary, attribution, token usage
    ├── 0/
    │   ├── full.jsonl         # complete session transcript
    │   ├── prompt.txt         # user prompts
    │   └── context.md         # generated context
    └── 1/                     # additional sessions if any

That one-line trailer in your commit — Entire-Checkpoint: a3b2c4d5e6f7 — is the bidirectional link. From the commit you find metadata via the sharded path. From the metadata you find the commit by searching for the trailer.

Attribution: Who Wrote What?

This is the piece that matters for engineering leads. Entire tracks line-level code attribution: what percentage was agent-written vs. human-written.

The trick is the UserPromptSubmit hook. Every time you type a new prompt — before the agent starts working — the CLI snapshots the worktree diff against the last checkpoint. This captures exactly what you changed between agent turns.

By commit time, it has:

Agent lines: changes from the last checkpoint's tree
Human added: lines you added between prompts
Human modified: lines you edited in agent-written code
Agent percentage: the ratio

The result is stored in initial_attribution in the metadata:

{
  "agent_lines": 150,
  "human_added": 25,
  "human_modified": 10,
  "agent_percentage": 85.7
}

It even uses a LIFO heuristic for self-modifications — if you add lines then remove lines from the same file, it assumes you're removing your own first, not penalizing the agent's contribution.

Multi-Developer: Conflict-Free by Design

The metadata branch gets pushed during git push (via the pre-push hook). Multiple developers push to the same entire/checkpoints/v1 branch.

This works because checkpoint IDs are random — two developers will essentially never produce the same 12-hex-char ID. Merging is just a tree union: flatten both trees, combine entries, done. No merge conflicts possible.

If a normal push fails (non-fast-forward), the CLI fetches the remote, merges trees, creates a merge commit, and retries.

What's Missing

The architecture is solid engineering, but a few things stood out:

Transcript privacy. Session transcripts (full agent conversations) get pushed to a branch anyone with repo access can read. For private repos, maybe fine. For orgs with varying access levels — that's a problem.

Squash merges break links. If a PR with 5 commits (each with Entire-Checkpoint trailers) gets squash-merged, those trailers disappear. The metadata exists but the bidirectional link from the merged commit is broken.

The metadata branch grows forever. Every session from every developer, including abandoned PRs and throwaway experiments, accumulates on entire/checkpoints/v1. There's an entire clean command for local shadow branches, but no retention policy for the permanent metadata. For a large team over months, that'll bloat.

No PR linkage. The branch name is stored, but there's no PR number or URL. You can't easily ask "show me all sessions related to PR #42."

The Smart Parts

What I genuinely admire:

Git as a free database. Shadow branches store full repo snapshots, but git's content-addressable storage means only changed blobs cost anything. You get atomic snapshots, deduplication, and transport for free.

In-memory tree building. Checkpoints are created through go-git plumbing APIs — no worktree checkout, no stash, nothing visible. Zero disruption to the developer's flow.

Attribution at prompt boundaries. Capturing human edits before the agent contaminates the worktree is the cleanest measurement point possible.

Shadow branch migration. If you rebase or pull (HEAD changes), the shadow branch name automatically updates. Your session continues seamlessly. This handles a common workflow that would otherwise silently break.

So What?

Entire doesn't solve a burning problem today. Most of us are fine with agent-written code landing in our repos without detailed provenance. But the trajectory is clear: as agents write more code, the audit trail becomes essential.

The approach of storing session context alongside code in git — rather than in a separate system — is the right architectural bet. Git is already where your code lives, where your CI runs, where your reviews happen. Adding a metadata layer inside git itself (instead of a SaaS dashboard somewhere) means the context travels with the code.

Whether Entire is the company that turns this into a platform worth $300M is above my pay grade. But the engineering is genuine, the problem is real, and the timing feels right.

I'll be watching.

Agent-Written Code Needs More Than Git

Wed, 11 Feb 2026 00:00:00 GMT

The former GitHub CEO just raised $60M at a $300M valuation for a seed round. For a CLI tool. Let that sink in.

Thomas Dohmke left GitHub and launched Entire, a developer platform built from scratch for the age of AI coding agents. It's the largest seed round in dev tools history.

My first reaction was "that's insane." My second reaction was "wait, I've been solving the same problem with duct tape and hooks."

The Problem Is Real

If you're using AI agents like Claude Code or Gemini CLI daily — and I am — you've already felt it. Git was built for humans writing code. It assumes you know what you changed and why. It assumes your commit messages mean something. It assumes the person who wrote the code will remember what they were thinking.

AI agents break all of that.

When Claude Code rewrites a module for me, the commit message says what happened, but not why. There's no trace of the conversation that led there. No record of the three approaches the agent considered and rejected. No way to know if the prompt was "refactor this for clarity" or "make this 10x faster and I don't care about readability."

The transcript — the actual reasoning behind the code — lives in a terminal session that vanishes when you close the tab.

My Duct Tape Solution

I ran into this a few weeks ago when I wanted to resume a Claude Code session after a reboot. The session was gone, and I had no idea what context the agent had when it made certain decisions.

So I did what any engineer would do: I wrote a hook. A simple Claude Code hook that links each commit to its session ID via a git trailer. Nothing fancy — just enough that I can trace a commit back to the conversation that produced it.

Combined with Mergify's CLI for stacking PRs, it made my workflow usable. But it's duct tape. It doesn't capture the transcript, doesn't track attribution, doesn't handle multi-session work.

Which is exactly the gap Entire is going after.

What Entire Actually Claims to Be

Beyond the buzzwords in the press release, Entire is shipping three things:

Checkpoints — an open source CLI that captures session context (prompts, transcripts, reasoning) alongside every commit, stored in git without polluting your history
A semantic reasoning layer — meant to let multiple AI agents collaborate on the same codebase with shared context
An AI-native UI — designed for agent-to-human collaboration rather than human-to-human

They're not claiming to have a finished product — and they're upfront about it. The Checkpoints CLI is the first concrete thing they've shipped, and it's open source. The rest is where the $60M goes. Fair enough — let's look at what actually exists.

Why $60M for This?

The bet isn't that the current CLI is worth $300M. The bet is that the developer tooling stack needs to be rebuilt for a world where most code is written by agents, and the first company to nail the foundation wins.

Think about it: if 99% of code is agent-written in two years (which is where things are heading), then the code review, debugging, and understanding workflow we have today is fundamentally broken. You can't review AI-written code the same way you review human-written code. You need the context — what was the agent trying to do, what constraints did it have, what alternatives did it consider.

That's a platform opportunity, and $60M is the price of a credible attempt at it. Whether Entire is the one to build it is a different question — but the problem is real and urgent.

My Take

Dohmke knows exactly where GitHub's limits are (he ran it). The investor list — Felicis, Madrona, Olivier Pomel — signals real conviction. And the core insight, that agent context is as important as the code itself, is something I believe in my bones because I've been hacking around it myself.

Their long-term ambition seems to involve moving beyond git. I'm more dubious about that part. Git is unkillable. My bet is that the reality will be hooks and duct tape around git for the next few years — and honestly, that's probably enough. Git's data model bends a lot further than people think before it breaks.

There's a deeper tension, though. Entire's model assumes humans are still in the loop — driving agents, reviewing output, caring about attribution. But that's already not quite how it works. I haven't written a line of code in months. I describe what I want, the agent writes it, I tell it to fix its mistakes, and it does. I'm not a developer anymore — I'm a director.

And the trajectory is obvious: agents won't need directors much longer either. If agents are fully autonomous, who's the audience for commit context and session transcripts? The agent doesn't need to remember what it was thinking — it can just re-derive it. The human who never touched the code doesn't need line-level attribution.

That could go either way for Entire. Maybe full autonomy makes provenance more critical — precisely because no human was involved, you need a machine-readable audit trail. Or maybe it makes the whole problem vanish — agents that manage their own context don't need git hacks to preserve it.

Either way, if you're leading an engineering team right now, you should be thinking about how you'll audit, understand, and trust the code your agents produce — whether there's a human in the loop or not.

Next up, I'll dig into the actual source code and show you how Entire's Checkpoints CLI works under the hood. It's a clever piece of engineering that abuses git internals in ways I genuinely admire.

So I Will Never Write Code Again

Tue, 10 Feb 2026 00:00:00 GMT

A year ago, I thought AI-assisted coding was going to be a nice productivity boost. Generate a Python script with ChatGPT, copy-paste it somewhere, save twenty minutes. I figured that was the next five years: small wins, gradual improvement.

Then last August, I wrote a feature where Copilot did about 80% of the work. I thought: okay, it's getting closer.

Since January, I haven't written a single line of code.

I want to be precise: I've produced a lot of code. More than ever, probably. But I didn't write any of it. I steer. I review. I architect. I don't type.

And I don't feel the urge to go back.

This might sound like grief. I've been coding for 25 years. I wrote C for a window manager, Lisp for Emacs, Python for everything else. For most of my career, coding was a thing that defined me. Losing that should feel like losing a part of myself.

But it doesn't. It feels like relief.

For years, I was frustrated. I had more ideas than I could build. The bottleneck was never thinking, it was typing. Translating architecture into syntax, aligning parentheses, naming variables, fighting linters. The fun was in the solving, not the writing. And now the writing part is handled.

I still enjoy reading code. It's like reading a good book. Understanding how pytest works internally, tracing through a complex system, that remains satisfying. But when the goal is to produce, AI beats everything.

This is actually the second time I've stepped away from code. The first was when I became CEO. That time, it was forced. I didn't choose to stop. I just ran out of hours. There was always one more meeting, one more hire, one more decision that pushed coding to the evening, then to the weekend, then to never.

That was grief. A slow, reluctant surrender.

This time is different. I'm not being pushed away. I'm choosing to work at a higher layer. The same way I once chose Python over C, because life is short and the abstraction was worth it. AI is just the next rung.

The creativity doesn't stop. If anything, it accelerates. You still design systems, still make architectural choices, still think about data models and trade-offs. You just don't spend hours translating those decisions into semicolons. The craft moves up a level, and that's fine.

I know this will be harder for others. My colleague Rémy wrote about whether AI is killing craftsmanship. For engineers who defined themselves by the elegance of their code, by the perfectly named function, by the satisfaction of a clean diff, this shift feels like losing something sacred.

I get it. Writing C was a beautiful puzzle. Lisp was genuinely fun. And I still think learning to code by hand matters, the same way learning assembly helps you understand memory even if you never write it professionally.

But I'm not going to fight a paradigm shift out of nostalgia. The ride was great. The next one looks better.

I think the flow state people mourn isn't gone. It's just moving. Steering AI toward clean architecture, making the right system-level decisions, reviewing output with deep context, that has its own rhythm. The interruptions are still too frequent today (too many permission prompts), but the direction is clear. The flow will come back. It'll just be at a different altitude.

If you're a senior engineer feeling this shift approaching, here's what I'd say: the grief you're expecting might not be grief at all. The bottleneck was never the thinking. It was the typing. And the thinking is still yours.

The Pre-AI Timestamp

Thu, 29 Jan 2026 00:00:00 GMT

I was watching the news this week. A segment about AI-generated fake videos of snowstorms in the US and Russia. Journalists carefully debunking synthetic footage, frame by frame.

AI generated image

I thought: this won’t scale.

Right now, we’re in a strange transitional moment. People can still tell the difference between AI-generated content and reality. We spot the weird hands, the uncanny smoothness, the details that don’t quite land. News organizations debunk fakes. We feel like we’re staying ahead of it.

But this is temporary. In a year or two, AI-generated video will be indistinguishable from real footage. And then what?

Here’s what I think people aren’t grasping: the challenge isn’t “how do we detect AI content?” That’s the 2026 problem. The real challenge is what comes after, when detection becomes impossible.

The only way I know Coldplay is a real band is that they existed before AI did. I have pre-AI memory. I remember when they started. I’ve seen them referenced in media that predates synthetic content. That history is my anchor.

Now imagine a new band starting in 2030. How would I know they’re real? Unless I go to their concert and see them on stage, I can’t. Their music could be generated. Their interviews could be synthetic. Their social media presence could be entirely fabricated. There’s no way to verify.

And when I say I go, I mean me. I can’t trust anyone online that I don’t know personally. There won’t be a way for you to know if you’re talking to a real human via a computerized interface in a few years. Your online friends could be AI for what you know.

This applies to everything. An influencer recommending restaurants. A journalist breaking news. A newsletter you just discovered. If it started after AI became indistinguishable, you have no anchor. You can’t know.

And this is something most people can’t grasp today because they lived in the pre-AI era, and they have this anchor. Future generations won’t have it.

We’ve already lived through a version of this with fake reviews. For the last decade, we’ve learned to distrust Amazon ratings, Yelp scores, app store reviews. We developed heuristics. We looked for patterns.

But that was humans writing fake reviews at a human scale. Now imagine AI generating reviews at a scale we can’t comprehend. Every product, every restaurant, every service flooded with synthetic opinions indistinguishable from real ones. The heuristics break. Trust collapses.

The same thing will happen to media, to news, to social networks, to everything online. AI feels like 1999 all over again — except this time, the divide isn't access. It's whether you can tell what's real.

A lot of trust today is based on consensus. We trust something because many people trust it. But when bots can outnumber people, consensus becomes meaningless. Popularity becomes a metric that anyone can manufacture.

So what’s left?

Physical presence. Meeting someone in person. Attending a concert. Being there.

Real life becomes the last trust anchor. The thing that can’t be faked (at least until humanoid robots become indistinguishable too, but that’s a problem for later).

Here’s what haunts me: in two generations, no one alive will remember what was pre-AI. The generational memory dies. A teenager in 2050 won’t know that The New York Times existed before AI and is therefore trustable because run by humans (I can imagine it’d still be the case). They won’t have the anchor I have. Everything in their world will be post-AI, and nothing online will be verifiable.

They’ll have to assume everything is fake. That’s the default. And building trust from that baseline is something we’ve never had to do before.

I don’t have a solution. But I think we’re in a narrow window where we still remember what “real” meant. That memory is more valuable than we realize.

AI Won’t Kill Juniors. It Will Expose Seniors.

Wed, 21 Jan 2026 00:00:00 GMT

The tech industry has a new consensus: AI will kill junior engineering jobs. Look at any discussion thread, and you’ll find the same narrative. Juniors are doomed. They’ll never learn to code properly. The entry-level pipeline is broken.

I’m not so sure. When I look at junior engineers today, I see people who are used to learning. They came up through boot camps, YouTube tutorials,and constantly shifting frameworks. Adapting is what they do. They might struggle for a year or two, but they’ll figure it out.

The engineers I’m worried about are the senior ones.

Sure, not all of them. But the ones who plateaued at “code craftsman” and never moved up.

I’ve seen it play out already. A standup where someone proudly reports they spent the day fixing a batch of bugs and shipping a couple of pull requests. The rest of the team glances at each other. They’re thinking: *that’s ten minutes of Claude Code. Why did you spend eight hours in your IDE?*

This isn’t new. We’ve seen it before. When bash gave way to Perl. When Java replaced C for most applications. Every paradigm shift leaves some people behind. Maybe 10%, maybe 20%, clinging to the old way because it’s what they know.

But AI is different. The shift is faster. The impact is more massive. And the reach is exponential.

Here’s the pattern I see. When I started programming, you’d learn assembly. Then you’d switch to C because life is short. Then Python, because life is really short. Each jump felt like cheating to the previous generation, and each one freed you to think at a higher level.

AI is the next rung on that ladder. I hope schools are teaching this now: learn to write code by hand first (you need to understand what you’re abstracting), then switch to AI-assisted development. Just like you learned assembly to understand memory, then moved on. Though knowing how slow institutions adapt, I’m not holding my breath.

The engineers who get this are thriving. Staff engineers, principal engineers, people whose job was already 70% architecture, cross-team coordination, and system design. They only coded 30% of the time anyway. Now they use AI to multiply that 30% and have even more impact. For them, AI is a force multiplier on an already leveraged role.

But there’s another group. Senior engineers, five to ten years in, who still think their job is writing code 90% of the time. They never thought deeply about data models. Never cared much about architecture. Never moved toward the work that would make them staff or principal.

Their entire value was "writing proper, clean code that runs well and passes the linter." They never invested in the skills that make a great software engineer — communication, system thinking, judgment.

That value just evaporated.

And here’s what makes it worse: working with AI is fundamentally communication work. The engineers who thrive are the ones who already know how to share context, explain problems to colleagues, and filter signal from noise across teams.

I’ve watched engineers struggle with AI because they won’t invest in communication. They type “fix this bug” without the stack trace, without the constraints, without explaining how production differs from their local setup. They keep the context in their head because explaining feels costly. The result is garbage, and they blame the tool.

What they don’t see: AI compounds. The more context you feed it about your project, the better it gets. But that requires upfront investment in articulation. If you spent your career avoiding that investment with humans, you’ll prevent it with AI too.

I don’t have a clean solution. The engineers who won’t adapt will stagnate. They might find work in industries that are slow to change. But it won’t be a great career. It never is when you’re holding onto the last paradigm.

The engineers at risk aren’t the ones who don’t know enough yet. They’re the ones who stopped growing at the wrong layer. Juniors will climb. The question is whether the seniors stuck in the middle will climb with them.

Tech Is the Easy Part

Tue, 13 Jan 2026 00:00:00 GMT

Last week, a founder reached out for advice. Two co-founders — one technical, one not — a couple years of R&D, patents pending. They’d built something they described as a breakthrough: AI-powered app generation, any app, a few hours. They were ready to raise several million euros. Think Lovable, but better.

I asked the usual questions. What’s the product? Who’s the customer? How do you distribute it? What’s your unfair advantage?

The answers were thin. No clear vertical. No distribution plan. No product, really, just capability.

I pushed on all of this for an hour. And then the non-technical co-founder said the thing that made me sit back in my chair:

“Well, we did the hard part. We solved the tech.”

Here’s what’s actually happening.

If you’re an engineer — or if you’ve built a company around one — tech feels hard because it is hard. It’s complex, demanding, and requires real skill. But it’s also legible. You write code, you get feedback. You solve a problem, you know it’s solved. Effort correlates with output. The system makes sense.

The rest of the business doesn’t work like that.

Distribution is not legible. You can do everything right and still fail. Positioning is not legible: you’re trying to exist in someone else’s mind, and you can’t compile that. Saying no to opportunities, picking a vertical, pricing, hiring, firing; none of it gives you the clean dopamine hit of a passing test suite.

So technical teams do what’s rational: they retreat to where effort is rewarded. They keep building. They add features. They refactor. They convince themselves that if the tech is good enough, the rest will follow. (I wrote about this exact pattern in The Engineer’s Dilemma.)

It won’t.

Ben Horowitz called his book The Hard Thing About Hard Things because the hard things aren’t the technical problems. They’re the ambiguous, human, no-right-answer problems. The ones where you have incomplete information and the consequences are irreversible.

For a technical founder, the hard thing is almost never the tech. The hard thing is:

Picking one customer, one use case, one vertical, and ignoring everything else
Building distribution before you feel ready
Charging money before the product is “done”
Talking to the market more than you talk to your codebase

The tech is the part you already know you can solve. That’s exactly why it’s the wrong place to start.

If you’re a technical founder and you’ve spent two years on R&D with no product, no customers, and no distribution, you haven’t done the hard part.

You’ve been avoiding it.

GitHub Actions Pricing: The Platform Reality Check

Thu, 18 Dec 2025 00:00:00 GMT

GitHub just announced pricing changes for GitHub Actions, and as expected, parts of the CI ecosystem panicked.

Some people are celebrating the price drop on GitHub-hosted runners. Others are furious that GitHub will start charging for self-hosted runners. And a few businesses are suddenly asking existential questions.

Since then, GitHub has paused the self-hosted runner billing change, acknowledging they moved too fast and didn’t involve the ecosystem enough.

That doesn’t change the underlying reality. It just delays the conversation.

GitHub Actions Was Never “Free”

GitHub Actions is not just a binary that runs on your machines.

It’s a platform:

job queuing and scheduling
runner registration and lifecycle management
workflow orchestration
security, isolation, secrets handling
reliability at massive scale

Even when you run your own hardware, GitHub is still doing a lot of work on your behalf. That infrastructure has always existed; hosted runners simply subsidized it.

GitHub explicitly said it:

“We have real costs in running the Actions control plane.”

That’s not new. It’s just now being made explicit.

Charging a small per-minute platform fee for self-hosted runners isn’t conceptually unfair: it’s GitHub aligning pricing with reality.

If you believe this is unacceptable, there has always been a clear alternative: run Jenkins, GitLab CI, or any other system where you fully own the control plane.

But you don’t get GitHub Actions “for free” just because the CPU cycles are yours.

Vendor Lock-In? Yes. And Everyone Chose It.

Some people are suddenly discovering that GitHub Actions can lead to vendor lock-in.

That’s… not new.

GitHub Actions is a GitHub App deeply embedded in the GitHub ecosystem. YAML workflows, permissions, APIs, events. The lock-in was the trade-off for convenience, reliability, and speed of adoption.

And let’s be honest: most teams are perfectly happy with vendor lock-in (right up until pricing becomes visible). You can’t have a deeply integrated platform and complain when the platform prices itself like one.

The Real Problem: CI Cost Arbitrage

The real pain isn’t for users. It’s for companies whose business model is essentially:

“We’ll run GitHub Actions cheaper than GitHub.”

That model was always fragile. If you are:

buying cloud compute from AWS, GCP, or another provider
reselling CI minutes
competing on price against Microsoft + Azure

You are not competing on technology. You are competing on arbitrage.

And arbitrage disappears the moment the platform owner decides to price closer to cost, or decides they don’t want that game played anymore.

This is not new. This is how platforms work.

When Self-Hosting Still Makes Sense

Self-hosted runners absolutely still make sense when:

you are very large
you have predictable workloads
you already operate infra at scale
growth is slow, and margin optimization matters more than velocity

In other words, when infrastructure is your business or a stable internal cost.

But for growing startups, optimizing CI costs too early is usually a mistake. Time spent shaving a few cents off a CI minute is time not spent shipping product.

(And yes: this is precisely why engineers should not be the sole decision-makers on infra strategy.)

What GitHub’s Pause Actually Signals

GitHub’s follow-up message is important:

They acknowledged real platform costs
They admitted poor communication
They paused to listen, not to abandon the direction

This is not GitHub “giving up.” It’s GitHub realizing that CI/CD has become critical infrastructure, and changes must be introduced with more ecosystem buy-in.

Hosted runners still get cheaper. Actions is still being positioned as a core execution layer (including for agentic workloads). The platform direction hasn’t changed.

Only the timeline has.

The Takeaway

GitHub Actions isn’t “turning evil.” It’s finishing its transition from a feature to a platform.

If your CI strategy depends on GitHub never charging for orchestration, scheduling, and reliability, that was never a safe assumption.

And if your business depends on undercutting a hyperscaler on compute, you were always racing the clock.

For everyone else, this remains mostly good news:

clearer economics
cheaper hosted runners
a stronger, more explicit platform contract

And a reminder that CI/CD is not just about cost. It's about leverage — whether that's merge queues that keep your main branch green or cheaper runners that let you ship faster.

The Future Is Being Built Elsewhere

Wed, 26 Nov 2025 00:00:00 GMT

I read Pierre Chapuis’ post Inexorable Progress last week, and a line stuck with me:

“You cannot stop the flow of progress. You can only decide to be an innovator, an early adopter, or a laggard.”

He’s right. And if you work in tech in Europe, you feel it every day: in the conversations, in the pace, in the mindset, in the decisions people around you consider “reasonable.”

I live in France. I build a global product. I talk to US companies daily. And honestly?

I’m worried too.

Not because we lack talent. We don’t.

Not because we lack engineers. We don’t.

But because we lack the mental model required to compete in the world we’re entering. And the gap is accelerating.

We Think We’re in the Same Race. We’re Not.

When I look at what’s happening in the US and China in AI, SaaS, robotics, automation… it feels like watching a different timeline.

They’re scaling models that can refactor codebases.

They’re shipping companies that go from idea to revenue in weeks.

They’re pushing robotics into homes.

They’re pouring capital at a pace that dwarfs what Europe raises in a quarter.

Meanwhile, in France:

We think regulation is a moat.
We believe “solving the French market” is a global strategy.
We look at the US and assume “we’ll catch up later.”
We treat AI like a temporary trend we can ignore until it stabilizes.

This isn't a mindset gap. It's a timeline gap. AI feels like 1999 all over again — the behavioral divide between adopters and holdouts is already compounding. And Europe is overwhelmingly on the wrong side.

Europe is acting like it has time. It doesn't.

The Most Dangerous Bias: Thinking France Is the World

When I hear founders say:

“We’ll win the French market first.”

I always think: France is 0.8% of the world’s population.

0.8%.

China is 20× bigger. The US tech market is 10× bigger.

The next wave of software will not be built for 0.8%.

If your plan is to build only for France, culturally, financially, technically, you’ve already chosen to lose.

Not because you’re bad.

But because you’re playing a local game while everyone else is playing planetary-scale chess.

The Mindset Problem Nobody Talks About

Here’s the part that founders and engineers will immediately recognize:

Most people here fundamentally don’t understand ROI, time, capital, or scale.

They understand tasks. They understand constraints. They understand regulation.

But they don’t understand leverage.

They want to “optimize costs” when the problem is growth.

They want to “avoid risk” when the problem is irrelevance.

They want to “comply first” when the problem is competing at all.

This is why hiring is more complex here, why product velocity is slower. Why teams hesitate on AI adoption, just as they did with cloud in 2008.

It’s not a technology gap.

It’s a worldview gap.

We’re Living Like a Rich Country Without Creating Enough Value

This part is uncomfortable, but founders feel it viscerally.

For 50 years, France has lived on increasing debt and the assumption that we can keep funding our lifestyle without producing equivalent value.

But look at our major industries:

Our car industry is fighting Brussels just to be allowed to sell pollution past 2035, not to compete.
Our energy leadership was squandered by 20 years of indecision.
Our tech ecosystem celebrates being five years behind the US, as long as it’s “sovereign.”

If we stop exporting cars, software, tech, heavy industry, how do we pay for everything?

How do we fund innovation? How do we stay competitive?

We don’t.

We shrink.

We tax more.

We lose ground.

And we pretend everything is fine.

Dropping Out of the Race Isn’t Ethical — It’s Surrender

When Pierre wrote:

“If you slow down, you are simply letting those who do not care about these issues in the first place win.”

That hit me hard.

Because this is the mindset I see too often in Europe:

“We shouldn’t build this.”

“We should regulate it.”

“We should wait until we’re sure.”

“We should be cautious.”

Caution is fine.

Except when you’re in a race you didn’t choose but cannot opt out of. You don’t get to be “ethical” by refusing to play.

You just hand the steering wheel of the future to people who don’t share your ethics.

What Founders Should Take Away

I’m not writing this to doomscroll. I’m writing this because founders and engineers need to hear one thing:

Build globally. Don’t wait for permission.

The world is not waiting for Europe to catch up. The next decade will be brutal for anyone playing local games. Whether we like it or not, the next wave of innovation will be:

AI-native
global from day one
capital-efficient
ruthlessly fast
engineered by people who want to win, not just exist

And we can be part of that — if we choose to.

Closing

I love France. I live here. My kids grow up here. But love doesn’t blind me.

I see the same thing Pierre sees:

A continent with world-class talent… and a mindset preventing it from playing the actual game.

I don’t have the solutions. But I see the problems clearly. And as entrepreneurs, our best chance isn’t waiting for a savior.

It’s building, ambitiously, globally, unapologetically, before the gap becomes irreversible. Because the world is moving.

And this time, if we hesitate, we’ll be spectators. Not players.

AI feels like 1999 all over again

Thu, 06 Nov 2025 00:00:00 GMT

Last week, I spent two days with an old friend. We’ve known each other for fifteen years. He’s curious, a bit of a geek, but not “in tech.” He doesn’t use GPT. His wife doesn’t either. They’ve heard of AI the way you hear of a new restaurant: name recognition, no bookings.

We talked, we cooked, we compared notes on work. At some point, I realized we were living on different planets. Not values. Toolchains.

He does great work. But AI just… isn’t part of his day. Meanwhile, I use it constantly: as a writing partner for emails, a sounding board for product decisions, a junior PM, a marketing intern who never sleeps. It’s not magic. It’s just leverage. And it reminds me of when I got internet access twenty-five years ago and people said, “Why would you need that every day?”

Two decades later, we answer that question by reflex, usually from a phone.

I don’t say this to flex. I say it because the gap is already visible.

If I compare my work today to two years ago, I’m doing two to three times more with better output. Same hours. Less context switching. I can hold more of Mergify’s product in my head, ship faster, and still write the marketing we used to split across two people. I wouldn’t claim I replace a whole team (let’s keep our illusions calibrated) but one founder plus AI now feels like one founder plus a sharp apprentice who learns absurdly fast.

And I’m still only scratching the surface. There are tasks I should automate that I haven’t, because of the classic XKCD curve: spending an hour to save a minute. The ROI is real; the overhead is too. It will get smoothed out, like everything else that starts out lumpy.

XKCD 974

What’s striking is not just the productivity jump. It’s the new behavioral divide. Twenty years ago the divide was access: who had broadband and who didn’t. Today the divide is adoption: who’s willing to put these systems in the loop every day, and who keeps them at arm’s length.

Same laptop. Same calendar. Wildly different output.

This isn’t about “AI replacing jobs.” It’s about AI reorganizing work around people who are willing to collaborate with it. The difference between “I don’t see the point” and “this is in my daily loop” already compounds in quiet ways:

The email you write in 7 minutes instead of 27.
The product spec with five explored options instead of two.
The marketing page that results from testing three angles instead of arguing for one.
The code you ship because the blank page wasn’t blank.

Multiply that by days, then by years. That is how careers and companies diverge. And once AI starts creating content at scale, not just assisting — the synthetic wave is already here — the gap widens even faster.

Of course, there are limits. AI isn’t judgment. It won’t hold your ethics, defend your taste, or choose your strategy. You still have to decide what “good” means, define constraints, and call the trade-offs. If you outsource your thinking, you don’t get leverage: you get noise.

But if you keep the steering wheel, the car is very fast.

There’s also a cultural point I didn’t expect: the stigma of using help. Some people still think “real work” means doing everything yourself. Same energy as hand-writing HTML in 2003 to prove you’re serious.

This reminds me of the latest post from Jean de La Rochebrochard where he talked about how French people are all about crafting. No wonder AI adoption is going to be a long road here.

But the craft isn’t in suffering; it’s in outcomes. Tools are honest if your goals are.

I don’t know precisely what the next twenty years look like. I do see the pattern. Early on, new technology looks optional, even irrelevant. Then someone quietly uses it to do three times more with the same time. Then we call it table stakes. The people who adopted early won’t be smarter; they’ll just have trained their reflexes sooner.

If you’re already all-in, you don’t need my sermon. If you’re AI-curious but unconvinced, try this: pick one workflow that hurts: a weekly email, a product spec, a marketing outline. Put an AI in the loop for a week. Not as a demo. As a colleague. Give it context. Ask for alternatives, not answers. Keep the steering wheel.

If after seven days it doesn’t save you time and improve your output, fine: ignore it for another year. But my bet is you’ll feel the old dial-up-to-broadband moment: once you touch the speed, it’s hard to go back.

Back in Toulouse, my friend and I didn’t resolve anything. We just noticed the split. Same age. Same curiosity. Different daily habits. Twenty-five years ago the web felt optional right up until it didn’t. I think we’re there again. The storm isn’t coming. We’re already in the rain. You can stay dry for a while. Or you can learn to dance in it.

42 Lessons at 42

Fri, 24 Oct 2025 00:00:00 GMT

Turning 42 felt like a good time to pause and reflect.

Not on milestones or ARR charts. But on what actually matters after two decades of building, shipping, leading, and living.

So if 42 is the answer to everything, here’s what I’ve learned so far 👇

🧍‍♂️ Life, growth & happiness

The definition of happiness changes with age. And that’s okay.
The more you learn, the more you realize how little you know.
Curiosity is infinite; boredom is optional.
You can’t optimize life like you optimize code.
If you chase success, it runs faster. If you enjoy the run, it slows down.
Peace beats growth. Every time.
Wealth is freedom, not Ferraris.
Desire is a contract to be unhappy until you get what you want. Cancel it often.
Run more. Read more. Scroll less.
Sleep is the most underrated productivity hack on Earth.

💼 Building, leading & learning

Startups are people, not ideas.
Leading by example is the hardest thing you’ll ever do.
Leadership is 80% conversations you didn’t want to have.
Hire slow. Fire fast. Then sleep well.
Focus is a superpower. Protect it like oxygen.
The first job of a leader is clarity. The second is repetition.
Nobody cares how much you know; they care how you make them feel.
Feedback given late is just resentment with a bow on it.
The most complex skill for a CEO to learn is knowing when to remain silent.
You can’t delegate judgment.

⚙️ Building products & companies

Quality compounds like interest. So does mediocrity. (Why we still care about quality.)
Success is rarely a pivot; it’s iteration with taste.
Great engineers fix root causes. Average ones fix symptoms.
The perfect product doesn’t exist: the one you can evolve does.
Simplicity scales. Complexity invoices.
Don’t chase vanity metrics: chase user trust.
Code reviews are easy. People reviews aren’t.
Build systems, not heroics.
The best feature is one you can delete later.
Tech debt is fine. Moral debt isn’t.

🧩 Mindset & philosophy

The only person you need to impress is your future self.
Gratitude is the antidote to burnout.
Meditation isn’t for everyone. Long runs count.
You can be kind and still have high standards.
It’s okay to be wrong, just not stubborn.
The right “no” beats a thousand “maybes.”
Compromise kills clarity. Choose, don’t blend.
Every problem looks smaller after a good meal.
Don’t overthink purpose. Build, love, repeat.
Curiosity never retires.

❤️ Family, friends & the long game

Family first. Always. The best way to disconnect is to be needed by your kids.
You can’t buy time, but you can decide how to spend it.

At 42, I’ve realized something simple but not easy:

Life isn’t a sprint to a finish line. It’s an ongoing debugging session. You fix a few things, break a few others, and hopefully push a slightly better version of yourself every year.

The same applies to building a company, raising children, or simply trying to be a decent person. You never “ship it” and move on. You iterate. You learn. You acknowledge that the next release will likely also contain bugs. And that’s fine.

For a long time, I thought success meant achieving things: a title, a product, a number in the bank, a milestone on a slide deck. But the more I live, the more I see that peace is the real metric. The ability to wake up excited, to solve interesting problems, to spend time with people you love, and to go to bed proud.

That’s it.

At 20, I wanted to build great code.

At 30, I wanted to build great products.

At 40, I just want to build a great life.

And the funny thing?

It still takes the same skills: focus, iteration, and knowing when to stop optimizing.

So if there’s one thing I’d tell my younger self, it’s this:

Keep learning, keep building, but remember that there’s no final commit.

You don’t get to “win” life. You just get to live it. Version by version.

Building Features One Prompt at a Time

Tue, 26 Aug 2025 00:00:00 GMT

A few weeks ago, we released a new feature at Mergify: autoqueue.

It automatically adds pull requests into the merge queue when they’re ready. No more custom automation rules, no more fiddling with YAML — it just works, straight from the merge queue settings.

Here’s the kicker: I coded it.

Yes, me. The CEO. The guy who hasn’t touched production code in years. The guy who usually spends his days on calls, not in GitHub.

And I did it in less than an hour a day, over three weeks, with the help of AI.

Why I Even Tried This

I’ve used Copilot casually before (mostly autocomplete in Emacs), but this time I wanted to go all-in.

Why? Curiosity, mostly. And time constraints. As a CEO, I have close to zero time to code, and this feature wasn’t urgent. So I thought: why not see what happens if I vibe-code it with AI?

How It Worked

The way I interacted with Claude 4 via GitHub Copilot was simple: I explained the feature like I’d explain it to my team in a product story. I added some technical constraints (“use unit tests, not functional ones”).

Then I let the AI go.

It just felt like coding blindfolded.

It wrote the code. I tweaked less than 5% of it. Once it was done, I sent it for review. I pasted my coworkers’ review feedback back into it. It rewrote. I guided. It iterated.

Did it nail it on the first try? No. Sometimes it forgot instructions. Sometimes it “lost context” after a few iterations and tried to reinvent the test setup it had already learned. That was frustrating — like explaining to a junior dev, except this junior dev has goldfish memory.

But eventually, it worked. The code was merged. Released. In production. Done.

What Surprised Me

I only changed about 5% of the lines myself.
Nobody on the team noticed it was “AI-coded.”
It handled six years of legacy code surprisingly well.
Two years ago this wouldn’t have been possible — the progress is insane.

What It Means

This isn’t about me playing engineer again for nostalgia. It’s about what’s coming.

The quality and quantity bar is about to rise dramatically. AI isn’t just autocomplete anymore; it’s co-construction.

You can ship faster. You can tackle features you don't fully understand at the start. You can guide at a high level and let the AI grind the details. A few months later, I took this even further — to the point where I stopped writing code entirely.

But it also raises new challenges. For instance:

How do juniors review AI-generated PRs?

How do teams trust code written by something that forgets your instructions after 10 turns?

(That’s probably another blog post.)

For now, though, I’ll just say this:

I vibe-coded a real feature into existence in less than an hour a day.

It felt like cheating. And I’m amazed.

The Em Dash Is Dead

Tue, 05 Aug 2025 00:00:00 GMT

I’ve always loved the em dash. It’s elegant. It’s useful. It lets you breathe in your writing—without having to deal with commas or (God forbid) parentheses.

Ten years ago, I wrote a book. A real book. With my hands.

Serious Python

Over 68,000 words—and 77 beautiful em dashes.

I wasn’t counting then—only recently did I check. You know, just to see how robotic I might’ve accidentally been.

Because now? Now the em dash is a red flag.

A decade ago, it was just a punctuation mark. Today, it’s basically a biometric marker for ChatGPT. Type an em dash on the internet in 2025, and someone will immediately side-eye your prose like you’re a prompt engineer trying to slip one past them.

“Nice try, OpenAI.”

Somehow, without even trying, I joined the ranks of the suspicious. My past self—the one tapping away joyfully, dashing away without care—was unknowingly building a future case against me.

So here I am. A human. Who’s written thousands of human words. Who once thought the em dash was peak form—and now has to ask:

Am I even allowed to use it anymore?

The tragedy is this: AI didn't invent the em dash. We gave it the em dash. We trained it on our books, our blog posts, our essays. We fed it so much em dash-laced content that now it thinks it's just what humans do. And to be fair… it was. It's just one more way AI is reshaping how we communicate — and as social platforms collapse under synthetic content, even punctuation becomes a trust signal.

Now, AI refuses to stop.

You can threaten it, prompt it, scold it—“no more em dashes!”—and two lines later? Bam. Another one. It’s like trying to get your dog to stop barking at squirrels. It hears you. It just doesn’t care.

Meanwhile, actual humans are uninstalling their em dash keyboard shortcuts. Coders are deleting — from their HTML snippets. Writers are rephrasing perfectly good sentences just to avoid looking synthetic.

We didn’t lose a punctuation mark. We lost a friend.

So, if you see an em dash in my writing—don’t panic.

It’s not a bot. It’s just me. Old-school. Nostalgic. Typing with trembling fingers and a tear in my eye.

Still human.

Still grieving.

Still em-dashing.

The Synthetic Wave Is Already Here

Tue, 29 Jul 2025 00:00:00 GMT

Six months ago, I wrote a blog post titled “The Collapse of Social Platforms” At the time, it felt like a distant horizon — something you could see coming if you squinted into the future.

Spotify just made headlines for hosting an AI-generated “band” that racked up over a million plays before anyone realized the artists weren’t real. No humans. No guitars. Just prompts, algorithms, and a good understanding of how to feed the machine what people want to hear.

And that’s just the beginning.

Source

AI is Creating — Not Assisting

Back then, I wrote that we were moving past “AI-assisted” content into “AI-native” creation. At the time, it might have sounded like theory. Now, we’ve entered the Spotify Phase: platforms no longer just recommend content — they create it. They don’t need to wait for artists to upload music. They can fill the catalog themselves.

And they will.

Because the economics are too good, the data feedback loops are too tight, and the audience — most importantly — doesn’t seem to care.

The Illusion of Authenticity is Enough

Spotify didn’t advertise the AI band. It was just another artist profile. People listened. They added songs to playlists. They vibed. It’s only after the fact — after journalists started poking around — that we learned the truth.

And you know what? Most listeners still don’t care.

Which proves my original point: we’re not as attached to the source of content as we think we are. We just want something that feels good, fits our mood, and plays seamlessly into our day. If that comes from a human or an LLM fine-tuned on hit-making formulas… who’s checking?

This is the uncanny shift: content is becoming pure simulation. And for most, it’s indistinguishable from the real thing.

Platforms are Optimizing Away People

Spotify’s move is not an isolated event. It’s the canary in the coal mine for every content platform out there.

Why wait for a podcast to be recorded when you can prompt one into existence?
Why pay creators when you can generate infinite variations?
Why host unpredictable humans when you can manufacture predictable engagement?

From AI-generated OnlyFans personas to YouTube clones to fake influencers on Instagram, we’re entering a phase where content isn’t created by people — it’s created for people by machines pretending to be people.

It’s not a dystopia. It’s just a business decision.

So Where does This Leave Us?

If you’re a creator: The value of “real” is shifting. It may no longer be about production quality — but about human connection. Your face, your voice, your story might become the only proof-of-humanity people care about. Ironically, the more polished your content looks, the more people might question if you made it.

If you're a platform: Congratulations, you're entering the golden age of AI-powered margins. But beware the erosion of trust. Once users start doubting whether anyone on your platform is real, the social glue breaks down fast.

If you’re a user: Good luck. You’re about to be bombarded with synthetic everything. And the biggest risk isn’t being tricked — it’s not caring anymore whether what you’re consuming is real or not.

That’s when the simulation wins.

A Prediction, Revisited

In that original post, I wrote:

“Real life will be the only place you’ll have left to interact with real humans.”

I stand by it — even more today. The value of the human connection will rise in proportion to how rare it becomes online. Coffee with a friend. A live concert. A hand-written letter. These may become the luxury goods of the 2030s.

So yes, the synthetic wave is here. But maybe that’s what we needed — a reason to remember what being human online really means.

Until then: keep your eyes open, your ears sharp, and maybe… spend a little more time offline.

The Day I Got Custom Table Legs

Tue, 22 Jul 2025 00:00:00 GMT

Last week, I was with my team at our Mergify on-site — what we call our MAHOS (Mergify All-Hands On-Site). Yes, we’re a fully remote team, so your regular off-site are called on-site for us. 😉

We talked about the usual: roadmap, strategy, alignment. But then I brought up something a little different. I wanted to explain how we think about customer support at Mergify—not as a checkbox, not as a cost center, but as a way to deliver what I call the wow effect.

To make my point, I told them a story. A true one — about a table.

Two years ago, I had just moved into a new house. My first real garden. I was excited to enjoy it, so I decided to buy a garden table and chairs. After browsing around, I picked Lafuma — a French brand I’ve liked since I was a kid.

Fun fact: my very first school backpack in first grade was a Lafuma. Yes, I still remember it.

Anyway, the table and chairs arrived. Great delivery. Good packaging. Quality seemed top-notch.

But something felt… off.

I sat down, and it wasn’t right. The proportions felt weird. The table was just a little too high, or the chairs too low. So I did what every curious engineer does: I grabbed a tape measure. Compared it to my indoor table. And there it was — the Lafuma table was exactly 2cm too tall. Just enough to make every meal feel slightly awkward.

So I wrote them a message. Not angry, not demanding. Just a note saying:

“Love the brand, love the product, but this feels like a design oversight.”

I didn’t expect a reply.

One week later, I got a call. It was someone from Lafuma’s support team — a QA engineer.

He said:

“I read your message. I’d like to understand exactly what’s wrong.”

I explained. He listened. Then, without hesitation, he said:

“Okay. I’ll send you custom table legs, 2cm shorter. You’ll have them next week.”

I was stunned.

“Wait, really? You can do that?”

“Of course,” he replied. “We have spare legs in the workshop. We’ll just trim and ship them to your size.”

And that’s exactly what happened. A week later, I swapped out the legs. Perfect fit. Perfect height. Perfect support.

They didn’t have to do that. I wasn’t going to return the table. I wasn’t even asking for anything. But they did it anyway — because they cared. Because they listened. Because they understood what great support means.

New leg size approved by my wife.

That’s the kind of service we try to deliver at Mergify.

Even in B2B, even in software, even at scale — you can still surprise people. (Why we still care about quality is about the same mindset.) You can still make them feel heard. You can still wow them. That’s what Amazon did so well for years. And it’s what so many companies forget as they grow.

But it’s not optional. It’s the difference between a satisfied user and a loyal one. Between a customer and a fan.

Build the table. And send the legs.

The Problem with OKRs Isn’t OKRs

Tue, 15 Jul 2025 00:00:00 GMT

I first encountered OKRs at Datadog. They were already in place when I joined — nobody really explained the “why” behind them. You just filled out your section in the shared Google Doc. The company was growing fast, already past 1000 employees. My team was new. We cargo-culted what others were doing.

In theory, OKRs are about aligning a team around measurable objectives. You set a direction (“Objective”), define how you’ll know you’ve made progress (“Key Results”), and track it. Simple. Ambitious. Inspiring, even.

In practice? It was a glorified quarterly to-do list.

Management would come down with a spreadsheet of tasks. You could discuss the list, maybe argue your way into trimming it down. But the measure of success wasn’t impact. It wasn’t even delivery velocity. It was binary: did we check the box, yes or no?

There was no conversation about why we were doing these things. No attempt to tie work to outcomes like activation, retention, support ticket volume, user satisfaction — anything. Product management was largely absent. Engineering was expected to execute. Period.

That’s not OKRs. That’s project management theater with a quarterly cadence.

When Metrics Become a Bludgeon

Because impact could not be measured, the game became about managing perception. If your manager brought you a list of 10 items, the game was to negotiate down to 5 and deliver 6. Overdeliver by carefully managing the optics.

It became a ritual of negotiated checklists, not shared purpose. A way to evaluate individuals, not steer teams. The illusion of alignment without any of the benefits.

When we started Mergify, I brought some of that skepticism with me.

We did try OKRs — for a while. And in some areas, they worked. Marketing, for instance, benefitted from clear metrics and planning: support ticket volume, incident count, lead generation. Things we could measure and reflect on quarterly.

But for product and engineering? Not so much. We didn’t have a mature enough product management function early on. And engineers — rightly — didn’t see the value in spending hours fine-tuning quarterly goals that wouldn’t actually guide their day-to-day. (We eventually evolved to a project-driven workflow that worked much better for us.)

Eventually, we stopped.

Not because we didn’t believe in planning or goals — but because the format had become more effort than it was worth. We weren’t getting leverage from the process.

Plans, Not Rituals

What I’ve come to believe is this: most teams would be better off writing down a plan than chasing OKR perfection.

A good plan answers: what are we going to ship, why does it matter, and how will we know if it worked?

That’s it.

I appreciated how Posthog described this recently in one of their updates:

“In 2022, we required “OKRs” as part of quarterly planning, but eventually walked it back. We found engineers were agonizing over finding the right metrics, while also feeling like metrics didn't accurately reflect their subjective view of progress.”

They realized something important: even if you do write OKRs, you still need to write the plan. So maybe just… start with the plan.

Let OKRs emerge when they make sense — when you have a clear outcome to optimize for. But don’t let a framework become a crutch.

Success Isn’t a Template

We love templates in tech. We copy Google’s OKRs. Amazon’s memos. Netflix’s culture deck.

But these practices only work when paired with deep understanding. Blindly copying them won’t align your team or 10x your output.

Success isn’t a template. It’s clarity, judgment, and execution.

Sometimes that means writing OKRs. Sometimes it just means writing a plan that everyone understands — and ships.

AI Is a Human Interface Nightmare

Tue, 08 Jul 2025 00:00:00 GMT

For the last 80 years, computers have been calculators. Fancy ones, sure — with screens, keyboards, networks. But under the hood, they’re still just deterministic machines. You give them input, and they process it with logic gates and silicon, and they spit out the exact same output every time. That’s the deal. That’s the contract.

And then came AI.

AI doesn’t work like that. At all.

AI — especially large language models — are not deterministic. They’re a soup of probabilities and neural weights. When you talk to an AI, you’re not talking to a computer. You’re talking to something more like a human brain: a machine that guesses, infers, hallucinates, and sometimes nails it. And sometimes doesn’t.

That’s fine. That’s expected. But the problem?

AI still runs on computers.

The interface hasn’t changed. We’re still typing on keyboards, expecting precise answers. We’re still clicking buttons, expecting repeatability. But AI doesn’t think like that. And so the human-AI interface is totally broken.

Ask ChatGPT “What’s the height of the Eiffel Tower?” and you might get the right number. Or not. And when it’s wrong, people freak out — “How can it not know that?” But think about it: the model is 1TB in size. It fits on a USB stick. You really believe all of humanity’s verified data fits in your pocket?

It's not Google. It's not Wikipedia. It's a brain. A tiny, weird, synthetic brain that talks to you via a command-line interface and autocomplete. And if we figure out the interface problem, AI could actually connect the dots in ways humans never could.

That’s the real nightmare: the medium is lying about the message.

We call them “smartphones” because we used to make calls with them — even though calling is now maybe 1% of what we do. The name stuck. And maybe we’ll keep talking to AI through keyboards and chatboxes. But eventually, we’ll need new metaphors. New expectations. New ways to interact.

Because what’s coming isn’t a better calculator.

It’s something else entirely.

Not Everything is a Hustle

Tue, 01 Jul 2025 00:00:00 GMT

This Sunday, I made the mistake of opening LinkedIn.

Among the usual weekend calm, a post caught my eye. A young founder proudly explaining how every morning before heading to the office, he cold-DMs 15 people. Every day. Rain or shine.

He then listed all the amazing outcomes of this habit: high-paid jobs, startup funding, conference invites, customer meetings, top hires, you name it.

I couldn’t help but roll my eyes.

So I reshared their post, wrote this instead, and published it on LinkedIn:

Every morning before I head to my home office,
I bring my kids to school.
It’s literally the single most important habit I’ve built.
Want to learn how I do it?
Comment “school” and I’ll tell you my secret.

Boom. Thousands of views.

Because here’s the truth: not everyone is under 30, lives in San Francisco, works at a startup with no users that raised millions, and spends their weekends writing cold emails to VCs and their evening eating pizzas with the team, crushing the latest release.

Some of us are in our 40s.

Some of us are building real companies.

Some of us are taking our kids to school, going for a run at lunch, playing music on the weekends, and still—yes, still—shipping, raising, hiring, and growing.

We don’t brag about skipping meals or sleeping 4 hours.

We don’t show our productivity hacks on a treadmill desk.

We don’t post about cold-DMing 15 people a day.

We just don’t need to turn every moment into content.

So if LinkedIn makes you feel like you’re not doing enough, or not doing it right, just know this:

You’re not alone. You’re not late.

And success can look very, very different.

Take a breath. Pick up your kids. Go for that run.

Enjoy the ride.

Why We Still Care About Quality

Tue, 24 Jun 2025 00:00:00 GMT

I recently read Linear’s excellent blog post on why quality is so rare, and it resonated deeply with me. Craft, quality, care — these aren’t buzzwords. They’re a way of working, a way of thinking, and frankly, the only way I’ve ever known how to build things.

Linear: Why is quality so rare?

For me, it started with open source. When you put your code out in the open, you naturally want to make it good. Maybe even beautiful. I started more than 20 years ago, polishing my Debian packages, making sure they were clean, understandable, and useful. Later I poured that same mindset into building awesomewm, striving to write the best C code I could — because that code was me, visible to anyone curious enough to look.

Open source taught me that quality is not an accident. It’s a habit. And a commitment.

Even though Mergify is no longer open source, the ethos never left. We still build like our code is going to be read by thousands, because, well, it is at least read by our folks. Our team ships work we’re proud of. Whether we win a deal or not, it’s common to hear people tell us:

The quality of Mergify stands out.

That never gets old.

I know I’m not alone in this. Mehdi, my cofounder, and I have been building together for over 15 years. It’s in our DNA: we hate mediocrity. We won’t ship something that we wouldn’t use ourselves — joyfully.

That said, I’ve also seen the flip side. Back when I worked on OpenStack, a massive open-source project, there was a lot of code… and not always a lot of care. Many contributors came from companies that didn’t value quality — and it showed. Open source can be beautiful, but when it’s driven by quantity instead of pride, it becomes exhausting. I hated that part.

Quality isn’t just aesthetic. It’s a business strategy. Linear nailed that in their post. When you build something that feels right — fast, polished, thoughtful — users notice. They stay. They tell others. We’ve seen this at Mergify: our growth has been fueled not just by features but by how those features feel to use.

But quality is more than just a great UI or bug-free code.

It’s also:

A fast, reliable, intuitive product.
Clean code that enables long-term agility.
Thoughtful defaults and edge-case handling.
Being able to say “no” when something adds complexity without enough payoff.

Getting there isn’t easy. You need judgment — to know what’s worth doing and what can wait. That comes with experience and the humility to know you’ll never get everything right. We aim for 80/20, not 100/0. Sometimes that means leaving the last 20% for another day — or maybe never. Not because we don’t care, but because we care about the whole system staying healthy and fast.

Quality isn’t free. But it pays back. In speed, trust, and joy.

So yes, it’s a choice. One you make every day.

You can take the shortcut, or you can make something that lasts.

We still choose the latter.

Why Engineers Shouldn’t Decide Your Cloud Strategy

Tue, 17 Jun 2025 00:00:00 GMT

Every few months, a new wave of engineers proudly announces their exit from the cloud. “We’re going bare metal. Look at our savings!”.

The thread goes viral. Everyone nods wisely.

But here’s the truth: if you’re making infra decisions without thinking about your growth model, you’re optimizing the wrong thing.

The cloud is not a cost problem. It’s a scaling solution.

Startups don’t pay AWS bills because it’s cheap. They pay because it gives them instant access to global infrastructure they couldn’t build or operate themselves, and arguable, they should not spend time building a team to operate it.

If your business is growing 100% year over year, optimizing gross margin is not your first battle. Surviving growth is.

Datadog has been in the cloud since day one. They’ve scaled revenue 150× over 10 years. The cloud didn’t kill them. It enabled them. They did that while controlling and optimizing their gross margin, but also without spinning up a giant project to double them by leaving the cloud (yet). Why? Because they’re still (a little bit more slowly) growing.

Bare metal works — if you’re not growing much.

Basecamp left AWS. They made noise. But they also “only” grew 6**×** in 12 years — not 150**×**. When growth is slow and predictable, you can (and should) optimize for margin. You have time. You have predictability. Maybe you even have ops engineers with spare cycles. And if you don’t, as you’re not struggling to grow your team, you can expand into infrastructure and internalize it.

When you run out of stamina for growth, you optimize your gross margin; therefore, your cost is what you want to shrink. It’s a different phase.

The same goes for any small or internal project; there might be no need to deal with a cloud provider if you know your infrastructure will not double every year. Just rent or buy a bunch of bare metal servers and deal with them.

Most engineers don’t see the whole picture.

If you follow engineers, they always want to optimize. The problem is that most of them can’t optimize your market, your growth. The only thing they know how to optimize is resource consumption and cost by working more.

Therefore, they’ll look at a line item on the AWS invoice and say, “We could get this cheaper with low-cost bare metal and our team spending time spinning things up.”

Maybe.

Who’s factoring in the cost of talent to manage infra? The time you won’t spend shipping product? The opportunity cost of slowing down? (This is a classic case of solving build vs. buy.)

So… bare metal or cloud?

It depends.

If you’re building a startup and aiming for fast growth: cloud, 100%.
If you’re a slow-growth company with predictable traffic: maybe bare metal.
If you’re a big org running an intranet or legacy app: buy servers, no big deal.

But let’s stop pretending this is just a technical decision.

It’s not.

It’s a strategic one.

Marc Chagall Never Painted That

Tue, 03 Jun 2025 00:00:00 GMT

It was a casual Friday. Nothing special—except I was on kid duty for lunch pickup, a rare detour in my usual routine.

As we strolled home, baguette under one arm, my daughter told me about her morning in class.

They had studied Marc Chagall. Her eyes sparkled as she recounted it, and then she asked if we could go see La Fée Électricité next time we were in Paris.

That name rang a bell, but I had no clue where that was exposed and if it was even in Paris. Painting is not my strong suit. Once home, I did what any responsible parent would do: I picked up my phone from my pocket and Googled it.

La Fée Électricité

The first answer showed that the painting was exhibited in the Musée d’Art Modern de Paris. But I didn’t tell my daughter right away. As I was scrolling on my phone, something didn’t click.

The museum mentioned that this painting was from Raoul Dufy — not Chagal.

I triple-checked on the web and Wikipedia. The result was the same. La Fée Électricité isn’t by Chagall at all. It’s really by Raoul Dufy.

That’s when the realisation hit me. The mistake probably didn’t come from a textbook or even a hasty Wikipedia glance. No, my bet is the teacher asked ChatGPT (or Bard, or whatever the tool of the week is) to prepare her lesson. AI probably hallucinated the answer. And nobody caught it.

We're at this weird moment where many people treat AI like it's a search engine. Or worse: as if it's a source of truth. And when this confidence gets applied at scale — to content, media, music — the synthetic wave is already here, and nobody is fact-checking it.

It’s not. It’s a conversation partner with infinite confidence and a shaky grasp on facts.

This isn’t a rant against AI. I use it daily and wouldn’t go back. But it’s a gentle reminder: if you don’t know how to question what it says—or double-check your sources—it’s easy to teach your whole class wrong facts.

No big deal this time. My kid went back to school in the afternoon after I dared her to ask her teacher if Chagal was really the painter behind La Fée Électricité. She did ask, and the teacher corrected her mistake for the whole class and moved on.

But next time, who knows?

Security Starts Where Convenience Ends

Tue, 27 May 2025 00:00:00 GMT

Over the past quarter, I’ve had conversations with a handful of engineers working at French software companies — from early-stage startups to more established players. Companies with thousands of users and millions of euros of revenue.

During the conversation, what struck me wasn’t what they were building or how they scaled. It was how little attention and seriousness many of them gave to security.

Some of these companies handle critical user data. Others operate infrastructure that powers thousands of customers. Yet, their security posture often amounts to… vibes. A bit of MFA here. A few random VPNs there. But very little that would pass as security maturity by any professional standard.

And yes — I get it. Security is not easy. It’s thankless. It doesn’t generate revenue. But here’s the deal: ignoring it isn’t neutral. It’s dangerous.

What’s going wrong?

From what I’ve seen and heard:

No MDM (Mobile Device Management): Engineers using unmonitored laptops, often their own machines, with no control over OS updates, disk encryption, or even if a password is required. The reason is sometimes that engineers are doing a heavy push back on this for convenience and have too much weight in the decision making process for security, without having a clue.
No endpoint visibility: If a machine is compromised, there’s no way to know. Worse, there’s no way to do anything about it.
No SOC 2, no ISO 27001, not even a roadmap: These aren’t magic bullets, but they’re a minimum bar—a starting point. Yet many companies either dismiss them or postpone them indefinitely.
Weak privilege separation: Developers with production access “just in case.” CI pipelines that can destroy environments. You get the picture.

This isn’t just a case of companies not being “mature enough.” This is willful neglect disguised as pragmatism.

One of the reasons (and why engineers shouldn't always decide your strategy): developers often act like divas. Many of them refuse to make even minor trade-offs in convenience for the sake of better security. They don’t want to lose their admin rights, install an MDM agent, or be told they can’t SSH into prod “just in case.” Security? That’s someone else’s problem—until it’s not. The bigger issue is that in many early-stage or engineering-led companies, devs hold disproportionate decision-making power, and there’s no one truly responsible for security. Without pushback, security becomes optional. This isn’t about lack of maturity. It’s about a complete lack of incentives, accountability, and understanding.

“We’re not a target” is a myth

Many of these teams believe they’re too small or irrelevant to be attacked. That might be true—until it’s not.

In France, we’ve recently seen crypto entrepreneurs attacked physically. That’s one end of the spectrum. But digital attacks? They don’t need to be targeted. They can be opportunistic. You leave a port open, someone finds it.

Original Article

Security hygiene isn’t about paranoia. It’s about respecting your customers, your users, and your own future.

What should change?

I’m not a security expert. I run a bootstrapped software company, and I’m just one of the gears in our security. But here’s what I’d like to see as a baseline across every SaaS company:

MDM. For every laptop.
Disk encryption. Mandatory.
Admin access? Logged. Monitored. Reviewed.
Least privilege policies by default.
CI/CD pipelines with auditable change control.
Security reviews baked into product releases.
A roadmap toward certifications like SOC 2 or ISO 27001.

Not because they’re trendy, but because they force discipline.

Why I’m writing this

Because I’m genuinely worried. I think we’re going to see more breaches, more leaks, more “oops we exposed production DBs for a month” stories. And when that happens, saying “well it was complicated” won’t cut it.

Security is part of the job. It’s not an add-on. It’s not the CISO’s problem. It’s yours. It’s mine. It’s everyone’s.

Let’s raise the bar.

Why French Tech Is Playing Not to Lose

Tue, 29 Apr 2025 00:00:00 GMT

There’s an uncomfortable truth in French tech. One that doesn’t get talked about much in conferences or demo days. One that’s quietly baked into a lot of the strategies, funding rounds, and product roadmaps.

Most French tech entrepreneurs are not playing to win. They’re playing not to lose.

They build nice, safe products. They avoid risk. They target the French market, maybe Europe if they’re feeling bold. They slap “sovereign” stickers on their infrastructure and call it innovation.

But let’s be honest: that’s not innovation. It’s insurance.

And it’s killing our chances to matter.

The Safety Bubble

A lot of French startups don’t dream of being global leaders. They don’t set out to compete with the best in the world. They design around constraints. Around regulation. Around what’s “acceptable.” Around what the French government might subsidize.

And for a while, that works. You can build a decent company by playing it safe. Get a few French customers. Raise a seed round. Maybe land a grant from the government innovation bank. Position yourself as “GDPR-compliant,” “cloud sovereign,” “data hosted in Europe.”

Yay. The state might give you a pat on the back. However, no CAC 40 company will buy your product (way too risky to work with you), but if you’re lucky, you’ll land a few scale-ups.

Meanwhile, somewhere in California, someone is building the product that your customers will switch to as soon as they need something that actually moves fast, scales globally, or breaks new ground.

This is the real issue: French entrepreneurs are incentivized to build for compliance, not for users.

And that’s how you end up with European companies that are five years behind their American counterparts, but proudly calling themselves “local alternatives.” They sometimes go as far as making other entrepreneurs feel guilty for using a non-French technology provider because it’s better — especially in those complicated times.

That’s not ambition. That’s surrender in disguise.

Sovereignty Is Not a Product Strategy

Let’s talk about the word that gets thrown around a lot lately: sovereignty.

Yes, it’s important that we have control over critical infrastructure. Yes, data security matters. But it’s not a business model. It’s not differentiation.

You don’t win by saying “we’re like AWS, but French.”

You don’t win by saying “we’re like OpenAI, but hosted in Europe.”

You don’t win by saying “we’re like GitHub, but on OVH.”

An actual email I received today while writing this post.

You win by solving problems better. By innovating. By taking the risk to challenge the status quo.

Sovereign hosting is a checkbox. It’s not a moat.

What frustrates me is that they use it as a crutch. As a justification for not competing with the leaders. As an excuse to stay in their comfort zone and call it strategy.

Meanwhile, the world is moving. And they’re watching.

The US Is Still the Market to Beat

There’s a pattern I see all the time: French startups that refuse to even consider entering the US market.

“It’s too crowded.”

“It’s too expensive.”

“We’ll start with France. Then maybe Germany. Then maybe UK.”

But here’s the thing: the US is the only market that can validate your product at scale. It’s where the fastest companies live. The most demanding customers. The strongest competitors.

If you’re not building with the intention of competing there, what are you doing?

Sure, it’s hard. Sure, you’ll probably get punched in the face. But that’s where you learn. That’s where you improve. That’s where you build something that matters globally — not just locally.

Avoiding the US is not a cautious strategy. It’s a self-imposed ceiling.

French Tech Has the Talent. What’s Missing Is the Fire.

I’m not saying we don’t have the brains. Or the skills. Or the creativity.

We do.

I’ve worked with incredible engineers, designers, product thinkers in France. People who could work anywhere in the world. (I wrote more about what it takes to build globally in The Future Is Being Built Elsewhere.)

But too often, the culture around them is one of caution, not ambition. One where failure is seen as shameful, not as a stepping stone. One where fundraising is seen as the end goal, not the fuel for building something bold.

And when that’s the vibe — guess what? You get safe products. You get local clones. You get decks that talk more about “sovereignty” than about solving user problems in new and interesting ways.

You don’t get industry leaders. You don’t get Snowflakes or Stripes or Notions.

And that’s a shame. Because we could.

We Need to Build Like We Mean It

If we want to matter in the next decade of tech, we need to stop building like we’re afraid.

We need founders who want to win — not just survive.

We need investors who back risky, ambitious plays — not just safe, incremental growth.

We need products that start by targeting the world’s best users — not just the few who care about local hosting.

And yes, we’ll need to fail more. That’s part of the deal.

But at least we’ll be failing forward. Not settling for second place.

Closing

I know this is an unpopular opinion. I know it sounds harsh. But I say this because I want us to do better.

French tech doesn’t lack talent. It lacks urgency. It lacks the hunger to build something that doesn’t just exist — but leads.

And until we face that, we’ll keep getting the crumbs.

From Failure to Focus

Tue, 22 Apr 2025 00:00:00 GMT

Startups love to talk about iteration. Failing fast. Learning from mistakes. But when you’re six months into a project that doesn’t ship — not once, but twice — that mantra starts to feel a bit too real.

At Mergify, we recently spent almost a year building two separate products around CI/CD. Both had potential. Both looked promising. And both ended up in the graveyard.

I have written about this in three posts already: The $100,000 Mistake, When Nobody Wants Your Product, and When Great Tech Isn't Enough.

But here’s the thing: that journey was the best thing that could’ve happened to us because it led to CI Insights, our newest product, which we’re shipping for real this time.

When R&D Goes Off-Road

If you’ve followed the first parts of this series, you know the backstory. In 2023, after years of growing Mergify’s Merge Queue product, we started exploring new ideas.

The first was CI Optimizer — a tool to help teams reduce CI/CD costs. But we quickly learned something important: engineers aren’t the ones who care about CI spend. That’s a FinOps conversation. And FinOps teams weren’t our users.

Then came CI Issues, a project aimed at tracking flaky tests and infrastructure problems. This time we had interest. Teams did struggle with these problems. But we made the mistake of diving into R&D without doing proper design work. We built a complex system — and it worked — but it was so hard to deploy and operate that we never felt confident letting users in.

No design, you said?

So once again, we shelved it.

Two product attempts, zero releases.

But what survived both efforts was a deeper understanding of the pain engineers feel every day around CI.

The Real Problem: CI Is a Black Box

Across all our conversations, one theme kept showing up: visibility.

Teams weren’t desperate to reduce CI costs. They weren’t obsessed with infrastructure failures. But they were all asking the same questions:

Why is our CI pipeline so slow?
Which PRs are consistently the bottleneck?
Which tests are flaky?
Where are we wasting time?

Nobody had good answers. CI is treated like a utility — flip the switch and hope the light turns on. But when it doesn’t, or when it flickers, most teams don’t have the tools to understand why.

We realized what was missing wasn’t a FinOps tool or a smart test tracker — it was observability.

CI Insights: The Missing Layer

With all the groundwork we’d laid — our CI connectors, our data pipelines, our internal dashboards — we already had most of the pieces. We just needed to reframe the problem.

So we did.

CI Insights is the observability layer for your CI.

It helps teams:

Spot flaky jobs and tests
Identify long-running or unstable jobs
Understand where their pipeline is slowing them down
Track trends over time across teams and repos

It’s not about cost savings. It’s not about blaming the CI tool. It’s about clarity — understanding what’s going on, so teams can ship faster and with less frustration.

This Time, We Built It Right

We learned from our previous mistakes.

This time, we:

Started with real use cases from customers and our own internal needs
Designed first, coded second
Focused on value over complexity
Shipped it to users. Yes. Really.

We’ve been using CI Insights ourselves for months now. It already helped us catch flaky jobs, detect broken test workflows, and reduce merge queue delays.

Now, we’re rolling it out to early users — and so far, the feedback has been 🔥.

The Bigger Picture

CI Insights is more than just a tool. It’s a shift in how we think about CI.

It’s not just a thing that “runs your tests.” It’s a critical part of your development workflow. And it deserves the same kind of visibility, metrics, and tooling that you already have for production systems.

We’re building CI Insights to be the best observability tool for CI — because engineers deserve better tools.

What’s Next?

We’re just getting started. The roadmap is full. We’re onboarding users slowly and shaping the product based on real feedback.

If CI is a black box for your team — if you’re tired of guessing why things are slow — we’d love to hear from you.

👉 Request early access

Not Just a Job, It’s a Ride

Tue, 15 Apr 2025 00:00:00 GMT

Hiring is easily one of the hardest jobs I’ve had to do at Mergify. Not because there aren’t smart people out there — there are. But because we’re not just hiring for skill. We’re hiring for mindset.

And let’s be honest: that’s a lot harder to screen for than technical chops.

When you’re building a startup, every new hire changes the shape of the team. Every person matters. You’re not adding a cog to a big machine — you’re inviting someone on the ride with you, and they’d better be ready for the speed bumps.

What We Actually Look For

At Mergify, we look for people who care. Not just about clean code or nice UI — but about the mission. The thing we’re building. The problems we’re solving. If what we’re doing doesn’t excite you, we’re not going to try to sell it to you. We want you to come in already leaning forward.

We’re looking for people who aren’t title-driven but outcome-driven. People who will get things done even when no ticket has been assigned or clear ownership has been defined yet. That happens a lot.

You need to bring ideas, not just execute someone else’s. We expect initiative, curiosity, and autonomy.

We provide a good summary of our vision and mission on our website.

Autonomy Isn’t a Buzzword, It’s the Filter

Here’s a small story that stuck with me.

We had a candidate once who asked me during the interview, “Who’s in charge of breaking down the project into tickets?”

I said, “You are.”

That was enough to scare them off — and that’s OK. At Mergify, you’ll be expected to do exactly that. Define your work, structure your plan, ask questions when you need to — but no one’s going to hand you a Jira board and a user story spec for every task.

If that makes you nervous, we might not be the right place. If it makes you excited? You should talk to us.

Startups Are Messy. That’s the Point.

You can’t build a startup with neat little boxes around every role.

One week, you might be writing code. The next, you’re giving a talk at a meetup. The week after, you’re helping a customer figure out something weird in their CI pipeline.

We look for people who can jump between lanes without crashing the car. If you need clear job boundaries, startup life will drive you insane. But if you like wearing multiple hats (sometimes in one day), you’ll thrive here.

Wearing multiple hats.

Hiring Mistakes: Yep, We Made Some

One of our early missteps? Underestimating how hard remote work can be — for some people.

We’re fully remote, and we love it. But not everyone is cut out for it. We’ve had brilliant engineers who struggled because they needed more structure, more hand-holding, and more real-time sync. (I wrote about this in Remote Work: Great, But Not Perfect.) And we’ve learned the hard way that great resumes don’t always mean great remote workers.

Now, we screen harder for that. Autonomy, again, is a big part of the puzzle.

Our Hiring Process (And Why We Meet in Person)

Our process is pretty standard on the surface:

👉 A technical test

👉 A technical interview

👉 A CEO chat

👉 An onsite interview

👉 Reference checks

We’ve documented the whole thing on our website, so there are no surprises.

But one thing we insist on: we meet you in person.

Remote or not, hiring is a human process. A real-life conversation can reveal things that a dozen Zoom calls won’t. We’ve avoided a few bad hires because we took the time to meet face-to-face.

Motivation Over Résumé

The biggest signal we look for? Why you want to join.

We try not to overexplain this part publicly, because it’s one of our most effective filters. But let’s just say: if your reason for leaving your current job is “looking for a remote job,” we’ll probably pass.

We want people who are actively choosing this kind of work, people who are ready for the ambiguity, the responsibility, and, yes, the chaos.

Hiring for Roles You Don’t Know? Brutal.

When we hired our first designer, it took us way longer than it should have. Why? Because we didn’t know how to assess the role.

If you’ve never done the job yourself, hiring for it is like ordering dinner in a language you don’t speak. You might get lucky, but you probably won’t.

We’ve learned to spend more time understanding what we actually need before we go looking for someone to do it. Sounds obvious. It’s not.

What’s Hard Now?

Right now, our biggest challenge is finding candidates who combine technical skill with startup DNA.

We’re looking for people who’ve spent a few years in early-stage companies, who know what it’s like to ship fast, wear multiple hats, and stay sane through ambiguity. Not just smart — adaptable.

If That’s You…

Then maybe we should talk. If you’re looking for more than just a job — if you want to build something, shape it, and take pride in it — you might belong here.

We’re picky, yeah. It slows us down. But we’ve learned the cost of the wrong hire is much higher than waiting for the right one.

So if you’re ready for the ride, we’re hiring.

When Great Tech Isn’t Enough

Wed, 26 Mar 2025 00:00:00 GMT

Welcome to the third post in our “we-built-something-and-killed-it” series.

In the first chapter, we shared how we started building CI Optimizer—our ambitious attempt to help teams cut down on CI/CD costs.

In the second, we explained why that effort never made it past the runway: while the problem existed, no one really wanted the solution.

But that wasn’t the end of the story.

Because just as we were winding down CI Optimizer, something else started to take shape—almost accidentally.

From CI Cost to CI Chaos

As we were working on CI Optimizer, we had to dig deeper into CI platforms like GitHub Actions or CircleCI. We needed to understand the structure, failures, and performance of CI pipelines to measure their cost.

And the more we explored, the more something else stood out: teams weren’t just struggling with CI/CD costs—they were struggling with CI reliability.

❗Flaky tests.

❗Unreliable runners.

❗Timeouts.

❗Random infra failures.

And as users of our Merge Queue product kept telling us:

“Our workflow is fine—until CI starts acting up.”

So we asked ourselves a new question:

💡 What if we stopped focusing on how much CI costs, and started looking at how much CI hurts?

That’s how the idea for our next product—CI Issues—was born.

The Pivot: CI Issues

CI Issues was meant to do one thing really well:

Track, identify, and alert on CI problems before they silently torpedo developer productivity.

We wanted to give teams insight and visibility into:

How often their tests flaked
Whether their CI infrastructure was unreliable
Which PRs were impacted
Which workflows deserved attention

The goal wasn’t just dashboards. It was detection and action. You’d be able to see patterns, set alerts, and flag recurring issues before developers noticed them.

And as we started to pitch the concept to engineers, the excitement was real:

💬 “We have this exact pain.”

💬 “We’ve built half of this internally.”

💬 “Please let us know when it’s ready.”

We felt like we were onto something.

The R&D Rabbit Hole

So we jumped in headfirst. We already had code collecting and analyzing CI data, so we started adapting it for CI Issues.

We ran the system internally, refined metrics, tested detection logic, built a first UI. And then we iterated. And iterated. And iterated again.

But something was off.

Every time we looked at what we had, the same thought came back:

“This is good… but it’s not a product.”

It was barely working for us internally. Even we had trouble using it.

It was noisy. It was complex. It was fragile. It wasn’t obvious how to deploy or operate it at scale.

We had built tech.

But we hadn’t designed a product.

The Realization That Stopped Us

After almost a year of work, we paused and took a step back. And it hit us:

We had made the same mistake again—but in a different way.

With CI Optimizer, we had no market.

With CI Issues, we had no design.

This time, it wasn’t the problem that was flawed—it was our approach.

We had focused on research, experimentation, pipelines, metrics, code—but we hadn’t put the same energy into figuring out how the product should be used.

How would teams onboard?

How would they configure it?

How would they act on the data?

What does success look like for them?

The longer we waited to answer those questions, the more we realized:

💣 “If we ship this now, we’ll be building another tool that’s hard to use, hard to maintain, and ultimately, unadopted.”

So we made the call—again.

We stopped.

What We Learned (This Time)

This second failure didn’t sting the same way as the first.

In fact, it felt like a necessary part of the journey.

Here’s what we learned:

Validation isn’t enough—you need design.

Even if users want a solution, they won’t use a product that’s hard to operate or understand.
Great tech doesn’t mean great UX.

CI Issues worked, technically—but without thoughtful design, it was dead in the water.
You need both clarity and empathy.

Clarity on what you’re solving, and empathy for how your users will experience it.

What’s Next?

The story doesn’t end here.

CI Issues gave us a powerful insight into how fragile and painful the CI experience can be—and how underserved engineers still are when things go wrong.

So we took everything we learned from CI Optimizer and CI Issues, and went back to the drawing board—with a new vision, new design principles, and a better understanding of how to build the right thing the right way.

Stay tuned for the final post in the series: what we built next, and how it’s going to change how developers deal with CI failures.

“It’s Complicated” Is Not an Excuse

Tue, 11 Mar 2025 00:00:00 GMT

I spend a lot of time talking to engineers.

I ask them about design choices, technical decisions, and why something is built a certain way. I try to understand why this feature is so cumbersome to use, why this API is so convoluted, or why the user experience feels unnecessarily difficult.

And more often than not, the response I get is:

💬 “Well… it’s complicated.”

Sure. Everything is complicated.

That’s why you’re here. That’s why you’re an engineer.

But “it’s complicated” should never be an excuse for bad design.

Imagine If Other Professions Worked Like This

Let’s take a bakery, for example.

You walk in and ask for a loaf of bread.

The baker hands you a cup of flour and some water.

🫤 “Uhh… I was expecting actual bread.”

💬 “Well… it’s complicated.”

💬 “We’d have to mix the dough, let it rise, bake it for a while…”

💬 “That’s a lot of steps, so we just decided to give you the raw ingredients instead.”

This is exactly how software feels someday.

When users interact with your product, they don’t want to assemble the damn bread. They just want something that works.

Your job as an engineer is to handle complexity—not push it onto the user.

The Difference Between Good and Bad Engineering

Look, I get it. Engineering is hard.

Making things simple is difficult.

Abstracting complexity takes effort.

But great engineers don’t just write code—they design experiences.

A bad engineer builds something difficult and says, “Well, it’s complicated.”

A great engineer builds something difficult and makes it look simple. (More on what makes a great software engineer.)

🔹 Bad engineering forces users to deal with complexity.

🔹 Good engineering hides the complexity behind smart design.

Take Apple, for example. You know what’s actually complicated?

🔹 Compressing a 4K video into a tiny file.

🔹 Rendering realistic lighting effects in real-time on an iPhone.

🔹 Syncing all your messages, contacts, and photos seamlessly across devices.

But do Apple users ever have to think about any of that?

No. It just works.

That’s good engineering.

Stop Saying “It’s Complicated”—Start Making It Simple

When you hear yourself saying, “It’s complicated”, stop for a second and think:

🛑 Are you solving a hard problem in the simplest way possible?

🛑 Or are you just passing the complexity to the user?

If it’s the latter, you haven’t finished the job yet.

Because real engineering isn’t about making things work.

It’s about making things work… simply.

When Nobody Wants Your Product

Tue, 04 Mar 2025 00:00:00 GMT

In the first part of this series, we introduced CI Optimizer, a product we were convinced would help engineering teams reduce their CI/CD costs. Given the economic downturn in 2023, we saw budgets tightening, companies folding, and engineering teams being forced to justify every dollar they spent.

If you missed the first part, read it here.

It seemed like the perfect time to launch a tool that would bring cost visibility and optimization to CI/CD workflows.

We started building immediately, setting up a landing page and a waitlist, and running early customer interviews. Our approach was clear:

Build the product.
Talk to potential users.
Iterate based on feedback.

But as we reached out to customers, one thing became clear:

💡 Nobody really cared about optimizing CI/CD costs.

This was the moment we realized we were building something that might never find an audience.

Trying to Sell the Product Before It Existed

I strongly believe that you should be selling a product before it even exists. If you can’t generate demand when it’s just an idea, chances are you won’t generate demand once it’s built.

So, as we were writing the first lines of code, we also launched:

✅ A marketing campaign to build awareness.

✅ A landing page with a waitlist.

✅ Customer outreach to gauge interest.

Our goal was to validate demand early—before we wasted months building something nobody wanted.

But things didn’t go as expected.

The First Red Flag: Engineers Didn’t Care

As we started talking to users, the first warning sign was that engineers were simply not interested in optimizing their CI/CD costs.

💬 “Sure, spending less money is nice, but it’s not a priority.”

💬 “We’ve never been asked to reduce our CI/CD spend.”

💬 “CI is just a necessary cost of doing business.”

This was surprising. We expected companies to be actively looking for ways to cut costs, but instead, we found:

👉 Engineers weren’t incentivized to optimize costs. Most of them were measured by features delivered and bugs fixed, not by how much they spent on infrastructure.

👉 Budgets were tight, but existing expenses weren’t scrutinized. Many teams were cutting new expenditures, but existing CI/CD costs were just accepted as part of doing business.

👉 It wasn’t an engineering problem—it was a finance problem. Even when engineers acknowledged CI/CD spending was high, they said, “This isn’t my job to fix.”

In short:

🚨 We had built a solution for a problem our audience didn’t care about.

But we weren’t ready to give up yet.

The Second Red Flag: Talking to the Wrong People

Since engineering teams didn’t seem to care, we were often redirected to FinOps teams—the financial teams responsible for tracking cloud spend.

So we thought, “Great! Maybe this is our actual target audience.”

We started talking to FinOps teams, and here’s what we discovered:

💬 “We don’t need another tool—we just need a report in a spreadsheet.”

💬 “Can you just give us an API so we can generate cost breakdowns?”

💬 “We don’t want to ‘optimize’ CI/CD automatically. We just need visibility.”

💬 “If we were to buy your product, we’d need more than reporting. We want automatic cost optimization.”

Here’s where we ran into our second major issue:

🚨 We were not equipped to build a product for FinOps teams.

We understood engineers. We had deep experience with CI/CD workflows.

But we knew nothing about selling to FinOps.

Selling to FinOps teams is a completely different game.

They care about budgets, forecasting, and high-level cost reporting, not about how CI/CD actually works.

Even worse:

❌ The product we had in mind was too technical for FinOps teams.

❌ The version they needed was much more complex and would take a year to build.

❌ We would be competing against massive cloud cost monitoring tools, not other DevOps tools.

At this point, we had two choices:

Keep building a product for engineers who didn’t care.
Completely pivot to a new audience we didn’t understand.

Neither option looked good.

The Hard Decision: Killing the Product

By the six-month mark, we had spent:

🕚 Hundreds of hours building a proof-of-concept.
📞 Countless customer calls trying to validate the idea.
💬 Weeks refining our messaging to see if we could spark interest.

But deep down, we knew the truth:

❌ Engineers wouldn’t pay for cost optimization.
❌ FinOps teams needed something completely different.
❌ There was no clear path forward.

And so, after six months of work, we made the hardest decision a product team can make:

We killed the project.

Instead of pushing forward with a product that had no market, we pivoted to something else—which I’ll reveal in the final part of this series.

Lessons Learned

Even though we ultimately abandoned CI Optimizer, the experience taught us some critical lessons about building new products:

Talk to Customers Before Writing Code

We should have validated demand before starting development. Building first and testing later is a risky approach. Fortunately we mitigated this by talking while building.

Engineers Don’t Always Care About Cost Savings

Developers are focused on shipping code, not cutting costs. If a product doesn’t directly impact their work, they won’t engage with it.

Just Because a Problem Exists Doesn’t Mean It Needs a Product

Companies do spend too much on CI/CD, but that doesn’t mean they’re looking for a tool to fix it. Some problems are simply not painful enough to justify a new product.

Selling to Finance Teams is a Whole Different Game

FinOps teams think differently from engineers. If your product doesn’t fit into their existing finance workflows, they won’t use it.

Know When to Walk Away

One of the hardest skills in startups is knowing when to cut your losses. We could have wasted another 6–12 months building something nobody wanted. Instead, we chose to fail fast and pivot.

What’s Next?

Even though CI Optimizer never launched, it wasn’t a wasted effort.

In fact, the insights we gained from this failure led us to build something even better.

In Part 3, I’ll reveal how we took everything we learned from this failure and pivoted to a product that actually resonated with engineers—and how that decision changed the trajectory of Mergify.

The Hidden Cost of Badly Typed Python Wrappers

Tue, 25 Feb 2025 00:00:00 GMT

What about some technical stuff this week?

Writing wrappers in Python is a common practice. Whether it’s to simplify function calls, encapsulate complexity, or create a cleaner API, wrapping functions can be a great way to organize code. But there’s a catch: if you’re not typing your wrappers correctly, you might be introducing subtle bugs that your type checker won’t catch.

If you’re using Mypy (or another static type checker like ruff), you should be careful about blindly passing *args and **kwargs as Any—because doing so effectively turns off your type checker, making your code vulnerable to runtime errors that should have been caught statically.

Let’s dive into why this is a problem, why traditional approaches fail, and what the correct way to handle wrapped functions is.

The Common but Flawed Wrapper Pattern

Here’s a classic example of an incorrectly typed wrapper function:

import typing

def make_request(url: str, *args: typing.Any, **kwargs: typing.Any):
    return send_request(HttpClient(url), *args, **kwargs)

def send_request(client: "HttpClient", method: str = "GET", timeout: int = 5) -> str:
    return f"Request sent to {client.url} with method {method} and timeout {timeout}s"

class HttpClient:
    def __init__(self, url: str):
        self.url = url

What’s the issue here?

At first glance, this seems fine. We’re creating an HttpClient for a given url and passing all additional arguments directly to send_request().

But the problem arises when you pass the wrong arguments:

make_request("https://example.com", method="POST", timout=10)  # ❌ Typo in "timeout"

This will result in a runtime error:

TypeError: send_request() got an unexpected keyword argument 'timout'

Since make_request() uses *args: Any and **kwargs: Any, Mypy won’t flag this mistake. The type checker has no way to verify whether the arguments passed to make_request() are valid for send_request().

Using Any like this completely disables type checking, making Mypy useless for catching argument mismatches.

What About Using ParamSpec? (And Why It Doesn’t Work)

A natural instinct is to use ParamSpec to tell Mypy that make_request() should take the exact same arguments as send_request().

from typing import ParamSpec, Callable

P = ParamSpec("P")

def make_request(url: str, *args: P.args, **kwargs: P.kwargs):
    return send_request(HttpClient(url), *args, **kwargs)  # ❌ Won't work

Why doesn’t this work?

ParamSpec is only useful for decorators and higher-order functions—functions that return another function.
It does not work for simple wrappers like this, where you’re directly calling the function inside the wrapper.
If you try this, Mypy will complain that ParamSpec is being used incorrectly.

This means that traditional wrapper functions in Python—where you take *args and **kwargs and pass them blindly—are no longer a good practice in a world where static typing matters.

The Correct Approach: Using `functools.partial`

Instead of directly calling send_request() within make_request(), we should return a callable function using functools.partial.

Here’s how you do it properly:

from functools import partial

def make_request(url: str):
    return partial(send_request, HttpClient(url))

# Correct Usage
request = make_request("https://example.com")
print(request(method="POST", timeout=10))  # ✅ Works correctly

# Incorrect Usage
print(request(method="POST", timout=10))  # ❌ Mypy will catch this!

Why This Works

✅ Mypy can now properly check argument correctness: request has the exact same signature as send_request(), ensuring proper type safety.
✅ No more unexpected runtime errors: if you pass an invalid argument, Mypy will flag it before you even run the code.
✅ More maintainable code: this pattern makes it clear what arguments belong to what function instead of having them blindly passed along.

Key Takeaways

Stop Using *args: Any, **kwargs: Any in Wrappers: this disables type checking and opens your code to hard-to-debug errors.
ParamSpec is NOT a fix: it only works for decorators and cannot be used to type generic wrapper functions.
Use functools.partial Instead

This ensures that type checkers can properly verify arguments while keeping the flexibility of a wrapper.

Final Thoughts

Python’s type system has evolved significantly, and many old habits—like blindly wrapping functions with Any—should now considered bad practice.

By using functools.partial, you ensure that your wrapped functions remain type-safe, predictable, and error-free.

Start refactoring your wrappers today—you’ll have fewer bugs, cleaner code, and a much happier type checker.

Have you encountered issues with typing wrappers in Python? Do you have alternative approaches? Let’s discuss in the comments! 🚀

The $100,000 Mistake

Tue, 18 Feb 2025 00:00:00 GMT

A Journey into Startup Reality

Not every startup success story begins with a garage, two co-founders, and an overnight explosion of users. And not every failure is a dramatic, fiery crash. Some of the most valuable lessons happen in the quieter moments—when you’ve built something, spent months refining it, and then realized you were solving the wrong problem all along.

This is the story of CI Optimizer, a product we believed would transform the way companies managed their CI/CD costs. We spent six months designing, building, and testing it—only to ultimately kill it before launch.

Why? Because we made one fundamental mistake.

This three-part series is not just about the technical challenges of optimizing CI/CD or the intricacies of pricing cloud infrastructure. It’s about the hard reality of building a product, talking to customers, and realizing you missed the mark.

If you’re an entrepreneur, a product builder, or just someone fascinated by the messy, unpredictable world of startups, this series is for you.

Let’s start at the beginning.

How It Started

At the end of 2022, Mergify was on a roll.

We had spent the past year growing steadily, tripling our revenue, refining our Merge Queue product, and deepening our place in the DevOps ecosystem. Our customers were engaged, our product-market fit felt strong, and our team fired on all cylinders.

But as the year drew to a close, something in the air felt different.

Conversations with prospects were shifting. Instead of discussing new features and scaling up their usage, they were hesitant. Startups—our core audience—were tightening their budgets. Investors were slowing down. Some of our customers simply disappeared.

First, it was the crypto companies. Then, real estate tech. One by one, they went silent—not because they didn’t love our product, but because their businesses were collapsing. The market was crashing. Funding was drying up.

The message became clear:

“We love Merge Queue, but our budget is frozen.”

We knew we couldn’t just sit back and hope the market would bounce back. We needed to adapt. That’s how startups survive.

And that’s when we had what we thought was a brilliant idea.

The Birth of CI Optimizer

One undeniable truth about software engineering is that CI/CD is expensive.

Every build, every test, every deployment—it all costs money.

At scale, those costs grow exponentially, often without teams fully understanding where their budget is going. Developers push a change, run a full test suite, and move on. But in the background, cloud bills are racking up, and finance teams are left wondering where all that money is going.

So we thought:

“What if we built a tool that gave teams complete visibility into their CI/CD costs? What if we could identify waste, eliminate unnecessary builds, and optimize pipelines automatically?” What if we could help companies save money on their CI without slowing them down?”

It sounded like a no-brainer. We’d build a smart system that could analyze CI/CD usage and recommend cost-saving adjustments—maybe even automate them.

This is something we could even use ourselves to save on our CI/CD bills.

This wasn’t just an idea—we were convinced we had struck gold.

Building the Future of CI/CD Cost Optimization

By early 2023, we had begun prototyping.

The first step? Connecting to GitHub Actions. GitHub, like many CI/CD providers, is well known for not providing a good report on cost analysis (still true as of today). Since GitHub is where all of our customers are, this made sense.

We needed to pull in detailed CI/CD usage data and break down the cost per minute of every build. Our system would scan pipelines, report metrics, identify inefficiencies, and provide actionable insights—like which jobs were wasting the most money.

It felt like a natural extension of what I had worked on years earlier at Datadog, where I had pushed for replacing CPU seconds with dollar values in profiling tools. The goal was simple: make CI/CD costs tangible, trackable, and optimizable.

We saw an opportunity to build something that would fit neatly into engineering workflows. The logic was airtight:

✅ Developers hate waiting for builds.

✅ Developers need to get more budget.

✅ We could solve both problems at once.

Or so we thought.

What We Hoped to Prove

Our plan for early 2023 was straightforward:

1️⃣ Build an MVP that could accurately measure CI/CD costs at a granular level.

2️⃣ Talk to real users and validate whether CI/CD engineers cared about cost optimization.

3️⃣ Launch a first version of CI Optimizer and start onboarding teams.

We expected engineers to tell us:

“This is amazing! We’ve been waiting for something like this!”

But that’s not what happened.

Instead, we hit an unexpected roadblock that completely changed the course of the project. (Read what happened next in When Nobody Wants Your Product.)

SaaS Pricing is Hard

Tue, 04 Feb 2025 00:00:00 GMT

Pricing is one of the hardest things to get right in SaaS. If you’re a startup founder, especially in B2B, you’ve likely wrestled with pricing questions:

💰 How much should I charge?

📊 What pricing model makes sense?

⚖️ How do I ensure fairness while maximizing revenue?

At Mergify, we’ve spent years experimenting, iterating, and learning the hard way. Here’s a breakdown of our journey—and what we’d do differently if we had to start over.

How We Picked Our First Pricing Model

When we first launched Mergify, we had no idea what the right pricing model should be. So, like many startups, we copied GitHub.

We charged per user, based on the size of the entire organization.

This meant that if a company had 200 engineers, they had to pay for all 200 engineers—even if only 20 or 30 of them actually used Mergify.

For small companies (e.g., 20–30 engineers), this wasn’t a big deal. They usually had one team using Mergify across all their repos. But as we grew and larger companies came in, things got tricky.

🛑 Larger companies had multiple teams, and only some teams used Mergify.

💰 They didn’t want to pay for everyone—just for the engineers who actually needed it.

We needed to change.

Counting Users the “Right” Way

To address this, we moved to a new model:

✅ Instead of charging per organization, we charged per collaborator—engineers who had write access to a repository where Mergify was active.

This felt fairer. A company with 100 engineers could now pay only for the repositories where Mergify was used, rather than for the entire org.

At the same time, we doubled our price per user. Why?

Customers were already seeing the value, and price wasn’t their biggest concern.
The new model lowered the user count for most companies, so we had to balance revenue.

The Math: Why This Worked

For a company with 100 engineers:

Old model: $2 per user × 100 users = $200/month
New model: $4 per user × 50 repo contributors = $200/month

For many customers, their bill stayed roughly the same. But they were happier because they felt they were paying for what they actually used.

Fairness is Everything

But the journey didn’t stop there.

Larger organizations often gave write access to all engineers by default, even if only a subset was actually making commits. That meant some companies were being charged for engineers who weren’t actively contributing.

So we introduced another iteration:

✅ Charging per “active user”—engineers who actually made commits.

This approach, inspired by Slack’s active user model, made more sense. Now, companies only paid for users who actively used Mergify.

The Math: Why This Worked (Again)

For a company with 100 engineers, where only 40 engineers actively pushed commits:

Previous model: $4 per user × 100 write-access users = $400/month
New model: $8 per user × 40 active users = $320/month

Again, the price per user increased, but total spending often decreased or stayed the same. More importantly, it felt fairer to customers.

The Real Takeaway: Fairness > Exact Pricing Models

What we’ve learned is that most customers don’t scrutinize how much they pay—but they deeply care about why they are paying it.

💡 Customers want fairness more than they want cheap pricing.

They don’t want to pay for people who never use the product.
They want transparency in billing.

Now, no matter how much we refine our pricing, some customers will always question it. That’s fine. What matters is that we keep the discussion focused on value—not just pricing mechanics.

Advice for SaaS Startups Navigating Pricing

Don’t get stuck in the weeds of perfect pricing.

Focus on maximizing total revenue, not obsessing over per-user logic.
Price increases aren’t scary if you frame them well.

Every time we changed how we counted users, we also raised prices—and it worked fine. You can always grandfather happy customers.
People care more about fairness than numbers.

If customers understand why they’re paying what they’re paying, they’re much less likely to complain.

Final Thoughts

Pricing is a constant work in progress. We’ll probably keep refining it at Mergify as we grow. But the core lesson is this: be transparent, focus on fairness, and anchor pricing to the value your product delivers.

If you want more on how we think about pricing, I wrote about why we're sticking with seat-based pricing over work-based models and the broader build vs buy dilemma from the customer's perspective.

Why We Left Heroku

Tue, 28 Jan 2025 00:00:00 GMT

In January 2023, everything was smooth sailing for Mergify. Our infrastructure was humming along on Heroku, a platform we had trusted for over three years. Heroku was once the go-to choice for startups—simple, reliable, and developer-friendly. We were happy customers, growing steadily and paying our invoices month-to-month.

Then things started to change.

The Start of a Rocky Relationship

In early 2023, Heroku reached out with an enticing offer: transition from month-to-month billing to an annual Heroku Enterprise contract. The deal included significant discounts on everything—dynos (containers), databases, and add-ons—in exchange for a one-year commitment to a certain number of resources.

We were told we’d be allowed to overuse our resources up to 30% during the year without being bothered—with the understanding that if we grew beyond that, the contract would be adjusted fairly in the next cycle.

It sounded like a win-win.

We signed the contract and carried on. For the first year, everything was fine. By the end of 2023, we had indeed surpassed the 30% growth threshold, but Heroku didn’t reach out. The contract auto-renewed, and we moved into 2024 with no issues.

That was until the automated emails started arriving.

A Series of Surprises

In May 2024, we received an automated email from Heroku. It informed us that the discounts on our containers were being rescinded, effective immediately. Naturally, we contacted Heroku’s support team to understand how that would affect our current contract and were redirected to a new account executive to clarify.

Their explanation was straightforward: we had doubled our usage, and they wanted us to pay the difference for the current contract term—for the next 9 months.

While this was unexpected, we decided to comply. We signed an amendment to the contract and paid the outstanding amount. We chalked it up to a policy change, and, as Heroku has been fair so far, we decided to move on.

But then, in October 2024, another email arrived. This time, Heroku announced that discounts on add-ons, such as PostgreSQL databases and Redis, would also be removed. Once again, we reached out to their team for clarification.

This conversation, however, was very different.

The Heroku automated email we received

When Contracts Don’t Matter

Our current account executive explained that the discounted add-ons we had purchased as part of our original enterprise agreement were no longer “fair” for Heroku.

Indeed, two years before, our previous account executive offered us a 60% discount on the listed price, which was a power move to make us commit for a whole year to the platform. A practice that worked: we committed to Heroku, and the account executive won a “top deal France SMB” award at Heroku.

But now, Heroku wanted us to pay the full price for these services, even though our contract explicitly stated otherwise.

We reminded them of the terms we agreed to in the contract, but their response was, essentially, “it’s not fair for us anymore.”

I spent a lot of time trying to understand how getting a few thousand euros more from a loyal startup would impact Salesforce P&L, or how bullying us into paying money we didn’t owe would help our account executive gain respect from their boss, with no luck.

Despite their efforts to pressure us to pay more, we held firm. A contract is a contract, and we weren’t going to be oppressed into paying for something that wasn’t part of the original agreement.

However, at this point, it was clear that Heroku was no longer a reliable partner for us. Their lack of stability, constant policy changes, and disregard for contractual terms made it impossible to trust them with our infrastructure.

The Move Away from Heroku

By late 2024, we made the decision to move Mergify’s infrastructure to Google Cloud Platform (GCP). Migrating a live product is never easy, but it was the right choice. Heroku, once the pioneer of developer-friendly hosting, had stagnated. The platform’s lack of innovation, combined with its increasingly unpredictable business practices, made it clear that it was time to leave.

GCP offered the flexibility, scalability, and reliability we needed to grow. The migration was a success, and while it wasn't a move we originally planned for, it's one we're glad we made. Google helped us a lot in moving to their platform, which made the whole process smooth — and it gave us a chance to rethink our entire CI/CD stack, including how we handle merge queue challenges.

Reflections on Heroku

Despite the rocky ending, it’s important to acknowledge Heroku’s role in our journey. The platform played a significant part in our early success, providing the simplicity and ease of use that helped us focus on building our product. For small apps and early-stage startups, Heroku can still be a good choice.

But over time, Heroku failed to evolve. As the tech industry moved forward, Heroku seemed to stand still. Features stagnated, the platform became less relevant, and dealing with them as a customer grew increasingly frustrating. In 2025, it’s hard to recommend Heroku as a reliable choice for scaling companies.

Advice for Other Startups

Our experience with Heroku taught us some valuable lessons:

Beware of Contracts with Large Companies: Big corporations can change their terms, policies, or priorities on a whim. Make sure you fully understand the risks before signing long-term agreements.
Stand Your Ground: If a vendor tries to pressure you into unfair terms, don’t be afraid to push back. Contracts exist for a reason. Be ready to jump and save your ass.
Choose Platforms That Grow with You: Heroku was perfect for us in the beginning, but as our needs grew, it became clear that we needed a more robust and innovative platform.

For startups navigating similar challenges, remember that your infrastructure choices are critical. Hosting platforms should be partners in your growth, not obstacles.

Final Thoughts

Leaving Heroku wasn’t an easy decision, but it was the right one for Mergify. We’ve learned a lot from this experience, and we’re excited about what’s ahead with our new infrastructure.

If you’re a startup considering Heroku—or debating whether to stay or move on—ask yourself this: is your hosting platform helping you scale or holding you back? At the end of the day, it’s all about finding a partner you can trust to grow with you.

Remote Work: Great, But Not Perfect

Tue, 21 Jan 2025 00:00:00 GMT

Running a fully remote company is an incredible experience, and I say this as someone who’s been working remotely for the past 15 years and managing a remote-first company, Mergify, for the last five. Yet, after a week of in-person collaboration with my team in Toulouse, I feel compelled to reflect on what makes remote work great—and where it falls short.

Let me start by saying this: remote work is fantastic, but it’s not perfect. It’s a trade-off. Depending on your company’s stage, your team’s roles, and the challenges you’re facing, the choice between remote, hybrid, or in-office work can make all the difference.

What Makes Remote Work Amazing

The benefits of remote work are undeniable:

Focus and Efficiency: Remote work allows individuals to dive deep into their tasks without the distractions of an open office.
Work-Life Balance: Cutting out commutes and office hours lets people design their schedules in ways that suit them best.
Access to Global Talent: A remote model lets you hire the best person for the job, no matter where they are. At Mergify, this has been invaluable.
Flexibility and Autonomy: Remote work naturally fosters a culture of trust, where people take ownership of their time and deliverables.

But here’s the catch: communication in remote work isn’t as fluid as in-office communication.

Why In-Office Communication is Superior

Face-to-face communication is powerful in ways that virtual communication simply can’t replicate. It’s not just about words—it’s about body language, energy, and subtle nonverbal cues that humans naturally pick up when we’re in the same room.

When you’re remote, everything has a higher latency. Sure, you can make video calls, send Slack messages, and send emails, but it’s like having a conversation with poor reception. You’ll get your message across, but it’s less fluid and often lacks the nuances that make communication easy and productive.

Now, this might not be a problem, depending on where you work.

For startups, this trade-off is especially magnified. When you’re building something new and need constant alignment, the lack of spontaneous coffee chats and hallway conversations slows you down. Good ideas often spark from casual interactions—something much harder to replicate remotely.

The rollercoaster, that a startup is, requires sharing the energy and adrenaline that you get from awesome news and the comfort that everyone needs when some cloudy day happens. But it’s hard to get the vibe from your coworkers when they are far away. The connection is more difficult to make and maintain.

The Remote vs. Office Matrix

Consequently, the choice between remote and office work isn’t one-size-fits-all; it’s a matrix of role and company size:

Large Companies: For individual contributors, remote work can be as effective as being in the office. In large organizations, communication often requires structure anyway, and remote tools can handle most of this. However, for managers in these settings, the lack of face-to-face interaction can make it harder to truly understand what’s going on in their teams.
Startups: For smaller companies and startups, where speed, creativity, and alignment are critical, remote work becomes trickier. It’s harder to maintain momentum and cohesion when everyone is isolated. Founders and managers need to be proactive, creating systems for communication and connection that compensate for the lack of in-person collaboration.

Of course, it’s an oversimplification, and it requires nuance, but you get the gist.

The matrix also applies to roles:

Individual contributors, particularly in large companies, the need to be physically present in an office can feel unnecessary. Their work often revolves around tasks that require focus rather than constant communication or managing teams. Sitting behind a desk all day in an office doesn’t necessarily enhance their productivity or add value to their contributions.
On the other hand, managers, especially in startups, face unique challenges with remote setups. Their roles demand frequent interaction, gauging team dynamics, and fostering collaboration. Without the ability to observe non-verbal cues or engage in casual, spontaneous conversations, understanding the team’s morale and addressing issues proactively becomes significantly harder.

The same goes for seniority:

Junior employees typically require more hands-on management, regular feedback, and frequent check-ins. Without the autonomy or experience to navigate challenges independently, they benefit greatly from close guidance and structured oversight.
In contrast, senior employees tend to be highly autonomous. They often need minimal direction, excelling at managing their own work and making decisions. They are comfortable raising issues or seeking input when needed, allowing them to operate effectively with little interaction from their managers.

This is one of the reasons there has been so much vociferation against Return to Office (RTO) mandates promulgated in recent years, especially in larger companies and amongst senior individual contributors.

The Mergify Experience: Remote, But Not Alone

At Mergify, we’ve been fully remote from the start, and we’re committed to staying that way. But we know it’s not without challenges. Here’s how we’ve managed the trade-offs:

Intentional Connection: We schedule regular virtual coffee breaks to foster camaraderie and maintain a sense of community.
Quarterly On-Sites: Every few months, we bring the team together in person. Our recent week in Toulouse was a reminder of how valuable these moments are—not just for productivity but for bonding as a team. Sharing meals, brainstorming in person, and simply spending time together are irreplaceable experiences. Nothing beats your team being yelled at by the game master of an escape game for having hacked your way around the solution, in true startup spirit. 😅
Focus on Proactivity: Remote work requires deliberate communication. Everyone on the team needs to take the initiative to keep each other informed and aligned.
Hiring for Autonomy: A remote model of self-motivated, independent individuals who excel without constant oversight. Building a team with these traits has been crucial for our success. Not everyone is made for remote work. (More on this in Not Just a Job, It's a Ride.)

RTO and the Future of Work

The current trend of companies mandating a Return to Office reflects the challenges of remote work—particularly for management. It’s easier to see what’s happening, build trust, and foster collaboration in person. Yet, for many roles and organizations, remote work remains a superior option.

The reality is that no single model is perfect. For some, the trade-offs of remote work are worth it. For others, the benefits of in-office collaboration outweigh the flexibility of remote setups. What matters most is that companies recognize these trade-offs and build systems that suit their unique needs

Remote work is here to stay, but it’s not a panacea. It’s a trade-off between flexibility and connection, efficiency and spontaneity. At Mergify, we’ve embraced remote work with open eyes—recognizing its strengths while finding ways to address its weaknesses. Whether remote, hybrid, or in-office, the key is to adapt, experiment, and keep evolving.

What’s your take on the remote vs. office debate? Let me know in the comments or reach out—I’d love to hear how others are navigating this shift.

Reflecting on 2024

Tue, 07 Jan 2025 00:00:00 GMT

As 2025 begins, I’m taking a moment to reflect on an eventful 2024—a year of evolution, challenges, and new beginnings for me and Mergify. Building a bootstrapped company comes with its own unique highs and lows, and 2024 was no exception. Here’s what the past year taught me and how it has shaped the future of Mergify.

Keeping Mergify Thriving in a Changing Market

Surviving and thriving as a bootstrap startup is always worth celebrating, especially in a niche like ours. In 2024, we doubled down on what makes Mergify special: our Merge Queue product, which remains the best in the market for helping engineering teams manage pull request workflows.

That said, we also recognized our limitations. While we’ve always been proud of our technology, we realized this year that having great tech isn’t enough. For years, Mergify was more of a tech-driven company than a product-focused one. This year, we worked hard to change that, shifting our mindset to prioritize product design, usability, and scalability.

The advancement of competitors, such as GitHub, forced us to rethink our strategy. We’ve been evolving in a niche for years, and it’s now time for us to expand our vision beyond what we’ve been doing so far.

2024 also brought some tough lessons about the realities of the market. We initially decided to double down on marketing in late 2022, deploying our efforts during all of 2023. However, after the startup market crash of late 2022, we spent much of 2023 navigating a challenging environment where companies were hesitant to adopt new tools. As that trend continued into early 2024, we ultimately decided to scale back our marketing efforts and rethink how we approach growth.

The truth is that marketing alone can’t solve every problem—especially in a niche like ours. Instead, we’re focusing on building the best products possible and letting our work speak for itself. This approach has already started to pay off, and we’re more confident than ever in Mergify’s future.

A Shift Toward Product and Design

Consequently, one of the pivotal moments in 2024 was hiring a designer—a first for Mergify. This move sparked a transformation in how we approached our product. It wasn’t just about solving technical challenges anymore; it was about creating an experience developers love.

New Mergify design

This new focus led to a complete redesign of our branding and dashboard, making it easier than ever for teams to onboard and use Mergify. It also paved the way for new products like Merge Protections, a tool for managing repository freezes and policies. This was the first product we built with a product-driven mindset from the ground up, and it’s already gaining traction with customers.

Back to Founder Mode

In 2024, I found myself returning to what is now called “founder mode.” For the first time in years, I rolled up my sleeves and dove back into coding and product development. Writing Python again, designing architecture, new workflows, and collaborating directly with the team reminded me of the early days of Mergify—and how much I enjoy building things.

This hands-on approach was fueled in part by the rise of AI tools, which have transformed how we work. From speeding up R&D to enhancing productivity, AI has helped us stay agile and efficient as we tackle big challenges in CI/CD.

Doubling Down on R&D

Speaking of challenges, research and development was a major focus for us in 2024. We spent a lot of time exploring how to solve some of the toughest problems in CI/CD, like flaky tests, CI failures, and observability issues. These pain points resonate deeply with our customers, and we’re excited to bring solutions to market in 2025.

Our work so far has been a team effort, but it’s also required stepping outside our comfort zone. We’ve been engaging more directly with developers, learning from their frustrations and workflows, and using those insights to shape our next big product.

On a Personal Note

Outside of Mergify, 2024 was a year of rediscovery for me. I started writing again, leaning on tools like GPT to help me produce regular content and share my thoughts with a wider audience. Writing has always been a way for me to reflect, and having the discipline to do it consistently has been rewarding.

I’m also trying to teach my kids to take photographs

I also continued my work as a business angel, supporting other startups and sharing what I’ve learned from building Mergify. While balancing this with my responsibilities at Mergify can be challenging, it’s incredibly fulfilling to help others navigate the ups and downs of entrepreneurship.

I continued publishing episodes of Nom d’un Pipeline !, my French CI/CD-themed podcast. This has been a tremendous way to meet new people and learn stuff. I really enjoy the exercise of podcasting, and I’m looking for an excuse to start a new one this year.

I also recorded several episodes as a guest, one in French in IFTTD, another one in French and in Toulouse on the future of tech, another one in French on The Cloud Screener, one on The Python Show, and, finally, one with saas.group.

Looking Ahead to 2025

As I look to the year ahead, I’m more excited than ever about what’s next for Mergify. We’re working on new products that tackle critical pain points in CI/CD, and we’re committed to helping developers ship faster and with less friction. At the same time, we’re staying true to our roots as a bootstrap startup—focused, agile, and always learning.

2024 was a year of growth in every sense of the word. It challenged us to think differently, to embrace new ideas, and to keep pushing forward. As we enter 2025, I’m grateful for the lessons we’ve learned, the customers we serve, and the incredible team that makes it all possible.

Here’s to another year of building, learning, and growing.

The Collapse of Social Platforms

Tue, 17 Dec 2024 00:00:00 GMT

It’s the end of the year, so I’ll write about something more theoretical that has been on my mind those last few days.

What happens when the line between human and AI content creators vanishes completely? We’re closer to that reality than you might think.

I was out running a few days ago listening to a tech podcast, Silicon Carne. There was an interesting debate around content creation and how platforms like YouTube will kill TV. I’m not sure the root of the talk was that challenging; TV seems already a thing of the past at this stage. But as they started to talk about AI, things started to get interesting.

People often have limited visions of what’s possible, shaped by their beliefs and ideas about what’s acceptable.

Most of the discussion revolved around how AI would be able to create content, how it would be used to help producers and content creators, and what it would mean for platforms and consumers.

Based on that, the debate continued about how much AI would be acceptable in content creation on platforms.

I think this is very short-sighted.

What’s Already Happening

You don’t have to look far to see AI being used in content production; that’s a fact. But it’s still very human-driven and AI-assisted. There are a lot of tech limitations for now that prevent pushing the throttle to the max, but it is certain that those limitations will go away very soon. Look at what OpenAI is building with Sora, and you’ll have a glimpse of the future.

People are already leveraging this tech to move to the next step: creating content, communities, and creators that do not exist in real life. Instagram and OnlyFans are seeing a tsunami of AI-based girls managed by digital pimps. Does it work? It sure does; look at the numbers.

This is where many people start to get confused and want to draw a line based on morale or their beliefs that this model will not be applicable to “regular” content creation.

I believe this is false; it’s already happening.

A Glimpse into the Future

People often argue that having AI-generated content from a content creator would feel inauthentic and that they wouldn’t watch it. I say that this is having a very high opinion of your brain and little faith in the evolution of AI.

What if I told you that MrBeast did not exist? You’d say, of course, he does! Really? How can you know he exists? Did you ever meet him in real life? Did you ever talk to him?

What if, tomorrow, you’d connect to YouTube and see 10 new MrBeast videos with fancy new ideas that’d fit your taste and be very appealing to your brain? They might or might not be AI-generated; in any case, you’d have a good time.

Now, let’s take a step back and imagine having a brand new content creator everyone’s talking about. Nobody heard of them before. You watch the content, and you like it. Does this person really exist, or is it just an AI? How would you ever know? There might be rumors that a friend of a friend met him in a restaurant… but is that the reality?

At some point, there will be no way to know if a content creator is a real person or not. As time passes and technology evolves, it will be close to impossible to distinguish human creation from AI creation — the synthetic wave is already here. This is what most people don't want to believe because it shuffles too much their current reality.

Believe it or not, it’s happening.

How Platform Might Crash

The ability to generate endless streams of AI-driven content will undoubtedly transform platforms like Instagram, YouTube, and LinkedIn. In the short term, the appeal of hyper-tailored, dopamine-driven content may captivate users and drive unprecedented engagement.

But at what cost?

As AI-generated content floods these platforms, the lines between human connection and algorithmic interaction will blur. The authenticity that once set content creators apart—real people sharing real experiences—will be diluted in a sea of indistinguishable, machine-generated personas. Even if platforms introduce measures like “human-verified” badges, the deeper question remains: will people still care? If the content entertains, informs, or inspires, does its origin matter?

This shift could erode one of social media's fundamental purposes: fostering connection. If users begin to see platforms as spaces dominated by machines rather than humans, the sense of community these platforms once provided may crumble. The allure of authentic interaction—the very reason social media exploded in the first place—could fade, leaving behind a world where “social” media is anything but social.

This trend raises profound questions in the broader societal context. Will our online spaces become environments where we primarily engage with algorithms instead of people? As AI infiltrates every email, phone call, and comment, will technology become a tool for connection or a barrier to it?

Perhaps this is where the pendulum swings back to real life. In a world saturated with AI interactions, the simplest moments of human connection—a conversation over coffee, a shared laugh, or a face-to-face debate—might become rare and precious. Paradoxically, as AI dominates the digital realm, it could reignite our desire for genuine human interaction in the physical world.

Until then, the question isn’t whether AI-generated content will dominate—it’s how we, as creators and consumers, will adapt and what we’ll choose to value in an increasingly artificial landscape.

My prediction is that real life will be the only place you’ll have left to interact with real humans.

Until robots take over, of course.

Aligning Project Management with Company's Values

Tue, 10 Dec 2024 00:00:00 GMT

When Mehdi and I co-founded Mergify, we didn’t just set out to create a great product—we wanted to build a company that reflected our values. Part of that journey has been rethinking project management, a task informed by years of working in organizations where “Agile” had become synonymous with bureaucracy. What started as a flexible, team-oriented methodology often felt bogged down by rituals that added complexity without delivering real results.

Over the past year, we’ve refined our project management approach at Mergify, shaping it to fit not only the needs of our team but also our belief in simplicity, ownership, and autonomy. Here’s the story of how we got there—and why building a workflow that aligns with your values matters as much as the work itself.

My Journey Through Agile Overload

When I look back at my experiences with Agile in various organizations, one story sticks out. Early in my career, I joined a team at Red Hat, one of the most functional, productive groups I’ve ever worked with. We didn’t rely on heavy Scrum processes; instead, we used a lightweight Kanban board and had stand-ups over IRC. It wasn’t fancy, but it worked. We focused on the work, not the process.

Contrast that with some other teams I observed. One team, with over 20 members, struggled to maintain a sense of ownership. The two-pizza rule, famously touted by Jeff Bezos, couldn’t have been more relevant here: a team too big to share two pizzas is often too big to stay effective. Communication is complicated, and the sense of ownership disappears.

Yet management decided to unify everyone under a single Scrum process, complete with daily stand-ups, sprints, retrospectives, and poker planning.

That did not fix the problem of the 20-person team. Even my high-functioning team began to falter under the weight of unnecessary rituals.

It was a powerful lesson: the process isn’t inherently good or bad but must serve the people doing the work.

Starting Mergify with a Blank Slate

When we started Mergify, we wanted to avoid the traps of over-engineering our processes. We began with almost no structure—just Slack messages and quick syncs to stay aligned. As the team grew, we added a daily stand-up. For a remote-first company, these short, synchronous check-ins were critical for maintaining a shared understanding, even as most of our work remained asynchronous.

Instead of following Agile dogma, we opted for a Kanban approach. Tasks moved naturally across the board with minimal friction. We didn’t bother with two-week sprints or strict velocity tracking; we let the workflow dictate the process.

And it worked; for a while.

Why Lightweight Isn’t Always Enough

Over time, cracks began to appear. One issue was ownership: who was responsible for creating cards on the Kanban board? Engineers who weren’t involved in defining tasks felt disconnected from the problem, treating the cards as instructions rather than opportunities to solve meaningful challenges. The person creating the card and the person doing the work weren’t always on the same page.

Another challenge was the endless backlog without a clear sense of what we were building or why; tasks accumulated, and the act of moving cards felt less like progress and more like treading water. The team craved a greater sense of accomplishment—a way to see their impact beyond the daily grind.

Evolving to a Project-Driven Workflow

To address these issues, we introduced a project-driven layer to our workflow. Projects became our new organizing principle: scoped pieces of work that could be completed in two to four weeks. Each project was defined by three key elements: a brief, a lead, and a deadline.

1. The Brief:

The brief outlined the problem, the goals, and the context for the project. It provided enough structure to guide the engineer while leaving room for creativity. Engineers weren’t just implementers—they were collaborators, shaping the solution as they worked.

2. The Lead:

Every project had a designated lead who was responsible for tracking progress and ensuring the work stayed on course. This wasn’t about assigning blame; it was about having a clear point of contact who could raise blockers, answer questions, and coordinate efforts.

3. The Deadline:

Deadlines were less about pressure and more about focus. They encouraged engineers to make trade-offs, prioritize effectively, and avoid over-engineering. If something couldn’t be completed within the timeframe, we adjusted the scope or deferred less critical elements.

What We Gained

The shift to project-driven work transformed how we operated. It gave engineers a sense of ownership and allowed us to ship faster, avoiding the dreaded “tunnel effect” where nothing tangible gets delivered for months. It also helped us align our priorities, ensuring that every project contributed meaningfully to our goals.

This system wasn’t just about productivity but about creating a culture where engineers felt empowered and connected to their work. It reinforced our belief that processes should serve people, not the other way around.

A few people I talked about our system thought it resembled the Shape Up methodology from Basecamp. I think it does have similarities, except that we’re a small company, meaning we don’t have enough teams to model it exactly yet. But that’s definitely a methodology that resonates with us.

The Shape Up methodology

Final Thoughts

Looking back, the changes we made weren’t just about fixing problems—they were about staying true to our values. At Mergify, we believe in autonomy, ownership, and simplicity, and our workflow reflects those principles.

If you’re struggling with your own project management processes, ask yourself: do they serve your team’s needs, or are they just there because “that’s how it’s done”? (I wrote about a related frustration in The Problem with OKRs Isn’t OKRs.) The best workflows aren’t the most popular—they’re the ones that align with your culture and empower your people to do their best work.

At Mergify, we’re proud of the system we’ve built, and we’re excited to keep evolving it as we grow.

SaaS and Work-based Pricing

Tue, 19 Nov 2024 00:00:00 GMT

Despite the rising popularity of work-based pricing in SaaS, Mergify is sticking with seat-based pricing—for now. Here’s why we believe it’s the right choice for our product and our customers, and their ability to budget with confidence

The Work-Based Pricing Trend

Following the latest SaaS pricing trends is tempting, especially when they align so well with a powerful concept: customers pay in proportion to the value they receive. In recent months, “work-based” pricing has gained traction across the industry, especially in AI-driven applications where you pay per task completed. It’s a straightforward exchange: resolve a customer’s problem and receive $1.

Simple, clear, and highly attractive.

When we first considered revisiting Mergify’s pricing for next year, work-based pricing seemed like a model worth exploring. Imagine a setup where every task accomplished by Mergify correlated directly to the value we delivered to our customers. More value equals more payment, a transparent exchange that clients find easy to understand. But, as we dove deeper, we realized that the nature of Mergify’s work doesn’t fit so neatly into this model.

And that might be true for your SaaS as well.

The Value Clarity of Work-Based Pricing

Work-based pricing makes perfect sense for certain SaaS products. Consider the AI-driven customer support solutions rolling out recently, like Salesforce’s $2-per-conversation approach or Intercom’s Finn and its 0.99$ per resolved conversation. The pricing here is inherently appealing because it aligns precisely with the customer’s perception of value: for every problem resolved, they see a clear, direct benefit to their end users, who leave satisfied and engaged.

This clarity makes it easy for a customer to decide—they’re paying to solve a specific pain point for their users, and each solved interaction has a measurable outcome. The more conversations are solved, the more they pay; it feels like a no-brainer.

Customers see exactly where their money is going, and it scales beautifully alongside their growth: the more users they have, the more problems they have, and the more value you can provide. But also: the more users they have, the more money they have, so they’re happy with giving you a part of it. Everything can grow at the same rate, from the customer’s business size to the value your SaaS provides.

This reminds me of the early days of cloud computing when Amazon Web Services introduced usage-based billing. The more your business grew, the more infrastructure you used, and the higher the bill. It made perfect sense, and that was one of the reasons for its success and early adoption by startups.

Why Work-Based Pricing Doesn’t Always Fit

At Mergify, we aimed to find the same direct correlation between our value and our pricing model. However, our reality is different: the value we provide is primarily to software engineers who use Mergify to streamline their CI/CD workflows. It’s an incredibly powerful tool, but it doesn’t impact our customers’ end-users in an obvious way. This makes work-based pricing challenging because we’re one step removed from the end-user experience — and, therefore, from the business.

We brainstormed possible approaches. Perhaps we could charge per pull request made by developers. After all, that’s a major part of what Mergify automates and where it provides value. But we immediately saw a problem: this model could encourage the wrong behavior. If every pull request carries a charge, teams might try to reduce the number of pull requests to save on costs, potentially compromising code quality. As engineers ourselves, we value clean, atomic commits and easy reviews. A per-pull-request charge could discourage these practices, creating friction between the optimal workflow and our pricing model.

Another challenge with work-based pricing is the difficulty it presents for customers when predicting their usage. Most organizations set budgets a year in advance, and it’s nearly impossible for a team to accurately estimate the number of pull requests or jobs they’ll need in a year. This unpredictability makes budgeting stressful and challenging, especially for engineers seeking approval for software expenses. With seat-based pricing, customers know their costs upfront, which aligns much better with annual budget planning.

We also considered a per-job charge based on CI (Continuous Integration) runs, but this quickly ran into similar issues. Running a high volume of CI jobs often results in better software quality. Charging per job could lead to reduced testing—a problematic incentive in an industry where quality matters deeply. CI providers charge per job because of the required computing power, but in Mergify’s case, we don’t incur comparable costs. So, a per-job charge wouldn’t reflect our real costs and could end up discouraging best practices.

The Challenge of Finding the Right Fit

In a perfect world, we’d find a model where Mergify’s pricing scaled directly with the customer’s perceived value. However, we found that our usage patterns, like many SaaS, do not align with the benefits of the work-based approach. To truly capture the value Mergify provides, we realize that seat-based pricing continues to be our best option, at least for now.

By charging per user (seat), we avoid influencing developer behavior, allowing them to use Mergify to improve their workflow without second-guessing how often they use it. We also want to avoid the stress of unpredictable costs. Seat-based pricing allows our clients to budget accurately and avoid unexpected expenses, aligning our pricing model with their planning cycles and offering peace of mind. As we continue to build Mergify, our goal remains to be the trusted tool in the hands of developers.

Right now, that means a seat-based model, which keeps our focus where it belongs—on supporting teams to do their best work, no strings attached.

This decision wasn't easy, and we might revisit it in the future. Pricing is a delicate balance between the customer's experience, the product's value, and the company's needs. I wrote more about our full pricing journey at Mergify — from copying GitHub's model to active user billing.

The Engineer’s Dilemma: What We Did Right at Mergify

Tue, 05 Nov 2024 00:00:00 GMT

In the early days of Mergify, the journey Mehdi and I embarked on wasn’t unique. In fact, it’s a tale as old as time for engineer founders: a couple of smart engineers, passionate about technology, with an exciting vision to revolutionize their space. We had everything we needed—or so we thought: the knowledge, the technical expertise, and the drive to build something incredible.

This is where the story of Mergify begins, but what often happens next is a classic mistake that many tech founders make. They build a beautiful, feature-packed product that doesn’t solve a real problem—or even worse, it solves a problem no one has. This is the hard lesson that too many engineers learn too late, and it could have easily been us.

But we managed to steer our ship in a different direction, and looking back, there are key things we did right. Let me take you through that journey.

The Temptation of the Tech

For any engineer, there’s nothing more fun than building. The thrill of creating a new feature, optimizing your product, or pushing out updates can become intoxicating. Mehdi and I felt this pull strongly when we started Mergify. We had big ideas, a packed roadmap, and technical solutions that we were eager to implement.

But here’s the thing: technology, while critical, is only a part of building a successful SaaS business. We could have easily fallen into the trap of focusing solely on the tech and neglecting the most important piece of the puzzle: the customer.

We both had to confront the reality that building an amazing product wasn’t enough. If we didn’t speak to our customers, understand their pain points, and really get to the heart of their problems, we were going to fail.

This is where so many engineer-founded startups stumble. They fall in love with their technology rather than falling in love with solving the customer’s problem. Luckily, we managed to recognize this early on.

Engineers Need to Talk to People

Talking to customers doesn’t come naturally to most engineers—it didn’t for us either—but it was a necessary step. While it’s tempting to stay behind your keyboard, tweaking code or adding features, the real magic happens when you step out and listen to what your customers are saying. What do they struggle with? What would make their lives easier? What’s keeping them up at night?

I remember one story about two bright engineers who reached out to me on LinkedIn, seeking advice. We decided to meet at a bar. They had spent three years working on a fantastic piece of technology but hadn’t seen any traction. Why? because they hadn’t built it with a customer in mind. The tech was solid, but it didn’t solve any real problem. They hadn’t spent time talking to users, and so the product existed in a vacuum.

You've got to start with the customer experience and work backwards to the technology — Steve Jobs (1997)

With Mergify, we knew that if we were going to build something with a lasting impact, we needed to constantly engage with our community and understand their problems. It wasn’t enough to have a great piece of technology—we had to have a great solution to a real problem.

Shifting Roles for Success

One of the smartest things Mehdi and I did early on was divide our roles clearly. We knew that if both of us were deep in the tech, Mergify would never succeed. So, while Mehdi stayed focused on building the product, I took on the role of sales, marketing, and customer interaction.

For an engineer, stepping into these roles can be uncomfortable at first. Sales? Marketing? Communication? These aren’t things they teach you in computer science class. But it was a necessary shift and one that paid off.

I drew from my experience selling self-published books. I knew that just because you write something doesn’t mean people will read it. You have to market it, spread the word, and get it into the hands of people who need it. The same principle applied to Mergify. We couldn’t just build features and expect users to come flocking. We had to sell it, promote it, and make it known. This is still something we need to do to this day.

The Hard Truth

The truth is, you can have the best technology in the world, but if you don’t have customers, it’s worthless.

I remember another encounter at a wedding, where the bride's father introduced me to his nephew—a tech entrepreneur. The moment I heard him describe his startup, I already knew what was wrong. “You’re not selling anything, are you?” I asked. The bride’s father looked at me, astonished by the boldness of my assumption. And sure enough, he wasn’t. He and his co-founder, both engineers, had spent their time adding features instead of learning how to sell their product. He admitted that I was not the first to tell them it was one recipe for a disaster.

At Mergify, we avoided that trap. We recognized early on that while the tech needed to be solid, the success of our business depended on our ability to market and sell it.

What We Did Right

So, what did we do right at Mergify? We talked to our customers, really talked to them. We asked questions, learned about their challenges, and made sure we were solving their pain points. We didn’t fall in love with our technology; we fell in love with the problem. And most importantly, we divided and conquered. Mehdi stayed on the tech while I trained myself in the art of sales, marketing, and product management.

These steps weren’t easy, requiring us to step outside our comfort zones, but they made all the difference. Five years later, Mergify isn’t just a successful SaaS company because we built great tech — it’s successful because we solved real problems for real people. (For a different angle on the same lesson, read Tech Is the Easy Part.)

And that’s the real lesson for any engineer founder: focus on the problem, not the tech, and you’ll go far.

There's (almost) no GitLab

Tue, 29 Oct 2024 00:00:00 GMT

Do you guys support GitLab? Is there any way this can work with GitLab? Does this support merge requests?

No, we don’t.

As I spent hours screening software engineer candidates those last weeks, I repeatedly answered the same question: where’s Mergify support for GitLab?

To put things into perspective, we mostly hire in the French market, and I think it deserves some context.

Culture Bias

I’ve known software engineers for more than 20 years in the country of the baguette, and something is pretty clear. We have amazing engineers, but they suck at understanding what ROI means. Most people have no conception of the value of time, and for most average French engineers, it’s OK to spend time on anything as long as it avoids spending money (or requesting a budget).

It’s not even a frugality thing; it really is just the inability to compute a basic return on investment and put a price on an hour of work.

In the context of software forges, that means something: French companies, from startups to scaleups, are heavily biased towards deploying GitLab Community Edition because it’s free. They would do this over a cheap hosting bare metal server. You can find such hosting for around $20/month.

At this price, the average French software engineer will not buy a GitHub license. They’d call that a rip-off.

I’ve heard it.

Fire at OVH datacenter in Strasbourg, March 2021

If you don’t factor in the time it takes to spin up and maintain the GitLab instance or the impact of having your server on fire, then, indeed, a price of $20/month is unbeatable.

I remember asking one young French engineer in a startup about their GitLab instance and how they would maintain it and manage its security compliance. The answer was straightforward:

We just run `apt upgrade` every night so we’re sure we’ve every security deployment installed.

YMMV.

The Market Share

In April 2024, the Mergify team spent a few days at Devoxx France, the country's largest developer conference with nearly 5,000 attendees. We talked to dozens of engineers, and roughly 50% of them were using GitLab at work. Some large teams were moving away from GitLab to GitHub, but for a large majority, we were weirdos for not supporting GitLab. Their view of the market share is biased toward the French market, where GitLab might indeed have a large usage, which might not tend to generate large revenue, though. Remember that the Community Edition of GitLab is free.

Mergify’s team at Devoxx France 2024

If we compare GitHub’s $2 billion revenue to GitLab’s $579 million revenue for 2024, this is a 1:4 ratio, which is already pretty huge. Sure, revenue is not usage, but considering that GitHub has Microsoft behind it and GitLab is reportedly looking for a buyer, the future looks way brighter for GitHub — something I explored in more depth in Is GitHub the Future?. And I'm not even talking about the fact that the vast majority of open-source projects use GitHub.

No Gauthier, I don’t think that this is how it works.

(LinkedIn post)

Innovation

Now that I’ve set the scenery, I feel it’s safe to answer about GitLab support for Mergify. The main reasons why Mergify’s Merge Queue is not looking for GitLab support anytime soon are:

Innovation happens on GitHub nowadays. Ten years ago, GitHub was behind on certain topics (hello CI), but they are now way ahead of the competition. GitHub is the place where most open-source is built and where you need to be if you’re building new products;
Most teams using GitLab CE have no intention to buy any software;
The market share of GitLab is small and probably shrinking.

I’ve nothing against GitLab, and their software might be pretty good. All we know is that they’re outsiders, while a lot of the tech market in Europe seems to think that betting on GitLab is the best go-to-market strategy.

I beg to differ.

What's going on with Dependabot?

Tue, 15 Oct 2024 00:00:00 GMT

I loved Dependabot. I’ve used it since Grey Baker started it in 2017. I’ve seen it grow from a one-person shop to being acquired by GitHub in 2019. It’s been a fantastic tool that created more than 5,000 pull requests on Mergify repositories. I remember the excitement of finally having a tool that would bring all the new fancy features and bug fixes of my dependencies to my project in a snap.

Dependabot allowed us to fix a lot of security updates introduced by dependencies, and to be aware of anything new being released in the libraries we use.

But today, we kicked Dependabot out. Dependabot let us down.

Security You Said?

The move to acquire Dependabot from GitHub was smart. It was one of the key features that started their security roadmap with the acquisition of Semmle and their CodeQL engine the same year. GitHub has become all about security over the last few years, which makes sense considering the cybersecurity segment's hyper-growth over the last and upcoming years.

However, as Dependabot matured under GitHub, cracks started to show in what was once a flawless experience. The tool that was supposed to streamline security and updates became a source of frustration.

It has major design flaws that GitHub does not seem to care about.

First, Dependabot can fail silently. That happened to us multiple times a year when Dependabot would just stop working and create pull requests to update our packages. You’d think that debugging such an issue would be possible by going into the Dependabot tab of your repository, but no. The log for this is actually hidden in Insight → Dependency graph → Dependabot. A strange and unintuitive location for such crucial information.

Once you find your log, you can then read it and debug it yourself.

That’s a major problem because there’s nothing warning you that Dependabot is broken. We are used to updating our packages regularly, so we’d know, but there’s nothing preventing your dependencies and security updates from getting stale for months without you noticing. Terrible experience.

Always Lagging Behind

We’re a Python shop. We leverage poetry to manage our dependencies, and we use the latest Python version in our containers.

As a Python shop, staying on the latest version helps us ensure security, performance, and compatibility. So we update it as soon as we can, usually a few days after it’s released.

And then Dependabot is broken.

And you have to wait weeks for GitHub to fix the problem.

The last few times, we had to update ourselves Dependabot itself, as shown here or even here. We’re basically doing GitHub’s job for free, maintaining the Dependabot database ourselves for all their customers.

We contacted GitHub support about this already, and they did not care at all. Their laconic answer was, “Wait for it to be updated.”

Sure, thank you. We’re the ones doing the updates.

I get it—maybe Fortune 500 companies don’t care about the latest Python micro releases. But for startups like ours? It’s a big deal.

So today, we got rid of Dependabot and replaced it with Renovate. It seems better maintained and supports a larger package ecosystem than Dependabot. So far, it has simplified our workflow and is not broken on a simple Python micro update. 🤞

We're also adding support for Renovate in Mergify Merge Protections, as we have done for Dependabot in the past. That will ensure you can write advanced rules for automating your GitHub workflows, including automatically merging your dependency update. 🦾 Let me know if you’re interested in trying it out!

Why Stock Options are Terrible for Employees

Tue, 08 Oct 2024 00:00:00 GMT

Imagine you’re joining a startup, excited by the potential to share in its future success. You’ve just been offered stock options. What do they actually mean for you?

You’re not sure if you need to negotiate these terms, or even what they imply. Once you’re in, someone mentions them during a coffee break, and you’re confused all over again. Whatever—you decide to see how it plays out. Maybe you’ll get filthy rich without understanding why. Maybe you’ll get screwed over. Who knows?

As a tech employee, tech entrepreneur, and investor, I’ve had my share of discussions with fellows about stock options over the last decade.

Stock options are a great leverage for founders to share future value with their (early) employees. The intention behind it is noble, but using them has terrible drawbacks.

Understanding What You Get

When I joined Datadog a few years ago, they offered different packages that you had to choose from:

Option 1: a small salary, a large number of stock options;
Option 2: a medium salary, a medium number of stock options;
Option 3: a large salary and a small number of stock options.

I believe many tech companies use the same template.

A few months after I joined, the company IPO’ed. I went to a coffee break and chatted with one of my colleagues. The recent IPO and the rising stock values sparked the conversation. They asked me, “Which option did you pick when you joined?” I replied that I was looking for a venture back then, focusing on wealth creation, so I picked option 1—the salary was enough for me to live.

They replied that they now regretted picking option 3.

I suspect that most employees were more attracted to larger salaries than stock options. I wish somebody from HR could confirm the stats about what I observed, but I doubt they would. The truth is, even HR did not understand what they were doing.

When I got offered the 3 packages above, I asked what the value of the stock options was. Stock options are options (no shit), meaning they offer you the right to buy shares of a company. My natural question was: what was the price of this option, what was the company's value back then, and what was the total number of outstanding prices?

The recruiter’s answer was: 😳

Nonetheless, I insisted, and they escalated my question to an HR director. The director ended up reaching out to the US side of the company. They finally showed me a spreadsheet with the number I was asking, easing my final decision.

The lesson? Always ask for clarity on the value of your options before deciding. If even the recruiter is unsure, push until you understand what you’re getting.

Not Everyone is an Investor

If you ask people what investing is all about, they’ll reply with “finance and maths.” That’s true, but it’s only one part of the equation. Investing is also a very psychological discipline; not everyone’s ready for that.

When I was working at Red Hat, the company used to distribute RSU (Restricted Stock Units) to its employees. RSUs are basically free stocks. You’d get a bunch of those every year and could keep them, therefore becoming a shareholder, or sell them on the market and get cash in exchange.

As the company was on a growth trajectory back then, the stock would keep climbing every month or so.

Red Hat stock history before being bought by IBM in 2019.

Most people I worked with were young engineers and not investors. They had no clue why the stock would fluctuate.

One day, I was chatting with a colleague over a beer, and they seemed worried. I asked what was wrong, and they explained that they had lost thousands of dollars over the last month because Red Hat stocks had plunged. They had received RSUs over the last years and were so happy with their growth that they never sold them and didn’t think they could crash. Now, they were upset and did not know what to do.

Of course, they did not know what to do.

They had no way to know if the stock was overvalued or undervalued.

They had no idea if the market would continue its bull run or if a bear market was upcoming.

They were no investors. They had no plan, and the uncertainty left them feeling lost.

How to Approach Stock Options as an Employee

Owning stock options (or RSUs) means one thing: you are becoming an investor. It means that you now own (potentially, in the case of options) a part of a company with a market value and an intrinsic value. If you cannot assess the value or form an opinion on your company’s stock, you probably shouldn’t be a shareholder.

I know. It’s not your fault. You didn’t ask for it, but this is what happens if you stick to your stock. You are transformed.

As an employee, should you keep or sell your Airbnb Inc. stock? When do you sell? Why?

→ Once you vest your RSUs, if don’t sell them immediately in exchange for cash, you’re effectively becoming an investor.

→ When a liquidity event occurs, and you choose not to exercise your stock options, you are deciding to become an investor.

It’s great to be an investor, for sure. But that comes with many questions: what’s my allocation policy? What’s my exit strategy? What is my risk tolerance? What is my investment horizon? What are the tax implications? How well do I understand the business model? Do I trust the leadership team? What’s my plan if things go wrong? How does this investment fit into my overall financial plan?

Investing isn’t just about understanding finances; you also need to handle fear, greed, and uncertainty. Most people aren’t ready for the psychological strain of watching their savings plummet overnight.

CrowdStrike shares lost 50% of their value during the incident in 2024.

Stock options can be a powerful motivator, but they aren’t for everyone. Next time you’re offered stock options, think like an investor—ask the tough questions, seek clarity, and ensure you’re prepared for both the financial and emotional aspects of investing.

If you then plan to stick to your stock, decide whether you truly want to become an investor and learn the trade. If you don’t, it’s fine; just get your money and enjoy it. ✌️

Why You Need Product Engineers

Tue, 01 Oct 2024 00:00:00 GMT

A couple of weeks ago, I attended our quarterly MAHOS (Mergify All Hands On-Site)—an event where we gather the whole Mergify team together for a week—and gave a small speech about product engineers.

Mergify’s team, Sept. 2024

I was unsure if I coined the term product engineer at that time or if it already existed. After Googling, I found that I was not the only one who realized that we no longer need software engineers.

When we started our venture a few years ago and decided to hire engineers, we naturally looked for software engineers. We found great engineers. They learned a ton of stuff working with us over the last couple of years and became very efficient at producing code. Awesome. I wrote last month about how to become a great software engineer, and while I hold to this, the next step in your career, if you want to work in a product-oriented startup, is to become a product engineer.

What is a Product Engineer?

Is that just a software engineer building a product? Yes! But that is not what a software engineer does by default. Let me tell you an anecdote.

Last month, with my product owner hat on, I wrote a user story explaining one of the changes we needed to make in Mergify: feature X is enabled by default in the product, which is annoying because most users do need it. We need to 1. allow the users to enable or disable it and 2. make it disabled by default.

One of our software engineers picks the ticket and implements a solution. Here’s what they do:

The feature X can be enabled or disabled;
The feature X is disabled by default;
An error message warns the user constantly that feature X is currently disabled and that they need to enable it to have it work. There’s a giant red banner to warn users that feature X is disabled—until the user has enabled the feature.

When I see that, I’m really confused. The code is great, and the ticket is indeed implemented, but the last part is terrible from a user experience perspective. It forces the user to enable feature X to get rid of the warning, meaning we get back to a point where users have to enable feature X by default, even if they don’t need it, just because they’re confused.

Bad product design.

That is great software engineering work but terrible product engineering work. In no case did the engineer put themself in the user's shoes or try to understand why we needed that change.

This is where a product engineer must shine. They need to understand the value and the reason behind the code they write, taking into account the product, its roadmap, its priorities, etc. It requires the ability to do trade-offs by being pragmatic. They need to be obsessed with the customer and understand their problem. In a startup, you need to ship fast, meaning, again, doing trade-offs and being efficient and practical. They need to be detail-oriented, have a sense of ownership, and be on the look to create terrific experiences.

Writing software has never been so easy. With AI on the rise, writing actual code will have less and less value.

The core value of building software is going to be whatever AI is not yet able to do, which is empathy—connecting with and learning from other human beings’ needs.

How to Transform Engineers in Product Gurus

Do not follow Indeed’s advice, for sure.

Based on the description I wrote above, we implemented some changes and made good progress overall. The improvements we made were due to simple changes we made to the organization.

Connect engineers with customers. Doing support directly with customers, joining a demo call, spending time in a booth during an event, and talking to prospects. All those activities where engineering can interact with prospects and customers are very valuable;
Explain the why, not the how. As a product owner, you must explain why changes are being made and not how they should be made. The more context you feed into your user stories, the easier for an engineer is to make the right decision when building a feature or fixing a problem. It’s especially important when, as a product manager, you have a technical background and you could be tempted to dictate a solution.

There are many software engineers out there, but not many product engineers. We’ll be on the lookout for that when hiring in the future. And if you want to join a product-oriented startup in the future, make sure you change your mindset to not just writing code. 😉

How To Test Your API Integration

Tue, 24 Sep 2024 00:00:00 GMT

As I was publishing last week's post on whether GitHub is becoming obsolete or the future of development platforms, they decided to trigger a two-hour interruption on Mergify in retaliation.

Just kidding. I am sure they did not do that on purpose.

Read my post-mortem if you want the whole story. The summary is that they broke their API for several hours until people started to complain, and they finally rolled back their change. Bringing down our service in the meantime.

That event forces me to talk about APIs this week.

API Definitions Are Just Definitions

I won’t go into the definition of an API per se; it’d be boring. You can Google it if you need to.

The real question is what having an API means. Offering an API to your users means authorizing them to interact with your service. This implies many rules, such as the data model of your API, the behavior of your API, the rules of usage, etc. Some can be encoded in a computer-readable machine; others cannot. Engineers like to talk about contracts, and I think it’s an almost good analogy.

To describe this contract, you need multiple specifications.

Developers have been ecstatic over OpenAPI over the last decade as a go-to media for describing their API. I want here to emphasize how little this documents your API. It illustrates the data model used but does not encode much of the behavior the system might exhibit.

Hey, I can confirm that GitHub did not break its OpenAPI schema when it broke its API last week. Formidable.

However, based on the assumption that OpenAPI is enough, many engineers mock their API consumption based on that part of the contract and think they’re done.

In that situation, the minimum you should do is validate that your mocking follows the OpenAPI schema you’re using. Even that is not enough because sometimes the schema changes—and sometimes it’s just not respected.

Let’s take GitHub again as an example. Their API is so legacy that the JSON schemas were crafted manually — and there might still be, I don’t know. It’s fine; it’s better than nothing, and it’s not obvious to change a legacy API that’s been there for 15 years.

We know first-hand that their system does not always respect the GitHub API JSON Schema.

APIs Have Side-effects

Again, this approach is entirely based on the data model and is insufficient and of little value.

Most of an API's value is in the behavior it triggers. Unless your API is a basic CRUD and does storage only, it will have side effects that might or might not be visible through the API.

For example, creating an asynchronous job on any REST API will return nothing except a unique identifier, which can be used later to identify the work. You might receive the data via a webhook or have to poll the API to get the job’s status. This kind of behavior cannot be documented in OpenAPI as it’s not part of the data model; there’s nothing to tell you to expect a webhook.

API Invisible Parts

Now, let’s discuss all the invisible parts of running an API. There are many. The first that come to mind are RBAC, quota, and rate limits. Most APIs have to implement those items, and they also have a direct impact on the API behavior and access.

Those features will massively impact the quality and quantity of API use. Again, they are pretty hard to test in a black box. There’s no way you can easily mock a full RBAC implementation or real-life rate limits.

Testing the Hard Way

Having consumed many different APIs for the last five years on Mergify, and especially GitHub’s one that we know by heart, gave us a few ideas on how you can or cannot test.

Rule number one: do not mock. Record your tests.

We leverage vcrpy in Python to do that: the idea is to run your test in a record mode where real HTTP requests are done against a service. Once the recording is done, you can replay the test when running it locally or in the CI.

If any of your code tries to make a different HTTP call, the test will fail, and you will have to re-record it. This ensures that no change is made to the application without being noticed.

Now, that does prevent your application from being broken, but that does not prevent the API from breaking your app. The only way to do this is to regularly re-record all the tests and see if they break.

So, rule number two: re-record your tests regularly — every day if possible.

For example, we have a test that plays with GitHub pull request labels. When re-recording a test a few months ago, we noticed that if it failed. It turned out that GitHub changed its API to become case-sensitive overnight (that was not in the OpenAPI schema!).

In that case, we preferred to ask GitHub to fix their API rather than fix our code, but hey, your mileage may vary.

Rule number three: be ready to fix the code.

No amount of testing will cover all the edge cases. For example, requests quota or rate limit might be hit in real scenarios but not in testing, meaning you’ll have to handle those specific cases without being able to test. It’s fine — you can actually mock part of the responses here.

For this, we leverage Sentry to obtain evidence of the problem, replicate it in a test, and fix it. No amount of testing can fix all scenarios, so having a way to hotfix your code is a must.

In the end, mixing API test recording for safety and error tracking for fast action is the best combination we’ve seen for dealing with external systems.

If we map those rules to last week's incident, rule number three helped to fix the issue quickly, while rule number one would have technically caught it, and rule number two would have done so in less than 24 hours. Even if it turned out in our case that reality kicked in before testing.

So use that. And retry mechanisms.

I guess that’ll be for another post.

Is GitHub the Future or Becoming Obsolete?

Tue, 17 Sep 2024 00:00:00 GMT

Over the last few months, I stumbled upon a few articles on GitHub's history and future. I find those quite interesting. Greg Foster gives a quick history of the rise of GitHub over the years, while Scott Chacon, one of GitHub’s co-founder, retraces the history of GitHub from the inside.

The story is great, and I’ll leave it to you to read those posts if you want to learn more. I was alive at that time; I saw and lived it all. I started using Git in 2005, and I’m the 2644th of the 100 million GitHub users—I joined GitHub in 2008 when it was still in beta.

It’s definitely true that Git won, and especially GitHub. I’ll probably have to write about GitLab at some point (I eventually did) — but you won’t read any ranting here today. As Scott Chacon writes, GitHub had a taste and perfect timing and soon gained traction from the open-source community.

Ok, great, so GitHub wins. Where do we go now?

Microsoft

I believe that GitHub's growth was already on track before Microsoft bought it in 2018, but that move still changed everything. At that time, GitHub still lacked major features, such as a CI/CD system, and had a hard time being compared to GitLab. The following year that changed, and GitHub Actions started (thanks, Azure) and shook everything up.

If you read opinions about GitHub vs. GitLab, CodeCommit, Azure DevOps, or any other platform, you’ll mostly see engineers comparing user-visible features. And sure, this has value, and that might be your main criteria to pick one or the other if you’re a small team with full power over their choice.

However, this is not what GitHub and Microsoft are after anymore. Take a look at the roadmap of the last few years, and you’ll see a pattern: enterprise. Pushing code, creating pull requests, and any part of software engineers' day-to-day activities have been covered for the last 15 years.

They designed it, the industry adopted it, and GitHub has nothing in its roadmap to change that paradigm. They’re building on the momentum they raised.

Security, Copilot (AI), and compliance are items that the largest corporation needs to embrace a platform such as GitHub. This is only the beginning: this year, the GitHub sales organization underwent a reorganization to look more like Microsoft's sales organization. I suspect GitHub is now able to leverage even more Microsoft resources to push its platform to large corporations—which definitely makes sense. The link between GitHub and Azure is tightening, both technically and commercially.

For the best and the worst.

Relevance

How is GitHub still relevant if it does not innovate on developer workflow? Is it becoming legacy software?

There is certainly a disjunction between what developers and corporations expect, and at this point, if you look at the ratio between features, compliance, and price, there's no better alternative than GitHub. I don’t think this is going to change anytime soon.

For open-source projects, there might be alternatives, but if you’re pragmatic (and lazy), GitHub is the pick. There have been a number of projects trying to escape GitHub over the years, such as when Microsoft acquired it. The latter is still seen as an enemy of open source by some folks (I suspect this is PTSD caused by the Balmer era). More recently, another exodus was triggered by the announcement of Copilot and the fear that the training was done on free software. However, at this stage, it’s like emptying an ocean with a spoon, and the impact does not compensate for larger projects moving to GitHub (e.g., Python).

Steve Ballmer

On the other hand, GitHub is attracting more competitors. It has grown to a point where many startups want to disrupt it: one can look at Radicle and its decentralized approach, Pierre and its modern design, Diversion with its game-centric approach, or Palmier, who’s building a new kind of repository.

They might succeed, but the road is going to be long to get massive adoption — and migration.

There’s nothing replacing GitHub in the short term. We better deal with it.

Staying Competitive

Tue, 10 Sep 2024 00:00:00 GMT

Starting a company is always a challenge. You start small, and you might have to fight competitors that are way larger than you. They could crush you.

However, being small does not mean that you will necessarily lose against large companies.

Last week, I was wandering through a forum where I regularly hang out with other SaaS founders. We share founders’ issues, anything from how to run ad campaigns to how to fire people (sigh).

That day, one of the participants asked a question about staying competitive when large companies enter your market. They were facing a few challenges, the main one being that a multi-billion corporation was adding features to its product that would poach on their turf.

That resonated with me. Mergify went through the same hassle when GitHub launched its own merge queue features last year.

Being in this position is extremely risky. There are many horror stories out there of startups being killed by larger competitors. You’d just have to watch the latest OpenAI DevDay announcements over the last few months to see how many startups they would instantly disrupt with the announcement of a single feature (marketplace, custom GPTs, etc.).

Is Apple going to kill a bunch of companies with iOS18?

Now, that being said, being the underdog is not necessarily a bad thing and can turn out to be a strength. When survival is at stake, the smallest dog can become the more fierce one.

80/20

In 1668, Jean de La Fontaine wrote a famous fable, Le Lièvre et la Tortue. If you never heard of it, the TL;DR is: a confident hare mocks a slow tortoise and challenges her to a race. The hare, overconfident, takes a nap mid-race while the tortoise steadily continues and wins. The fable teaches the lesson that persistence and diligence can triumph over arrogance and haste.

There are multiple choices to do in this situation, but my feeling is that most large companies will implement the “20% features that do 80% of the job.” It makes sense to them economically. With little effort, they can enter the market and grasp a large amount of the share using their moat, branding, marketing, and existing customers.

They can leverage their vast resources and existing ecosystems to quickly dominate this space. However, they often leave gaps in niche or specialized needs because building the remaining 80% of features to cover every specific use case requires more effort and may not align with their broader goals.

Unfortunately, if you were also targeting the same 20% feature, this can hurt your business. You’ll stop seeing new customers and will lose existing ones as they remove the one-too-many vendor from their list.

However, there is a chance for you to stay ahead of the game. Like the hare in the tale, large companies will just take a gigantic nap once they’re done with what they expect to be enough. This is where you can shine.

A small company that builds beyond the basic 20% and focuses on solving complex or specialized problems can appeal to customers who require more tailored solutions, thus capturing a large share of a more focused market. While large companies serve the general market, small players can dominate specialized verticals. By implementing, e.g., 40% of the features for a particular scope, you could address 97% of the market, thus appearing as an expert and leader in your area.

Vertical

Another play that I like is to target different verticals. If you design presentation software and suddenly PowerPoint enters the market, it’s going to be pretty hard to win over the long run. Microsoft will win because they’re known (the “nobody gets fired for buying IBM” rule), and they can reach out to millions of customers — you can’t.

However, if you pivot to becoming the presentation software specialized in building convincing, AI-generated, prospect-centered sales decks automatically, then you can win an entire part of the broader market.

For sure, the initial market might be narrower, but by being focused, it’ll be easier to win it. As you’re winning it, you can expand into adjacent markets and grab more share of the global offering.

Strategy

For both strategies — expanding beyond the 80/20 or addressing a particular, the key to discovering which approach might suit you better is to talk to your existing customers. They are the ones having the answer to the questions “Is there a usage pattern they rely on?” or “Are they working in the same industry?”. Focusing on customer experience or specializing in integrations that large corporations won’t do is only identifiable if you speak to your users.

Small companies have the agility and execution speed that large companies lack, making them the best innovators.

I guess that’s my message to anyone out there being attacked by large corporations. Don’t throw the towel too soon. There might be different plays for you to continue growing and fighting back against the Goliath.

In the end, it’s not just about size. It’s about understanding where you can uniquely provide value and delivering that value better and faster than anyone else.

How to Be a Great Software Engineer

Tue, 03 Sep 2024 00:00:00 GMT

I did not write for the last few weeks as I enjoyed taking a break. Ha! That’s probably the first point I could write about being a great software engineer: taking breaks.

Nevermind.

What do I know, after all? I’m not a software engineer anymore. I’m a CEO, god damn it.

Not my role model.

At least being a CEO gives me some excuse to dispense some pieces of advice regularly. It turns out that over the last couple of years, I had to become a manager of people — and many people in my team are software engineers. The only thing I knew about management so far was being managed, which taught me many things, such as:

how to manage;
how not to manage;
how to be managed.

I don’t want to talk about the first two points here, but I’d like to write about the latter. I regularly have to give feedback to people on my team, and they often rely on the Great Engineer Framework that I built in my mind.

It’s time to write that down.

Expectations

Since I started my career as a software engineer 20 years ago, I always wondered how to improve. My initial appeal for this career was typing code on a keyboard, so I decided that the best way to become a great engineer was to be the best at technical stuff.

I coded days and nights, learned everything I could, and became amazing. I Debian-packaged hundreds of software, wrote C code for a window manager, Linux and CPython, wrote CIEDE2000 color space computation functions in Lisp, wrote thousands of lines of Python to do crazy stuff, implemented XML binding for the X11 protocol, built a scalable time-series database based on object storage, etc; you name it. I did many tech-crazy things and thought I was a great engineer.

It turns out I was only 33% good. As I grew into the tech and startup ecosystem, I started to understand whatever was around me, the industry, the business, the people. And I soon realized that this was not enough, even if I was among the best engineers you could probably find (sorry for bragging).

Aspects

After a few years, I built a mental model that I still use nowadays to give feedback to engineers in my team, based on 3 aspects that you must master to become a great engineer — like the 10x engineer they all talk about:

Tech
Business value
Collaboration

Tech

I just discussed tech. You have to be amazing at it, which means you have to dig deep into it. As my co-founder Mehdi says, great engineers pull the strings. This means that you’re not just there to paper over the problem; you’re here to understand it fully, to grasp it entirely, from top to bottom, and to fix it forever because you understand it.

Many junior engineers are not able to do that. They just tinker with their code until “well, it works, tests pass, whatever.” The rise of AI tooling is supporting that, and engineers working this way will have to step up their game, or they’ll disappear.

It takes a large amount of time to achieve this expertise, and as common sense says, maybe 10,000 hours. This is actually a major issue for people switching to tech after another career; 10,000 hours of coding 25 hours a week (if you just do it on the job) in a typical 45-week year is more than 8 years before starting to “know what you’re talking about.” If you start at 18 years old, tinkering with computers 60 hours a week for fun, you’ll be pretty good at it by 21. I know that’s not fair, but I see this as a major roadblock for hiring tech talents coming from a career change.

So: do tech. Don’t stop until you understand everything of what you are responsible for. I remember 15 years ago being screened by a recruiter at Google who’d ask me what happened when I typed google.com in my Web browser. Being able to explain everything, from the keyboard input to the DNS requests and TCP headers of the packet sent, to the HTTP server made me pass without a blink.

Business Value

This sounds totally stupid, and I might be slightly biased by my French experience, but there are too many engineers who do not understand business value. It actually took me a few years to understand this, probably because I was only focused on tech. Let me give you a good anecdote to illustrate this.

Ten years ago, I was called by a senior manager to help with a Python project in a media company. I go to the meeting and meet the manager. He comes from a famous French tech school — one where they learn the C standard library from scratch in their first year — and so do most of the engineers in his team. They’re managing hundreds of servers, and after evaluating various software to do that (Puppet, Ansible, etc) they didn’t find anything that suited 100% of their needs, so they built their own. They invested hundreds of hours in it, and now they’d need help maintaining it.

It turns out that what they needed was probably Ansible plus a custom plugin to reach 98% of their need in a tenth of the time, but they didn’t see it that way. They just built an entire tech project, not related to their core business, from the ground up, investing hours of time. This is the kind of similar experience I talked about in my previous post, Solving Build vs Buy. I skipped that project and moved on to other things. I had no interest in maintaining a project that was not providing core value to the business. That’d have been a great way to be ditch as soon as somebody smarter in the company realized the amount of wasted time that project might have been.

This kind of behaviour applies everywhere. Engineers would spend hours trying to implement perfect systems that will scale to millions of users. While the business might have no user — yet. Engineers would spend hours building a feature or solving a problem that would impact 0.1% of users. It’s true that engineering might not be entirely responsible for the roadmap directly, but they are responsible for the time they spend on how far they go into implementing systems and features.

We live in a world where economy is the driver, which means you have to maximum throughput and minimize input. Input is your coding time, and throughput is the (extra) money the company that hires you can make with you work.

Collaboration

I could probably summarize this aspect with just:

If you want to go fast, go alone. If you want to go far, go together.

That’s entirely true. Considering the team is a problem for many engineers who are frustrated by members of their team. It does take time to deal with people, and they are not as easy to understand as computers. However, in the long run, they are the best way to provide you with leverage to achieve amazing things. Maybe another secret of 10x engineers, who knows?

Therefore, you’ll need to understand the dynamics that make your teamwork. You have to make sure your work is not isolated and is not the only correct piece of code hidden in its own corner. You need to connect both your piece of software and your brain to other pieces of software and people. I know it requires a lot of effort for some people, especially because it can feel annoying and inefficient to talk or write things for other engineers to understand what you’re achieving.

But until we can all Neuralink, yes, you’ll have to pause and do something that seems like a waste of time: talking to your teammates, your manager, or customers.

Always remember: this is an investment. In the long run, it will pay off.

Conclusion

Those three aspects are always the ones that I use to drive my feedback on performance reviews to engineers in the team. Not everyone is always 10/10 in every aspect, so it eases providing feedback and drives them to where they should improve next. They are probably not exhaustive, but they are a great way to spot great and inadequate behaviors.

Launching Byte Brigade

Tue, 20 Aug 2024 00:00:00 GMT

Today I'm excited to announce the launch of Byte Brigade, a community dedicated to investing in early-stage developer tools and SaaS companies. Our mission is to support tech startups at their initial stage of development, helping them become leaders in their respective fields

The Challenge

Many developers start companies to build incredible tools, but they face enormous challenges when it comes to marketing and productizing their innovations. This is particularly true in the French ecosystem, where we have brilliant engineers but often lack the cultural know-how to propel projects to their full potential. For example, marketing to developers is tough, and addressing the US market from France adds another layer of complexity.

I've seen this problem repeatedly in my career. For instance, I've recently shared insights with CodSpeed on the differences between European and US marketing strategies, helping them understand the specific challenges they might face. This is just one example of the kind of guidance that can make a significant difference for tech startups.

My Journey

Having worked in open source for over 25 years and in devtools and SaaS for close to 15 years with companies like Red Hat, Datadog, and Mergify, I understand these challenges intimately. Over the past few years, I've had the privilege of funding and mentoring several tech startups, helping them navigate the tricky landscape of market fit and growth.

This experience has inspired me to create Byte Brigade—a collective of tech enthusiasts who can bring their unique expertise to startups aiming to grow. We are building a group of business angels interested in investing in companies that could become tomorrow's tech leaders. Byte Brigade is unique in that we are very developer-centric, focusing on solving problems that matter most to developers.

Our Approach

Our approach is simple: we invest in and guide startups, helping them transform into dev tools and SaaS powerhouses. This involves significant efforts in marketing and product development, areas where we provide invaluable guidance. With advanced technology, experienced leadership, and a growing customer base, our portfolio companies are well-positioned for significant growth. Once they refine their strategies in their initial verticals, they can expand into new sectors, leveraging their technological capabilities and market insights.

Looking Ahead

In the long term, I envision Byte Brigade growing into a community where multiple developers and business angels come together to invest in tech startups that solve problems for developers. While we have no set timeline, we plan to host events, workshops, and networking opportunities for our members and portfolio companies, fostering a collaborative environment where ideas and experiences can be shared.

If you're a developer or a business angel interested in investing in companies that build solutions for developers, I invite you to join us. Your expertise and investment can help shape the future of tech innovation. On the other hand, if you're a tech company targeting developers and need funding and mentorship, we’re here to support you. Feel free to reach out to hello@bytebrigade.fund.

Call to Action

Launching Byte Brigade marks an exciting new chapter in my journey to support and empower developer tool startups. I'm eager to see what we can achieve together and look forward to helping many more companies grow and succeed. Join us in driving innovation and growth in the tech world.

Stay tuned for more updates, and let's build the future of developer tools together.

For more information or to get involved, please contact us at hello@bytebrigade.fund.

Solving Build vs Buy

Tue, 13 Aug 2024 00:00:00 GMT

Selling to developers is challenging, and the mindset of “build or buy” is one of the biggest hurdles. Over the past year at Mergify, we've encountered countless engineers grappling with this dilemma.

Developers are natural problem-solvers. When they encounter an issue, their first instinct is often to build a solution themselves. They see a problem, identify it, and immediately start imagining how they would solve it. However, there's a significant gap between having expertise in engineering and having expertise in a specific solution.

The Overconfidence Trap

Many developers fall into the trap of underestimating the complexity of the problems they're trying to solve. It's akin to asking an engineer, "How long to fix XYZ?" and hearing, "Should be done today," only for the task to stretch out over a week as unforeseen complications arise. They start working on the problem, only to discover layers of hidden challenges—refactoring needs, unexpected dependencies, and more. It all makes sense in hindsight, but it's nearly impossible to foresee these issues at the outset.

This overconfidence often leads to the creation of subpar solutions. While some teams might eventually succeed, these homegrown solutions fall short of their commercial counterparts more often than not. Moreover, solving problems outside of your core business usually results in a poor return on investment. Think about it: how many companies still manage their own email servers when GSuite offers a hassle-free solution for just $6 per user per month? Your IT team is incapable of competing with such an offer.

The typical customer for this comes to your demo call with a speech along those lines: “We’ve tried building a solution to this problem, but we failed because we hit too many bumps in the road; it seems you guys know how to solve it.” Teams that take this road are the easiest to win customers because they already know they can’t build.

The Build or Buy Mindset in Action

During our demo sessions at Mergify, we frequently encounter engineers who are initially skeptical about buying a solution. They come with a build-or-buy mindset, confident in their ability to solve the problem themselves. They're curious about why they should spend money on a product when they believe they can build it in-house. This is where it gets interesting.

One of the most enjoyable aspects of these sessions is debunking their assumptions. For instance, many engineers approach our merge queue system thinking, "This is just an automatic rebase, right?" After a 20-minute demo, they often leave with a new appreciation for the complexities involved. "Oh, okay. That sounds quite hard to do. Good job." This is where I reply with: “Well, this is why we’ve been working on this for 5 years already and have hundreds of customers.” 😉 Let me brag a bit.

By walking them through the numerous edge cases and intricacies that our product handles, we can show them just how challenging the problem really is.

The Power of Demonstrating Value

This experience highlights why startup founders who are engineers can be excellent salespeople. They understand the technical mindset and can effectively communicate the value of their products. The key is to convey the value of your product in your messaging and demos.

Having at least one feature that is both highly valuable and challenging to implement can make a significant impact. The more features like this you can demonstrate, the better—provided they solve real problems and aren't just complex for complexity's sake.

Final Thoughts

The "build or buy" dilemma is a significant barrier in marketing devtools. Developers' natural inclination to build solutions themselves can lead to underestimations of complexity and overconfidence. By demonstrating the intricate challenges your product solves and highlighting its value, you can shift their perspective. In the end, it's about showing that your solution is not just a convenience but a necessity for efficient and effective problem-solving. Once you've won them over, the next challenge is getting pricing right — which has its own set of hard lessons.

Connecting the Dots with AI

Tue, 06 Aug 2024 00:00:00 GMT

In any conversation, a lot of context can be lost between what you think, what you say, what the listener hears, and what they ultimately understand. This loss of information can lead to miscommunication and inefficiencies. How often have you found yourself confused by someone's words, asking them what they mean, only to hear, "Sorry, I was thinking about this," and finally, the dots connect?

This common scenario underscores AI's potential to revolutionize communication. Imagine a world where your AI assistant, enriched with context from your daily activities, bridges the gap between thoughts and understanding. This could transform the way we interact, making communication more efficient and precise.

Communicating

The Role of AI in Enhancing Communication

In today's world, computers and phones are already tracking many of our communications through platforms like email, Slack, and Teams. Combining all of those platforms captures the full context of conversations — which is why people are starting to use them as sources for RAG (Retrieval-Augmented Generation) in LLM.

Technologies such as Microsoft Recall are going in that direction: recording more information to improve the AI context to make you even more able to understand your world.

AI and LLM can step even further in the direction of communication improvement in the future.

Consider a scenario where Alice needs to tell her colleague Bob to handle a customer request. Instead of Alice trying to guess what context Bob has or lacks, she could give the instruction to her AI assistant rather than communicating with Bob directly. Alice's AI could then communicate with Bob's AI, sharing the necessary context and information, ensuring that Bob receives a complete and clear message. This method of using AIs as proxies eliminates the guesswork and ensures that all relevant details are communicated effectively.

The Vision: AI-Assisted Communication

In the future, AI could be integrated into every piece of communication, from emails to meetings to casual conversations. The potential is immense. Imagine AI assistants transforming messages to match the recipient's preferred communication style and form, embedding the extra context that might be missing to the recipient.

For example, if Alice’s AI knows that Bob prefers visual information, it could transform Alice’s text-based request into an infographic or a visual summary. This ensures that Bob receives the information in the most effective way for him, enhancing understanding and efficiency.

Benefits of AI in Communication

The primary benefit of AI-enhanced communication is the significant improvement in efficiency. Misunderstandings and miscommunications can lead to wasted time and resources. By ensuring that all parties have the necessary context, AI can streamline interactions and reduce the need for clarifications and follow-ups.

Additionally, AI can create a personalized communication experience, tailoring messages to fit the recipient's preferences and needs. This not only improves comprehension but also makes interactions more pleasant and engaging.

Overcoming Challenges

However, implementing AI in communication is not without its challenges. One significant issue is the segregation of information. Just as humans struggle with deciding whether to share certain information, AI will need to learn how to handle sensitive or contextual data appropriately. Current AI systems lack robust role-based access control (RBAC) for context, making it difficult to manage which information can be shared and with whom.

Sandboxing Cycle

Furthermore, while AI can potentially develop its own languages to communicate more efficiently, the practical application of this in everyday communication remains a complex challenge. Security and privacy concerns also need to be addressed, ensuring that sensitive information is protected while still allowing AI to function effectively. I don’t think anyone is actively working on this right now, but it will be a major issue in the future.

Personal Reflections and Future Visions

Reflecting on my own experiences, I've often encountered situations where additional context could have prevented misunderstandings. A really simple example would be planning a lunch meeting without knowing the dietary preferences of your invitee. It might be so obvious to your guest that the restaurant must have vegan options that will not mention it, which can lead to disappointing outcomes if you book a steak house. If AI can provide this context seamlessly, such issues could be avoided.

Looking ahead, I envision AI assistants like Siri evolving to incorporate these advanced communication capabilities. This could extend beyond personal assistants to systems built between products and companies, facilitating smoother interactions and collaborations across different platforms.

The future of AI in communication holds incredible promise. By bridging the gaps in context and understanding, AI has the potential to transform how we interact, making communication more efficient and effective. Of course, getting there requires solving a fundamental problem: AI is still a human interface nightmare, and the way we interact with these systems today is far from what it could be. While challenges remain, the ongoing advancements in AI technology bring us closer to a future where misunderstandings are minimized and every message is clearly understood.

Reflecting on the Journey of "Nom d'un Pipeline !"

Tue, 30 Jul 2024 00:00:00 GMT

"Nom d'un Pipeline !" (translation for English readers: "What a Pipeline!") has been an incredible journey, filled with learning, growth, and connecting with some of the brightest minds in the French tech scene. As a French-based startup for over five years, with most of our customers and audience in the US, we noticed a significant gap in the knowledge and maturity around CI/CD among French engineering teams. This realization motivated us to create a platform to elevate the understanding and practices of CI/CD in our home country, leading to the birth of "Nom d'un Pipeline !"

The Birth of an Idea

Starting a podcast was a new adventure for me. With the help of Mergify’s marketing team, we built everything from the ground up and began reaching out to potential guests. We aimed to find the right candidates and teams to feature in our 45-minute episodes, sharing their journeys and insights around continuous integration, continuous deployment, and quality assurance.

Highlights from Season 1

All episodes were amazing, and the content we built was really valuable. I think several episodes stood out for their depth and impact:

Mathieu Leroux-Huet discussed performance, offering valuable insights into optimizing systems for better efficiency.
Cyril Rohr from RunsOn shared how improving GitHub Actions runner can significantly enhance CI workflows speed and cost.
Olivier Pillaud-Tirard from ManoMano detailed how they have improved their CI over the last couple of years, providing a roadmap for other teams to follow.

These episodes were not only informative but also showcased the diverse approaches and innovations happening within the French tech community.

Overcoming Challenges

Producing a podcast comes with its own set of challenges. One of the biggest hurdles we faced was recording remotely. There were times when technical issues disrupted the recordings, but we managed to overcome these obstacles with perseverance and technical adjustments. While I would love to record in a studio, living in Toulouse makes it challenging since none of my guests are local. Despite these difficulties, the remote setup has allowed us to connect with a broader range of guests.

Positive Feedback and Growing Momentum

The feedback we've received has been overwhelmingly positive. Listeners appreciate the insights and real-world experiences shared by our guests. Knowing that the show provides value and helps the French tech scene advance in CI/CD practices is incredibly rewarding. Recently, we passed the 2,000 views/listens mark, indicating growing momentum after just eight months. This milestone is a testament to the show's impact and the increasing interest in CI/CD topics.

Personal and Professional Growth

On a personal level, hosting "Nom d'un Pipeline !" has been a delightful experience. I discovered that I genuinely enjoy talking to people and learning about their teams and tech stacks. It's been an eye-opening journey that has enriched my understanding of CI/CD and connected me with some brilliant minds in the industry.

Acknowledging Our Guests

I want to extend my heartfelt thanks to all our guests for their confidence and for sharing their journeys: Clément, Sofiyan, Romaric, Aurélien, Frédéric, Olivier, Dan, Thomas, Mathieu, Cyril and François.

Your contributions have been invaluable in making this podcast a success.

Looking Ahead to Season 2

As we wrap up the first season, we are already gearing up for season 2, set to launch in September. We are excited to bring new hosts from major French companies like Doctolib and Alan, promising even more insightful content and engaging discussions. While we don't have specific plans for season 2 yet, we are committed to continuing our mission of educating and inspiring the French tech community about CI/CD.

"Nom d'un Pipeline !" has been an extraordinary journey, and I'm proud of what we've accomplished so far. We hope that this podcast continues to serve as a valuable resource for the French tech scene, helping teams to improve their continuous integration, testing, continuous deployment, and overall development workflows — including tackling tough topics like the challenges of merge queues. Stay tuned for more exciting episodes and insights in the upcoming season!

If you understand French, don’t forget to subscribe to the podcast on your favorite podcast platform or on YouTube!

The Challenges of Merge Queues

Tue, 23 Jul 2024 00:00:00 GMT

Merge queues are a tough concept to grasp, and over the last five years at Mergify, we've spent countless hours educating developers about their importance and utility. We've published numerous blog posts, written extensive documentation, and even gone to conferences to teach software engineers what a merge queue is. This process of spreading awareness has been a rewarding yet challenging endeavor.

One of our developers, Charly Laurent, gave an insightful talk on the subject, highlighting how merge queues can revolutionize CI/CD processes. You can check out his talk here:

Understanding Merge Queues

Merge queues are not an obvious choice for most teams, and they often require a shift in the balance between safety and speed of delivery. Deploying a merge queue means prioritizing quality over quantity, which is not an easy decision for many development teams, who might be pressured to ship fast.

For example, without a merge queue, teams often merge untested code. This is due to outdated test runs, meaning that they are deploying code that might not work. Without a merge queue, there is no way to prevent merging pull requests with outdated tests and breaking the CI for everyone — which is exactly why you should stop merging your pull requests manually in the first place. One of our customers faced this exact issue, which meant they needed the equivalent of a full-time engineer dedicated to tracking issues in the main branch that broke the CI.

Most platform engineers find the concept of moving the post-merge tests to pre-merge tests challenging.

The Trade-offs

This blog post from Vercel captures this common misunderstanding and the trade-off around merge queues and their CI costs and latency:

Despite the majority of commits being safe to merge after the local CI checks complete on their pull request, the merge queue will incur running the cost of running the CI again every time.

While this is true, the problem here lies in the word "majority." The definition of "majority" can vary significantly across teams. If a minority of pull requests break the main branch after merging, it can cause considerable downtime and require substantial effort from CI engineers to restore stability. We've seen teams come to Mergify with a 30% failure rate on their main branch. While a merge queue won't magically improve the failure rate, it ensures that it doesn't worsen, even if it means a small decrease in merge speed. That ensures that the effort invested in improving the CI is not wasted the day after.

Another perspective from Vercel states:

With merge queues, changes from developers depend on changes from other developers even if they are unrelated to each other, and this makes it hard to scale monorepo merge times with more developers.

This concern is valid for merge queues that don't support monorepo and queue parallelization. However, most modern merge queues (GitHub's own being an exception) do allow for optimization in these scenarios.

Vercel’s blog post concludes with:

With this workflow in place, the merge queue can be safely removed because checks will still always be run before users ever see the deployment.

This reflects the workflow of many teams that don't use a merge queue: merge, run tests on main, then deploy. However, this approach doesn't solve the issue of merging something that breaks the main branch. During the downtime, teams have to identify the culprit, revert changes, and ensure everything works, causing delays and frustration. Bad developer experience ensues.

Teams like Uber recognized this problem six years ago and started building their merge queues. Similarly, in OpenStack, we had a system supporting multiple repositories with Zuul over ten years ago.

Build New Solutions

Considering the merge queue adoption issues, we've spent the last few months reworking our merge queue system to simplify deployment and enhance user experience. We know for a fact that developers appreciate the reliability it brings to CI processes, but we also observe the difficulty of discovering and integrating the system. By deploying a merge queue, teams can eliminate the need for a "check that main works before deployment" step because this is done before the actual merge.

One notable example is a team that previously needed a full-time engineer to manage CI issues due to frequent breaks in the main branch. After adopting Mergify's merge queue, they drastically reduced these disruptions, allowing their engineers to focus on more productive tasks.

The Road Ahead

Merge queues are not without their challenges, and the trade-offs between safety and speed are not always apparent. However, we believe in their potential to transform development workflows. We're on the verge of redefining the merge queue concept at Mergify, and we think it has far greater potential than what has been realized over the past decade. I’ll be happy to write about that soon and share what we’ve built.

Navigating SQL Migrations with Confidence: Introducing sql-compare

Tue, 16 Jul 2024 00:00:00 GMT

As long as I can remember, SQL has been a cornerstone of my engineering journey. My early days at university were filled with monotonous Oracle-based SQL courses, which I found uninspiring. Knowing I would likely never use Oracle, I shifted my focus to MySQL. Over time, I discovered the limitations of MySQL and was introduced to PostgreSQL, thanks to Dimitri. I even organized a few meetups in Paris and encouraged Dimitri to publish "The Art of PostgreSQL," arguably the best book on SQL (I reviewed it here). Eventually, I embraced PostgreSQL wholeheartedly.

SQL databases are a timeless technology that continues to evolve. From Timescale to pgvector, new advancements are continually emerging. However, one persistent challenge has been managing database migrations. Modifying your data model is crucial for evolving your application, but it’s often a daunting task. At Mergify, like many companies, we’ve faced this challenge head-on.

We've tried various solutions, from custom Python scripts to using migra, an open-source project that is unfortunately no longer maintained. Each solution had its drawbacks, leading us to a crossroads where we had to decide on our next move.

The Initial Struggle

At Mergify, PostgreSQL is the backbone of our data handling, from managing the state of GitHub objects to maintaining our event log. From the beginning, we’ve interacted with the database exclusively using an ORM, choosing SQLAlchemy for its maturity, framework agnosticism, and support for asynchronous I/O since version 2.0.0.

Given our frequent production deployments, a robust CI/CD pipeline is essential to handle database evolution smoothly. Every schema modification must be rigorously tested and automatically applied to the production database, adhering to the principles outlined in Martin Fowler's "Evolutionary Database Design." Version-controlling each database artifact and scripting every change as a migration are critical steps in this process.

We chose Alembic to manage our database migrations. Maintained by the SQLAlchemy team, Alembic is a command-line tool that can automatically create migration scripts from your SQLAlchemy models. Each script is version-controlled alongside your source code. Alembic applies these migrations to the database, recording the revision number in the alembic_version table to ensure only new migrations are applied subsequently. This command is typically executed in the continuous delivery pipeline to keep the production database up-to-date.

A Naive Beginning

Our initial approach to testing migration scripts was straightforward: create two databases—one using SQLAlchemy models and the other using only the migration scripts—and ensure they have identical schemas. This involved:

Creating PostgreSQL servers using Docker: On a new server, create two empty databases.
Generating schemas: Use the first database to create artifacts with SQLAlchemy models, and use Alembic to run migration scripts on the second database.
Comparing schemas: Dump each database schema into SQL files using pg_dump and compare them using Python’s filecmp and difflib builtin libraries.

Here’s an example command to dump a database schema into an SQL file:

pg_dump \
    --dbname=postgresql://user:password@host:port/database \
    --schema-only \
    --exclude-table=alembic_version \
    --format=p \
    --encoding=UTF8 \
    --file /path/to/dump.sql

To compare the files:

assert filecmp.cmp(schema_dump_creation_path, schema_dump_migration_path, shallow=False)

If the test fails, use difflib to display the differences:

def filediff(path1: pathlib.Path, path2: pathlib.Path) -> str:
    with path1.open() as f1, path2.open() as f2:
        diff = difflib.unified_diff(
            f1.readlines(),
            f2.readlines(),
            path1.name,
            path2.name,
        )
        return "Database dump differences: \n" + "".join(diff)

While effective, this test had limitations, such as sensitivity to column order. PostgreSQL doesn’t easily allow changing column positions, necessitating consistent column order in models and production databases.

The Complexity Grows

As our models grew more complex, our naive test struggled to keep up. Consider the following example:

class Base:
    updated_at: orm.Mapped[datetime.datetime] = orm.mapped_column(
        sqlalchemy.DateTime(timezone=True),
        server_default=sqlalchemy.func.now(),
    )

class User(Base):
    id: orm.Mapped[int] = orm.mapped_column(
        sqlalchemy.BigInteger,
        primary_key=True,
    )

In this setup, the updated_at column is added to every child model, such as User. Adding a new column to User, like name, would misalign the order, causing schema mismatches.

To address this, we needed to compare schemas while ignoring column order. We explored various tools:

Alembic: Can compare schemas to generate migration scripts but misses some differences.
Migra: An unmaintained tool that compares database schemas effectively.
SQL dumps: The most reliable format but challenging to parse and compare directly.

Building the Solution: sql-compare

It was clear that our current solutions were insufficient. We needed a hero to rescue us from the perils of SQL migration management, so we developed sql-compare.

sql-compare is a Python library that uses sqlparse to parse SQL files and compare schemas, ignoring irrelevant differences like comments, whitespace, and column order. This new tool became an integral part of our workflow, catching migration issues that other tools might miss.

The main challenge was filtering and grouping tokens by column definition before sorting them. Despite these complexities, sql-compare emerged victorious, enabling us to ensure seamless migrations and maintain schema integrity.

The Journey Forward

We’ve open-sourced sql-compare to help others facing similar challenges. You can try it by running pip install sql-compare. We plan to enhance sql-compare, such as creating functions to retrieve all schema differences for better test results. If you have suggestions or want to contribute, feel free to submit issues or pull requests on our GitHub repository.

Conclusion

Managing database migrations is a complex but essential task for evolving applications. With sql-compare, we found our solution, ensuring seamless migrations, maintaining schema integrity, and continuing to deliver high-quality software. Our journey through the challenges of SQL migrations has taught us valuable lessons, and with sql-compare, we’re better equipped to face the future.

A Journey of Embracing Linters

Tue, 09 Jul 2024 00:00:00 GMT

Recently, I found myself in a spirited debate with one of our front-end developers at Mergify. This discussion, revolving around the usage of linters, reminded me of my long and storied history with these "advisor tools." Having been confronted with linters for the past 25 years, I believe it's time to share some of that accumulated wisdom.

My first encounter with a linter was with use strict in Perl. Although I can't recall the specifics of what it did, I do remember it being an essential tool for writing better code. Later on, I encountered the gcc -W and -pedantic options, which I enabled religiously in all my projects. These early experiences set the stage for my ongoing relationship with linters.

Warnings

Fast forward to today, my recent discussion centered around eslint and enabling all the checks for the Playwright plugin, treating every drift as an error rather than a warning. This distinction is crucial: an error causes the CI to fail, while a warning merely generates noise. Not all linters have this warning level, but in my experience, warnings will be disruptive if left unaddressed in your development workflow. An error should be a clear-cut issue: either ignore it or fix it.

Having unresolved warnings in your CI logs creates ambiguity and inefficiency. Make a decision. Commit to it.

The original warning that triggered our discussion.

Picking Errors

Despite not being a JavaScript expert, my 25 years of experience with various linters gives me some perspective on this matter. Our debate also touched on which approach to use with respect to linters, either:

stick to the recommended and default settings;
being stricter by promoting certain warnings to errors for checks we deemed useful;
enable everything to be an error and explicitly ignore checks that don't apply to our project or are considered incorrect.

Every linter, from gcc -W flags to ruff in Python, starts with a set of "recommended" settings. These are designed to throw a manageable number of errors on a typical project, making the linter easy to adopt for teams. This doesn't mean the disabled options are bad; they are simply considered "too much for beginners" and can be enabled later.

This incremental approach is how we adopted mypy at Mergify. The default typing checks are relatively light, allowing us to enable it without much friction. We spent a few weeks fixing typing issues, caught a few bugs in the process, and were satisfied. Gradually, we enabled more checks until we reached the point of enabling strict = true (a nostalgic nod to Perl) and caught even more (potential) bugs.

On the flip side, having a poorly calibrated set of default recommendations is why I never adopted pylint. Running pylint on our otherwise impeccable Python code, which passes ruff with most checks enabled, results in 13,000 errors for 140,000 SLOC. (I wrote about similar code quality tools in The Best Flake8 Extensions.) This is an insurmountable barrier for any developer. The prospect of ignoring all these non-critical errors, such as missing docstrings or line lengths, seems daunting.

Eslint and Playwright

Returning to eslint and Playwright, we used the following code to enable all Playwright rules as errors:

...Object.keys(playwrightPlugin.configs['flat/recommended'].plugins.playwright.rules).reduce(
  (acc, rule) => {
    acc[`playwright/${rule}`] = 'error';
    return acc;
  },
  {}
),

This approach ensures we don't miss any linting recommendations from the Playwright team. With Dependabot automatically updating our dependencies, new errors introduced by updates appear in brand-new pull requests, allowing us to improve our code continuously.

In conclusion, "recommended" settings in linters are designed for ease of adoption, striking a balance between "best practices" and "practicality," but they are not guidelines for what you should follow.

Striving for perfection (assuming your linter is robust and not crazy) is always the goal. Make deliberate choices about which checks to ignore, and remember that linters are here to help you write better, more reliable code.

The Biggest Mistake We Made Building Mergify: Navigating the Hiring Minefield

Tue, 02 Jul 2024 00:00:00 GMT

This is part 2 of the "Biggest Mistakes" series. Read part 1: Navigating the Payment System Nightmare.

Building a successful startup is a journey filled with unexpected challenges, and hiring the right people is undoubtedly one of the most daunting tasks. As a tech engineer with no HR background, I’ve faced numerous hiring pitfalls that have taught me invaluable lessons. Reflecting on our journey at Mergify, it's clear that navigating the complexities of hiring has been one of the biggest challenges we've encountered. Here’s what we’ve learned.

Who to Hire: The Temptation of Cheap Labor

When you’re a startup with limited resources, it’s tempting to hire interns or apprentices, especially when they come at a low or even zero cost, thanks to government sponsorships. In France, this is an attractive option many startups pursue, and we were no different. Initially, we hired interns and apprentices, believing that this would provide us with much-needed help without straining our budget.

However, we quickly realized that while interns can be a valuable addition, they often lack the expertise we needed to tackle complex tasks. As founders, we required skilled assistance, not just extra hands for minor tasks. The overhead of breaking down projects into manageable tasks and guiding interns through them often resulted in a net loss of productivity. While we did encounter some exceptional interns who became valuable team members, the general rule is that relying on almost-free resources like interns is unlikely to provide the expertise needed in the early days of your startup.

How to Hire: The Challenge of Assessing Candidates

Hiring is more than just finding someone with the right skills; it’s about finding the right fit for your team and company culture. Despite the plethora of online resources on evaluating candidates, the reality of assessing someone over a Zoom call in just a few hours is incredibly challenging.

We've had both terrific and terrible hires, and in each case, we believed they were a perfect fit at the time.

After conducting hundreds of interviews and hiring more than a dozen people, we’ve learned one golden rule: if you have any doubts about a candidate, don’t hire them. It’s better to wait for the right person than to rush and hire someone who might not fit well with your team.

Finding Candidates: The Perils of Recruitment

The search for talent is a constant struggle. In France, we experimented with various solutions, from Welcome to the Jungle (a nightmare) to talent.io (effective). We also engaged a few headhunting firms, which unfortunately turned out to be a costly mistake. These firms often sent us unsuitable candidates and still kept their fees. The legal obligations in France favor the headhunters, not the employers, making it a risky and expensive endeavor.

For example, we had a candidate that we hired and then paid the fees to the headhunting firm. The employee would stop their trial period and leave. That meant the headhunting firm should send new candidates our way to replace the one that left during the trial period. However, there was no incentive for them to do so as they’d been paid already. Therefore, they didn’t care. We found our candidate in a different manner, meaning the fee was then lost for us. Considering that fees can be 10–20% of the employee's yearly salary, that’s a large amount of money to throw out the window.

Networking and recommendations remain some of the best ways to find talent, but they don’t scale well and often have timing issues; you’d find the right candidate, but they’re not available, or a friend would knock at the door, and you’d not have any budget to hire them.

Additionally, we realized that marketing our company effectively during hiring talks is crucial. Initially, our pitch didn’t resonate, and most candidates would ignore us. By improving our presentation and emphasizing our company culture and values, we started attracting genuinely interested candidates.

After years, we discovered that you want to polarize your candidate early in the process and during their employment to ensure they get 100 % onboard with your venture. Depending on your founder profile, that might be something you do naturally. As tech founders, we were not particularly good at that, but we learned our way through.

The Remote Work Dilemma

Building a remote team has advantages, like accessing a broader talent pool, but it also comes with significant challenges. At Mergify, we embraced remote work and wrote extensively about our experience. Remote work works well with senior staff, but junior employees often struggle without in-person guidance. Sharing the company vision and brainstorming ideas are also more effective in person, which is why we regularly organize in-person meetings.

Regular team-building events, video calls, and asynchronous communication help bridge the gap, but they can’t completely replace the spontaneous interactions that foster innovation. Remote work is great for finding talent, but in-person connections remain essential for a cohesive and innovative team environment.

I would not consider remote work a mistake, but we underestimated its impact on the company.

Lessons Learned and Rules Established

From our hiring mistakes, we’ve developed a few key rules:

Avoid hiring remote junior staff if you are working remotely. They need more guidance and in-person interaction.
Leverage in-person connections for innovation. Remote work makes this challenging, especially for junior staff.
Share a lot of context to drive innovation and execution. Overcommunicate.
Be cautious of headhunters and their fees. Consider delaying payment until the trial period ends.
Avoid working with multiple headhunting firms to avoid finding your candidate with one when you have already paid the other.
Learn to pitch your company effectively. Highlight your values and culture to attract the right candidates.
If you have any doubts about a candidate, don’t hire them.
Don’t hire interns and trainees until you have enough senior staff to mentor them. Consider them a small cost, not a huge win.

Navigating the hiring process is complex and fraught with potential pitfalls, but by learning from our mistakes and establishing clear rules, we've been able to build a stronger, more effective team.

If you’re building a startup, remember that the right hires can make all the difference, and taking the time to find them is well worth the effort.

Discovering the Tech Community in Toulouse: A Three-Year Journey

Tue, 25 Jun 2024 00:00:00 GMT

When I moved to Toulouse three years ago, I knew almost no one. My only connection was my cofounder, Mehdi. It was an intimidating start, but thanks to the magic of LinkedIn, my presence in the city didn't go unnoticed. Soon, people reached out to me, giving me my first glimpse into the tech scene here. Among those early connections were Denis and Cédric, whose introductions helped bootstrap my network at an impressive speed.

In case you have never heard of Toulouse, it’s the fourth largest city in France (soon to be third), with 0.5 million inhabitants, and the home of Airbus.

Mergify is a remote-first company, so we don’t have an office where I can commute every day and rely on colleagues to easily have a social network. That forced me to go out and expand my social network out in the unknown.

In 2022, I decided to reboot the Toulouse Python Meetup. Despite having over 800 members in the group at the time, the event had gone dormant since COVID 19. I reached out to the previous (idle) organizers, and with the support of Hugo at Mergify, we organized our first meetup. (I wrote more about attending conferences and how my approach changed over the years.) I prepared a quick return of experience talk, which I presented to the three attendees who showed up (!) how we deployed mypy at Mergify.

It was a humble beginning, but it was a start. Reviving in-person events was still challenging, but we persisted, organizing more meetups until the community began to grow. This effort eventually led us to Wannes, who now runs the events, allowing me to focus on other commitments. It’s tough to get people to attend events nowadays, possibly due to the lingering effects of COVID and the convenience of YouTube.

Joining the Tech.Rocks a couple of years ago was another significant milestone. Through this community, I discovered a substantial number of tech professionals in and around Toulouse. Remote work has made it possible for people to work far from their office, and I met people working at companies such as Datadog, Elastic, Spotify, OVHcloud, Ankorstore, AWS, Scaleway, MongoDB, ManoMano, Malt, Zenchef, GitLab — and there are many others I have yet to meet. I regularly organize lunches and dinners with tech folks from this community to foster stronger bonds, which has been incredibly rewarding.

I've had the pleasure of meeting remarkable people, from experienced engineers to startup founders. Toulouse boasts a vibrant ecosystem, smaller than Paris but thriving nonetheless. While it's easy to joke about the tech economy here being heavily dependent on Airbus and its providers, that's not entirely true. There's a diverse range of companies, though the prevalence of service companies (ESNs) in France is notable, with a scarcity of product-based firms.

One of the highlights of the tech calendar here is DevFest, an annual conference organized by the Google Developers Group (GDG). It's one of the most active meetups, although the connection to Google isn't always clear. Nonetheless, it's a fantastic event that brings the community together.

As a business angel, I've had the opportunity to meet incredible founders and teams from startups like Munity, Kotzilla, Roundtable, and more. It's inspiring to see startups in Toulouse that aren't solely focused on aerospace or IoT. There's a budding entrepreneurial spirit here that’s encouraging to witness.

Finally, I must mention IoT Valley — it took me three years to understand what was that IoT Valley I heard about all the time. Originally centered around SigFox, the community has evolved and now encompasses a broader range of startups. I recently had the chance to host a dinner and podcast episode there, giving me a deeper understanding of its scope. Located near Toulouse in Labège, IoT Valley houses numerous companies, many still focused on IoT, but expanding into other areas as well. They might benefit from a rebrand, but marketing isn't typically a strong suit in French tech. Sharing my experiences of building Mergify and working as an entrepreneur and business angel was a highlight, and I look forward to the podcast episode going live.

IoT Valley Founder Dinner

You can listen to the podcast I recorded here.

<iframe style="border-radius:12px" src="https://widget.ausha.co/index.html?podcastId=yX4JgC5WWna7&color=%23ffffff&v=3" width="100%" frameborder="0" scrolling="no"></iframe>

In conclusion, Toulouse's tech scene is dynamic and growing. For anyone considering a move to this city, you'll find a community that's welcoming, innovative, and full of opportunities. Whether you're looking to network, learn, or launch your next venture, Toulouse has something to offer.

And if you move there, send me a mail!

A Decade of Writing Books and Selling 25,000 Copies

Wed, 12 Jun 2024 00:00:00 GMT

Ten years ago, I embarked on a journey that profoundly shaped my career and personal growth. Writing my first book, The Hacker's Guide to Python (later updated and renamed Serious Python), marked the beginning of a series of literary endeavors that allowed me to share my knowledge, experiences, and passion for Python programming with a global audience. Today, I reflect on this journey, the lessons learned, and the incredible milestones achieved along the way.

First print of The Hacker’s Guide to Python in 2014

The Genesis: The Hacker's Guide to Python

In March 2014, I published my first book, The Hacker's Guide to Python. This book was born out of a desire to provide a comprehensive resource for Python developers, offering insights and techniques I had gathered over the years. The response was overwhelmingly positive, and it motivated me to continue writing and sharing my expertise. I sold over 3,000 copies of the book in a couple of years, which is a very good number in its category.

Challenges and Time Investment

Writing was an enormous undertaking that demanded significant time and effort. I spent around 150 hours on each book, covering various activities from writing and editing to marketing and publishing. The process was spread over a year most of the time.

One of the toughest challenges was constructing a coherent and comprehensive table of contents. This initial step was crucial, as it guided the entire writing process, making the subsequent task of filling in the blanks somewhat more manageable. Additionally, I had to balance my time between my day job as a software engineer and this side project, making time management a critical aspect of the endeavor.

Another significant difficulty was the proofreading process. I needed both technical and language reviews to ensure the content was accurate and well-written, considering English is not my native language. Finding reliable reviewers who could provide timely and constructive feedback was challenging. Despite reaching out to many contacts, only a fraction responded and contributed consistently.

Self-publishing also taught me marketing, one of the best skills I could have learned, and I’m still leveraging it to this day while working on Mergify.

Scaling Python and Serious Python

Following the success of my first book, I continued exploring new topics and challenges within the Python ecosystem. Scaling Python, published in 2017, delved into the complexities of scaling applications, a topic that resonated with many developers facing similar challenges. I distributed around 1,000 copies of this book.

Scaling Python cover

In 2019, after being approached by No Starch, I released Serious Python, a book aimed at helping developers write more efficient, maintainable, and scalable code. Both books received praise for their practical approach and in-depth coverage of advanced topics. Being backed by No Starch helped the book to be distributed widely, and made it reach 20,000 copies as of today.

Impact on My Career

Writing these books significantly impacted my career and established me as an authority in the Python community. When I joined Datadog in 2019, I remember seeing my books casually lying around at the entrance of the Paris office.

My books chilling in the Datadog Paris office entrance in 2019

This moment was a profound realization of the reach and influence of my work. Colleagues and peers often treated the content of my books as definitive guides. The books provided answers and insights so clearly that people often didn't feel the need to ask me questions about the topics I covered during interviews; they trusted my written word as a reliable source. This validation opened new opportunities and allowed me to connect with an extensive network of professionals who recognized and respected my expertise.

Connecting with the Community

Writing allowed me to talk to anyone, reach out to amazing hackers worldwide, and forge new friendships. Writing books was the best excuse to meet fantastic people and create new friendships. I discovered fantastic engineers and learned from their experiences while interviewing them. This journey has been incredibly rewarding, not just professionally but also personally, as I connected with a vibrant community of developers who share my passion for Python and open-source software.

Sharing the knowledge of my book on stage during PyCon FR 2017

I also saw my books being translated into multiple languages, including Chinese and Korean.

This feeling is awesome as it gives even more impact to your writing, knowing that your knowledge is spreading across the globe. Having your work translated and accessible to a wider audience is a great reward, and it emphasizes the importance and value of sharing knowledge on such a large scale.

The Joy of Writing

While writing is hard, it is also refreshing. Producing content that people love and are happy to recommend is a fantastic feeling. My golden rule was, and still is:

Produce content that you'd be happy consuming.

The rest then becomes history. This philosophy guided me through the writing process and ensured that my books remained valuable and relevant to readers.

A New Era of Writing

Reflecting on my writing journey, it resonates deeply with a post I wrote titled "I used to write.” In that post, I shared my journey from writing extensively in my early years to facing the challenges of balancing life and work, decreasing my writing output. The desire to return to the keyboard lingered, and despite the rise of AI-generated content, I realized that authentic, human writing still holds immense value.

Over the last year, I toyed with GPT, generating tons of content and using it to brainstorm, change sentences, and rewrite text. This experimentation reaffirmed my belief that AI could never truly replace the nuanced and creative process of human writing. As AI-generated content grows, the need for genuine, human-crafted writing becomes even more critical. This new writing era challenges us to strengthen our signal amidst the growing noise.

The past ten years have been an incredible journey of learning, teaching, and connecting with developers worldwide. I look forward to continuing this journey, exploring new topics, and sharing my insights through future books and blog posts.

Thank you for being a part of this journey.

The Biggest Mistake We Made Building Mergify: Navigating the Payment System Nightmare

Tue, 11 Jun 2024 00:00:00 GMT

In 2018, we embarked on an exciting journey with Mergify, our brainchild aimed at simplifying GitHub pull request workflows. One of the first crucial decisions we faced was choosing a payment processor. Stripe, with its developer-centric approach, seemed like the perfect fit. Within a few days, I had mastered the Stripe API and built the foundational billing system for Mergify. For a while, everything ran smoothly as we scaled our user base. However, we soon encountered a significant roadblock: handling VAT in Europe.

European VAT is notoriously complex, with countless edge cases that can quickly become a nightmare for any business. Invoicing internationally from France presented additional challenges, leading us to the conclusion that outsourcing our invoicing would be the best course of action.

The GitHub Marketplace Misstep

In 2019, the GitHub Marketplace appeared to be an attractive solution. It promised to streamline invoicing while exposing Mergify to a broader audience. Although GitHub took a 15% cut (later reduced to 5%), we were not focused on optimizing margins at that stage.

GitHub Marketplace

However, this decision turned out to be a colossal mistake.

Issues with payments began to surface almost immediately. Problems with GitHub’s payment system meant we often had to ask customers to contact GitHub support, creating a frustrating experience for them. Our inability to manage payments directly, such as retrying failed transactions, was a significant drawback. It became evident that while the GitHub Marketplace was a great tool for acquiring new customers, it was far from ideal for handling payments.

As a glaring example, if a GitHub customer switched from credit card billing to invoice, they would lose access to all marketplace apps, including Mergify. This could abruptly cut off our service, leading to dissatisfied customers and lost revenue. By 2020, we decided to completely transition away from the GitHub Marketplace for payments, migrating our customers back to Stripe. This move eliminated the invoicing problems caused by GitHub and allowed us to regain control over our billing process.

The Paddle Predicament

Despite our move back to Stripe, the VAT issue remained unresolved. In our quest for a better solution, we discovered Paddle, a platform that promised to handle VAT by becoming the merchant of record for our transactions. We quickly integrated Paddle into our system, hopeful it would be the solution we needed. Unfortunately, this decision soon proved to be another costly mistake.

Paddle

Paddle's API was far less sophisticated than Stripe’s, and we found ourselves grappling with numerous limitations and workarounds to integrate our billing system. The added complexity and subpar user experience led us to conclude that Paddle was not the right fit for Mergify.

The Turning Point: Handling VAT Ourselves

Realizing that there was no perfect third-party solution, we decided to tackle the VAT problem head-on. In 2020, we took the plunge and developed our own VAT handling system using Stripe and Python. We detailed this process in a blog post sharing our approach and challenges.

Fortunately, Stripe was also working on solving the VAT issue. By the end of 2021, they released their comprehensive tax product, simplifying VAT and other tax processes. This allowed us to finally switch fully to Stripe, discarding our custom code in favor of their robust solution.

Lessons Learned

The most significant lesson from our journey is that payment processing is not a mere detail—it’s an integral part of the user experience. Even now, I spend several hours each month resolving payment issues, from credit card problems to ensuring invoices are correctly fed into various supplier systems. While automation can handle many aspects, the unique methods and systems of each customer often require personalized solutions.

Our experience underscores the importance of keeping things simple on your side and minimizing friction for your customers. Ensuring a smooth and reliable payment process is crucial for maintaining customer satisfaction and loyalty.

In hindsight, we should have approached the payment system with the same rigor and attention to detail as the rest of our product.

It's a lesson we learned the hard way, but one that has ultimately strengthened Mergify and our commitment to providing the best possible service for our users.

This is part 1 of the "Biggest Mistakes" series. Read part 2: Navigating the Hiring Minefield.

Sponsoring Conferences

Thu, 06 Jun 2024 00:00:00 GMT

Last week, I wrote about my experience attending conferences.

Over the last year, we've tried to expose Mergify at conferences to reach out to developers. We’ve done various conferences in Europe and the US—the largest being QCon San Francisco 2023 and Devoxx France 2024. We sponsored those events and ran booths for several days all day long to talk to engineers.

Mergify booth at QCon San Francisco 2023

The pattern we’ve seen has been interesting. First, QCon San Francisco was the smallest it’s been over the last few years, as far as I can tell. While 1,400 people were expected, counting the number of people seated in the keynote session revealed less than half of that were present. We talked to tens of engineers without great success. It turns out that trying to sell your tool for an early startup like Mergify is not efficient at all in such a place. Companies tend to do that when they are way larger to raise brand awareness and penetrate the market more efficiently.

At our stage, this was a lot of money spent for barely any gain.

As Mehdi, my cofounder and CTO says, “no great engineer will go to a conference to find the next tool they’ll need.” Indeed, I don’t believe any good engineer should wait 6 months for the next conference they will attend to find a product to their technical problems.

Doing market education, as we tried, over a booth, is utopian. Here’s the typical dialogue that would happen:

– Engineer attending the conference: “Hi! What does Mergify do?”
– Mergify staff: “We offer merge queues for your GitHub repository. Do you know about them?”
– Engineer attending the conference trying not to lose face: “Yeah, for sure!”
– Mergify staff: “Do you use one in your team?”
– Engineer: “No, we don’t need one.”
– Mergify: "How so? You're happy merging outdated PR or running a lot of CI on every PR?"
– Engineer: "We… don't… well… err… what are we talking about exactly?…"

The truth is, 95% of the engineers we talked to have no clue what a merge queue is. Actually, 80% of them don’t know anything about Git besides the basics (i.e., commit and push), and chatting for 10 minutes over a booth is not a good place to educate them.

Speaking at conferences is a way better strategy, as the advent of the Developer Evangelist role has demonstrated over the last years. If well executed, it’s cheaper and can have a far better outcome than sponsoring an event.

You could imagine that sponsoring an event buys you a ticket to speak, but it’s not the case by default. Some conferences allow you to buy speaking time in special, dedicated rooms, for example, but you usually don’t get any special treatment over the regular CfP.

I really need to talk about that CfP game.

Attending Conferences

Tue, 04 Jun 2024 00:00:00 GMT

There’s a lot to say about tech conferences, shows, and exhibits in any form. I’ve been to a few of them over the last decade, and I probably can't count anymore at this point.

I thought it’d be interesting to write a thing or two about them and how my experience changed from my first participation in a gathering of tech people to the most recent one.

Goals

What especially changed for me over the last 15 years is the expectation of conferences.

I remember that the first conferences I went to appeared especially exciting because of the content. I was utterly interested in hearing about new technologies, discovering new features, and learning new practices. My goal as a young software engineer was to become the best at my craft, so I’d go there and attend as many lectures as I could.

Sometimes, that would be very challenging. Take the FOSDEM, one of the largest open-source conferences in Europe, with thousands of geeks attending. I attended FOSDEM numerous times. The buildings they use to run the conference have been the same all along those years, and are known to be way too small to welcome many people to most conferences. There are fantastic talks done by some of the world's greatest hackers, but there are more people waiting in the hall to access the talk than people in the room listening to it.

Regular FOSDEM attendance

You could think that this would ruin the conference for most people, but I think it does not. It shows the other very important aspect of conferences: social interaction.

I know there are so many stereotypes about tech guys not interacting with each other. The truth is, as human beings, we crave social connections, and events are a great source of those. Attending my first OpenStack Summit in 2013 allowed me to meet people I only chatted with over IRC, and it was a real game changer later on.

As my engineering career grew, there was not much interesting to learn from the talks. Many lectures became boring or déjà-vu.

I quickly switched sides and became the one giving talks and sharing knowledge. This was great. It gave me a great sense of recognition, validation, fear, and adrenaline. It’s a significant boost for anyone’s career. It was a game changer for me.

COVID

When COVID happened, everything changed.

I remember receiving a phone call in early March 2020 from Sylvain Zimmer, organizer of the dotConferences. I was booked to speak at dotPy, and my talk on Python performances and profiling was ready to go in a couple of days. I would be live on stage in a Parisian theater in front of hundreds of people. Sylvain explained that something was happening, that they couldn’t risk having this event run, and that they had to cancel.

For the next months, every event was canceled, and people shifted to online, remote work, etc. — you know it all. This broke local communities and the habit of many people going to conferences.

I think this shows overall as there are fewer people going to events today than there used to be. People got used to accessing content over the Internet, webinars bloomed, and since many conferences published their lectures online, interest in traveling to a conference reduced a lot.

I relaunched the Python Toulouse meetup group 18 months ago. There were more than 800 members on that group when we announced that we were scheduling a new session in October 2022—3 years after the last one.

We got only 5 attendees.

Since then, we have continued pushing the event every couple of months, and the event has grown back to more than 40 attendees (and I have stepped down from the organizers).

I think this shows well how bad the COVID hit conferences and meetups in general.

Attending Conferences

As I started running Mergify a few years ago, my expectations of conferences shifted again. As we are building a developer tool, so the developer is now a persona we want to reach to make us aware of what we’re building.

There are two ways of doing that, and most developer-focused companies do one or both:

speak at conferences;
expose at conferences (sponsoring).

I’ll write something about event sponsoring some other time. (Update: I did.)

Winning a slot to speak at conferences is not easy: it requires expertise (we have) and time and focus (we don’t have much). In my case, we encourage folks at Mergify to respond to calls for papers, teach other developers the problems we solve, or share our experience on various topics. This is not always working; unfortunately, we are not experts at playing the CfP game—another topic I should write about.

First, I noticed that while there are more and more software engineers, many of them don’t care about going to conferences. They know most of the talks are already online or will be. As the CfP game is getting professional, many talks you see at conferences have already been lectured, filmed, and published online. Engineers valuing their time might not go to conferences in the end.

Some conferences are trying to fix that by not publishing their talks online. I think it can be a good strategy in certain cases, but as many conferences invite speakers with talks running for months if not years, it’s likely you can already watch the content online anyway.

Second, many events and conferences still overpromise the number of people actually attending. It’s likely the pre-COVID level is not back everywhere.

Last, the average technical level of expertise of both speakers and attendees fluctuates a lot. However, the more I think about it, the less I see a pattern. Some community-run conferences have great and poor content simultaneously; some professional conferences have attendees with low-skill but great speakers. It’s hard to have a rule, and I think it’s really on a case-by-case basis—and it might be subjective, after all.

Those issues might be anecdotic for most, though they’re not for me as they explain partly why Mergify's sponsoring of events has been mostly a failure over the last year.

But I’ll talk about that later.

I used to write

Tue, 21 May 2024 00:00:00 GMT

I did. I mean, I used to write a lot back in the day.

To give some context, I started my blog in 2003 — 21 years ago. (I later reflected on my writing career in A Decade of Writing Books and Selling 25,000 Copies.) It ran on Dotclear, an old blog engine. I switched to various static code generators over the years, spending hours migrating data from one format to another. I used Muse within Emacs, Org-mode, probably at some point, and finally, Ghost.

The cover of Serious Python

Then I stopped.

The number of publications I posted online decreased over the years as I spent more and more time coding, CEO’ing things at Mergify, and managing life. Writing slowly faded away, replaced with the mundane demandes of daily life. I closed my blog, and its content vanished in the limb of a file named jd-dev-blog.zip.

Sure, I thought about writing more. Or again. Over the last couple of years, every week, my stomach would ache, and my brain would melt under the weight of my thoughts. I wanted to yell so many things at the world, correct so many wrongs, and share so many learnings.

https://xkcd.com/386/

But the time to write disappeared, buried under the many priorities that other necessities. Still, the desire to return to the keyboard lingered.

Until the final nail in the coffin hit.

AI became mainstream.

Anyone could write anything in seconds. Copywriters were being replaced by bots. The cost savings of replacing humans would revolutionize entire industries. There was no point in writing anymore. My engineer brain decided that the problem was solved.

So I gave up. My mind gave up.

My desire to write would be instantly killed by the concept of ChatGPT. The existence of that malicious AI would stifle any urge to draft thoughts on virtual paper. The sheer thought of such an entity lurking in the digital shadows, analyzing and predicting every word, strikes a chilling fear into the heart of my creativity.

Until today.

Over the last year, I toyed with GPT, generating tons of content. I used it to brainstorm, change sentences, and rewrite text. The more I used it, the more I realized I was getting bored. Browsing the Internet, and social networks, I realized humans were replaced by AI.

No one would write anymore; everyone would just publish.

With its cold, calculating algorithms, AI reduced the rich tapestry of human expression to mere patterns and data points. People would throw a bare concept or even ask AI for one and demand it to produce text. Many of those publishers would not even take the time to tweak the AI, to feed it with the small amount of style and humanity they would have. Content would be farmed, from social media posts to SEO blog posts.

Over the last year, everything became bland. The once vibrant landscape of ideas has been replaced with mechanical mimicry.

My brain acknowledged that AI could never truly write. That revelation shifted my perspective, and I realized that writing wasn't dead. The noise would undoubtedly grow louder, but this only meant the signal would need to be stronger.

We are entering a new era for writing.

Well, at least, that’s what I hope.

Debugging C code on macOS

Thu, 11 Feb 2021 00:00:00 GMT

I started to write C 25 years ago now, with many different tools over the year. As many open source developers, I spent most of my life working with the GNU tools out there.

As I've been using an Apple computer over the last years, I had to adapt to this environment and learn the tricks of the trade. Here are some of my notes so a search engine can index them — and I'll be able to find them later.

Debugger: lldb

I was used to `gdb` for most of years doing C. I never managed to install gdb correctly on macOS as it needs certificates, authorization, you name it, to work properly.

macOS provides a native debugger named lldb, which really looks like gdb to me — it runs in a terminal with a prompt.

I had to learn the few commands I mostly use, which are:

lldb -- myprogram -options to run the program with options
r to run the program
bt or bt N to get a backtrace of the latest N frames
f N to select frame N
p V to print some variable value or memory address

Those commands cover 99% of my use case with a debugger when writing C, so once I lost my old gdb habits, I was good to go.

Debugging Memory Overflows

On GNU/Linux

One of my favorite tools when writing C has always been Electric Fence (and DUMA more recently). It's a library that overrides the standard memory manipulation function (e.g., malloc) and instantly makes the program crash when an out of memory error is produced, rather than corrupting the heap.

Heap corruption issues are hard to debug without such tools as they can happen at any time and stay unnoticed for a while, crashing your program in a totally different location later.

There's no need to compile your program with those libraries. By using the dynamic loader, you can preload them and overload the standard C library functions.

My gdb configuration has been sprinkle with my friends efence and duma, and I would activate them from gdb easily with this configuration in ~/.gdbinit:

define efence
        set environment EF_PROTECT_BELOW 0
        set environment EF_ALLOW_MALLOC_0 1
        set environment LD_PRELOAD /usr/lib/libefence.so.0.0
        echo Enabled Electric Fence\n
end
document efence
Enable memory allocation debugging through Electric Fence (efence(3)).
        See also nofence and underfence.
end

define underfence
        set environment EF_PROTECT_BELOW 1
        set environment EF_ALLOW_MALLOC_0 1
        set environment LD_PRELOAD /usr/lib/libefence.so.0.0
        echo Enabled Electric Fence for underflow detection\n
end
document underfence
Enable memory allocation debugging for underflows through Electric Fence
(efence(3)).
        See also nofence and efence.
end

define nofence
        unset environment LD_PRELOAD
        echo Disabled Electric Fence\n
end
document nofence
Disable memory allocation debugging through Electric Fence (efence(3)).
end

define duma
        set environment DUMA_PROTECT_BELOW 0
        set environment DYMA_ALLOW_MALLOC_0 1
        set environment LD_PRELOAD /usr/lib/libduma.so
        echo Enabled DUMA\n
end
document duma
Enable memory allocation debugging through DUMA (duma(3)).
        See also noduma and underduma.
end

On macOS

I've been looking for equivalent features in macOS, and after many hours of research, I found out that this feature is shipped natively with libgmalloc. It works in the same way, and its features are documented by Apple.

My ~/.lldbinit file now contains the following:

command alias gm _regexp-env DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib

This command alias allows enabling gmalloc by just typing gm at the lldb prompt and then run the program again to see if it crashes with gmalloc enabled.

Debugging CPython

It's not a mystery that I spend a lot of time writing Python code — that's the main reason I've been doing C lately.

When playing with CPython, it can be useful to, e.g., dump the content of PyObject structs on the heap or get the Python backtrace.

I've been using cpython-lldb for this with great success. It adds a few bells and whistles when debugging CPython or extensions inside lldb. For example, the alias py-bt is handy to get the Python traceback of your calls rather than a bunch of cryptic C frames.

Now, you should be ready to debug your nasty issues and memory problems on macOS efficiently!

I am a Software Engineer and I am in Charge

Tue, 22 Dec 2020 00:00:00 GMT

Fifteen years have passed since I started my career in IT — which is quite some time. I've been playing with computers for 25 years now, which makes me quite knowledgeable about the field, for sure.

However, while I was fully prepared to bargain with computers, I was not prepared to do so with humans. The whole career management thing was unknown to me. I had no useful skills to navigate within the enterprise organization. I had to learn the ropes the hard way, failing along the way. It hurts.

Almost ten years ago, I had the chance to meet a new colleague — Alexis Monville. Alexis was a team facilitator, and I started to work with him on many non-technical levels. He taught me a lot about agility and team organization. Working on this set of new skills changed how I envisioned my work and how I fit into the company.

I worked on those aspects of my job because I decided to be in charge of my career rather than keeping things boring. That was one of the best decisions I ever made. Growing the social aspect of my profession allowed me to develop and find aspiring jobs and missions.

Getting to that point takes a lot of time and effort, and it's pretty hard to do it alone. My friend Alexis wrote an excellent book titled I am a Software Engineer and I am in Charge. I'm proud to have been the first reviewer the book before it was released a few weeks ago.

Many developers out there are stuck in a place where they are not excited by their colleagues' work and whose managers do not appropriately recognize their achievement. It would be best for them if they did something about that.

This book is an excellent piece for engineers who wants to break the cycle of frustration. It covers many situations I encountered across my professional life those last years, giving good insights into how to solve them.

To paraphrase Alexis, the answers to your career management problems are not on StackOverflow — they're not technical issues. However, you can still solve them with the right tools. That's where I am a Software Engineer and I am in Charge shines. It gives you leads, solutions, and exercise to get out of this kind of situation. It helps increase your impact and satisfaction at work.

I love this book, and I wish I had access to it years ago. Developing technical leadership is not easy and requires a mindset shift. Having a way to bootstrap yourself with this is a luxury.

If you're a software engineer at the beginning of your career or struggling with your current professional situation, I profoundly recommend reading this book!

You'll get a fast track on your career, for sure.

Interview: The Performance of Python

Mon, 11 May 2020 00:00:00 GMT

Earlier this year, I was supposed to participate to dotPy, a one-day Python conference happening in Paris. This event has unfortunately been cancelled due to the COVID-19 pandemic.

Both Victor Stinner and me were supposed to attend that event. Victor had prepared a presentation about Python performances, while I was planning on talking about profiling.

Rather than being completely discouraged, Victor and I sat down (remotely) with Anne Laure from Behind the Code (a blog ran by Welcome to the Jungle, the organizers of the dotPy conference).

We discuss Python performance, profiling, speed, projects, problems, analysis, optimization and the GIL.

You can read the interview here.

Being in Charge

Fri, 17 Apr 2020 00:00:00 GMT

If you never heard of the 10x engineer myth, it's a pretty great concept. It boils down to the idea where an engineer could be 10x more efficient than a random engineer. I find this fantastically twisted.

Last week, I sat and chat with Alexis Monville in Le Podcast — a podcast that equips you to make positive change in your organization. We talked about that 10x Engineer myth, and from there we digressed on how to grow your career and handle the different aspects of it.

This was a very interesting exchange. Alexis is actually going to publish a new book next month (May 2020) entitled I am a Software Engineer and I am in charge.

Lucky me, this week, I had the chance to be able to read the book before everybody else — which means I actually read after our recording. I understood why Alexis said that a lot of what I was talking about during our podcast resonated with him. I send a detailed review of the book to Alexis and Michael if you're curious. I'm definitely recommending this book if you want to stop complaining about your job and start understanding how to pull the strings.

I wish I had this book available 10 years ago! 😅

One year of Mergify

Thu, 12 Mar 2020 00:00:00 GMT

It has been close to a year now that I've incorporated my new company, Mergify. I've been busy, and I barely wrote anything about it so far. Now is an excellent time to take a break and reflect a bit on what happened during those last 12 months.

What problem does Mergify solve?

Mergify is a powerful automation engine for GitHub pull requests. It allows you to automate everything — and especially merging. You write rules, and it handles the rest.

For example, let's say you want your pull request to be merged, e.g., once your CI passes and the pull request has been approved. You just write such a rule, and our engine merges the pull request as soon as it's ready.

We also deal with more advanced use cases. For instance, we provide a merge queue so your pull requests are merged serially and tested by your CI one after another — avoiding any regression in your code.

Our goal is to make pull request management and automation easy. You can use your bot to trigger a rebase of your pull requests, or a backport to a different branch, just with a single comment.

A New Adventure

Mergify is the first company that I ever started. I did run some personal businesses before, created non-profit organizations, built FOSS projects — but I never created a company from scratch, even less with an associate.

Indeed, I've chosen to build the company with my old friend Mehdi. We've known each others for 7 years now, and have worked together all that time on different open-source projects. Having worked with each other for so long has probably been a critical factor in the success of our venture so far.

I had little experience sharing the founding seats with someone, and tons of reading seemed to indicate that it would be a tough ride. Picking the right business partner(s) can be a hard task. Luckily, after working so much time together, Mehdi and I both know our strengths and weaknesses well enough to be able to circumvent them. 😅

On the other hand, we both have similar backgrounds as software engineers. That does not help to cover all the hats you need to wear when building a company. Over time, we found arrangements to cover most of those equally between us.

We don't have any magical advice to give on this. As in every relationship, communication is the key, and the #1 factor of success.

Getting Users

I don't know if we got lucky, but we got users and customers pretty early. We used a few cooperative projects as guinea pigs first, and they were brave enough to try our service and give us feedback. No repository has been harmed during this first phase!

Then, as soon as we managed to get our application on the GitHub Marketplace, we saw a steady number of users coming to us.

This has been fantastic as it allowed us to get feedback rapidly. We set up a form asking users for feedback after they used Mergify for a couple of weeks. What we hear is that users were happy, that the documentation was confusing and that some features were buggy or missing. We planned all of those ideas as our future work in our roadmap, using the principles we described a few months ago.

We tried various strategies to get new users, but so far, organic growth has been our #1 way of onboarding new users. Like many small startups out there, we're not that good at marketing and executing strategies.

We provide our service for free for open-source projects We are now powering many organizations, such as Mozilla, Amazon Web Services, Ceph and Fedora.

Working with GitHub

Working with GitHub has been… complicated. When you build an application for a marketplace, your business is entirely dependent on the platform you develop for — both in terms of features and quality of service.

In our case, we hit quite many bugs with GitHub. Their support has mostly been fast to answer, but some significant issues are still opened months later. The truth is that the GitHub API could deserve more love and care from GitHub. For example, their GraphQL API is a work in progress for years and miss out many essential features.

We dealt and still deal with all those issues. It obviously impacts our operations and decreases our overall velocity. We regularly have to find new ways to sidestep GitHub limitations.

You have no idea how we wished for GitHub to be open-source. The idea of not having access to their code and understand how it works is so frustrating that we publish our engine as an open-source project. That allows all of our users to see how it works and even propose enhancements.

Automate all the way

We're a tiny startup, and we decided to bootstrap our company. We never took any funding. From the beginning, it has been clear to us that we had to think and act like we had no resources. We're built around a scarcity mindset. Every decision we make is based on the assumption that we basically are very limited in terms of money and time.

We basically act like any wrong choice we do could (virtually) kill the company. We only do what is essential, we ship fast, and we automate everything.

For example, we have built our whole operation about CI/CD systems, and pushing any new fix or feature in production is done in a matter of minutes. It's not uncommon for us to push a fix from our phone, just by reviewing some code or editing a file.

Growth

We're extremely happy with our steady growth and more users using our service. We now manage close to 30k repositories and merge 15k pull requests per month for our users.

That's a lot of mouse clicks saved!

If you want to try Mergify yourself, it's a single click log-in using your GitHub account. Check it out!

Attending FOSDEM 2020

Thu, 06 Feb 2020 00:00:00 GMT

This weekend, I've been lucky to attend again the FOSDEM conference, one of the largest open-source conference out there.

I had a talk scheduled in the Python devroom on Saturday about building a production-ready profiling in Python. This was a good overview of the work I've been doing at Datadog for the last few months.

The video and slides are available below.

Your browser does not support the video tag.

The talk went well, attended by a few hundred people. I had a few interesting exchanges with people being interested and having some ideas about improvement.

Python Logging with Datadog

Mon, 03 Feb 2020 00:00:00 GMT

At Mergify, we generate a pretty large amount of logs. Every time an event is received from GitHub for a particular pull request, our engine computes a new state for it. Doing so, it logs some informational statements about what it's doing — and any error that might happen.

This information is precious to us. Without proper logging, it'd be utterly impossible for us to debug any issue. As we needed to store and index our logs somewhere, we picked Datadog as our log storage provider.

Datadog offers real-time indexing of our logs. The ability to search our records that fast is compelling as we're able to retrieve log about a GitHub repository or a pull request with a single click.

To achieve this result, we had to inject our Python application logs into Datadog. To set up the Python logging mechanism, we rely on daiquiri, a fantastic library I maintained for several years now. Daiquiri leverages the regular Python logging module, making its a no-brainer to set up and offering a few extra features.

We recently added native support for the Datadog agent in daiquiri, making it even more straightforward to log from your Python application.

Enabling log on the Datadog agent

Datadog has extensive documentation on how to configure its agent. This can be summarized to adding logs_enabled: true in your agent configuration. Simple as that.

You then need to create a new source for the agent. The easiest way to connect your application and the Datadog agent is using the TCP socket. Your application will write logs directly to the Datadog agent, which will forward the entries to Datadog backend.

Create a configuration file in conf.d/python.d/conf.yaml with the following content:

Setting up `daiquiri`

Once this is done, you need to configure your Python application to log to the TCP socket configured in the agent above.

The Datadog agent expects logs in JSON format being sent, which is what daiquiri does for you. Using JSON allows to embed any extra fields to leverage fast search and indexing. As daiquiri provides native handling for extra fields, you'll be able to send those extra fields without trouble.

First, list daiquiri in your application dependency. Then, set up logging in your application this way:

import daiquiri

daiquiri.setup(
  outputs=[
    daiquiri.output.Datadog(),
  ],
  level=logging.INFO,
)

This configuration logs to the default TCP destination localhost:10518 — though you can pass the host and port argument to change that. You can customize the outputs as you wish by checking out daiquiri documentation. For example, you could also include logging to stdout by adding daiquiri.output.Stream(sys.stdout) in the output list.

Using `extra`

When using daiquiri, you're free to use logging.getLogger to get your regular logging object. However, by using the alternative daiquiri.getLogger function, you're enabling the native use of extra arguments — which is quite handy. That means you can pass any arbitrary key/value to your log call, and see it up being embedded in your log data — up to Datadog.

Here's an example:

import daiquiri

[…]

log = daiquiri.getLogger(__name__)
log.info("User did something important", user=user, request_id=request_id)

The extra keyword argument passed to log.info will be directly shown as attributes in Datadog logs:

All those attributes can then be used to search or to display custom views. This is really powerful to monitor and debug any kind of service.

A log object per object

When passing extra arguments, it is easy to make mistakes and forget some. This especially can happen when your application wants to log information for a particular object.

The best pattern to avoid this is to create a custom log object per object:

import daiquiri

class MyObject:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        self.log = daiquiri.getLogger("MyObject", x=self.x, y=self.y)

    def do_something(self):
        try:
            self.call_this()
        except Exception:
            self.log.error("Something bad happened")

By using the self.log object as defined above, there's no way for your application to miss some extra fields for an object. All your logs will look in the same style and will end up being indexed correctly in Datadog.

Log Design

The extra arguments from the Python loggers are often dismissed, and many developers stick to logging strings with various information included inside. Having a proper explanation string, plus a few extra key/value pairs that are parsable by machines and humans, is a better way to do logging. Leveraging engines such as Datadog allow to store and query those logs in a snap.

This is way more efficient than trying to parse and grep strings yourselves!

Atomic lock-free counters in Python

Mon, 06 Jan 2020 00:00:00 GMT

At Datadog, we're really into metrics. We love them, we store them, but we also generate them. To do that, you need to juggle with integers that are incremented, also known as counters.

While having an integer that changes its value sounds dull, it might not be without some surprises in certain circumstances. Let's dive in.

The Straightforward Implementation

class SingleThreadCounter(object):
	def __init__(self):
    	self.value = 0
        
    def increment(self):
        self.value += 1

Pretty easy, right?

Well, not so fast, buddy. As the class name implies, this works fine with a single-threaded application. Let's take a look at the instructions in the increment method:

>>> import dis
>>> dis.dis("self.value += 1")
  1           0 LOAD_NAME                0 (self)
              2 DUP_TOP
              4 LOAD_ATTR                1 (value)
              6 LOAD_CONST               0 (1)
              8 INPLACE_ADD
             10 ROT_TWO
             12 STORE_ATTR               1 (value)
             14 LOAD_CONST               1 (None)
             16 RETURN_VALUE

The self.value +=1 line of code generates 8 different operations for Python. Operations that could be interrupted at any time in their flow to switch to a different thread that could also increment the counter.

Indeed, the += operation is not atomic: one needs to do a LOAD_ATTR to read the current value of the counter, then an INPLACE_ADD to add 1, to finally STORE_ATTR to store the final result in the value attribute.

If another thread executes the same code at the same time, you could end up with adding 1 to an old value:

Thread-1 reads the value as 23
Thread-1 adds 1 to 23 and get 24
Thread-2 reads the value as 23
Thread-1 stores 24 in value
Thread-2 adds 1 to 23
Thread-2 stores 24 in value

Boom. Your Counter class is not thread-safe. 😭

The Thread-Safe Implementation

To make this thread-safe, a lock is necessary. We need a lock each time we want to increment the value, so we are sure the increments are done serially.

import threading

class FastReadCounter(object):
    def __init__(self):
        self.value = 0
        self._lock = threading.Lock()
        
    def increment(self):
        with self._lock:
            self.value += 1

This implementation is thread-safe. There is no way for multiple threads to increment the value at the same time, so there's no way that an increment is lost.

The only downside of this counter implementation is that you need to lock the counter each time you need to increment. There might be much contention around this lock if you have many threads updating the counter.

On the other hand, if it's barely updated and often read, this is an excellent implementation of a thread-safe counter.

A Fast Write Implementation

There's a way to implement a thread-safe counter in Python that does not need to be locked on write. It's a trick that should only work on CPython because of the Global Interpreter Lock.

While everybody is unhappy with it, this time, the GIL is going to help us. When a C function is executed and does not do any I/O, it cannot be interrupted by any other thread. It turns out there's a counter-like class implemented in Python: itertools.count.

We can use this count class as our advantage by avoiding the need to use a lock when incrementing the counter.

If you read the documentation for itertools.count, you'll notice that there's no way to read the current value of the counter. This is tricky, and this is where we'll need to use a lock to bypass this limitation. Here's the code:

import itertools
import threading

class FastWriteCounter(object):
    def __init__(self):
        self._number_of_read = 0
        self._counter = itertools.count()
        self._read_lock = threading.Lock()

    def increment(self):
        next(self._counter)

    def value(self):
        with self._read_lock:
            value = next(self._counter) - self._number_of_read
            self._number_of_read += 1
        return value

The increment code is quite simple in this case: the counter is just incremented without any lock. The GIL protects concurrent access to the internal data structure in C, so there's no need for us to lock anything.

On the other hand, Python does not provide any way to read the value of an itertools.count object. We need to use a small trick to get the current value. The value method increments the counter and then gets the value while subtracting the number of times the counter has been read (and therefore incremented for nothing).

This counter is, therefore, lock-free for writing, but not for reading. The opposite of our previous implementation

Measuring Performance

After writing all of this code, I wanted to make sure how the different implementations impacted speed. Using the timeit module and my fancy laptop, I've measured the performance of reading and writing to this counter.

Operation

SingleThreadCounter

FastReadCounter

FastWriteCounter

increment

176 ns

390 ns

169 ns

value

26 ns

529 ns

I'm glad that the performance measurements in practice match the theory 😅. Both SingleThreadCounter and FastReadCounter have the same performance for reading. Since they use a simple variable read, it makes absolute sense.

The same goes for SingleThreadCounter and FastWriteCounter, which have the same performance for incrementing the counter. Again they're using the same kind of lock-free code to add 1 to an integer, making the code fast.

Conclusion

It's pretty obvious, but if you're using a single-threaded application and do not have to care about concurrent access, you should stick to using a simple incremented integer.

For fun, I've published a Python package named fastcounter that provides those classes. The sources are available on GitHub. Enjoy!

Properly managing your .gitignore file

Mon, 02 Dec 2019 00:00:00 GMT

There's not a single month where I don't have to explain this. I thought it'd be a good opportunity to write about this .gitignore file so everyone is up to date on this magic file.

The purpose of `.gitignore`

The .gitignore file is meant to be a list of files that Git should not track. It resides at the root directory of your repository. It can be a list of file path relative to the repository, or a list of wildcard. The file format and location is fully documented in Git documentation.

For example, this is a valid content for a .gitignore file:

foo
bar/*

When you're using Git commands such as git add, all the files matching what's listed in .gitignore are ignored. That makes sure you don't commit a file that should not be there by mistake. In the example above, any file in the bar directory or any file named foo will be completely ignored by all Git commands.

Awesome!

What's the problem with it?

Soon, developers realize that their directory is cluttered with temporary files. It might be from their build system, their editors or some test files they wrote.

So what do they do? They add those files to .gitignore for their project. You end up with a .gitignore file that contains entries like:

*~
.vscode
*.DS_Store
.idea

With that, you're sure to ignore backup files from vim, folders from MacOS and temporary files from Visual Studio code, etc.

Don't do this. Not everybody uses your editor or favorite pet tool, and nobody cares. The repository you're working in is shared with a lot of others developers. Sending pull requests to just add this kind of entry to ignore files generated by your pet editor is wrong and annoying.

Wait, how do I ignore my editor files then?

If you read through Git documentation, the answer lies there: Git has a global ignore file that works for EVERY repository on your system. No need to hack each repository. By default, it's in ~/.config/git/ignore. Here's mine:

.#*
*.swp
.DS_Store
.dir-locals.el
.dir-locals-2.el

That's enough to ignore my editors and OS files in all my repositories so I don't git add wrong files by mistake. You can tweak this global file location by changing by tweaking core.excludesFile in your Git configuration.

So what should I put in .gitignore?

You should put in .gitignore all files and patterns that are generated by the build system of your project, or any file that it might output while running.

For example, for a Python project, it's common to have this:

*.pyc
__pycache__

With this, it makes sure that nobody is committing compiled Python files.

Thanks for reading through this. I hope you'll write better .gitignore files in the future. 🤞

Finding definitions from a source file and a line number in Python

Mon, 04 Nov 2019 00:00:00 GMT

My job at Datadog keeps me busy with new and questioning challenges. I recently stumbled upon a problem that sounded easy but was more difficult than I imagined.

Here's the thing: considering a filename and a line number, can you tell which function, method or class this line of code belongs to?

I started to dig into the standard library, but I did not find anything solving this problem. It sounded like I had to write this myself.

The first steps sound easy. Open a file, read it, find the line number. Right.

Then, how do you know which functions this line is in? You don't, expect if you parse the whole file and keep tracks of function definitions. A regular expression parsing each line might be a solution?

Well, you had to be careful as function definitions can span multiple lines.

Using the AST

I decided that a good and robust strategy was not going to use manual parsing or the like, but using Python abstract syntax tree (AST) directly. By leveraging Python's own parsing code, I was sure I was not going to fail while parsing a Python source file.

This can be simply be accomplished with:

import ast

def parse_file(filename):
    with open(filename) as f:
        return ast.parse(f.read(), filename=filename)

And you're done. Are you? No, because that only works in 99.99% of the case. If your source file is using an encoding that is now ASCII or UTF-8, then the function fails. I know you think I'm crazy to think about this but I like my code to be robust.

It turns out Python has a cookie to specify the encoding in the form of # encoding: utf-8 as defined in PEP 263. Reading this cookie would help to find the encoding.

To do that, we need to open the file in binary mode, use a regular expression to match the data, and… Well, it's dull, and somebody already implemented it for us so let's use the fantastic [tokenize.open](https://docs.python.org/3/library/tokenize.html#tokenize.open) function provided by Python:

import ast
import tokenize

def parse_file(filename):
    with tokenize.open(filename) as f:
        return ast.parse(f.read(), filename=filename)

That should work in 100% of the time. Until proven otherwise.

Browsing the AST

The parse_file function now returns a Python AST. If you never played with Python AST, it's a gigantic tree that represents your source code just before it is compiled down to Python bytecode.

In the tree, there should be statements and expression. In our case, we're interested in finding the function definition that is the closest to our line number. Here's an implementation of that function:

def filename_and_lineno_to_def(filename, lineno):
    candidate = None
    for item in ast.walk(parse_file(filename)):
        if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            if item.lineno > lineno:
                # Ignore whatever is after our line
                continue
            if candidate:
                distance = lineno - item.lineno
                if distance < (lineno - candidate.lineno):
                    candidate = item
            else:
                candidate = item

    if candidate:
        return candidate.name

This iterates over all the node of the AST and returns the node where the line number is the closest to our definition. If we have a file that contains:

class A(object):
    X = 1
    def y(self):
        return 42

the function filename_and_lineno_to_def returns for the lines 1 to 5:

It works!

Closures?

The naive approach described earlier likely works for 90% of your code, but there are some edge cases. For example, when defining function closures, the above algorithm fails. With the following code:

class A(object):
   X = 1
   def y(self):
       def foo():
           return 42
       return foo

the function filename_and_lineno_to_def returns for lines 1 to 7:

Oops. Clearly, lines 6 and 7 do not belong to the foo function. Our approach is too naive to see that starting at line 6, we're back in the y method.

Interval Trees

The correct way of handling that is to consider each function definition as an interval:

Whatever the line number we request is, we should return the node that is responsible for the smallest interval that the line is in.

What we need in this case is a correct data structure to solve our problem: an interval tree fits perfectly our use case. It allows for searching rapidly pieces of code that match our line number.

To solve our problem we need several things:

A way to compute the beginning and end line numbers for a function.
A tree that is fed with the intervals we computed just before.
A way to select the best matching intervals if a line is part of several functions (closure).

Computing Function Intervals

The interval of a function is the first and last lines that compose its body. It's pretty easy to find those by walking through the function AST node:

def _compute_interval(node):
    min_lineno = node.lineno
    max_lineno = node.lineno
    for node in ast.walk(node):
        if hasattr(node, "lineno"):
            min_lineno = min(min_lineno, node.lineno)
            max_lineno = max(max_lineno, node.lineno)
    return (min_lineno, max_lineno + 1)

Given any AST node, the function returns a tuple of the first and last line number of that node.

Building The Tree

Rather than implementing an interval tree, we'll use the intervaltree library. We need to create a tree and feed it with the computed interval:

def file_to_tree(filename):
    with tokenize.open(filename) as f:
        parsed = ast.parse(f.read(), filename=filename)
    tree = intervaltree.IntervalTree()
    for node in ast.walk(parsed):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            start, end = _compute_interval(node)
            tree[start:end] = node
    return tree

Here you go: the function parses the Python file passed as an argument and converts it to its AST representation. It then walks it and feeds the interval tree with every class and function definition.

Querying the Tree

Now that the tree is built, it should be queried with the line number. This is pretty simple:

matches = file_to_tree(filename)[lineno]
if matches:
    return min(matches, key=lambda i: i.length()).data.name

The build tree might return several matches if there are several intervals containing our line number. In that case, we pick the smallest interval and return the name of the node — which is our class or function name!

Mission Success

We did it! We started with a naive approach and iterated to a final solution covering 100% of our cases. Picking the right data structure, interval trees here, helped us solving this in an intelligent approach.

Sending Emails in Python — Tutorial with Code Examples

Tue, 15 Oct 2019 00:00:00 GMT

What do you need to send an email with Python? Some basic programming and web knowledge along with the elementary Python skills. I assume you’ve already had a web app built with this language and now you need to extend its functionality with notifications or other emails sending. This tutorial will guide you through the most essential steps of sending emails via an SMTP server:

Configuring a server for testing (do you know why it’s important?)
Local SMTP server
Mailtrap test SMTP server
Different types of emails: HTML, with images, and attachments
Sending multiple personalized emails (Python is just invaluable for email automation)
Some popular email sending options like Gmail and transactional email services

Served with numerous code examples written and tested on Python 3.7!

Sending an email using an SMTP

The first good news about Python is that it has a built-in module for sending emails via SMTP in its standard library. No extra installations or tricks are required. You can import the module using the following statement:

import smtplib

To make sure that the module has been imported properly and get the full description of its classes and arguments, type in an interactive Python session:

help(smtplib)

At our next step, we will talk a bit about servers: choosing the right option and configuring it.

An SMTP server for testing emails in Python

When creating a new app or adding any functionality, especially when doing it for the first time, it’s essential to experiment on a test server. Here is a brief list of reasons:

You won’t hit your friends’ and customers’ inboxes. This is vital when you test bulk email sending or work with an email database.
You won’t flood your own inbox with testing emails.
Your domain won’t be blacklisted for spam.

Local SMTP server

If you prefer working in the local environment, the local SMTP debugging server might be an option. For this purpose, Python offers an smtpd module. It has a DebuggingServer feature, which will discard messages you are sending out and will print them to stdout. It is compatible with all operations systems.

Set your SMTP server to localhost:1025

python -m smtpd -n -c DebuggingServer localhost:1025

In order to run SMTP server on port 25, you’ll need root permissions:

sudo python -m smtpd -n -c DebuggingServer localhost:25

It will help you verify whether your code is working and point out the possible problems if there are any. However, it won’t give you the opportunity to check how your HTML email template is rendered.

Fake SMTP server

Fake SMTP server imitates the work of a real 3rd party web server. In further examples in this post, we will use Mailtrap. Beyond testing email sending, it will let us check how the email will be rendered and displayed, review the message raw data as well as will provide us with a spam report. Mailtrap is very easy to set up: you will need just copy the credentials generated by the app and paste them into your code.

Here is how it looks in practice:

import smtplib

port = 2525
smtp_server = "smtp.mailtrap.io"
login = "1a2b3c4d5e6f7g" # your login generated by Mailtrap
password = "1a2b3c4d5e6f7g" # your password generated by Mailtrap

Mailtrap makes things even easier. Go to the Integrations section in the SMTP settings tab and get the ready-to-use template of the simple message, with your Mailtrap credentials in it. It is the most basic option of instructing your Python script on who sends what to who is the sendmail() instance method:

The code looks pretty straightforward, right? Let’s take a closer look at it and add some error handling (see the comments in between). To catch errors, we use the try and except blocks.

## The first step is always the same: import all necessary components:
import smtplib
from socket import gaierror

## Now you can play with your code. Let’s define the SMTP server separately here:
port = 2525
smtp_server = "smtp.mailtrap.io"
login = "1a2b3c4d5e6f7g" # paste your login generated by Mailtrap
password = "1a2b3c4d5e6f7g" # paste your password generated by Mailtrap

## Specify the sender’s and receiver’s email addresses:
sender = "from@example.com"
receiver = "mailtrap@example.com"

## Type your message: use two newlines (\n) to separate the subject from the message body, and use 'f' to  automatically insert variables in the text
message = f"""\
Subject: Hi Mailtrap
To: {receiver}
From: {sender}
This is my first message with Python."""

try:
  # Send your message with credentials specified above
  with smtplib.SMTP(smtp_server, port) as server:
    server.login(login, password)
    server.sendmail(sender, receiver, message)
except (gaierror, ConnectionRefusedError):
  # tell the script to report if your message was sent or which errors need to be fixed
  print('Failed to connect to the server. Bad connection settings?')
except smtplib.SMTPServerDisconnected:
  print('Failed to connect to the server. Wrong user/password?')
except smtplib.SMTPException as e:
  print('SMTP error occurred: ' + str(e))
else:
  print('Sent')

Once you get the Sent result in Shell, you should see your message in your Mailtrap inbox:

Sending emails with HTML content

In most cases, you need to add some formatting, links, or images to your email notifications. We can simply put all of these with the HTML content. For this purpose, Python has an email package.

We will deal with the MIME message type, which is able to combine HTML and plain text. In Python, it is handled by the email.mime module.

It is better to write a text version and an HTML version separately, and then merge them with the MIMEMultipart("alternative") instance. It means that such a message has two rendering options accordingly. In case an HTML isn’t be rendered successfully for some reason, a text version will still be available.

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

port = 2525
smtp_server = "smtp.mailtrap.io"
login = "1a2b3c4d5e6f7g" # paste your login generated by Mailtrap
password = "1a2b3c4d5e6f7g" # paste your password generated by Mailtrap

sender_email = "mailtrap@example.com"
receiver_email = "new@example.com"

message = MIMEMultipart("alternative")
message["Subject"] = "multipart test"
message["From"] = sender_email
message["To"] = receiver_email
## Write the plain text part
text = """\ Hi, Check out the new post on the Mailtrap blog: SMTP Server for Testing: Cloud-based or Local? https://blog.mailtrap.io/2018/09/27/cloud-or-local-smtp-server/ Feel free to let us know what content would be useful for you!"""

## write the HTML part
html = """\ <html> <body> <p>Hi,<br> Check out the new post on the Mailtrap blog:</p> <p><a href="https://blog.mailtrap.io/2018/09/27/cloud-or-local-smtp-server">SMTP Server for Testing: Cloud-based or Local?</a></p> <p> Feel free to <strong>let us</strong> know what content would be useful for you!</p> </body> </html> """

## convert both parts to MIMEText objects and add them to the MIMEMultipart message
part1 = MIMEText(text, "plain")
part2 = MIMEText(html, "html")
message.attach(part1)
message.attach(part2)

## send your email
with smtplib.SMTP("smtp.mailtrap.io", 2525) as server:
  server.login(login, password)
  server.sendmail( sender_email, receiver_email, message.as_string() )

print('Sent')

Sending Emails with Attachments in Python

The next step in mastering sending emails with Python is attaching files. Attachments are still the MIME objects but we need to encode them with the base64 module. A couple of important points about the attachments:

Python lets you attach text files, images, audio files, and even applications. You just need to use the appropriate email class like email.mime.audio.MIMEAudio or email.mime.image.MIMEImage. For the full information, refer to this section of the Python documentation.
Remember about the file size: sending files over 20MB is a bad practice.

In transactional emails, the PDF files are the most frequently used: we usually get receipts, tickets, boarding passes, order confirmations, etc. So let’s review how to send a boarding pass as a PDF file.

import smtplib
from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

port = 2525
smtp_server = "smtp.mailtrap.io"
login = "1a2b3c4d5e6f7g" # paste your login generated by Mailtrap
password = "1a2b3c4d5e6f7g" # paste your password generated by Mailtrap

subject = "An example of boarding pass"
sender_email = "mailtrap@example.com"
receiver_email = "new@example.com"

message = MIMEMultipart()
message["From"] = sender_email
message["To"] = receiver_email
message["Subject"] = subject

## Add body to email
body = "This is an example of how you can send a boarding pass in attachment with Python"
message.attach(MIMEText(body, "plain"))

filename = "yourBP.pdf"
## Open PDF file in binary mode
## We assume that the file is in the directory where you run your Python script from
with open(filename, "rb") as attachment:
## The content type "application/octet-stream" means that a MIME attachment is a binary file
part = MIMEBase("application", "octet-stream")
part.set_payload(attachment.read())
## Encode to base64
encoders.encode_base64(part)
## Add header
part.add_header("Content-Disposition", f"attachment; filename= {filename}")
## Add attachment to your message and convert it to string
message.attach(part)

text = message.as_string()
## send your email
with smtplib.SMTP("smtp.mailtrap.io", 2525) as server:
  server.login(login, password)
  server.sendmail(sender_email, receiver_email, text)

print('Sent')

To attach several files, you can call the message.attach() method several times.

How to send an email with image attachment

Images, even if they are a part of the message body, are attachments as well. There are three types of them: CID attachments (embedded as a MIME object), base64 images (inline embedding), and linked images.

For adding a CID attachment, we will create a MIME multipart message with MIMEImage component:

import smtplib
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.mime.multipart import MIMEMultipart

port = 2525
smtp_server = "smtp.mailtrap.io"
login = "1a2b3c4d5e6f7g" # paste your login generated by Mailtrap
password = "1a2b3c4d5e6f7g" # paste your password generated by Mailtrap

sender_email = "mailtrap@example.com"
receiver_email = "new@example.com"

message = MIMEMultipart("alternative")
message["Subject"] = "CID image test"
message["From"] = sender_email
message["To"] = receiver_email

## write the HTML part
html = """\
<html>
<body>
<img src="cid:myimage">
</body>
</html>
"""
part = MIMEText(html, "html")
message.attach(part)

## We assume that the image file is in the same directory that you run your Python script from
with open('mailtrap.jpg', 'rb') as img:
  image = MIMEImage(img.read())
## Specify the  ID according to the img src in the HTML part
image.add_header('Content-ID', '<myimage>')
message.attach(image)

## send your email
with smtplib.SMTP("smtp.mailtrap.io", 2525) as server:
  server.login(login, password)
  server.sendmail(sender_email, receiver_email, message.as_string())

print('Sent')

The CID image is shown both as a part of the HTML message and as an attachment. Messages with this image type are often considered spam: check the Analytics tab in Mailtrap to see the spam rate and recommendations on its improvement. Many email clients — Gmail in particular — don’t display CID images in most cases. So let’s review how to embed a base64 encoded image instead.

Here we will use base64 module and experiment with the same image file:

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import base64

port = 2525
smtp_server = "smtp.mailtrap.io"
login = "1a2b3c4d5e6f7g" # paste your login generated by Mailtrap
password = "1a2b3c4d5e6f7g" # paste your password generated by Mailtrap
sender_email = "mailtrap@example.com"
receiver_email = "new@example.com"

message = MIMEMultipart("alternative")
message["Subject"] = "inline embedding"
message["From"] = sender_email
message["To"] = receiver_email

## We assume that the image file is in the same directory that you run your Python script from
with open("image.jpg", "rb") as image:
  encoded = base64.b64encode(image.read()).decode()

html = f"""\
<html>
<body>
<img src="data:image/jpg;base64,{encoded}">
</body>
</html>
"""
part = MIMEText(html, "html")
message.attach(part)

## send your email
with smtplib.SMTP("smtp.mailtrap.io", 2525) as server:
  server.login(login, password)
  server.sendmail(sender_email, receiver_email, message.as_string())

print('Sent')

Now the image is embedded into the HTML message and is not available as an attached file. Python has encoded our JPEG image, and if we go to the HTML Source tab, we will see the long image data string in the img src attribute.

How to Send Multiple Emails

Sending multiple emails to different recipients and making them personal is the special thing about emails in Python.

To add several more recipients, you can just type their addresses in separated by a comma, add Cc and Bcc. But if you work with a bulk email sending, Python will save you with loops.

One of the options is to create a database in a CSV format (we assume it is saved to the same folder as your Python script).

We often see our names in transactional or even promotional examples. Here is how we can make it with Python.

Let’s organize the list in a simple table with just two columns: name and email address. It should look like the following example:

#name,email
John Johnson,john@johnson.com
Peter Peterson,peter@peterson.com

The code below will open the file and loop over its rows line by line, replacing the {name} with the value from the “name” column.

import csv
import smtplib

port = 2525
smtp_server = "smtp.mailtrap.io"
login = "1a2b3c4d5e6f7g" # paste your login generated by Mailtrap
password = "1a2b3c4d5e6f7g" # paste your password generated by Mailtrap

message = """Subject: Order confirmation
To: {recipient}
From: {sender}
Hi {name}, thanks for your order! We are processing it now and will contact you soon"""
sender = "new@example.com"
with smtplib.SMTP("smtp.mailtrap.io", 2525) as server:
  server.login(login, password)
  with open("contacts.csv") as file:
  reader = csv.reader(file)
  next(reader)  # it skips the header row
  for name, email in reader:
    server.sendmail(
      sender,
      email,
      message.format(name=name, recipient=email, sender=sender),
    )
    print(f'Sent to {name}')

In our Mailtrap inbox, we see two messages: one for John Johnson and another for Peter Peterson, delivered simultaneously:

Sending emails with Python via Gmail

When you are ready for sending emails to real recipients, you can configure your production server. It also depends on your needs, goals, and preferences: your localhost or any external SMTP.

One of the most popular options is Gmail so let’s take a closer look at it.

We can often see titles like “How to set up a Gmail account for development”. In fact, it means that you will create a new Gmail account and will use it for a particular purpose.

To be able to send emails via your Gmail account, you need to provide access to it for your application. You can Allow less secure apps or take advantage of the OAuth2 authorization protocol. It’s a way more difficult but recommended due to the security reasons.

Further, to use a Gmail server, you need to know:

the server name = smtp.gmail.com
port = 465 for SSL/TLS connection (preferred)
or port = 587 for STARTTLS connection
username = your Gmail email address
password = your password

import smtplib
import ssl

port = 465
password = input("your password")
context = ssl.create_default_context()

with smtplib.SMTP_SSL("smtp.gmail.com", port, context=context) as server:
  server.login("my@gmail.com", password)

If you tend to simplicity, then you can use Yagmail, the dedicated Gmail/SMTP. It makes email sending really easy. Just compare the above examples with these several lines of code:

import yagmail

yag = yagmail.SMTP()
contents = [
"This is the body, and here is just text http://somedomain/image.png",
"You can find an audio file attached.", '/local/path/to/song.mp3'
]
yag.send('to@someone.com', 'subject', contents)

Next steps with Python

Those are just basic options of sending emails with Python. To get great results, review the Python documentation and experiment with your own code!

There are a bunch of various Python frameworks and libraries, which make creating apps more elegant and dedicated. In particular, some of them can help improve your experience with building emails sending functionality:

The most popular frameworks are:

Flask, which offers a simple interface for email sending: Flask Mail.
Django, which can be a great option for building HTML templates.
Zope comes in handy for a website development.
Marrow Mailer is a dedicated mail delivery framework adding various helpful configurations.
Plotly and its Dash can help with mailing graphs and reports.

Also, here is a handy list of Python resources sorted by their functionality.

Good luck and don’t forget to stay on the safe side when sending your emails!

This article was originally published at Mailtrap’s blog: Sending emails with Python

Python and fast HTTP clients

Mon, 07 Oct 2019 00:00:00 GMT

Nowadays, it is more than likely that you will have to write an HTTP client for your application that will have to talk to another HTTP server. The ubiquity of REST API makes HTTP a first class citizen. That's why knowing optimization patterns are a prerequisite.

There are many HTTP clients in Python; the most widely used and easy to
work with is requests. It is the de-factor standard nowadays.

Persistent Connections

The first optimization to take into account is the use of a persistent connection to the Web server. Persistent connections are a standard since HTTP 1.1 though many applications do not leverage them. This lack of optimization is simple to explain if you know that when using requests in its simple mode (e.g. with the get function) the connection is closed on return. To avoid that, an application needs to use a Session object that allows reusing an already opened connection.

Each connection is stored in a pool of connections (10 by default), the size of
which is also configurable:

Reusing the TCP connection to send out several HTTP requests offers a number of performance advantages:

Lower CPU and memory usage (fewer connections opened simultaneously).
Reduced latency in subsequent requests (no TCP handshaking).
Exceptions can be raised without the penalty of closing the TCP connection.

The HTTP protocol also provides pipelining, which allows sending several requests on the same connection without waiting for the replies to come (think batch). Unfortunately, this is not supported by the requests library. However, pipelining requests may not be as fast as sending them in parallel. Indeed, the HTTP 1.1 protocol forces the replies to be sent in the same order as the requests were sent – first-in first-out.

Parallelism

requests also has one major drawback: it is synchronous. Calling requests.get("http://example.org") blocks the program until the HTTP server replies completely. Having the application waiting and doing nothing can be a drawback here. It is possible that the program could do something else rather than sitting idle.

A smart application can mitigate this problem by using a pool of threads like the ones provided by concurrent.futures. It allows parallelizing the HTTP requests in a very rapid way.

This pattern being quite useful, it has been packaged into a library named requests-futures. The usage of Session objects is made transparent to the developer:

By default a worker with two threads is created, but a program can easily customize this value by passing the max_workers argument or even its own executor to the FuturSession object – for example like this: FuturesSession(executor=ThreadPoolExecutor(max_workers=10)).

Asynchronicity

As explained earlier, requests is entirely synchronous. That blocks the application while waiting for the server to reply, slowing down the program. Making HTTP requests in threads is one solution, but threads do have their own overhead and this implies parallelism, which is not something everyone is always glad to see in a program.

Starting with version 3.5, Python offers asynchronicity as its core using asyncio. The aiohttp library provides an asynchronous HTTP client built on top of asyncio. This library allows sending requests in series but without waiting for the first reply to come back before sending the new one. In contrast to HTTP pipelining, aiohttp sends the requests over multiple connections in parallel, avoiding the ordering issue explained earlier.

All those solutions (using Session, threads, futures or asyncio) offer different approaches to making HTTP clients faster.

Performances

The snippet below is an HTTP client sending requests to httpbin.org, an HTTP API that provides (among other things) an endpoint simulating a long request (a second here). This example implements all the techniques listed above and times them.

Running this program gives the following output:

Time needed for `serialized' called: 12.12s
Time needed for `Session' called: 11.22s
Time needed for `FuturesSession w/ 2 workers' called: 5.65s
Time needed for `FuturesSession w/ max workers' called: 1.25s
Time needed for `aiohttp' called: 1.19s

Without any surprise, the slower result comes with the dumb serialized version, since all the requests are made one after another without reusing the connection — 12 seconds to make 10 requests.

Using a Session object and therefore reusing the connection means saving 8% in terms of time, which is already a big and easy win. Minimally, you should always use a session.

If your system and program allow the usage of threads, it is a good call to use them to parallelize the requests. However threads have some overhead, and they are not weight-less. They need to be created, started and then joined.

Unless you are still using old versions of Python, without a doubt using aiohttp should be the way to go nowadays if you want to write a fast and asynchronous HTTP client. It is the fastest and the most scalable solution as it can handle hundreds of parallel requests. The alternative, managing hundreds of threads in parallel is not a great option.

Streaming

Another speed optimization that can be efficient is streaming the requests. When making a request, by default the body of the response is downloaded immediately. The stream parameter provided by the requests library or the content attribute for aiohttp both provide a way to not load the full content in memory as soon as the request is executed.

Not loading the full content is extremely important in order to avoid allocating potentially hundred of megabytes of memory for nothing. If your program does not need to access the entire content as a whole but can work on chunks, it is probably better to just use those methods. For example, if you're going to save and write the content to a file, reading only a chunk and writing it at the same time is going to be much more memory efficient than reading the whole HTTP body, allocating a giant pile of memory, and then writing it to disk.

I hope that'll make it easier for you to write proper HTTP clients and requests. If you know any other useful technic or method, feel free to write it down in the comment section below!

Dependencies Handling in Python

Mon, 02 Sep 2019 00:00:00 GMT

Dependencies are a nightmare for many people. Some even argue they are technical debt. Managing the list of the libraries of your software is a horrible experience. Updating them — automatically? — sounds like a delirium.

Stick with me here as I am going to help you get a better grasp on something that you cannot, in practice, get rid of — unless you're incredibly rich and talented and can live without the code of others.

First, we need to be clear of something about dependencies: there are two types of them. Donald Stuff wrote better than I would about the subject years ago. To make it simple, one can say that they are two types of code packages depending on external code: applications and libraries.

Libraries Dependencies

Python libraries should specify their dependencies in a generic way. A library should not require requests 2.1.5: it does not make sense. If every library out there needs a different version of requests, they can't be used at the same time.

Libraries need to declare dependencies based on ranges of version numbers. Requiring requests>=2 is correct. Requiring requests>=1,<2 is also correct if you know that requests 2.x does not work with the library. The problem that your version range specification is solving is the API compatibility issue between your code and your dependencies — nothing else. That's a good reason for libraries to use Semantic Versioning whenever possible.

Therefore, dependencies should be written in setup.py as something like:

from setuptools import setup

setup(
    name="MyLibrary",
    version="1.0",
    install_requires=[
        "requests",
    ],
    # ...
)

This way, it is easy for any application to use the library and co-exist with others.

Applications Dependencies

An application is just a particular case of libraries. They are not intended to be reused (imported) by other libraries of applications — though nothing would prevent it in practice.

In the end, that means that you should specify the dependencies the same way that you would do for a library in the application's setup.py.

The main difference is that an application is usually deployed in production to provide its service. Deployments need to be reproducible. For that, you can't solely rely on setup.py: the requested range of the dependencies are too broad. You're at the mercy of random version changes at any time when re-deploying your application.

You, therefore, need a different version management mechanism to handle deployment than just setup.py.

pipenv has an excellent section recapping this in its documentation. It splits dependency types into abstract and concrete dependencies: abstract dependencies are based on ranges (e.g., libraries) whereas concrete dependencies are specified with precise versions (e.g., application deployments) — as we've just seen here.

Handling Deployment

The requirements.txt file has been used to solve application deployment reproducibility for a long time now. Its format is usually something like:

requests==3.1.5
foobar==2.0

Each library sees itself specified to the micro version. That makes sure each of your deployment is going to install the same version of your dependency. Using a requirements.txt is a simple solution and a first step toward reproducible deployment. However, it's not enough.

Indeed, while you can specify which version of requests you want, if requests depends on urllib3, that could make pip install urllib 2.1 or urllib 2.2. You can't know which one will be installed, which does not make your deployment 100% reproducible.

Of course, you could duplicate all requests dependencies yourself in your requirements.txt, but that would be madness!

There are various hacks available to fix this limitation, but the real saviors here are pipenv and poetry. The way they solve it is similar to many package managers in other programming languages. They generate a lock file that contains the list of all installed dependencies (and their own dependencies, etc.) with their version numbers. That makes sure the deployment is 100% reproducible.

Check out their documentation on how to set up and use them!

Handling Dependencies Updates

Now that you have your lock file that makes sure your deployment is reproducible in a snap, you've another problem. How do you make sure that your dependencies are up-to-date? There is a real security concern about this, but also bug fixes and optimizations that you might miss by staying behind.

If your project is hosted on GitHub, Dependabot is an excellent solution to solve this issue. Enabling this application on your repository creates automatically pull requests whenever a new version of the library listed in your lock file is available. For example, if you've deployed your application with redis 3.3.6, Dependabot will create a pull request updating to redis 3.3.7 as soon as it gets released. Furthermore, Dependabot supports requirements.txt, pipenv, and poetry!

Automatic Deployment Update

You're almost there. You have a bot that is letting you know that a new version of a library your project needs is available.

Once the pull request is created, your continuous integration system is going to kick in, deploy your project, and runs the test. If everything works fine, your pull request is ready to be merged. But are you really needed in this process?

Unless you have a particular and personal aversion on specific version numbers —"Gosh I hate versions that end with a 3. It's always bad luck."— or unless you have zero automated testing, you, human, is useless. This merge can be fully automatic.

This is where Mergify comes into play. Mergify is a GitHub application allowing to define precise rules about how to merge your pull requests. Here's a rule that I use in every project:

pull_requests_rules:
  - name: automatic merge from dependabot
    conditions:
      - author~=^dependabot(|-preview)\[bot\]$
      - label!=work-in-progress
      - "status-success=ci/circleci: pep8"
      - "status-success=ci/circleci: py37"
    actions:
      merge:
        method: merge

As soon as your continuous integration system passes, Mergify merges the pull request for you.

You can then automatically trigger your deployment hooks to update your production deployment and get the new library version installed right away. This leaves your application always up-to-date with newer libraries and not lagging behind several years of releases.

If anything goes wrong, you're still able to revert the commit from Dependabot — which you can also automate if you wish with a Mergify rule.

Beyond

This is to me the state of the art of dependency management lifecycle right now. And while this applies exceptionally well to Python, it can be applied to many other languages that use a similar pattern — such as Node and npm.

The Art of PostgreSQL is out!

Wed, 28 Aug 2019 00:00:00 GMT

If you remember well, a couple of years ago, I wrote about Mastering PostgreSQL, a fantastic book written by my friend Dimitri Fontaine.

Dimitri is a long-time PostgreSQL core developer — for example, he wrote the extension support in PostgreSQL — no less. He is featured in my book Serious Python, where he advises on using databases and ORM in Python.

Today, Dimitri comes back with the new version of this book, named The Art of PostgreSQL.

I love the motto of this book: Turn Thousands of Lines of Code into Simple Queries. I have spent all my career working with code that talks to databases, and I can't count the number of times where I've seen people write lengthy, slow code in their pet language rather than a single well-thought SQL query which would do a better job.

This is exactly what this book is about.

That's why it's my favorite SQL book. I learned so many things from it. In many cases, I've been able to divide by 10 the size of the code I had to write in Python to implement a feature. All I had to do is to browse the book to discover the right PostgreSQL feature and write a single SQL query. The right query that does the job for me_._

Less code, fewer bugs, more happiness!

The book also features interviews with great PostgreSQL users and developers — hey, no wonder where Dimitri got this idea, right? ;-)

I loved those interviews. What's better than reading Kris Jenkins explaining how Clojure and PostgreSQL play nice together, or Markus Winand (from the famous use-the-index-luke.com) talking about the relationship developers have with their database. :-)

No need to say that you should get your hands on this right now. Dimitri just made a launch offer where he offers a 15% discount on the book until the end of this month! You can also read the free chapter to get an idea of what you'll get.

Last thing: it's DRM-free and money-back guaranteed. You can get this book with your eyes closed.

Handling multipart/form-data natively in Python

Mon, 01 Jul 2019 00:00:00 GMT

RFC7578 (who obsoletes RFC2388) defines the multipart/form-data type that is usually transported over HTTP when users submit forms on your Web page. Nowadays, it tends to be replaced by JSON encoded payloads; nevertheless, it is still widely used.

While you could decode an HTTP body request made with JSON natively with Python — thanks to the json module — there is no such way to do that with multipart/form-data. That's something barely understandable considering how old the format is.

There is a wide variety of way available to encode and decode this format. Libraries such as requests support this natively without making you notice, and the same goes for the majority of Web server frameworks such as Django or Flask.

However, in certain circumstances, you might be on your own to encode or decode this format, and it might not be an option to pull (significant) dependencies.

Encoding

The multipart/form-data format is quite simple to understand and can be summarised as an easy way to encode a list of keys and values, i.e., a portable way of serializing a dictionary.

There's nothing in Python to generate such an encoding. The format is quite simple and consists of the key and value surrounded by a random boundary delimiter. This delimiter must be passed as part of the Content-Type, so that the decoder can decode the form data.

There's a simple implementation in urllib3 that does the job. It's possible to summarize it in this simple implementation:

import binascii
import os

def encode_multipart_formdata(fields):
    boundary = binascii.hexlify(os.urandom(16)).decode('ascii')

    body = (
        "".join("--%s\r\n"
                "Content-Disposition: form-data; name=\"%s\"\r\n"
                "\r\n"
                "%s\r\n" % (boundary, field, value)
                for field, value in fields.items()) +
        "--%s--\r\n" % boundary
    )

    content_type = "multipart/form-data; boundary=%s" % boundary

    return body, content_type

You can use by passing a dictionary where keys and values are bytes. For example:

encode_multipart_formdata({"foo": "bar", "name": "jd"})

Which returns:

--00252461d3ab8ff5c25834e0bffd6f70
Content-Disposition: form-data; name="foo"

bar
--00252461d3ab8ff5c25834e0bffd6f70
Content-Disposition: form-data; name="name"

jd
--00252461d3ab8ff5c25834e0bffd6f70--

multipart/form-data; boundary=00252461d3ab8ff5c25834e0bffd6f70

You can use the returned content type in your HTTP reply header Content-Type. Note that this format is used for forms: it can also be used by emails.

Emails did you say?

Encoding with `email`

Right, emails are usually encoded using MIME, which is defined by yet another RFC, RFC2046. It turns out that multipart/form-data is just a particular MIME format, and that if you have code that implements MIME handling, it's easy to use it to implement this format.

Fortunately for us, Python standard library comes with a module that handles exactly that: email.mime. I told you it was heavily used by email — I guess that's why they put that code in the email subpackage.

Here's a piece of code that handles multipart/form-data in a few lines of code:

from email import message
from email.mime import multipart
from email.mime import nonmultipart
from email.mime import text

class MIMEFormdata(nonmultipart.MIMENonMultipart):
    def __init__(self, keyname, *args, **kwargs):
        super(MIMEFormdata, self).__init__(*args, **kwargs)
        self.add_header(
            "Content-Disposition", "form-data; name=\"%s\"" % keyname)

def encode_multipart_formdata(fields):
    m = multipart.MIMEMultipart("form-data")

    for field, value in fields.items():
        data = MIMEFormdata(field, "text", "plain")
        data.set_payload(value)
        m.attach(data)

    return m

Using this piece of code returns the following:

Content-Type: multipart/form-data; boundary="===============1107021068307284864=="
MIME-Version: 1.0

--===============1107021068307284864==
Content-Type: text/plain
MIME-Version: 1.0
Content-Disposition: form-data; name="foo"

bar
--===============1107021068307284864==
Content-Type: text/plain
MIME-Version: 1.0
Content-Disposition: form-data; name="name"

jd
--===============1107021068307284864==--

This method has several advantages over our first implementation:

It handles Content-Type for each of the added MIME parts. We could add other data types than just text/plain like it is implicitly done in the first version. We could also specify the charset (encoding) of the textual data.
It's very likely more robust by leveraging the wildly tested Python standard library.

The main downside, in that case, is that the Content-Type header is included with the content. In case of handling HTTP, it is problematic as this needs to be sent as part of the HTTP header and not as part of the payload.

It should be possible to build a particular generator from email.generator that does this. I'll leave that as an exercise to you, reader.

Decoding

We must be able to use that same email package to decode our encoded data, right? It turns out that's the case, with a piece of code that looks like this:

import email.parser

msg = email.parser.BytesParser().parsebytes(my_multipart_data)

print({
    part.get_param('name', header='content-disposition'): part.get_payload(decode=True)
    for part in msg.get_payload()
})

With the example data above, this returns:

{'foo': b'bar', 'name': b'jd'}

Amazing, right?

The moral of this story is that you should never underestimate the power of the standard library. While it's easy to add a single line in your list of dependencies, it's not always required if you dig a bit into what Python provides for you!

Advanced Functional Programming in Python: lambda

Mon, 03 Jun 2019 00:00:00 GMT

A few weeks ago, I introduced you to functional programming in Python. Today, I'd like to go further into this topic and show you so more interesting features.

Lambda Functions

What do we call lambda functions? They are in essence anonymous functions. In order to create them, you must use the lambda statement:

>>> lambda x: x
<function <lambda> at 0x102e23620>

In Python, lambda functions are quite limited. They can take any number of arguments; however they can contain only one statement and be written on a single line.

They are mostly useful to be passed to high-order functions, such as map():

>>> list(map(lambda x: x * 2, range(10)))
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

This will apply the anonymous function lambda x: x * 2 to every item returned by range(10).

functools.partial

Since lambda functions are limited to being one line long, it's often that they are used to specialize longer version of an existing function:

def between(number, min=0, max=1000):
    return max > number > min

## Only returns number between 10 and 1000
filter(lambda x: between(x, min=10), range(10000))

Our lambda is finally just a wrapper of the between function with one of the argument already set. What if we would have a better way, without the various lambda limitations, to write that? That's where functools.partial comes handy.

import functools
def between(number, min=0, max=1000):
    return max > number > min

## Only returns number between 10 and 1000
atleast_10_and_upto = functools.partial(between, min=10)
## Return number betweens 10 and 1000
filter(atleast_10_and_upto, range(10000))

## Return number betweens 10 and 20
filter(lambda x: atleast_10_and_upto(x, max=20), range(10000))

The functools.partial function returns a specialized version of the between function, where min is already set. We can store them in a variable, use it, reuse it, as much as we want. We can pass it a max argument, as shown in the second part — using a lambda! You can mix and matches those two as you prefer and what seems clearer for you.

Common lambda

There is a type of lambda function that is pretty common: the attribute or item getter. They are typically used a key function for sorting or filtering.

Here's a list of 200 tuples containing two integers (i1, i2). If you want to use only i2 as the sorting key, you would write:

mylist = list(zip(range(40, 240), range(-100, 100)))

sorted(mylist, key=lambda i: i[1])

Which works fine, but make you use lambda. You could rather use the operator module:

import operator

mylist = list(zip(range(40, 240), range(-100, 100)))

sorted(mylist, key=operator.itemgetter(1))

This does the same thing, except it avoids using lambda altogether. Cherry-on-the-cake: it is actually 10% faster on my laptop.

I hope that'll make you write more functional code!

An Introduction to Functional Programming with Python

Mon, 06 May 2019 00:00:00 GMT

Many Python developers are unaware of the extent to which you can use functional programming in Python, which is a shame: with few exceptions, functional programming allows you to write more concise and efficient code. Moreover, Python’s support for functional programming is extensive.

Here I'd like to talk a bit about how you can actually have a functional approach to programming with our favorite language.

Pure Functions

When you write code using a functional style, your functions are designed to have no side effects: instead, they take an input and produce an output without keeping state or modifying anything not reflected in the return value. Functions that follow this ideal are referred to as purely functional.

Let’s start with an example of a regular, non-pure function that removes the last item in a list:

def remove_last_item(mylist):
    """Removes the last item from a list."""
    mylist.pop(-1)  # This modifies mylist

This function is not pure: it has a side effect as it modifies the argument it is given. Let's rewrite it as purely functional:

def butlast(mylist):
    """Like butlast in Lisp; returns the list without the last element."""
    return mylist[:-1]  # This returns a copy of mylist

We define a butlast() function (like butlast in Lisp) that returns the list without the last element without modifying the original list. Instead, it returns a copy of the list that has the modifications in place, allowing us to keep the original. The practical advantages of using functional programming include the following:

Modularity. Writing with a functional style forces a certain degree of
separation in solving your individual problems and makes sections of code
easier to reuse in other contexts. Since the function does not depend on any
external variable or state, call it from a different piece of code is
straightforward.
Brevity. Functional programming is often less verbose than other paradigms.
Concurrency. Purely functional functions are thread-safe and can run
concurrently. Some functional languages do this automatically, which can be
a big help if you ever need to scale your application, though this is not
quite the case yet in Python.
Testability. Testing a functional program is incredibly easy: all you need
is a set of inputs and an expected set of outputs. They are idempotent,
meaning that calling the same function over and over with the same arguments
will always return the same result.

Note that concepts such as list comprehension in Python are already functionals in their approach, as they are designed to avoid side effects. We'll see in the following that some of the functional functions Python provide can actually be expressed as list comprehension!

Python Functional Functions

You might repeatedly encounter the same set of problems when manipulating data using functional programming. To help you deal with this situation efficiently, Python includes a number of functions for functional programming. Here, we'll see with a quick overview some of these built-in functions that allows you to build fully functional programs. Once you have an idea of what’s available, I encourage you to research further and try out functions where they might apply in your own code.

Applying Functions to Items with `map`

The map() function takes the form map(function, iterable) and applies function to each item in iterable to return an iterable map object:

>>> map(lambda x: x + "bzz!", ["I think", "I'm good"])
<map object at 0x7fe7101abdd0>
>>> list(map(lambda x: x + "bzz!", ["I think", "I'm good"]))
['I thinkbzz!', "I'm goodbzz!"]

You could also write an equivalent of map() using list comprehension, which
would look like this:

>>> (x + "bzz!" for x in ["I think", "I'm good"])
<generator object <genexpr> at 0x7f9a0d697dc0>
>>> [x + "bzz!" for x in ["I think", "I'm good"]]
['I thinkbzz!', "I'm goodbzz!"]

Filtering Lists with `filter`

The filter() function takes the form filter(function or None, iterable) and filters the items in iterable based on the result returned by function. This will return iterable filter object:

>>> filter(lambda x: x.startswith("I "), ["I think", "I'm good"])
<filter object at 0x7f9a0d636dd0>
>>> list(filter(lambda x: x.startswith("I "), ["I think", "I'm good"]))
['I think']

You could also write an equivalent of filter() using list comprehension, like
so:

>>> (x for x in ["I think", "I'm good"] if x.startswith("I "))
<generator object <genexpr> at 0x7f9a0d697dc0>
>>> [x for x in ["I think", "I'm good"] if x.startswith("I ")]
['I think']

Getting Indexes with `enumerate`

The enumerate() function takes the form enumerate(iterable[, start]) and returns an iterable object that provides a sequence of tuples, each consisting of an integer index (starting with start, if provided) and the corresponding item in iterable. This function is useful when you need to write code that refers to array indexes. For example, instead of writing this:

i = 0
while i < len(mylist):
    print("Item %d: %s" % (i, mylist[i]))
    i += 1

You could accomplish the same thing more efficiently with enumerate(), like so:

for i, item in enumerate(mylist):
    print("Item %d: %s" % (i, item))

Sorting a List with `sorted`

The sorted() function takes the form sorted(iterable, key=None, reverse=False) and returns a sorted version of iterable. The key argument allows you to provide a function that returns the value to sort on:

>>> sorted([("a", 2), ("c", 1), ("d", 4)])
[('a', 2), ('c', 1), ('d', 4)]
>>> sorted([("a", 2), ("c", 1), ("d", 4)], key=lambda x: x[1])
[('c', 1), ('a', 2), ('d', 4)]

Finding Items That Satisfy Conditions with any and all

The any(iterable) and all(iterable) functions both return a Boolean depending on the values returned by iterable. These simple functions are equivalent to the following full Python code:

def all(iterable):
    for x in iterable:
        if not x:
            return False
    return True

def any(iterable):
    for x in iterable:
        if x:
            return True
    return False

These functions are useful for checking whether any or all of the values in an iterable satisfy a given condition. For example, the following checks a list for two conditions:

mylist = [0, 1, 3, -1]
if all(map(lambda x: x > 0, mylist)):
    print("All items are greater than 0")
if any(map(lambda x: x > 0, mylist)):
    print("At least one item is greater than 0")

The key difference here, as you can see, is that any() returns True when at least one element meets the condition, while all() returns True only if every element meets the condition. The all() function will also return True for an empty iterable, since none of the elements is False.

Combining Lists with `zip`

The zip() function takes the form zip(iter1 [,iter2 [...]]) and takes multiple sequences and combines them into tuples. This is useful when you need to combine a list of keys and a list of values into a dict. Like the other functions described here, zip() returns an iterable. Here we have a list of keys that we map to a list of values to create a dictionary:

>>> keys = ["foobar", "barzz", "ba!"]
>>> map(len, keys)
<map object at 0x7fc1686100d0>
>>> zip(keys, map(len, keys))
<zip object at 0x7fc16860d440>
>>> list(zip(keys, map(len, keys)))
[('foobar', 6), ('barzz', 5), ('ba!', 3)]
>>> dict(zip(keys, map(len, keys)))
{'foobar': 6, 'barzz': 5, 'ba!': 3}

What's Next?

While Python is often advertised as being object oriented, it can be used in a very functional manner. A lot of its built-in concepts, such as generators and list comprehension, are functionally oriented and don’t conflict with an object-oriented approach. Python provides a large set of builtin functions that can help you keeping your code with no side effects. That also limits the reliance on a program’s global state, for your own good.

In the next blog post, we'll see how you can leverage Python functools and itertools module to enhance your functional adventure. Stay tuned!

Writing Your Own Filtering DSL in Python

Mon, 01 Apr 2019 00:00:00 GMT

A few months ago, we've seen how to write a filtering syntax tree in Python. The idea behind this was to create a data structure — in the form of a dictionary — that would allow to filter data based on conditions.

Our API looked like this:

>>> f = Filter(
  {"and": [
    {"eq": ("foo", 3)},
    {"gt": ("bar", 4)},
   ]
  },
)
>>> f(foo=3, bar=5)
True
>>> f(foo=4, bar=5)
False

While such a mechanism is pretty powerful to use, the input data structure format might not be user friendly. It's great to use, for example, with a JSON based REST API, but it's pretty terrible to use for a command-line interface.

A good solution to that problem is to build our own language. That's called a DSL.

Building a DSL

What's a Domain-Specific Language (DSL)? It's a computer language that is specialized to a certain domain. In our case, our domain is filtering, as we're providing a Filter class that allows to filter a set of value.

How do you build a data structure such as {"and": [{"eq": ("foo", 3)}, {"gt": ("bar", 4)}]} from a string? Well, you define a language, parse it, and then convert it to the right format.

In order to parse a language, there are a lot of different solutions, from implementing manual parsers to using regular expression. In this case, we'll use lexical analsysis.

First Iteration

Let's start small and define the base of our grammar. That should be something simple, so we'll go with <identifier><operator><value>. For example "foobar"="baz" is a valid sentence in our grammar and will conver to {"=": ("foobar", "baz")}.

The following code snippet leverages pyparsing for parsing the string and specifying the grammar:

import pyparsing

identifier = pyparsing.QuotedString('"')
operator = (
    pyparsing.Literal("=") |
    pyparsing.Literal("≠") |
    pyparsing.Literal("≥") |
    pyparsing.Literal("≤") |
    pyparsing.Literal("<") |
    pyparsing.Literal(">")
)
value = pyparsing.QuotedString('"')

match_format = identifier + operator + value

print(match_format.parseString('"foobar"="123"'))

## Prints:
## ['foobar', '=', '123']

With that simple grammar, we can parse and get a token list composed of our 3 items: the identifier, the operator and the value.

Transforming the Data

The list above in the format [identifier, operator, value] is not really what we need in the end. We need something like {operator: (identifier, value)}. We can leverage pyparsing API to help us with that.

def list_to_dict(pos, tokens):
    return {tokens[1]: (tokens[0], tokens[2])}

match_format = (identifier + operator + value).setParseAction(list_to_dict)

print(match_format.parseString('"foobar"="123"'))

## Prints:
## [{'=': ('foobar', '123')}]

The parseString method allows to modify the returned value of a grammar token. In that case, we transform the list of the dict we need.

Plugging the Parser and the Filter

In the following code, we'll reuse the Filter class we wrote in our previous post. We'll just add the following code to our previous example:

def parse_string(s):
    return match_format.parseString(s, parseAll=True)[0]

f = Filter(parse_string('"foobar"="baz"'))
print(f(foobar="baz"))
print(f(foobar="biz"))

## Prints:
## True
## False

Now, we have a pretty simple parser and a good way to build a Filter object from a string.

As our Filter object supports complex and nested operations, such as and and or, we could also add it to the grammar — I'll leave that to you reader as an exercise!

Building your own Grammar

pyparsing makes it easy to build one's own grammar. However, it should not be abused: building a DSL means that your users will have to discover and learn it. If it's way different that what they know and already exists, it might be cumbersome for them.

Finally, if you're curious and want to see a real world usage, Mergify condition system leverages pyparsing to implement its parser. Check it out!

Python + Memcached: Efficient Caching in Distributed Applications

Mon, 04 Mar 2019 00:00:00 GMT

When writing Python applications, caching is important. Using a cache to avoid recomputing data or accessing a slow database can provide you with a great performance boost.

Python offers built-in possibilities for caching, from a simple dictionary to a more complete data structure such as functools.lru_cache. The latter can cache any item using a Least-Recently Used algorithm to limit the cache size.

Those data structures are, however, by definition local to your Python process. When several copies of your application run across a large platform, using a in-memory data structure disallows sharing the cached content. This can be a problem for large-scale and distributed applications.

Therefore, when a system is distributed across a network, it also needs a cache that is running on the network. Nowadays, there are plenty of network servers that offer caching capability—for example, Redis.

As you’re going to see in this tutorial, memcached is another great option for caching. After a quick introduction to basic memcached usage, you’ll learn about advanced patterns such as “cache and set” and using fallback caches to avoid cold cache performance issues.

Installing memcached

Memcached is available for many platforms:

If you run Linux, you can install it using apt-get install memcached or yum install memcached. This will install memcached from a pre-built package but you can alse build memcached from source, as explained here.
For macOS, using Homebrew is the simplest option. Just run brew install memcached after you’ve installed the Homebrew package manager.
On Windows, you would have to compile memcached yourself or find pre-compiled binaries.

Once installed, memcached can simply be launched by calling the memcached command:

$ memcached

Before you can interact with memcached from Python-land you’ll need to install a memcached client library. You’ll see how to do this in the next section, along with some basic cache access operations.

Storing and Retrieving Cached Values Using Python

If you never used memcached, it is pretty easy to understand. It basically provides a giant network-available dictionary. This dictionary has a few properties that are different from a classical Python dictionnary, mainly:

Keys and values have to be bytes
Keys and values are automatically deleted after an expiration time

Therefore, the two basic operations for interacting with memcached are set and get. As you might have guessed, they’re used to assign a value to a key or to get a value from a key, respectively.

My preferred Python library for interacting with memcached is pymemcache—I recommend using it. You can simply install it using pip:

$ pip install pymemcache

The following code shows how you can connect to memcached and use it as a network cache in your Python applications:

>>> from pymemcache.client import base
## Don't forget to run `memcached' before running this next line:
>>> client = base.Client(('localhost', 11211))
## Once the client is instantiated, you can access the cache:
>>> client.set('some_key', 'some value')
## Retrieve previously set data again:
>>> client.get('some_key')'some value'

memcached network protocol is really simple an its implementation extremely fast, which makes it useful to store data that would be otherwise slow to retrieve from the canonical source of data or to compute again.

While straightforward enough, this example allows storing key/value tuples across the network and accessing them through multiple, distributed, running copies of your application. This is simplistic, yet powerful. And it’s a great first step towards optimizing your application.

Automatically Expiring Cached Data

When storing data into memcached, you can set an expiration time—a maximum number of seconds for memcached to keep the key and value around. After that delay, memcached automatically removes the key from its cache.

What should you set this cache time to? There is no magic number for this delay, and it will entirely depend on the type of data and application that you are working with. It could be a few seconds, or it might be a few hours.

Cache invalidation, which defines when to remove the cache because it is out of sync with the current data, is also something that your application will have to handle. Especially if presenting data that is too old or stale is to be avoided.

Here again, there is no magical recipe; it depends on the type of application you are building. However, there are several outlying cases that should be handled—which we haven’t yet covered in the above example.

A caching server cannot grow infinitely—memory is a finite resource. Therefore, keys will be flushed out by the caching server as soon as it needs more space to store other things.

Some keys might also be expired because they reached their expiration time (also sometimes called the “time-to-live” or TTL.) In those cases the data is lost, and the canonical data source must be queried again.

This sounds more complicated than it really is. You can generally work with the following pattern when working with memcached in Python:

from pymemcache.client import base
def do_some_query():
    # Replace with actual querying code to a database,
    # a remote REST API, etc.
    return 42
    
## Don't forget to run `memcached' before running this code
client = base.Client(('localhost', 11211))
result = client.get('some_key')
if result is None:
    # The cache is empty, need to get the value
        # from the canonical source:
        result = do_some_query()
        # Cache the result for next time:
        client.set('some_key', result)
        # Whether we needed to update the cache or not,
        # at this point you can work with the data
        # stored in the `result` variable:
        print(result)

Note: Handling missing keys is mandatory because of normal flush-out operations. It is also obligatory to handle the cold cache scenario, i.e. when memcached has just been started. In that case, the cache will be entirely empty and the cache needs to be fully repopulated, one request at a time.

This means you should view any cached data as ephemeral. And you should never expect the cache to contain a value you previously wrote to it.

Warming Up a Cold Cache

Some of the cold cache scenarios cannot be prevented, for example a memcached crash. But some can, for example migrating to a new memcached server.

When it is possible to predict that a cold cache scenario will happen, it is better to avoid it. A cache that needs to be refilled means that all of the sudden, the canonical storage of the cached data will be massively hit by all cache users who lack a cache data (also known as the thundering herd problem.)

pymemcache provides a class named FallbackClient that helps in implementing this scenario as demonstrated here:

from pymemcache.client import base
from pymemcache import fallback

def do_some_query():
    # Replace with actual querying code to a database,
    # a remote REST API, etc.
    return 42
    
## Set `ignore_exc=True` so it is possible to shut down
## the old cache before removing its usage from
## the program, if ever necessary.
old_cache = base.Client(('localhost', 11211), ignore_exc=True)
new_cache = base.Client(('localhost', 11212))

client = fallback.FallbackClient((new_cache, old_cache))

result = client.get('some_key')

if result is None:
    # The cache is empty, need to get the value
    # from the canonical source:
    result = do_some_query()
    # Cache the result for next time:
    client.set('some_key', result)
    print(result)

The FallbackClient queries the old cache passed to its constructor, respecting the order. In this case, the new cache server will always be queried first, and in case of a cache miss, the old one will be queried—avoiding a possible return-trip to the primary source of data.

If any key is set, it will only be set to the new cache. After some time, the old cache can be decommissioned and the FallbackClient can be replaced directed with the new_cacheclient.

Check And Set

When communicating with a remote cache, the usual concurrency problem comes back: there might be several clients trying to access the same key at the same time. memcached provides a check and set operation, shortened to CAS, which helps to solve this problem.

The simplest example is an application that wants to count the number of users it has. Each time a visitor connects, a counter is incremented by 1. Using memcached, a simple implementation would be:

def on_visit(client):
    result = client.get('visitors')
    if result is None:
        result = 1
    else:
        result += 1
    client.set('visitors', result)

However, what happens if two instances of the application try to update this counter at the same time?

The first call client.get('visitors') will return the same number of visitors for both of them, let’s say it’s 42. Then both will add 1, compute 43, and set the number of visitors to 43. That number is wrong, and the result should be 44, i.e. 42 + 1 + 1.

To solve this concurrency issue, the CAS operation of memcached is handy. The following snippet implements a correct solution:

def on_visit(client):
    while True:
        result, cas = client.gets('visitors')
        if result is None:
            result = 1
        else:
            result += 1
        if client.cas('visitors', result, cas):
             break

The gets method returns the value, just like the get method, but it also returns a CAS value.

What is in this value is not relevant, but it is used for the next method cas call. This method is equivalent to the set operation, except that it fails if the value has changed since the gets operation. In case of success, the loop is broken. Otherwise, the operation is restarted from the beginning.

In the scenario where two instances of the application try to update the counter at the same time, only one succeeds to move the counter from 42 to 43. The second instance gets a False value returned by the client.cas call, and have to retry the loop. It will retrieve 43 as value this time, will increment it to 44, and its cas call will succeed, thus solving our problem.

Incrementing a counter is interesting as an example to explain how CAS works because it is simplistic. However, memcached also provides the incr and decr methods to increment or decrement an integer in a single request, rather than doing multiple gets/cas calls. In real-world applications gets and cas are used for more complex data type or operations

Most remote caching server and data store provide such a mechanism to prevent concurrency issues. It is critical to be aware of those cases to make proper use of their features.

Beyond Caching

The simple techniques illustrated in this article showed you how easy it is to leverage memcached to speed up the performances of your Python application.

Just by using the two basic “set” and “get” operations you can often accelerate data retrieval or avoid recomputing results over and over again. With memcached you can share the cache accross a large number of distributed nodes.

Other, more advanced patterns you saw in this tutorial, like the _Check And Set (CAS)_operation allow you to update data stored in the cache concurrently across multiple Python threads or processes while avoiding data corruption.

If you are interested into learning more about advanced techniques to write faster and more scalable Python applications, check out Scaling Python. It covers many advanced topics such as network distribution, queuing systems, distributed hashing, and code profiling.

How to Log Properly in Python

Mon, 04 Feb 2019 00:00:00 GMT

Logging is one of the most underrated features. Often ignored by software engineers, it can save your time when your application's running in production.

Most teams don't think about it until it's too late in their development process. It's when things start to get wrong in deployments that somebody realizes too late that logging is missing.

Guidelines

The Twelve-Factor App defines logs as a stream of aggregated, time-ordered events collected from the output streams of all running processes. It also describes how applications should handle their logging. We can summarize those guidelines as:

Logs have no fixed beginning or end.
Print logs to stdout.
Print logs unbuffered.
The environment is responsible for capturing the stream.

From my experience, this set of rules is a good trade-off. Logs have to be kept pretty simple to be efficient and reliable. Building complex logging systems might make it harder to get insight into a running application.

There's also no point in duplication effort in log management (e.g., log file rotation, archival policy, etc) in your different applications. Having an external workflow that can be shared across different programs seems more efficient.

In Python

Python provides a logging subsystem with its logging module. This module provides a Logger object that allows you to emit messages with different levels of criticality. Those messages can then be filtered and send to different handlers.

Let's have an example:

import logging

logger = logging.getLogger("myapp")
logger.error("something wrong")

Depending on the version of Python you're running you'll either see:

No handlers could be found for logger "test123"

or:

something wrong

Python 2 used to have no logging setup by default, so it would print an error message about no handler being found. Since Python 3, a default handler outputting to stdout is now installed — matching the requirements from the 12factor App.

However, this default setup is far from being perfect.

Shortcomings

The default format that Python uses does not embed any contextual information. There is no way to know the name of the logger — myapp in the previous example — nor the date and time of the logged message.

You must configure Python logging subsystem to enhance its output format.

To do that, I advise using the daiquiri module. It provides an excellent default configuration and a simple API to configure logging, plus some exciting features.

Logging Setup

When using daiquiri, the first thing to do is to set up your logging correctly. This can be done with the daiquiri.setup function as this:

import daiquiri

daiquiri.setup()

As simple as that. You can tweak the setup further by asking it to log to file, to change the default string formats, etc, but just calling daiquiri.setup is enough to get a proper logging default.

See:

import daiquiri

daiquiri.setup()
daiquiri.getLogger("myapp").error("something wrong")

outputs:

2018-12-13 10:24:04,373 [38550] ERROR    myapp: something wrong

If your terminal supports writing text in colors, the line will be printed in red since it's an error. The format provided by daiquiri is better than Python's default: this one includes a timestamp, the process ID, the criticality level and the logger's name. Needless to say that this format can also be customized.

Passing Contextual Information

Logging strings are boring. Most of the time, engineers end up writing code such as:

logger.error("Something wrong happened with %s when writing data at %d", myobject.myfield, myobject.mynumber")

The issue with this approach is that you have to think about each field that you want to log about your object, and to make sure that they are inserted correctly in your sentence. If you forget an essential field to describe your object and the problem, you're screwed.

A reliable alternative to this manual crafting of log strings is to pass interesting objects as keyword arguments. Daiquiri supports it, and it works that way:

import attr
import daiquiri
import requests

daiquiri.setup()
logger = daiquiri.getLogger("myapp")

@attr.s
class Request:
     url = attr.ib()
     status_code = attr.ib(init=False, default=None)
     
     def get(self):
         r = requests.get(self.url)
         self.status_code = r.status_code
         r.raise_for_status()
         return r

user = "jd"
req = Request("https://google.com/not-this-page")
try:
    req.get()
except Exception:
    logger.error("Something wrong happened during the request",
                 request=req, user=user)

If anything goes wrong with the request, it will be logged with the stack trace, like this:

2018-12-14 10:37:24,586 [43644] ERROR    myapp [request: Request(url='https://google.com/not-this-page', status_code=404)] [user: jd]: Something wrong happened during the request

As you can see, the call to logger.error is pretty straight-forward: a line that explains what's wrong, and then the different interesting objects are passed as keyword arguments.

Daiquiri logs those keyword arguments with a default format of [key: value] that is included as a prefix to the log string. The value is printed using its __format__ method — that's why I'm using the attr module here: it automatically generates this method for me and includes all fields by default. You can also customize daiquiri to use any other format.

Following those guidelines should be a perfect start for logging correctly with Python!

Serious Python released!

Thu, 17 Jan 2019 00:00:00 GMT

Today I'm glad to announce that my new book, Serious Python, has been released.

However, you wonder… what is Serious Python?

Well, Serious Python is the the new name of The Hacker's Guide to Python — the first book I published. Serious Python is the 4th update of that book — but with a brand a new name and a new editor!

For more than a year, I've been working with the editor No Starch Press to enhance this book and bring it to the next level! I'm very proud of what we achieved, and working with a whole team on this book has been a fantastic experience.

The content has been updated to be ready for 2019: pytest is now a de-facto standard for testing, so I had to write about it. On the other hand, Python 2 support was less a focus, and I removed many mentions of Python 2 altogether. Some chapters have been reorganized, regrouped and others got enhanced with new content!

The good news: you can get this new edition of the book with a 15% discount for the next 24 hours using the coupon code SERIOUSPYTHONLAUNCH on the book page.

The book is also released as part as No Starch collection. They also are in charge of distributing the paperback copy of the book. If you want a version of the book that you can touch and hold in your arms, look for it in No Starch shop, on Amazon or in your favorite book shop!

Why You Should Care That Your SQL DDL is Transactional

Mon, 07 Jan 2019 00:00:00 GMT

I don't write a lot about database management. How come? I'm a software engineer, and like many of my peers, I leverage databases to store data. I should talk more about this! What made me write this today is that I've discovered that many of my peers wouldn't be able to understand the title of this post.

However, I can tell you that once you've finished reading this, you'll thank me!

DDL?

To understand what this post is about, let's start with DDL. DDL is the abbreviation of Data Definition Language. In summary, a DDL is a language that allows defining your data structure. A famous one is the SQL DDL — and that's the one I talk about here.

I'm sure you already used it if you created a relational database with CREATE TABLE foo (id INTEGER). That is a DDL statement.

In SQL, there's a lot of DDL operations you can do, such as creating a table, renaming a table, creating or removing a column, converting a column to a new type, etc.

Those DDL statements are commonly used in two cases:

When creating your database' tables for the first time. You issue a bunch of CREATE TABLE statements, and your database is ready to be used.
When updating your database by adding, removing or modifying tables or columns. This is typically done when upgrading your application to a new version.

The fact that our DDL is transactional or not in option 1. has often little impact in practice. It's can still be useful, where for example you could get an error because the disk is full — having the ability to roll back in this case can be a life saviour.

In our case here, we'll talk about why you need a transactional DDL when upgrading your database.

Transactional You Said?

What transactional means here? It means that we can issue those DDL statements inside a transaction.

Wait, what's a transaction? To make it simple, in a database, a transaction is a group of operations that are treated as a single coherent operation, independently of other transactions. The final operation has to be atomic, consistent, isolated and durable — therefore that ACID property you keep reading about while always wondering what it meant. The operations composing the transaction are either entirely executed, or not at all.

In our case, having the DDL being transactional means one simple thing: the ability to execute several operations (e.g., several ALTER TABLE) in a single operation, that can be either committed or rolled back.

Let's use an example. Here's a table ingredients with a name column created with:

CREATE TABLE ingredients (
  name text NOT NULL
);

In this table, there is a list of ingredients in the form of water 20 mL, flour 300 g, etc.

Now, we're upgrading our application, and we want to handle the quantity of ingredients in their columns to make it easier to query the data. Let's say we're going to handle quantity and quantity units for our ingredients. We need to add two new columns to our table schema, quantity and unit:

ALTER TABLE ingredients ADD COLUMN quantity integer NOT NULL;
ALTER TABLE ingredients ADD COLUMN unit text NOT NULL;

We also need to convert the name by splitting it into <name> <quantity> <unit> and insert this into the new columns. We can do this like that:

UPDATE ingredients SET name=split_part(name, ' ', 1), quantity=split_part(name, ' ', 2)::int, unit=split_part(name, ' ', 3);

In this example, I'm using the split_part operator from PostgreSQL to split the string.

With the UPDATE statement, the name column containing flour 300 grams now contains flour, and the columns quantity and unit respectively stores 300 and grams.

When we run our upgrade procedure consisting of those two ALTER TABLE and one UPDATE, we got our final table like this:

## SELECT * FROM ingredients;
 name  │ quantity │ unit
───────┼──────────┼──────
 flour │      300 │ g
(1 row)

Exactly what we want.

Ok, So What?

In the previous example, everything worked fine. Our 300 grams of flour string is split, converted and stored into the three different columns. However, let's think about what happens if the conversion fails because our ingredient name is foobar:

## ALTER TABLE ingredients ADD COLUMN quantity integer;
ALTER TABLE
## ALTER TABLE ingredients ADD COLUMN unit text;
ALTER TABLE
## UPDATE ingredients SET name=split_part(name, ' ', 1), quantity=split_part(name, ' ', 2)::int, unit=split_part(name, ' ', 3);
ERROR:  invalid input syntax for integer: ""

Right, so in this case our update failed because it's impossible to convert an empty string to an integer.

We're going to fix this piece of data in our database (manually or automatically, whatever) to make it work, changing foobar to something like foobar 1 kg.

Then, when we rerun the upgrade script, this is what happens:

## ALTER TABLE ingredients ADD COLUMN quantity integer NOT NULL;
ERROR:  column "quantity" of relation "ingredients" already exists

The upgrade script failed earlier — not in the UPDATE statement. It has a good reason to fail: the column quantity already exists.

Why is that? Well, when we run the upgrade procedure the first time, we did not run it inside a transaction. Every DDL statement was committed right after its execution. Therefore, the current state of our database is half-migrated: we have the new schema installed, but not the data migrated.

This sucks. This should not happen. Ever.

Why?

Some database systems (e.g., MySQL) do not support DDL running in a transaction, so you have no choice than running the three operations (ALTER, ALTER and then UPDATE) as three distinct operations: if any of those fails, there's no way to recover and get back to the initial state.

If you're using a database that supports running DDL statements inside a transaction (e.g., PostgreSQL), we can run your upgrade script like this:

postgres=# BEGIN;
postgres=# ALTER TABLE ingredients ADD COLUMN quantity integer;
ALTER TABLE
postgres=# ALTER TABLE ingredients ADD COLUMN unit text;
ALTER TABLE
postgres=# UPDATE ingredients SET name=split_part(name, ' ', 1), quantity=split_part(name, ' ', 2)::int, unit=split_part(name, ' ', 3);
ERROR:  invalid input syntax for integer: ""
postgres=# ROLLBACK;
ROLLBACK

Since the transaction failed, we ended up doing a ROLLBACK. When checking the state of the database, we can see the state did not change:

## \d ingredients;
           Table "public.ingredients"
 Column │ Type │ Collation │ Nullable │ Default
────────┼──────┼───────────┼──────────┼─────────
 name   │ text │           │ not null │

Therefore, it's possible to fix our database content and rerun the migration procedure without being in a half-migrated state.

A Database That Lies

When you're giving data to a database, you're trusting it. It'd be awful if it were lying to you, right? Check this out:

mysql> CREATE TABLE ingredients (name text NOT NULL);
Query OK, 0 rows affected (0.03 sec)

mysql> BEGIN;
Query OK, 0 rows affected (0.00 sec)

mysql> ALTER TABLE ingredients ADD COLUMN quantity integer;
Query OK, 0 rows affected (0.05 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ALTER TABLE ingredients ADD COLUMN unit text;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> ROLLBACK;
Query OK, 0 rows affected (0.00 sec)

mysql> DESC ingredients;
+----------+---------+------+-----+---------+-------+
| Field    | Type    | Null | Key | Default | Extra |
+----------+---------+------+-----+---------+-------+
| name     | text    | NO   |     | NULL    |       |
| quantity | int(11) | YES  |     | NULL    |       |
| unit     | text    | YES  |     | NULL    |       |
+----------+---------+------+-----+---------+-------+
3 rows in set (0.00 sec)

In the output above, you can see that we issued two DDL statements inside a transaction and that we then rolled back that transaction. MySQL did not output any error at any time, making us think that it did not alter our table. However, when checking the schema of the database, we can see that nothing has been rolled back. Not only MySQL does not support transactional DDL, but it also fails you entirely and lie about what it's doing.

How Important is That?

Transactional DDL is a feature that is often ignored by software engineers, whereas it's a key feature for managing your database life cycle.

I'm writing this post today because I've been hit by this multiple times over the last years. OpenStack made the choice years ago to go with MySQL, and in consequences, every database upgrade script that fails in the middle of the procedure leave the database is an inconsistence state. In that case, it means that either:

The operator have to determine where the upgrade script stopped, roll back the upgrade by itself, fix the failure, and rerun the upgrade procedure.
The developer must anticipate every case of potential upgrade failure, write a roll back procedure for each of this case by and test every one of those cases.
Use a database system that handles transactional DDL.

No need to tell you that option 3. is the best, option 2. is barely possible to implement and option 1. is what reality looks like. In Gnocchi, we picked option 3. by recommending operators to use PostgreSQL.

Next time you use a database, think carefully about what your upgrade procedure will look like!

Podcast.init: Gnocchi, a Time Series Database for your Metrics

Tue, 11 Dec 2018 00:00:00 GMT

A few weeks ago, Tobias Macey contacted me as he wanted to talk about Gnocchi, the time series database I've been working on for the last few years.

It was a great opportunity to talk about the project, so I jumped on it! We talk about how Gnocchi came to life, how we built its architecture, the challenges we met, what kind of trade-off we made, etc.

You can list to this episode here.

A multi-value syntax tree filtering in Python

Mon, 03 Dec 2018 00:00:00 GMT

A while ago, we've seen how to write a simple filtering syntax tree with Python. The idea was to provide a small abstract syntax tree with an easy to write data structure that would be able to filter a value. Filtering meaning that once evaluated, our AST would return either True or False based on the passed value.

With that, we were able to write small rules like Filter({"eq": 3})(4) that would return False since, well, 4 is not equal to 3.

In this new post, I propose we enhance our filtering ability to support multiple values. The idea is to be able to write something like this:

>>> f = Filter(
  {"and": [
    {"eq": ("foo", 3)},
    {"gt": ("bar", 4)},
   ]
  },
)
>>> f(foo=3, bar=5)
True
>>> f(foo=4, bar=5)
False

The biggest change here is that the binary operators (eq, gt, le, etc.) now support getting two values, and not only one, and that we can pass multiple values to our filter by using keyword arguments.

How should we implement that? Well, we can keep the same data structure we built previously. However, this time we're gonna do the following change:

The left value of the binary operator will be a string that will be used as the key to access the keyword arguments passed to our Filter.__call__ values.
The right value of the binary operator will be kept as it is (like before).

We therefore need to change our Filter.build_evaluator to accommodate this as follow:

def build_evaluator(self, tree):
    try:
        operator, nodes = list(tree.items())[0]
    except Exception:
        raise InvalidQuery("Unable to parse tree %s" % tree)
    try:
        op = self.multiple_operators[operator]
    except KeyError:
        try:
            op = self.binary_operators[operator]
        except KeyError:
            raise InvalidQuery("Unknown operator %s" % operator)
        assert len(nodes) == 2 # binary operators take 2 values
        def _op(values):
            return op(values[nodes[0]], nodes[1])
        return _op
    # Iterate over every item in the list of the value linked
    # to the logical operator, and compile it down to its own
    # evaluator.
    elements = [self.build_evaluator(node) for node in nodes]
    return lambda values: op((e(values) for e in elements))

The algorithm is pretty much the same, the tree being browsed recursively.

First, the operator and its arguments (nodes) are extracted.

Then, if the operator takes multiple arguments (such as and and or operators), each node is recursively evaluated and a function is returned evaluating those nodes.
If the operator is a binary operator (such as eq, lt, etc.), it checks that the passed argument list length is 2. Then, it returns a function that will apply the operator (e.g., operator.eq) to values[nodes[0]] and nodes[1]: the former access the arguments (values) passed to the filter's __call__ function while the latter is directly the passed argument.

The full class looks like this:

import operator

class InvalidQuery(Exception):
    pass

class Filter(object):
    binary_operators = {
        u"=": operator.eq,
        u"==": operator.eq,
        u"eq": operator.eq,

        u"<": operator.lt,
        u"lt": operator.lt,

        u">": operator.gt,
        u"gt": operator.gt,

        u"<=": operator.le,
        u"≤": operator.le,
        u"le": operator.le,

        u">=": operator.ge,
        u"≥": operator.ge,
        u"ge": operator.ge,

        u"!=": operator.ne,
        u"≠": operator.ne,
        u"ne": operator.ne,
    }

    multiple_operators = {
        u"or": any,
        u"∨": any,
        u"and": all,
        u"∧": all,
    }

    def __init__(self, tree):
        self._eval = self.build_evaluator(tree)

    def __call__(self, **kwargs):
        return self._eval(kwargs)

    def build_evaluator(self, tree):
        try:
            operator, nodes = list(tree.items())[0]
        except Exception:
            raise InvalidQuery("Unable to parse tree %s" % tree)
        try:
            op = self.multiple_operators[operator]
        except KeyError:
            try:
                op = self.binary_operators[operator]
            except KeyError:
                raise InvalidQuery("Unknown operator %s" % operator)
            assert len(nodes) == 2 # binary operators take 2 values
            def _op(values):
                return op(values[nodes[0]], nodes[1])
            return _op
        # Iterate over every item in the list of the value linked
        # to the logical operator, and compile it down to its own
        # evaluator.
        elements = [self.build_evaluator(node) for node in nodes]
        return lambda values: op((e(values) for e in elements))

We can check that it works by building some filters:

x = Filter({"eq": ("foo", 1)})
assert x(foo=1)

x = Filter({"eq": ("foo", "bar")})
assert not x(foo=1)

x = Filter({"or": (
    {"eq": ("foo", "bar")},
    {"eq": ("bar", 1)},
)})
assert x(foo=1, bar=1)

Supporting multiple values is handy as it allows to pass complete dictionaries to the filter, rather than just one value. That enables users to filter more complex objects.

Sub-dictionary support

It's also possible to support deeper data structure, like a dictionary of dictionary. By replacing values[nodes[0]] by self._resolve_name(values, node[0]) with a _resolve_name method like this one, the filter is able to traverse dictionaries:

ATTR_SEPARATOR = "."

def _resolve_name(self, values, name):
    try:
        for subname in name.split(self.ATTR_SEPARATOR):
            values = values[subname]
        return values
    except KeyError:
        raise InvalidQuery("Unknown attribute %s" % name)

It then works like that:

x = Filter({"eq": ("baz.sub", 23)})
assert x(foo=1, bar=1, baz={"sub": 23})

x = Filter({"eq": ("baz.sub", 23)})
assert not x(foo=1, bar=1, baz={"sub": 3})

By using the syntax key.subkey.subsubkey the filter is able to access item inside dictionaries on more complex data structure.

That basic filter engine can evolve quite easily in something powerful, as you can add new operators or new way to access/manipulate the passed data structure.

If you have other ideas on nifty features that could be added, feel free to add a comment below!

The Best flake8 Extensions for your Python Project

Mon, 05 Nov 2018 00:00:00 GMT

In the last blog post about coding style, we dissected what the state of the art was regarding coding style check in Python.

As we've seen, Flake8 is a wrapper around several tools and is extensible via plugins: meaning that you can add your own checks. I'm a heavy user of Flake8 and relies on a few plugins to extend the check coverage of common programming mistakes in Python. Here's the list of the ones I can't work without. As a bonus, you'll find at the end of this post, a sample of my go-to tox.ini file.

flake8-import-order

The name is quite explicit: this extension checks the order of your import statements at the beginning of your files. By default, it uses a style that I enjoy, which looks like:

import os
import sys

import requests

import yaml

import myproject
from myproject.utils import somemodule

The builtin modules are grouped as the first ones. Then comes a group for each third-party modules that are imported. Finally, the last group manages the modules of the current project. I find this way of organizing modules import quite clear and easy to read.

To make sure flake8-import-order knows about the name of your project module name, you need to specify it in tox.ini with the application-import-names option.

If you beg to differ, you can use any of the other styles that flake8-import-order offers by default by setting the import-order-style option. You can obviously provide your own style.

flake8-blind-except

The flake8-blind-except extension checks that no except statement is used without specifying an exception type. The following excerpt is, therefore, considered invalid:

try:
    do_something()
except:
    pass

Using except without any exception type specified is considered bad practice as it might catch unwanted exceptions. It forces the developer to think about what kind of errors might happen and should really be caught.

In the rare case any exception should be caught, it's still possible to use except Exception anyway.

flake8-builtins

The flake8-builtins plugin checks that there is no name collision between your code and the Python builtin variables.

For example, this code would trigger an error:

def first(list):
    return list[0]

As list is a builtin in Python (to create a list!), shadowing its definition by using list as the name of a parameter in a function signature would trigger a warning from flake8-builtins.

While the code is valid, it's a bad habit to override Python builtins functions. It might lead to tricky errors; in the above example, if you ever need to call list(), you won't be able to.

flake8-logging-format

This module is handy as it is still slapping my fingers once in a while. When using the logging module, it prevents from writing this kind of code:

mylogger.info("Hello %s" % mystring)

While this works, it's suboptimal as it forces the string interpolation. If the logger is configured to print only messages with a logging level of warning or above, doing a string interpolation here is pointless.

Therefore, one should instead write:

mylogger.info("Hello %s", mystring)

Same goes if you use format to do any formatting.

Be aware that contrary to other flake8 modules, this one does not enable the check by default. You'll need to add enable-extensions=G in your tox.ini file.

flake8-docstrings

The flake8-docstrings module checks the content of your Python docstrings for respect of the PEP 257. This PEP is full of small details about formatting your docstrings the right way, which is something you wouldn't be able to do without such a tool. A simple example would be:

class Foobar:
    """A foobar"""

While this seems valid, there is a missing point at the end of the docstring.

Trust me, especially if you are writing a library that is consumed by other developers, this is a must-have.

flake8-rst-docstrings

This extension is a good complement to flake8-docstrings: it checks that the content of your docstrings is valid RST. It's a no-brainer, so I'd install it without question. Again, if your project exports a documented API that is built with Sphinx, this is a must-have.

My standard tox.ini

Here's the standard tox.ini excerpt that I use in most of my projects. You can copy paste it and use

[testenv:pep8]
deps = flake8
       flake8-import-order
       flake8-blind-except
       flake8-builtins
       flake8-docstrings
       flake8-rst-docstrings
       flake8-logging-format
commands = flake8

[flake8]
exclude = .tox
## If you need to ignore some error codes in the whole source code
## you can write them here
## ignore = D100,D101
show-source = true
enable-extensions=G
application-import-names = <myprojectname>

Before disabling an error code for your entire project, remember that you can force flake8 to ignore a particular instance of the error by adding the # noqa tag at the end of the line.

If you have any flake8 extension that you think is useful, please let me know in the comment section!

More GitHub workflow automation

Tue, 16 Oct 2018 00:00:00 GMT

The more you use computers, the more you see the potentials for automating everything. Who doesn't love that? By building Mergify those last months, we've decided it was time bring more automation to the development workflow.

Mergify's first version was a minimal viable product around automating the merge of pull requests. As I wrote a few months ago, we wanted to automate the merge of pull requests when it was ready to be merged. For most projects, this is easy and consists of a simple rule: "it must be approved by a developer and pass the CI".

Evolving on Feedback

For the first few months, we received a lot of feedback from our users. They were enthusiastic about the product but were frustrated by a couple of things.

First, Mergify would mess up with branch protections. We thought that people wanted the GitHub UI to match their rules. As I'll explain later, it turns out to be only partially true, and we found a workaround.

Then, Mergify's abilities were capped by some of the limitations of the GitHub workflow and API. For example, GitHub would only allow rules per branch, whereas our users wanted to have rules applied based on a lot of different criteria.

Building the Next Engine

We rolled up our sleeves and started to build that new engine. The first thing was to get rid of the GitHub branch protection feature altogether and leveraging the Checks API to render something useful to the users in the UI. You can now have a complete overview of the rules that will be applied to your pull requests in the UI, making it easy to understand what's happening.

Then, we wrote a new matching engine that would allow matching any pull requests based on any of its attributes. You can now automate your workflow with a finer-grained configuration.

What Does It Look Like?

Here's a simple rule you could write:

pull_request_rules:
  - name: automatic merge on approval and CI pass
    conditions:
     - "#approved-reviews-by>=1"
     - status-success=continuous-integration/travis-ci/pr
     - label!=work-in-progress
    actions:
      merge:
        method: merge

With that, any pull request that has been approved by a collaborator, passes the Travis CI job and does not have the label work-in-progress will be automatically merged by Mergify.

You could use even more actions to backport this pull request to another branch, close the pull request or add/remove labels. We're starting to see users building amazing workflow with that engine!

We're thrilled by this new version we launched this week and glad we're getting amazing feedback (again) from our users.

When you give it a try, drop me a note and let me know what you think about it!

Code Style Checks in Python

Mon, 01 Oct 2018 00:00:00 GMT

After starting your first Python project, you might realize that it is actually not that obvious to be consistent with the way you write Python code. If you collaborate with other developers, your code style might differ, and the code can become somehow unreadable.

I hate coding style discussions as much as every engineer. Who has not seen hours of nitpicking on code reviews, a heated debate around the coffee machine or nerf guns battles to decide where the semicolon should be?

When I start a new project, the first thing I do is set up an automated style check. With that in place, there's no time wasted during code reviews about manually checking what's a program's good at: coding style consistency. Since coding style is a touchy subject, it's a good reason to tackle it at the beginning of the project.

Python has an amazing quality that few other languages have: it uses indentation to define blocks. While it offers a solution to the age-old question of "where should I put my curly braces?", it introduces a new question in the process: "how should I indent?".

I imagine that it was one of the first question that was raised in the community, so the Python folks, in their vast wisdom, came up with the PEP 8: Style Guide for Python Code.

This document defines the standard style for writing Python code. The list of guidelines boils down to:

Use 4 spaces per indentation level.
Limit all lines to a maximum of 79 characters.
Separate top-level function and class definitions with two blank lines.
Encode files using ASCII or UTF-8.
One module import per import statement and per line, at the top of the file, after comments and docstrings, grouped first by standard, then third-party, and finally local library imports.
No extraneous whitespaces between parentheses, brackets, or braces, or before commas.
Name classes in CamelCase; suffix exceptions with Error (if applicable); name functions in lowercase with words separated_by_underscores; and use a leading underscore for _private attributes or methods.

These guidelines really aren't hard to follow and they make a lot of sense. Most Python programmers have no trouble sticking to them as they write code.

However, errare humanum est, and it's still a pain to look through your code to make sure it fits the PEP 8 guidelines. That's what the pycodestyle tool (formerly called pep8) is there for: it can automatically check any Python file you send its way.

$ pycodestyle hello.py
hello.py:4:1: E302 expected 2 blank lines, found 1
$ echo $?
1

pycodestyle indicates which lines and columns do not conform to PEP 8 and reports each issue with a code. Violations of MUST statements in the specification are reported as errors — their error codes start with an E. Minor issues are reported as warnings — their error codes start with a W. The three-digit code following the first letter indicates the exact kind of error or warning.

You can tell the general category of an error code at a glance by looking at the hundreds digit: for example, errors starting with E2 indicate issues with whitespace; errors starting with E3 indicate issues with blank lines; and warnings starting with W6 indicate deprecated features being used.

I advise you to consider it and run a PEP 8 validation tool against your source code on a regular basis. An easy way to do this is to integrate it into your continuous integration system: it's a good way to ensure that you continue to respect the PEP 8 guidelines in the long term.

Most open source project enforce PEP 8 conformance through automatic checks. Doing so since the beginning of the project might frustrate newcomers, but it also ensures that the codebase always looks the same in every part of the project. This is very important for a project of any size where there are multiple developers with differing opinions on whitespace ordering. You know what I mean.

It's also possible to ignore certain kinds of errors and warnings by using the --ignore option:

$ pycodestyle --ignore=E3 hello.py
$ echo $?
0

This allows you to effectively ignore parts of the PEP 8 specification that you don't want to follow. If you're running pycodestyle on a existing code base, it also allows you to ignore certain kinds of problems so you can focus on fixing issues one category at a time.

If you write C code for Python (e.g. modules), the PEP 7 standard describes the coding style that you should follow.

Other tools also exist that check for actual coding errors rather than style errors. Some notable examples include:

pyflakes, which is also extendable via plugins.
pylint, which also checks PEP 8 conformance while performing more checks by default. It also can be extended via plugins.

These tools all make use of static analysis — that is, they parse the code and analyze it rather than running it outright.

If you choose to use pyflakes — which I recommend — note that it doesn't check PEP 8 conformance on its own — you would still pycodestyle to do that. That means you need 2 different tools to have a proper coverage.

In order to simplify things, a project named flake8 exists and combines pyflakes and pycodestyle into a single command. It also adds some new fancy features: for example, it can skip checks on lines containing # noqa and is extensible via plugins.

There are a large number of plugins available for flake8 that you can just use. For example, installing flake8-import-order (with pip install flake8-import-order) will extend flake8 so it also checks that your import statements are sorted alphabetically in your source code.

flake8 is now heavily used in most open source projects for code style verification. Some large open source projects even wrote their own plugins, adding checks checks for errors such as odd usage of except, Python 2/3 portability issues, import style, dangerous string formatting, possible localization issues, etc.

If you're starting a new project, I strongly recommend you use one of these tools and rely on it for automatic checking of your code quality and style. If you already have a codebase, a good approach is to run them with most of the warnings disabled and fix issues one category at a time.

While none of these tools may be a perfect fit for your project or your preferences, using flake8 together is a good way to improve the quality of your code and make it more durable. If nothing else, it's a good start toward that goal.

Many text editors, including the famous GNU Emacs and vim, have plugins available (such as Flycheck) that can run tools such as pep8 or flake8 directly in your code buffer, interactively highlighting any part of your code that isn't PEP 8-compliant. This is a handy way to fix most style errors as you write your code.

High-Performance in Python with Zero-Copy and the Buffer Protocol

Mon, 03 Sep 2018 00:00:00 GMT

Whatever your programs are doing, they often have to deal with vast amounts of data. This data is usually represented and manipulated in the form of strings. However, handling such a large quantity of input in strings can be very ineffective once you start manipulating them by copying, slicing, and modifying. Why?

Let's consider a small program which reads a large file of binary data, and
copies it partially into another file. To examine out the memory usage of this program, we will use memory_profiler, an excellent Python package that allows us to see the memory usage of a program line by line.

@profile
def read_random():
    with open("/dev/urandom", "rb") as source:
        content = source.read(1024 * 10000)
        content_to_write = content[1024:]
    print("Content length: %d, content to write length %d" %
          (len(content), len(content_to_write)))
    with open("/dev/null", "wb") as target:
        target.write(content_to_write)

if __name__ == '__main__':
    read_random()

Running the above program using memory_profiler produces the following:

$ python -m memory_profiler memoryview/copy.py
Content length: 10240000, content to write length 10238976
Filename: memoryview/copy.py

Mem usage    Increment   Line Contents
======================================
                         @profile
 9.883 MB     0.000 MB   def read_random():
 9.887 MB     0.004 MB       with open("/dev/urandom", "rb") as source:
19.656 MB     9.770 MB           content = source.read(1024 * 10000)
29.422 MB     9.766 MB           content_to_write = content[1024:]
29.422 MB     0.000 MB       print("Content length: %d, content to write length %d" %
29.434 MB     0.012 MB             (len(content), len(content_to_write)))
29.434 MB     0.000 MB       with open("/dev/null", "wb") as target:
29.434 MB     0.000 MB           target.write(content_to_write)

The call to source.read reads 10 MB from /dev/urandom. Python needs to allocate around 10 MB of memory to store this data as a string. The instruction on the line just after, content[1024:], copies the entire block of data minus the first KB — allocating 10 more megabytes.

So what's interesting here, is to notice that the memory usage of the program increased by about 10 MB when building the variable content_to_write. The slice operator is copying the entirety of content, minus the first KB, into a new string object.

When dealing with extensive data, performing this kind of operation on large byte arrays is going to be a disaster. If you already have written C code, you know that using memcpy() has a significant cost, both in term of memory usage and regarding general performance: copying memory is slow.

However, as a C programmer, you also know that strings are arrays of characters and that nothing stops you from looking at only part of this array without copying it, through the use of basic pointer arithmetic – assuming that the entire string is in a contiguous memory area.

This is possible in Python using objects which implement the buffer protocol. The buffer protocol is defined in PEP 3118, which explains the C API used to provide this protocol to various types, such as strings.

When an object implements this protocol, you can use the memoryview class constructor on it to build a new memoryview object that references the original object memory.

>>> s = b"abcdefgh"
>>> view = memoryview(s)
>>> view[1]
98
>>> limited = view[1:3]
>>> limited
<memory at 0x7fca18b8d460>
>>> bytes(view[1:3])
b'bc'

Note: 98 is the ASCII code for the letter b.

In the example above, we use the fact that the memoryview object's slice operator itself returns a memoryview object. That means it does not copy any data but merely references a particular slice of it.

The graph below illustrates what happens:

Therefore, it is possible to rewrite the program above in a more efficient manner. We need to reference the data that we want to write using a memoryview object, rather than allocating a new string.

@profile
def read_random():
    with open("/dev/urandom", "rb") as source:
        content = source.read(1024 * 10000)
        content_to_write = memoryview(content)[1024:]
    print("Content length: %d, content to write length %d" %
          (len(content), len(content_to_write)))
    with open("/dev/null", "wb") as target:
        target.write(content_to_write)

if __name__ == '__main__':
    read_random()

Let's run the program above with the memory profiler:

$ python -m memory_profiler memoryview/copy-memoryview.py
Content length: 10240000, content to write length 10238976
Filename: memoryview/copy-memoryview.py

Mem usage    Increment   Line Contents
======================================
                         @profile
 9.887 MB     0.000 MB   def read_random():
 9.891 MB     0.004 MB       with open("/dev/urandom", "rb") as source:
19.660 MB     9.770 MB           content = source.read(1024 * 10000) <1>
19.660 MB     0.000 MB           content_to_write = memoryview(content)[1024:] <2>
19.660 MB     0.000 MB       print("Content length: %d, content to write length %d" %
19.672 MB     0.012 MB             (len(content), len(content_to_write)))
19.672 MB     0.000 MB       with open("/dev/null", "wb") as target:
19.672 MB     0.000 MB           target.write(content_to_write)

In that case, the source.read call still allocates 10 MB of memory to read the content of the file. However, when using memoryview to refer to the offset content, no more memory is allocated.

This version of the program ends up allocating 50% less memory than the original version!

This kind of trick is especially useful when dealing with sockets. When sending data over a socket, all the data might not be sent in a single call.

import socket
s = socket.socket(…)
s.connect(…)
## Build a bytes object with more than 100 millions times the letter `a`
data = b"a" * (1024 * 100000)
while data:
    sent = s.send(data)
    # Remove the first `sent` bytes sent
    data = data[sent:] <2>

Using a mechanism as implemented above, the program copies the data over and over until the socket has sent everything. By using memoryview, it is possible to achieve the same functionality with zero-copy, and therefore higher performance:

import socket
s = socket.socket(…)
s.connect(…)
## Build a bytes object with more than 100 millions times the letter `a`
data = b"a" * (1024 * 100000)
mv = memoryview(data)
while mv:
    sent = s.send(mv)
    # Build a new memoryview object pointing to the data which remains to be sent
    mv = mv[sent:]

As this won't copy anything, it won't use any more memory than the 100 MB
initially needed for the data variable.

So far we've used memoryview objects to write data efficiently, but the same method can also be used to read data. Most I/O operations in Python know how to deal with objects implementing the buffer protocol. They can read from it, but also write to it. In this case, we don't need memoryview objects – we can ask an I/O function to write into our pre-allocated object:

>>> ba = bytearray(8)
>>> ba
bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00')
>>> with open("/dev/urandom", "rb") as source:
...     source.readinto(ba)
... 
8
>>> ba
bytearray(b'`m.z\x8d\x0fp\xa1')

With such techniques, it's easy to pre-allocate a buffer (as you would do in C to mitigate the number of calls to malloc()) and fill it at your convenience.

Using memoryview, you can even place data at any point in the memory area:

>>> ba = bytearray(8)
>>> # Reference the _bytearray_ from offset 4 to its end
>>> ba_at_4 = memoryview(ba)[4:]
>>> with open("/dev/urandom", "rb") as source:
... # Write the content of /dev/urandom from offset 4 to the end of the
... # bytearray, effectively reading 4 bytes only
...     source.readinto(ba_at_4)
... 
4
>>> ba
bytearray(b'\x00\x00\x00\x00\x0b\x19\xae\xb2')

The buffer protocol is fundamental to achieve low memory overhead and great performances. As Python hides all the memory allocations, developers tend to forget what happens under the hood, at a high cost for the speed of their programs!

It's also good to know that both the objects in the array module and the functions in the struct module can handle the buffer protocol correctly, and can, therefore, efficiently perform when targeting zero copy.

Gnocchi 4.3.0 released

Mon, 30 Jul 2018 00:00:00 GMT

This new release minor release of Gnocchi has been a bit longer than usual, but here it is!

So what's new in this version of Gnocchi? Well, according to the release notes, not much. There are only two new features:

gnocchi-injector which allows injecting data for metricd consumption directly. This is useful to test metricd performances.
The ability for the /v1/aggregation/resources endpoint to read a string rather than a JSON formatted payload for filtering.

Nothing exciting here… however, other changes are not user-visible and are not in those notes:

Performance boost, everywhere!

The storage engine has been largely improved to batch a ton of operations that used to be done on a per-metric basis. When ingesting new measures, Gnocchi was storing those new points in batch. However, the processing done by metricd later was single-metric based for most it. This did not leverage the efficiency that some backend might have and would create more I/O operations than necessary.

Each incoming data sack is now processed in batch mode, making metricd much faster at aggregating metrics data! When doing local benchmarks, some scenario presented an improvement of 8x.

This new storage internal API is not used by the REST API yet, as many operations exposed by the API are oriented for a single metric. That might be a significant improvement for the next version of Gnocchi's API.

Happy upgrade!

Starting your first Python project

Thu, 26 Jul 2018 00:00:00 GMT

There's a gap between learning the syntax of the Python programming language and being able to build a project from scratch. When you finish reading your first tutorial or book about Python, you're good to go for writing a Fibonacci suite calculator, but that does not help you starting your actual project.

There are a few questions that pop up in your mind, and that's normal. Let's take a stab at those!

Which Python version should I use?

It's not a secret that Python has several versions that are supported at the same time. Each minor version of the interpreter gets bugfix support for 18 months and security support for 5 years. For example, Python 3.7, released on 27th June 2018, will be supported until Python 3.8 is released, around October 2019 (15 months later). Around December 2019, the last bugfix release of Python 3.7 will occur, and everyone is expected to switch to Python 3.8.

That's important to be aware of as the version of the interpreter will be entirely part of your software lifecycle.

On top of that, we should take into consideration the Python 2 versus Python 3 question. That still might be an open question for people working with (very) old platforms.

In the end, the question of which version of Python one should use is well worth asking.

Here are some short answers:

Versions 2.6 and older are really obsolete by now, so you don't have to worry about supporting them at all. If you intend on supporting these older versions anyway, be warned that you'll have an even harder time ensuring that your program supports Python 3.x as well. Though you might still run into Python 2.6 on some older systems; if that's the case, sorry for you!
Version 2.7 is and will remain the last version of Python 2.x. I don't think there is a system where Python 3 is not available one way or the other nowadays. So unless you're doing archeology once again, forget it. Python 2.7 will not be supported after the year 2020, so the last thing you want to do is build a new software based on it.
Versions 3.7 is the most recent version of the Python 3 branch as of this writing, and that's the one that you should target. Most recent operating systems ship at least 3.6, so in the case where you'd target those, you can make sure your application also work with 3.7.

Project Layout

Starting a new project is always a puzzle. You never know how to organize your files. However, once you have a proper understanding of the best practice out there, it's pretty simple.

First, your project structure should be fairly basic. Use packages and hierarchy wisely: a deep hierarchy can be a nightmare to navigate, while a flat hierarchy tends to become bloated.

Then, avoid making a few common mistakes. Don't leave unit tests outside the package directory. These tests should be included in a sub-package of your software so that:

They don't get automatically installed as a tests top-level module by setuptools (or some other packaging library) by accident.
They can be installed and eventually used by other packages to build their unit tests.

The following diagram illustrates what a standard file hierarchy should look like:

setup.py is the standard name for Python installation script, along with its companion setup.cfg, which should contain the installation script configuration. When run, setup.py installs your package using the Python distribution utilities.

You can also provide valuable information to users in README.rst (or README.txt, or whatever filename suits your fancy). Finally, the docs directory should contain the package's documentation in reStructuredText format, that will be consumed by Sphinx.

Packages often have to provide extra data, such as images, shell scripts, and so forth. Unfortunately, there's no universally accepted standard for where these files should be stored. Just put them wherever makes the most sense for your project: depending on their functions, for example, Web application templates could go in a templates directory in your package root directory.

The following top-level directories also frequently appear:

etc for sample configuration files.
tools for shell scripts or related tools.
bin for binary scripts you've written that will be installed by setup.py.

There's another design issue that I often encounter. When creating files or modules, some developers create them based on the type of code they will store. For example, they would create functions.py or exceptions.py files. This is a terrible approach. It doesn't help any developer when navigating the code. The code organization doesn't benefit from this, and it forces readers to jump between files for no good reason. There are a few exceptions, such as libraries, in some instances, because they do expose a complete API for consumers. However, other than that, think twice before doing that in your application.

Organize your code based on features, not based on types.

Creating a module directory with just an __init__.py file in it is also a bad idea. For example, don't create a directory named hooks with a single file named hooks/__init__.py in it where hooks.py would have been enough instead. If you create a directory, it should contain several other Python files that belong to the category the directory represents.

Be also very careful about the code that you put in the __init__.py files: it is going to be called and executed the first time that any of the module contained in the directory is loaded. This can have unwanted side effects. Those __init__.py files should be empty most of the time unless you know what you're doing.

Version Numbering

Software version needs to be stamped to know which one is more recent than another. As every piece of code evolves, it's a requirement for every project to be able to organize its timeline.

There is an infinite number of way to organize your version numbers, but PEP 440 introduces a version format that every Python package, and ideally every application, should follow. This way, programs and packages will be able to quickly and reliably identify which versions of your package they require.

PEP 440 defines the following regular expression format for version numbering:

N[.N]+[{a|b|c|rc}N][.postN][.devN]

This allows for standard numbering like 1.2 or 1.2.3.

However, please do note that:

1.2 is equivalent to 1.2.0; 1.3.4 is equivalent to 1.3.4.0, and so forth.
Versions matching N[.N]+ are considered final releases.
Date-based versions such as 2013.06.22 are considered invalid. Automated tools designed to detect PEP 440-format version numbers will (or should) raise an error if they detect a version number greater than or equal to 1980.

Final components can also use the following format:

N[.N]+aN (e.g. 1.2a1) denotes an alpha release, a version that might be unstable and missing features.
N[.N]+bN (e.g. 2.3.1b2) denotes a beta release, a version that might be feature-complete but still buggy.
N[.N]+cN or N[.N]+rcN (e.g. 0.4rc1) denotes a (release) candidate, a version that might be released as the final product unless significant bugs emerge. While the rc and c suffixes have the same meaning, if both are used, rc releases are considered to be newer than c releases.

These suffixes can also be used:

.postN (e.g.1.4.post2) indicates a post-release. These are typically used to address minor errors in the publication process (e.g. mistakes in release notes). You shouldn't use .postN when releasing a bugfix version; instead, you should increment the minor version number.
.devN (e.g. 2.3.4.dev3) indicates a developmental release. This suffix is discouraged because it is harder for humans to parse. It indicates a prerelease of the version that it qualifies: e.g. 2.3.4.dev3 indicates the third developmental version of the 2.3.4 release, before any alpha, beta, candidate or final release.

This scheme should be sufficient for most common use cases.

You might have heard of Semantic Versioning, which provides its own guidelines for version numbering. This specification partially overlaps with PEP 440, but unfortunately, they're not entirely compatible. For example, Semantic Versioning's recommendation for prerelease versioning uses a scheme such as 1.0.0-alpha+001 that is not compliant with PEP 440.

Many DVCS platforms, such as Git and Mercurial, can generate version numbers using an identifying hash (for Git, refer to git describe). Unfortunately, this system isn't compatible with the scheme defined by PEP 440: for one thing, identifying hashes aren't orderable.

Those are only some of the first questions you could have. If you have any other one that you would like me to answer, feel free to write a comment below. Some goes if you have any other pieces of advice you'd like to share!

How I stopped merging broken code

Tue, 03 Jul 2018 00:00:00 GMT

It's been a while since I moved all my projects to GitHub. It's convenient to host Git projects, and the collaboration workflow is smooth.

I love pull requests to merge code. I review them, I send them, I merge them. The fact that you can plug them into a continuous integration system is great and makes sure that you don't merge code that will break your software. I usually have Travis-CI setup and running my unit tests and code style check.

The problem with the GitHub workflow is that it allows merging untested code.

What?

Yes, it does. If you think that your pull requests, all green decorated, are ready to be merged, you're wrong.

You see, pull requests on GitHub are marked as valid as soon as the continuous integration system passes and indicates that everything is valid. However, if the target branch (let's say, master) is updated while the pull request is opened, nothing forces to retest that pull request with this new master branch. You think that the code in this pull request works while that might have become false.

So it might be that what went into your master branch now breaks this not-yet-merged pull request. You've no clue. You'll trust GitHub and press that green merge button, and you'll break your software. For whatever reason, it's possible that the test will break.

The good news is that's something that's solvable with the strict workflow that Mergify provides. There's a nice explanation and example in Mergify's blog post You are merging untested code that I advise you to read. What Mergify provides here is a way to serialize the merge of pull requests while making sure that they are always updated with the latest version of their target branch. It makes sure that there's no way to merge broken code.

That's a workflow I've now adopted and automatized on all my repositories, and we've been using such a workflow for Gnocchi for more than a year, with great success. Once you start using it, it becomes impossible to go back!

Stop merging your pull requests manually

Wed, 20 Jun 2018 00:00:00 GMT

If there's something that I hate, it's doing things manually when I know I could automate them. Am I alone in this situation? I doubt so.

Nevertheless, every day, they are thousands of developers using GitHub that are doing the same thing over and over again: they click on this button:

This does not make any sense.

Don't get me wrong. It makes sense to merge pull requests. It just does not make sense that someone has to push this damn button every time.

It does not make any sense because every development team in the world has a known list of pre-requisite before they merge a pull request. Those requirements are almost always the same, and it's something along those lines:

Is the test suite passing?
Is the documentation up to date?
Does this follow our code style guideline?
Have N developers reviewed this?

As this list gets longer, the merging process becomes more error-prone. "Oops, John just clicked on the merge button while there were not enough developer that reviewed the patch." Rings a bell?

In my team, we're like every team out there. We know what our criteria to merge some code into our repository are. That's why we set up a continuous integration system that runs our test suite each time somebody creates a pull request. We also require the code to be reviewed by 2 members of the team before it's approbated.

When those conditions are all set, I want the code to be merged.

Without clicking a single button.

That's exactly how Mergify started.

Mergify is a service that pushes that merge button for you. You define rules in the .mergify.yml file of your repository, and when the rules are satisfied, Mergify merges the pull request.

No need to press any button.

Take a random pull request, like this one:

This comes from a small project that does not have a lot of continuous integration services set up, just Travis. In this pull request, everything's green: one of the owners reviewed the code, and the tests are passing. Therefore, the code should be already merged: but it's there, hanging, chilling, waiting for someone to push that merge button. Someday.

With Mergify enabled, you'd just have to put this .mergify.yml a the root of the repository:

rules:
  default:
    protection:
      required_status_checks:
        contexts:
          - continuous-integration/travis-ci
      required_pull_request_reviews:
        required_approving_review_count: 1

With such a configuration, Mergify enables the desired restrictions, i.e., Travis passes, and at least one project member reviewed the code. As soon as those conditions are positive, the pull request is automatically merged.

We built Mergify as a free service for open-source projects. The engine powering the service is also open-source.

Now go check it out and stop letting those pull requests hang out one second more. Merge them!

If you have any question, feel free to ask us or write a comment below! And stay tuned — as Mergify offers a few other features that I can't wait to talk about!

A simple filtering syntax tree in Python

Thu, 03 May 2018 00:00:00 GMT

Working on various pieces of software those last years, I noticed that there's always a feature that requires implementing some DSL.

The problem with DSL is that it is never the road that you want to go. I remember how creating my first DSL was fascinating: after using programming languages for years, I was finally designing my own tiny language!

A new language that my users would have to learn and master. Oh, it had nothing new, it was a subset of something, inspired by my years of C, Perl or Python, who knows. And that's the terrible part about DSL: they are an marvelous tradeoff between the power that they give to users, allowing them to define precisely their needs and the cumbersomeness of learning a language that will be useful in only one specific situation.

In this blog post, I would like to introduce a very unsophisticated way of implementing the syntax tree that could be used as a basis for a DSL. The goal of that syntax tree will be filtering. The problem it will solve is the following: having a piece of data, we want the user to tell us if the data matches their conditions or not.

To give a concrete example: a machine wants to grant the user the ability to filter the beans that it should keep. What the machine passes to the filter is the size of the current grain, and the filter should return either true or false, based on the condition defined by the user: for example, only keep beans that are bigger that are between 1 and 2 centimeters or between 4 and 6 centimeters.

The number of conditions that the users can define could be quite considerable, and we want to provide at least a basic set of predicate operators: equal, greater than and lesser than. We also want the user to be able to combine those, so we'll add the logical operators or and and.

A set of conditions can be seen as a tree, where leaves are either predicates, and in that case, do not have children, or are logical operators, and have children. For example, the propositional logic formula φ1 ∨ (φ2 ∨ φ3) can be represented with as a tree like this:

Starting with this in mind, it appears that the natural solution is going to be recursive: handle the predicate as terminal, and if the node is a logical operator, recurse over its children.
Since we will be doing Python, we're going to use Python to evaluate our syntax tree.

The simplest way to write a tree in Python is going to be using dictionaries. A dictionary will represent one node and will have only one key and one value: the key will be the name of the operator (equal, greater than, or, and…) and the value will be the argument of this operator if it is a predicate, or a list of children (as dictionaries) if it is a logical operator.

For example, to filter our bean, we would create a tree such as:

{"or": [
  {"and": [
    {"ge": 1},
    {"le": 2},
  ]},
  {"and": [
    {"ge": 4},
    {"le": 6},
  ]},
]}

The goal here is to walk through the tree and evaluate each of the leaves of the tree and returning the final result: if we passed 5 to this filter, it would return True, and if we passed 10 to this filter, it would return False.

Here's how we could implement a very depthless filter that only handles predicates (for now):

import operator

class InvalidQuery(Exception):
    pass

class Filter(object):
    binary_operators = {
        "eq": operator.eq,
        "gt": operator.gt,
        "ge": operator.ge,
        "lt": operator.lt,
        "le": operator.le,
    }

    def __init__(self, tree):
        # Parse the tree and store the evaluator
        self._eval = self.build_evaluator(tree)

    def __call__(self, value):
        # Call the evaluator with the value
        return self._eval(value)

    def build_evaluator(self, tree):
        try:
            # Pick the first item of the dictionary.
            # If the dictionary has multiple keys/values
            # the first one (= random) will be picked.
            # The key is the operator name (e.g. "eq")
            # and the value is the argument for it
            operator, nodes = list(tree.items())[0]
        except Exception:
            raise InvalidQuery("Unable to parse tree %s" % tree)
        try:
            # Lookup the operator name
            op = self.binary_operators[operator]
        except KeyError:
            raise InvalidQuery("Unknown operator %s" % operator)
        # Return a function (lambda) that takes
        # the filtered value as argument and returns
        # the result of the predicate evaluation
        return lambda value: op(value, nodes)

You can use this Filter class by passing a predicate such as {"eq": 4}:

>>> f = Filter({"eq": 4})
>>> f(2)
False
>>> f(4)
True

This Filter class works but is quite limited as we did not provide logical operators. Here's a complete implementation that supports binary operators and and or:

import operator

class InvalidQuery(Exception):
    pass

class Filter(object):
    binary_operators = {
        u"=": operator.eq,
        u"==": operator.eq,
        u"eq": operator.eq,

        u"<": operator.lt,
        u"lt": operator.lt,

        u">": operator.gt,
        u"gt": operator.gt,

        u"<=": operator.le,
        u"≤": operator.le,
        u"le": operator.le,

        u">=": operator.ge,
        u"≥": operator.ge,
        u"ge": operator.ge,

        u"!=": operator.ne,
        u"≠": operator.ne,
        u"ne": operator.ne,
    }

    multiple_operators = {
        u"or": any,
        u"∨": any,
        u"and": all,
        u"∧": all,
    }

    def __init__(self, tree):
        self._eval = self.build_evaluator(tree)

    def __call__(self, value):
        return self._eval(value)

    def build_evaluator(self, tree):
        try:
            operator, nodes = list(tree.items())[0]
        except Exception:
            raise InvalidQuery("Unable to parse tree %s" % tree)
        try:
            op = self.multiple_operators[operator]
        except KeyError:
            try:
                op = self.binary_operators[operator]
            except KeyError:
                raise InvalidQuery("Unknown operator %s" % operator)
            return lambda value: op(value, nodes)
        # Iterate over every item in the list of the value linked
        # to the logical operator, and compile it down to its own
        # evaluator.
        elements = [self.build_evaluator(node) for node in nodes]
        return lambda value: op((e(value) for e in elements))

To support the and and or operators, we leverage the all and any built-in Python functions. They are called with an argument that is a generator that evaluates each one of the sub-evaluator, doing the trick.

Unicode is the new sexy, so I've also added Unicode symbols support.

And it is now possible to implement our full example:

>>> f = Filter(
...     {"∨": [
...         {"∧": [
...             {"≥": 1},
...             {"≤": 2},
...         ]},
...         {"∧": [
...             {"≥": 4},
...             {"≤": 6},
...         ]},
...     ]})
>>> f(5)
True
>>> f(8)
False
>>> f(1)
True

As an exercise, you could try to add the not operator, which deserve its own category as it is a unary operator!

In the next blog post, we will see how to improve that filter with more features, and how to implement a domain-specific language on top of it, to make humans happy when writing the filter!

Correct HTTP scheme in WSGI with Cloudflare

Wed, 25 Apr 2018 00:00:00 GMT

I've recently been using Cloudflare as an HTTP frontend for some applications, and getting things working correctly with WSGI was unobvious.

In Python, WSGI is the standard protocol to write a Web application. All Web frameworks that I know follows it. And many of those Web frameworks leverage some request environment variables to learn how the request has been made.

One of those environment variables is wsgi.url_scheme, and it contains either http or https, depending on the protocol that has been used to connect to your WSGI server.

And that's where things can get messy. If you enable SSL at Cloudflare in "Flexible" mode, your visitor will connect to your Web site using HTTPS, but Cloudflare will connect to your backend using HTTP. That means that for your application, the traffic will appear to be over HTTP, and not HTTPS: wsgi.url_scheme will be set to http.

That can lead to several problems with some frameworks. For example, the function url_for of Flask will rely on this variable to generate the scheme part of any URL. In this case, it would, therefore, generate URL starting with http:// whereas your visitors are using https.

The usual workaround is to leverage the X-Forwarded-Proto that is actually set by Cloudflare. In the case where Cloudflare proxies the request to your HTTP host, this will be set to https. By using the werkzeug.contrib.fixers.ProxyFix module, the variable wsgi.url_scheme will be set to what X-Forwarded-Proto is set.

That would work fine for any application that is directly behind Cloudflare, or any single HTTP reverse proxy.

But that does not work as soon as you have multiple reverse proxies. If your application runs on top of Heroku for example, they already provide a reverse proxy and overwrite those headers. That gives the following: Visitor -HTTPS-> Cloudflare -HTTP-> Heroku proxy -HTTP-> Heroku dyno. Once your dyno is reacher, X-Forwarded-For will be set to http.

Damn it!

The proper solution is, therefore, to have all your proxies implement RFC7239. This RFC defines a new Forwarded header that can contain all the hops that have forwarded this request, including all the scheme and IP addresses. Unfortunately, this is not implemented by Cloudflare nor Heroku. Bummer!

Finally, Cloudflare provides yet another custom header named Cf-Visitor. It contains a JSON payload with the original HTTP scheme used by the visitor: we can use that to solve our issue. Here's a WSGI middleware to do that:

class CloudflareProxy(object):
    """This middleware sets the proto scheme based on the Cf-Visitor header."""

    def __init__(self, app):
        self.app = app

    def __call__(self, environ, start_response):
        cf_visitor = environ.get("HTTP_CF_VISITOR")
        if cf_visitor:
            try:
                cf_visitor = json.loads(cf_visitor)
            except ValueError:
                pass
            else:
                proto = cf_visitor.get("scheme")
                if proto is not None:
                    environ['wsgi.url_scheme'] = proto
        return self.app(environ, start_response)

You can then use it to encapsulate your WSGI application with app = CloudflareProxy(app).

If you're using JavaScript, I noticed that the forwarded library provides that same support for Cloudflare along all the other headers – even RFC7239!

Lessons from OpenStack Telemetry: Deflation

Thu, 19 Apr 2018 00:00:00 GMT

This post is the second and final episode of Lessons from OpenStack Telemetry. If you have missed the first post, you can read it here.

Splitting

At some point, the rules relaxed on new projects addition with the Big Tent initiative, allowing us to rename ourselves to the OpenStack Telemetry team and splitting Ceilometer into several subprojects: Aodh (alarm evaluation functionality) and Panko (events storage). Gnocchi was able to join the OpenStack Telemetry party for its first anniversary.

Finally being able to split Ceilometer into several independent pieces of software allowed us to tackle technical debt more rapidly. We built autonomous teams for each project and gave them the same liberty they had in Ceilometer. The cost of migrating the code base to several projects was higher than we wanted it to be, but we managed to build a clear migration path nonetheless.

Gnocchi Shamble

With Gnocchi in town, we stopped all efforts on Ceilometer storage and API and expected people to adopt Gnocchi. What we underestimated is the unwillingness of many operators to think about telemetry. They did not want to deploy anything to have telemetry features in the first place, so adding yet a new component (a timeseries database) to have proper metric features was seen a burden – and sometimes not seen at all.
Indeed, we also did not communicate enough on our vision for that transition. After two years of existence, many operators were asking what Gnocchi was and what they needed it for. They deployed Ceilometer and its bogus storage and API and were confused about needing yet another piece of software.

It took us more than two years to deprecate the Ceilometer storage and API, which is way too long.

Deflation

In the meantime, people were leaving the OpenStack boat. Soon enough, we started to feel the shortage of human resources. Smartly, we never followed the OpenStack trend of imposing blueprints, specs, bug reports or any process to contributors, obeying my list of open source best practice. This flexibility allowed us to iterate more rapidly; compared to other OpenStack projects; we were going faster proportionately to the size of our contributor base.

Nonetheless, we felt like bailing out a sinking ship. Our contributors were disappearing while we were swamped with technical debt: half-baked feature, unfinished migration, legacy choices and temporary hacks. After the big party that happened, we had to wash the dishes and sweep the floor.

Being part of OpenStack started to feel like a burden in many ways. The inertia of OpenStack being a big project was beginning to surface, so we put up a lot of efforts to dodge most of its implications. Consequently, the team was perceived as an outlier, which does not help, especially when you have to interact with a lot your neighbors.

The OpenStack Foundation never understood the organization of our team. They would refer to us as "Ceilometer" whereas we formally renamed ourselves to "Telemetry" since we were englobing four server projects and a few libraries. For example, while Gnocchi has been an OpenStack project for two years before leaving, it has never been listed on the project navigator maintained by the foundation.

That's a funny anecdote that demonstrates the peculiarity of our team, and how it has been both a strength and a weakness.

Competition

Nobody was trying to do what we were doing when we started Ceilometer. We filled the space of metering OpenStack. However, as the number of companies involved increased and the friction with it along, some people grew unhappy. The race to have a seat at the table of the feast and becoming a Project Team Leader was strong, so some people preferred to create their project rather than trying to play the contribution game. In many areas, including our, that divided the effort up to a ridiculous point where several teams where doing the exact the same thing, or were trying to step on each other toes to kill the competitors.

We spent a significant amount of time trying to bring other teams in the Telemetry scope, to unify our efforts, without much success. Some companies were not embracing open-source because of their cultural differences, while some others had no interest to join a project where they would not be seen as the leader.

That fragmentation did not help us, but also did not do much harm in the end. Most of those projects are now either dead or becoming irrelevant as the rest of the world caught up on what they were trying to do.

Epilogue

As of 2018, I'm the PTL for Telemetry – because nobody else ran. The official list of maintainer for the telemetry projects is five people: two are inactive, and three are part-time. During the latest development cycle (Queens), 48 people committed in Ceilometer, though only three developers made impactful contributions. The code size has been divided by two since the peak: Ceilometer is now 25k lines of code long.

Panko and Aodh have no active developer. A Red Hat colleague and I are maintaining the projects afloat to keep it working.

Gnocchi has humbly thriven since it left OpenStack. The stains from having been part of OpenStack are not yet all gone. It has a small community, but users see its real value and enjoy using it.

Those last six years have been intense, and riding the OpenStack train has been amazing. As I concluded in the first blog post of this series, most of us had a great time overall; the point of those writings is not to complain, but to reflect.

I find it fascinating to see how the evolution of a piece of software and the metamorphosis of its community are entangled. The amount of politics that a corporately-backed project of this size generates is majestic and has a prominent influence on the outcome of software development.

So, what's next? Well, as far as Ceilometer is concerned, we still have ideas and plans to keep shrinking its footprint to a minimum. We hope that one-day Ceilometer will become irrelevant – at least that's what we're trying to achieve so we don't have anything to maintain. That mainly depends on how the myriad of OpenStack projects will chose to address their metering.

We don't see any future for Panko nor Aodh.

Gnocchi, now blooming outside of OpenStack, is still young and promising. We've plenty of ideas and every new release brings new fancy features. The storage of timeseries at large scale is exciting. Users are happy, and the ecosystem is growing.

We'll see how all of that concludes, but I'm sure it'll be new lessons to learn and write about in six years!

Lessons from OpenStack Telemetry: Incubation

Thu, 12 Apr 2018 00:00:00 GMT

It was mostly around that time in 2012 that I and a couple of fellow open-source enthusiasts started working on Ceilometer, the first piece of software from the OpenStack Telemetry project. Six years have passed since then. I've been thinking about this blog post for several months (even years, maybe), but lacked the time and the hindsight needed to lay out my thoughts properly. In a series of posts, I would like to share my observations about the Ceilometer development history.

To understand the full picture here, I think it is fair to start with a small retrospective on the project. I'll try to keep it short, and it will be unmistakably biased, even if I'll do my best to stay objective – bear with me.

Incubation

Early 2012, I remember discussing with the first Ceilometer developers the right strategy to solve the problem we were trying to address. The company I worked for wanted to run a public cloud, and billing the resources usage was at the heart of the strategy. The fact that no components in OpenStack were exposing any consumption API was a problem.

We debated about how to implement those metering features in the cloud platform. There were two natural solutions: either achieving some resource accounting report in each OpenStack projects or building a new software on the side, covering for the lack of those functionalities.

At that time there were only less than a dozen of OpenStack projects. Still, the burden of patching every project seemed like an infinite task. Having code reviewed and merged in the most significant projects took several weeks, which, considering our timeline, was a show-stopper. We wanted to go fast.

Pragmatism won, and we started implementing Ceilometer using the features each OpenStack project was offering to help us: very little.

Our first and obvious candidate for usage retrieval was Nova, where Ceilometer aimed to retrieves statistics about virtual machines instances utilization. Nova offered no API to retrieve those data – and still doesn't. Since it was out of the equation to wait several months to have such an API exposed, we took the shortcut of polling directly libvirt, Xen or VMware from Ceilometer.

That's precisely how temporary hacks become historical design. Implementing this design broke the basis of the abstraction layer that Nova aims to offer.

As time passed, several leads were followed to mitigate those trade-offs in better ways. But on each development cycle, getting anything merged in OpenStack became harder and harder. It went from patches long to review, to having a long list of requirements to merge anything. Soon, you'd have to create a blueprint to track your work, write a full specification linked to that blueprint, with that specification being reviewed itself by a bunch of the so-called core developers. The specification had to be a thorough document covering every aspect of the work, from the problem that was trying to be solved, to the technical details of the implementation. Once the specification was approved, which could take an entire cycle (6 months), you'd have to make sure that the Nova team would make your blueprint a priority. To make sure it was, you would have to fly a few thousands of kilometers from home to an OpenStack Summit, and orally argue with developers in a room filled with hundreds of other folks about the urgency of your feature compared to other blueprints.

Even if you passed all of those ordeals, the code you'd send could be rejected, and you'd get back to updating your specification to shed light on some particular points that confused people. Back to square one.

Nobody wanted to play that game. Not in the Telemetry team at least.

So Ceilometer continued to grow, surfing the OpenStack hype curve. More developers were joining the project every cycle – each one with its list of ideas, features or requirements cooked by its in-house product manager.

But many features did not belong in Ceilometer. They should have been in different projects. Ceilometer was the first OpenStack project to pass through the OpenStack Technical Committee incubation process that existed before the rules were relaxed.

This incubation process was uncertain, long, and painful. We had to justify the existence of the project, and many technical choices that have been made. Where we were expecting the committee to challenge us at fundamental decisions, such as breaking abstraction layers, it was mostly nit-picking about Web frameworks or database storage.

Consequences

The rigidity of the process discouraged anyone to start a new project for anything related to telemetry. Therefore, everyone went ahead and started dumping its idea in Ceilometer itself. With more than ten companies interested, the frictions were high, and the project was at some point pulled apart in all directions. This phenomenon was happening to every OpenStack projects anyway.

On the one hand, many contributions brought marvelous pieces of technology to Ceilometer. We implemented several features you still don't find any metering system. Dynamically sharded, automatic horizontally scalable polling? Ceilometer has that for years, whereas you can't have it in, e.g., Prometheus.

On the other hand, there were tons of crappy features. Half-baked code merged because somebody needed to ship something. As the project grew further, some of us developers started to feel that this was getting out of control and could be disastrous. The technical debt was growing as fast as the project was.

Several technical choices made were definitely bad. The architecture was a mess; the messaging bus was easily overloaded, the storage engine was non-performant, etc. People would come to me (as I was the Project Team Leader at that time) and ask why the REST API would need 20 minutes to reply to an autoscaling request. The willingness to solve everything for everyone was killing Ceilometer. It's around that time that I decided to step out of my role of PTL and started working on Gnocchi to, at least, solve one of our biggest challenge: efficient data storage.

Ceilometer was also suffering from the poor quality of many OpenStack projects. As Ceilometer retrieves data from a dozen of other projects, it has to use their interface for data retrieval (API calls, notifications) – or sometimes, palliate for their lack of any interface. Users were complaining about Ceilometer dysfunctioning while the root of the problem was actually on the other side, in the polled project. The polling agent would try to retrieve the list of virtual machines running on Nova, but just listing and retrieving this information required several HTTP requests to Nova. And those basic retrieval requests would overload the Nova API. The API does not offer any genuine interface from where the data could be retrieved in a small number of calls. And it had terrible performances.
From the point of the view of the users, the load was generated by Ceilometer. Therefore, Ceilometer was the problem. We had to imagine new ways of circumventing tons of limitation from our siblings. That was exhausting.

At its peak, during the Juno and Kilo releases (early 2015), the code size of Ceilometer reached 54k lines of code, and the number of committers reached 100 individuals (20 regulars). We had close to zero happy user, operators were hating us, and everybody was wondering what the hell was going in those developer minds.

Nonetheless, despite the impediments, most of us had a great time working on Ceilometer. Nothing's ever perfect. I've learned tons of things during that period, which were actually mostly non-technical. Community management, social interactions, human behavior and politics were at the heart of the adventure, offering a great opportunity for self-improvement.

In the next blog post, I will cover what happened in the years that followed that booming period, up until today. Stay tuned!

Is Python a Good Choice for Entreprise Projects?

Wed, 04 Apr 2018 00:00:00 GMT

A few weeks ago, one of my followers, Morteza, reached out and asked me the following:

I develop projects mostly with Python, but I am scared that Python is not a good choice for enterprise projects. In many cases, I've encountered a situation where Python performance was not sufficient, like thread spawning and so on, and as you know, the GIL supports one thread at the time.
Some friends told me to try to use Java, C++ or even Go for enterprise projects instead of Python. I see many job boards that require Python just for testing, QA or some small projects. I feel that Python is a small gun for showing my experiences and that I'd have to choose an alternative language.
As you are advanced and professional in many topics especially in Python, I'd need your advice. Is Python good enough for enterprise systems? Or should I choose an alternative language which fills the gaps that exist in Python?

If you follow me for a long time, you know I've been doing Python for more than ten years now and even wrote two books about it. So while I'm obviously biased, and before writing a reply, I would also like to take a step back and reassure you, dear reader, that I've used plenty of other programming languages those last 20 years: Perl, C, PHP, Lua, Lisp, Java, etc. I've built tiny to big projects with some of them, and I consider that Lisp is the best programming language. 😅 Therefore, I like to think that I'm not overly partial.

To reply to Morteza, I would say that you first need to acknowledge that a language itself is not slow or fast. English is not faster than French; however, some French people speak faster than English people.

So then, yes, CPython, the chief implementation of the Python programming language has some limitations: the GIL (Global Interpreter Lock) as Morteza says, is the most significant parallelism limiter. The rest of the language is being optimized regularly, and you can follow the work done in each Python version to see where this is going. CPython gets faster on each minor version.

On the other hand, don't think that Go or Java are miracles: they both have their limitations. For example, you can read this compelling presentation from Ben Bangert at Mozilla entitled "From Python to Go and back again". Ben explains some of the limitations that he encountered while switching to Go.

I'm sure you can find problems and limitations with the Java Virtual Machine too.

In Scaling Python, I wrote a few chapters covering the GIL and how you can circumvent its limitation. If you write widely scalable applications, the GIL is not such a big deal, as you need, anyway, to spread the load across multiple servers, not only on several processors.

There are tons of companies running Python applications at large scale, e.g. Instagram, Google and YouTube, Dropbox or PayPal.

Therefore, no, Python is not only for QA applications, no more than Java is only good for browser applets nor Go is for devops or whatever.

They all are different languages that approach problems from different angles. Depending on your mindset and on the solution that you want to implement, some might appear better equipped than others. Their virtual machines or compilers are marvelous, but also have their limitations and shortcomings that you need to be aware of so you can avoid falling into a big trap.

Of course, another approach is to remove all those issues by going down a layer and use a lower level language, e.g. C or C++. That'll remove those limitations for sure: no Python GIL, no Go resources leaking, no JVM startup slowness, etc. However, it'll add a ton of extra work and problems that YOU will have to solve – puzzles that are already resolved by higher-level languages. That's a matter of trade-offs: do you want to write a blazingly fast program in 10 years or do you want to write a decently fast program in 1 year? 😏

In the end, picking a language is not only a matter of performance but also a concern of support, community, and ecosystem. Picking battle-tested languages like Python and Java is the assurance of reliability and trustworthiness, while selecting a younger language like Rust might be an exciting ride. Doing some "reality check" is always worth considering before choosing a language. If you wanted to write an application that uses, e.g., AMQP and HTTP/2, are you sure that there are libraries providing those features and that are broadly used and supported? Or are you ready to commit time to maintain them yourself?

Again, Python is pretty solid here. Considering the extensive practice it has, there are tons of generously used libraries for everything you could ever need. The community is large and the ecosystem is flourishing.

In the end, I do think that yes, Python is a terrific choice for any enterprise projects, and considering the number of existing projects it counts, I'm not the only one thinking that way.

Feel free to share your experience – or even projects – in the comments section below!

Gnocchi engine optimization

Tue, 27 Mar 2018 00:00:00 GMT

Software speed is relative.

After all, it is the result of a set of trade-offs made between the ease of programming and the speed of hardware. The comfort of the developer and its use of multiple abstraction layers has a direct impact on decreasing the cost in time (and therefore in money), while it on the other hand increases the hardware expenditure as the software is less performant. In the end, whichever between optimization and hardware that is the cheapest gets privileged.

Of course, there are terrible exceptions, such as picking the wrong algorithm or including sleep() calls, but the essence of it is here. Pick C to be fast, saving money on hardware and spending it on development, or pick Java to save money on development, while making rich hardware manufacturers.

Last month, a co-worker at Red Hat picked Gnocchi for a test run and was disappointed by the performance he saw for his particular usage. After correctly understanding the use case scenario, I wrote a small test case that implemented this scheme and popped out my favorite code profiler. You know how I roll.

The profiling result made the performance issue obvious. Along with its releases, Gnocchi evolved from a one metric at a time processing approach to a bunch of metric at a time approach – especially since Gnocchi 4 and the introduction of the sacks. However, that batched approach is not yet complete in Gnocchi 4.2, and the processing engine still manipulates metrics one by one in parallel. The parallelization using processes and threads makes sure that the CPU usage is high and that the I/O latency does not impact the processing too much.

Processing incoming measures can be therefore schematized as this:

In the schema above, each operation in red is an I/O operation. The three branches I drew represents three metrics being processed. Obviously, if there were ten metrics, there would be ten branches, creating even more I/O operations. With the current Gnocchi 4.2 code, the number of I/O operations for processing a sack of metric can be roughly computed to 2 + (5 × M × D × G) actions, with M the number of metrics and D the number of definitions in the archive policy and G the number of aggregation methods. For my test scenario, I used D=1 and G=1, which is what can be seen on the diagram above.

The obvious solution is to merge those I/O operations for each metric in a single I/O operations for a bunch of metrics. This allows for storage backends to batch the reading and writing operations, reducing latency and improving throughput.

It took me a few tens of patches and a few code reviews from my peers to rework the internal storage engine of Gnocchi. The new engine is now ready to be used and merged into the master branch.

The new engine reduces the number of I/O operations to process a bunch of metric to 5 + M – a (at least) five times reduction in the amount of operations. In my case, for 1000 metrics being processed in a batch, with only one aggregation, that decreases the quantity of transactions from 5002 to 1005.

A typical metric has 6 aggregation methods defined usually, so that could reduce the number of I/O operations from 40,002 to only 1005 for 1000 metrics – a fourty times reduction of I/O operations. The benchmark code that I wrote, which implements the desired use case with a single aggregation, is now performing more than four times faster.

Not all the drivers will benefit from this improvement, as some of them are better at doing batched operations than others; Redis is great at it, while Swift is not. And even if the number I/O operations has been largely reduced, they still need to be fully executed, which can take time depending on the backend performance. It's a really great improvement, not a silver bullet.

Mehdi started to use that new internal driver API to implement a RocksDB driver. While it has its own limitation (has to be single-threaded) that we will need to circumvent, it could improve performance for the non-distributed use-case by a large magnitude.

This code will be included in the next Gnocchi major release in a few weeks, so stay tuned for further update. And benchmark, I hope!

On blog migration

Wed, 21 Mar 2018 00:00:00 GMT

I've started my first Web page in 1998 and one could say that it evolved quite a bit in the meantime. From a Frontpage designed Web site with frames, it evolved to plain HTML files. I've started blogging in 2003, though the archives of this blog only gets back to 2007. Truth is, many things I wrote in the first years were short (there were no Twitter) and not that relevant nowadays. Therefore, I never migrated them along the road of the many migrations that site had.

The last time I switched this site engine was in 2011, were I switched from Emacs Muse (and my custom muse-blog.el extension) to Hyde, a static Web site generator written in Python.

That taught me a few things.

First, you can't really know for sure which project will be a ghost in 5 years. I had no clue back then that Hyde author would lose interest and struggle passing the maintainership to someone else. The community was not big but it existed. Betting on a horse is part skill and part chance. My skills were probably lower seven years ago and I also may have had bad luck.

Secondly, maintaining a Web site is painful. I used to blog more regularly a few years ago, as the friction of using a dynamic blog engine was lower than spawning my deprecated static engine. Knowing that it needs 2 minutes to generate a static Web site really makes it difficult to compose and see the result at the same time without losing patience. It took me a few years to decide it was time to invest in the migration. I just jumped from Hyde to Ghost, hosted on their Pro engine as I don't want to do any maintenance. Let's be honest, I've no will to inflict myself the maintenance of a JavaScript blogging engine.

The positive side is that this is still Markdown based, so the migration job was not so painful. Ghost offers a REST API which allow to manipulate most of the content. It works fine, and I was able to leverage the Python ghost-client to write a tiny migration script to migrate every post.

I am looking forward to share most of the things that I work on during the next months. I really enjoyed reading contents of great hackers those last years, and I've learned ton of things by reading the adventure of smarter engineers.

It might be my time to share.

Scaling a polling Python application with tooz

Mon, 05 Mar 2018 00:00:00 GMT

This article is the final one of the series I wrote about scaling a large number of connections in a Python application. If you don't remember what the problem we're trying to solve is, here it is, coming from one of my followers:

It so happened that I'm currently working on scaling some Python app. Specifically, now I'm trying to figure out the best way to scale SSH connections - when one server has to connect to thousands (or even tens of thousands) of remote machines in a short period of time (say, several minutes).
How would you write an application that does that in a scalable way?

The first blog post was exploring a solution based on threads, while the second blog post was exploring an architecture around asyncio.

In the two first articles, we wrote programs that could handle this problem by using multiple threads or asyncio – or both. While this worked pretty well, this had some limitations, such as only using one computer. So this time, we're going to take a different approach and use multiple computers!

The job

As we've already seen, writing a Python application that connects to a host by ssh can be done using Paramiko or asyncssh as we've seen previously. Here again, that will not be the focus of this blog post since it is pretty straightforward to do.

To keep this exercise simple, we'll reuse our ping function from the first article. It looked like this:

import subprocess

def ping(hostname):
    p = subprocess.Popen(["ping", "-c", "3", "-w", "1", hostname],
                         stdout=subprocess.DEVNULL,
                         stderr=subprocess.DEVNULL)
    return p.wait() == 0

As a reminder, running this program alone and pinging serially 255 IP addresses takes more than 10 minutes. Let's try to make it faster by running it in parallel.

The architecture

Remember: if pinging 255 hosts takes 10 minutes, pinging the whole Internet is going to take forever – around five years at this rate.

With our ping experiment, we already divided our mission (e.g. "who's alive on the Internet") into very small tasks ("ping"). If we want to ping 4 billion hosts, we need to run those tasks in parallel. But one computer is not going to be enough: we need to distribute those tasks to different hosts, so we can use some massive parallelism to go even faster!

There are two ways to distribute such a set of tasks:

Use a queue. That works well for jobs that are not determined in advance, such as user-submitted tasks or that are going to be executed only once.
Use a distribution algorithm. That works only for tasks are determined in advance, and that are scheduled regularly, such as polling.

We are going to pick the second option here, as those ping tasks (or polling in the original problem) should regularly be run. That approach will allow us to spread the jobs onto several processes whose can be even spread onto several nodes over a network. We also won't have to "maintain" the queue (e.g. make it work and monitor it) so that's also a bonus point.

That's infinite horizontal scalability!

The distribution algorithm

The algorithm we're going to use to distribute this task is based on a consistent hashring.

Here's how it works in short. Picture a circular ring. We map objects onto this ring. The ring is then split into partitions. Those partitions are distributed among all the workers. The workers take care of jobs that are in the partitions they are responsible for.

In the case where a new node joins the ring, it is inserted between 2 nodes and take a bit of their workload. In the case where a node leaves the ring, the partitions it was taking care of are reassigned to its adjacent nodes.

If you want more details, it exists plenty of explanations about how this algorithm work. Feel free to look online!

However, to make this work, we need to know which nodes are alive or dead. This is another problem to solve, and the best way to tackle it is to use a coordination mechanism. There are plenty of those, from Apache ZooKeeper to etcd.

Without going too much into details, those pieces of software provide a network service where every node can connect to and can manage its state. If a client gets disconnected or crashes, it's then easy to consider it as removed. That enables the application to get the full list of nodes, and split the ring accordingly. There's no need to have any shared state between the nodes other than who's alive and running.

Using group membership

To get a list of nodes that are available to help us pinging the Internet, we need a service that provides this and a library to interact with it. Since the use case is pretty simple and I don't know which backends you like the most, we're going to use the Tooz library.

Tooz provides a coordination mechanism on top of a large variety of backends: ZooKeeper or etcd, as suggested earlier, but also Redis or memcached for those who want to live more dangerously. Indeed, while ZooKeeper or etcd can be set up in a synchronized cluster, memcached, on the other hand, is a SPOF.

For the sake of the exercise, we're going to use a single instance of etcd here. Thanks to Tooz, switching to another backend would be a one-line change anyway.

Tooz provides a tooz.coordination.Coordinator object that represents the connection to the coordination subsystem. It then exposes an API based on groups and members. A member is a node connected through a Coordinator instance. A group is a place that members can join or leave.

Here's a first implementation of a member joining a group and printing the member list:

import sys
import time

from tooz import coordination

## Check that a client and group ids are passed as arguments
if len(sys.argv) != 3:
    print("Usage: %s <client id> <group id>" % sys.argv[0])
    sys.exit(1)

## Get the Coordinator object
c = coordination.get_coordinator(
    "etcd3://localhost",
    sys.argv[1].encode())
## Start it (initiate connection).
c.start(start_heart=True)

group = sys.argv[2].encode()

## Create the group
try:
    c.create_group(group).get()
except coordination.GroupAlreadyExist:
    pass

## Join the group
c.join_group(group).get()

try:
    while True:
        # Print the members list
        members = c.get_members(group)
        print(members.get())
        time.sleep(1)
finally:
    # Leave the group
    c.leave_group(group).get()

    # Stop when we're done
    c.stop()

Don't forget to run etcd on your machine before running this program. Running a first instance of this program will print set(['client1']) every second. As soon as you run a second instance of this program, they both start to print set(['client1', 'client2']). If you shut down one of the clients, they will print the member list with only one member of it.

This can work with any number of client. If a client crashes rather than disconnect properly, its membership will automatically expire a few seconds – you can configure this expiration period with by passing a timeout value in
Tooz URL.

Using consistent hashing

Now that we have a group, which will turn out to be our ring, we can
implement consistent hashring on top of it. Fortunately, Tooz also provides an implementation of this that is ready to be used. Rather than using the
join_group method, we're gonna use the join_partitioned_group method.

import sys
import time

from tooz import coordination

## Check that a client and group ids are passed as arguments
if len(sys.argv) != 3:
    print("Usage: %s <client id> <group id>" % sys.argv[0])
    sys.exit(1)

## Get the Coordinator object
c = coordination.get_coordinator(
    "etcd3://localhost",
    sys.argv[1].encode())
## Start it (initiate connection).
c.start(start_heart=True)

group = sys.argv[2].encode()

## Join the partitioned group
p = c.join_partitioned_group(group)

try:
    while True:
        print(p.members_for_object("foobar"))
        time.sleep(1)
finally:
    # Leave the group
    c.leave_group(group).get()

    # Stop when we're done
    c.stop()

Running this program on one node (or just one terminal) will output the following every second:

$ python distribution.py client1 foobar
0 handled by set(['client1'])
1 handled by set(['client1'])
2 handled by set(['client1'])
3 handled by set(['client1'])
4 handled by set(['client1'])
5 handled by set(['client1'])
6 handled by set(['client1'])
7 handled by set(['client1'])
8 handled by set(['client1'])
9 handled by set(['client1'])

As soon as a second members join (just run another copy of the script in another terminal), the output changes and both the running programs output the same thing:

0 handled by set(['client2'])
1 handled by set(['client1'])
2 handled by set(['client1'])
3 handled by set(['client1'])
4 handled by set(['client1'])
5 handled by set(['client2'])
6 handled by set(['client2'])
7 handled by set(['client1'])
8 handled by set(['client1'])
9 handled by set(['client2'])

They just shared the ten objects between them. They did not communicate with each other. They just know each other presence, and since they are using the same algorithm to compute where an object should belong, they share the same
results. You can do the test with a third copy of the program:

0 handled by set(['client2'])
1 handled by set(['client1'])
2 handled by set(['client1'])
3 handled by set(['client1'])
4 handled by set(['client1'])
5 handled by set(['client2'])
6 handled by set(['client2'])
7 handled by set(['client3'])
8 handled by set(['client1'])
9 handled by set(['client3'])

Here we got a third client in the mix, excellent! If we stop one of the clients, the rebalancing is done automatically.

While the consistent hashing approach is great, is has a few characteristics you might want to know about:

The distribution algorithm is not made to be perfectly even. If you have a vast number of objects, it might seem pretty even statistically, but if you are trying to distribute two objects on two nodes, it's probable one node will handle the two objects and the other one none.
The distribution is not done in real time, meaning there's a small chance that an object might be owned by two nodes at the same time. This is not a problem in a scenario such as this one, since pinging a host twice is not going to be a big deal, but if your job needed to be unique and executed once and only once, this might not be an adequate method of distribution. Rather use a queue which has the proper characteristics.

Distributed ping

Now that we have our hashring ready to distribute our job, we can implement our final program!

import sys
import subprocess
import time

from tooz import coordination

## Check that a client and group ids are passed as arguments
if len(sys.argv) != 3:
    print("Usage: %s <client id> <group id>" % sys.argv[0])
    sys.exit(1)

## Get the Coordinator object
c = coordination.get_coordinator(
    "etcd3://localhost",
    sys.argv[1].encode())
## Start it (initiate connection).
c.start(start_heart=True)

group = sys.argv[2].encode()

## Join the partitioned group
p = c.join_partitioned_group(group)

class Host(object):
    def __init__(self, hostname):
        self.hostname = hostname

    def __tooz_hash__(self):
        """Returns a unique byte identifier so Tooz can distribute this object."""
        return self.hostname.encode()

    def __str__(self):
        return "<%s: %s>" % (self.__class__.__name__, self.hostname)

    def ping(self):
        p = subprocess.Popen(["ping", "-q", "-c", "3", "-W", "1",
                              self.hostname],
                             stdout=subprocess.DEVNULL,
                             stderr=subprocess.DEVNULL)
        return p.wait() == 0

hosts_to_ping = [Host("192.168.2.%d" % i) for i in range(255)]

try:
    while True:
        for host in hosts_to_ping:
            c.run_watchers()
            if p.belongs_to_self(host):
                print("Pinging %s" % host)
                if host.ping():
                    print("  %s is alive" % host)
        time.sleep(1)
finally:
    # Leave the group
    c.leave_group(group).get()

    # Stop when we're done
    c.stop()

When the first client starts, it starts iterating on the host, and since it is alone, all hosts belong to it. So it starts pinging all nodes:

{% syntax %}
$ python3 ping.py client1 ping
Pinging <Host: 192.168.2.0>
<Host: 192.168.2.0> is alive
Pinging <Host: 192.168.2.1>
<Host: 192.168.2.1> is alive
Pinging <Host: 192.168.2.2>
{% endsyntax %}

Then, a second client starts pinging too, and automatically the jobs are split. The client1 instance starts skipping some nodes that now belongs to client2:

## client1 output
Pinging <Host: 192.168.2.8>
  <Host: 192.168.2.8> is alive
Pinging <Host: 192.168.2.9>
Pinging <Host: 192.168.2.11>
Pinging <Host: 192.168.2.12>

## client2 output
Pinging <Host: 192.168.2.7>
Pinging <Host: 192.168.2.10>
Pinging <Host: 192.168.2.13>
  <Host: 192.168.2.13> is alive

On the other hand, client2 is skipping nodes that are belonging to client1. If you want to scale further our application, we can start new clients on other nodes on the network and expand our pinging system!

Just a first step

This ping job does not use a lot of CPU time or I/O bandwidth, neither would the original ssh case by Alon. However, if that would be the case, this method would be even more efficient as the scalability of the resources would be a key.

These are just the first steps of the distribution and scalability mechanism
that you can implement using Python. There are a few other options available on top of this mechanism such as defining different weights for different nodes or using replicas to achieve high-availability scenario. I've covered those in my book Scaling Python, if you're interested in learning more!

Scaling a polling Python application with asyncio

Mon, 12 Feb 2018 00:00:00 GMT

This article is a follow-up of my previous blog post about scaling a large number of connections. If you don't remember, I was trying to solve one of my followers' problem:

It so happened that I'm currently working on scaling some Python app. Specifically, now I'm trying to figure out the best way to scale SSH connections - when one server has to connect to thousands (or even tens of thousands) of remote machines in a short period of time (say, several minutes).
How would you write an application that does that in a scalable way?

In the first article, we wrote a program that could handle large scale of this problem by using multiple threads. While this worked pretty well, this had some severe limitations. This time, we're going to take a different approach.

The job

The job has not changed and is still about connecting to a remote server via
ssh. This time, rather than faking it by using ping instead, we are going to connect for real to an ssh server. Once connected to the remote server, the mission will be to run a single command. For the sake of this example, the command that will be run here is just a simple "echo hello world".

Using an event loop

This time, rather than leveraging threads, we are using asyncio. Asyncio is the leading Python event loop system implementation. It allows executing multiple functions (named coroutines) concurrently. The idea is that each time a coroutine performs an I/O operation, it yields back the control to the event loop. As the input or output might be blocking (e.g., the socket has no data yet to be read), the event loop will reschedule the coroutine as soon as there is work to do. In the meantime, the loop can schedule another coroutine that has something to do – or wait for that to happen.

Not all libraries are compatible with the asyncio framework. In our case, we need an ssh library that has support for asyncio. It happens that AsyncSSH is a Python library that provides ssh connection handling support for asyncio. It is particularly easy to use, and the documentation has plenty of examples.

Here's the function that we're going to use to execute our command on a remote host:

import asyncssh

async def run_command(host, command):
    async with asyncssh.connect(host) as conn:
        result = await conn.run(command)
        return result.stdout

The function run_command runs a command on a remote host once connected
to it via ssh. It then returns the standard output of the command. The function uses the keywords async and await that are specific to Python >= 3.6 and asyncio. It indicates that the called functions are coroutine that might be blocking, and that the control is yield back to the event loop.

As I don't own hundreds of servers where I can connect to, I will be using a single remote server as the target – but the program will connect to it multiple times. The server is at a latency of about 6 ms, so that'll magnify a bit the results.

The first version of this program is simple and stupid. It'll run N times the run_command function serially by providing the tasks one at a time to the asyncio event loop:

loop = asyncio.get_event_loop()

outputs = [
    loop.run_until_complete(
        run_command("myserver", "echo hello world %d" % i))
    for i in range(200)
]
print(outputs)

Once executed, the program prints the following:

$ time python3 asyncssh-test.py
['hello world 0\n', 'hello world 1\n', 'hello world 2\n', … 'hello world 199\n']
python3 asyncssh-test.py  6.11s user 0.35s system 15% cpu 41.249 total

It took 41 seconds to connect 200 times to the remote server and execute a simple printing command.

To make this faster, we're going to schedule all the coroutines at the same time. We just need to feed the event loop with the 200 coroutines at once. That will give it the ability to schedule them efficiently.

outputs = loop.run_until_complete(asyncio.gather(
    *[run_command("myserver", "echo hello world %d" % i)
      for i in range(200)]))
print(outputs)

By using asyncio.gather, it is possible to pass a list of coroutines and wait for all of them to be finished. Once run, this program prints the following:

$ time python3 asyncssh-test.py
['hello world 0\n', 'hello world 1\n', 'hello world 2\n', … 'hello world 199\n']
python3 asyncssh-test.py  4.90s user 0.34s system 35% cpu 14.761 total

This version only took ⅓ of the original execution time to finish! As a fun note, the main limitation here is that my remote server is having trouble to handle more than 150 connections in parallel, so this program is a bit tough for it alone.

Scalability

To show how great this method is, I've built a chart below that shows the difference of execution time between the two approaches, depending on the number of hosts the application has to connect to.

The trend lines highlight the difference of execution time and how important the concurrency is here. For 10,000 nodes, the time needed for a serial execution would be around 40 minutes whereas it would be only 7 minutes with a cooperative approach – quite a difference. The concurrent approach allows executing one command 205 times a day rather than only 36 times!

That was the second step

Using an event loop for tasks that can run concurrently due to their I/O intensive nature is really a great way to maximize the throughput of a program. This simple changes made the program 6× faster.

Anyhow, this is not the only way to scale Python program. There are a few other options available on top of this mechanism – I've covered those in my book Scaling Python, if you're interested in learning more!

Until then, stay tuned for the next article of this series!

Gnocchi 4.2 release

Tue, 06 Feb 2018 00:00:00 GMT

The time of the release arrived. A little more than three months have passed since the latest minor version, 4.1, has been released. There are tons of improvement and a few nice significant features in this release!

Most of the principal changes are recorded in the 4.2 release notes, but here are as a few that I find particularly interesting. There were 141 commits since 4.1.0 that we merged. As a comparison, it is a lot less than the 228 we had between 4.0.0 and 4.1.0 or the 375 we had between 3.1.0 and 4.0.0!

We added two compatibility endpoints on the REST API for InfluxDB and Prometheus. We want users coming from those other database systems or using tools that are compatible with them to be able also to use Gnocchi. This is now possible as Gnocchi offers endpoint to write data using the InfluxDB line protocol and Prometheus HTTP API. Reading data using their API is not supported yet though. For example, this has been tested with Telegraf and works perfectly fine!

Some other improvements were made, such as enhancing the ACL filtering when using Keystone for authentication, a new batch format for passing more information about non-existing metrics to create and tons of performance improvements!

We already started working on the next version of Gnocchi! Come and join us on GitHub! Star us, and stay tuned for some more awesome news around metrics.

Scaling a polling Python application with parallelism

Tue, 23 Jan 2018 00:00:00 GMT

A few weeks ago, Alon contacted me and asked me the following:

It so happened that I'm currently working on scaling some Python app. Specifically, now I'm trying to figure out the best way to scale SSH connections - when one server has to connect to thousands (or even tens of thousands) of remote machines in a short period of time (say, several minutes).
How would you write an application that does that in a scalable way?

Alon is using such an application to gather information on the hosts it connects to, though that's not important in this case.

In a series of blog post, I'd like to help Alon solve this problem! We're gonna write an application that can manage millions of hosts.

Well, if you have enough hardware, obviously.

The job

Writing a Python application that connects to a host by ssh can be done using, for example, Paramiko. That will not be the focus of this blog post since it is pretty straightforward to do.

To keep this exercise simple, we'll just use a ping function that looks like this:

import subprocess

def ping(hostname):
    p = subprocess.Popen(["ping", "-c", "3", "-W", "1", hostname],
                         stdout=subprocess.DEVNULL,
                         stderr=subprocess.DEVNULL)
    return p.wait() == 0

The function ping returns True if the host is reachable and alive, or False if an error occurs (bad hostname, network unreachable, ping timeout, etc.). We're also not trying to make ping fast by specifying a lower timeout or a smaller number of packets. The goal is to scale this task while knowing it takes time to execute.

So ping is going to be the job to be executed by our application. It'll replace ssh in this example, but you'll see it'll be easy to replace it with any other job you might have.

We're going to use this job to accomplish a bigger mission: determine which hosts in my home network are up:

for i in range(255):
    ip = "192.168.2.%d" % i
    if ping(ip):
        print("%s is alive" % ip)

Running this program alone and pinging all 255 IP addresses takes more than 10 minutes.

It is pretty slow because each time we ping a host, we wait for the ping to succeed or timeout before starting the next ping. So if you need 3 seconds to ping each host in average, then to ping 255 nodes you'll need 5 seconds × 255 = 765 seconds and that's more than 12 minutes.

The solution

If 255 hosts need 12 minutes to be pinged, you can imagine how long it's going to be when we're going to test which hosts are alive on the IPv4 Internet – 4 294 967 296 addresses to ping!

Since those ping (or ssh) jobs are not CPU intensive, we can consider that one multi-processor host is going to be powerful enough – at least for a beginning.

The real issue here currently is that those tasks are I/O intensive and executing them serially is very long.

So let's run them in parallel!

To do this, we're going to use threads. Threads are not efficient in Python when your tasks are CPU intensive, but in case of blocking I/O, they are good enough.

Using `concurrent.futures`

With concurrent.futures, it's easy to manage a pool of threads and schedule the execution of tasks. Here's how we're going to do it:

import functools
from concurrent import futures
import subprocess

def ping(hostname):
    p = subprocess.Popen(["ping", "-q", "-c", "3", "-W", "1",
                          hostname],
                         stdout=subprocess.DEVNULL,
                         stderr=subprocess.DEVNULL)
    return p.wait() == 0

with futures.ThreadPoolExecutor(max_workers=4) as executor:
    futs = [
        (host, executor.submit(functools.partial(ping, host)))
        for host in ("192.168.2.%d" % i for i in range(255))
    ]

    for ip, f in futs:
        if f.result():
            print("%s is alive" % ip)

The ThreadPoolExecutor is an engine, called executor, that allows us to submit tasks to it. Each task submitted is put into an internal queue using the executor.submit method. This method takes a function to execute as argument.

Then, the executor pulls jobs out of its queue and execute them. In order to execute them, it starts a thread that is going to be responsible for the execution. The maximum number of threads to start is controlled by the max_workers parameters.

executor.submit returns a Future object, that holds the future result of the submitted task. Future objects expose methods to know if the task is finished or not; here we just use Future.result() to get the result. This method will block until the result is ready.

There's no magic recipe to find how many max workers you should use. It really depends on the nature of the tasks that are submitted. In this case, using a value of 4 brings down the execution time to 3 minutes – roughly 12 minutes divided by 4, which makes sense. Setting the max_workers to 255 (i.e. the number of tasks submitted) will make all the pings started at the same time, producing a CPU usage spike, but bringing down the total execution time to less than 5 seconds!

Obviously, you wouldn't be able to start 4 billion threads in parallel, but if your system is big and fast enough, and your task using more I/O than CPU, you can use a pretty high value in this case. The memory should also be taken into account – in this case, it's very low since the ping task is not using a lot of memory.

Just a first step

As already said, this ping job does not use a lot of CPU time or I/O bandwidth, neither would the original ssh case by Alon. However, if that would be the case, this method would be limited pretty quickly. Threads are not always the best option to maximize your throughput, especially with Python.

These are just the first steps of the distribution and scalability mechanism that you can implement using Python. There are a few other options available on top of this mechanism – I've covered those in my book Scaling Python, if you're interested in learning more!

If you're curious, go read the next article of this series.

A safe GitHub workflow with Pastamaker

Fri, 15 Dec 2017 00:00:00 GMT

When the Gnocchi project decided to move to GitHub, we developers had to move from a Gerrit based workflow to a GitHub pull-request one.

This has been challenging in some ways. We were satisfied with the workflow we had using Gerrit and Zuul for testing so we decided to adapt GitHub to our requirements.

We know that Zuul now supports GitHub. However, that implies having your own testing infrastructure, something we can't afford. Instead, we rely on Travis, like most open-source projects hosted on GitHub.

The workflow

The workflow we wanted to have was the following:

A contributor creates a pull-request on GitHub.
The pull-request is tested by Travis.
The pull-request is reviewed by approved projects members.
If the tests pass and two reviewers have approved the pull-request, then it
can be merged.

This sounds simple, but it is actually not that simple.

First, when Travis tests the pull-request, it checks what has been sent by the contributor. If the contributor created a pull-request on top of an outdated version of the base branch, that's what will be tested by Travis during the initial pull-request creation.

Even if the pull-request has been created using the tip of the base branch, as time passes, the base branch will progress. However, the pull-request created by your contributor will not get those new commits – unless rebased manually.

That means the Travis tests result is now outdated invalid. Still, GitHub and Travis will both show you that this pull-request passed all tests – yes it did but with an old base branch from a while back!

If you added new tests in the meantime in your base branch, it's possible that this pull-request does not work anymore. Pressing the merge button might just break your project!

To help with that problem, GitHub recently added a button that allows you to base branch into the pull-request. That allows, in one click, to get the pull-requested updated with the base branch (e.g., master) and retested by Travis.

Still, this means that if you have ten pull-requests, you need to:

Merge base branch into PR#1
Wait for Travis to pass
Wait for two reviewers to approve
Merge PR#1
All other nine pull-requests are not out of date. You need to do start back at operation 1. for each pull-request.

This is very tedious to do manually, especially when your projects has tons of pull-requests.

This is why Mehdi Abaakouk created Pastamaker.

Pastamaker to the rescue

Pastamaker is a small Web application that implements the described workflow. Once connected to your GitHub project, it will set the proper permissions to protect it for accidental manual merge and force the workflow above to be followed.

Pastamaker listens for GitHub and Travis events to track the state of each pull-request. If it detects that a pull-request has been approved by two reviewers and that the initial Travis test run passed, it will merge the base branch if needed in it, wait for Travis to pass again, and then finally merge it.

If multiple pull-requests are approved at the same time and are candidates for a merge, it will order them, update once at a time, wait for Travis results and merge them if they pass. It essentially automates the workflow described above.

Pastamaker exposes its data via a simple dashboard, which allows seeing all the pull-requests for your project in a snap.

Pastamaker offers a lot of tiny other details that make the developers lives easier, such as posting the job result with direct links to the jobs logs in the pull-request – so you're informed as soon as they pass or fail and can fix them right away!

Pastamaker is obviously open-source, and we would love to see you give it a try!

Scaling Python released

Tue, 05 Dec 2017 00:00:00 GMT

I am proud to announce today the immediate release of Scaling Python, my second book about Python! It talks about the distribution and performance of applications written in Python, and how to build them properly!

It took me a year to build this entirely new product around Python. It's an exciting moment and I am sure it will enjoy many of my dear readers that are waiting for it for a while now!

I've been able to build this using my last three years of experience working on The Hacker's Guide to Python – an amazing adventure.

Starting now, you can enjoy reading the book and learn a bit more about building distributed and scalable applications with Python. I really hope it'll help you bring your Python-fu to a new level, and that it will help you build great projects!

Since this is first days of sale, you will enjoy a 15% discount on all packages for the next 48 hours!

Scaling Python: the interviewees

Tue, 28 Nov 2017 00:00:00 GMT

The release date for Scaling Python is now very close! Today, I'd like to talk a bit about the interviews that I've run those last months that are featured in the book.

I'm glad that during those long weeks work, I have managed to find a Python expert on each of the major topic covered in the book. They will provide hindsight on the different subject covered and share their experience so you can benefit from it!

Without further delay, ladies and gentlemen, here they are:

# Mehdi Abaakouk

Mehdi is a French free software hacker, working at Red Hat, who has been using Linux for almost twenty years now. He works daily on OpenStack, the largest open source project using Python. He also regularly builds and contribute to distributed applications and is responsible for several widely used Python libraries – Cotyledon, oslo.messaging, etc.

In the book, Mehdi gives excellent tips on how to build distributed daemons.

Naoki Inada

Naoki is a Japanese software engineer, who happens to also be one of the CPython developers. He worked on several significant features in CPython, such as asyncio.

You'll be able to read Naoki opinion on Python and other programming languages when it comes to asynchronous workflows.

Chris Dent

Chris Dent has been using Python for more than 15 years now and is an expert on WSGI. He has an extensive knowledge about REST API – he is one of the early organizers of the OpenStack API working group.

Chris has, among other things, created Gabbi, a fabulous Python testing tool for. In Scaling Python, he provides best practice on building REST API.

Joshua Harlow

Joshua is a highly experienced engineer in distributed systems. He maintains a few Python libraries, such as Kazoo (ZooKeeper client) or TaskFlow (distributed tasks).

In the book, Joshua lays down principles that make Python application resilient and fault tolerant.

Alexys Jacob-Monier

Alexys is the CTO of 1000mercis and is part of the open-source software community for a few years now. He regularly gives speeches at Python conferences and talks about how to leverage Python when distributing applications.

Alexys talks about advanced techniques, e.g. using consistent hash rings, and how they should be applied.

Victor Stinner

Victor is a long time CPython core developer, working on the language itself for several years now. He is well known in the community for working on making CPython faster and leads several performance-oriented projects.

In Scaling Python, Victor talks about optimizations, profiling, and performance when using Python, and how to make the right decisions.

Jason Myers

Jason is a Python developer and an author – he wrote an entire book on SQLAlchemy, the famous Python SQL library. He worked on cloud computing platforms, as a Web developer, and as a data engineer.

In the book, we discuss with Jason about caching and RDBMS usage.

It was marvelous to have a chat with all those developers and pick their brain about different subjects. These contents broaden the scope and expand the view of the themes covered through the chapters. I can't thank them all enough!

If you want to be informed of the release of the book, subscribe in the following form! You'll be the first to be notified and to enjoy an exclusive offer. ;-)

Mastering PostgreSQL

Tue, 07 Nov 2017 00:00:00 GMT

A few months ago, my friend Dimitri Fontaine and I discussed writing books and sharing our knowledge. If you do not know Dimitri yet, he is an old-time PostgreSQL Major Contributor – meaning he writes code for the PostgreSQL software itself!

I interviewed Dimitri a few years ago in The Hacker's Guide to Python, where he shared his insight about writing proper Python application code with relational database management system.

All of this gave Dimitri the idea of writing his own book about PostgreSQL. And he released his book this week! If like me, you can't wait to read it, just scroll down below and grab a package with a 15% discount.

To celebrate the event, I went ahead and decided to ask him a few questions.

Hey Dimitri! So what made you start writing this book in the first place?

Dimitri: As a PostgreSQL consultant, I've met with many developers for whom SQL just didn't click. They then tend to consider SQL much as they would consider HTML: some string you need to build dynamically then send over to an external part of the system, either the browser or the database server.

As soon as you start on this path, SQL is more and more of a problem in your daily life and developer workflow. It doesn't integrate well with the usual testing and continuous integration tools, not to mention it's hard to review (as in code review) and hard to maintain.

At the end of the day, when using a relational database system, you have to
know your SQL.

As a developer, you need to be fluent in SQL and master window functions, common table expressions, recursive queries, time zone handling and advanced string and regexp processing functions, transaction behaviors and also how to build a query result set in JSON. And so much more.

For most developers, it's a daunting task. Just too much to learn when they have so many other things to take care of. So many lines of code to write to implement that new product idea. Well, as Dijkstra put it, lines of code are “spent” on writing a new feature. When you master SQL, you spend much less of those lines of code.

So I wrote Mastering PostgreSQL in Application Development to teach SQL to developers. Focusing on real use cases and authentic data set, so that it's easier to grasp all those advanced features. The book also addresses the tooling you need to integrate SQL as another programming language with a decent worfklow, from code review to unit testing, including regression testing and production debugging.

Who should read this book? What are the prerequisites to get the most of it?

Dimitri: The pre-requisites are quite easy to reach.

If you ever deployed an application that embeds SQL queries and talks to a
database server, you're in the target audience, the book is for you.

If you've never used PostgreSQL before, reading Mastering PostgreSQL in Application Development may convince you to have a look at that awesome piece of technology. My bet is on you switching to PostgreSQL and finding it much better at helping you in your daily work and challenges.

Even if you've been using MySQL all your life, you will learn about standard SQL features and how to use them in a way that applies to more than just PostgreSQL, so the book is going to help you in your daily life.

What's your next adventure now that this book is out?

Dimitri: There are more things that I want to do that a lifetime allows, and I am in the process of choosing what is going to be my next adventure. I feel so lucky to be able to have that problem to solve… and it still isn't the easiest one for me.

What I can tell you is that I have much more PostgreSQL knowledge to share after having been using, promoting, and contributing to this database server technology for about 20 years now. So if that first book sells well, I will get back to filling empty pages and deliver more contents to help developers making the best of SQL, to help developers on their road to Mastering PostgreSQL!

Thanks Dimitri!

I've just read the book and found it fantastic. It contains tons of tips on how to use PostgreSQL correctly, and I discovered SQL features I had no clue about. The book uses real data that you can fetch and play with. It provides the data and a Docker container with everything included so you can edit the query yourself and try it out. There is no better way to learn things than to play with the examples that are included, in just a few clicks!

The book also features a few interviews with SQL experts from the PostgreSQL community and from the development community, which gives great insight about how to use the software.

Dimitri is offering 15% off for my readers during the next 48 hours for any of the edition of the book. Just use the PYTHON-LOVES-POSTGRESQL coupon code in any of the following package:

.product img { max-width: 100%; margin-top: 10px; } .col-sm-1 { width: 8.33333333%; float: left; } .col-sm-3 { width: 25%; float: left; } .col-sm-9 { width: 75%; float: left; } .col-sm-11 { width: 91.66666667%; float: left; /* should be on the div not here but well.. */ padding-left: 30px; padding-bottom: 20px; } .row { clear: both; } .btn.btn-default { background-color: #e6940e; color: #FFF; line-height: 46px; height: 50px; font-size: 19px; cursor: pointer; text-align: center; border-radius: 5px; border: 3px solid #e6940e; vertical-align: middle; display: inline-block; padding: 0 30px; position: relative; outline: none !important; transition: color .3s ease,background .3s ease,border-color .3s ease,opacity .3s ease; box-shadow: none; }

Mastering PostgreSQL – Enterprise Edition

$179 $152
with coupon code PYTHON-LOVES-POSTGRESQL

The Enterprise Edition includes:

The book in PDF, EPUB and MOBI formats.

Interviews from industry veterans who began building web application in the previous century. They've been there and have opinions to share about how to approach SQL.

The PostgreSQL database dump that you need to run the queries against, with a script to restore it easily. The database includes all the 12 datasets used in the book.

A Docker container image of an already loaded PostgreSQL database with the whole 12 datasets in 56 tables, and the 265 SQL queries each in their own .sql file and a Web-based application for easily running and editing the SQL queries.

Licence for you to share the book and the Docker set-up with up to 50 people, including you. That's everything you need for your whole team to master SQL!

Buy the Enterprise Edition

Mastering PostgreSQL – Full Edition

$89 $75
with coupon code PYTHON-LOVES-POSTGRESQL

The Full Edition includes:

The book in PDF, EPUB and MOBI formats.

Interviews from industry veterans who began building web application in the previous century. They've been there and have opinions to share about how to approach SQL.

The PostgreSQL database dump that you need to run the queries against, with a script to restore it easily. The database includes all the 12 datasets used in the book.

Buy the Full Edition

Mastering PostgreSQL

$39 $33
with coupon code PYTHON-LOVES-POSTGRESQL

The Standard Edition includes:

The book in PDF, EPUB and MOBI formats.

Interviews from industry veterans who began building web application in the previous century. They’ve been there and have opinions to share about how to approach SQL.

Buy the book

If you have any question, feel free to reach Dimitri directly and he will be happy to reply. Or write in the comment section below!

And don't worry: if the book is not what you expect it to be and has no value to you, then just say so and Dimitri will refund you, no questions asked.

Gnocchi 4.1 is out

Fri, 27 Oct 2017 00:00:00 GMT

We did it again. A bit more of our usual four months were needed to do it, but Gnocchi 4.1 has been released. This is a great news and another big milestone for the project!

As usual, we enhanced Gnocchi and added a bunch of new things that can all be seen in the online changelog. Nevertheless, I would like to talk of a few here!

First, we added notification support to the Redis incoming driver. This feature makes sure that, when using Redis as an incoming measure driver, the metrics as processed as fast as possible, rather than waiting metric_processing_delay. This moves the incoming driver toward more of a push model than a pull model – even if it still uses both. That feature decreases the latency between the time metrics are pushed, and metrics are processed by metricd, which is transcendent.

Secondly, the internal computing engine (measures aggregation) has entirely been ported from Pandas to Numpy. While Pandas is written using Numpy, it does some things more than Numpy itself when used. Those features are beneficial when quickly writing data analysis processes, but are not needed for Gnocchi. They take CPU time, which means less throughput for metricd. Pandas is still needed for the old and deprecated dynamic aggregation feature and will be entirely removed as a dependency in the next version of Gnocchi.

Finally, the biggest functionality that has landed is the new /v1/aggregates endpoint. This is a principal feature that allows to retrieve aggregates but also to do cross-aggregation in new ways that were not possible before. For example, you can request "the absolute value of the average of two metrics being multiplied by" writing: (absolute (* (metric 32dd0731-c423-45aa-94f6-e4069989eb57 mean) (metric 942990de-b208-4bf7-a0ee-93e4890df73a mean))). This endpoint supports fetching any metric from the database (by id or by search in resources) and applying any mathematics operation. The syntax is inspired from Lisp, which makes it easy to write both as a string or as JSON.

Come and join us on GitHub! Star us, and stay tuned for some more awesome news around metrics.

My interview with Cool Python Codes

Thu, 05 Oct 2017 00:00:00 GMT

A few days ago, I've recently been contacted by Godson Rapture from Cool Python codes to answer a few questions about what I work on in open source. Godson regularly interview developers and I invite you to check out his website!

Here's a copy of my original interview. Enjoy!

Good day, Julien Danjou, welcome to Cool Python Codes. Thanks for taking your precious time to be here.

You’re welcome!

Could you kindly tell us about yourself like your full name, hobbies, nationality, education, and experience in programming?

Sure. I’m Julien Danjou, I’m French and live in Paris, France. I studied Computer science for 5 years around 15 years ago, and continued my career in that field since then, specializing in open source projects.

Those last years, I’ve been working as a software engineer at Red Hat. I’ve spent the last 10 years working with the Python programming language. Now I work on the Gnocchi project which is a time series database.

When I’m not coding, I enjoy running half-marathon and playing FPS games.

Can you narrate your first programming experience and what got you to start learning to program?

I started programming around 2001, and my first serious programs were in Perl. I was contributing to a hosting platform for free software named VHFFS. It was a free software project itself, and I enjoyed being able to learn from other more experienced developers and being able to contribute back to it. That’s what got me stuck into that world of open source projects.

Which programming language do you know and which is your favorite?

I know quite a few, I’ve been doing serious programming in Perl, C, Lua, Common Lisp, Emacs Lisp and Python.

Obviously, my favorite is Common Lisp, but I was never able to use it for any serious project, for various reasons. So I spend most of my time hacking with Python, which I really enjoy as it is close to Lisp, in some ways. I see it as a small subset of Lisp.

What inspired you to venture into the world of programming and drove you to learn a handful of programming languages?

It was mostly scratching my own itches when I started. Each time I saw something I wanted to do or a feature I wanted in an existing software, I learned what I needed to get going and get it working.

I studied C and Lua while writing awesome- the window manager that I created 10 years ago and used for a while. I learned Emacs Lisp while writing extensions that I wanted to see in Emacs, etc. It’s the best way to start.

What is your blog about?

My blog is a platform where I write about what I work on most of the time. Nowadays, it’s mostly about Python and the main project I contribute to,
Gnocchi.

When writing about Gnocchi, I usually try to explain what part of the project I worked on, what new features we achieved, etc.

On Python, I try to share solutions to common problems I encountered or identified while doing e.g. code reviews. Or presenting a new library I created!

Tell us more about your book, The Hacker’s Guide to Python.

It’s a compilation of everything I learned those last years building large Python applications. I spent the last 6 years developing on a large code base with thousands of other developers.

I’ve reviewed tons of code and identified the biggest issues, mistakes, and bad practice that developers tend to have. I decided to compile that in a guide, helping developers that played a bit with Python to learn the stages to get really productive with Python.

OpenStack is the biggest open source project in Python, Can you tell us more about OpenStack?

OpenStack is a cloud computing platform, started 7 years ago now. Its goal is to provide a programmatic platform to manage your infrastructure while being open source and avoiding vendor lock-in.

Who uses OpenStack? Is it for programmers, website owners?

It’s used by a lot of different organizations – not really by individuals. It’s a big piece of software. You can find it in some famous public cloud providers (Dreamhost, Rackspace…), and also as a private cloud in a lot of different organizations, from Bloomberg to eBay or the CERN in Switzerland, a big OpenStack user. Tons of telecom providers also leverages OpenStack for their own internal infrastructure.

Have you participated in any OpenStack conference? What did you speak on if
you did?

I’ve attended the last 9 OpenStack summits and a few other OpenStack events around the world. I’ve been engaged in the upstream community for the last 6 years now.

My area of expertise is telemetry, the stack of software that is in charge of collecting and storing metrics from the various OpenStack components. This is what I regularly talk about during those events.

How can one join the OpenStack community?

There’s an entire documentation about that, called the Developer’s Guide. It explains how to setup your environment to send patches, how to join the community using the mailing-lists or IRC.

What makes your book, The Hacker’s Guide to Python stand out from other Python books? Also, who exactly did you write this book for?

I wrote the book that I always wanted to read about Python, but never found. It’s not a book for people that want to learn Python from scratch. It’s a great guide for those who know the language but don’t know the details that experienced developers know and that make the difference. The best practice, the elegant solutions to common problems, etc. That’s why it also includes interviews with prominent Python developers, so they can share their advice on different areas.

How can someone get your book?

I’ve decided to self-publish my book, so he does not have an editor like you can be used to see. The best place to get it is online at where you can pick the format you want, electronic or paper.

What do you mean when you say you hack with Python?

Unfortunately, most people refer to hacking as the activity of some bad guys trying to get access to whatever they’re not supposed to see. In the book title, I mean “hacking” as the elegant way of writing code and making things worse smoothly even when you were not expecting to make it.

You mentioned earlier that Gnocchi is a time series database. Can you please be more elaborate about Gnocchi? Is there also any documentation about Gnocchi?

So Gnocchi is a project I started a few years ago to store time series at large scale. Timeseries are basically a series of tuple composed of a timestamp and a value.

Imagine you wanted to store the temperature of all the rooms of the world at any point of time. You’d need a dedicated database for that with the right data structure. This is what Gnocchi does: it provides this data structure storage at very, very large scale.

The primary use case is infrastructure monitoring, so most people use it to store tons of metrics about their hardware, software, etc. It’s fully documented on its website.

How can a programmer without much experience contribute to open source projects?

The best way to start is to try to fix something that irritates you in some way. It might be a bug, it might be a missing feature. Start small. Don’t try big things first or you could be discouraged.

Never stop.

Also, don’t plunge right away in the community and start poking random people or spam them with questions. Do your homework, and listen to the community for a while to get a sense of how things are going. That can be joining IRC and lurking or following the mailing lists for example.

Big open source communities dedicate programs to help you become engaged. It might be worth a try. Generic programs like Outreachy or Google Summer of Code are a great way to start if you don’t feel confident enough to jump by your own means in a community.

Just out of curiosity, do you write code in French?

Never ever. I think it’s acceptable to write in your language if you are sure that your code will never be open sourced and that your whole team is talking in that language, no matter what – but it’s a ballsy assumption, clearly.

Truth is that if you do open source, English is the standard, so go with it. Be sad if you want, but please be pragmatic.

I’ve seen projects being open sourced by companies where all the code source comments were in Korean. It was impossible for any non-Korean people to get a glance of what the code and the project was doing, so it just failed and disappeared.

How does a team of programmers handle bugs in a large open source project?

I wish there was some magic recipe, but I don’t think it’s the case. What you want is to have a place where your users can feel safe reporting bugs. Include a template so they don’t forget any details: how to reproduce the bugs, what they expected, etc. The worst thing is to have users reporting “That does not work.” with no details. It’s a waste of time.

What tool to use to log all of that really depends on the team size and culture.

Once that works, the actual fixing of bug doesn’t follow any rule. Most developers fix the bug they encounter or the ones that are the most critical for users. Smaller problems might not be fixed for a long time.

Can you tell us about the new book you are working on and when do we expect
to get it?

That new book is entitled “Scaling Python” and it provides insight into how to build largely scalable and distributed applications using Python.

It is also based on my experience in building this kind of software during the past years. This book also includes interviews of great Python hackers who work on scalable system or know a thing or two about writing applications for performance – an important point to have scalable applications.

The book is in its final stage now, and it should be out at the beginning of 2018.

How can someone get in contact with you?

I’m reachable at julien@danjou.info by email or via Twitter, @juldanjou.

Gnocchi 4 performance

Mon, 11 Sep 2017 00:00:00 GMT

It has been a long time since I have tested Gnocchi performances. Last time was two years ago, on version 2. The current version for Gnocchi is 4.0, released a couple of months ago. It adds a lot of new features, such as a Redis incoming driver and a new job distribution method.

Many of those features and improvement implemented over the last couple of years were made with performance in mind. It is time to check if this lives up to our expectation.

Test protocol

I have pulled the servers I used a couple of years ago out of the dust, updated them with latest RHEL 7 and installed Gnocchi 4.0.1 and Redis 4.0.1 on one of them. I used the other server as the benchmark client, in charge of generating a bunch of loads.

The hardware configuration for each server is:

2 × Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (6 cores each)
32 GB RAM
SanDisk Extreme Pro 240GB SSD

I have installed Gnocchi using pip install gnocchi[postgresql,file,redis], created a PostgreSQL database and wrote the following configuration file:

[indexer]
url = postgresql://root:@localhost/gnocchi

## Uncomment when testing with Redis
## [incoming]
## driver = redis

[storage]
file_basepath = /root/gnocchi-venv/data

The perk of having good default values: you only to write a couple of configuration lines to get it working.

I have used uWSGI as the Web server, using the configuration
file provided Gnocchi's documentation and configured it with 64 processes and 16 threads.

Since the hardware configurations are identical, I allow myself in this article to compare the performances of Gnocchi 2 and Gnocchi 4 directly.

Benchmark tools

For generating loads, I have reused the code that I wrote and merged
in python-gnocchiclient. It is still not that easy to generate a lot of parallel loads in Python, though it is still the best tool I find available that was not too complicated to setup for things like CRUD operations.

To benchmark measures, I needed something very fast to generate requests on the client side to be sure to be able to overload the server. I have leveraged wrk, which is written in C++ and is fast. It is scriptable using Lua, so it made it easy to generate fake batches of data.

Metric CRUD operations

The first step is to benchmarks the CRUD operations for metrics. Here are the
results, compared to the benchmarks I did against Gnocchi 2.

Without surprises (but with great pleasure), everything is between 13% and 26% faster. Those operations mostly consist of SQL operations for the backend and serialization on the API – nothing heavy.

Sending and getting measures

Writing measures is still the hottest topic! How fast can you push things into that time series database and how efficient it is at retrieving those?

Gnocchi has been supporting various batching methods for a while, and here the tested one is the simplest case, i.e., batching for one metric at a time.

I think the chart talks for itself. With Redis as a driver, I attained almost 1 million measures per second. I did not find a suitable tool to report performances with a payload bigger than 5000 points, so I stopped at that. Those results are inline with what Gordon Chung measured recently on Gnocchi 4 – though he achieved 1.3 million measures per second with his bigger hardware!

These are performances using HTTP as a protocol – with all its overhead and JSON serialization going on. Gnocchi does not implement any custom protocol so far because we never had any requirement for more performances. However, that would certainly be a good path to follow for anyone wanting to go even faster.

Reading metrics is 54% faster here again. You can retrieve up to 400 000 measures per second (around 150 Mbit/s of data). That means you can retrieve a metric with a whole year of measures with a one-minute aggregate in 1.3 seconds. More realistically, you can retrieve the last 24 hours of data with a one minute precision for 280 metrics in just one second. That is more data you could ever fit on your graph dashboard!

Most of the time is spent serializing points in JSON – again, a different retrieving mechanism could be envisioned to achieve even higher performances.

Metricd speed

I did not benchmark myself metricd speed, as Gordon wrote a complete report in the meantime. Gnocchi 4 multiplies the processing speed from Gnocchi 2 by a factor of 2.

This speed is quite impressive and allows Gnocchi to ingest and pre-compute considerable amount of data in a short time span. Some of the changes Gordon tested here are not yet released and will be part of the next minor release (4.1).

Being that efficient means that with only 1 CPU, Gnocchi can process (data aggregation) roughly 700 measures per second. If you have 70 servers and gather 10 metrics per server every second, Gnocchi can process them without any delay.

If you scale back your polling to one minute instead of one second (the most common scenario) and use a single computer with 12 cores, that means Gnocchi can aggregate the metrics from 50 400 servers with only one server.

Not that bad.

Conclusion

Our processing engine is getting now really mature. Hundreds of deployments are now using it for production purpose of gathering metrics. The recent improvements made for Gnocchi 4 are a compelling argument for users to upgrade, and we are pretty proud of our work! We still have a few ideas on how to improve some corner cases, but the general use case is getting well covered. Adding to that the native horizontal capability that Gnocchi provides since day one, it is getting hard to find a time series database that has those features with this level of performance (but of course I'm biased, haha).

And if you have any questions, feel free to shoot them in the comment section. 😉

$(function () { var chartColors = { red: 'rgb(234, 18, 51)', orange: 'rgb(255, 159, 64)', yellow: 'rgb(255, 205, 86)', green: 'rgb(75, 192, 192)', blue: 'rgb(46, 106, 234)', purple: 'rgb(153, 102, 255)', grey: 'rgb(201, 203, 207)' }; var color = Chart.helpers.color; var ctx = $("#metric_crud").get(0).getContext("2d"); var metric_crud = new Chart(ctx, { type: 'bar', data: { labels: ["Create", "Read", "Delete"], datasets: [ { label: "Gnocchi 2", backgroundColor: color(chartColors.blue).alpha(0.6).rgbString(), borderColor: chartColors.blue, borderWidth: 1, data: [1300, 670, 524] }, { label: "Gnocchi 4", backgroundColor: color(chartColors.red).alpha(0.6).rgbString(), borderColor: chartColors.red, borderWidth: 1, data: [1473, 843, 708] } ] }, options: { responsive: true, legend: { position: 'top', }, scales: { yAxes: [ { ticks: { beginAtZero: true } } ] }, title: { display: true, text: 'CRUD operations' } }}); var ctx = $("#metric_measures").get(0).getContext("2d"); var metric_measures = new Chart(ctx, { type: 'bar', data: { labels: ["1", "10", "100", "500", "1000", "2000", "3000", "4000", "5000"], datasets: [ { label: "Gnocchi 2 (file)", backgroundColor: color(chartColors.blue).alpha(0.6).rgbString(), borderColor: chartColors.blue, borderWidth: 1, data: [624, 6000, 45000, 98000, 113000, 121000, 123000, 125000, 122000] }, { label: "Gnocchi 4 (file)", backgroundColor: color(chartColors.red).alpha(0.6).rgbString(), borderColor: chartColors.red, borderWidth: 1, data: [1 * 754.26, 10 * 770.16, 100 * 583.53, 500 * 522.88, 1000 * 406.38, 2000 * 273.03, 3000 * 215.11, 4000 * 185.08, 5000 * 176.11] }, { label: "Gnocchi 4 (Redis)", backgroundColor: color(chartColors.purple).alpha(0.6).rgbString(), borderColor: chartColors.purple, borderWidth: 1, data: [1 * 674, 10 * 782.34, 100 * 600, 500 * 533.66, 1000 * 405.38, 2000 * 282, 3000 * 223, 4000 * 195, 5000 * 185.41] } ] }, options: { responsive: true, legend: { position: 'top', }, scales: { yAxes: [ { scaleLabel: { display: true, labelString: "Measures per second" }, ticks: { beginAtZero: true } } ], xAxes: [ { scaleLabel: { display: true, labelString: "Measures per request" }, } ] }, title: { display: true, text: 'Measures writing' } }}); var ctx = $("#metric_measures_get").get(0).getContext("2d"); var metric_measures_get = new Chart(ctx, { type: 'bar', data: { labels: ["Get measures for metric"], datasets: [ { label: "Gnocchi 2 (file)", backgroundColor: color(chartColors.blue).alpha(0.6).rgbString(), borderColor: chartColors.blue, borderWidth: 1, data: [260000] }, { label: "Gnocchi 4 (file)", backgroundColor: color(chartColors.red).alpha(0.6).rgbString(), borderColor: chartColors.red, borderWidth: 1, data: [46.43 * 8640] } ] }, options: { responsive: true, legend: { position: 'top', }, scales: { yAxes: [ { scaleLabel: { display: true, labelString: "Measures per second" }, ticks: { beginAtZero: true } } ] }, title: { display: true, text: 'Measures reading' } }}); });

Attending PyCon FR 2017

Thu, 31 Aug 2017 00:00:00 GMT

The French edition of the annual Python conference, PyCon FR, will happen in Toulouse from 21st to 24th September.

I skipped the last few PyCon FR, but this year I will be back with a one-hour long talk entitled "Scalable and distributed applications in Python". It will take place on Saturday afternoon. I will lay out many topics that will be covered in the book I'm working on, Scaling Python!

The Thursday and Friday will be dedicated to development sprints. I will be there with my friend Mehdi running a session for Gnocchi! We'll spend time teaching new contributors how to use it or how to send love and patches to the project. If you're into Python and want to learn about timeseries management, it's an excellent occasion to join us for some fun. 😎

To join the sprint and the conference, visit the PyCon FR website and register.

Gnocchi or Prometheus?

Wed, 30 Aug 2017 00:00:00 GMT

The realm of time series database keeps expanding those last years. Now and then a new contender appears from the fog. People keep asking me about the difference between Gnocchi and Prometheus. It's time to content them.

Gnocchi and Prometheus are two open source projects evolving in the same expertise area, time series handling. They both are licensed under the Apache 2.0 license (see Gnocchi license file and Prometheus license file. And that's a good thing!

Both Gnocchi and Prometheus offers a bunch of features. Here's a table summary of the differences between the features they both offer – or not.

Feature

Prometheus

Gnocchi

Multi-tenant

❌

✓

User auth & ACL

❌

✓

Resource history

❌

✓

Metric polling

✓

❌

Highly available

❌

✓

Horizontal scalability

❌

✓

Alerting engine

✓

❌

Data compression

✓

Pre-computed aggregation

✓

Grafana support

✓

collectd support

✓

#comparison th, #comparison td + td { text-align: center; }

There's a lot of overlap between the two projects, but there are also some major differences.

First, Gnocchi does not try to solve the metric retrieval problem. Prometheus provides a pull mechanism and takes in charge of getting the measurements. Gnocchi developers estimate that they are plenty of tools already doing that and that work well, such as collectd.

Secondly, Prometheus offers an alerting engine, statically configured with
a YAML file. It is way better than Gnocchi which offers nothing in comparison – for now. Gnocchi developers are discussing the feature and while it's not on the roadmap yet, it will happen. It will, however, leverage a REST API to be controlled, as it seems important to us to be able to define alerts
programmatically.

Then there is a bunch of features where Gnocchi shines compared to Prometheus, and it is the core of its function: storing metrics. Gnocchi has a great storage engine that supports many storage backends (plain
files, OpenStack Swift, Ceph…). It helps Gnocchi scaling horizontally and providing native high-availability, whereas Prometheus stays a single point of failure.

Multi-tenant and authentication are also supported by Gnocchi, allowing a single instance to be shared by multiple accounts. System administrators do not commonly use this kind of feature, but applications developers usually need them.

That brings me to the usage and querying of Prometheus and Gnocchi. Prometheus has its small DSL (referred to as PromQL) whereas Gnocchi has a fully featured REST API that tries to expose proper semantic. It does not seem there are major differences between the two in term of features.

Both Prometheus and Gnocchi support aggregating values over time ranges on query time ("give me the minimum value for every 5 minutes range over the last day"). Gnocchi always aggregates metrics at writing time, and never at query time (unless doing it cross-metrics). This implies that Gnocchi needs a bit of CPU time at write time to pre-compute those aggregates, but it is blazingly fast at reading time as it has nothing to compute. Prometheus can do the same thing using recording rules.

Prometheus has some limitations inherent to time series database designed around the notion of "monitoring": they tend to compute everything relatively to $NOW. For example, it seems impossible to inject data from the past. The timestamp for a value is the timestamp where Prometheus read that value. If Prometheus misses values for a few hours, don't think about importing it back.

I'm noting this here as it makes it harder to benchmark Prometheus for ingestion. You need tons of fake metrics to polls and build data. I did not find any reference of Prometheus performances online, though it is advertised to ingest "millions of measures from thousands of sources".

Query performances seem to vary on Prometheus, and I did not find any benchmark on that neither. Gnocchi leverages standard RDBMS (MySQL or PostgreSQL is supported) to query indexed data and the metrics retrieval is always O(1), making it always fast.

Conclusion

If you look in different and older areas, there never has been only one HTTP server. Many people use Apache HTTP server, but you'll find plenty of users of nginx, Tomcat, HAProxy, Node.js or uwsgi which are also common options nowadays. Same goes for RDBMS if you look at PostgreSQL, MySQL and other databases solution, etc. There will never be a project winning all the market share.

It seems to me that time series storage and management is also growing in this category. There will probably be various projects that will enjoy some popularity and growth. Every project addresses the time series problem space with a different view and different trade-offs. There might never be a single project solving all problems at once.

Prometheus seems to be oriented toward monitoring of live systems. Gnocchi is oriented to highly available time series storage at massive scale. Not considering performances (I was not able to compare anyway), both have different tradeoffs in term of features, philosophy, and orientation. Depending on your use cases, one might be a better fit than the other.

Using Gnocchi with Docker

Thu, 17 Aug 2017 00:00:00 GMT

I've recently started to look into Docker to build images ready to be used with Gnocchi in it. I found it would be a great way to distribute a working instance of Gnocchi.

To this end, we created the gnocchi-docker repository on GitHub. It contains:

a 11 lines long (only!) Dockerfile to build a Linux image containing Gnocchi;
a Dockerfile to create a Grafana image that will use Gnocchi as datasource (preconfigured);
a Dockerfile to create a collectd image that gather various metrics for your container in order to feed Gnocchi and have something to display in Grafana;
a docker-compose file that orchestrates and runs those containers.

If you don't know docker-compose, it's a tool to define and run applications using multiple containers. This is very handy in our case, as we need to start a few services, and therefore a few containers, to have our whole stack running.

If you just want to use and run Gnocchi in a snap using this, it's easy. First clone the repository:

$ git clone https://github.com/gnocchixyz/gnocchi-docker.git

Then, just ask docker-compose to start your stack of containers:

$ cd gnocchi-docker
$ docker-compose up

On the first run, docker-compose will build the various images (this should take only a few minutes) and then will start them.

Once everything is started, you can connect to Grafana by typing the URL http://<ip of your docker server>:3000 in your browser and using "admin" as username and "password" as password. Just click on the dashboard entitled "Gnocchi" and wait a few minutes: you will see the chart being drawn in real time!

The data fed into Gnocchi come from the collectd container, which gathers various metrics (CPU, network interface statistics, etc).

You can then edit the docker files as you like to add new features or test your code. The files are also a good basis if you want to deploy Gnocchi in production running Docker!

If you want to access and play with Gnocchi in command line, just install gnocchiclient and do the following:

$ export GNOCCHI_ENDPOINT=http://`docker-machine ip`:8041
$ gnocchi resource list
+----------+----------+------------+---------+----------------------+------------+----------+----------------+--------------+---------+
| id       | type     | project_id | user_id | original_resource_id | started_at | ended_at | revision_start | revision_end | creator |
+----------+----------+------------+---------+----------------------+------------+----------+----------------+--------------+---------+
| c31e4adc | collectd | None       | None    | collectd:fake-phy-   | 2017-08-17 | None     | 2017-08-17T12: | None         | admin   |
| -2cff-5f |          |            |         | host-719acbad336c    | T12:20:27. |          | 20:27.643790+0 |              |         |
| 78-8206- |          |            |         |                      | 643778+00: |          | 0:00           |              |         |
| f5ca66e4 |          |            |         |                      | 00         |          |                |              |         |
| 6cce     |          |            |         |                      |            |          |                |              |         |
+----------+----------+------------+---------+----------------------+------------+----------+----------------+--------------+---------+

You can now have fun creating new resources and metrics!

Feel free to contribute patches to the GitHub project too, obviously!

Easy Python logging with daiquiri

Tue, 04 Jul 2017 00:00:00 GMT

After more than 10 years of writing Python, there's something I always have
been annoyed with: logging.

Don't read me wrong: I like the Python logging subsystem. It's easy to use and works like a charm in most cases. If you never used it, logging in Python turns down to be as simple as:

import logging

logger = logging.getLogger()
logger.info("Something useful")

It could barely be easier. What annoys me is that if you run the example above, an error will happen. See by yourself:

>>> import logging
>>> logger = logging.getLogger()
>>> logger.error("Something useful")
No handlers could be found for logger "root"

Nothing is printed, except an error. No log file is created. Logging does not work "by default". I hate it.

Each time I write a new application, I need to remember how to set logging up. There's a full API, documented, that explains how to setup handlers, formatters, filters, or a record factory. And each time I need to dig into all that documentation to remember how to set some sane defaults (e.g. log to stderr in a format with a timestamp). I could use logging.basicConfig, but it's usually too basic (e.g. it does not print any timestamp).

Each time I go down the rabbit-hole of tweaking logging.

Here comes daiquiri

I finally took some time recently to bootstrap a tiny library to do this job for me. It's named daiquiri, and it does only one thing: configure the Python logging subsystem for modern Python applications.

It's small and the 1.0.0 version I just released contains 228 lines of code and 79 lines of tests. That's it!

Its promise is to setup a complete standard Python logging system with just one function call. Nothing more, nothing less. The interesting features are:

Logs to stderr by default.
Use colors if logging to a terminal.
Support file logging.
Use program name as the name of the logging file so providing just a directory for logging will work.
Support syslog.
Support journald.
JSON output support.
Support of arbitrary key/value context information providing.
Capture the warnings emitted by the warnings module.
Native logging of any exception.

And it's used by Gnocchi starting with version 4.0. That should say how long it's production ready, right? 😀

Enough selling. Let's see how it looks by default!

Basic working

Here's the basic usage of daiquiri:

import daiquiri

daiquiri.setup()

I told you I want it to be simple. Just doing this is already doing a better job than logging.basicConfig, since it'll do something useful by default:

>>> import daiquiri
>>> daiquiri.setup()
>>> logger = daiquiri.getLogger()
>>> logger.error("something wrong happened")
2017-07-04 18:03:04,929 [16876] ERROR root: something wrong happened

It does print the message on stderr using a useful formatting and a timestamp by default. Just what everybody wants, isn't it? If you run this on a terminal, the line will be printed in red as it is an error that is logged. Other colors will be used for different logging levels (green for debug, etc).

Better, daiquiri will log any exception in your program:

>>> import daiquiri
>>> daiquiri.setup()
>>> raise Exception("boom!")
2017-07-04 18:05:43,378 [16959] CRITICAL root: Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
Exception: boom!

As soon as an exception is uncaught, it'll be logged as a critical log message.

More advanced features

If you want to tweak the default output, you can pass some arguments to daiquiri.setup. This function accepts a outputs argument that must be an iterable of daiquiri.Output objects. This is typically a list of daiquiri.File object to log to a file, daiquiri.Syslog to log to syslog or daiquiri.Stream to log to any stream (e.g. an opened file, stdout or stderr).

If you want to log via syslog but also to stderr, here's what you'll have to do:

daiquiri.setup(outputs=(
    daiquiri.output.Syslog(),
    daiquiri.output.STDERR,
))

If you want to log to a file, you can just specify a directory, daiquiri will guess the program name and creates the appropriate file:

## If the program name is foobar-server then the logging will
## be done to /var/log/foobar-server.log
daiquiri.setup(outputs=(
     daiquiri.output.File(directory="/var/log"),
))

Those examples might be too easy. So let's log to journald and also to a network server using JSON output:

import socket
import daiquiri

## Let's connect to the server first
s = socket.socket()
## You can run a simple server in another terminal by typing `nc -l 2333`
s.connect(("localhost", 2333))
f = s.makefile()

daiquiri.setup(outputs=(
     daiquiri.output.Journal(),
     daiquiri.output.Stream(f, formatter=daiquiri.formatter.JSON_FORMATTER),
))
daiquiri.getLogger().error("oops", somekey=42, anotherkey="foobar")
## Server will receive:
## {"message": "oops", "somekey": 42, "anotherkey": "foobar"}

You can obviously extend it with your own formatter or outputs, the API is pretty simple. But the default should be usable for 99% of applications.

Let me know what you think and feel free to pip install and git clone it! The library is available at PyPI, the source is on GitHub and the documentation is published online.

Gnocchi 4 is out

Tue, 13 Jun 2017 00:00:00 GMT

Finally! Four months ago we pushed the Gnocchi 3.1 release and here we are now, release the 4th major version of that timeseries database.

A lot happened in the last 4 months.

First, as I already wrote about, we moved to GitHub for hosting our project. This slowed down our development pace for a couple of weeks, but we're now almost back to normal! We were a bit sad to quit the great infrastructure that we used before, but it feels great to be hosted on a platform everyone knows about and is more straightforward to use.

Second, we implemented some major changes that should improve performances again. We tend to that in each release, I know, I know. As usual, the release notes contains most of the major changes we did and can be read online.

But I'd like to talk about few here that I find very exciting. The work and performances tests that Alex Krzos did (and
we presented during the last OpenStack Summit) was of a great help for inspiration on where to improve performances.

Redis! We added a Redis driver which can store incoming measures and metric archives. Obviously, it's more meant for incoming measures. Remember, in Gnocchi 3.1 we split the storage driver into two parts: the incoming measure storage and the archive storage. Since you can use two different drivers for those different functions, with Gnocchi 4.0 you can use Redis to store your incoming measures in a very fast temporary storage service and then metricd will process them and store the results in your favorite scalable storage such as Ceph, where it's mostly read.
Sacks! We rewrote the entire scheduling mechanism for metricd. It now uses several "sacks" to store incoming measures in a distributed manner, instead of the previous one-sack-only storage for those incoming data. A hashring is then used to spread the processing workload on all the running metricd daemon. Faster, simpler and more efficient scheduling should happen with this version!
S3! We fixed the S3 driver. It was a nice proof-of-concept in 3.1 and now it should work. For real.

That's mostly it. The rest of the changes are bug fixes there and there and some performance improvement, but this should be enough to get you excited to try it out.

Come and join us on GitHub! Star us, and stay tuned for some more awesome news around metrics.

Sending GitHub pull-request from your shell

Wed, 24 May 2017 00:00:00 GMT

I've always been frustrated by the GitHub workflow. A while back I
wrote how Gerrit workflow was superior to GitHub pull-request system. But it seems that GitHub listened and they improved the pull-request system these last years to include reviews, and different workflow implementation, e.g. requiring continuous integration tests to pass before merging a patch.

All those improvements great helped the Gnocchi team to consider moving to GitHub when leaving OpenStack. Our first days have been great and I cannot say we miss Gerrit much for now.

The only tool that I loved and miss is git-review. It allows pushing a branch of update easily to Gerrit.

Unfortunately, in the GitHub world, things are different. To send a pull-request you have to execute a few steps which are:

Clone the target repository
Push your local branch to your repository
Create a pull-request from your pushed local branch to the target branch

If you want to update later your pull-request, you either have to push new commits to your branch or, more often, edit your patches and force push your branch to your forked repository so you can ask for a new review of your pull-request.

I'm way too lazy to do all of that by hand, so I had a tool for a few years that I used based on hub, a command-line tool that interacts with GitHub API. Unfortunately, it was pretty simple and did not have all the feature I wanted.

Which pushed me to write my own tool, humbly entitled git-pull-request. It allows to send a pull-request to any GitHub project just after you just cloned it. So there's no need to manually fork the repository, send branches, etc.

Once you created a branch and committed to it, just run git pull-request and everything we'll be done for you automatically.

## First pull-request creation
$ git clone https://github.com/gnocchixyz/gnocchi.git
$ cd gnocchi
$ git checkout -b somefeature
<edit files>
$ git commit -a -m 'I did some changes'
$ git pull-request
Forked repository: https://github.com/jd/gnocchi
Force-pushing branch `somefeature' to remote `github'
Counting objects: 5, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (5/5), 562 bytes | 0 bytes/s, done.
Total 5 (delta 3), reused 0 (delta 0)
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.
To https://github.com/jd/gnocchi.git
 + 73a733f7...1be2bf29 somefeature -> somefeature (forced update)
Pull-request created: https://github.com/gnocchixyz/gnocchi/pull/33

If you need to update your pull-request with new patches, just edit your branch and call git pull-request again. It'll re-push your branch and will not create a pull-request if one already exists.

<edit some more files>
$ git commit --amend -a
$ git pull-request
Forked repository: https://github.com/jd/gnocchi
Force-pushing branch `somefeature to remote `github'
Counting objects: 5, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (5/5), 562 bytes | 0 bytes/s, done.
Total 5 (delta 3), reused 0 (delta 0)
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.
To https://github.com/jd/gnocchi.git
 + 73a733f7...1be2bf29 somefeature -> somefeature (forced update)
Pull-request already exists at: https://github.com/gnocchixyz/gnocchi/pull/33

This tool was definitely the missing piece to smooth my GitHub workflow, so I'm glad I took some time to write it. I hope you'll enjoy it and will send me awesome pull-requests, so go check it out. This program is written in Python and uses the GitHub API.

And feel free to request new fancy features!

OpenStack Summit Boston 2017 recap

Mon, 15 May 2017 00:00:00 GMT

The first OpenStack Summit of 2017 was last week, in Boston, MA, USA. I was able to attend as I've been selected to give 3 talks, to help for a hands-on and to animate an on-boarding session. This made sure I was a bit busy every day, which was good.

This is the first summit to happen since the new Project Team Gathering (PTG) happened last February. I was unable to attend this first PTG back then, as there was no way to justify my presence there. The OpenStack Telemetry team that I lead is pretty small. People don't really need to talk to each other face to face to discuss: therefore we decided to not ask to be present during the last PTG event.

The Telemetry on-boarding session that I organized with my fellow developer Gordon Chung on Tuesday had only 3 people showing up to ask a few questions about Telemetry. The session lasted 15 minutes on 90 planned. We shared that session with CloudKitty, for which nobody showed up for. When you think about it, this was really disappointing but did not come as a surprise.

First, the amount of company engaging developers into OpenStack has shrunk drastically during the last year. Secondly, since there's now another event (the PTG) twice a year, it seems pretty clear that every developer will not be able to attend all the 4 events every year, creating dispersion in the community.

I personally was glad to attend the Summit rather than the PTG, as it is more valuable to meet operators and users than developers to gather feedback. However, meeting everyone at the same time would be great, especially for smaller teams. The PTG scattered some teams to a point that many of developers of those lineups won't go to either the PTG nor the OpenStack. As a consequence, I won't have any meeting point in the future with many of my fellow developers around OpenStack. I warned the Technical Committee last year about this when it was decided to reorganize the events. I'm glad to be right but I'm a bit sad that the Foundation did not listen.

Though all the projects I work on tend to follow the good practice I wrote last year. Therefore I cannot say that it has huge consequences on the projects I work on. It's a loss as it makes it harder to reach users and operators for some of us. It also reduces our occasion for social interaction, which was a great benefit. But it will not prevent us from building great software anyway!

The few other sessions of The Forum (the space dedicated to developers during the Summit) that I attended discussed various technical things, and some sessions were pretty empty. I wonder if it was a lack of interest of people or if people were unable to travel to discuss those items. Anyhow, at this stage I am not sure it would have really mattered: this has been my 9th OpenStack Summit and many of the subjects discussed already have been discussed multiple time with barely any change since. Talk is cheap. Furthermore, most of the discussion were not made by stakeholders of the various projects involved, but by people on the side, or by members of the Technical Committee. There is just unfortunately too much of wishful thinking.

On the talk side, my presentation with Alex Krzos entitled Telemetry and the 10,000 instances went pretty well. We demonstrated what how we tested the performance of the telemetry stack.

Same goes for my hands-on with the CloudKitty developers, where we managed to explain how Ceilometer, Gnocchi, and CloudKitty were able to work with each other to create nice billing reports. The last day was concluded with my talk on collectd and Gnocchi with Emma, which was short and to the point.

My final talk was about the status and roadmap of the OpenStack Telemetry team where I tried to explain how the Telemetry works and what we might do (or not) in the next cycles. It was pretty short as we barely have a roadmap, the project having 3 developers doing 80% of the work.

I was also able to catch up with Nubeliu about their Gnocchi usage. They presented a nice demo of the cloud monitoring solution they build on top of Gnocchi. They completely understood how to use Gnocchi to store a large number of metrics at scale and how to leverage the API to render what's happening in your infrastructure. It is pretty amazing.

While I missed the energy and the drive that the design session used to have in the first summits, it has been a pretty good summit. I was especially happy to be able to discuss OpenStack Telemetry and Gnocchi. The feedback I gathered was tremendous and terrific and I'm looking forward to the work we'll achieve in the next months!

Gnocchi independence

Sat, 06 May 2017 00:00:00 GMT

Three years have passed since I started working on Gnocchi. It's amazing to gaze at the path we wandered on.

During all this time, Gnocchi has been "incubated" inside OpenStack. It has been created there and it grew with the rest of the ecosystem. But Gnocchi (developers) always stuck to some strange principles: autonomy and independence from the other OpenStack projects. This actually made the project a bit unpopular sometimes inside OpenStack, being stamped as some kind of rebel.

I've spent the last years asserting that each project inside OpenStack should seek towards living its own life. It is a key success for any open source project to be able to be used in any context, not only the one it has been built for. Having to use large bundles of projects together is not a good user story. I wish OpenStack will be a set of more autonomous building blocks.

One of the most used project by people not using an entire OpenStack installation has been Swift. That was possible because Swift always tried to be autonomous and to not depend on any other service. It is able to leverage external services but it can also work without any. And I feel that Swift is the most successful project if you measure that success by being used by people having zero knowledge about OpenStack.

With the move toward the Big Tent, it struck me that the OpenStack Foundation will end up as some sort of an Apache Foundation. And I am pretty sure nobody forces you to use the Apache HTTP server if you want to use e.g. Lucene or HBase.

Being part of OpenStack for Gnocchi has been a great advantage at the beginning of the project. The infrastructure provided is awesome. The support we had from the community was great. The Gerrit workflow suited us well.

But unfortunately, now that the project is getting more and more mature, many of the requirements of being an OpenStack project has become a real burden. The various processes forced by OpenStack is hurting the development pace. The contribution workflow based around Gerrit and Launchpad is too complicated for most external contributors and therefore prevents new users to participate to the development.
Worse, the bad image or reputation that OpenStack carries in certain situation or communities is preventing Gnocchi to be evaluated and, maybe, used.

I think that many of those negative aspects are finally taken into account by the OpenStack Technical Committee, as can be seen in the proposed vision of 2 years from now for OpenStack.
Better late than never.

So after spending a lot of time weighing the pros and the cons, we, Gnocchi
contributors, finally decided to move Gnocchi out of OpenStack.
We started to move the project to a brand new Gnocchi organization on GitHub. At the time of this writing, only the main gnocchi repository is missing and should be moved soon after the OpenStack Summit happening next week.

We also used that opportunity to make usage of the new Gnocchi logo, courtesy of my friend Thierry Ung!

We'll see how everything will turn out and if the project will gain more traction, as we hope. This will not change the consumption of Gnocchi made by projects such as Ceilometer. and the project aims to remain a good friend of OpenStack. 😀

Python never gives up: the tenacity library

Thu, 02 Mar 2017 00:00:00 GMT

A couple of years ago, I wrote about the Python retrying library. This library was designed to retry the execution of a task when a failure occurred.

I started to spread usage of this library in various projects, such as Gnocchi, these last years. Unfortunately, it started to get very hard to contribute and send patches to the upstream retrying project. I spent several months trying to work with the original author. But after a while, I had to come to the conclusion that I would be unable to fix bugs and enhance it at the pace I would like to. Therefore, I had to take a difficult decision and decided to fork the library.

Here comes tenacity

I picked a new name and rewrote parts of the API of retrying that were not working correctly or were too complicated. I also fixed bugs with the help of Joshua, and named this new library tenacity. It works in the same manner as retrying does, except that it is written in a more functional way and offers some nifty new features.

Basic usage

The basic usage is to use it as a decorator:

import tenacity

@tenacity.retry
def do_something_and_retry_on_any_exception():
    pass

This will make the function do_something_and_retry_on_any_exception be called over and over again until it stops raising an exception. It would have been hard to design anything simpler. Obviously, this is a pretty rare case, as one usually wants to e.g. wait some time between retries. For that, tenacity offers a large panel of waiting methods:

import tenacity

@tenacity.retry(wait=tenacity.wait_fixed(1))
def do_something_and_retry():
    do_something()

Or a simple exponential back-off method can be used instead:

import tenacity

@tenacity.retry(wait=tenacity.wait_exponential())
def do_something_and_retry():
    do_something()

Combination

What is especially interesting with tenacity, is that you can easily combine several methods. For example, you can combine tenacity.wait.wait_random with tenacity.wait.wait_fixed to wait a number of seconds defined in an interval:

import tenacity

@tenacity.retry(wait=tenacity.wait_fixed(10) + wait.wait_random(0, 3))
def do_something_and_retry():
    do_something()

This will make the function being retried wait randomly between 10 and 13 seconds before trying again.

tenacity offers more customization, such as retrying on some exceptions only. You can retry every second to execute the function only if the exception raised by do_something is an instance of IOError, e.g. a network communication error.

import tenacity

@tenacity.retry(wait=tenacity.wait_fixed(1),
                retry=tenacity.retry_if_exception_type(IOError))
def do_something_and_retry():
    do_something()

You can combine several condition easily by using the | or & binary operators. They are used to make the code retry if an IOError exception is raised, or if no result is returned. Also, a stop condition is added with the stop keyword arguments. It allows to specify a condition unrelated to the function result of exception to stop, such as a number of attemps or a delay.

import tenacity

@tenacity.retry(wait=tenacity.wait_fixed(1),
                stop=tenacity.stop_after_delay(60),
                retry=(tenacity.retry_if_exception_type(IOError) |
                       tenacity.retry_if_result(lambda result: result == None))
def do_something_and_retry():
    do_something()

The functional approach of tenacity makes it easy and clean to combine a lot of condition for various use cases with simple binary operators.

Standalone usage

tenacity can also be used without decorator by using the object Retrying, that implements its main behaviour, and usig its call method. This allows to call any function with different retry conditions, or to retry any piece of code that do not use the decorator at all – like code from an external library.

import tenacity

r = tenacity.Retrying(
    wait=tenacity.wait_fixed(1),
    retry=tenacity.retry_if_exception_type(IOError))
r.call(do_something)

This also allows you to re-use that object without creating one new each time, saving some memory!

I hope you'll like it and will find it some use. Feel free to fork it, report bug or ask for new features on its GitHub!

If you want to learn more about retrying strategy and how to handle failure, there's even more in Scaling Python. Check it out!

Scalable metrics storage: Gnocchi on Amazon Web Services

Wed, 22 Feb 2017 00:00:00 GMT

As I wrote a few weeks ago in my post about Gnocchi 3.1 being released, one of the new feature available in this version it the S3 driver. Today I would like to show you how easy it is to use it and store millions of metrics into the simple, durable and massively scalable object storage provided
by Amazon Web Services.

Installation

The installation of Gnocchi for this use case is not different than
the standard installation procedure described in the documentation. Simply install Gnocchi from PyPI using the following command:

$ pip install gnocchi[s3,postgresql] gnocchiclient

This will install Gnocchi with the dependencies for the S3 and PostgreSQL drivers and the command-line interface to talk with Gnocchi.

Configuring Amazon RDS

Since you need a SQL database for the indexer, the easiest way to get started is to create a database on Amazon RDS. You can create a managed PostgreSQL database instance in just a few clicks.

Once you're on the homepage of Amazon RDS, pick PostgreSQL as a
database:

You can then configure your PostgreSQL instance: I've picked a dev/test instance with the basic options available within the RDS Free Tier, but you can pick whatever you think is needed for your production use. Set a username and a password and note them for later: we'll need them to configure Gnocchi.

The next step is to configure the database in details. Just set the database name to "gnocchi" and leave the other options to their default values (I'm lazy).

After a few minutes, your instance should be created and running. Note down the endpoint. In this case, my instance is gnocchi.cywagbaxpert.us-east-1.rds.amazonaws.com.

Configuring Gnocchi for S3 access

In order to give Gnocchi an access to S3, you need to create access keys. The easiest way to create them is to go to IAM in your AWS console, pick a user with S3 access and click on the big gray button named "Create access key".

Once you do that, you'll get the access key id and secret access key. Note them down, we will need these later.

Creating `gnocchi.conf`

Now is time to create the gnocchi.conf file. You can place it in /etc/gnocchi if you want to deploy it system-wide, or in any other directory and add the --config-file option to each Gnocchi command..

Here are the values that you should retrieve and write in the configuration file:

indexer.url: the PostgreSQL RDS instance endpoint and credentials (see above) to set into
storage.s3_endpoint_url: the S3 endpoint URL – that depends on the region you want to use and they are listed here.
storage.s3_region_name: the S3 region name matching the endpoint you picked.
storage.s3_access_key_id and storage.s3_secret_acess_key: your AWS access key id and secret access key.

Your gnocchi.conf file should then look like that:

[indexer]
url = postgresql://gnocchi:gn0cch1rul3z@gnocchi.cywagbaxpert.us-east-1.rds.amazonaws.com:5432/gnocchi

[storage]
driver = s3
s3_endpoint_url = https://s3-eu-west-1.amazonaws.com
s3_region_name = eu-west-1
s3_access_key_id = <you access key id>
s3_secret_access_key = <your secret access key>

Once that's done, you can run gnocchi-upgrade in order to initialize Gnocchi indexer (PostgreSQL) and storage (S3):

$ gnocchi-upgrade --config-file gnocchi.conf
2017-02-07 15:35:52.491 3660 INFO gnocchi.cli [-] Upgrading indexer <gnocchi.indexer.sqlalchemy.SQLAlchemyIndexer object at 0x108221950>
2017-02-07 15:36:04.127 3660 INFO gnocchi.cli [-] Upgrading storage <gnocchi.storage.s3.S3Storage object at 0x10ca943d0>

Then you can run the API endpoint using the test endpoint gnocchi-api and specifying its default port 8041:

$ gnocchi-api --port 8041 -- --config-file gnocchi.conf
2017-02-07 15:53:06.823 6290 INFO gnocchi.rest.app [-] WSGI config used: /Users/jd/Source/gnocchi/gnocchi/rest/api-paste.ini
********************************************************************************
STARTING test server gnocchi.rest.app.build_wsgi_app
Available at http://127.0.0.1:8041/
DANGER! For testing only, do not use in production
********************************************************************************

The best way to run Gnocchi API is to use uwsgi as documented, but in this case, using the testing daemon gnocchi-api is good enough.

Finally, in another terminal, you can start the gnocchi-metricd daemon that will process metrics in background:

$ gnocchi-metricd --config-file gnocchi.conf
2017-02-07 15:52:41.416 6262 INFO gnocchi.cli [-] 0 measurements bundles across 0 metrics wait to be processed.

Once everything is running, you can use Gnocchi's client to query it and check that everything is OK. The backlog should be empty at this stage, obviously.

$ gnocchi status
+-----------------------------------------------------+-------+
| Field                                               | Value |
+-----------------------------------------------------+-------+
| storage/number of metric having measures to process | 0     |
| storage/total number of measures to process         | 0     |
+-----------------------------------------------------+-------+

Gnocchi is ready to be used!

$ # Create a generic resource "foobar" with a metric named "visitor"
$ gnocchi resource create foobar -n visitor
+-----------------------+-----------------------------------------------+
| Field                 | Value                                         |
+-----------------------+-----------------------------------------------+
| created_by_project_id |                                               |
| created_by_user_id    | admin                                         |
| creator               | admin                                         |
| ended_at              | None                                          |
| id                    | b4d568e4-7af1-5aec-ac3f-9c09fa3685a9          |
| metrics               | visitor: 05f45876-1a69-4a64-8575-03eea5b79407 |
| original_resource_id  | foobar                                        |
| project_id            | None                                          |
| revision_end          | None                                          |
| revision_start        | 2017-02-07T14:54:54.417447+00:00              |
| started_at            | 2017-02-07T14:54:54.417414+00:00              |
| type                  | generic                                       |
| user_id               | None                                          |
+-----------------------+-----------------------------------------------+

## Send the number of visitor at 2 different timestamps
$ gnocchi measures add --resource-id foobar -m 2017-02-07T15:56@23 visitor
$ gnocchi measures add --resource-id foobar -m 2017-02-07T15:57@42 visitor

## Check the average number of visitor
## (the --refresh option is given to be sure the measure are processed)
$ gnocchi measures show --resource-id foobar visitor --refresh
+---------------------------+-------------+-------+
| timestamp                 | granularity | value |
+---------------------------+-------------+-------+
| 2017-02-07T15:55:00+00:00 |       300.0 |  32.5 |
+---------------------------+-------------+-------+

## Now shows the minimum number of visitor
$ gnocchi measures show --aggregation min --resource-id foobar visitor
+---------------------------+-------------+-------+
| timestamp                 | granularity | value |
+---------------------------+-------------+-------+
| 2017-02-07T15:55:00+00:00 |       300.0 |  23.0 |
+---------------------------+-------------+-------+

And voilà! You're ready to store millions of metrics and measures on your Amazon Web Services cloud platform. I hope you'll enjoy it and feel free to ask any question in the comment section or by reaching me directly!

Sending your collectd metrics to Gnocchi

Thu, 16 Feb 2017 00:00:00 GMT

Knowing that collectd is a daemon that collects system and applications metrics and that Gnocchi is a scalable timeseries database, it sounds like a good idea to combine them together. Cherry on the cake: you can easily draw charts using Grafana.

While it's true that Gnocchi is well integrated with OpenStack, as it orginally comes from this ecosystem, it actually works standalone by default. Starting with the 3.1 version, it is now easy to send metrics to Gnocchi using collectd.

Installation

What we'll need to install to accomplish this task is:

collectd
Gnocchi
collectd-gnocchi

How you install them does not really matter. If they are packaged by your operating system, go ahead. For Gnocchi and collectd-gnocchi, you can also use pip:

## pip install gnocchi[file,postgresql]
[…]
Successfully installed gnocchi-3.1.0
## pip install collectd-gnocchi
Collecting collectd-gnocchi
  Using cached collectd-gnocchi-1.0.1.tar.gz
[…]
Installing collected packages: collectd-gnocchi
  Running setup.py install for collectd-gnocchi ... done
Successfully installed collectd-gnocchi-1.0.1

The detailed installation procedure for Gnocchi is detailed in the documentation. It among other things explains which flavors are available – here I picked PostgreSQL and the file driver to store the metrics.

Configuration

Gnocchi

Gnocchi is simple to configure and is again documented. The default configuration file is /etc/gnocchi/gnocchi.conf – you can generate it with gnocchi-config-generator if needed. However, it also possible to specify another configuration file by appending the --config-file option to any command line

In Gnocchi's configuration file, you need to set the indexer.url configuration option to point an existing PostgreSQL database and set storage.file_basepath to an existing directory to store your metrics (the default is /var/lib/gnocchi). That gives something like:

[indexer]
url = postgresql://root:p4assw0rd@localhost/gnocchi

[storage]
file_basepath = /var/lib/gnocchi

Once done, just run the gnocchi-upgrade command to initialize the index and storage.

collectd

Collectd provides a default configuration file that loads a bunch of plugin by default, that will meter all sort of metrics on your computer. You can check the documentation online to see how to disable or enable plugins.

As the collectd-gnocchi plugin is written in Python, you'll need to enable the Python plugin and load the collectd-gnocchi module:

LoadPlugin python

<Plugin python>
  Import "collectd_gnocchi"
  <Module collectd_gnocchi>
      endpoint "http://localhost:8041"
  </Module>
</Plugin>

That is enough to enable the storage of metrics in Gnocchi.

Running the daemons

Once everything is configured, you can launch gnocchi-metricd and the gnocchi-api daemon:

$ gnocchi-metricd
2017-01-26 15:22:49.018 15971 INFO gnocchi.cli [-] 0 measurements bundles
across 0 metrics wait to be processed.
[…]
## In another terminal
$ gnocchi-api --port 8041
[…]
STARTING test server gnocchi.rest.app.build_wsgi_app
Available at http://127.0.0.1:8041/
[…]

It's not recommended to run Gnocchi using Gnocchi API (as
written in the documentation): using uwsgi is a better option. However for rapid testing, the gnocchi-api daemon is good enough.

Once that's done, you can start collectd:

$ collectd
## Or to run in foreground with a different configuration file:
## $ collectd -C collectd.conf -f

If you have any problem launchding colllectd, check syslog for more information: there might be an issue loading a module or plugin.

If no error are printed, then everythin's working fine and you soon should see gnocchi-api printing some requests such as:

127.0.0.1 - - [26/Jan/2017 15:27:03] "POST /v1/resource/collectd HTTP/1.1" 409 113
127.0.0.1 - - [26/Jan/2017 15:27:03] "POST /v1/batch/resources/metrics/measures?create_metrics=True HTTP/1.1" 400 91

Enjoying the result

Once everything runs, you can access your newly created resources and metric by using the gnocchiclient. It should have been installed as a dependency of collectd_gnocchi, but you can also install it manually using pip install gnocchiclient.

If you need to specify a different endpoint you can use the --endpoint option (which default to http://localhost:8041). Do not hesitate to check the --help option for more information.

$ gnocchi resource list --details
+---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+
| id            | type     | project_id | user_id | original_resource_id | started_at    | ended_at | revision_start | revision_end | creator | host      |
+---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+
| dd245138-00c7 | collectd | None       | None    | dd245138-00c7-5bdc-  | 2017-01-26T14 | None     | 2017-01-26T14: | None         | admin   | localhost |
| -5bdc-94f8-26 |          |            |         | 94f8-263e236812f7    | :21:02.297466 |          | 21:02.297483+0 |              |         |           |
| 3e236812f7    |          |            |         |                      | +00:00        |          | 0:00           |              |         |           |
+---------------+----------+------------+---------+----------------------+---------------+----------+----------------+--------------+---------+-----------+
$ gnocchi resource show collectd:localhost
+-----------------------+-----------------------------------------------------------------------+
| Field                 | Value                                                                 |
+-----------------------+-----------------------------------------------------------------------+
| created_by_project_id |                                                                       |
| created_by_user_id    | admin                                                                 |
| creator               | admin                                                                 |
| ended_at              | None                                                                  |
| host                  | localhost                                                             |
| id                    | dd245138-00c7-5bdc-94f8-263e236812f7                                  |
| metrics               | interface-en0@if_errors-0: 5d60f224-2e9e-4247-b415-64d567cf5866       |
|                       | interface-en0@if_errors-1: 1df8b08b-555a-4cab-9186-f9b79a814b03       |
|                       | interface-en0@if_octets-0: 491b7517-7219-4a04-bdb6-934d3bacb482       |
|                       | interface-en0@if_octets-1: 8b5264b8-03f3-4aba-a7f8-3cd4b559e162       |
|                       | interface-en0@if_packets-0: 12efc12b-2538-45e7-aa66-f8b9960b5fa3      |
|                       | interface-en0@if_packets-1: 39377ff7-06e8-454a-a22a-942c8f2bca56      |
|                       | interface-en1@if_errors-0: c3c7e9fc-f486-4d0c-9d36-55cea855596a       |
|                       | interface-en1@if_errors-1: a90f1bec-3a60-4f58-a1d1-b3c09dce4359       |
|                       | interface-en1@if_octets-0: c1ee8c75-95bf-4096-8055-8c0c4ec8cd47       |
|                       | interface-en1@if_octets-1: cbb90a94-e133-4deb-ac10-3f37770e32f0       |
|                       | interface-en1@if_packets-0: ac93b1b9-da71-4876-96aa-76067b35c6c9      |
|                       | interface-en1@if_packets-1: 2f8528b2-12ae-4c4d-bec7-8cc987e7487b      |
|                       | interface-en2@if_errors-0: ddcf7203-4c49-400b-9320-9d3e0a63c6d5       |
|                       | interface-en2@if_errors-1: b249ea42-01ad-4742-9452-2c834010df71       |
|                       | interface-en2@if_octets-0: 8c23013a-604e-40bf-a07a-e2dc4fc5cbd7       |
|                       | interface-en2@if_octets-1: 806c1452-0607-4b56-b184-c4ffd48f52c0       |
|                       | interface-en2@if_packets-0: c5bc6103-6313-4b8b-997d-01930d1d8af4      |
|                       | interface-en2@if_packets-1: 478ae87e-e56b-44e4-83b0-ed28d99ed280      |
|                       | load@load-0: 5db2248d-2dca-401e-b2e2-bbaee23b623e                     |
|                       | load@load-1: 6f74ac93-78fd-4a74-a47e-d2add487a30f                     |
|                       | load@load-2: 1897aca1-356e-4791-907f-512e516992b5                     |
|                       | memory@memory-active-0: 83944a85-9c84-4fe4-b471-1a6cf8dce858          |
|                       | memory@memory-free-0: 0ccc7cfa-26a5-4441-a15f-9ebb2aa82c6d            |
|                       | memory@memory-inactive-0: 63736026-94c4-47c5-8d6f-a9d89d65025b        |
|                       | memory@memory-wired-0: b7217fd6-2cdc-4efd-b1a8-a1edd52eaa2e           |
| original_resource_id  | dd245138-00c7-5bdc-94f8-263e236812f7                                  |
| project_id            | None                                                                  |
| revision_end          | None                                                                  |
| revision_start        | 2017-01-26T14:21:02.297483+00:00                                      |
| started_at            | 2017-01-26T14:21:02.297466+00:00                                      |
| type                  | collectd                                                              |
| user_id               | None                                                                  |
+-----------------------+-----------------------------------------------------------------------+
% gnocchi metric show -r collectd:localhost load@load-0
+------------------------------------+-----------------------------------------------------------------------+
| Field                              | Value                                                                 |
+------------------------------------+-----------------------------------------------------------------------+
| archive_policy/aggregation_methods | min, std, sum, median, mean, 95pct, count, max                        |
| archive_policy/back_window         | 0                                                                     |
| archive_policy/definition          | - timespan: 1:00:00, granularity: 0:05:00, points: 12                 |
|                                    | - timespan: 1 day, 0:00:00, granularity: 1:00:00, points: 24          |
|                                    | - timespan: 30 days, 0:00:00, granularity: 1 day, 0:00:00, points: 30 |
| archive_policy/name                | low                                                                   |
| created_by_project_id              |                                                                       |
| created_by_user_id                 | admin                                                                 |
| creator                            | admin                                                                 |
| id                                 | 5db2248d-2dca-401e-b2e2-bbaee23b623e                                  |
| name                               | load@load-0                                                           |
| resource/created_by_project_id     |                                                                       |
| resource/created_by_user_id        | admin                                                                 |
| resource/creator                   | admin                                                                 |
| resource/ended_at                  | None                                                                  |
| resource/id                        | dd245138-00c7-5bdc-94f8-263e236812f7                                  |
| resource/original_resource_id      | dd245138-00c7-5bdc-94f8-263e236812f7                                  |
| resource/project_id                | None                                                                  |
| resource/revision_end              | None                                                                  |
| resource/revision_start            | 2017-01-26T14:21:02.297483+00:00                                      |
| resource/started_at                | 2017-01-26T14:21:02.297466+00:00                                      |
| resource/type                      | collectd                                                              |
| resource/user_id                   | None                                                                  |
| unit                               | None                                                                  |
+------------------------------------+-----------------------------------------------------------------------+
$ gnocchi measures show -r collectd:localhost load@load-0
+---------------------------+-------------+--------------------+
| timestamp                 | granularity |              value |
+---------------------------+-------------+--------------------+
| 2017-01-26T00:00:00+00:00 |     86400.0 | 3.2705004391254193 |
| 2017-01-26T15:00:00+00:00 |      3600.0 | 3.2705004391254193 |
| 2017-01-26T15:00:00+00:00 |       300.0 | 2.6022800611413044 |
| 2017-01-26T15:05:00+00:00 |       300.0 |  3.561742940080275 |
| 2017-01-26T15:10:00+00:00 |       300.0 | 2.5605337960379466 |
| 2017-01-26T15:15:00+00:00 |       300.0 |  3.837517851142473 |
| 2017-01-26T15:20:00+00:00 |       300.0 | 3.9625948392427883 |
| 2017-01-26T15:25:00+00:00 |       300.0 | 3.2690042162698414 |
+---------------------------+-------------+--------------------+

As you can see, the command line works smoothly and can show you any kind of metric reported by collectd. In this case, it was just running on my laptop, but you can imagine it's easy enough to poll thousands of hosts with collectd and Gnocchi.

Bonus: charting with Grafana

Grafana, a charting software, has a plugin for Gnocchi as detailed in the documentation. Once installed, you can just configure Grafana to point to Gnocchi this way:

You can then create a new dashboard by filling the forms as you wish. See this other screenshot for a nice example:

I hope everything is clear and easy enough. If you have any question, feel free to write something in the comment section!

FOSDEM 2017, recap

Mon, 06 Feb 2017 00:00:00 GMT

Last week-end, I was in Brussels, Belgium for the 2017 edition of the FOSDEM, one of the greatest open source developer conference.

This year, I decided to propose a talk about Gnocchi which was accepted in the Python devroom. The track was very well organized (thanks to Stéphane Wirtel) and I was able to present Gnocchi to a room full of Python developers!

I've explained why we created Gnocchi and how we did it, and finally briefly explained how to use it with the command-line interface or in a Python application using the SDK.

You can check the slides below and [the video of the talk (https://video.fosdem.org/2017/UD2.120/storing_metrics_gnocchi.mp4).

Gnocchi 3.1 unleashed

Thu, 02 Feb 2017 00:00:00 GMT

It's always difficult to know when to release, and we really wanted to do it earlier. But it seems that each week more awesome work was being done in Gnocchi, so we kept delaying it while having no pressure to push it out.

But now that the OpenStack cycle is finishing, even Gnocchi does not strictly follow it, it seemed to be a good time to cut the leash and leave this release be.

There are again some major new changes coming from 3.0. The previous version 3.0 was tagged in October and had 90 changes merged from 13 authors since 2.2. This 3.1 version have 200 changes merged from 24 different authors. This is a great improvement of our contributor base and our rate of change – even if our delay to merge is very low. Once again, we pushed usage of release notes to document user visible changes, and they can be read online.

Therefore, I am going to summary quickly the major changes:

The REST API authentication mechanism has been modularized. It's now simple to provide any authentication mechanism for Gnocchi as a plugin. The default is now a HTTP basic authentication mechanism that does not implement any kind of enforcement. The Keystone authentication is still available, obviously.
Batching has been improved and can now create metrics on the fly, reducing the latency needed when pushing measures to non-existing metrics. This is leveraged by the collectd-gnoccchi plugin for example.
The performance of Carbonara based backend has been largely improved. This is not really listed as a change as it's not user-visible, but an amazing work of profiling and rewriting code from Pandas to NumPy has been done. While Pandas is very developer-friendly and generic, using NumPy directly offers way more performance and should decrease gnocchi-metricd CPU usage by a large factor.
The storage has been split into two parts: the storage of incoming new measures to be processed, and the storage and archival of aggregated metrics. This allows to use e.g. file to store new measures being sent, and once processed store them into e.g. Ceph. Before that change, all the new measures had to go into Ceph. While there's no specific driver yet for incoming measures, it's easy to envision a driver for systems like Redis or Memcached.
A new Amazon S3 driver has been merged. It works in the same way than the file or OpenStack Swift drivers.

I will write more about some of these new features in the upcoming weeks, as they are very interesting for Gnocchi's users.

We are planning to run a scalability test and benchmarks using the ScaleLab in a few weeks if everything goes as planned. I will obviously share the result here, but we also submitted a
talk for the next OpenStack Summit in Boston to present the results of our scalability and performance tests – hoping the session will be accepted.

I will also be talking about Gnocchi this Sunday at FOSDEM.

We don't have a very determined roadmap for Gnocchi during the next weeks. Sure we do have a few ideas on what we want to implement, but we are also very easily influenced by the requests of our user: therefore feel free to ask for anything!

Scaling Python is on its way

Mon, 16 Jan 2017 00:00:00 GMT

My day-to-day activities are still evolving around the Python programming language, as I continue working on the OpenStack project as part of my job at Red Hat. OpenStack is still the biggest Python project out there, and attract a lot of Python hackers.

Those last few years, however, things have taken a different turn for me when I made the choice with my team to rework the telemetry stack architecture. We
decided to make a point of making it scale way beyond what has been done in the project so far.

I started to dig into a lot of different fields around Python. Topics you don't often look at when writing a simple and straight-forward application. It turns out that writing scalable applications in Python is not impossible, nor that difficult. There are a few hiccups to avoid, and various tools that can help, but it really is possible – without switching to another whole language, framework, or exotic tool set.

Working on those projects seemed to me like a good opportunity to share with the rest of the world what I learned. Therefore, I decided to share my most recent knowledge addition around distributed and scalable Python application in a new book, entitled The Hacker's Guide to Scaling Python (or Scaling Python, in short). The book should be released in a few months – fingers crossed.

And as the book is still a work-in-progress, I'll be happy to hear any remark, subject, interrogation or topic idea you might have or any particular angle you would like me to take in this book (reply in the comments section or shoot me an email). And if you'd like to get be kept updated on this book advancement, you can subscribe in the following form or from the book homepage.

The adventure of working on my previous book, The Hacker's Guide to Python, has been so tremendous and the feedback so great, that I'm looking forward releasing this new book later this year!

Packaging Python software with pbr

Mon, 02 Jan 2017 00:00:00 GMT

Packaging Python has been a painful experience for long. The history of the various distribution that Python offered along the years is really bumpy, and both the user and developer experience has been pretty bad.

Fortunately, things improved a lot in the recent years, with the reconciliation of setuptools and distribute.

Though in the context of the OpenStack project, a solution on top of setuptools has been already started a while back. Its usage is now spread across a whole range of software and libraries.

This project is called pbr, for Python Build Reasonableness. Don't be afraid by the OpenStack colored themed of the documentation – it is a bad habit of OpenStack folks to not advertise their tooling in an agnostic fashion. The tool has no dependency with the cloud platform, and can be used painlessly with any package.

How it works

pbr takes inspiration from distutils2 (a now abandoned project) and uses a setup.cfg file to describe the packager's intents. This is how a setup.py using pbr looks like:

import setuptools

setuptools.setup(setup_requires=['pbr'], pbr=True)

Two lines of code – it's that simple. The actual metadata that the setup requires is stored in setup.cfg:

[metadata]
name = foobar
author = Dave Null
author-email = foobar@example.org
summary = Package doing nifty stuff
license = MIT
description-file =
    README.rst
home-page = http://pypi.python.org/pypi/foobar
requires-python = >=2.6
classifier = 
    Development Status :: 4 - Beta
    Environment :: Console
    Intended Audience :: Developers
    Intended Audience :: Information Technology
    License :: OSI Approved :: Apache Software License
    Operating System :: OS Independent
    Programming Language :: Python

[files]
packages =
    foobar

This syntax is way easier to write and read than the standard setup.py.

pbr also offers other features such as:

automatic dependency installation based on requirements.txt
automatic documentation building and generation using Sphinx
automatic generation of AUTHORS and ChangeLog files based on git history
automatic creation of the list of files to include using git
version management based on git tags

All of this comes with little to no effort on your part.

Using flavors

One of the feature that I use a lot, is the definition of flavors. It's not tied particularly to pbr – it's actually provided by setuptools and pip themselves – but pbr setup.cfg file makes it easy to use.

When distributing a software, it's common to have different drivers for it. For example, your project could support both PostgreSQL or MySQL – but nobody is going to use both at the same time. The usual trick to make it work is to add the needed library to the requirements list (e.g. requirements.txt). The upside is that the software will work directly with either RDBMS, but the downside is that this will install both libraries, whereas only one is needed. Using flavors, you can specify different scenarios:

[extras]
postgresql =
    psycopg2
mysql =
    pymysql

When installing your package, the user can then just pick the right flavor by using pip to install the package:

$ pip install foobar[postgresql]

This will install foobar, all its dependencies listed in requirements.txt, plus whatever dependencies are listed in the [extras] section of setup.cfg matching the flavor. You can also combine several flavors, e.g.:

$ pip install foobar[postgresql,mysql]

would install both flavors.

pbr is well-maintained and in very active development, so if you have any plans to distribute your software, you should seriously consider including pbr in those plans.

Attending OpenStack Summit Ocata

Mon, 31 Oct 2016 00:00:00 GMT

For the last time in 2016, I flew out to the OpenStack Summit in Barcelona, where I had the chance to meet (again) a lot of my fellow OpenStack contributors there.

How To Work Upstream with OpenStack

My week started by giving a talk about How To Work Upstream with OpenStack where I explained, accompanied by Ryota and Ashiq, to the audience how to contribute upstream to OpenStack. It went well and was well received by the public – you can watch the video below or
download the slides.

Python 3 in telemetry projects

I've attended a few interesting cross-project sessions, which helped me getting some prioritization for my work during the next few months.

The Python 3 porting effort is blocked for a while in Nova and Swift for various (mostly non-technical) reasons, while almost all other projects are working correctly. On the other hand, we have committed the telemetry projects to be the first one to drop Python 2 support has soon as it is possible. The next steps are to be sure downstream is ready and enable functional testing in devstack with Python 3.

Ceilometer deprecation

The Ceilometer sessions were really interesting, are we mainly discussed deprecating and removing old crufts that are not or should not be used anymore. The main change will be the depreciation of the Ceilometer API. It has been clear for more than a year that Gnocchi is the way-to-go to store and provide access to metrics, but we failed at announcing wildly. A lot of the people I talked to during the summit were not aware that the Ceilometer API was not a good pick, and that Gnocchi was the now recommended storage backend. Bad communication from our side – but we are going to fix it as of now.

We also committed to simplify the current architecture by removing the collector, which has now be made obsolete by the agent based architecture that was implemented during the last development cycles.

Aodh alarm timeout

We had a feature proposal for a while in Aodh that we postponed for too long already: having timeout triggered after not having seen some events. This seems to be a functionality requested by NFV users – something we want Aodh to cover.

We spent some time discussing this feature, and now that we all have a clear understanding of the use case, we'll work on having a clear path to the implementation.

I've also attended a session with the Vitrage developers in order to discuss how we could work better together, as they rely on Aodh. It seems there might be some convergence in the future, which would be very welcome. Wait'n see.

Gnocchi improvement, past and future

The Gnocchi session ran smoothly, and everyone seemed happy with the work we have done so far. We've made some impressive improvement in Gnocchi 3.0 – as I already covered previously – and Gordon Chung presented a short talk about the performance difference metered while working on this new version of Gnocchi:

The return of the InfluxDB driver is on the table, as Sam Morrison proposed a patch for that while back. While it's not as fast and scalable as other drivers, it offers a good alternative for people having to use it.

Leandro Reox presented how to do capacity planning using Ceilometer and Gnocchi, presenting the projects at the same time:

It is pretty impressive to see what they achieved with this project, and I'm looking forward to being able to check how it works inside.

PTG and beyond

The next meeting is supposed to be the new OpenStack PTG in February in Atlanta, though we did not request any specific space there. While the team love seeing each other face-to-face every few months, we achieved to follow all of the guidelines I listed recently on good open source project management, meaning we are able to work very well asynchronously and remotely. There is no need to put hard requirements on people wanting to participate in our community. Nevertheless, I expect cross-projects discussions that will happen to still concern the OpenStack Telemetry projects.

In the end, we're all very happy with our past and future roadmaps and I'm looking forward to achieving our next big milestones with our amazing telemetry team!

Running an open source and upstream oriented team in agile mode

Tue, 18 Oct 2016 00:00:00 GMT

For the last 3 years, I've been working in the OpenStack Telemetry team at eNovance, and then at Red Hat. Our mission is to maintain the OpenStack Telemetry stack, both upstream and downstream (i.e. inside Red Hat products). Besides the technical challenges, the organization of the team always have played a major role in our accomplishments.

Here, I'd like to share some of my hindsight with you, faithful readers.

Meet the team

The team I work in changed a bit during those 3 years, but the core components always have been the same: a few software engineers, a QE engineer, a product owner, and an engineering manager. That meant the team size has been always been between 6 and 8 people.

I cannot emphasize enough how team size is important. Not having more than 8 persons in a team fits with the two pizzas rule from Jeff Bezzos, which turned out to be key in our team composition.

The group dynamic that is applied by teams not bigger than this is excellent. It offers the possibility to know and connect with everyone – each time member has only up to 7 people to talk to on a daily basis, which means only 28 communication axis between people. Having a team of e.g. 16 people means 120 different links in your team. Double your team size, and multiply by 4 your communication overhead. My experience shows that the less communication axis you have in a team, the less overhead your will have and the swifter your team will be.

All team members being remote workers, is is even more challenging to build relationship and bound. We had the opportunity to know each other during the OpenStack summit twice a year and doing regular video-conference via Google Hangout or BlueJeans really helped.

The atmosphere you set-up in your team will also forge the outcome of your team work. Run your team with trust, peace and humor (remember I'm on the team 🤣) and awesome things will happen. Run your team with fear, pressure, and finger-pointing, nothing good will happen.

There's little chance that when a team is built, everyone will be on the same level. We were no exception, we had more and less experienced engineers. But the most experienced engineers took the time needed to invest and mentor the less experienced. That also helped to build trust and communication links between members of the team. And over the long run, everyone is getting more efficient: the less experienced engineers are getting better and the more experienced can delegate a lot of stuff to their fellows.

Then they can chill or work on bigger stuff. Win-win.

It's actually no more different than that the way you should run an open source team, as I already claimed in a previous article on FOSS projects management.

Practicing agility

I might be bad at practicing agility: contrary to many people, I don't see agility as a set of processes. I see that as a state of mind, as a team organization based on empowerment. No more, no less.

And each time I meet people and explain that our team is "agile", they start shivering, explaining how they hate sprints, daily stand-ups, scrum, and planning poker, that this is all a waste of time and energy.

Well, it turns out that you can be agile without all of that.

Planning poker

In our team, we tried at first to run 2-weeks sprints and used planning poker to schedule our user stories from our product backlog (= todo list). It never worked as expected.

First, most people had the feeling to lose their time because they already knew exactly what they were supposed to do. Having any doubt, they would have just gone and talked to the product owner or another fellow engineer.

Secondly, some stories were really specialized and only one of the team member was able to understand it in details and evaluate it. So most of the time, a lot of the team members playing planning poker would just vote a random number based on the length of the explanation of the story teller. For example, if an engineer said "I just need to change that flag in the configuration file" then everyone would vote 1. If they started rambling for 5 minutes about "how the configuration option is easy to switch, but that there might be other things to change at the same time, and things to check for impact bigger than expected, and code refactoring to do", then most people would just announce a score of 13 on that story. Just because the guy talked for 3 minutes straight and everything sounded complicated and out of their scope.

That meant that the poker score had no meaning to us. We never managed to have a number of points that we knew we could accomplish during a sprint (the team velocity as they call it).

The only benefit that we identified from planning poker, in our case, is that it forces people to keep sit down and communicate about a user story. Though, it turned out that making people communicate was not a problem we needed to solve in our team, so we decided to stop doing that. But it can be a pretty good tool to make people talking to each other.

Therefore, the 2-weeks sprint never made much sense as we were unable to schedule our work reliably. Furthermore, doing most of our daily job in open source communities, we were unable to schedule anything. When sending patches to an upstream project, you have no clue when they will be reviewed. What you know for sure, is that in order to maximize your code merge throughput with this high latency of code review, you need to parallelize your patch submission a lot. So as soon as you receive some feedback from your reviewers, you need to (almost) drop everything, rework your code and resubmit it.

There's no need to explain what this does not absolutely work with a sprint approach. Most of the scrum framework lays on the fact that you own workflow from top to bottom, which is far from being true when working in open source communities.

Daily stand-up meetings

We used to run a daily stand-up meeting every day, then every other day. Doing that remotely kills the stand-up part, obviously, so there is less guarantee the meeting will be short. Considering all team members are working remotely in different time zones, with some freedom to organize their schedule, it was very difficult to synchronize those meetings. With member spread from the US to Eastern Europe, the meeting was in the middle of the afternoon for me. I found it frustrating to have to stop my activities in the middle of every afternoon to chat with my team. We all know the cost of context switching to us, humans.

So we drifted from our 10 minutes daily meeting to a one-hour weekly meeting with the whole team. It's way easier to synchronize for a large chunk of time once a week and to have this high-throughput communication channel.

Our (own) agile framework

Drifting from the original scrum implementation, we ended up running our own agility framework. It turned out to have similarity with kanban – you don't always have to invent new things!

Our main support is a Trello board that we share with the whole team. It consists of different columns, where we put card representing small user stories or simple to-do items. Each column is the state of the current card, and we move them left to right:

Ideas: where we put things we'd like to do or dig into, but there's no urgency. It might lead to new, smaller ideas, in the "To Do" column.
To Do: where we put real things we need to do. We might run a grooming session with our product manager if we need help prioritizing things, but it's usually not necessary.
Epic: here we create a few bigger cards that regroup several To Do items. We don't move them around, we just archive them when they are fully implemented. There are only 5-6 big cards here at max, which are the long term goals we work on.
Doing: where we move cards from To Do when we start doing them. At this stage, we also add people working on the task to the card, so we see a little face of people involved.
Under review: 90% of our job being done upstream, we usually move cards done and waiting for feedback from the community to this column. When the patches are approved and the card is complete, we move the card to Done. If a patch needs further improvement, we move back the card to Doing and work on it, and then move it back to Under review when resubmitted.
On hold / blocked: some of the tasks we work on might be blocked by external factors. We move cards there to keep track of them.
Done during week #XX: we create a new list every Monday to stack our done cards by week. This is just easier to display and it allows us to see the cards that we complete each week. We archive lists older than a month, from time to time. It gives a great visual feedback and what has been accomplished and merged every week.

We started to automate some of our Trello workflow in a tool called Trelloha. For example, it allows us to track upstream patches sent through Gerrit or GitHub and tick the checkbox items in any card when those are merged.

We actually don't put many efforts on our Trello board. It's just a slightly organized chaos, as are upstream projects. We use it as a lightweight system for taking notes, organizing our thought and letting know what we're doing and why we're doing it. That's where Trello is wonderful because using it has a very low friction: creating, updating and moving card is a one click operation.

One bias of most engineers is to overthink and over-engineer their workflow, trying to rationalize it. Most of the time, they end up automating which means building processes and bureaucracy. It just slows things down and builds frustration upon everyone. Just embrace chaos and spend time on what matters.

Most of the things we do are linked to external Launchpad bugs, Gerrit reviews or GitHub issues. That means the cards in Trello carry very little information, as everything happens outside, in the wild Internet of open source communities.

This is very important as we need to avoid any kind of retention of knowledge and information from contributors outside the company. This also makes sure that our internal way of running does not leak outside and (badly) influence outside communities.

Retrospectives

We also run a retrospective every 2 weeks, which might be the only thing we kept from the scrum practice. It's actually a good opportunity for us to share our feelings, concerns or jokes. We used to do it using the six thinking hats method, but it slowly faded away. In the end, we now use a different Trello board with those columns:

Good 😄
Hopes and Wishes 🎁
Puzzles and Challenges 🌊
To improve 😡
Action Items 🤘

All teammates fill the board with the card they want, and everyone is free to add themselves to any card. We then run through each card and let people who added their name to it talk about it. The column "Action Items" is usually filled as we speak and discover we should do things. We can then move cards created there to our regular board, in the To Do column.

Central communication

Sure, people have different roles in a team, but we dislike bottleneck and single point of failure. Therefore, we are using an internal mailing list where we ask people to send their request and messages to. If people send things related to our team job to one of us personally, we just forward or Cc the list when replying so everyone is aware of what one might be talking about with people external to the team.

This is very important, as it emphasizes that no team member should be considered special. Nobody owns more information and knowledge than others, and anybody can jump into a conversation if it has valuable knowledge to share.

The same applies for our internal IRC channel.

We also make sure that we discuss only company-specific things on this list or on our internal IRC channel. Everything that can be public and is related to upstream is discussed on external communication medium (IRC, upstream mailing list, etc). This is very important to make sure that we are not blocking anybody outside the Red Hat to join us and contribute to the projects or ideas we work on. We also want to make sure that people working in our company are no more special than other contributors.

Improvement

We're pretty happy with our set-up right now, and the team runs pretty smoothly since a few months. We're still trying to improve, and having a general sense of trust among team members make sure we can openly speak about whatever problem we might have.

Feel free to share your feedback and own experience of running your own teams in the comment section.

Gnocchi 3.0 release

Mon, 03 Oct 2016 00:00:00 GMT

After a few weeks of hard work with the team, here is the new major version of Gnocchi, stamped 3.0.0. It was very challenging, as we wanted to implement a few big changes in it.

Gnocchi is now using reno to its maximum and you can read the release notes of the 3.0 branch online. Some notes might be missing as it is our first release with it, but we are making good progress at writing changelogs for most of our user facing and impacting changes.

Therefore, I'll only write here about our big major feature that made us bump the major version number.

New storage engine

And so the most interesting thing that went in the 3.0 release, is the new storage engine that has been built by me and Gordon Chung during those last months. The original approach of writing data in Gnocchi was really naive, so we had an iterative improvement process since version 1.0, and we're getting close to something very solid.

This new version leverages several important features which increase performance by a large factor on Ceph (using write(offset) rather than read()+write() to append new points), our recommended back-end.

To summarize, since most data points are sent sequentially and ordered, we enhanced the data format to profit from that fact and be able to be appended without reading anything. That only works on Ceph though, which provides the needed features.

We also enabled data compression on all storage drivers by enabling LZ4 compression (see my previous article and research on the subject), which obviously offers its own set of challenges when using append-only write. The results are tremendous and decrease data usage by a huge factor:

The rest of the processing pipeline also has been largely improved:

Overall, we're delighted with the performance improvement we achieved, and we're looking forward making even better more progress. Gnocchi is now one of the most performing and scalable timeseries databases out there.

Upcoming challenges

With that big change done, we're now heading toward a set of more lightweight improvements. Our bug tracker is a good place to learn what might be on our mind (check for the wishlist bugs).

Improving our API features and offering a better experience for those coming outside of the real of OpenStack are now on my top priority list.

But let me know if there's anything you have scratching you, obviously. 😎

AsciiDoc book toolchain released

Tue, 20 Sep 2016 00:00:00 GMT

Writing a book is a big undertaking. You have to think about what you will actually write, the content, its organization, the examples you want to show, illustrations, etc.

When publishing with the help of a regular editor, your job stops there at writing – and that's already a big and hard enough task. Your editor will handle the publishing process, leaving you free of the printing task. Though they might have their own set of requirements, such as making you work with a word processing tool (think LibreOffice Writer or Microsoft Word).

When you self-publish like I did with The Hacker's Guide to Python, none of that happens. You have to deal yourself with getting your work out there, released and available in a viable format for your readership.

Most of the time, you need to render your book in different formats. You will have to make sure it works correctly on different devices and that the formatting and content disposition is correct.

I knew exactly what I wanted exactly when writing my book. I wanted to have the book published in at least PDF (for computer reading) and ePub (for e-readers). I also knew, as an Emacs user, that I did not want to spend hours writing a book in LibreOffice. It's not for me.

When I wrote about the making of The Hacker's Guide to Python, I briefly mentioned which tools I used to build the book and that I picked AsciiDoc as the input format. It makes it easy to write your book inside your favorite text editor, and AsciiDoc has plenty of output format. Customizing these formats to my liking and requirements was another challenge.

It took me hours and hours of work to have all the nitty-gritty details right. Today I am happy to announce that I can save you a few hours of work if you also want to publish a book.

I've published a new project on my GitHub called asciidoc-book-toolchain. It is the actual toolchain that I use to build The Hacker's Guide to Python. It should be easy to use and is able to render any book in HTML, PDF, PDF (printable 6"×9" format), ePub and MOBI.

So feel free to use it, hack it, pull-request it, or whatever. You don't have any good excuse to not write a book now! 😇 And if you want to self-publish a book and need some help getting started, let me know, I would be glad giving you a few hints!

From decimal to timestamp with MySQL

Thu, 08 Sep 2016 00:00:00 GMT

When working with timestamps, one question that often arises is the precision of those timestamps. Most software is good enough with a precision up to the second, and that's easy. But in some cases, like working on metering, a finer precision is required.

I don't know exactly why, and it makes me suffer every day, but OpenStack is really tied to MySQL (and its clones). It hurts because MySQL is a very poor solution if you want to leverage your database to actually solve problems. But that's how life is, unfair. And in the context of the projects I work on, that boils down to that we can't afford to not support MySQL.

So here we are, needing to work with MySQL and at the same time requiring timestamp with a finer precision than just seconds. And guess what: MySQL did not support that until 2011.

No microseconds in MySQL? No problem: DECIMAL!

MySQL 5.6.4 (released in 2011), a beta version of MySQL 5.6 (hello MySQL, ever heard of Semantic Versioning?), brought microsecond precision to timestamps. But the first stable version supporting that, MySQL 5.6.10, was only released in 2013. So for a long time, there was a problem without any solution.

The obvious workaround, in this case, is to reassess your choices in technologies, discover that PostgreSQL supports microsecond precision for at least a decade and problem solved.

This is not what happened in our case, and in order to support MySQL, one had to find a workaround. And so did they in our Ceilometer project, using a DECIMAL type instead of DATETIME.

The DECIMAL type takes 2 arguments: the total number of digits you need to store, and how many in that total will be used for the fractional part. Knowing that the internal storage of MySQL uses 1 byte for 2 digits, 2 bytes for 4 digits, 3 bytes for 6 digits and 4 bytes for 9 digits, and that each part is stored independently, in order to maximize your storage space, you want to pick a number of digits that fits that correctly.

This is why Ceilometer picked 14 for the integer part (9 digits on 4 bytes and 5 digits on 3 bytes) and 6 for the decimal part (3 bytes).

Wait. It's stupid because:

DECIMAL(20, 6) implies that you uses 14 digits for the integer part, which using epoch as a reference makes you able to encode timestamp (10^14) - 1 which is year 3170843. I am certain Ceilometer won't last that far.
14 digits is 9 + 5 digits in MySQL which is 7 bytes, the same size that is used for 9 + 6 digits. So if you could have DECIMAL(21, 6) for the same storage space (and go up to year 31690708 which is a nice bonus, right?)

Well, I guess the original author of the patch did not read the documentation entirely (DECIMAL(20, 6) being on the MySQL documentation page as an example, I imagine it just has been copy-pasted blindly?).

The best choice for this use case would have been DECIMAL(17, 6) which would allow storing 11 digits for integer (5 bytes), supporting timestamp up to (2^11)-1 (year 5138), and 6 digits for decimal part (3 bytes), using only 8 bytes in total per timestamp.

Nonetheless, this workaround has been implemented using a SQLAlchemy custom type and works as expected:

class PreciseTimestamp(sqlalchemy.types.TypeDecorator):
    """Represents a timestamp precise to the microsecond."""

    impl = sqlalchemy.DateTime

    def load_dialect_impl(self, dialect):
        if dialect.name == 'mysql':
            return sqlalchemy.dialect.type_descriptor(
                sqlalchemy.types.DECIMAL(precision=20,
                                         scale=6,
                                         asdecimal=True))
        return sqlalchemy.dialect.type_descriptor(self.impl)

Microseconds in MySQL? Damn, migration!

As I said, MySQL 5.6.4 brought microseconds precision to the table (pun intended). Therefore, it's a great time to migrate away from this hackish format to the brand new one.

First, be aware that the default DATETIME type has no microseconds precision: you have to specify how many digits you want as an argument.
To support microseconds, you should therefore use DATETIME(6).

If we were using a great RDBMS, let's say, hum, PostgreSQL, we could do that
very easily, see:

postgres=# CREATE TABLE foo (mytime decimal);
CREATE TABLE
postgres=# \d foo
      Table "public.foo"
 Column │  Type   │ Modifiers
────────┼─────────┼───────────
 mytime │ numeric │
postgres=# INSERT INTO foo (mytime) VALUES (1473254401.234);
INSERT 0 1
postgres=# ALTER TABLE foo ALTER COLUMN mytime SET DATA TYPE timestamp with time zone USING to_timestamp(mytime);
ALTER TABLE
postgres=# \d foo
              Table "public.foo"
 Column │           Type           │ Modifiers
────────┼──────────────────────────┼───────────
 mytime │ timestamp with time zone │

postgres=# select * from foo;
           mytime
────────────────────────────
 2016-09-07 13:20:01.234+00
(1 row)

And since this is a pretty common use case, it's even an example in the PostgreSQL documentation. The version from the documentation uses a calculation based on epoch, whereas my example here leverages the to_timestamp() function. That's my personal touch.

Obviously, doing this conversion in a single line is not possible with MySQL: it does not implement the USING keyword on ALTER TABLE … ALTER COLUMN. So what's the solution gonna be? Well, it's a 4 steps job:

Create a new column of type DATETIME(6)
Copy data from the old column to the new column, converting them to the new format
Delete the old column
Rename the new column to the old column name.

But I know what you're thinking: there are 4 steps, but that's not a problem, we'll just use a transaction and embed these operations inside.

MySQL does not support transactions on data definition language (DDL).
So if any of those steps fails, you'll be unable rollback steps 1, 3 and 4. Who knew that using MySQL was like living on the edge, right?

Doing this in Python with our friend Alembic

I like Alembic. It's a Python library based on SQLAlchemy that handles schema migration for your favorite RDBMS.

Once you created a new alembic migration script using alembic revision, it's time to edit it and write something along those lines:

from alembic import op
import sqlalchemy as sa
from sqlalchemy.sql import func

class Timestamp(sa.types.TypeDecorator):
    """Represents a timestamp precise to the microsecond."""

    impl = sqlalchemy.DateTime

    def load_dialect_impl(self, dialect):
        if dialect.name == 'mysql':
            return dialect.type_descriptor(mysql.DATETIME(fsp=6))
        return self.impl

def upgrade():
    bind = op.get_bind()
    if bind and bind.engine.name == "mysql":
        existing_type = sa.types.DECIMAL(
            precision=20, scale=6, asdecimal=True)
        existing_col = sa.Column("mytime", existing_type, nullable=False)
        temp_col = sa.Column("mytime_ts", Timestamp(), nullable=False)
        # Step 1: ALTER TABLE mytable ADD COLUMN mytime_ts DATETIME(6)
        op.add_column("mytable", temp_col)
        t = sa.sql.table("mytable", existing_col, temp_col)
        # Step 2: UPDATE mytable SET mytime_ts=from_unixtime(mytime)
        op.execute(t.update().values(mytime_ts=func.from_unixtime(existing_col)}))
        # Step 3: ALTER TABLE mytable DROP COLUMN mytime
        op.drop_column("mytable", "mytime")
        # Step 4: ALTER TABLE mytable CHANGE mytime_ts mytime
        # Note: MySQL needs to have all the old/new information to just rename a column…
        op.alter_column("mytable",
                        "mytime_ts",
                        nullable=False,
                        type_=Timestamp(),
                        existing_nullable=False,
                        existing_type=existing_type,
                        new_column_name="mytime")

In MySQL, the function to convert a float to a UNIX timestamp is from_unixtime(), so the script leverages it to convert the data. As said, you'll notice we don't bother using any kind of transaction, so if anything goes wrong, there's no rollback, and it won't be possible to re-run the migration without a manual intervention.

TimestampUTC is a custom class that implements sqlalchemy.DateTime using a DATETIME(6) type for MySQL, and a regular sqlalchemy.DateTime type for other back-ends. It is used by the rest of the code (e.g. ORM model) but I've pasted it in this example for a better understanding.

Once written, you can easily test your migration using pifpaf to run a temporary database:

$ pifpaf run mysql $SHELL
$ alembic -c alembic/alembic.ini upgrade 1c98ac614015 # upgrade to the initial revision
$ mysql -S $PIFPAF_MYSQL_SOCKET pifpaf
mysql> INSERT INTO mytable (mytime) VALUES (1325419200.213000);
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM mytable;
+-------------------+
| mytime            |
+-------------------+
| 1325419200.213000 |
+-------------------+
1 row in set (0.00 sec)

$ alembic -c alembic/alembic.ini upgrade head

$ mysql -S $PIFPAF_MYSQL_SOCKET pifpaf
mysql> SELECT * FROM mytable;
+----------------------------+
| mytime                     |
+----------------------------+
| 2012-01-01 13:00:00.213000 |
+----------------------------+
1 row in set (0.00 sec)

And voilà, we just migrated unsafely our data to a new fancy format. Thank you Alembic for solving a problem we would not have without MySQL. 😊

A retrospective of the OpenStack Telemetry project Newton cycle

Mon, 05 Sep 2016 00:00:00 GMT

A few weeks ago, I recorded an interview with Krishnan Raghuram about what was discussed for this development cycle for OpenStack Telemetry at the Austin summit.

It's interesting to look back at this video more than 3 months after recording it, and see what actually happened to Telemetry. It turns out that some of the things that I think were going to happen did not happen yet. As the first release candidate version is approaching, it's very unlikely they happen.

And on the other side, some new fancy features arrived suddenly without me having a clue about them.

As far as Ceilometer is concerned, here's the list of what really happened in terms of user features:

Added full support for SNMP v3 USM model
Added support for batch measurement in Gnocchi dispatcher
Set ended_at timestamp in Gnocchi dispatcher
Allow Swift pollster to specify regions
Add L3 cache usage and memory bandwidth meters
Split out the event code (REST API and storage) to a new Panko project

And a few other minor things. I planned none of them except Panko (which I was responsible for), and the ones we planned (documentation update, pipeline rework and polling enhancement) did not happen yet.

For Aodh, we expected to rework the documentation entirely too, and that did not happen either. What we did instead:

Deprecate and disable combination alarms
Add pagination support in REST API
Deprecated all non-SQL database store and provide a tool to migrate
Support batch notification for aodh-notifier

It's definitely a good list of new features for Aodh, still small, but simplifying it, removing technical debt and continuing building momentum around it.

For Gnocchi, we really had no plan, except maybe a few small features (they're usually tracked in the Launchpad bug list). It turned out we had some fancy new idea with Gordon Chung on how to boost our storage engine, so we work on that. That kept us busy a few weeks in the end, though the preliminary results look tremendous – so it was definitely worth it. We also have a AWS S3 storage driver on its way.

I find this exercise interesting, as it really emphasizes how you can't really control what's happening in any open source project, where your contributors come and go and work on their own agenda.

That does not mean we're dropping the themes and ideas I've laid out in that video. We're still pushing our "documentation is mandatory" policy and improving our "work by default" scenario. It's just a longer road that we expected.

The definitive guide to Python exceptions

Thu, 11 Aug 2016 00:00:00 GMT

Three years after my definitive guide on Python classic, static, class and abstract methods, it seems to be time for a new one. Here, I would like to dissect and discuss Python exceptions.

Dissecting the base exceptions

In Python, the base exception class is named BaseException. Being rarely used in any program or library, it ought to be considered as an implementation detail. But to discover how it's implemented, you can go and read Objects/exceptions.c in the CPython source code. In that file, what is interesting is to see that the BaseException class defines all the basic methods and attribute of exceptions. The basic well-known Exception class is then simply defined as a subclass of BaseException, nothing more:

/*
 *    Exception extends BaseException
 */
SimpleExtendsException(PyExc_BaseException, Exception,
                       "Common base class for all non-exit exceptions.");

The only other exceptions that inherits directly from BaseException are GeneratorExit, SystemExit and KeyboardInterrupt. All the other builtin exceptions inherits from Exception. The whole hierarchy can be seen by running pydoc2 exceptions or pydoc3 builtins.

Here are the graph representing the builtin exceptions inheritance in Python 2 and Python 3 (generated using this script).

The BaseException.__init__ signature is actually BaseException.__init__(*args). This initialization method stores any arguments that is passed in the args attribute of the exception. This can be seen in the exceptions.c source code – and is true for both Python 2 and Python 3:

static int
BaseException_init(PyBaseExceptionObject *self, PyObject *args, PyObject *kwds)
{
    if (!_PyArg_NoKeywords(Py_TYPE(self)->tp_name, kwds))
        return -1;

    Py_INCREF(args);
    Py_XSETREF(self->args, args);

    return 0;
}

The only place where this args attribute is used is in the BaseException.__str__ method. This method uses self.args to convert an exception to a string:

static PyObject *
BaseException_str(PyBaseExceptionObject *self)
{
    switch (PyTuple_GET_SIZE(self->args)) {
    case 0:
        return PyUnicode_FromString("");
    case 1:
        return PyObject_Str(PyTuple_GET_ITEM(self->args, 0));
    default:
        return PyObject_Str(self->args);
    }
}

This can be translated in Python to:

def __str__(self):
    if len(self.args) == 0:
        return ""
    if len(self.args) == 1:
        return str(self.args[0])
    return str(self.args)

Therefore, the message to display for an exception should be passed as the first and the only argument to the BaseException.__init__ method.

Defining your exceptions properly

As you may already know, in Python, exceptions can be raised in any part of the program. The basic exception is called Exception and can be used anywhere in your program. In real life, however no program nor library should ever raise Exception directly: it's not specific enough to be helpful.

Since all exceptions are expected to be derived from the base class Exception, this base class can easily be used as a catch-all:

try:
    do_something()
except Exception:
    # THis will catch any exception!
    print("Something terrible happened")

To define your own exceptions correctly, there are a few rules and best practice that you need to follow:

Always inherit from (at least) Exception:

class MyOwnError(Exception):
    pass

Leverage what we saw earlier about BaseException.__str__: it uses the first argument passed to BaseException.__init__ to be printed, so always call BaseException.__init__ with only one argument.
When building a library, define a base class inheriting from Excepion. It will make it easier for consumers to catch any exception from the library:

class ShoeError(Exception):
    """Basic exception for errors raised by shoes"""

class UntiedShoelace(ShoeError):
    """You could fall"""

class WrongFoot(ShoeError):
    """When you try to wear your left show on your right foot"""

It then makes it easy to use except ShoeError when doing anything with that piece of code related to shoes. For example, Django does not do that for some of its exceptions, making it hard to catch "any exception raised by Django".

Provide details about the error. This is extremely valuable to be able to log correctly errors or take further action and try to recover:

class CarError(Exception):
    """Basic exception for errors raised by cars"""
    def __init__(self, car, msg=None):
        if msg is None:
            # Set some default useful error message
            msg = "An error occured with car %s" % car
        super(CarError, self).__init__(msg)
        self.car = car

class CarCrashError(CarError):
    """When you drive too fast"""
    def __init__(self, car, other_car, speed):
        super(CarCrashError, self).__init__(
            car, msg="Car crashed into %s at speed %d" % (other_car, speed))
        self.speed = speed
        self.other_car = other_car

Then, any code can inspect the exception to take further action:

try:
    drive_car(car)
except CarCrashError as e:
    # If we crash at high speed, we call emergency
    if e.speed >= 30:
        call_911()

For example, this is leveraged in Gnocchi to raise specific application exceptions (NoSuchArchivePolicy) on expected foreign key violations raised by SQL constraints:

try:
    with self.facade.writer() as session:
        session.add(m)
except exception.DBReferenceError as e:
    if e.constraint == 'fk_metric_ap_name_ap_name':
        raise indexer.NoSuchArchivePolicy(archive_policy_name)
    raise

Inherits from builtin exceptions types when it makes sense. This makes it easier for programs to not be specific to your application or library:

class CarError(Exception):
    """Basic exception for errors raised by cars"""

class InvalidColor(CarError, ValueError):
    """Raised when the color for a car is invalid"""

That allows many programs to catch errors in a more generic way without noticing your own defined type. If a program already knows how to handle a ValueError, it won't need any specific code nor modification.

Organization

Organizing code can be quite touchy and complicated. I cover more general rules in The Hacker's Guide to Python, but here's a few rules concerning exceptions in particular.

There is no limitation on where and when you can define exceptions. As they are, after all, normal classes, they can be defined in any module, function or class – even as closures.

Most libraries package their exceptions into a specific exception module: SQLAlchemy has them in
sqlalchemy.exc, requests has them in
requests.exceptions, Werkzeug has them in werkzeug.exceptions, etc.

That makes sense for libraries to export exceptions that way, as it makes it very easy for consumers to import their exception module and know where the exceptions are defined when writing code to handle errors.

This is not mandatory, and smaller Python modules might want to retain their exceptions into their sole module. Typically, if your module is small enough to be kept in one file, don't bother splitting your exceptions into a different file/module.

While this wisely applies to libraries, applications tend to be different beasts. Usually, they are composed of different subsystems, where each one might have its own set of exceptions. This is why I generally discourage going with only one exception module in an application, but to split them across the different parts of one's program. There might be no need of a special myapp.exceptions module.

For example, if your application is composed of an HTTP REST API defined into the module myapp.http and of a TCP server contained into myapp.tcp, it's likely they can both define different exceptions tied to their own protocol errors and cycle of life. Defining those exceptions in a myapp.exceptions module would just scatter the code for the sake of some useless consistency. If the exceptions are local to a file, just define them somewhere at the top of that file. It will simplify the maintenance of the code.

Wrapping exceptions

Wrapping exception is the practice by which one exception is encapsulated into another:

class MylibError(Exception):
    """Generic exception for mylib"""
    def __init__(self, msg, original_exception):
        super(MylibError, self).__init__(msg + (": %s" % original_exception))
        self.original_exception = original_exception

try:
    requests.get("http://example.com")
except requests.exceptions.ConnectionError as e:
     raise MylibError("Unable to connect", e)

This makes sense when writing a library which leverages other libraries. If a library uses requests and does not encapsulate requests exceptions into its own defined error classes, it will be a case of layer violation. Any application using your library might receive a requests.exceptions.ConnectionError, which is a problem because:

The application has no clue that the library was using requests and does not need/want to know about it.
The application will have to import requests.exceptions itself and therefore will depend on requests – even if it does not use it directly.
As soon as mylib changes from requests to e.g. httplib2, the application code catching requests exceptions will become irrelevant.

The Tooz library is a good example of wrapping, as it uses a driver-based approach and depends on a lot of different Python modules to talk to different backends (ZooKeeper, PostgreSQL, etcd…).
Therefore, it wraps exception from other modules on every occasion into its own set of error classes. Python 3 introduced the raise from form to help with that, and that's what Tooz leverages to raise its own error.

It's also possible to encapsulate the original exception into a custom defined exception, as done above. That makes the original exception available for inspection easily.

Catching and logging

When designing exceptions, it's important to remember that they should be targeted both at humans and computers. That's why they should include an explicit message, and embed as much information as possible. That will help to debug and write resilient programs that can pivot their behavior depending on the attributes of exception, as seen above.

Also, silencing exceptions completely is to be considered as bad practice. You should not write code like that:

try:
    do_something()
except Exception:
    # Whatever
    pass

Not having any kind of information in a program where an exception occurs is a nightmare to debug.

If you use (and you should) the logging library, you can use the exc_info parameter to log a complete traceback when an exception occurs, which might help debugging on severe and unrecoverable failure:

try:
    do_something()
except Exception:
    logging.getLogger().error("Something bad happened", exc_info=True)

If you often forget on how to setup the logging library, you should check out daiquiri.

The bad practice in FOSS projects management

Thu, 09 Jun 2016 00:00:00 GMT

During the OpenStack summit a few weeks ago, I had the chance to talk to some people about my experience on running open source projects. It turns out that after hanging out in communities and contributing to many projects for years, I may be able to provide some hindsight and an external eye to many of those who are new to it.

There are plenty of resource explaining how to run an open source projects out there. Today, I would like to take a different angle and emphasize what you should not socially do in your projects. This list comes from various open source projects I encountered these past years. I'm going to go through some of the bad practice I've spotted, in a random order, illustrated by some concrete example.

Seeing contributors as an annoyance

When software developers and maintainers are busy, there's one thing they don't need: more work. To many people, the instinctive reactions to external contribution is: damn, more work. And actually, it is.

Therefore, some maintainers tend to avoid that surplus of work: they state they don't want contributions, or make contributors feel un-welcomed. This can take a lot of different forms, from ignoring them to being unpleasant. It indeed avoids the immediate need to deal with the work that has been added on the maintainer shoulders.

This is one of the biggest mistake and misconception of open source. If people are sending you more work, you should do whatever it takes to feel them welcome so they continue working with you. They might pretty soon become the guys doing the work you are doing instead of you. Think: retirement!

Let's take a look at my friend Gordon, who I saw starting as a Ceilometer contributor in 2013. He was doing great code reviews, but he was actually giving me more work by catching bugs in my patches and sending patches I had to review. Instead of being a bully so he would stop making me rework my code and reviews his patches, I requested that we trust him even more by adding him as a core reviewer. time contribution.

And if they don't do this one-time contribution, they won't make it two. They won't make any. Those projects may have just lost their new maintainers.

Letting people only do the grunt work

When new contributors arrive and want to contribute to a particular project, they may have very different motivation. Some of them are users, but some of them are just people looking to see how it is to contribute. Getting the thrill of contribution, as an exercise, or as a willingness to learn and start contributing back to the ecosystem they use.

The usual response from maintainers is to push people into doing grunt work. That means doing jobs that have no interest, little value, and probably no direct impact on the project.

Some people actually have no problem with it, some have. Some will feel offended to do low impact work, and some will love it as soon as you give them some sort of acknowledgment. Be aware of it, and be sure to high-five people doing it. That's the only way to keep them around.

Not valorizing small contributions

When the first patch that comes in from a new contributor is a typo fix, what developers think? That they don't care, that you're wasting their precious time with your small contribution. And nobody cares about bad English in the documentation, don't they?

This is wrong. See my first contributions to home-assistant and Postmodern: I fixed typos in the documentation.

I contributed to Org-mode for a few years. My first patch to orgmode was about fixing a docstring. Then, I sent 56 patches, fixing bugs and adding fancy features and also wrote a few external modules. To this day, I'm still #16 in the top-committer list of Org-mode who contains 390 contributors. So not that would call a small contributor. I am sure the community is glad they did not despise my documentation fix.

Setting the bar too high for new comers

When new contributors arrive, their knowledge about the project, its context, and the technologies can vary largely. One of the mistakes people often make is to ask contributors too complicated things that they cannot realize. That scares them away (many people are going to be shy or introvert) and they may just disappear, feeling too stupid to help.

Before making any comment, you should not have any assumption about their knowledge. That should avoid such situation. You also should be very delicate when assessing their skills, as some people might feel vexed if you underestimate them too much.

Once that level has been properly evaluated (a few exchanges should be enough), you need to mentor to the right degree your contributor so it can blossom. It takes time and experience to master this, and you may likely lose some of them in the process, but it's a path every maintainer has to take.

Mentoring is a very important aspect of welcoming new contributors to your project, whatever it is. Pretty sure that applies nicely outside free software too.

Requiring people to make sacrifices with their lives

This is an aspect that varies a lot depending on the project and context, but it's really important. As a free software project, where most people will contribute on their own good will and sometimes spare time, you must not require them to make big sacrifices. This won't work.

One of the worst implementation of that is requiring people to fly 5 000 kilometers to meet in some place to discuss the project. This puts contributors in an unfair position, based on their ability to leave their family for a week, take a plane/boat/car/train, rent an hotel, etc. This is not good, and everything should be avoided to require people to do that in order to participate and feel included in the project and blend in your community. Don't get me wrong: that does not me social activities should be prohibited, on the contrary. Just avoid excluding people when you discuss any project.

The same apply to any other form of discussion that makes it complicated for everyone to participate: IRC meetings (it's hard for some people to book an hour, especially depending on the timezone they live in), video-conference (especially using non-free software), etc.

Everything that requires people to basically interact with the project in a synchronous manner for a period of time will put constraints on them that can make them uncomfortable.

The best medium is still e-mail and asynchronous derivative (bug trackers, etc), as it is asynchronous and allow people to work at their own pace at their own time.

Not having an (implicit) CoC

Codes of conduct seem to be a trendy topic (and a touchy subject), as more and more communities are opening to a wilder audience than they used to be – which is great.

Actually, all communities have a code of conduct, being written with black ink or being carried in everyone's mind unconsciously. Its form is a matter of community size and culture.

Now, depending on the size of your community and how you feel comfortable applying it, you may want to have it composed in a document, e.g. like Debian did.

Having a code of conduct does not transform your whole project community magically into a bunch of carebears following its guidance. But it provides an interesting point you can refer to as soon as you need. It can help throwing it at some people, to indicate that their behavior is not welcome in the project, and somehow, ease their potential exclusion – even if nobody wants to go that far generally, and that's it's rarely that useful.

I don't think it's mandatory to have such a paper on smaller projects. But you have to keep in mind that the implicit code of conduct will be derived from your own behavior. The way your leader(s) will communicate with others will set the entire social mood of the project. Do not underestimate that.

When we started the Ceilometer project, we implicitly followed the OpenStack Code of Conduct before it even existed, and probably set the bar a little higher. Being nice, welcoming and open-minded, we achieved a descent score of diversity, having up to 25% of our core team being women – way above the current ratio in OpenStack and most open source projects!

Making people not English native feeling like outsider

It's quite important to be aware of that the vast majority of free software project out there are using English as the common language of communication. It makes a lot of sense: it's a commonly spoken language, and it seems to do the job correctly.

But a large part of the hackers out there are not native English speakers. Many are not able to speak English fluently. That means the rate at which they can communicate and run a conversation might be very low, which can make some people frustrated, especially native English speaker.

The principal demonstration of this phenomena can be seen in social events (e.g. conferences) where people are debating. It can be very hard for people to explain their thoughts in English and to communicate properly at a decent rate, making the conversation and the transmission of ideas slow. The worst thing that one can see in this context is an English native speaker cutting people off and ignoring them, just because they are talking too slowly. I do understand that it can be frustrating, but the problem here is not the non-native English speaking, it's the medium being used that does not make your fellow on the same level of everyone by moving the conversation orally.

To a lesser extent, the same applies to IRC meetings, which are by relatively synchronous. Completely asynchronous media do not have this flaw, that's why they should also be preferred in my opinion.

No vision, no delegation

Two of the most commonly encountered mistakes in open source projects: seeing the maintainer struggling with the growth of its project while having people trying to help.

Indeed, when the flow of contributor starts coming in, adding new features, asking for feedback and directions, some maintainers choke and don't know how to respond. That ends up frustrating contributors, which therefore may simply vanish.

It's important to have a vision for your project and communicate it. Make it clear for contributors what you want or don't want in your project. Transferring that in a clear (and non-aggressive, please) manner, is a good way of lowering the friction between contributors. They'll pretty soon know if they want to join your ship or not, and what to expect. So be a good captain.

If they chose to work with you and contribute, you should start trusting them as soon as you can and delegate some of your responsibilities. This can be anything that you used to do: review patches targeting some subsystem, fixing bugs, writing docs. Let people own an entire part of the project so they feel responsible and care about it as much as you do. Doing the opposite, which is being a control-freak, is the best shot at staying alone with your open source software.

And no project is going to grow and be successful that way.

In 2009, when Uli Schlachter sent his first patch to awesome, this was more work for me. I had to review this patch, and I was already pretty busy designing the new versions of awesome and doing my day job! Uli's work was not perfect, and I had to fix it myself. More work. And what did I do? A few minutes later, I replied to him with a clear plan of what he should do and what I thought about his work.

In response, Uli sent patches and improved the project. Do you know what Uli does today? He manages the awesome window manager project since 2010 instead of me. I managed to transmit my vision, delegate, and then retired!

Non-recognition of contributions

People contribute in different ways, and it's not always code. There's a lot of things around a free software projects: documentation, bug triage, user support, user experience design, communication, translation…

It took a while for example to Debian to recognize that their translators could have the status of Debian Developer. OpenStack is working in the same direction by trying to recognize non-technical contributions.

As soon as your project starts attributing badges to some people and creating classes of different members in the community, you should be very careful that you don't forget anyone. That's the easiest road to losing contributors along the road.

Don't forget to be thankful

This whole list has been inspired by many years of open source hacking and free software contributions. Everyone's experience and feeling might be different, or malpractice may have been seen under different forms. Let me know and if there's any other point that you encountered and blocked you to contribute to open source projects!

Gnocchi talk at the Paris Monitoring Meetup #6

Fri, 27 May 2016 00:00:00 GMT

Last week was the sixth edition of the Paris Monitoring Meetup, where I was invited as a speaker to present and talk about Gnocchi.

There was around 50 persons in the room, listening to my presentation of Gnocchi.

The talk went fine and I had a few interesting questions and feedback. One interesting point that keeps coming when talking about Gnocchi, is its OpenStack label, which scares away a lot of people. We definitely need to continue explaining that the project work stand-alone has a no dependency on OpenStack, just a great integration with it.

The Monitoring-fr organization also interviewed me after the meetup about Gnocchi. The interview is in French, obviously. I talk about Gnocchi, what it does, how it does it and why we started the project a couple of years ago. Enjoy, and let me know what you think!

The Hacker's Guide to Python 3rd edition is out

Wed, 04 May 2016 00:00:00 GMT

Exactly a year ago, I released the second edition of my book The Hacker's Guide to Python. One more time, it has been a wonderful release and I received a lot of amazing feedback from my readers all over this year.

Since then, the book has been translated into 2 languages: Korean and Chinese. A few thousands of copies has been distributed there, and I'm very glad the book has been such a success. I'm looking into getting it translated into more languages – don't hesitate to get in touch with me if you have any interesting connections in your country.

For those who still don't know about this guide, that I first released a couple of years ago, let me sum up by saying it's the Python book that I always wanted to read, never found, and finally wrote. It does not cover the basics of the language, but deals with concrete problems, best practice and some of the languages internals.

It includes content about unit testing, methods, decorators, AST, distribution, documentation, functional programming, scaling, Python 3, etc. All of that made it pretty successful! It comes with awesome 9 interviews that I realized with some of my fellow experienced Python hackers and developers!

In that 3rd edition, there is, like in each new edition, a few fixes on code, typos, etc. I guess books need a lot of time to become perfect! I also updated some of the content: things evolved a bit since I last revised the content a year ago. Finally, a new chapter about timestamps handling and timezone has made his appearance too.

If you didn't get the book yet, it's time to go check it out and use the coupon THGTP3LAUNCH to get 20 % off during the next 48 hours!

OpenStack Summit Newton from a Telemetry point of view

Mon, 02 May 2016 00:00:00 GMT

It's again that time of the year, where we all fly out to a different country to chat about OpenStack and what we'll do during the next 6 months. This time, it was in Austin, TX and we chatted about the new Newton release that will be out in October.

As the Project Team Leader for the Telemetry project, I set up and animated the week for our team. We had 9 discussion slots of 40 minutes assigned, but finally only used 8. We also, somehow, canceled the contributor team meet-up on the last day, as only a few of us developers were there and available.

We took a few notes in our Etherpads, but I think most of them were pretty sparse, as there was nothing really important we talked about. Actually, many topics were already discussed and covered 6 months ago in Tokyo during the previous summit. We just did not have time to implement everything we wanted, so talking over it again would not have been of a great help.

Reference architecture

Unfortunately, nor Gordon Chung nor the OpenStack Innovation Center had time to run the tests and benchmarks they wanted to run before the summit. We still discussed their plan to run tests and benchmark of the whole Telemetry suite (Ceilometer, Gnocchi & Aodh). They should run their tests for 3 weeks, no more, in a few weeks. The window to run tests being narrow, they want to be sure they are prepared, and will reach to us for help, ideas, and validation.

I've also requested them to, if possible, provide us some profiling (e.g. cProfile) data so we can have better knowledge of the area we can optimize.

Gnocchi, next steps

This session was particularly smooth since most people in the room were not up-to-date with Gnocchi 2.1. Some people expressed concerned about the InfluxDB driver removal, though they were not aware of the bugs it had, and that Gnocchi was actually performing better – so they may very likely be testing Gnocchi directly instead.

No particular fancy feature was requested, only a few bugs and ideas noted on Launchpad were discussed.

Enhancing Ceilometer polling

This session was not particularly productive, as everything was we wanted to discuss was already on the Etherpad from… Tokyo, 6 months ago. It turns out nobody had time to pursue this project, so we'll see what happens. There's definitely some work to do to pursue our goal of splitting the pipeline definition into smaller files.

Aodh roadmap & improvements

First, we decided to definitely kill the combination alarm in the future, in favor of the new composite alarms definition that we like better.

We should switch to OpenStackClient in the future for aodhclient. The OSC team indicated they are willing to provide a way to keep the "aodh" CLI command on its own, which is something that blocked us to move to OSC.

A bunch of people indicated that had support for alarms CRUD in the Horizon dashboard. They should work together with the Horizon team to complete what has been started in Horizon recently to add Aodh support.

Ceilometer splitting

A year ago, we decided to split Ceilometer and its alarm feature: Aodh was born. We did discuss doing it again 6 months ago, but nothing happened as we already had so many stuff on our plate.

As far as I'm concerned, I think it's now time to split some Ceilometer functionality again, so I'm going to do that this time with the event part. Gordon found a name, and this new project will be named Panko.

Documentation

We have then discussed our documentation. Users present in the room were particularly happy with the Gnocchi policy that we apply since the beginning: no doc = no merge of your patch. The consensus is to move forward on this policy for all Telemetry projects, especially since it's now clear that the documentation team is not going to help us more. Ildikó, our documentation
wizard, will take care of making some links between the official OpenStack documentation and our projects, avoid content duplication.

For this cycle, my personal plan is to document Aodh up to roughly 80 %, and then force that policy on newly implemented changes.

Events management

The event management part of Ceilometer and API (soon to be split in its own project as stated above) was discussed in this session. Nothing really exciting coming here, as nobody is willing to enhance it for now. Which, again, makes it a great candidate for splitting it out of Ceilometer.

Vitrage

The last session was dedicated to Vitrage, a root cause analysis tool built on OpenStack. The Vitrage team had a few features that they wanted to see in Aodh, so we discussed that at length. Notably, more support for sending notifications on events (alarm creation, deletion…) should be added in this next release.

Also, a new alarm type that would be entirely managed and triggered over HTTP would be very useful for external projects such as Vitrage. We'll try to make that happen during this cycle too.

Talks

There were a few interesting talks about our telemetry projects during this summit, among other I highly recommend watching:

OpenStack Ceilometer with Gnocchi and Aodh Feature, where Amol and Paul from Ericsson explain what Gnocchi and Aodh do and how they work, and then help people deploy it on their lab.
DPDK, Collectd & Ceilometer The Missing Link, where Ryota Mibu, one of the contributor to Aodh explains why he implemented the event alarm feature
Showback & Chargeback!! OpenStack Gnocchi + Cloudkitty as a Whole Billing System, where Maximiliano Venesio (Nubeliu) and Stéphane Albert (Objectif Libre) talk about how they built an amazing scalable billing solution using Gnocchi and CloudKitty
Using Ceilometer Data for Effective Witch-Hunting, where Mike explain how Overstock.com leveraged Ceilometer to track anomalies in their cloud.

All of this should keep me and the team busy for the next cycle. If you have any question about what has been discussed or the future of our projects, don't hesitate to leave a comment or ask us on the OpenStack development mailing list.

Gnocchi 2.1 release

Wed, 13 Apr 2016 00:00:00 GMT

A little less than 2 months after our latest major release, here is the new minor version of Gnocchi, stamped 2.1.0. It was a smooth release, but with one major feature implemented by my fellow fantastic developer Mehdi Abaakouk: the ability to create resource types dynamically.

Resource types REST API

This new version of Gnocchi offers the long-awaited ability to create resource types dynamically. What does that mean? Well, until version 2.0, the resources that you were able to create in Gnocchi had a particular type that was defined in the code: instance, volume, SNMP host, Swift account, etc. All of them were tied to OpenStack, since it was our primary use case.

Now, the API allows to create resource types dynamically. This means you can create your own custom types to describe your own architecture. You then can exploit the same features that were offered before: history of your resources, searching through them, associating metrics, etc!

Performances improvement

We did some profiling on Gnocchi, and some benchmarks, and with the help of my fellow developer Gordon Chung, improved the metric performances.

The API speed improved a bit, and I've measured the Gnocchi API endpoint of being able to ingest up to 190k measures/s with only one node (the same as used in my previous benchmark) using uwsgi, so a 50 % improvement. The time required to compute aggregation on new measures is now also metered and displayed in the gnocchi-metricd log in debug mode. Handy to have an idea of how fast your measures are treated.

Ceph backend optimization

The Ceph back-end has been improved again by Mehdi. We're now relying on OMAP rather than xattr for finer grained control and better performance.

We already have a few new features being prepared for our next release, so stay tuned! And if you have any suggestion, feel free to say a word.

Pifpaf, or how to run any daemon briefly

Fri, 08 Apr 2016 00:00:00 GMT

There's a lot of situation where you end up needing a software deployed temporarily. This can happen when testing something manually, when running a script or when launching a test suite.

Indeed, many applications need to use and interconnect with external software: a RDBMS (PostgreSQL, MySQL…), a cache (memcached, Redis…) or any other external component. This tends to make more difficult running a software (or its test suite). If you want to rely on this component being installed and deployed, you end up needing a full environment set-up and properly configured to run your tests. Which is discouraging.

The different OpenStack projects I work on ended up pretty soon spawning some of their back-ends temporarily to run their tests. Some of those unit tests somehow became entirely what you would call functional or integration tests. But that's just a name. In the end, what we ended up doing is testing that the software was really working. And there's no better way doing that than talking to a real PostgreSQL instance rather than mocking every call.

Pifpaf to the rescue

To solve that issue, I created a new tool, named Pifpaf. Pifpaf eases the run of any daemon in a test mode for a brief moment, before making it disappear completely. It's pretty easy to install as it is available on PyPI:

$ pip install pifpaf
Collecting pifpaf
[…]
Installing collected packages: pifpaf
Successfully installed pifpaf-0.0.7

You can then use it to run any of the listed daemons:

$ pifpaf list
+---------------+
| Daemons       |
+---------------+
| redis         |
| postgresql    |
| mongodb       |
| zookeeper     |
| aodh          |
| influxdb      |
| ceph          |
| elasticsearch |
| etcd          |
| mysql         |
| memcached     |
| rabbitmq      |
| gnocchi       |
+---------------+

Pifpaf accepts any shell command line to execute after its arguments:

$ pifpaf run postgresql -- psql
Expanded display is used automatically.
Line style is unicode.
SET
psql (9.5.2)
Type "help" for help.

template1=# \l
                              List of databases
   Name    │ Owner │ Encoding │   Collate   │    Ctype    │ Access privileges
───────────┼───────┼──────────┼─────────────┼─────────────┼───────────────────
 postgres  │ jd    │ UTF8     │ en_US.UTF-8 │ en_US.UTF-8 │
 template0 │ jd    │ UTF8     │ en_US.UTF-8 │ en_US.UTF-8 │ =c/jd            ↵
           │       │          │             │             │ jd=CTc/jd
 template1 │ jd    │ UTF8     │ en_US.UTF-8 │ en_US.UTF-8 │ =c/jd            ↵
           │       │          │             │             │ jd=CTc/jd
(3 rows)

template1=# create database foobar;
CREATE DATABASE
template1=# \l
                              List of databases
   Name    │ Owner │ Encoding │   Collate   │    Ctype    │ Access privileges
───────────┼───────┼──────────┼─────────────┼─────────────┼───────────────────
 foobar    │ jd    │ UTF8     │ en_US.UTF-8 │ en_US.UTF-8 │
 postgres  │ jd    │ UTF8     │ en_US.UTF-8 │ en_US.UTF-8 │
 template0 │ jd    │ UTF8     │ en_US.UTF-8 │ en_US.UTF-8 │ =c/jd            ↵
           │       │          │             │             │ jd=CTc/jd
 template1 │ jd    │ UTF8     │ en_US.UTF-8 │ en_US.UTF-8 │ =c/jd            ↵
           │       │          │             │             │ jd=CTc/jd
(4 rows)

template1=# \q

What pifpaf does is that it runs the different commands needed to create a new PostgreSQL cluster and then run PostgreSQL on a temporary port for you. So your psql session actually connects to a temporary PostgreSQL server, that is trashed as soon as you quit psql. And all of that in less than 10 seconds, without the use of any virtualization or container technology!

You can see what it does in detail using the debug mode:

$ pifpaf --debug run mysql $SHELL
DEBUG: pifpaf.drivers: executing: ['mysqld', '--initialize-insecure', '--datadir=/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg']
DEBUG: pifpaf.drivers: executing: ['mysqld', '--datadir=/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg', '--pid-file=/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg/mysql.pid', '--socket=/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg/mysql.socket', '--skip-networking', '--skip-grant-tables']
DEBUG: pifpaf.drivers: executing: ['mysql', '--no-defaults', '-S', '/var/folders/7k/pwdhb_mj2cv4zyr0kyrlzjx40000gq/T/tmpkut9bg/mysql.socket', '-e', 'CREATE DATABASE test;']
[…]
$ exit
[…]
DEBUG: pifpaf.drivers: mysqld output: 2016-04-08T08:52:04.202143Z 0 [Note] InnoDB: Starting shutdown...

Pifpaf also supports my pet project Gnocchi, so you can run and try that timeseries database in a snap:

$ pifpaf run gnocchi $SHELL
$ gnocchi metric create
+------------------------------------+-----------------------------------------------------------------------+
| Field                              | Value                                                                 |
+------------------------------------+-----------------------------------------------------------------------+
| archive_policy/aggregation_methods | std, count, 95pct, min, max, sum, median, mean                        |
| archive_policy/back_window         | 0                                                                     |
| archive_policy/definition          | - points: 12, granularity: 0:05:00, timespan: 1:00:00                 |
|                                    | - points: 24, granularity: 1:00:00, timespan: 1 day, 0:00:00          |
|                                    | - points: 30, granularity: 1 day, 0:00:00, timespan: 30 days, 0:00:00 |
| archive_policy/name                | low                                                                   |
| created_by_project_id              | admin                                                                 |
| created_by_user_id                 | admin                                                                 |
| id                                 | ff825d33-c8c8-46d4-b696-4b1e8f84a871                                  |
| name                               | None                                                                  |
| resource/id                        | None                                                                  |
+------------------------------------+-----------------------------------------------------------------------+
$ exit

And it takes less than 10 seconds to launch Gnocchi on my laptop using pifpaf. I'm then able to play with the gnocchi command line tool. It's by far faster than using OpenStack devstack to deloy everything the software.

Using pifpaf with your test suite

We leverage Pifpaf in several of our OpenStack telemetry related projects now, and even in tooz. For example, to run unit/functional tests with a memcached server available, a tox.ini file should like this:

[testenv:py27-memcached]
commands = pifpaf run memcached -- python setup.py testr

The tests can then use the environment variable PIFPAF_MEMCACHED_PORT to connect to memcached and run tests using it. As soon as the tests are finished, memcached is killed by pifpaf and the temporary data are trashed.

We move a few OpenStack projects to using Pifpaf already, and I'm planning to make use of it in a few more. My fellow developer Mehdi Abaakouk added support for RabbitMQ in Pifpaf and added support for more advanced tests in oslo.messaging (such as failure scenarios) using Pifpaf.

Pifpaf is a very small and handy tool. Give it a try and let me know how it works for you!

The OpenStack Schizophrenia

Wed, 30 Mar 2016 00:00:00 GMT

When I started contributing to OpenStack, almost five years ago, it was a small ecosystem. There were no foundation, a handful of projects and you could understand the code base in a few days.

Fast forward 2016, and it is a totally different beast. The project grew to no less than 54 teams, each team providing one or more deliverable. For example, the Nova and Swift team each one produces one service and its client, whereas the Telemetry team produces 3 services and 3 different clients.

In 5 years, OpenStack went to a few IaaS projects, to 54 different teams tackling different areas related to cloud computing. Once upon a time, OpenStack was all about starting some virtual machines on a network, backed by images and volumes. Nowadays, it's also about orchestrating your network deployment over containers, while managing your application life-cycle using a database service, everything being metered and billed for.

This exponential growth has been made possible with the decision of the OpenStack Technical Committee to open the gates with the project structure reform voted at the end of 2014.

This amendment suppresses the old OpenStack model of "integrated projects" (i.e. Nova, Glance, Swift…). The big tent, as it's called, allowed OpenStack to land new projects every month, growing from the 20 project teams of December 2014 to the 54 we have today – multiplying the number of projects by 2.7 in a little more than a year.

Amazing growth, right?

And this was clearly a good change. I sat at the Technical Committee in 2013, when projects were trying to apply to be "integrated", after Ceilometer and Heat were. It was painful to see how the Technical Committee was trying to assess whether new projects should be brought in or not.

But what I notice these days, is how OpenStack is still stuck between its old and new models. On one side, it accepted a lot of new teams, but on the other side, many are considered as second-class citizens. Efforts are made to continue to build an OpenStack project that does not exist anymore.

For example, there is a team trying to define what's OpenStack core, named DefCore. That is looking to define which projects are, somehow, actually OpenStack. This leads to weird situations, such as having non-DefCore projects seeing their doc rejected from installation guides.
Again, I reiterated my proposal to publish documentation as part of each project code to solve that dishonest situation and put everything on a level playing field

Some cross-projects specs are also pushed without implication of all OpenStack projects. For example, The deprecate-cli spec which proposes to deprecate command-line interface tools proposed by each project had a lot of sense in the old OpenStack sense, where the goal was to build a unified and ubiquitous cloud platform. But when you now have tens of projects with largely different scopes, this start making less sense. Still, this spec was merged by the OpenStack Technical Committee this cycle. Keystone is the first project to proudly force users to rely on
openstack-client, removing its old keystone command line tool. I find it odd to push that specs when it's pretty clear that some projects (e.g. Swift, Gnocchi…) have no intention to go down that path.

Unfortunately, most specs pushed by the Technical Committee are in the realm of wishful thinking. It somehow makes sense, since only a few of the members are actively contributing to OpenStack projects, and they can't by themselves implement all of that magically. But OpenStack is no exception in the free software world and remains a do-ocracy.

There is good cross-project content in OpenStack, such as the API working group. While the work done should probably not be OpenStack specific, there's a lot that teams have learned by building various HTTP REST API with different frameworks. Compiling this knowledge and offering it as a guidance to various teams is a great help.

My fellow developer Chris Dent wrote a post about what he would do on the Technical Committee.
In this article, he points to a lot of the shortcomings I described here, and his confusion between OpenStack being a product or being a kit is quite understandable. Indeed, the message broadcasted by OpenStack is still very confusing after the big tent openness. There's no enough user experience improvement being done.

The OpenStack Technical Committee election is opened for April 2016, and from what I read so far, many candidates are proposing to now clean up the big tent, kicking out projects that do not match certain criteria anymore. This is probably a good idea, there is some inactive project laying around. But I don't think that will be enough to solve the identity crisis that OpenStack is experiencing.

So this is why, once again this cycle, I will throw my hat in the ring and submit my candidacy for OpenStack Technical Committee.

Gnocchi 2.0 release

Fri, 19 Feb 2016 00:00:00 GMT

A little more than 3 months after our latest minor release, here is the new major version of Gnocchi, stamped 2.0.0. It contains a lot of new and exciting features, and I'd like to talk about some of them to celebrate!

You may notice that this release happens in the middle of the OpenStack release cycle. Indeed, Gnocchi does not follow that 6-months cycle, and we release whenever our code is ready. That forces us to have a more iterative approach, less disruptive for other projects and allow us to achieve a higher velocity. Applying the good old mantra release early, release often.

Documentation

This version features a large documentation update. Gnocchi is still the only OpenStack server project that implements a "no doc, no merge" policy, meaning any code must come with the documentation addition or change included in the patch. The full documentation is included in the source code and available online at gnocchi.xyz.

Data split & compression

I've already covered this change extensively in my last blog about timeseries compression. Long story short, Gnocchi now splits timeseries archives in small chunks that are compressed, increasing speed and decreasing data size.

Measures batching support

Gnocchi now supports batching, which allow submitting several measures for different metric in a single request. This is especially useful in the context where your application tends to cache metrics for a while and is able to send them in a batch. Usage is fully documented for the REST API.

Group by support in aggregation

One of the most demanded features was the ability to do measure aggregation no resource, using a group by type query. This is now possible using the new groupby parameter to aggregation queries.

Ceph backend optimization

We improved the Ceph back-end a lot. Mehdi Abaakouk wrote a new Python binding for Ceph, called Cradox, that is going to replace the current Python rados module in the subsequent Ceph releases. Gnocchi makes usage of this new module to speed things up, making the Ceph based driver really, really faster than before. We also implemented asynchronous data deletion, which improves performance a bit.

The next step will be to run some new benchmarks like I did a few months ago and compare with the Gnocchi 1.3 series. Stay tuned!

Timeseries storage and data compression

Mon, 15 Feb 2016 00:00:00 GMT

The first major version of the scalable timeserie database I work on, Gnocchi was a released a few months ago. In this first iteration, it took a rather naive approach to data storage. We had little ideas about if and how our distributed back-ends were going to be heavily used, so we stuck to the code of the first proof-of-concept written a couple of years ago.

Recently we got more feedbacks from our users, ran a few benchmarks. That gave us enough feedback to start investigating in improving our storage strategy.

Data split

Up to Gnocchi 1.3, all data for a single metric are stored in a single gigantic file per aggregation method (min, max, average…). This means that the file can grow to several megabytes in size, which make it slow to manipulate. For the next version of Gnocchi, our first work has been to rework that storage and split the data into smaller parts.

The diagram above shows how data are organized inside Gnocchi. Until version 1.3, there would have been only one file for each aggregation methods.

In the upcoming 2.0 version, Gnocchi will split all these data into smaller parts, where each data split is stored in a file/object. This allows to manipulate smaller pieces of data and to increase the parallelism of the CRUD operations on the back-end – leading to large speed improvement.

In order to split timeseries into several chunks, Gnocchi defines a maximum number of N points to keep per chunk, to limit their maximum size. It then defines a hash function that produces a non-unique key for any timestamp. It makes it easy to find in which chunk any timestamp should be stored or retrieved.

Data compression

Up to Gnocchi 1.3, the data stored for each metric is simply serialized using msgpack, a fast and small serialization format. Though, this format does not provide any compression. That means that storing data points needs 8 bytes for a timestamp (64 bits timestamp with nanosecond precision) and 8 bytes for a value (64 bits double-precision floating-point), plus some overhead (extra information and msgpack itself).

After looking around on how to compress all these measures, I stumbled upon a paper from some Facebook engineers called about Gorilla, their in-memory timeserie database, entitled "Gorilla: A Fast, Scalable, In-Memory Time Series Database". For reference, part of this encoding is also used by InfluxDB in its new storage engine.

The first technique I implemented is easy enough, and it's inspired from delta-of-delta encoding. Instead of storing each timestamp for each data point, and since all the data points are aggregated on a regular interval, we transpose points to be the time difference divided by the interval. For example, the suite of timestamps timestamps = [41230, 41235, 41240, 41250, 41255] is encoded into timestamps = [41230, 1, 1, 2, 1], interval = 5. This allows regular compression algorithms to reduce the size of the integer list using run-length encoding.

To actually compress the values, I tried two different algorithms:

LZ4, a fast compression/decompression algorithm
The XOR based compression scheme described in the Gorilla paper mentioned above – that I had to implement myself. For reference, it also exists a Go implementation in go-tsz.

I then benchmarked these solutions:

The XOR algorithm implemented in Python is pretty slow, compared to LZ4. Truth is that python-lz4 is fully implemented in C, which makes it fast. I've profiled my XOR implementation in Python, to discover that one operation took 20 % of the time: count_lead_and_trail_zeroes, which is in charge of counting the number of leading and trailing zeroes in a binary number.

I tried 2 Python implementations of the same algorithm (and submitted them to my friend and Python developer Victor Stinner by the way).

The first version using string search with .index() is 10× faster than the second one that only do integer computation. Ah, Python… As Victor explained, each Python operation is slow and there's a lot in the second version, whereas .index() is implemented in C and really well optimized and only needs 2 Python operations.

Finally, I ended up optimizing that code by leveraging cffi to use directly ffsll() and flsll(). That decreased the run-time of count_lead_and_trail_zeroes by 45 %, making the entire XOR compression code speed increased by a small 7 %. This is not enough to catch up with LZ4 speed. At this stage, the only solution to achieve a high-speed would probably to go with a full C implementation.

Considering the compression ratio of the different algorithms, they are pretty much identical. The worst case scenario (random values) for LZ4 compress down to 9 bytes per data point, whereas XOR can go down to 7.38 bytes per data point. In general XOR encoding beats LZ4 by 15 %, except for cases where all values are 0 or 1. However, LZ4 is faster than XOR by a factor of 4×-70× depending on cases.

That means that we'll use LZ4 for data compression in Gnocchi 2.0. It's possible that we could achieve as fast compression/decompression algorithm, but I don't think it's worth the effort right now – it'd represent a lot of code to write and to maintain.

FOSDEM 2016, recap

Sat, 06 Feb 2016 00:00:00 GMT

Last week-end, I was in Brussels, Belgium for the FOSDEM, one of the greatest open source developer conference. I was not sure to go there this year (I already skipped it in 2015), but it turned out I was requested to do a talk in the shared Lua & GNU Guile devroom.

As a long time Lua user and developer, and a follower of GNU Guile for several years, the organizer asked me to run a talk that would be a link between the two languages.

I've entitled my talk "How awesome ended up with Lua and not Guile" and gave it to a room full of interested users of the awesome window manager 🙂.

We continued with a panel discussion entitled "The future of small languages Experience of Lua and Guile" composed of Andy Wingo, Christopher Webber, Ludovic Courtès, Etiene Dalcol, Hisham Muhammaad and myself. It was a pretty interesting discussion, where both language shared their views on the state of their languages.

It was a bit awkward to talk about Lua & Guile whereas most of my knowledge was years old, but it turns out many things didn't change. I hope I was able to provide interesting hindsight to both community. Finally, it was a pretty interesting FOSDEM to me, and it was a long time I didn't give talk here, so I really enjoyed it. See you next year!

Profiling Python using cProfile: a concrete case

Mon, 16 Nov 2015 00:00:00 GMT

Writing programs is fun, but making them fast can be a pain. Python programs are no exception to that, but the basic profiling toolchain is actually not that complicated to use. Here, I would like to show you how you can quickly profile and analyze your Python code to find what part of the code you should optimize.

What's profiling?

Profiling a Python program is doing a dynamic analysis that measures the execution time of the program and everything that compose it. That means measuring the time spent in each of its functions. This will give you data about where your program is spending time, and what area might be worth optimizing.

It's a very interesting exercise. Many people focus on local optimizations, such as determining e.g. which of the Python functions range or xrange is going to be faster. It turns out that knowing which one is faster may never be an issue in your program, and that the time gained by one of the functions above might not be worth the time you spend researching that, or arguing about it with your colleague.

Trying to blindly optimize a program without measuring where it is actually spending its time is a useless exercise. Following your guts alone is not always sufficient.

There are many types of profiling, as there are many things you can measure. In this exercise, we'll focus on CPU utilization profiling, meaning the time spent by each function executing instructions. Obviously, we could do many more kind of profiling and optimizations, such as memory profiling which would measure the memory used by each piece of code – something I talk about in The Hacker's Guide to Python.

cProfile

Since Python 2.5, Python provides a C module called cProfile which has a reasonable overhead and offers a good enough feature set. The basic usage goes down to:

>>> import cProfile
>>> cProfile.run('2 + 2')
         2 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Though you can also run a script with it, which turns out to be handy:

$ python -m cProfile -s cumtime lwn2pocket.py
         72270 function calls (70640 primitive calls) in 4.481 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.004    0.004    4.481    4.481 lwn2pocket.py:2(<module>)
        1    0.001    0.001    4.296    4.296 lwn2pocket.py:51(main)
        3    0.000    0.000    4.286    1.429 api.py:17(request)
        3    0.000    0.000    4.268    1.423 sessions.py:386(request)
      4/3    0.000    0.000    3.816    1.272 sessions.py:539(send)
        4    0.000    0.000    2.965    0.741 adapters.py:323(send)
        4    0.000    0.000    2.962    0.740 connectionpool.py:421(urlopen)
        4    0.000    0.000    2.961    0.740 connectionpool.py:317(_make_request)
        2    0.000    0.000    2.675    1.338 api.py:98(post)
       30    0.000    0.000    1.621    0.054 ssl.py:727(recv)
       30    0.000    0.000    1.621    0.054 ssl.py:610(read)
       30    1.621    0.054    1.621    0.054 {method 'read' of '_ssl._SSLSocket' objects}
        1    0.000    0.000    1.611    1.611 api.py:58(get)
        4    0.000    0.000    1.572    0.393 httplib.py:1095(getresponse)
        4    0.000    0.000    1.572    0.393 httplib.py:446(begin)
       60    0.000    0.000    1.571    0.026 socket.py:410(readline)
        4    0.000    0.000    1.571    0.393 httplib.py:407(_read_status)
        1    0.000    0.000    1.462    1.462 pocket.py:44(wrapped)
        1    0.000    0.000    1.462    1.462 pocket.py:152(make_request)
        1    0.000    0.000    1.462    1.462 pocket.py:139(_make_request)
        1    0.000    0.000    1.459    1.459 pocket.py:134(_post_request)
[…]

This prints out all the function called, with the time spend in each and the number of times they have been called.

Advanced visualization with KCacheGrind

While being useful, the output format is very basic and does not make easy to grab knowledge for complete programs. For more advanced visualization, I leverage KCacheGrind. If you did any C programming and profiling these last years, you may have used it as it is primarily designed as front-end for Valgrind generated call-graphs.

In order to use, you need to generate a cProfile result file, then convert it to KCacheGrind format. To do that, I use pyprof2calltree.

$ python -m cProfile -o myscript.cprof myscript.py
$ pyprof2calltree -k -i myscript.cprof

And the KCacheGrind window magically appears!

Concrete case: Carbonara optimization

I was curious about the performances of Carbonara, the small timeseries library I wrote for Gnocchi. I decided to do some basic profiling to see if there was any obvious optimization to do.

In order to profile a program, you need to run it. But running the whole program in profiling mode can generate a lot of data that you don't care about, and adds noise to what you're trying to understand. Since Gnocchi has thousands of unit tests and a few for Carbonara itself, I decided to profile the code used by these unit tests, as it's a good reflection of basic features of the library.

Note that this is a good strategy for a curious and naive first-pass profiling.
There's no way that you can make sure that the hotspots you will see in the unit tests are the actual hotspots you will encounter in production. Therefore, a profiling in conditions and with a scenario that mimics what's seen in production is often a necessity if you need to push your program optimization further and want to achieve perceivable and valuable gain.

I activated cProfile using the method described above, creating a cProfile.Profile object around my tests (I actually started to implement that in testtools). I then run KCacheGrind as described above. Using KCacheGrind, I generated the following figures.

The test I profiled here is called test_fetch and is pretty easy to understand: it puts data in a timeserie object, and then fetch the aggregated result. The above list shows that 88 % of the ticks are spent in set_values (44 ticks over 50). This function is used to insert values into the timeserie, not to fetch the values. That means that it's really slow to insert data, and pretty fast to actually retrieve them.

Reading the rest of the list indicates that several functions share the rest of the ticks, update, _first_block_timestamp, _truncate, _resample, etc. Some of the functions in the list are not part of Carbonara, so there's no point in looking to optimize them. The only thing that can be optimized is, sometimes, the number of times they're called.

The call graph gives me a bit more insight about what's going on here. Using my knowledge about how Carbonara works, I don't think that the whole stack on the left for _first_block_timestamp makes much sense. This function is supposed to find the first timestamp for an aggregate, e.g. with a timestamp of 13:34:45 and a period of 5 minutes, the function should return 13:30:00. The way it works currently is by calling the resample function from Pandas on a timeserie with only one element, but that seems to be very slow. Indeed, currently this function represents 25 % of the time spent by set_values (11 ticks on 44).

Fortunately, I recently added a small function called _round_timestamp that does exactly what _first_block_timestamp needs that without calling any Pandas function, so no resample. So I ended up rewriting that function this way:

     def _first_block_timestamp(self):
-        ts = self.ts[-1:].resample(self.block_size)
-        return (ts.index[-1] - (self.block_size * self.back_window))
+        rounded = self._round_timestamp(self.ts.index[-1], self.block_size)
+        return rounded - (self.block_size * self.back_window)

And then I re-run the exact same test to compare the output of cProfile.

The list of function seems quite different this time. The number of time spend used by set_values dropped from 88 % to 71 %.

The call stack for set_values shows that pretty well: we can't even see the _first_block_timestamp function as it is so fast that it totally disappeared from the display. It's now being considered insignificant by the profiler.

So we just speed up the whole insertion process of values into Carbonara by a nice 25 % in a few minutes. Not that bad for a first naive pass, right?

If you want to know more, I wrote a whole chapter about optimizing code in Scaling Python. Check it out!

Gnocchi 1.3.0 release

Wed, 04 Nov 2015 00:00:00 GMT

Finally, Gnocchi 1.3.0 is out. This is our final release, more or less matching the OpenStack 6 months schedule, that concludes the Liberty development cycle.

This release was supposed to be released a few weeks earlier, but our integration test got completely blocked for several days just the week before the OpenStack Mitaka summit.

New website

We build a new dedicated website for Gnocchi at gnocchi.xyz. We want to promote Gnocchi outside of the OpenStack bubble, as it a useful timeseries database on its own that can work without the rest of the stack. We'll try to improve the documentation. If you're curious, feel free to check it out and report anything you miss!

The speed bump

Obviously, if it was a bug in Gnocchi that we have hit, it would have been quick to fix. However, we found a nasty bug in Swift caused by the evil monkey-patching of Eventlet (once again) blended with a mixed usage of native threads and Eventlet threads in Swift. Shake all of that, and you got yourself pretty race conditions when using the Keystone middleware authentication.

In the meantime, we disabled Swift multi-threading by using mod_wsgi instead of Eventlet in devstack.

New features

So what's new in this new shiny release? A few interesting things:

Metric deletion is now asynchronous. That's not the most used feature in the REST API – weirdly people do not often delete metrics – but it's now way faster and reliable by being asynchronous. Metricd is now in charge of cleaning up things up.
Speed improvement. We are now confident to be even more faster than in the latest benchmarks I run (around 1.5-2× faster), which makes Gnocchi really fast with its native storage back-ends. We profiled and optimized Carbonara and the REST API data validation.
Improve metricd status report. It now reports the size of the backlog of the whole cluster both in its log and via the REST API. Easy monitoring!
Ceph drivers enhancement. We had people testing the Ceph drivers in production, so we made a few changes and fixes to it to make it more solid.

And that's all we did in the last couple of months. We have a lot of things on the roadmap that are pretty exciting, and I'll sure talk about them in the next weeks.

OpenStack Summit Mitaka from a Telemetry point of view

Mon, 02 Nov 2015 00:00:00 GMT

Last week I was in Tokyo, Japan for the OpenStack Summit, discussing the new Mitaka version that will be released in 6 months.

I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi, Aodh and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things. Below are what I found remarkable during this summit concerning those projects.

Distributed lock manager

I did not attend this session, but I need to write something about it.

See, when working in a distributed environment like OpenStack, it's almost obvious that sooner or later you end up needing a distributed lock mechanism. It started to be pretty obvious and a serious problem for us 2 years ago in Ceilometer. Back then, we proposed the service-sync blueprint and talked about it during the OpenStack Icehouse Design Summit in Hong-Kong. The session at that time was a success, and in 20 minutes I convinced everyone it was the good thing to do. The night following the session, we picked a named, Tooz, to name this new library. It was the first time I met Joshua Harlow, which became one of the biggest Tooz contributor since then.

For the following months, we tried to move the lines in OpenStack. It was very hard to convince people that it was the solution to their problem. Most of the time, they did not seem to grasp the entirety of what was at stake.

This time, it seems that we managed to convince everyone that a DLM is indeed needed. Joshua wrote an extensive specification called Chronicle of a DLM, which ended up being discussed and somehow adopted during that session in Tokyo.

So yes, Tooz will be the weapon of choice for OpenStack. It will avoid a hard requirement on any DLM solution directly. The best driver right now is the ZooKeeper one, but it'll still be possible for operators to use e.g. Redis.

This is a great achievement for us, after spending years trying to fix features such as the Nova service group subsystem and seeing our proposals postponed forever.

(If you want to know more, LWN.net has
a great article about that session.)

Telemetry team name

With the new projects launched this last year, Aodh & Gnocchi, in parallel of the old Ceilometer, plus the change from programs to Big Tent in OpenSack, the team is having an identity issue. Being referred to as the "Ceilometer team" is not really accurate, as some of us only work on Aodh or on Gnocchi. So after discussing that, I proposed to rename the team to Telemetry instead. We'll see how it goes.

Alarms

The first session was about alarms and the Aodh project. It turns out that the project is in pretty good shape, but probably need some more love, which I hope I'll be able to provide in the next months.

The need for a new aodhclient based on the technologies we recently used building gnocchiclient has been reasserted, so we might end up working on that pretty soon. The Tempest support also needs some improvement, and we have a plan to enhance that.

Data visualisation

We got David Lyle in this session, the Project Technical Leader for Horizon. It was an interesting discussion. It used to be technically challenging to draw charts from the data Ceilometer collects, but it's now very easy with Gnocchi and its API.

While the technical side is resolved, the more political and user experience side of was to draw and how was discussed at length. We don't want to make people think that Ceilometer and Gnocchi are a full monitoring solution, so there's some precaution to take. Other than that, it would be pretty cool to have view of the data in Horizon.

Rolling upgrade

It turns out that Ceilometer has an architecture that makes it easy to have rolling upgrade. We just need to write a proper documentation explaining how to do it and in which order the services should be upgraded.

Ceilometer splitting

The split of the alarm feature of Ceilometer in its own project Aodh in the last cycle was a great success for the whole team. We want to split other pieces of Ceilometer, as they make sense on their own, makes it easier to manage. They are also some projects that want to use them without the whole stack, so that's a good idea to make it happen.

CloudKitty & Gnocchi

I attended the 2 sessions that were allocated to CloudKitty. It was pretty interesting as they want to simplify their architecture and leverage what Gnocchi provides. I proposed my view of the project architecture and how they could leverage the more of Gnocchi to retrieve and store data. They want to go in that direction though it's a large amount of work and refactoring on their side, so it'll take time.

We also need to enhance the support of extension for new resources in Gnocchi, and that's something I hope I'll work on in the next months.

Overall, this summit was pretty good and I got a tremendous amount of good feedback on Gnocchi. I again managed to get enough ideas and tasks to tackle for the next 6 months. It really looks interesting to see where the whole team will go from that. Stay tuned!

Benchmarking Gnocchi for fun & profit

Tue, 13 Oct 2015 00:00:00 GMT

We got pretty good feedback on Gnocchi so far, even if we only had little. Recently, in order to have a better feeling of where we were at, we wanted to know how fast (or slow) Gnocchi was.

The early benchmarks that some of the Mirantis engineers ran last year showed pretty good signs. But a year later, it was time to get real numbers and have a good understanding of Gnocchi capacity.

Benchmark tools

The first thing I realized when starting that process, is that we were lacking of tools to run benchmarks. Therefore I started to write some benchmark tools in python-gnocchiclient, which provides a command line tool to interrogate Gnocchi. I added a few basic commands to measure metric performance, such as:

$ gnocchi benchmark metric create -w 48 -n 10000 -a low
+----------------------+------------------+
| Field                | Value            |
+----------------------+------------------+
| client workers       | 48               |
| create executed      | 10000            |
| create failures      | 0                |
| create failures rate | 0.00 %           |
| create runtime       | 8.80 seconds     |
| create speed         | 1136.96 create/s |
| delete executed      | 10000            |
| delete failures      | 0                |
| delete failures rate | 0.00 %           |
| delete runtime       | 39.56 seconds    |
| delete speed         | 252.75 delete/s  |
+----------------------+------------------+

The command line tool supports the --verbose switch to have detailed progress report on the benchmark progression. So far it supports metric operations only, but that's the most interesting part of Gnocchi.

Spinning up some hardware

I got a couple of bare metal servers to test Gnocchi on. I dedicated the first one to Gnocchi, and used the second one as the benchmark client, plugged on the same network. Each server is made of
2×Intel Xeon E5-2609 v3 (12 cores in total) and 32 GB of RAM. That provides a lot of CPU to handle requests in parallel.

Then I simply performed a basic RHEL 7 installation and ran devstack to spin up an installation of Gnocchi based on the master branch, disabling all of the others OpenStack components. I then tweaked the Apache httpd configuration to use the worker MPM and increased the maximum number of clients that can sent request simultaneously.

I configured Gnocchi to use the PostsgreSQL indexer, as it's the recommended one, and the file storage driver, based on Carbonara (Gnocchi own storage engine). That means files were stored locally rather than in Ceph or Swift.

Using the file driver is less scalable (you have to run on only one node or uses a technology like NFS to share the files), but it was good enough for this benchmark and to have some numbers and profiling the beast.

The OpenStack Keystone authentication middleware was not enabled in this setup, as it would add some delay validating the authentication token.

Metric CRUD operations

Metric creation is pretty fast. I managed to attain 1300 metric/s created pretty easily. Deletion is now asynchronous, which means it's faster than in Gnocchi 1.2, but it's still slower than creation: 500 metric/s can be deleted. That does not sound like a huge issue since metric deletion is actually barely used in production.

Retrieving metric information is also pretty fast and goes up to 800 metric/s. It'd be easy to achieve very higher throughput for this one, as it'd be easy to cache, but we didn't feel the need to implement it so far.

Another important thing is that all of these numbers are constant and barely depends on the number of the metric already managed by Gnocchi.

Operation	Details	Rate
Create metric	Created 100k metrics in 77 seconds	1300 metric/s
Show metric	Show a metric 100k times in 149 seconds	670 metric/s
Delete metric	Deleted 100k metrics in 190 seconds	524 metric/s

Sending and getting measures

Pushing measures into metrics is one of the hottest topic. Starting with Gnocchi 1.1, the measures pushed are treated asynchronously, which makes it much faster to push new measures. Getting new numbers on that feature was pretty interesting.

The number of metric per second you can push depends on the batch size, meaning the number of actual measurements you send per call. The naive approach is to push 1 measure per call, and in that case, Gnocchi is able to handle around 600 measures/s. With a batch containing 100 measures, the number of calls per second goes down to 450, but since you push 100 measures each time, that means 45k measures per second pushed into Gnocchi!

I've pushed the test further, inspired by the recent blog post of InfluxDB claiming to achieve 300k points per second with their new engine. I ran the same benchmark on the hardware I had, which is roughly two times smaller than the one they used. I achieved to push Gnocchi to a little more than 120k measurement per second. If I had same hardware as they used, I could interpolate the results to achieve almost 250k measures/s pushed. Obviously, you can't strictly compare Gnocchi and InfluxDB since they are not doing exactly the same thing, but it still looks way better than what I expected.

Using smaller batch sizes of 1k or 2k improve the throughput further to around 125k measures/s.

Operation	Details	Rate
Push metric 5k	Push 5M measures with batch of 5k measures in 40 seconds	122k measures/s
Push metric 4k	Push 5M measures with batch of 4k measures in 40 seconds	125k measures/s
Push metric 3k	Push 5M measures with batch of 3k measures in 40 seconds	123k measures/s
Push metric 2k	Push 5M measures with batch of 2k measures in 41 seconds	121k measures/s
Push metric 1k	Push 5M measures with batch of 1k measures in 44 seconds	113k measures/s
Push metric 500	Push 5M measures with batch of 500 measures in 51 seconds	98k measures/s
Push metric 100	Push 5M measures with batch of 100 measures in 112 seconds	45k measures/s
Push metric 10	Push 5M measures with batch of 10 measures in 852 seconds	6k measures/s
Push metric 1	Push 500k measures with batch of 1 measure in 800 seconds	624 measures/s
Get measures	Push 43k measures of 1 metric	260k measures/s

What about getting measures? Well, it's actually pretty fast too. Retrieving a metric with 1 month of data with 1 minute interval (that's 43k points) takes less than 2 second.

Though it's actually slower than what I expected. The reason seems to be that the JSON is 2 MB big and encoding it takes a lot of time for Python. I'll investigate that. Another point I discovered, is that by default Gnocchi returns all the datapoints for each granularities available for the asked period, which might double the size of the returned data for nothing if you don't need it. It'll be easy to add an option to the API to only retrieve what you need though!

Once benchmarked, that meant I was able to retrieve 6 metric/s per second, which translates to around 260k measures/s.

Metricd speed

New measures that are pushed into Gnocchi are processed asynchronously by the gnocchi-metricd daemon. When doing the benchmarks above, I ran into a very interesting issue: sending 10k measures on a metric would make gnocchi-metricd uses up to 2 GB RAM and 120 % CPU for more than 10 minutes.

After further investigation, I found that the naive approach we used to resample datapoints in Carbonara using Pandas was causing that. I reported a bug on Pandas and the upstream author was kind enough to provide a nice workaround, that I sent as a pull request to Pandas documentation.

I wrote a fix for Gnocchi based on that, and started using it. Computing the standard aggregation methods set (std, count, 95pct, min, max, sum, median, mean) for 10k batches of 1 measure (worst case scenario) for one metric with 10k measures now takes only 20 seconds and uses 100 MB of RAM – 45× faster. That means that in normal operations, where only a few new measures are processed, the operation of updating a metric only takes a few milliseconds. Awesome!

Comparison with Ceilometer

For comparison sake, I've quickly run some read operations benchmark in Ceilometer. I've fed it with one month of samples for 100 instances polled every minute. That represents roughly 4.3M samples injected, and that took a while – almost 1 hour whereas it would have taken less than a minute in Gnocchi. Then I tried to retrieve some statistics in the same way that we provide them in Gnocchi, which mean aggregating them over a period of 60 seconds over a month.

Operation	Details	Rate
Read metric SQL	Read measures for 1 metric	2min 58s
Read metric MongoDB	Read measures for 1 metric	28s
Read metric Gnocchi	Read measures for 1 metric	2s

Obviously, Ceilometer is very slow. It has to look into 4M of samples to compute and return the result, which takes a lot of time. Whereas Gnocchi just has to fetch a file and pass it over. That also means that the more samples you have (so the more time you collect data and the more resources you have), slower Ceilometer will become. This is not a problem with Gnocchi, as I emphasized when I started designing it.

Most Gnocchi operations are O(log R) where R is the number of metrics or resources, whereas most Ceilometer operations are O(log S) where S is the number of samples (measures). Since is R millions of time smaller than S, Gnocchi gets to be much faster.

And what's even more interesting, is that Gnocchi is entirely scalable horizontally. Adding more Gnocchi servers (for the API and its background processing worker metricd) will multiply Gnocchi performances by the number of servers added.

Improvements

There are several things to improve in Gnocchi, such as splitting Carbonara archives to make them more efficient, especially from drivers such as Ceph and Swift. It's already on my plate, and I'm looking forwarding to working on that!

And if you have any questions, feel free to shoot them in the comment section. 😉

Gnocchi talk at OpenStack Paris Meetup #16

Mon, 05 Oct 2015 00:00:00 GMT

Last week, I've been invited to the OpenStack Paris meetup #16, whose subject was about metrics in OpenStack. Last time I spoke at this meetup was back in 2012, during the OpenStack Paris meetup #2. A very long time ago!

I talked for half an hour about Gnocchi, the OpenStack project I've been running for 18 months now. I started by explaining the story behind the project and why we needed to build it. Ceilometer has an interesting history and had a curious roadmap these last year, and I summarized that briefly. Then I talk about how Gnocchi works and what it offers to users and operators. The slides where full of JSON, but I imagine it offered a interesting view of what the API looks like and how easy it is to operate. This also allowed me to emphasize how many use cases are actually really covered and solved, contrary to what Ceilometer did so far. The talk has been well received and I got a few interesting questions at the end.

My interview in le Journal du Hacker

Thu, 17 Sep 2015 00:00:00 GMT

A few days ago, the French equivalent of Hacker News, called "Le Journal du Hacker", interviewed me about my work on OpenStack, my job at Red Hat and my self-published book The Hacker's Guide to Python. I've spent some time translating it into English so you can read it if you don't understand French! I hope you'll enjoy it.

Hi Julien, and thanks for participating in this interview for the Journal du Hacker. For our readers who don't know you, can you introduce you briefly?

You're welcome! My name is Julien, I'm 31 years old, and I live in Paris. I now have been developing free software for around fifteen years. I had the pleasure to work (among other things) on Debian, Emacs and awesome these last years, and more recently on OpenStack. Since a few months now, I work at Red Hat, as a Principal Software Engineer on OpenStack. I am in charge of doing upstream development for that cloud-computing platform, mainly around the Ceilometer, Aodh and Gnocchi projects.

Being myself a system architect, I follow your work in OpenStack since a while. It's uncommon to have the point of view of someone as implied as you are. Can you give us a summary of the state of the project, and then detail your activities in this project?

The OpenStack project has grown and changed a lot since I started 4 years ago. It started as a few projects providing the basics, like Nova (compute), Swift (object storage), Cinder (volume), Keystone (identity) or even Neutron (network) who are basis for a cloud-computing platform, and finally became composed of a lot more projects.

For a while, the inclusion of projects was the subject of a strict review from the technical committee. But since a few months, the rules have been relaxed, and we see a lot more projects connected to cloud-computing joining us.

As far as I'm concerned, I've started with a few others people the Ceilometer project in 2012, devoted to handling metrics of OpenStack platforms. Our goal is to be able to collect all the metrics and record them to analyze them later. We also have a module providing the ability to trigger actions on threshold crossing (alarm).

The project grew in a monolithic way, and in a linear way for the number of contributors, during the first two years. I was the PTL (Project Technical Leader) for a year. This leader position asks for a lot of time for bureaucratic things and people management, so I decided to leave my spot in order to be able to spend more time solving the technical challenges that Ceilometer offered.

I've started the Gnocchi project in 2014. The first stable version (1.0.0) was released a few months ago. It's a timeseries database offering a REST API and a strong ability to scale. It was a necessary development to solve the problems tied to the large amount of metrics created by a cloud-computing platform, where tens of thousands of virtual machines have to be metered as often as possible. This project works as a standalone deployment or with the rest of OpenStack.

More recently, I've started Aodh, the result of moving out the code and features of Ceilometer related to threshold action triggering (alarming). That's the logical suite to what we started with Gnocchi. It means Ceilometer is to be split into independent modules that can work together – with or without OpenStack. It seems to me that the features provided by Ceilometer, Aodh and Gnocchi can also be interesting for operators running more classical infrastructures. That's why I've pushed the projects into that direction, and also to have a more service-oriented architecture (SOA).

I'd like to stop for a moment on Ceilometer. I think that this solution was very expected, especially by the cloud-computing providers using OpenStack for billing resources sold to their customers. I remember reading a blog post where you were talking about the high-speed construction of this brick, and features that were not supposed to be there. Nowadays, with Gnocchi and Aodh, what is the quality of the brick Ceilometer and the programs it relies on?

Indeed, one of the first use-case for Ceilometer was tied to the ability to get metrics to feed a billing tool. That's now a reached goal since we have billing tools for OpenStack using Ceilometer, such as CloudKitty.

However, other use-cases appeared rapidly, such as the ability to trigger alarms. This feature was necessary, for example, to implement the auto scaling feature that Heat needed. At the time, for technical and political reasons, it was not possible to implement this feature in a new project, and the functionality ended up in Ceilometer, since it was using the metrics collected and stored by Ceilometer itself.

Though, like I said, this feature is now in its own project, Aodh. The alarm feature is used since a few cycles in production, and the Aodh project brings new features on the table. It allows to trigger threshold actions and is one of the few solutions able to work at high scale with several thousands of alarms.
It's impossible to make Nagios run with millions of instances to fetch metrics and triggers alarms. Ceilometer and Aodh can do that easily on a few tens of nodes automatically.

On the other side, Ceilometer has been for a long time painted as slow and complicated to use, because its metrics storage system was by default using MongoDB. Clearly, the data structure model picked was not optimal for what the users were doing with the data.

That's why I started Gnocchi last year, which is perfectly designed for this use case. It allows linear access time to metrics (O(1) complexity) and fast access time to the resources data via an index.

Today, with 3 projects having their own perimeter of features defined – and which can work together – Ceilometer, Aodh and Gnocchi finally erased the biggest problems and defects of the initial project.

To end with OpenStack, one last question. You're a Python developer for a long time and a fervent user of software testing and test-driven development. Several of your blogs posts point how important their usage are. Can you tell us more about the usage of tests in OpenStack, and the test prerequisites to contribute to OpenStack?

I don't know any project that is as tested on every layer as OpenStack is. At the start of the project, there was a vague test coverage, made of a few unit tests. For each release, a bunch of new features were provided, and you had to keep your fingers crossed to have them working. That's already almost unacceptable. But the big issue was that there was also a lot of regressions, et things that were working were not anymore. It was often corner cases that developers forgot about that stopped working.

Then the project decided to change its policy and started to refuse all patches – new features or bug fix – that would not implement a minimal set of unit tests, proving the patch would work. Quickly, regressions were history, and the number of bugs largely reduced months after months.

Then came the functional tests, with the Tempest project, which runs a test battery on a complete OpenStack deployment.

OpenStack now possesses a complete test infrastructure, with operators hired full-time to maintain them. The developers have to write the test, and the operators maintain an architecture based on Gerrit, Zuul, and Jenkins, which runs the test battery of each project for each patch sent.

Indeed, for each version of a patch sent, a full OpenStack is deployed into a virtual machine, and a battery of thousands of unit and functional tests is run to check that no regressions are possible.

To contribute to OpenStack, you need to know how to write a unit test – the policy on functional tests is laxer. The tools used are standard Python tools, unittest for the framework and tox to run a virtual environment (venv) and run them.

It's also possible to use DevStack to deploy an OpenStack platform on a virtual machine and run functional tests. However, since the project infrastructure also do that when a patch is submitted, it's not mandatory to do that yourself locally.

The tools and tests you write for OpenStack are written in Python, a language which is very popular today. You seem to like it more than you have to, since you wrote a book about it, The Hacker's Guide to Python, that I really enjoyed. Can you explain what brought you to Python, the main strong points you attribute to this language (quickly) and how you went from developer to author?

I stumbled upon Python by chance, around 2005. I don't remember how I hear about it, but I bought a first book to discover it and started toying with that language. At that time, I didn't find any project to contribute to or to start. My first project with Python was rebuildd for Debian in 2007, a bit later.

I like Python for its simplicity, its object orientation rather clean, its easiness to be deployed and its rich open source ecosystem. Once you get the basics, it's very easy to evolve and to use it for anything, because the ecosystem makes it easy to find libraries to solve any kind of problem.

I became an author by chance, writing blog posts from time to time about Python. I finally realized that after a few years studying Python internals (CPython), I learned a lot of things. While writing a post about
the differences between method types in Python – which is still one of the most read post on my blog – I realized that a lot of things that seemed obvious to me where not for other developers.

I wrote that initial post after thousands of hours spent doing code reviews on OpenStack. I, therefore, decided to note all the developers pain points and to write a book about that. A compilation of what years of experience taught me and taught to the other developers I decided to interview in the book.

I've been very interested by the publication of your book, for the subject itself, but also the process you chose. You self-published the book, which seems very relevant nowadays. Is that a choice from the start? Did you look for an editor? Can you tell use more about that?

I've been lucky to find out about others self-published authors, such as Nathan Barry – who even wrote a book on that subject, called Authority. That's what convinced me it was possible and gave me hints for that project.

I've started to write in August 2013, and I ran the firs interviews with other developers at that time. I started to write the table of contents and then filled the pages with what I knew and what I wanted to share. I manage to finish the book around January 2014. The proof-reading took more time than I expected, so the book was only released in March 2014. I wrote a complete report about that on my blog, where I explain the full process in detail, from writing to launching.

I did not look for editors though I've been proposed some. The idea of self-publishing really convince me, so I decided to go on my own, and I have no regret. It's true that you have to wear two hats at the same time and handle a lot more things, but with a minimal audience and some help from the Internet, anything's possible!

I've been reached by two editors since then, a Chinese and Korean one. I gave them rights to translate and publish the books in their countries, so you can buy the Chinese and Korean version of the first edition of the book out there.

Seeing how successful it was, I decided to launch a second edition in May 2015, and it's likely that a third edition will be released in 2016.

Nowadays, you work for Red Hat, a company that represents the success of using Free Software as a commercial business model. This company fascinates a lot in our community. What can you say about your employer from your point of view?

It only has been a year since I joined Red Hat (when they bought eNovance), so my experience is quite recent.

Though, Red Hat is really a special company on every level. It's hard to see from the outside how open it is, and how it works. It's really close to and it really looks like an open source project. For more details, you should read The Open Organization, a book wrote by Jim Whitehurst (CEO of Red Hat), which he just published. It describes perfectly how Red Hat works. To summarize, meritocracy and the lack of organization in silos is what makes Red Hat a strong organization and puts them as
one of the most innovative company.

In the end, I'm lucky enough to be autonomous for the project I work on with my team around OpenStack, and I can spend 100% working upstream and enhance the Python ecosystem.

Visualize your OpenStack cloud: Gnocchi & Grafana

Mon, 14 Sep 2015 00:00:00 GMT

We've been hard working with the Gnocchi team these last months to store your metrics, and I guess it's time to show off a bit.

So far Gnocchi offers scalable metric storage and resource indexation, especially for OpenStack cloud – but not only, we're generic. It's cool to store metrics, but it can be even better to have a way to visualize them!

Prototyping

We very soon started to build a little HTML interface. Being REST-friendly guys, we enabled it on the same endpoints that were being used to retrieve information and measures about metric, sending back text/html instead of application/json if you were requesting those pages from a Web browser.

But let's face it: we are back-end developers, we suck at any kind front-end development. CSS, HTML, JavaScript? Bwah! So what we built was a starting point, hoping some magical Web developer would jump in and finish the job.

Obviously it never happened.

Ok, so what's out there?

It turns out there are back-end agnostic solutions out there, and we decided to pick Grafana. Grafana is a complete graphing dashboard solution that can be plugged on top of any back-end. It already supports timeseries databases such as Graphite, InfluxDB and OpenTSDB.

That was largely enough for that my fellow developer Mehdi Abaakouk to jump in and start writing a Gnocchi plugin for Grafana! Consequently, there is now a basic but solid and working back-end for Grafana that lies in the grafana-plugins
repository.

With that plugin, you can graph anything that is stored in Gnocchi, from raw metrics to metrics tied to resources. You can use templating, but no annotation yet.

The back-end supports Gnocchi with or without Keystone involved, and any type of authentication (basic auth or Keystone token). So yes, it even works if you're not running Gnocchi with the rest of OpenStack.

It also supports advanced queries, so you can search for resources based on some criterion and graphs their metrics.

I want to try it!

If you want to deploy it, all you need to do is to install Grafana and its plugins, and create a new datasource pointing to Gnocchi. It is that simple. There's some CORS middleware configuration involved if you're planning on using Keystone authentication, but it's pretty straightforward – just set the cors.allowed_origin option to the URL of your Grafana dashboard.

We added support of Grafana directly in Gnocchi devstack plugin. If you're running DevStack you can follow the instructions – which are basically adding the line enable_service gnocchi-grafana.

Moving to Grafana core

[Mehdi just opened a pull request] (https://github.com/grafana/grafana/pull/2716) a few days ago to merge the plugin into Grafana core. It's actually one of the most unit-tested plugin in Grafana so far, so it should be on a good path to be merged in the future and have support of Gnocchi directly into Grafana without any plugin involved.

Data validation in Python with voluptuous

Fri, 04 Sep 2015 00:00:00 GMT

Continuing my post series on the tools I use these days in Python, this time I would like to talk about a library I really like, named voluptuous.

It's no secret that most of the time, when a program receives data from the outside, it's a big deal to handle it. Indeed, most of the time your program has no guarantee that the stream is valid and that it contains what is expected.

The robustness principle says you should be liberal in what you accept, though that is not always a good idea neither. Whatever policy you chose, you need to process those data and implement a policy that will work – lax or not.

That means that the program need to look into the data received, check that it finds everything it needs, complete what might be missing (e.g. set some default), transform some data, and maybe reject those data in the end.

Data validation

The first step is to validate the data, which means checking all the fields are there and all the types are right or understandable (parseable). Voluptuous provides a single interface for all that called a Schema.

>>> from voluptuous import Schema
>>> s = Schema({
...   'q': str,
...   'per_page': int,
...   'page': int,
... })
>>> s({"q": "hello"})
{'q': 'hello'}
>>> s({"q": "hello", "page": "world"})
voluptuous.MultipleInvalid: expected int for dictionary value @ data['page']
>>> s({"q": "hello", "unknown": "key"})
voluptuous.MultipleInvalid: extra keys not allowed @ data['unknown']

The argument to voluptuous.Schema should be the data structure that you expect. Voluptuous accepts any kind of data structure, so it could also be a simple string or an array of dict of array of integer. You get it. Here it's a dict with a few keys that if present should be validated as certain types. By default, Voluptuous does not raise an error if some keys are missing. However, it is invalid to have extra keys in a dict by default. If you want to allow extra keys, it is possible to specify it.

>>> from voluptuous import Schema
>>> s = Schema({"foo": str}, extra=True)
>>> s({"bar": 2})
{"bar": 2}

It is also possible to make some keys mandatory.

>>> from voluptuous import Schema, Required
>>> s = Schema({Required("foo"): str})
>>> s({})
voluptuous.MultipleInvalid: required key not provided @ data['foo']

You can create custom data type very easily. Voluptuous data types are actually just functions that are called with one argument, the value, and that should either return the value or raise an Invalid or ValueError exception.

>>> from voluptuous import Schema, Invalid
>>> def StringWithLength5(value):
...     if isinstance(value, str) and len(value) == 5:
...             return value
...     raise Invalid("Not a string with 5 chars")
...
>>> s = Schema(StringWithLength5)
>>> s("hello")
'hello'
>>> s("hello world")
voluptuous.MultipleInvalid: Not a string with 5 chars

Most of the time though, there is no need to create your own data types. Voluptuous provides logical operators that can, combined with a few others provided primitives such as voluptuous.Length or voluptuous.Range, create a large range of validation scheme.

>>> from voluptuous import Schema, Length, All
>>> s = Schema(All(str, Length(min=3, max=5)))
>>> s("hello")
"hello"
>>> s("hello world")
voluptuous.MultipleInvalid: length of value must be at most 5

The voluptuous documentation has a good set of examples that you can check to have a good overview of what you can do.

Data transformation

What's important to remember, is that each data type that you use is a function that is called and returns a value, if the value is considered valid. That value returned is what is actually used and returned after the schema validation:

>>> import uuid
>>> from voluptuous import Schema
>>> def UUID(value):
...     return uuid.UUID(value)
...
>>> s = Schema({"foo": UUID})
>>> data_converted = s({"foo": "uuid?"})
voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']
>>> data_converted = s({"foo": "8B7BA51C-DFF5-45DD-B28C-6911A2317D1D"})
>>> data_converted
{'foo': UUID('8b7ba51c-dff5-45dd-b28c-6911a2317d1d')}

By defining a custom UUID function that converts a value to a UUID, the schema converts the string passed in the data to a Python UUID object – validating the format at the same time.

Note a little trick here: it's not possible to use directly uuid.UUID in the schema, otherwise Voluptuous would check that the data is actually an instance of uuid.UUID:

>>> from voluptuous import Schema
>>> s = Schema({"foo": uuid.UUID})
>>> s({"foo": "8B7BA51C-DFF5-45DD-B28C-6911A2317D1D"})
voluptuous.MultipleInvalid: expected UUID for dictionary value @ data['foo']
>>> s({"foo": uuid.uuid4()})
{'foo': UUID('60b6d6c4-e719-47a7-8e2e-b4a4a30631ed')}

And that's not what is wanted here.

That mechanism is really neat to transform, for example, strings to timestamps.

>>> import datetime
>>> from voluptuous import Schema
>>> def Timestamp(value):
...     return datetime.datetime.strptime(value, "%Y-%m-%dT%H:%M:%S")
...
>>> s = Schema({"foo": Timestamp})
>>> s({"foo": '2015-03-03T12:12:12'})
{'foo': datetime.datetime(2015, 3, 3, 12, 12, 12)}
>>> s({"foo": '2015-03-03T12:12'})
voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']

Recursive schemas

So far, Voluptuous has one limitation so far: the ability to have recursive schemas. The simplest way to circumvent it is by using another function as an indirection.

>>> from voluptuous import Schema, Any
>>> def _MySchema(value):
...     return MySchema(value)
...
>>> from voluptuous import Any
>>> MySchema = Schema({"foo": Any("bar", _MySchema)})
>>> MySchema({"foo": {"foo": "bar"}})
{'foo': {'foo': 'bar'}}
>>> MySchema({"foo": {"foo": "baz"}})
voluptuous.MultipleInvalid: not a valid value for dictionary value @ data['foo']['foo']

Usage in REST API

I started to use Voluptuous to validate data in a the REST API provided by Gnocchi. So far it has been a really good tool, and we've been able to create a complete REST API that is very easy to validate on the server side. I would definitely recommend it for that. It blends with any Web framework easily.

One of the upside compared to solution like JSON Schema, is the ability to create or re-use your own custom data types while converting values at validation time. It is also very Pythonic, and extensible – it's pretty great to use for all of that. It's also not tied to any serialization format.

On the other hand, JSON Schema is language agnostic and is serializable itself as JSON. That makes it easy to be exported and provided to a consumer so it can understand the API and validate the data potentially on its side.

Reading LWN.net with Pocket

Thu, 13 Aug 2015 00:00:00 GMT

I've started to use Pocket a few months ago to store my backlog of things to read. It's especially useful as I can use it to read content offline since we still don't have any Internet access in places such as airplanes or the Paris metro. It's only 2015 after all.

I am also a LWN.net subscriber for years now, and I really like their articles from the weekly edition. Unfortunately, as the access is restricted to subscribers, you need to login: it makes it impossible to add these articles to Pocket directly. Sad.

Yesterday, I thought about that and decided to start hacking on it. LWN provides a feature called "Subscriber Link" that allows you to share an article with a friend. I managed to use that feature to share the articles with my friend… Pocket!

As doing that every week is tedious, I wrote a small Python program called lwn2pocket that I published on GitHub. Feel free to use it, hack it and send pull requests.

Ceilometer, Gnocchi & Aodh: Liberty progress

Tue, 04 Aug 2015 00:00:00 GMT

It's been a while since I talked about Ceilometer and its companions, so I thought I'd go ahead and write a bit about what's going on this side of OpenStack. I'm not going to cover new features and fancy stuff today, but rather a shallow overview of the new project processes we initiated.

Ceilometer growing

Ceilometer has grown a lot since that time when we started it 3 years ago. It has evolved from a system designed to fetch and store measurements, to a more complex system, with agents, alarms, events, databases, APIs, etc.

All those features were needed and asked for by users and operators, but let's be honest, some of them should never have ended up in the Ceilometer code repository, especially not all at the same time.

The reality is we picked a pragmatic approach due to the rigidity of the OpenStack Technical Committee in regards to new projects to become OpenStack integrated – and, therefore, blessed – projects. Ceilometer was actually the first project to be incubated and then integrated. We had to go through the very first issues of that process.

Fortunately, now that time has passed, and all those constraints have been relaxed. To me, the OpenStack Foundation is turning into something that looks like the Apache Foundation, and there's, therefore, no need to tie technical solutions to political issues.

Indeed, the Big Tent now allows much more flexibility to all of that. Back a year ago, we were afraid to bring Gnocchi into Ceilometer. Was the Technical Committee going to review the project? Was the project going to be in the scope of Ceilometer for the Technical Committee? Now we don't have to ask ourselves those questions, now that we have that freedom, it empowers us to actually do what we think is good in term of technical design without worrying too much about political issues.

Acknowledging Gnocchi

The first step in this new process was to continue working on Gnocchi (a timeserie database and resource indexer designed to overcome historical Ceilometer storage issue) and to decide that it was not the right call to merge it into Ceilometer as some REST API v3, but that it was better to keep it standalone.

We managed to get traction to Gnocchi, getting a few contributors and users. We're even seeing talks proposed to the next Tokyo Summit where people leverage Gnocchi, such as "Service of predictive analytics on cost and performance in OpenStack", "Suveil" and "Cutting Edge NFV On OpenStack: Healing and Scaling Distributed Applications".

We are also doing some progress on pushing Gnocchi outside of the OpenStack community, as it can be a self-sufficient timeserie and resource database that can be used without any OpenStack interaction.

Branching Aodh

Rather than continuing to grow Ceilometer, during the last summit we all decided that it was time to reorganize and split Ceilometer into the different components it is made of, leveraging a more service-oriented architecture. The alarm subsystem of Ceilometer being mostly untied to the rest of Ceilometer, we decided it was the first and perfect candidate to do that. I personally engaged into doing the work and created a new repository with only the alarm code from Ceilometer, named Aodh.

This made sense for a lot of reason. First because Aodh can now work completely standalone, using either Ceilometer or Gnocchi as a backend – or any new plugin you'd write. I love the idea that OpenStack projects can work standalone – like Swift does for example – without implying any other OpenStack component. I think it's a proof of good design. Secondly, because it allows us to resonate on a smaller chunk of software – a reason really under-estimated today in OpenStack. I believe that the size of your software should match a certain ratio to the size of your team.

Aodh is, therefore, a new project under the OpenStack Telemetry program (or what remains of OpenStack programs now), alongside Ceilometer and Gnocchi, forked from the original Ceilometer alarm feature. We'll deprecate the latter with the Liberty release, and we'll remove it in the Mitaka release.

Lessons learned

Actually, moving that code out of Ceilometer (in the case of Aodh), or not merging it in (in the case of Gnocchi) had a few side effects that I admit I think we probably under-estimated back then.

Indeed, the code size of Gnocchi or Aodh ended up being much smaller than the entire Ceilometer project – Gnocchi is 7× smaller and Aodh 5x smaller than Ceilometer – and therefore much more easy to manipulate and to hack on. That allowed us to merge dozens of patches in a few weeks, cleaning-up and enhancing a lot of small things in the code. Those tasks are very much harder in Ceilometer, due to the bigger size of the code base and the small size of our team. By having our small team working on smaller chunks of changes – even when it meant actually doing more reviews – greatly improved our general velocity and the number of bugs fixed and features implemented.

On the more sociological side, I think it gave the team the sensation of finally owning the project. Ceilometer was huge, and it was impossible for people to know every side of it. Now, it's getting possible for people inside a team to cover a much larger portion of those smaller project, which gives them a greater sense of ownership and caring. Which ends up being good for the project quality overall.

That also means that we technically decided to have different core teams by project (Ceilometer, Gnocchi, and Aodh) as they all serve different purposes and can all be used standalone or with each others. Meaning we could have contributors completely ignoring other projects.

All of that reminds me some discussion I heard about projects such as Glance, trying to fit new features in - some that are really orthogonal to the original purpose. It's now clear to me that having different small components interacting together that can be completely owned and taken care of by a (small) team of contributors is the way to go. People that can therefore trust each others and easily bring new people in, makes a project really incredibly more powerful. Having a project covering a too wide set of features make things more difficult if you don't have enough manpower. This is clearly an issue that big projects inside OpenStack are facing now, such as Neutron or Nova.

Timezones and Python

Tue, 16 Jun 2015 00:00:00 GMT

Recently, I've been fighting with the never ending issue of timezones. I never thought I would have plunged into this rabbit hole, but hacking on OpenStack and Gnocchi I felt into that trap easily is, thanks to Python.

“Why you really, really, should never ever deal with timezones”

To get a glimpse of the complexity of timezones, I recommend that you watch Tom Scott's video on the subject. It's fun and it summarizes remarkably well the nightmare that timezones are and why you should stop thinking that you're smart.

The importance of timezones in applications

Once you've heard what Tom says, I think it gets pretty clear that a timestamp without any timezone attached does not give any useful information. It should be considered irrelevant and useless. Without the necessary context given by the timezone, you cannot infer what point in time your application is really referring to.

That means your application should never handle timestamps with no timezone information. It should try to guess or raises an error if no timezone is provided in any input.

Of course, you can infer that having no timezone information means UTC. This sounds very handy, but can also be dangerous in certain applications or language – such as Python, as we'll see.

Indeed, in certain applications, converting timestamps to UTC and losing the timezone information is a terrible idea. Imagine that a user create a recurring event every Wednesday at 10:00 in its local timezone, say CET. If you convert that to UTC, the event will end up being stored as every Wednesday at 09:00.

Now imagine that the CET timezone switches from UTC+01:00 to UTC+02:00: your application will compute that the event starts at 11:00 CET every Wednesday. Which is wrong, because as the user told you, the event starts at 10:00 CET, whatever the definition of CET is. Not at 11:00 CET. So CET means CET, not necessarily UTC+1.

As for endpoints like REST API, a thing I daily deal with, all timestamps should include a timezone information. It's nearly impossible to know what timezone the timestamps are in otherwise: UTC? Server local? User local? No way to know.

Python design & defect

Python comes with a timestamp object named datetime.datetime. It can store date and time precise to the microsecond, and is qualified of timezone "aware" or "unaware", whether it embeds a timezone information or not.

To build such an object based on the current time, one can use datetime.datetime.utcnow() to retrieve the date and time for the UTC timezone, and datetime.datetime.now() to retrieve the date and time for the current timezone, whatever it is.

>>> import datetime
>>> datetime.datetime.utcnow()
datetime.datetime(2015, 6, 15, 13, 24, 48, 27631)
>>> datetime.datetime.now()
datetime.datetime(2015, 6, 15, 15, 24, 52, 276161)

As you can notice, none of these results contains timezone information. Indeed, Python datetime API always returns unaware datetime objects, which is very unfortunate. Indeed, as soon as you get one of this object, there is no way to know what the timezone is, therefore these objects are pretty "useless" on their own.

Armin Ronacher proposes that an application always consider that the unaware datetime objects from Python are considered as UTC. As we just saw, that statement cannot be considered true for objects returned by datetime.datetime.now(), so I would not advise doing so. datetime objects with no timezone should be considered as a "bug" in the application.

Recommendations

My recommendation list comes down to:

Always use aware datetime object, i.e. with timezone information. That makes sure you can compare them directly (aware and unaware datetime objects are not comparable) and will return them correctly to users. Leverage pytz to have timezone objects.
Use ISO 8601 as input and output string format. Use datetime.datetime.isoformat() to return timestamps as string formatted using that format, which includes the timezone information.

In Python, that's equivalent to having:

>>> import datetime
>>> import pytz
>>> def utcnow():
    return datetime.datetime.now(tz=pytz.utc)
>>> utcnow()
datetime.datetime(2015, 6, 15, 14, 45, 19, 182703, tzinfo=<UTC>)
>>> utcnow().isoformat()
'2015-06-15T14:45:21.982600+00:00'

If you need to parse strings containing ISO 8601 formatted timestamp, you can rely on the iso8601, which returns timestamps with correct timezone information. This makes timestamps directly comparable:

>>> import iso8601
>>> iso8601.parse_date(utcnow().isoformat())
datetime.datetime(2015, 6, 15, 14, 46, 43, 945813, tzinfo=<FixedOffset '+00:00' datetime.timedelta(0)>)
>>> iso8601.parse_date(utcnow().isoformat()) < utcnow()
True

If you need to store those timestamps, the same rule should apply. If you rely on MongoDB, it assumes that all the timestamp are in UTC, so be careful when storing them – you will have to normalize the timestamp to UTC.

For MySQL, nothing is assumed, it's up to the application to insert them in a timezone that makes sense to it. Obviously, if you have multiple applications accessing the same database with different data sources, this can end up being a nightmare.

PostgreSQL has a special data type that is recommended called timestamp with timezone. That does not mean you should not use UTC in most cases; that just means you are sure that the timestamp are stored in UTC since it's written in the database, and you check if any other application inserted timestamps with different timezone.

OpenStack status

As a side note, I've improved OpenStack situation recently by changing the oslo.utils.timeutils module to deprecate some useless and dangerous functions. I've also added support for returning timezone aware objects when using the oslo_utils.timeutils.utcnow() function. It's not possible to make it a default unfortunately for backward compatibility reason, but it's there nevertheless, and it's advised to use it. Thanks to my colleague Victor for the help!

Have a nice day, whatever your timezone is!

Get back up and try again: retrying in Python

Tue, 02 Jun 2015 00:00:00 GMT

The library presented in this article is becoming obsolete and un-maintained. I recommend you to read this post about tenacity instead.

I don't often write about tools I use when for my daily software development tasks. I recently realized that I really should start to share more often my workflows and weapons of choice.

One thing that I have a hard time enduring while doing Python code reviews, is people writing utility code that is not directly tied to the core of their business. This looks to me as wasted time maintaining code that should be reused from elsewhere.

So today I'd like to start with retrying, a Python package that you can use to… retry anything.

It's OK to fail

Often in computing, you have to deal with external resources. That means accessing resources you don't control. Resources that can fail, become flapping, unreachable or unavailable.

Most applications don't deal with that at all, and explode in flight, leaving a skeptical user in front of the computer. A lot of software engineers refuse to deal with failure, and don't bother handling this kind of scenario in their code.

In the best case, applications usually handle simply the case where the external reached system is out of order. They log something, and inform the user that it should try again later.

In this cloud computing area, we tend to design software components with service-oriented architecture in mind. That means having a lot of different services talking to each others over the network. And we all know that networks tend to fail, and distributed systems too. Writing software with failing being part of normal operation is a terrific idea.

Retrying

In order to help applications with the handling of these potential failures, you need a plan. Leaving to the user the burden to "try again later" is rarely a good choice. Therefore, most of the time you want your application to retry.

Retrying an action is a full strategy on its own, with a lot of options. You can retry only on certain condition, and with the number of tries based on time (e.g. every second), based on a number of tentative (e.g. retry 3 times and abort), based on the problem encountered, or even on all of those.

For all of that, I use the retrying library that you can retrieve easily on PyPI.

retrying provides a decorator called retry that you can use on top of any function or method in Python to make it retry in case of failure. By default, retry calls your function endlessly until it returns rather than raising an error.

import random
from retrying import retry

@retry
def pick_one():
    if random.randint(0, 10) != 1:
        raise Exception("1 was not picked")

This will execute the function pick_one until 1 is returned by random.randint.

retry accepts a few arguments, such as the minimum and maximum delays to use, which also can be randomized. Randomizing delay is a good strategy to avoid detectable pattern or congestion. But more over, it supports exponential delay, which can be used to implement exponential backoff, a good solution for retrying tasks while really avoiding congestion. It's especially handy for background tasks.

@retry(wait_exponential_multiplier=1000, wait_exponential_max=10000)
def wait_exponential_1000():
    print "Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards"
    raise Exception("Retry!")

You can mix that with a maximum delay, which can give you a good strategy to retry for a while, and then fail anyway:

## Stop retrying after 30 seconds anyway
>>> @retry(wait_exponential_multiplier=1000, wait_exponential_max=10000, stop_max_delay=30000)
... def wait_exponential_1000():
...     print "Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards"
...     raise Exception("Retry!")
...
>>> wait_exponential_1000()
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Wait 2^x * 1000 milliseconds between each retry, up to 10 seconds, then 10 seconds afterwards
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 212, in call
    raise attempt.get()
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/local/lib/python2.7/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "<stdin>", line 4, in wait_exponential_1000
  Exception: Retry!

A pattern I use very often, is the ability to retry only based on some exception type. You can specify a function to filter out exception you want to ignore or the one you want to use to retry.

def retry_on_ioerror(exc):
    return isinstance(exc, IOError)

@retry(retry_on_exception=retry_on_ioerror)
def read_file():
    with open("myfile", "r") as f:
        return f.read()

retry will call the function passed as retry_on_exception with the exception raised as first argument. It's up to the function to then return a boolean indicating if a retry should be performed or not. In the example above, this will only retry to read the file if an IOError occurs; if any other exception type is raised, no retry will be performed.

The same pattern can be implemented using the keyword argument retry_on_result, where you can provide a function that analyses the result and retry based on it.

def retry_if_file_empty(result):
    return len(result) <= 0

@retry(retry_on_result=retry_if_file_empty)
def read_file():
    with open("myfile", "r") as f:
        return f.read()

This example will read the file until it stops being empty. If the file does not exist, an IOError is raised, and the default behavior which triggers retry on all exceptions kicks-in – the retry is therefore performed.

That's it! retry is really a good and small library that you should leverage rather than implementing your own half-baked solution!

OpenStack Summit Liberty from a Ceilometer & Gnocchi point of view

Tue, 26 May 2015 00:00:00 GMT

Last week I was in Vancouver, BC for the OpenStack Summit, discussing the new Liberty version that will be released in 6 months.

I've attended the summit mainly to discuss and follow-up new developments on Ceilometer, Gnocchi and Oslo. It has been a pretty good week and we were able to discuss and plan a few interesting things.

Ops feedback

We had half a dozen Ceilometer sessions, and the first one was dedicated to getting feedbacks from operators using Ceilometer. We had a few operators present, and a few of the Ceilometer team. We had constructive discussion, and my feeling is that operators struggles with 2 things so far: scaling Ceilometer storage and having Ceilometer not killing the rest of OpenStack.

We discussed the first point as being addressed by Gnocchi, and I presented a bit Gnocchi itself, as well as how and why it will fix the storage scalability issue operators encountered so far.

Ceilometer putting down the OpenStack installation is more interesting problem. Ceilometer pollsters request information from Nova, Glance… to gather statistics. Until Kilo, Ceilometer used to do that regularly and at fixed interval, causing high pike load in OpenStack. With the introduction of jitter in Kilo, this should be less of a problem. However, Ceilometer hits various endpoints in OpenStack that are poorly designed, and hitting those endpoints of Nova or other components triggers a lot of load on the platform. Unfortunately, this makes operators blame Ceilometer rather than blaming the components being guilty of poor designs. We'd like to push forward improving these components, but it's probably going to take a long time.

Componentisation

When I started the Gnocchi project last year, I pretty soon realized that we would be able to split Ceilometer itself in different smaller components that could work independently, while being able to leverage each others. For example, Gnocchi can run standalone and store your metrics even if you don't use Ceilometer – nor even OpenStack itself.

My fellow developer Chris Dent had the same idea about splitting Ceilometer a few months ago and drafted a proposal. The idea is to have Ceilometer split in different parts that people could assemble together or run on their owns.

Interestingly enough, we had three 40 minutes sessions planned to talk and debate about this division of Ceilometer, though we all agreed in 5 minutes that this was the good thing to do. Five more minutes later, we agreed on which part to split. The rest of the time was allocated to discuss various details of that split, and I engaged to start doing the work with Ceilometer alarming subsystem.

I wrote a specification on the plane bringing me to Vancouver, that should be approved pretty soon now. I already started doing the implementation work. So fingers crossed, Ceilometer should have a new components in Liberty handling alarming on its own.

This would allow users for example to only deploys Gnocchi and Ceilometer alarm. They would be able to feed data to Gnocchi using their own system, and build alarms using Ceilometer alarm subsystem relying on Gnocchi's data.

Gnocchi

We didn't have a Gnocchi dedicated slot – mainly because I indicated I didn't feel we needed one. We anyway discussed a few points around coffee, and I've been able to draw a few new ideas and changes I'd like to see in Gnocchi. Mainly changing the API contract to be more asynchronously so we can support InfluxDB more correctly, and improve Carbonara (the library we created to manipulate timeseries) based drivers to be faster.

All of those should – plus a few Oslo tasks I'd like to tackle – should keep me busy for the next cycle!

My interview about software tests and Python

Mon, 11 May 2015 00:00:00 GMT

I've recently been contacted by Johannes Hubertz, who is writing a new book about Python in German called "Softwaretests mit Python" which will be published by Open Source Press, Munich this summer. His book will feature some interviews, and he was kind enough to let me write a bit about software testing. This is the interview that I gave for his book. Johannes translated to German and it will be included in Johannes' book, and I decided to publish it on my blog today. Following is the original version.

How did you come to Python?

I don't recall exactly, but around ten years ago, I saw more and more people using it and decided to take a look. Back then, I was more used to Perl. I didn't really like Perl and was not getting a good grip on its object system.

As soon as I found an idea to work on – if I remember correctly that was rebuildd – I started to code in Python, learning the language at the same time.

I liked how Python worked, and how fast I was to able to develop and learn it, so I decided to keep using it for my next projects. I ended up diving into Python core for some reasons, even doing things like briefly hacking on projects like Cython at some point, and finally ended up working on OpenStack.

OpenStack is a cloud computing platform entirely written in Python. So I've been writing Python every day since working on it.

That's what pushed me to write The Hacker's Guide to Python in 2013 and then self-publish it a year later in 2014, a book where I talk about doing smart and efficient Python.

It had a great success, has even been translated in Chinese and Korean, so I'm currently working on a second edition of the book. It has been an amazing adventure!

Zen of Python: Which line is the most important for you and why?

I like the "There should be one – and preferably only one – obvious way to do it". The opposite is probably something that scared me in languages like Perl. But having one obvious way to do it is and something I tend to like in functional languages like Lisp, which are in my humble opinion, even better at that.

For a python newbie, what are the most difficult subjects in Python?

I haven't been a newbie since a while, so it's hard for me to say. I don't think the language is hard to learn. There are some subtlety in the language itself when you deeply dive into the internals, but for beginners most of the concept are pretty straight-forward. If I had to pick, in the language basics, the most difficult thing would be around the generator objects (yield).

Nowadays I think the most difficult subject for new comers is what version of Python to use, which libraries to rely on, and how to package and distribute projects. Though things get better, fortunately.

When did you start using Test Driven Development and why?

I learned unit testing and TDD at school where teachers forced me to learn Java, and I hated it. The frameworks looked complicated, and I had the impression I was losing my time. Which I actually was, since I was writing disposable programs – that's the only thing you do at school.

Years later, when I started to write real and bigger programs (e.g. rebuildd), I quickly ended up fixing bugs… I already fixed. That recalled me about unit tests and that it may be a good idea to start using them to stop fixing the same things over and over again.

For a few years, I wrote less Python and more C code and Lua (for the awesome window manager), and I didn't use any testing. I probably lost hundreds of hours testing manually and fixing regressions – that was a good lesson. Though I had good excuses at that time – it is/was way harder to do testing in C/Lua than in Python.

Since that period, I have never stopped writing "tests". When I started to hack on OpenStack, the project was adopting a "no test? no merge!" policy due to the high number of regressions it had during the first releases.

I honestly don't think I could work on any project that does not have – at least a minimal – test coverage. It's impossible to hack efficiently on a code base that you're not able to test in just a simple command. It's also a real problem for new comers in the open source world. When there are no test, you can hack something and send a patch, and get a "you broke this" in response.

Nowadays, this kind of response sounds unacceptable to me: if there is no test, then I didn't break anything!

In the end, it's just too much frustration to work on non tested projects as I demonstrated in my study of whisper source code.

What do you think to be the most often seen pitfalls of TDD and how to avoid them best?

The biggest problems are when and at what rate writing tests.

On one hand, some people starts to write too precise tests way too soon. Doing that slows you down, especially when you are prototyping some idea or concept you just had. That does not mean that you should not do test at all, but you should probably start with a light coverage, until you are pretty sure that you're not going to rip every thing and start over. On the other hand, some people postpone writing tests for ever, and end up with no test all or a too thin layer of test. Which makes the project with a pretty low coverage.

Basically, your test coverage should reflect the state of your project. If it's just starting, you should build a thin layer of test so you can hack it on it easily and remodel it if needed. The more your project grow, the more you should make it sold and lay more tests.

Having too detailed tests is painful to make the project evolve at the start. Having not enough in a big project makes it painful to maintain it.

Do you think, TDD fits and scales well for the big projects like OpenStack?

Not only I think it fits and scales well, but I also think it's just impossible to not use TDD in such big projects.

When unit and functional tests coverage was weak in OpenStack – at its beginning – it was just impossible to fix a bug or write a new feature without breaking a lot of things without even noticing. We would release version N, and a ton of old bugs present in N-2 – but fixed in N-1 – were reopened.

For big projects, with a lot of different use cases, configuration options, etc, you need belt and braces. You cannot throw code in a repository thinking it's going to work ever, and you can't afford to test everything manually at each commit. That's just insane.

The Hacker's Guide to Python, 2nd edition!

Mon, 04 May 2015 00:00:00 GMT

A year passed since the first release of The Hacker's Guide to Python in March 2014. A few hundreds copies have been distributed so far, and the feedback is wonderful!

I already wrote extensively about the making of that book last year, and I cannot emphasize enough how this adventure has been amazing so far. That's why I decided a few months ago to update the guide and add some new content.

So let's talk about what's new in this second edition of the book!

First, I obviously fixed a few things. I had some reports about small mistakes and typos which I applied as I received them. Not a lot fortunately, but it's still better to have fewer errors in a book, right?

Then, I updated some of the content. Things changed since I wrote the first chapters of that guide 18 months ago. Therefore I had to rewrite some of the sections and take into account new software or libraries that were released.

At last, I decided to enhance the book with one more interview. I've requested my fellow OpenStack developer Joshua Harlow, who is leading a few interesting Python projects, to join the long list of interviewees in the book. I hope you'll enjoy it!

If you didn't get the book yet, go check it out and use the coupon THGTP2LAUNCH to get 20% off during the next 48 hours!

Gnocchi 1.0: storing metrics and resources at scale

Tue, 21 Apr 2015 00:00:00 GMT

A few months ago, I wrote a long post about what I called back then the "Gnocchi experiment". Time passed and we – me and the rest of the Gnocchi team – continued to work on that project, finalizing it.

It's with a great pleasure that we are going to release our first 1.0 version this month, roughly at the same time that the integrated OpenStack projects release their Kilo milestone. The first release candidate numbered 1.0.0rc1 has been released this morning!

The problem to solve

Before I dive into Gnocchi details, it's important to have a good view of what problems Gnocchi is trying to solve.

Most of the IT infrastructures out there consists of a set of resources. These resources have properties: some of them are simple attributes whereas others might be measurable quantities (also known as metrics).

And in this context, the cloud infrastructures make no exception. We talk about instances, volumes, networks… which are all different kind of resources. The problems that are arising with the cloud trend is the scalability of storing all this data and being able to request them later, for whatever usage.

What Gnocchi provides is a REST API that allows the user to manipulate resources (CRUD) and their attributes, while preserving the history of those resources and their attributes.

Gnocchi is fully documented and the documentation is available online. We are the first OpenStack project to require patches to integrate the documentation. We want to raise the bar, so we took a stand on that. That's part of our policy, the same way it's part of the OpenStack policy to require unit tests.

I'm not going to paraphrase the whole Gnocchi documentation, which covers things like installation (super easy), but I'll guide you through some basics of the features provided by the REST API. I will show you some example so you can have a better understanding of what you could leverage using Gnocchi!

Handling metrics

Gnocchi provides a full REST API to manipulate time-series that are called metrics. You can easily create a metric using a simple HTTP request:

POST /v1/metric HTTP/1.1
Content-Type: application/json

{
  "archive_policy_name": "low"
}

HTTP/1.1 201 Created
Location: http://localhost/v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a
Content-Type: application/json; charset=UTF-8

{
  "archive_policy": {
    "aggregation_methods": [
      "std",
      "sum",
      "mean",
      "count",
      "max",
      "median",
      "min",
      "95pct"
    ],
    "back_window": 0,
    "definition": [
      {
        "granularity": "0:00:01",
        "points": 3600,
        "timespan": "1:00:00"
      },
      {
        "granularity": "0:30:00",
        "points": 48,
        "timespan": "1 day, 0:00:00"
      }
    ],
    "name": "low"
  },
  "created_by_project_id": "e8afeeb3-4ae6-4888-96f8-2fae69d24c01",
  "created_by_user_id": "c10829c6-48e2-4d14-ac2b-bfba3b17216a",
  "id": "387101dc-e4b1-4602-8f40-e7be9f0ed46a",
  "name": null,
  "resource_id": null
}

The archive_policy_name parameter defines how the measures that are being sent are going to be aggregated. You can also define archive policies using the API and specify what kind of aggregation period and granularity you want. In that case , the low archive policy keeps 1 hour of data aggregated over 1 second and 1 day of data aggregated to 30 minutes. The functions used for aggregations are the mathematical functions standard deviation, minimum, maximum, … and even 95th percentile. All of that is obviously customizable and you can create your own archive policies.

If you don't want to specify the archive policy manually for each metric, you can also create archive policy rule, that will apply a specific archive policy based on the metric name, e.g. metrics matching disk.* will be high resolution metrics so they will use the high archive policy.

It's also worth noting Gnocchi is precise up to the nanosecond and is not tied to the current time. You can manipulate and inject measures that are years old and precise to the nanosecond. You can also inject points with old timestamps (i.e. old compared to the most recent one in the timeseries) with an archive policy allowing it (see back_window parameter).

It's then possible to send measures to this metric:

POST /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1
Content-Type: application/json

[
  {
    "timestamp": "2014-10-06T14:33:57",
    "value": 43.1
  },
  {
    "timestamp": "2014-10-06T14:34:12",
    "value": 12
  },
  {
    "timestamp": "2014-10-06T14:34:20",
    "value": 2
  }
  ]
  
HTTP/1.1 204 No Content

These measures are synchronously aggregated and stored into the configured storage backend. Our most scalable storage drivers for now are either based on Swift or Ceph which are both scalable storage objects systems.

It's then possible to retrieve these values:

GET /v1/metric/387101dc-e4b1-4602-8f40-e7be9f0ed46a/measures HTTP/1.1

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

[
  [
    "2014-10-06T14:30:00.000000Z",
    1800.0,
    19.033333333333335
  ],
  [
    "2014-10-06T14:33:57.000000Z",
    1.0,
    43.1
  ],
  [
    "2014-10-06T14:34:12.000000Z",
    1.0,
    12.0
  ],
  [
    "2014-10-06T14:34:20.000000Z",
    1.0,
    2.0
  ]
]

As older Ceilometer users might notice here, metrics are only storing points and values, nothing fancy such as metadata anymore.

By default, values eagerly aggregated using mean are returned for all supported granularities. You can obviously specify a time range or a different aggregation function using the aggregation, start and stop query parameter.

Gnocchi also supports doing aggregation across aggregated metrics:

GET /v1/aggregation/metric?metric=65071775-52a8-4d2e-abb3-1377c2fe5c55&metric=9ccdd0d6-f56a-4bba-93dc-154980b6e69a&start=2014-10-06T14:34&aggregation=mean HTTP/1.1

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8

[
  [
    "2014-10-06T14:34:12.000000Z",
    1.0,
    12.25
  ],
  [
    "2014-10-06T14:34:20.000000Z",
    1.0,
    11.6
  ]
]

This computes the mean of mean for the metric 65071775-52a8-4d2e-abb3-1377c2fe5c55 and 9ccdd0d6-f56a-4bba-93dc-154980b6e69a starting on 6th October 2014 at 14:34 UTC.

Indexing your resources

Another object and concept that Gnocchi provides is the ability to manipulate resources. There is a basic type of resource, called generic, which has very few attributes. You can extend this type to specialize it, and that's what Gnocchi does by default by providing resource types known for OpenStack such as instance, volume, network or even image.

POST /v1/resource/generic HTTP/1.1

Content-Type: application/json

{
  "id": "75C44741-CC60-4033-804E-2D3098C7D2E9",
  "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D",
  "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D"
}

HTTP/1.1 201 Created
Location: http://localhost/v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9
ETag: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd"
Last-Modified: Fri, 17 Apr 2015 11:18:48 GMT
Content-Type: application/json; charset=UTF-8

{
  "created_by_project_id": "ec181da1-25dd-4a55-aa18-109b19e7df3a",
  "created_by_user_id": "4543aa2a-6ebf-4edd-9ee0-f81abe6bb742",
  "ended_at": null,
  "id": "75c44741-cc60-4033-804e-2d3098c7d2e9",
  "metrics": {},
  "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d",
  "revision_end": null,
  "revision_start": "2015-04-17T11:18:48.696288Z",
  "started_at": "2015-04-17T11:18:48.696275Z",
  "type": "generic",
  "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"
}

The resource is created with the UUID provided by the user. Gnocchi handles the history of the resource, and that's what the revision_start and revision_end fields are for. They indicates the lifetime of this revision of the resource. The ETag and Last-Modified headers are also unique to this resource revision and can be used in a subsequent request using If-Match or If-Not-Match header, for example:

GET /v1/resource/generic/75c44741-cc60-4033-804e-2d3098c7d2e9 HTTP/1.1
If-Not-Match: "e3acd0681d73d85bfb8d180a7ecac75fce45a0dd"

HTTP/1.1 304 Not Modified

Which is useful to synchronize and update any view of the resources you might have in your application.

You can use the PATCH HTTP method to modify properties of the resource, which will create a new revision of the resource. The history of the resources are available via the REST API obviously.

The metrics properties of the resource allow you to link metrics to a resource. You can link existing metrics or create new ones dynamically:

POST /v1/resource/generic HTTP/1.1
Content-Type: application/json

{
  "id": "AB68DA77-FA82-4E67-ABA9-270C5A98CBCB",
  "metrics": {
    "temperature": {
      "archive_policy_name": "low"
    }
  },
  "project_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D",
  "user_id": "BD3A1E52-1C62-44CB-BF04-660BD88CD74D"
}

HTTP/1.1 201 Created
Location: http://localhost/v1/resource/generic/ab68da77-fa82-4e67-aba9-270c5a98cbcb
ETag: "9f64c8890989565514eb50c5517ff01816d12ff6"
Last-Modified: Fri, 17 Apr 2015 14:39:22 GMT
Content-Type: application/json; charset=UTF-8

{
  "created_by_project_id": "cfa2ebb5-bbf9-448f-8b65-2087fbecf6ad",
  "created_by_user_id": "6aadfc0a-da22-4e69-b614-4e1699d9e8eb",
  "ended_at": null,
  "id": "ab68da77-fa82-4e67-aba9-270c5a98cbcb",
  "metrics": {
    "temperature": "ad53cf29-6d23-48c5-87c1-f3bf5e8bb4a0"
  },
  "project_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d",
  "revision_end": null,
  "revision_start": "2015-04-17T14:39:22.181615Z",
  "started_at": "2015-04-17T14:39:22.181601Z",
  "type": "generic",
  "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"
}

Haystack, needle? Find!

With such a system, it becomes very easy to index all your resources, meter them and retrieve this data. What's even more interesting is to query the system to find and list the resources you are interested in!

You can search for a resource based on any field, for example:

POST /v1/search/resource/instance HTTP/1.1
Content-Type: application/json

{
  "=": {
    "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"
  }
}

That query will return a list of all resources owned by the user_id bd3a1e52-1c62-44cb-bf04-660bd88cd74d.

You can do fancier queries such as retrieving all the instances started by a user this month:

POST /v1/search/resource/instance HTTP/1.1
Content-Type: application/json
Content-Length: 113

{
  "and": [
    {
      "=": {
        "user_id": "bd3a1e52-1c62-44cb-bf04-660bd88cd74d"
      }
    },
    {
      ">=": {
        "started_at": "2015-04-01"
      }
    }
  ]
}

And you can even do fancier queries than the fancier ones (still following?). What if we wanted to retrieve all the instances that were on host foobar the 15th April and who had already 30 minutes of uptime? Let's ask Gnocchi to look in the history!

POST /v1/search/resource/instance?history=true HTTP/1.1
Content-Type: application/json
Content-Length: 113

{
  "and": [
    {
      "=": {
        "host": "foobar"
      }
    },
    {
      ">=": {
        "lifespan": "1 hour"
      }
    },
    {
      "<=": {
        "revision_start": "2015-04-15"
      }
    }

  ]
}

I could also mention the fact that you can search for value in metrics.
One feature that I will very likely include in Gnocchi 1.1 is the ability to search for resource whose specific metrics matches some value. For example, having the ability to search for instances whose CPU consumption was over 80% during a month.

Cherries on the cake

While Gnocchi is well integrated and based on common OpenStack technology, please do note that it is completely able to function without any other OpenStack component and is pretty straight-forward to deploy.

Gnocchi also implements a full RBAC system based on the OpenStack standard oslo.policy and which allows pretty fine grained control of permissions.

There is also some work ongoing to have HTML rendering when browsing the API using a Web browser. While still simple, we'd like to have a minimal Web interface served on top of the API for the same price!

Ceilometer alarm subsystem supports Gnocchi with the Kilo release, meaning you can use it to trigger actions when a metric value crosses some threshold. And OpenStack Heat also supports auto-scaling your instances based on Ceilometer+Gnocchi alarms.

And there are a few more API calls that I didn't talk about here, so don't hesitate to take a peek at the full documentation!

Towards Gnocchi 1.1!

Gnocchi is a different beast in the OpenStack community. It is under the umbrella of the Ceilometer program, but it's one of the first projects that is not part of the (old) integrated release. Therefore we decided to have a release schedule not directly linked to the OpenStack and we'll release more often that the rest of the old OpenStack components – probably once every 2 months or the like.

What's coming next is a close integration with Ceilometer (e.g. moving the dispatcher code from Gnocchi to Ceilometer) and probably more features as we have more requests from our users. We are also exploring different backends such as InfluxDB (storage) or MongoDB (indexer).

Stay tuned, and happy hacking!

Hacking Python AST: checking methods declaration

Mon, 16 Feb 2015 00:00:00 GMT

A few months ago, I wrote the definitive guide about Python method declaration, which had quite a good success. I still fight every day in OpenStack to have the developers declare their methods correctly in the patches they submit.

Automation plan

The thing is, I really dislike doing the same things over and over again. Furthermore, I'm not perfect either, and I miss a lot of these kind of problems in the reviews I made. So I decided to replace me by a program – a more scalable and less error-prone version of my brain.

In OpenStack, we rely on flake8 to do static analysis of our Python code in order to spot common programming mistakes.

But we are really pedantic, so we wrote some extra hacking rules that we enforce on our code. To that end, we wrote a flake8 extension called hacking. I really like these rules, I even recommend to apply them in your own project. Though I might be biased or victim of Stockholm syndrome. Your call.

Anyway, it's pretty clear that I need to add a check for method declaration in hacking. Let's write a flake8 extension!

Typical error

The typical error I spot is the following:

class Foo(object):
    # self is not used, the method does not need
    # to be bound, it should be declared static
    def bar(self, a, b, c):
        return a + b - c

That would be the correct version:

class Foo(object):
    @staticmethod
    def bar(a, b, c):
        return a + b - c

This kind of mistake is not a show-stopper. It's just not optimized. Why you have to manually declare static or class methods might be a language issue, but I don't want to debate about Python misfeatures or design flaws.

Strategy

We could probably use some big magical regular expression to catch this problem. flake8 is based on the pep8 tool, which can do a line by line analysis of the code. But this method would make it very hard and error prone to detect this pattern.

Though it's also possible to do an AST based analysis on on a per-file basis with pep8. So that's the method I pick as it's the most solid.

AST analysis

I won't dive deeply into Python AST and how it works. You can find plenty of sources on the Internet, and I even talk about it a bit in my book The Hacker's Guide to Python.

To check correctly if all the methods in a Python file are correctly declared, we need to do the following:

Iterate over all the statement node of the AST
Check that the statement is a class definition (ast.ClassDef)
Iterate over all the function definitions (ast.FunctionDef) of that class statement to check if it is already declared with @staticmethod or not
If the method is not declared static, we need to check if the first argument (self) is used somewhere in the method

Flake8 plugin

In order to register a new plugin in flake8 via hacking, we just need to add an entry in setup.cfg:

[entry_points]
flake8.extension =
    […]
    H904 = hacking.checks.other:StaticmethodChecker
    H905 = hacking.checks.other:StaticmethodChecker

We register 2 hacking codes here. As you will notice later, we are actually going to add an extra check in our code for the same price. Stay tuned.

The next step is to write the actual plugin. Since we are using an AST based check, the plugin needs to be a class following a certain signature:

@core.flake8ext
class StaticmethodChecker(object):
    def __init__(self, tree, filename):
        self.tree = tree

    def run(self):
        pass

So far, so good and pretty easy. We store the tree locally, then we just need to use it in run() and yield the problem we discover following pep8 expected signature, which is a tuple of (lineno, col_offset, error_string, code).

This AST is made for walking ♪ ♬ ♩

The ast module provides the walk function, that allow to iterate easily on a tree. We'll use that to run through the AST. First, let's write a loop that ignores the statement that are not class definition.

@core.flake8ext
class StaticmethodChecker(object):
    def __init__(self, tree, filename):
        self.tree = tree

    def run(self):
        for stmt in ast.walk(self.tree):
            # Ignore non-class
            if not isinstance(stmt, ast.ClassDef):
                continue

We still don't check for anything, but we know how to ignore statement that are not class definitions. The next step need to be to ignore what is not function definition. We just iterate over the attributes of the class definition.

for stmt in ast.walk(self.tree):
    # Ignore non-class
    if not isinstance(stmt, ast.ClassDef):
        continue
    # If it's a class, iterate over its body member to find methods
    for body_item in stmt.body:
        # Not a method, skip
        if not isinstance(body_item, ast.FunctionDef):
            continue

We're all set for checking the method, which is body_item. First, we need to check if it's already declared as static. If so, we don't have to do any further check and we can bail out.

for stmt in ast.walk(self.tree):
    # Ignore non-class
    if not isinstance(stmt, ast.ClassDef):
        continue
    # If it's a class, iterate over its body member to find methods
    for body_item in stmt.body:
        # Not a method, skip
        if not isinstance(body_item, ast.FunctionDef):
            continue
        # Check that it has a decorator
        for decorator in body_item.decorator_list:
            if (isinstance(decorator, ast.Name)
               and decorator.id == 'staticmethod'):
                # It's a static function, it's OK
                break
        else:
            # Function is not static, we do nothing for now
            pass

Note that we use the special for/else form of Python, where the else is evaluated unless we used break to exit the for loop.

for stmt in ast.walk(self.tree):
    # Ignore non-class
    if not isinstance(stmt, ast.ClassDef):
        continue
    # If it's a class, iterate over its body member to find methods
    for body_item in stmt.body:
        # Not a method, skip
        if not isinstance(body_item, ast.FunctionDef):
            continue
        # Check that it has a decorator
        for decorator in body_item.decorator_list:
            if (isinstance(decorator, ast.Name)
               and decorator.id == 'staticmethod'):
                # It's a static function, it's OK
                break
        else:
            try:
                first_arg = body_item.args.args[0]
            except IndexError:
                yield (
                    body_item.lineno,
                    body_item.col_offset,
                    "H905: method misses first argument",
                    "H905",
                )
                # Check next method
                continue

We finally added some check! We grab the first argument from the method signature. Unless it fails, and in that case, we know there's a problem: you can't have a bound method without the self argument, therefore we raise the H905 code to signal a method that misses its first argument.

Now you know why we registered this second pep8 code along with H904 in setup.cfg. We have here a good opportunity to kill two birds with one stone.

The next step is to check if that first argument is used in the code of the method.

for stmt in ast.walk(self.tree):
    # Ignore non-class
    if not isinstance(stmt, ast.ClassDef):
        continue
    # If it's a class, iterate over its body member to find methods
    for body_item in stmt.body:
        # Not a method, skip
        if not isinstance(body_item, ast.FunctionDef):
            continue
        # Check that it has a decorator
        for decorator in body_item.decorator_list:
            if (isinstance(decorator, ast.Name)
               and decorator.id == 'staticmethod'):
                # It's a static function, it's OK
                break
        else:
            try:
                first_arg = body_item.args.args[0]
            except IndexError:
                yield (
                    body_item.lineno,
                    body_item.col_offset,
                    "H905: method misses first argument",
                    "H905",
                )
                # Check next method
                continue
            for func_stmt in ast.walk(body_item):
                if six.PY3:
                    if (isinstance(func_stmt, ast.Name)
                       and first_arg.arg == func_stmt.id):
                        # The first argument is used, it's OK
                        break
                else:
                    if (func_stmt != first_arg
                       and isinstance(func_stmt, ast.Name)
                       and func_stmt.id == first_arg.id):
                        # The first argument is used, it's OK
                        break
            else:
                yield (
                    body_item.lineno,
                    body_item.col_offset,
                    "H904: method should be declared static",
                    "H904",
                )

To that end, we iterate using ast.walk again and we look for the use of the same variable named (usually self, but if could be anything, like cls for @classmethod) in the body of the function. If not found, we finally yield the H904 error code. Otherwise, we're good.

Conclusion

I've submitted this patch to hacking, and, fingers crossed, it might be merged one day. If it's not I'll create a new Python package with that check for flake8. The actual submitted code is a bit more complex to take into account the use of abc module and include some tests.

As you may have notice, the code walks over the module AST definition several times. There might be a couple of optimization to browse the AST in only one pass, but I'm not sure it's worth it considering the actual usage of the tool. I'll let that as an exercise for the reader interested in contributing to OpenStack. 😉

Happy hacking!

Distributed group management and locking in Python with tooz

Fri, 21 Nov 2014 00:00:00 GMT

With OpenStack embracing the Tooz library more and more over the past year, I think it's a good start to write a bit about it.

A bit of history

A little more than year ago, with my colleague Yassine Lamgarchal and others at eNovance, we investigated on how to solve a problem often encountered inside OpenStack: synchronization of multiple distributed workers. And while many people in our ecosystem continue to drive development by adding new bells and whistles, we made a point of solving new problems with a generic solution able to address the technical debt at the same time.

Yassine wrote the first ideas of what should be the group membership service that was needed for OpenStack, identifying several projects that could make use of this. I've presented this concept during the OpenStack Summit in Hong-Kong during an Oslo session. It turned out that the idea was well-received, and the week following the summit we started the tooz project on StackForge.

Goals

Tooz is a Python library that provides a coordination API. Its primary goal is to handle groups and membership of these groups in distributed systems.

Tooz also provides another useful feature which is distributed locking. This allows distributed nodes to acquire and release locks in order to synchronize themselves (for example to access a shared resource).

The architecture

If you are familiar with distributed systems, you might be thinking that there are a lot of solutions already available to solve these issues: ZooKeeper, the Raft consensus algorithm or even Redis for example.

You'll be thrilled to learn that Tooz is not the result of the NIH syndrome, but is an abstraction layer on top of all these solutions. It uses drivers to provide the real functionalities behind, and does not try to do anything fancy.

All the drivers do not have the same amount of functionality of robustness, but depending on your environment, any available driver might be suffice. Like most of OpenStack, we let the deployers/operators/developers chose whichever backend they want to use, informing them of the potential trade-offs they will make.

So far, Tooz provides drivers based on:

Kazoo (ZooKeeper)
Zake
memcached
redis
SysV IPC (only for distributed locks for now)
PostgreSQL (only for distributed locks for now)
MySQL (only for distributed locks for now)

All drivers are distributed across processes. Some can be distributed across the network (ZooKeeper, memcached, redis…) and some are only available on the same host (IPC).

Also note that the Tooz API is completely asynchronous, allowing it to be more efficient, and potentially included in an event loop.

Features

Group membership

Tooz provides an API to manage group membership. The basic operations provided are: the creation of a group, the ability to join it, leave it and list its members. It's also possible to be notified as soon as a member joins or leaves a group.

Leader election

Each group can have a leader elected. Each member can decide if it wants to run for the election. If the leader disappears, another one is elected from the list of current candidates. It's possible to be notified of the election result and to retrieve the leader of a group at any moment.

Distributed locking

When trying to synchronize several workers in a distributed environment, you may need a way to lock access to some resources. That's what a distributed lock can help you with.

Adoption in OpenStack

Ceilometer is the first project in OpenStack to use Tooz. It has replaced part of the old alarm distribution system, where RPC was used to detect active alarm evaluator workers. The group membership feature of Tooz was leveraged by Ceilometer to coordinate between alarm evaluator workers.

Another new feature part of the Juno release of Ceilometer is the distribution of polling tasks of the central agent among multiple workers. There's again a group membership issue to know which nodes are online and available to receive polling tasks, so Tooz is also being used here.

The Oslo team has accepted the adoption of Tooz during this release cycle. That means that it will be maintained by more developers, and will be part of the OpenStack release process.

This opens the door to push Tooz further in OpenStack. Our next candidate would be write a service group driver for Nova.

The complete documentation for Tooz is available online and has examples for the various features described here, go read it if you're curious and adventurous!

Python bad practice, a concrete case

Mon, 15 Sep 2014 00:00:00 GMT

A lot of people read up on good Python practice, and there's plenty of information about that on the Internet. Many tips are included in the book I wrote this year, The Hacker's Guide to Python. Today I'd like to show a concrete case of code that I don't consider being the state of the art.

In my last article where I talked about my new project Gnocchi, I wrote about how I tested, hacked and then ditched whisper out. Here I'm going to explain part of my thought process and a few things that raised my eyebrows when hacking this code.

Before I start, please don't get the spirit of this article wrong. It's in no way a personal attack to the authors and contributors (who I don't know). Furthermore, whisper is a piece of code that is in production in thousands of installation, storing metrics for years. While I can argue that I consider the code not to be following best practice, it definitely works well enough and is worthy to a lot of people.

Tests

The first thing that I noticed when trying to hack on whisper, is the lack of test. There's only one file containing tests, named test_whisper.py, and the coverage it provides is pretty low. One can check that using the coverage tool.

$ coverage run test_whisper.py
...........
----------------------------------------------------------------------
Ran 11 tests in 0.014s

OK
$ coverage report
Name           Stmts   Miss  Cover
----------------------------------
test_whisper     134      4    97%
whisper          584    227    61%
----------------------------------
TOTAL            718    231    67%

While one would think that 61% is "not so bad", taking a quick peak at the actual test code shows that the tests are incomplete. Why I mean by incomplete is that they for example use the library to store values into a database, but they never check if the results can be fetched and if the fetched results are accurate. Here's a good reason one should never blindly trust the test cover percentage as a quality metric.

When I tried to modify whisper, as the tests do not check the entire cycle of the values fed into the database, I ended up doing wrong changes but had the tests still pass.

No PEP 8, no Python 3

The code doesn't respect PEP 8 . A run of flake8 + hacking shows 732 errors… While it does not impact the code itself, it's more painful to hack on it than it is on most Python projects.

The hacking tool also shows that the code is not Python 3 ready as there is usage of Python 2 only syntax.

A good way to fix that would be to set up tox and adds a few targets for PEP 8 checks and Python 3 tests. Even if the test suite is not complete, starting by having flake8 run without errors and the few unit tests working with Python 3 should put the project in a better light.

Not using idiomatic Python

A lot of the code could be simplified by using idiomatic Python. Let's take a simple example:

def fetch(path,fromTime,untilTime=None,now=None):
  fh = None
  try:
    fh = open(path,'rb')
    return file_fetch(fh, fromTime, untilTime, now)
  finally:
    if fh:
      fh.close()

That piece of code could be easily rewritten as:

def fetch(path,fromTime,untilTime=None,now=None):
  with open(path, 'rb') as fh:
    return file_fetch(fh, fromTime, untilTime, now)

This way, the function looks actually so simple that one can even wonder why it should exists – but why not.

Usage of loops could also be made more Pythonic:

for i,archive in enumerate(archiveList):
  if i == len(archiveList) - 1:
    break

could be actually:

for archive in itertools.islice(archiveList, len(archiveList) - 1):

That reduce the code size and makes it easier to read through the code.

Wrong abstraction level

Also, one thing that I noticed in whisper, is that it abstracts its features at the wrong level.

Take the create() function, it's pretty obvious:

def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
  # Set default params
  if xFilesFactor is None:
    xFilesFactor = 0.5
  if aggregationMethod is None:
    aggregationMethod = 'average'

  #Validate archive configurations...
  validateArchiveList(archiveList)

  #Looks good, now we create the file and write the header
  if os.path.exists(path):
    raise InvalidConfiguration("File %s already exists!" % path)
  fh = None
  try:
    fh = open(path,'wb')
    if LOCK:
      fcntl.flock( fh.fileno(), fcntl.LOCK_EX )

    aggregationType = struct.pack( longFormat, aggregationMethodToType.get(aggregationMethod, 1) )
    oldest = max([secondsPerPoint * points for secondsPerPoint,points in archiveList])
    maxRetention = struct.pack( longFormat, oldest )
    xFilesFactor = struct.pack( floatFormat, float(xFilesFactor) )
    archiveCount = struct.pack(longFormat, len(archiveList))
    packedMetadata = aggregationType + maxRetention + xFilesFactor + archiveCount
    fh.write(packedMetadata)
    headerSize = metadataSize + (archiveInfoSize * len(archiveList))
    archiveOffsetPointer = headerSize

    for secondsPerPoint,points in archiveList:
      archiveInfo = struct.pack(archiveInfoFormat, archiveOffsetPointer, secondsPerPoint, points)
      fh.write(archiveInfo)
      archiveOffsetPointer += (points * pointSize)

    #If configured to use fallocate and capable of fallocate use that, else
    #attempt sparse if configure or zero pre-allocate if sparse isn't configured.
    if CAN_FALLOCATE and useFallocate:
      remaining = archiveOffsetPointer - headerSize
      fallocate(fh, headerSize, remaining)
    elif sparse:
      fh.seek(archiveOffsetPointer - 1)
      fh.write('\x00')
    else:
      remaining = archiveOffsetPointer - headerSize
      chunksize = 16384
      zeroes = '\x00' * chunksize
      while remaining > chunksize:
        fh.write(zeroes)
        remaining -= chunksize
      fh.write(zeroes[:remaining])

    if AUTOFLUSH:
      fh.flush()
      os.fsync(fh.fileno())
  finally:
    if fh:
      fh.close()

The function is doing everything: checking if the file doesn't exist already, opening it, building the structured data, writing this, building more structure, then writing that, etc.

That means that the caller has to give a file path, even if it just wants a whipser data structure to store itself elsewhere. StringIO() could be used to fake a file handler, but it will fail if the call to fcntl.flock() is not disabled – and it is inefficient anyway.

There's a lot of other functions in the code, such as for example setAggregationMethod(), that mixes the handling of the files – even doing things like os.fsync() – while manipulating structured data. This is definitely not a good design, especially for a library, as it turns out reusing the function in different context is near impossible.

Race conditions

There are race conditions, for example in create() (see added comment):

if os.path.exists(path):
  raise InvalidConfiguration("File %s already exists!" % path)
fh = None
try:
  # TOO LATE I ALREADY CREATED THE FILE IN ANOTHER PROCESS YOU ARE GOING TO
  # FAIL WITHOUT GIVING ANY USEFUL INFORMATION TO THE CALLER :-(
  fh = open(path,'wb')

That code should be:

try:
  fh = os.fdopen(os.open(path, os.O_WRONLY | os.O_CREAT | os.O_EXCL), 'wb')
except OSError as e:
  if e.errno == errno.EEXIST:
    raise InvalidConfiguration("File %s already exists!" % path)

to avoid any race condition.

Unwanted optimization

We saw earlier the fetch() function that is barely useful, so let's take a look at the file_fetch() function that it's calling.

def file_fetch(fh, fromTime, untilTime, now = None):
  header = __readHeader(fh)
[...]

The first thing the function does is to read the header from the file handler.

Let's take a look at that function:

def __readHeader(fh):
  info = __headerCache.get(fh.name)
  if info:
    return info

  originalOffset = fh.tell()
  fh.seek(0)
  packedMetadata = fh.read(metadataSize)

  try:
    (aggregationType,maxRetention,xff,archiveCount) = struct.unpack(metadataFormat,packedMetadata)
  except:
    raise CorruptWhisperFile("Unable to read header", fh.name)
[...]

The first thing the function does is to look into a cache. Why is there a cache?

It actually caches the header based with an index based on the file path (fh.name). Except that if one for example decide not to use file and cheat using StringIO, then it does not have any name attribute. So this code path will raise an AttributeError.

One has to set a fake name manually on the StringIO instance, and it must be unique so nobody messes with the cache

import StringIO

packedMetadata = <some source>
fh = StringIO.StringIO(packedMetadata)
fh.name = "myfakename"
header = __readHeader(fh)

The cache may actually be useful when accessing files, but it's definitely useless when not using files. But it's not necessarily true that the complexity (even if small) that the cache adds is worth it. I doubt most of whisper based tools are long run processes, so the cache that is really used when accessing the files is the one handled by the operating system kernel, and this one is going to be much more efficient anyway, and shared between processed. There's also no expiry of that cache, which could end up of tons of memory used and wasted.

Docstrings

None of the docstrings are written in a a parsable syntax like Sphinx. This means you cannot generate any documentation in a nice format that a developer using the library could read easily.

The documentation is also not up to date:

def fetch(path,fromTime,untilTime=None,now=None):
  """fetch(path,fromTime,untilTime=None)
[...]
"""

def create(path,archiveList,xFilesFactor=None,aggregationMethod=None,sparse=False,useFallocate=False):
  """create(path,archiveList,xFilesFactor=0.5,aggregationMethod='average')
[...]
"""

This is something that could be avoided if a proper format was picked to write the docstring. A tool cool be used to be noticed when there's a diversion between the actual function signature and the documented one, like missing an argument.

Duplicated code

Last but not least, there's a lot of code that is duplicated around in the scripts provided by whisper in its bin directory. Theses scripts should be very lightweight and be using the console_scripts facility of setuptools, but they actually contains a lot of (untested) code. Furthermore, some of that code is partially duplicated from the whisper.py library which is against DRY.

Conclusion

There are a few more things that made me stop considering whisper, but these are part of the whisper features, not necessarily code quality. One can also point out that the code is very condensed and hard to read, and that's a more general problem about how it is organized and abstracted.

A lot of these defects are actually points that made me start writing The Hacker's Guide to Python a year ago.
Running into this kind of code makes me think it was a really good idea to write a book on advice to write better Python code!

Tracking OpenStack contributions in GitHub

Tue, 19 Aug 2014 00:00:00 GMT

I've switched my Git repositories to GitHub recently, and started to watch my contributions statistics, which were very low considering I spend my days hacking on open source software, especially OpenStack.

OpenStack hosts its Git repositories on its own infrastructure at git.openstack.org, but also mirrors them on GitHub. Logically, I was expecting GitHub to track my commits there too, as I'm using the same email address everywhere.

It turns out that it was not the case, and the help page about that on GitHub describes the rule in place to compute statistics. Indeed, according to GitHub, I had no relations to the OpenStack repositories, as I never forked them nor opened a pull request on them (OpenStack uses Gerrit).

Starring a repository is enough to build a relationship between a user and a repository, so this is was the only thing needed to inform GitHub that I have contributed to those repositories. Considering OpenStack has hundreds of repositories, I decided to star them all by using a small Python script using pygithub.

And voilà, my statistics are now including all my contributions to OpenStack!

OpenStack Ceilometer and the Gnocchi experiment

Mon, 18 Aug 2014 00:00:00 GMT

A little more than 2 years ago, the Ceilometer project was launched inside the OpenStack ecosystem. Its main objective was to measure OpenStack cloud platforms in order to provide data and mechanisms for functionalities such as billing, alarming or capacity planning.

In this article, I would like to relate what I've been doing with other Ceilometer developers in the last 5 months. I've lowered my involvement in Ceilometer itself directly to concentrate on solving one of its biggest issue at the source, and I think it's largely time to take a break and talk about it.

Ceilometer early design

For the last years, Ceilometer didn't change in its core architecture. Without diving too much in all its parts, one of the early design decision was to build the metering around a data structure we called samples. A sample is generated each time Ceilometer measures something. It is composed of a few fields, such as the the resource id that is metered, the user and project id owning that resources, the meter name, the measured value, a timestamp and a few free-form metadata. Each time Ceilometer measures something, one of its components (an agent, a pollster…) constructs and emits a sample headed for the storage component that we call the collector.

This collector is responsible for storing the samples into a database. The Ceilometer collector uses a pluggable storage system, meaning that you can pick any database system you prefer. Our original implementation has been based on MongoDB from the beginning, but we then added a SQL driver, and people contributed things such as HBase or DB2 support.

The REST API exposed by Ceilometer allows to execute various reading requests on this data store. It can returns you the list of resources that have been measured for a particular project, or compute some statistics on metrics. Allowing such a large panel of possibilities and having such a flexible data structure allows to do a lot of different things with Ceilometer, as you can almost query the data in any mean you want.

The scalability issue

We soon started to encounter scalability issues in many of the read requests made via the REST API. A lot of the requests requires the data storage to do full scans of all the stored samples. Indeed, the fact that the API allows you to filter on any fields and also on the free-form metadata (meaning non indexed key/values tuples) has a terrible cost in terms of performance (as pointed before, the metadata are attached to each sample generated by Ceilometer and is stored as is). That basically means that the sample data structure is stored in most drivers in just one table or collection, in order to be able to scan them at once, and there's no good "perfect" sharding solution, making data storage scalability painful.

It turns out that the Ceilometer REST API is unable to handle most of the requests in a timely manner as most operations are O(n) where n is the number of samples recorded (see big O notation if you're unfamiliar with it). That number of samples can grow very rapidly in an environment of thousands of metered nodes and with a data retention of several weeks. There is a few optimizations to make things smoother in general cases fortunately, but as soon as you run specific queries, the API gets barely usable.

During this last year, as the Ceilometer PTL, I discovered these issues first hand since a lot of people were feeding me back with this kind of testimony. We engaged several blueprints to improve the situation, but it was soon clear to me that this was not going to be enough anyway.

Thinking outside the box

Unfortunately, the PTL job doesn't leave him enough time to work on the actual code nor to play with anything new. I was coping with most of the project bureaucracy and I wasn't able to work on any good solution to tackle the issue at its root. Still, I had a few ideas that I wanted to try and as soon as I stepped down from the PTL role, I stopped working on Ceilometer itself to try something new and to think a bit outside the box.

When one takes a look at what have been brought recently in Ceilometer, they can see the idea that Ceilometer actually needs to handle 2 types of data: events and metrics.

Events are data generated when something happens: an instance start, a volume is attached, or an HTTP request is sent to an REST API server. These are events that Ceilometer needs to collect and store. Most OpenStack components are able to send such events using the notification system built into oslo.messaging.

Metrics is what Ceilometer needs to store but that is not necessarily tied to an event. Think about an instance CPU usage, a router network bandwidth usage, the number of images that Glance is storing for you, etc… These are not events, since nothing is happening. These are facts, states we need to meter.

Computing statistics for billing or capacity planning requires both of these data sources, but they should be distinct. Based on that assumption, and the fact that Ceilometer was getting support for storing events, I started to focus on getting the metric part right.

I had been a system administrator for a decade before jumping into OpenStack development, so I know a thing or two on how monitoring is done in this area, and what kind of technology operators rely on. I also know that there's still no silver bullet – this made it a good challenge.

The first thing that came to my mind was to use some kind of time-series database, and export its access via a REST API – as we do in all OpenStack services. This should cover the metric storage pretty well.

Cooking Gnocchi

At the end of April 2014, this led met to start a new project code-named Gnocchi. For the record, the name was picked after confusing so many times the OpenStack Marconi project, reading OpenStack Macaroni instead. At least one OpenStack project should have a "pasta" name, right?

The point of having a new project and not send patches on Ceilometer, was that first I had no clue if it was going to make something that would be any better, and second, being able to iterate more rapidly without being strongly coupled with the release process.

The first prototype started around the following idea: what you want is to meter things. That means storing a list of tuples of (timestamp, value) for it. I've named these things "entities", as no assumption are made on what they are. An entity can represent the temperature in a room or the CPU usage of an instance. The service shouldn't care and should be agnostic in this regard.

One feature that we discussed for several OpenStack summits in the Ceilometer sessions, was the idea of doing aggregation. Meaning, aggregating samples over a period of time to only store a smaller amount of them. These are things that time-series format such as the RRDtool have been doing for a long time on the fly, and I decided it was a good trail to follow.

I assumed that this was going to be a requirement when storing metrics into Gnocchi. The user would need to provide what kind of archiving it would need: 1 second precision over a day, 1 hour precision over a year, or even both.

The first driver written to achieve that and store those metrics inside Gnocchi was based on whisper. Whisper is the file format used to store metrics for the Graphite project. For the actual storage, the driver uses Swift, which has the advantages to be part of OpenStack and scalable.

Storing metrics for each entities in a different whisper file and putting them in Swift turned out to have a fantastic algorithm complexity: it was O(1). Indeed, the complexity needed to store and retrieve metrics doesn't depends on the number of metrics you have nor on the number of things you are metering. Which is already a huge win compared to the current Ceilometer collector design.

However, it turned out that whisper has a few limitations that I was unable to circumvent in any manner. I needed to patch it to remove a lot of its assumption about manipulating file, or that everything is relative to now (time.time()). I've started to hack on that in my own fork, but… then everything broke. The whisper project code base is, well, not the state of the art, and have 0 unit test. I was starring at a huge effort to transform whisper into the time-series format I wanted, without being sure I wasn't going to break everything (remember, no test coverage).

I decided to take a break and look into alternatives, and stumbled upon Pandas, a data manipulation and statistics library for Python. Turns out that Pandas support time-series natively, and that it could do a lot of the smart computation needed in Gnocchi. I built a new file format leveraging Pandas for computing the time-series and named it carbonara (a wink to both the Carbon project and pasta, how clever!). The code is quite small (a third of whisper's, 200 SLOC vs 600 SLOC), does not have many of the whisper limitations and… it has test coverage. These Carbonara files are then, in the same fashion, stored into Swift containers.

Anyway, Gnocchi storage driver system is designed in the same spirit that the rest of OpenStack and Ceilometer storage driver system. It's a plug-in system with an API, so anyone can write their own driver. Eoghan Glynn has already started to write a InfluxDB driver, working closely with the upstream developer of that database. Dina Belova started to write an OpenTSDB driver. This helps to make sure the API is designed directly in the right way.

Handling resources

Measuring individual entities is great and needed, but you also need to link them with resources. When measuring the temperature and the number of a people in a room, it is useful to link these 2 separate entities to a resource, in that case the room, and give a name to these relations, so one is able to identify what attribute of the resource is actually measured. It is also important to provide the possibility to store attributes on these resources, such as their owners, the time they started and ended their existence, etc.

Once this list of resource is collected, the next step is to list and filter them, based on any criteria. One might want to retrieve the list of resources created last week or the list of instances hosted on a particular node right now.

Resources also need to be specialized. Some resources have attributes that must be stored in order for filtering to be useful. Think about an instance name or a router network.

All of these requirements led to to the design of what's called the indexer. The indexer is responsible for indexing entities, resources, and link them together. The initial implementation is based on SQLAlchemy and should be pretty efficient. It's easy enough to index the most requested attributes (columns), and they are also correctly typed.

We plan to establish a model for all known OpenStack resources (instances, volumes, networks, …) to store and index them into the Gnocchi indexer in order to request them in an efficient way from one place. The generic resource class can be used to handle generic resources that are not tied to OpenStack. It'd be up to the users to store extra attributes.

Dropping the free form metadata we used to have in Ceilometer makes sure that querying the indexer is going to be efficient and scalable.

REST API

All of this is exported via a REST API that was partially designed and documented in the Gnocchi specification in the Ceilometer repository; though the spec is not up-to-date yet. We plan to auto-generate the documentation from the code as we are currently doing in Ceilometer.

The REST API is pretty easy to use, and you can use it to manipulate entities and resources, and request the information back.

Roadmap & Ceilometer integration

All of this plan has been exposed and discussed with the Ceilometer team
during the last OpenStack summit in Atlanta in May 2014, for the Juno release. I led a session about this entire concept, and convinced the team that using Gnocchi for our metric storage would be a good approach to solve the Ceilometer collector scalability issue.

It was decided to conduct this project experiment in parallel of the current Ceilometer collector for the time being, and see where that would lead the project to.

Early benchmarks

Some engineers from Mirantis did a few benchmarks around Ceilometer and also against an early version of Gnocchi, and Dina Belova presented them to us during the mid-cycle sprint we organized in Paris in early July.

The following graph sums up pretty well the current Ceilometer performance issue. The more you feed it with metrics, the more slow it becomes.

For Gnocchi, while the numbers themselves are not fantastic, what is interesting is that all the graphs below show that the performances are stable without correlation with the number of resources, entities or measures. This proves that, indeed, most of the code is built around a complexity of O(1), and not O(n) anymore.

Next steps

While the Juno cycle is being wrapped-up for most projects, including Ceilometer, Gnocchi development is still ongoing. Fortunately, the composite architecture of Ceilometer allows a lot of its features to be replaced by some other code dynamically. That, for example, enables Gnocchi to provides a Ceilometer dispatcher plugin for its collector, without having to ship the actual code in Ceilometer itself. That should help the development of Gnocchi to not be slowed down by the release process for now.

The Ceilometer team aims to provide Gnocchi as a sort of technology preview with the Juno release, allowing it to be deployed along and plugged with Ceilometer. We'll discuss how to integrate it in the project in a more permanent and strong manner probably during the OpenStack Summit for Kilo that will take place next November in Paris.

OpenStack Design Summit Juno, from a Ceilometer point of view

Fri, 30 May 2014 00:00:00 GMT

Last week was the OpenStack Design Summit in Atlanta, GA where we, developers, discussed and designed the new OpenStack release (Juno) coming up. I've been there mainly to discuss Ceilometer upcoming developments.

The summit has been great. It was my third OpenStack design summit, and the first one not being a PTL, meaning it was a largely more relaxed summit for me!

On Monday, we started by a 2.5 hours meeting with Ceilometer core developers and contributors about the Gnocchi experimental project that I've started a few weeks ago. It was a great and productive afternoon, and allowed me to introduce and cover this topic extensively, something that would not have been possible in the allocated session we had later in the week.

Ceilometer had his design sessions running mainly during Wednesday. We noted a lot of things and commented during the sessions in our Etherpads instances.

Here is a short summary of the sessions I've attended.

Scaling the central agent

I was in charge of the first session, and introduced the work that was done so far in the scaling of the central agent. Six months ago, during the Havana summit, I proposed to scale the central agent by distributing the tasks among several node, using a library to handle the group membership aspect of it. That led to the creation of the tooz library that we worked on at eNovance during the last 6 months.

Now that we have this foundation available, Cyril Roelandt started to replace the Ceilometer alarming job repartition code by Taskflow and Tooz. Starting with the central agent is simpler and will be a first proof of concept to be used by the central agent then. We plan to get this merged for Juno.

For the central agent, the same work needs to be done, but since it's a bit more complicated, it will be done after the alarming evaluators are converted.

Test strategy

The next session discussed the test strategy and how we could improve Ceilometer unit and functional testing. There is a lot in this area to be done, and this is going to be one of the main focus of the team in the upcoming weeks. Having Tempest tests run was a goal for Havana, and even if we made a lot of progress, we're still no there yet.

Complex queries and per-user/project data collection

This session, led by Ildikó Váncsa, was about adding finer-grained configuration into the pipeline configuration to allow per-user and per-project data retrieval. This was not really controversial, though how to implement this exactly is still to be discussed, but the idea was well received. The other part of the session was about adding more in the complex queries feature provided by the v2 API.

Rethinking Ceilometer as a Time-Series-as-a-Service

This was my main session, the reason we met on Monday for a few hours, and one of the most promising session – I hope – of the week.

It appears that the way Ceilometer designed its API and storage backends a long time ago is now a problem to scale the data storage. Also, the events API we introduced in the last release partially overlaps some of the functionality provided by the samples API that causes us scaling troubles.

Therefore, I've started to rethink the Ceilometer API by building it as a time series read/write service, letting the audit part of our previous sample API to the event subsystem. After a few researches and experiments, I've designed a new project called Gnocchi, which provides exactly that functionality in a hopefully scalable way.

Gnocchi is split in two parts: a time series API and its driver, and a resource indexing API with its own driver. Having two distinct driver sets allows it to use different technologies to store each data type in the best storage engine possible. The canonical driver for time series handling is based on Pandas and Swift. The canonical resource indexer driver
is based on SQLAlchemy.

The idea and project was well received and looked pretty exciting to most people. Our hope is to design a version 3 of the Ceilometer API around Gnocchi at some point during the Juno cycle, and have it ready as some sort of preview for the final release.

Revisiting the Ceilometer data model

This session led by Alexei Kornienko, kind of echoed the previous session, as it clearly also tried to address the Ceilometer scalability issue, but in a different way.

Anyway, the SQL driver limitations have been discussed and Mehdi Abaakouk implemented some of the suggestions during the week, so we should very soon see more performances in Ceilometer with the current default storage driver.

Ceilometer devops session

We organized this session to get feedbacks from the devops community about deploying Ceilometer. It was very interesting, and the list of things we could improve is long, and I think will help us to drive our future efforts.

SNMP inspectors

This session, led by Lianhao Lu, discussed various details of the future of SNMP support in Ceilometer.

Alarm and logs improvements

This mixed session, led by Nejc Saje and Gordon Chung, was about possible improvements on the alarm evaluation system provided by Ceilometer, and making logging in Ceilometer more effective. Both half-sessions were interesting and led to several ideas on how to improve both systems.

Conclusion

Considering the current QA problems with Ceilometer, Eoghan Glynn, the new Project Technical Leader for Ceilometer, clearly indicated that this will be the main focus of the release cycle.

Personally, I will be focused on working on Gnocchi, and will likely be joined by others in the next weeks. Our idea is to develop a complete solution with a high velocity in the next weeks, and then works on its integration with Ceilometer itself.

Making of The Hacker's Guide to Python

Wed, 07 May 2014 00:00:00 GMT

As promised, today I would like to write a bit about the making of The Hacker's Guide to Python. It has been a very interesting experimentation, and I think it is worth sharing it with you.

The inspiration

All started out at the beginning of August 2013. I was spending my summer, as the rest of the year, hacking on OpenStack.

As years passed, I got more and more deeply involved in the various tools that we either built or contributed to within the OpenStack community. And I somehow got the feeling that my experience with Python, the way we used it inside OpenStack and other applications during these last years was worth sharing. Worth writing something bigger than a few blog posts.

The OpenStack project is doing code reviews, and therefore so did I for almost two years. That inspired a lot of topics, like the definitive guide to method decorators that I wrote at the time I started the hacker's guide. Stumbling upon the same mistakes or misunderstanding over and over is, somehow, inspiring.

I also stumbled upon Nathan Barry's blog and book Authority which were very helpful to get started and some sort of guideline.

All of that brought me enough ideas to start writing a book about Python software development for people already familiar with the language.

The writing

The first thing I started to do is to list all the topics I wanted to write about. The list turned out to have subjects that had no direct interest for a practical guide. For example, on one hand, very few developers know in details how metaclasses work, but on the other hand, I never had to write a metaclass during these last years. That's the kind of subject I decided not to write about, dropped all subjects that I felt were not going to help my reader to be more productive. Even if they could be technically interesting.

Then, I gathered all problems I saw during the code reviews I did during these last two years. Some of them I only recalled in the days following the beginning of that project. But I kept adding them to the table of contents, reorganizing stuff as needed.

After a couple of weeks, I had a pretty good overview of the contents that there I will write about. All I had to do was to fill in the blank (that sounds so simple now).

The entire writing of the took hundred hours spread from August to November, during my spare time. I had to stop all my other side projects for that.

The interviews

While writing the book, I tried to parallelize every thing I could. That included asking people for interviews to be included in the book. I already had a pretty good list of the people I wanted to feature in the book, so I took some time as soon as possible to ask them, and send them detailed questions.

I discovered two categories of interviewees. Some of them were very fast to answer (≤ 1 week), and others were much, much slower. A couple of them even set up Git repositories to answer the questions, because that probably looked like an entire project to them. :-) So I had to not lose sight and kindly ask from time to time if everything was alright, and at some point started to kindly set some deadline.

In the end, the quality of the answers was awesome, and I like to think that was because I picked the right people!

The proof-reading

Once the book was finished, I somehow needed to have people proof-reading it. This was probably the hardest part of this experiment. I needed two different types of reviews: technical reviews, to check that the content was correct and interesting, and language review. That one is even more important since English is not my native language.

Finding technical reviewers seemed easy at first, as I had ton of contacts that I identified as being able to review the book. I started by asking a few people if they would be comfortable reading a simple chapter and giving me feedbacks. I started to do that in September: having the writing and the reviews done in parallel was important to me in order to minimize latency and the book's release delay.

All people I contacted answered positively that they would be interested in doing a technical review of a chapter. So I started to send chapters to them. But in the end, only 20% replied back. And even after that, a large portion stopped reviewing after a couple of chapters.

Don't get me wrong: you can't be mad at people not wanting to spend their spare time in book edition like you do.

However, from the few people that gave their time to review a few chapters, I got tremendous feedback, at all level. That's something that was very important and that helped a lot getting confident. Writing a book alone for months without having anyone taking a look upon your shoulder can make you doubt that you are creating something worth it.

As far as English proof-reading, I went ahead and used ODesk to recruit a professional proof-reader. I looked for people with the right skills: a good English level (being a native English speaker at least), be able to understand what the book was about, and being able to work with correct delays. I had mixed results from the people I hired, but I guess that's normal. The only error I made was not to parallelize those reviews enough, so I probably lost a couple of months on
that.

The toolchain

While writing the book, I did a few breaks to build a toolchain. What I call a toolchain is set of tools used to render the final PDF, EPUB and MOBI files of the guide.

After some research, I decided to settle on AsciiDoc, using the DocBook output, which is then being transformed to LaTeX, and then to PDF, or either to EPUB directly. I rely on Calibre to convert the EPUB file to MOBI. It took me a few hours to do what I wanted, using some magic LaTeX tricks to have a proper render, but it was worth it and I'm particularly happy with the result.

For the cover design, I asked my talented friend Nicolas to do something for me, and he designed the wonderful cover and its little snake!

The publishing

Publishing in an interesting topic people kept asking me about. This is what I had to answer a few dozens of time:

"Who is your editor?"
"Me."

I never had any plan for asking an editor to publish this book. Nowadays, asking an editor to publish a book feels to me like asking a major company to publish a CD. It feels awkward.

However, don't get me wrong: there can be a few upsides of having an editor. They will find reviewers and review your book for you. Having the book review handled for you is probably a very good thing, considering how it was hard to me to get that in place. It can be especially important for a technical book.

Also, your book may end up in brick and mortar stores and be part of a collection, both improving visibility. That may improve your book's selling, though the editor and all the intermediaries are going to keep the largest amount of the money anyway.

"Oh, you will publish it yourself, great. So you will print it and sell it to people?"
"Not really."

I've heard good stories about people using Gumroad to sell electronic contents, so after looking for competitors in that market, I picked them. I also had the idea to sell the book with Bitcoins, so I settled on Coinbase, because they have a nice API to do that.

Setting up everything was quite straight-forward, especially with Gumroad. It only took me a few hours to do so. Writing the Coinbase application took a few hours too.

"Oh, you will sell it only as an ebook? That's too bad. You need a paper version. Many people will want a paper version."

My initial plan was to only sell online an electronic version. On the other hand, since I kept hearing that a printed version should exist, I decided to give it a try. I chose to work with Lulu because I knew people using it, and it was pretty simple to set up.

The launch

Once I had everything ready, I built the selling page and connected everything between Mailchimp, Gumroad, Coinbase, Google Analytics, etc.

Writing the launch email was really exciting. I used Mailchimp feature to send the launch mail in several batches, just to have some margin in case of a sudden last minute problem. But everything went fine. Hurrah!

I distributed around 200 copies of the ebook in the first 48 hours, for about $5000. That covered all the cost I had from the writing the book, and even more, so I was already pretty happy with the launch.

Retrospective

In retrospect, something that I didn't do the best way possible is probably to build a solid mailing list of people interested, and to build an important anticipation and incentive to buy the book at launch date. My mailing list counted around 1500 people subscribed because they were interested in the launch of the book subscribed; in the end, probably only 10-15% of them bought the book during the launch, which is probably a bit lower than what I could expect.

But more than a month later, I distributed in total almost 500 copies of the book (including physical units) for more than $10000, so I tend to think that this was a success. I still sell a few copies of the book each weeks, but the number are small compared to the launch.

I sold less than 10 copies of the ebook using Bitcoins, and I admit I'm a bit disappointed and surprised about that.

Physical copies represent 10% of the book distribution. It's probably a lot lower than most people that pushed me to do it thought it would be. But it is still higher of what I thought it would be. So I still would advise to have a paperback version of your book. At least because it's nice to have it
in your library.

I only got positive feedbacks, a few typo notices, and absolutely no refund demand, which I really find amazing.

The good news is also that I've been contacted with a couple of Korean and Chinese editors to get the book translated and published in those countries. If everything goes well, the book should be translated in the upcoming months and be available on these markets in 2015!

If you didn't get a copy, it's still time to do so!

Doing A/B testing with Apache httpd

Sun, 06 Apr 2014 00:00:00 GMT

When I started writing the landing page for The Hacker's Guide to Python, I wanted to try new things at the same time. I read about A/B testing a while ago, and I figured it was a good opportunity to test it out.

A/B testing

If you do not know what A/B testing is about, take a quick look at the Wikipedia page on that subject. Long story short, the idea is to serve two different version of a page to your visitors and check which one is getting the most success. When you found which version is better, you can definitely switch to it.

In the case of my book, I used that technique on the pre-launch page where people were able to subscribe to the newsletter. I didn't have a lot of things I wanted to test out on that page, so I just used that approach on the subtitle, being either "Learn everything you need to build a successful Python project" or "It's time you make the most out of Python".

Statistically, each version would be served half of the time, so both would get the same number of view. I then would build statistics about which page was attracting the most subscribers. With the results I would be able to switch definitively to that version of the landing page.

Technical design

My Web site, this Web site, is entirely static and served by Apache httpd. I didn't want to use any dynamic page, language or whatever. Mainly because I didn't want to have something else to install and maintain just for that on my server.

It turns out that Apache httpd is powerful enough to implement such a feature. There are different ways to build it, and I'm going to describe my choices here.

The first thing to pick is a way to balance the display of the page. You need to find a way so that if you get 100 visitors, around 50 will see the version A of your page, and around 50 will see the version B of the page.

You could use a random number generator, pick a random number for each visitor, and decides which page he's going to see. But it turns out that I didn't find a way to do that with Apache httpd at first sight.

My second thought was to pick the client IP address. But it's not such a good idea, because if you got visitors from, for example, people behind a company firewall, they are all going to be served the same page, so that kind of kills the statistics.

Finally, I picked time based balancing: if you visit the page on a second that is even, you get version A of the page, and if you visit the page on a second that is odd, you get version B. Simple, and so far nothing proves there are more visitors on even than odd seconds, or vice-versa.

The next thing is to always serve the same page to a returning visitor. I mean that if the visitor comes back later and get a different version, that's cheating. I decided the system should always serve the same page once a visitor "picked" a version. To do that, it's simple enough, you just have to use cookies to store the page the visitor has been attributed, and then use that cookie if he comes back.

Implementation

To do that in Apache httpd, I used the powerful mod_rewrite that is shipped with it. I put 2 files in the books directory, named either "the-hacker-guide-to-python-a.html" and "the-hacker-guide-to-python-b.html" that got served when you requested "https://thehackerguidetopython.com".

RewriteEngine On
RewriteBase /books

## If there's a cookie called thgtp-pre-version set,
## use its value and serve the page
RewriteCond %{HTTP_COOKIE} thgtp-pre-version=([^;])
RewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-%1.html [L]

## No cookie yet and…
RewriteCond %{HTTP_COOKIE} !thgtp-pre-version=([^;]+)
## … the number of seconds of the time right now is even
RewriteCond %{TIME_SEC} [02468]$
## Then serve the page A and store "a" in a cookie
RewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-a.html [cookie=thgtp-pre-version:a:julien.danjou.info,L]

## No cookie yet and…
RewriteCond %{HTTP_COOKIE} !thgtp-pre-version=([^;]+)
## … the number of seconds of the time right now is odd
RewriteCond %{TIME_SEC} [13579]$
## Then serve the page B and store "b" in a cookie
RewriteRule ^the-hacker-guide-to-python$ %{REQUEST_FILENAME}-b.html [cookie=thgtp-pre-version:b:julien.danjou.info,L]

With that few lines, it worked flawlessly.

Results

The results were very good, as it worked perfectly. Combined with Google Analytics, I was able to follow the score of each page. It turns out that testing this particular little piece of content of the page was, as expected, really useless. The final score didn't allow to pick any winner. Which also kind of proves that the system worked perfectly.

But it still was an interesting challenge!

The Hacker's Guide to Python released!

Tue, 25 Mar 2014 00:00:00 GMT

And done! It took me just 8 months to do this entire book project around Python. From the first day I started writing to today, where I finally publish and sell – almost entirely – myself this book. I'm really proud of what I've achieved so far, as this was something totally new to me.

Doing all of that has been a great adventure, and I'll promise I'll write something about that later on. A making of.

For now, you can enjoy reading the book and learn a bit more about Python. I really hope it'll help you bring your Python-fu to a new level, and help you build great projects!

Go check it out, and since this is first day of sale, enjoy 20% off by using the offer code THGTP20.

OpenStack Ceilometer Icehouse-2 milestone released

Fri, 24 Jan 2014 00:00:00 GMT

Yesterday, the second milestone of the Icehouse development branch of Ceilometer has been released and is now available for testing and download. This means the first half of the OpenStack Icehouse development has
passed!

New features

For the Icehouse-1 milestone, we barely had enough time to implement 2 blueprints. We almost did a better job this time, but finally only 2 blueprints were implemented again. This is really far from what we planned initially. The infrastructure slowdown issues and the lower number of reviews is probably the root cause here.

Anyway, Ceilometer now offers a REST API to accesses the stored event.

The initial work to replace the /v2/meters endpoint with something more RESTy has started with the implementation of /v2/samples.

Bug fixes

Thirty-one bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read the list if you are curious.

Toward Icehouse 3

We now have 29 blueprints targeting the Ceilometer's third Icehouse milestone, with some of them are already started and ready to merge. However, it's likely that we won't make all of them. As usual, the priority should indicate how confident we are that we want and need a feature. Still, it's likely the roadmap will be adjusted in the upcoming weeks. I'll try to make sure we'll get there without too much trouble for the 6th March 2013. Stay tuned!

Databases integration testing strategies with Python

Mon, 06 Jan 2014 00:00:00 GMT

The Ceilometer project supports various database backend that can be used as storage. Among them are MongoDB, SQLite MySQL, PostgreSQL, HBase, DB2… All Ceilometer's code is unit tested, but when dealing with external storage services, one cannot be sure that the code is really working. You could be inserting data with an incorrect SQL statement, or in the wrong table. Only having the real database storage running and being used can tell you that.

Over the months, we developed integration testing on top of our unit testing to validate that our storage drivers are able to deal with real world databases. That is not really different from generic integration testing.

Integration testing is about plugging all the pieces of your software all together and running. In what I call "database integration testing", the pieces will be both your software and the database system that you are going to rely on.

The only difference here is that one of the module is not coming from the application itself but is an external project. The type of database that you use (RDBMS, NoSQL…) does not matter. Taking a step back, what I will describe here could also apply to a lot of other different software modules, even something that would not be a database sytem at all.

Writing tests for integration

Presumably, your Python application has unit tests. In order to test against a database back-end, you need to write a few specific classes of tests that will use the database subsystem for real. For example:

import unittest
import os
import sqlalchemy

class TestDB(unittest.TestCase):
    def setUp(self):
       url = os.getenv("DB_TEST_URL")
       if not url:
           self.skipTest("No database URL set")
       self.engine = sqlalchemy.create_engine(url)

This code will try to fetch the database URL to use from an environment variable, and then will rely on SQLAlchemy to create a database connection.

import unittest
import os
import sqlalchemy

import myapp

class TestDB(unittest.TestCase):
    def setUp(self):
       url = os.getenv("DB_TEST_URL")
       if not url:
           self.skipTest("No database URL set")
       self.engine = sqlalchemy.create_engine(url)

    def test_foobar(self):
        self.assertTrue(myapp.store_integer(self.engine, 42))

You can then add as many tests as you want using the connection stored in self.engine. If no test database URL is, the tests will be skipped; however that decision is up to you. You may want to have these tests always run and fail if they can't be run.

In the setUp() method, you may also need to do more work, like create a database and delete a database.

import unittest
import os
import sqlalchemy

class TestDB(unittest.TestCase):
    def setUp(self):
       url = os.getenv("DB_TEST_URL")
       if not url:
           self.skipTest("No database URL set")
       self.engine = sqlalchemy.create_engine(url)
       self.connection = self.engine.connect()
       self.connection.execute("CREATE DATABASE testdb")

    def tearDown(self):
        self.connection.execute("DROP DATABASE testdb")

This will make sure that the database you need is clean and ready to be used to testing.

Launching modules, a.k.a. databases

The main problem we encountered when building integration testing with databases, is to find a way to start them. Most users are used to start them system-wide with some sort of init script, but when running sandboxed tests, that is not really a good option. Browsing the documentation of each storage allowed us to find a way to start them in foreground and control them "interactively" via a shell script.

The following is a script that you can use to run Python tests using nose and is heavily inspired by the one we wrote for Ceilometer.

#!/bin/bash
set -e

clean_exit() {
    local error_code="$?"
    kill -9 $(jobs -p) >/dev/null 2>&1 || true
    rm -rf "$PGSQL_DATA"
    return $error_code
}

check_for_cmd () {
    if ! which "$1" >/dev/null 2>&1
    then
        echo "Could not find $1 command" 1>&2
        exit 1
    fi
}

wait_for_line () {
    while read line
    do
        echo "$line" | grep -q "$1" && break
    done < "$2"
    # Read the fifo for ever otherwise process would block
    cat "$2" >/dev/null &
}

check_for_cmd postgres

trap "clean_exit" EXIT

## Start PostgreSQL process for tests
PGSQL_DATA=`mktemp -d /tmp/PGSQL-XXXXX`
PGSQL_PATH=`pg_config --bindir`
${PGSQL_PATH}/initdb ${PGSQL_DATA}
mkfifo ${PGSQL_DATA}/out
${PGSQL_PATH}/postgres -F -k ${PGSQL_DATA} -D ${PGSQL_DATA} &> ${PGSQL_DATA}/out &
## Wait for PostgreSQL to start listening to connections
wait_for_line "database system is ready to accept connections" ${PGSQL_DATA}/out
export DB_TEST_URL="postgresql:///?host=${PGSQL_DATA}&dbname=template1"

## Run the tests
nosetests

If you use tox to automatize your test run, you can use this scripts (I call it run-test.sh) in your tox.ini file.

[testenv]
commands = {toxinidir}/run-tests.sh {posargs}

Most databases are able to be run in some sort of standalone mode where you can connect to them using a either a Unix domain socket, or a fixed port. Here are the snippet used in Ceilometer to run with MongoDB and MySQL:

## Start MongoDB process for tests
MONGO_DATA=$(mktemp -d /tmp/MONGODB-XXXXX)
MONGO_PORT=29000
mkfifo ${MONGO_DATA}/out
mongod --maxConns 32 --nojournal --noprealloc --smallfiles --quiet --noauth --port ${MONGO_PORT} --dbpath "${MONGO_DATA}" --bind_ip localhost &>${MONGO_DATA}/out &
## Wait for Mongo to start listening to connections
wait_for_line "waiting for connections on port ${MONGO_PORT}" ${MONGO_DATA}/out
export DB_TEST_URL="mongodb://localhost:${MONGO_PORT}/testdb"

## Start MySQL process for tests
MYSQL_DATA=$(mktemp -d /tmp/MYSQL-XXXXX)
mkfifo ${MYSQL_DATA}/out
mysqld --datadir=${MYSQL_DATA} --pid-file=${MYSQL_DATA}/mysql.pid --socket=${MYSQL_DATA}/mysql.socket --skip-networking --skip-grant-tables &> ${MYSQL_DATA}/out &
## Wait for MySQL to start listening to connections
wait_for_line "mysqld: ready for connections." ${MYSQL_DATA}/out
export DB_TEST_URL="mysql://root@localhost/testdb?unix_socket=${MYSQL_DATA}/mysql.socket&charset=utf8"

The mechanism is always the same. We create a fifo with mkfifo, and then run the database daemon with the output redirected to that fifo. We then read from it until we find a line stating the the database is ready to be used. At that point, we can continue and start running the tests. You have to read continuously from the fifo, otherwise the process writing to it will block. We redirect the output to /dev/null, but you could also redirect it to a different log file, or not at all.

Note: Evgeni Golov pointed it exists a pg_virtualenv for PostgreSQL and my_virtualenv for MySQL that does the same kind of thing, but with more bells and whistles.

One step further: using parallelism and scenarios

The described approach is quite simple, as it only support one database type. When using an abstraction layer, such as SQLAlchemy, it would be a good idea to run all these tests against different RDBMS, such as MySQL and PostgreSQL for example.

The snippet above allows to run both RDBMS in parallel, but the classic approach of unit tests does not allow that. Using one scenario for each database backend would be a great idea. To that end, you can use the testscenarios library.

import unittest
import os
import sqlalchemy
import testscenarios

load_tests = testscenarios.load_tests_apply_scenarios

class TestDB(unittest.TestCase):
    scenarios = [
        ('mysql', dict(database_connection=os.getenv("MYSQL_TEST_URL")),
        ('postgresql', dict(database_connection=os.getenv("PGSQL_TEST_URL")),
    ]

    def setUp(self):
       if not self.database_connection:
           self.skipTest("No database URL set")
       self.engine = sqlalchemy.create_engine(self.database_connection)
       self.connection = self.engine.connect()
       self.connection.execute("CREATE DATABASE testdb")

    def tearDown(self):
        self.connection.execute("DROP DATABASE testdb")

$ python -m subunit.run test_scenario | subunit2pyunit
test_scenario.TestDB.test_foobar(mysql)
test_scenario.TestDB.test_foobar(mysql) ... ok
test_scenario.TestDB.test_foobar(postgresql)
test_scenario.TestDB.test_foobar(postgresql) ... ok

---------------------------------------------------------
Ran 2 tests in 0.061s

OK

To speed up tests run, you could also run the test in parallel. It can be intesting as you'll be able to spread the workload among a lot of different CPUs. However, note that it can require a different database for each test or a locking mechanism to be in place. It's likely that your tests won't be able to work altogether at the same time on only one database.

(Both usage of scenarios and parallelism in testing will be covered in The Hacker's Guide to Python,
in case you wonder.)

OpenStack Design Summit Icehouse, from a Ceilometer point of view

Wed, 13 Nov 2013 00:00:00 GMT

Last week was the OpenStack Design Summit Icehouse in Hong-Kong where we, OpenStack developers, discussed and designed the new OpenStack release (Icehouse) that is coming up.

The week has been wonderful. It was my second OpenStack design summit, and I loved it. Bumping into various people I've never met so far and worked with online was a real pleasure. As it was to meet again with fellow OpenStack developers! The event organisation was great, as were the parties. :-)

On the last day, I had the chance to present a talk with Eoghan Glynn and Nick Barcet how we built the auto-scaling feature in Heat, implementing the "alarming" feature needed in Ceilometer.

Design sessions

This time, Ceilometer design sessions were spread on 3 days. Everything we talked about has its Etherpad instance. The discussions were interesting, and the amount of feedback gathered is big and is going to be very useful.

There's a lot of people and companies using Ceilometer now, and the project is getting more and more traction in general. There's a lot of different way to use it and to bend it to one's needs. Considering the amount of features and options that is provided, building functionality with a genericized approach it making Ceilometer useful for a lot of different and interesting use-cases.

Icehouse roadmap

The list of blueprints targeting Icehouse is available, but not yet complete. I expect people to start filling this list in the next days. If you want to propose blueprints, you're free to do so and inform us about it so we can validate it. The same applies if you wish to implement one of them!

Thereafter, I try to guess what the roadmap will look like in the upcoming weeks for Ceilometer based on the discussion we had last week during the summit.

Events management

A lot of work is going to be put into event management. Ceilometer plans to store notifications sent using oslo.messaging by OpenStack projects. Some work already got merge for Havana, but the API part and future improvements and ideas will continue to flow into the Icehouse release.

Agents and group management

A lot has been discussed around the polling agents and around the alarm evaluator agent.

The current state of the ceilometer-central-agent disallows any kind of high-availability and load-balancing, as the polling task are kept and scheduled on only one node.

The high-availability part is already covered by a custom mechanism built into ceilometer-alarm-evaluator, but it came clear to us that a more generic approach is needed. A lot of other projects needs this kind of functionality, and a pattern have been pointed out. A blueprint about group membership has been discussed in an Oslo session, and will end into a new Python library written to solve this in Ceilometer and in other projects.

TaskFlow will also probably be leveraged to solve the task distribution issue.

Documentation

Since a few weeks, Ceilometer auto-generates its API reference documentation using sphinxcontrib-docbookrestapi that parses our API code that uses WSME.

We also want to start writing a user guide, and we'll do that inside our own repository. That way, I hope that we will be the first project in OpenStack to require documentation to be incorporated into every patch that's being sent to Ceilometer. This is the best way to assure that nothing can be changed nor added without being accompanied with a documentation update.

Tempest testing

Testing of Ceilometer already has been a subject during the previous design summit about testing. We already put a large effort on Tempest testing in this last cycle, but we encountered a lot of small issues that we had to tackle to achieve something. Some Ceilometer basic tests are already on their way into Tempest, so this is something that is going to be achieved very soon.

Ultimately, I would also want Ceilometer moving towards providing its own set of Tempest tests as part of the code base. That way, it'd be as easy for core reviewers to refuse a patch if it doesn't provide functional tests as it is to refuse it if it doesn't provide unit tests. As we'll do for the documentation.

API improvements

Once again, a few API improvements will probably be implemented, like aggregation or the ability to specify multiple queries with OR and AND operators.

Roll-up, archiving of data

There seems to be interest in archiving and rolling-up the data stored by Ceilometer, so work in this area is to be expected. Supporting multiple data storage driver in parallel seems to be something that needs to be done for this and other aspects of Ceilometer feature set.

Alarming

The alarming feature set is already big, and the work that has been accomplished pretty amazing. A few improvements will be made, as retrieving better metrics and building better statistics (exclusion of low quality data points).

Python 3.4 single dispatch, a step into generic functions

Tue, 17 Sep 2013 00:00:00 GMT

I love to say that Python is a nice subset of Lisp, and I discover that it's getting even more true as time passes. Recently, I've stumbled upon the PEP 443 that describes a way to dispatch generic functions, in a way that looks like what CLOS, the Common Lisp Object System, provides.

What are generic functions

If you come from the Lisp world, this won't be something new to you. The Lisp object system provides a really good way to define and handle method dispatching. It's a base of the Common Lisp object system. For my own pleasure to see Lisp code in a Python post, I'll show you how generic methods work in Lisp first.

To begin, let's define a few very simple classes.

(defclass snare-drum ()
  ())

(defclass cymbal ()
  ())

(defclass stick ()
  ())

(defclass brushes ()
  ())

This defines a few classes: snare-drum, symbal, stick and brushes, without any parent class nor attribute. These classes compose a drum kit, and we can combine them to play sound. So we define a play method that takes two arguments, and returns a sound (as a string).

(defgeneric play (instrument accessory)
  (:documentation "Play sound with instrument and accessory."))

This only defines a generic method: it has no body, and cannot be called with any instance yet. At this stage, we only inform the object system that the method is generic and can be then implemented with various type of arguments. We'll start by implementing versions of this method that knows how to play with the snare-drum.

(defmethod play ((instrument snare-drum) (accessory stick))
  "POC!")

(defmethod play ((instrument snare-drum) (accessory brushes))
  "SHHHH!")

Now we just defined concrete methods with code. They also takes two arguments: instrument which is an instance of snare-drum and accessory that is an instance of stick or brushes.

At this stage, you should note the first difference with object system as built into language like Python: the method isn't tied to any class in particular. The methods are generic, and any class can implement them, or not.

Let's try it.

* (play (make-instance 'snare-drum) (make-instance 'stick))
"POC!"

* (play (make-instance 'snare-drum) (make-instance 'brushes))
"SHHHH!"

* (play (make-instance 'cymbal) (make-instance 'stick))
debugger invoked on a SIMPLE-ERROR in thread
#<THREAD "main thread" RUNNING {1002ADAF23}>:
  There is no applicable method for the generic function
    #<STANDARD-GENERIC-FUNCTION PLAY (2)>
  when called with arguments
    (#<CYMBAL {1002B801D3}> #<STICK {1002B82763}>).

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [RETRY] Retry calling the generic function.
  1: [ABORT] Exit debugger, returning to top level.

((:METHOD NO-APPLICABLE-METHOD (T)) #<STANDARD-GENERIC-FUNCTION PLAY (2)> #<CYMBAL {1002B801D3}> #<STICK {1002B82763}>) [fast-method]

As you see, the function called depends on the class of the arguments. The object systems dispatch the function calls to the right function for us, depending on the arguments classes. If we call play with instances that are not know to the object system, an error will be thrown.

Inheritance is also supported and the equivalent (but more powerful and less error prone) equivalent of Python's super() is available via (call-next-method).

(defclass snare-drum () ())
(defclass cymbal () ())

(defclass accessory () ())
(defclass stick (accessory) ())
(defclass brushes (accessory) ())

(defmethod play ((c cymbal) (a accessory))
  "BIIING!")

(defmethod play ((c cymbal) (b brushes))
  (concatenate 'string "SSHHHH!" (call-next-method)))

In this example, we define the stick and brushes classes as subclass of the accessory class. The play method defined will return the sound BIIING! regardless of the accessory instance that is used to play the cymbal. Except in the case where it's a brushes instance; only the most precise method is always called. The (call-next-method) function is used to call the closest parent method, in this case that would be the method returning _"BIIING!".

* (play (make-instance 'cymbal) (make-instance 'stick))
"BIIING!"

* (play (make-instance 'cymbal) (make-instance 'brushes))
"SSHHHH!BIIING!"

Note that CLOS is also able to dispatch on object instances themself by using the eql specializer.

But if you're really curious about all features CLOS provides, I suggest you read the brief guide to CLOS by Jeff Dalton as a starter.

Python implementation

Python implements a simpler equivalence of this workflow with the singledispatch function. It will be provided with Python 3.4 as part of the functools module. Here's a rough equivalence of the above Lisp program.

import functools

class SnareDrum(object): pass
class Cymbal(object): pass
class Stick(object): pass
class Brushes(object): pass

@functools.singledispatch
def play(instrument, accessory):
    raise NotImplementedError("Cannot play these")

@play.register(SnareDrum)
def _(instrument, accessory):
    if isinstance(accessory, Stick):
        return "POC!"
    if isinstance(accessory, Brushes):
        return "SHHHH!"
    raise NotImplementedError("Cannot play these")

We define our four classes, and a base play function that raises NotImplementedError, indicating that by default we don't know what to do. We can then write specialized version of this function with a first instrument, the SnareDrum. We then check for the accessory type that we get, and return the appropriate sound or raise NotImplementedError again if we don't know what to do with it.

If we run it, it works as expected:

>>> play(SnareDrum(), Stick())
'POC!'
>>> play(SnareDrum(), Brushes())
'SHHHH!'
>>> play(Cymbal(), Brushes())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jd/Source/cpython/Lib/functools.py", line 562, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/jd/sd.py", line 10, in play
    raise NotImplementedError("Cannot play these")
NotImplementedError: Cannot play these
>>> play(SnareDrum(), Cymbal())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jd/Source/cpython/Lib/functools.py", line 562, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/home/jd/sd.py", line 18, in _
    raise NotImplementedError("Cannot play these")
NotImplementedError: Cannot play these

The singledispatch module looks through the classes of the first argument passed to the play function, and calls the right version of it. The first defined version of the play function is always run for the object class, so if our instrument is a class that we did not register for, this base function will be called.

For whose eager to try and use it, the singledispatch function is provided Python 2.6 to 3.3 through the Python Package Index.

Limitations

First, as you noticed in the Lisp version, CLOS provides a multiple dispatcher that can dispatch on the type of any of the argument defined in the method prototype, not only the first one. Unfortunately, Python dispatcher is named singledispatch for this good reason: it only knows to dispatch on the first argument. Guido van Rossum wrote a short article about the subject that he called multimethod a few years ago.

Then, there's no way to call the parent function directly. There's no equivalent of the (call-next-method) from Lisp nor the super() function that allows to do that in Python class system. This means you will have to use various trick to bypass this limitation.

So while I am really glad that Python is going toward that direction, as it's a really powerful way to enhance an object system, it really lacks a lot of more advanced features that CLOS provides out of the box.

Though, improving this could be an interesting challenge. Especially to bring more CLOS power to Hy. :-)

OpenStack Ceilometer Havana-3 milestone released

Tue, 10 Sep 2013 00:00:00 GMT

Last week, the third and last milestone of the Havana development branch of Ceilometer has been released and is now available for testing and download. This means the end of the OpenStack Havana development time is coming, and that the features are now frozen.

New features

Eleven blueprints have been implemented as you can see on the release page. That's one more than during Havana-2, but it's less than was planned initially, though we had a pretty high score considering the size of our contributors team. I'm going to talk through some of them here, that are the most interesting for users.

Our favorite OPW intern Terri Yu implemented the long awaited GROUP BY API feature, that allows to group samples by fields before returning statistics.
Eoghan Glynn (Red Hat) continued his implementation of alarming features, and the audit API has been merged. A few blueprints related to alarming slipped and will be delayed for RC1, as they have been granted feature freeze exceptions:
logical combinations of alarms and alarm service partitioner.
With the help of Gordon Chung (IBM), I've worked on creating a middleware to meter API requests. This has been merged into Oslo and is handled by Ceilometer. Gordon added another middleware on top of it to add CADF support for audit.
Ceilometer agent compute gained his second inspector to poll for virtual machine, thanks to Alessandro Pilotti (Cloudbase) who implemented the Hyper-V inspector.
Ceilometer will be able to meter Neutron bandwidth thanks to eNovance folks that worked on bandwidth metering blueprint, both on Ceilometer and Neutron parts. This is also a long awaited feature.
Finally, Ceilometer will be shipped with yet another storage back-end, as Tong Li (IBM) implemented a DB2 driver.

Bug fixes

Fifty-six bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read the list if you are curious.

Toward our final Havana release

With the feature freeze in place, we're now focusing on fixing bugs and improving documentation. I'll try to make sure we'll get there without too much trouble for the 17th October 2013. Stay tuned!

Announcing The Hacker's Guide to Python

Tue, 03 Sep 2013 00:00:00 GMT

I've been hacking on Python for a lot of years now, on various project. For the last two years, I've been heavily involved in OpenStack, which makes an heavy usage of Python.

Once you start working with a hundred of hackers, on several software and libraries representing more than half a million source lines of Python, things change. The scalability, testing and deployment problems inherent to a cloud platform meddle with everything in designing components.

During these two years working on OpenStack development, I've learned a lot on Python from astounding Python hackers. From general architecture and design principles to various tips and tricks of the language.

It seemed to me like a good opportunity to share what I learnt doing so with others so you can benefit from it in other projects. I've started working a book, entitled "The Hacker's Guide to Python", where I will try to share what I learnt while working with Python.

The book is still a work in progress at this stage, but if you'd like to get in touch and keep updated on its advancement, you can subscribe in the following form or from the book homepage.

The definitive guide on how to use static, class or abstract methods in Python

Thu, 01 Aug 2013 00:00:00 GMT

Doing code reviews is a great way to discover things that people might struggle to comprehend. While proof-reading OpenStack patches recently, I spotted that people were not using correctly the various decorators Python provides for methods. So here's my attempt at providing me a link to send them to in my next code reviews. :-)

How methods work in Python

A method is a function that is stored as a class attribute. You can declare and access such a function this way:

>>> class Pizza(object):
...     def __init__(self, size):
...         self.size = size
...     def get_size(self):
...         return self.size
...
>>> Pizza.get_size
<unbound method Pizza.get_size>

What Python tells you here, is that the attribute get_size of the class Pizza is a method that is unbound. What does this mean? We'll know as soon as we'll try to call it:

>>> Pizza.get_size()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unbound method get_size() must be called with Pizza instance as first argument (got nothing instead)

We can't call it because it's not bound to any instance of Pizza. And a method wants an instance as its first argument (in Python 2 it must be an instance of that class; in Python 3 it could be anything). Let's try to do that then:

>>> Pizza.get_size(Pizza(42))
42

It worked! We called the method with an instance as its first argument, so everything's fine. But you will agree with me if I say this is not a very handy way to call methods; we have to refer to the class each time we want to call a method. And if we don't know what class is our object, this is not going to work for very long.

So what Python does for us, is that it binds all the methods from the class Pizza to any instance of this class. This means that the attribute get_size of an instance of Pizza is a bound method: a method for which the first argument will be the instance itself.

>>> Pizza(42).get_size
<bound method Pizza.get_size of <__main__.Pizza object at 0x7f3138827910>>
>>> Pizza(42).get_size()
42

As expected, we don't have to provide any argument to get_size, since it's bound, its self argument is automatically set to our Pizza instance. Here's an even better proof of that:

>>> m = Pizza(42).get_size
>>> m()
42

Indeed, you don't even have to keep a reference to your Pizza object. Its method is bound to the object, so the method is sufficient to itself.

But what if you wanted to know which object this bound method is bound to? Here's a little trick:

>>> m = Pizza(42).get_size
>>> m.__self__
<__main__.Pizza object at 0x7f3138827910>
>>> # You could guess, look at this:
...
>>> m == m.__self__.get_size
True

Obviously, we still have a reference to our object, and we can find it back if we want.

In Python 3, the functions attached to a class are not considered as unbound method anymore, but as simple functions, that are bound to an object if required. So the principle stays the same, the model is just simplified.

>>> class Pizza(object):
...     def __init__(self, size):
...         self.size = size
...     def get_size(self):
...         return self.size
...
>>> Pizza.get_size
<function Pizza.get_size at 0x7f307f984dd0>

Static methods

Static methods are a special case of methods. Sometimes, you'll write code that belongs to a class, but that doesn't use the object itself at all. For example:

class Pizza(object):
    @staticmethod
    def mix_ingredients(x, y):
        return x + y

    def cook(self):
        return self.mix_ingredients(self.cheese, self.vegetables)

In such a case, writing mix_ingredients as a non-static method would work too, but it would provide it with a self argument that would not be used. Here, the decorator @staticmethod buys us several things:

Python doesn't have to instantiate a bound-method for each Pizza object we instantiate. Bound methods are objects too, and creating them has a cost. Having a static method avoids that:

>>> Pizza().cook is Pizza().cook
False
>>> Pizza().mix_ingredients is Pizza.mix_ingredients
True
>>> Pizza().mix_ingredients is Pizza().mix_ingredients
True

It eases the readability of the code: seeing @staticmethod, we know that the method does not depend on the state of the object itself;
It allows us to override the mix_ingredients method in a subclass. If we used a function mix_ingredients defined at the top-level of our module, a class inheriting from Pizza wouldn't be able to change the way we mix ingredients for our pizza without overriding cook itself.

Class methods

Having said that, what are class methods? Class methods are methods that are
not bound to an object, but to… a class!

>>> class Pizza(object):
...     radius = 42
...     @classmethod
...     def get_radius(cls):
...         return cls.radius
... 
>>> 
>>> Pizza.get_radius
<bound method type.get_radius of <class '__main__.Pizza'>>
>>> Pizza().get_radius
<bound method type.get_radius of <class '__main__.Pizza'>>
>>> Pizza.get_radius == Pizza().get_radius
True
>>> Pizza.get_radius()
42

Whatever the way you use to access this method, it will always be bound to the class it is attached to, and its first argument will be the class itself (remember that classes are objects too).

When to use this kind of methods? Well class methods are mostly useful for two types of methods:

Factory methods, that are used to create an instance for a class using for example some sort of pre-processing. If we use a @staticmethod instead, we would have to hardcode the Pizza class name in our function, making any class inheriting from Pizza unable to use our factory for its own use.

class Pizza(object):
    def __init__(self, ingredients):
        self.ingredients = ingredients

    @classmethod
    def from_fridge(cls, fridge):
        return cls(fridge.get_cheese() + fridge.get_vegetables())

Static methods calling static methods: if you split a static method in several static methods, you shouldn't hard-code the class name but use class methods. Using this way to declare our method, the Pizza name is never directly referenced and inheritance and method overriding will work flawlessly

class Pizza(object):
    def __init__(self, radius, height):
        self.radius = radius
        self.height = height

    @staticmethod
    def compute_area(radius):
         return math.pi * (radius ** 2)

    @classmethod
    def compute_volume(cls, height, radius):
         return height * cls.compute_area(radius)

    def get_volume(self):
        return self.compute_volume(self.height, self.radius)

Abstract methods

An abstract method is a method defined in a base class, but that may not provide any implementation. In Java, it would describe the methods of an interface.

So the simplest way to write an abstract method in Python is:

class Pizza(object):
    def get_radius(self):
        raise NotImplementedError

Any class inheriting from Pizza should implement and override the get_radius method, otherwise an exception would be raised.

This particular way of implementing abstract method has a drawback. If you write a class that inherits from Pizza and forget to implement get_radius, the error will only be raised when you'll try to use that method.

>>> Pizza()
<__main__.Pizza object at 0x7fb747353d90>
>>> Pizza().get_radius()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in get_radius
NotImplementedError

There's a way to trigger this way earlier, when the object is being instantiated, using the abc module that's provided with Python.

import abc

class BasePizza(object):
    __metaclass__  = abc.ABCMeta

    @abc.abstractmethod
    def get_radius(self):
         """Method that should do something."""

Using abc and its special class, as soon as you'll try to instantiate BasePizza or any class inheriting from it, you'll get a TypeError.

>>> BasePizza()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Can't instantiate abstract class BasePizza with abstract methods get_radius

Mixing static, class and abstract methods

When building classes and inheritances, the time will come where you will have to mix all these methods decorators. So here's some tips about it.

Keep in mind that declaring a method as being abstract, doesn't freeze the prototype of that method. That means that it must be implemented, but it can be implemented with any argument list.

import abc

class BasePizza(object):
    __metaclass__  = abc.ABCMeta

    @abc.abstractmethod
    def get_ingredients(self):
         """Returns the ingredient list."""

class Calzone(BasePizza):
    def get_ingredients(self, with_egg=False):
        egg = Egg() if with_egg else None
        return self.ingredients + egg

This is valid, since Calzone fulfills the interface requirement we defined for BasePizza objects. That means that we could also implement it as being a class or a static method, for example:

import abc

class BasePizza(object):
    __metaclass__  = abc.ABCMeta

    @abc.abstractmethod
    def get_ingredients(self):
         """Returns the ingredient list."""

class DietPizza(BasePizza):
    @staticmethod
    def get_ingredients():
        return None

This is also correct and fulfills the contract we have with our abstract BasePizza class. The fact that the get_ingredients method doesn't need to know about the object to return result is an implementation detail, not a criteria to have our contract fulfilled.

Therefore, you can't force an implementation of your abstract method to be a regular, class or static method, and arguably you shouldn't. Starting with Python 3 (this won't work as you would expect in Python 2, see issue5867), it's now possible to use the @staticmethod and @classmethod decorators on top of @abstractmethod:

import abc

class BasePizza(object):
    __metaclass__  = abc.ABCMeta

    ingredient = ['cheese']

    @classmethod
    @abc.abstractmethod
    def get_ingredients(cls):
         """Returns the ingredient list."""
         return cls.ingredients

Don't misread this: if you think this is going to force your subclasses to implement get_ingredients as a class method, you are wrong. This simply implies that your implementation of get_ingredients in the BasePizza class is a class method.

An implementation in an abstract method? Yes! In Python, contrary to methods in Java interfaces, you can have code in your abstract methods and call it via super():

import abc

class BasePizza(object):
    __metaclass__  = abc.ABCMeta

    default_ingredients = ['cheese']

    @classmethod
    @abc.abstractmethod
    def get_ingredients(cls):
         """Returns the ingredient list."""
         return cls.default_ingredients

class DietPizza(BasePizza):
    def get_ingredients(self):
        return ['egg'] + super(DietPizza, self).get_ingredients()

In such a case, every pizza you will build by inheriting from BasePizza will have to override the get_ingredients method, but will be able to use the default mechanism to get the ingredient list by using super().

If you're interested in knowing more, I've covered this topic extensively in The Hacker's Guide to Python. Checkout it out!

OpenStack Ceilometer Havana-2 milestone released

Sat, 27 Jul 2013 00:00:00 GMT

Last week, the second milestone of the Havana development branch of Ceilometer has been released and is now available for testing and download. This means the first half of the OpenStack Havana development has passed!

New features

Ten blueprints have been implemented as you can see on the release page. I'm going to talk through some of them here, that are the most interesting for users.

The Ceilometer API now returns all the samples sorted by timestamp. This blueprint is the first one implemented by Terri Yu, our OPW intern! In the same spirit, I've added the ability to limit the number of samples returned.

On the alarming front, things evolved a lot. I've implemented the notifier system that will be used to run actions when alarms are triggered. To trigger these alarms, Eoghan Glynn (Red Hat) worked on the alarm evaluation system that will use the Ceilometer API to check for alarm states.

I've reworked the publisher system so it now uses URL formatted target for publication. That now allows to publish different meters to different target using the same publishing protocol (e.g. via UDP toward different hosts).

Sandy Walsh (RackSpace) have been working on the StackTach like functionality and added the ability for the collector to optionally store the notification events received.

Finally, Mehdi Abaakouk (eNovance) implemented a TTL system for the database, so you're now able to expire your data whenever you like.

Bug fixes

Thirty-five bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read the list if you are curious.

Toward Havana 3

We now have 30 blueprints targeting the Ceilometer's third Havana milestone, with some of them are already started. I'll try to make sure we'll get there without too much trouble for the 6th September 2013. Stay tuned!

OpenStack meets Lisp: cl-openstack-client

Thu, 04 Jul 2013 00:00:00 GMT

A month ago, a mail hit the OpenStack mailing list entitled "The OpenStack Community Welcomes Developers in All Programming Languages". You may know that OpenStack is essentially built using Python, and therefore it is the reference language for the client libraries implementations. As a Lisp and OpenStack practitioner, I used this excuse to build a challenge for myself: let's prove this point by bringing Lisp into OpenStack!

Welcome cl-openstack-client, the OpenStack client library for Common Lisp!

The project is hosted on the classic OpenStack infrastructure for third party project, StackForge. It provides the continuous integration system based on Jenkins and the Gerrit infrastructure used to review contributions.

How the tests works

OpenStack projects ran a fabulous contribution workflow, which I already talked about, based on tools like Gerrit and Jenkins.

OpenStack Python projects are used to run tox, to build a virtual environment and run test inside. We don't have such thing in Common Lisp as far as I know, so I had to build it myself.

Fortunately, using Quicklisp, the fabulous equivalent of Python's PyPI, it has been a breeze to set this up. cl-openstack-client just includes a basic shell script that does the following:

Download quicklisp.lisp
Run a Lisp program to install the dependencies using Quicklisp
Run a Lisp program running the test suite using FiveAM, that exit with 0 or 1 based on the tests results.

I just run the test using SBCL, though adding more compiler on the table would be a really good plan in the future, and should be straightforward. You can admire a log from a successful test run done when I proposed a patch via Gerrit, to check what it looks like.

Implementation status

For the curious, here's an example of how it works:

* (require 'cl-openstack-client)
* (use-package 'cl-keystone-client)
* (defvar k (make-instance 'connection-v2 :username "demo" :password "somepassword" :tenant-name "demo" :url "http://devstack:5000"))

K

* (authenticate k)

((:ISSUED--AT . "2013-07-04T05:59:55.454226")
 (:EXPIRES . "2013-07-05T05:59:55Z")
 (:ID
  . "wNFQwNzo1OTo1NS40NTQyMthisisaverylongtokenwNFQwNzo1OTo1NS40NTQyM")
 (:TENANT (:DESCRIPTION) (:ENABLED . T)
  (:ID . "1774fd545df4400380eb2b4f4985b3be") (:NAME . "demo")))

* (connection-token-id k)

"wNFQwNzo1OTo1NS40NTQyMthisisaverylongtokenwNFQwNzo1OTo1NS40NTQyM"

Unfortunately, the implementation is far from being complete. It only implements for now the Keystone token retrieval.

I've actually started this project to build an already working starting point. With this, future potential contributors will be able to spend efforts on writing code, and not on setting up the basic continuous integration system or module infrastructure.

If you wish to help me and contribute, just follow the OpenStack Gerrit workflow howto or feel free to come by me and ask any question (I'm hanging out on #lisp on Freenode too).

See you soon, hopping to bring more Lisp into OpenStack!

OpenStack Ceilometer Havana-1 milestone released

Fri, 31 May 2013 00:00:00 GMT

Yesterday, the first milestone of the Havana development branch of Ceilometer has been released and is now available for testing and download. This means the first quarter of the OpenStack Havana development has passed!

New features

Ten blueprints have been implemented as you can see on the release page. I'm going to talk through some of them here, that are the most interesting for users.

Ceilometer can now counts the scheduling attempt of instances done by nova-scheduler. This can be useful to eventually bill such information or for audit (implemented by me for eNovance).

People using the HBase backend can now do requests filtering on any of the counter fields, something we call metadata queries, and which was missing for this backend driver. Thanks to Shengjie Min (Dell) for the implementation.

Counters can now be sent over UDP instead of the Oslo RPC mechanism (AMQP based by default). This allows counter transmission to be done in a much faster way, though less reliable. The primary use case being not audit or billing, but the alarming features that we are working on (implemented by me for eNovance).

The initial alarm API has been designed and implemented, thanks to Mehdi Abaakouk (eNovance) and Angus Salkled (RedHat) who tackled this. We're now able to do CRUD actions on these.

Posting of meters via the HTTP API is now possible. This is now another conduct that can be used to publish and collector meter. Thanks to Angus Salkled (RedHat) for implementing this.

I've been working on an somewhat experimental notifier driver for Oslo notification that publishes Ceilometer counters instead of the standard notification, using the Ceilometer pipeline setup.

Sandy Walsh (Rackspace) has put in place the base needed to store raw notifications (events), with the final goal of bringing more functionalities around these into Ceilometer.

Obviously, all of this blueprint and bug fixes wouldn't be implemented or fixed without the harden eyes of our entire team, reviewing code and advising restlessly the developers. Thanks to them!

Bug fixes

Thirty-one bugs were fixed, though most of them might not interest you so I won't elaborate too much on that. Go read the list if you are curious.

Toward Havana 2

We now have 21 blueprints targeting the Ceilometer's second Havana milestone, with some of them are already started. I'll try to make sure we'll get there without too much trouble for the 18th July 2013. Stay tuned!

Rant about Github pull-request workflow implementation

Fri, 10 May 2013 00:00:00 GMT

One of my recent innocent tweet about Gerrit vs Github triggered much more reponses and debate that I expected it to. I realize that it might be worth explaining a bit what I meant, in a text longer than 140 characters.

I'm having a hard time now contributing to projects not using Gerrit. Github isn't that good.

— Julien Danjou (@juldanjou) May 8, 2013

The problems with Github pull-requests

I always looked at Github from a distant eye, mainly because I always disliked their pull-request handling, and saw no value in the social hype it brings. Why?

One click away isn't one click effort

The pull-request system looks like an incredible easy way to contribute to any project hosted on Github. You're a click away to send your contribution to any software. But the problem is that any worthy contribution isn't an effort of a single click.

Doing any proper and useful contribution to a software is never done right the first time. There's a dance you will have to play. A slowly rhythmed back and forth between you and the software maintainer or team. You'll have to dance it until your contribution is correct and can be merged.

But as a software maintainer, not everybody is going to follow you on this choregraphy, and you'll end up with pull-request you'll never get finished unless you wrap things up yourself. So the gain in pull-requests here, isn't really bigger than a good bug report in most cases.

This is where the social argument of Github isn't anymore. As soon as you're talking about projects bigger than a color theme for your favorite text editor, this feature is overrated.

Contribution rework

If you're lucky enough, your contributor will play along and follow you on this pull-request review process. You'll make suggestions, he will listen and will modify his pull-request to follow your advice.

At this point, there's two technics he can use to please you.

Technic #1: the Topping

Github's pull-requests invite you to send an entire branch, eclipsing the fact that it is composed of several commits. The problem is that a lot of one-click-away contributors do not masterize Git and/or do not make efforts to build a logical patchset, and nothing warns them that their branch history is wrong. So they tend to change stuff around, commit, make a mistake, commit, fix this mistake, commit, etc. This kind of branch is composed of the whole brain's construction process of your contributor, and is a real pain to review. To the point I quite often give up.

Without Github, the old method that all software used, and that many software still use (e.g. Linux), is to send a patch set over e-mail (or any other medium like Gerrit). This method has one positive effect, that it forces the contributor to acknowledge the list of commits he is going to publish. So, if the contributor he has fixup commits in his history, they are going to be seen as first class citizen. And nobody is going to want to see that, neither your contributor, nor the software maintainers. Therefore, such a system tend to push contributors to write atomic, logical and self-contained patchset that can be more easily reviewed.

Technic #2: the History Rewriter

This is actually the good way to build a working and logical patchset using Git. Rewriting history and amending problematic patches using the famous git rebase --interactive trick.

The problem is that if your contributor does this and then repush the branch composing your pull-request to Github, you will both lose the previous review done, each time. There's no history on the different versions of the branch that has been pushed.

In the old alternative system like e-mail, no information is lost when reworked patches are resent, obviously. This is far better because it eases the following of the iterative discussions that the patch triggered.

Of course, it would be possible for Github to enhance this and fix it, but currently it doesn't handle this use case correctly..

A quick look at OpenStack workflow

It's not a secret for anyone that I've been contributing to OpenStack as a daily routine for the last 18 months. The more I contribute, the more I like the contribution workflow and process. It's already well and longly described on the wiki, so I'll summarize here my view and what I like about it.

Gerrit

To send a contribution to any OpenStack project, you need to pass via Gerrit. This is way simpler than doing a pull-request on Github actually, all you have to do is do your commit(s), and type git review. That's it. Your patch will be pushed to Gerrit and available for review.

Gerrit allows other developers to review your patch, add comments anywhere on it, and score your patch up or down. You can build any rule you want for the score needed for a patch to be merged; OpenStack requires one positive scoring from two core developers before the patch is merged.

Until a patch is validated, it can be reworked and amended locally using Git, and then resent using git review again. That simple. The historic and the different version of the patches are available, with the whole comments. Gerrit doesn't lose any historic information on your workflow.

Finally, you'll notice that this is actually the same kind of workflow projects use when they work by patch sent over e-mail. Gerrit just build a single place to regroup and keep track of patchsets, which is really handy. It's also much easier for people to actually send patch using a command line tool than their MUA or git send-email.

Gate testing

Testing is mandatory for any patch sent to OpenStack. Unit tests and functionnals test are run for each version of each patch of the patchset sent. And until your patch passes all tests, it will be impossible to merge it.

Yes, this implies that all patches in a patchset must be working commits and can be merged on their own, without the entire patchset going in! With such a restricution, it's impossible to have "fixup commits" merged in your project and pollute the history and the testability of the project.

Once your patch is validated by core developers, the system checks that there is not any merge conflicts. If there's not, tests are re-run, since the branch you are pushing to might have changed, and if everything's fine, the patch is merged.

This is an uncredible force for the quality of the project. This implies that no broken patchset can ever sneak in, and that the project pass always all tests.

Conclusion: accessibility vs code review

In the end, I think that one of the key of any development process, which is code review, is not well covered by Github pull-request system. It is, along with history integrity, damaged by the goal of making contributions easier.

Choosing between these features is probably a trade-off that each project should do carefully, considering what are its core goals and the quality of code it want to reach.

I tend to find that OpenStack found one of the best trade-off available using Gerrit and plugging testing automation via Jenkins on it, and I would probably recommend it for any project taking seriously code reviews and testing.

OpenStack Design Summit Havana, from a Ceilometer point of view

Thu, 25 Apr 2013 00:00:00 GMT

Last week was the OpenStack Design Summit in Portland, OR where we, developers, discussed and designed the new OpenStack release (Havana) coming up.

The summit has been wonderful. It was my first OpenStack design summit -- even more as a PTL -- and bumping into various people I've never met so far and worked with online only was a real pleasure!

Nick Barcet from eNovance, our dear previous Ceilometer PTL, and myself, talked about Ceilometer and presented the work that has been done for Grizzly, with some previews of what we'll like to see done for its Havana release.

Design sessions

Ceilometer had his design sessions during the last days of the summit. We noted a lot of things and commented during the sessions in our Etherpads instances.

The first session was a description of Ceilometer core architecture for interested people, and was a wonderful success considering that the room was packed. Our Doug Hellmann did a wonderful job introducing people to Ceilometer and answering question.

The next session was about getting feedbacks from our users. We had a lot of surprise to discover wonderful real use-cases and deployments, like the CERN using Ceilometer and generating 2 GB of data per day!

The following sessions ran on Thursday and were much more about new features discussion. A lot ot already existing blueprints were discussed and quickly validated during the first morning session. Then, Sandy Walsh introduced the architecture they use inside StackTach, so we can start thinking about getting things from it into Ceilometer.

API improvements were discussed without surprises and with a good consensus on what needs to be done. The four following sessions that occupied a lot of the days were related to alarming. All were lead by Eoghan Glynn, from Red Hat, who did an amazing job presenting the possible architectures with theirs pros and cons. Actually, all we had to do was to nod to his designs and acknowledge the plan on how to build this.

That last two sessions were about discussing advanced models for billing where we got some interesting feedback from Daniel Dyer from HP, and then were a quick follow-up of the StackTach presentation from the morning session.

Havana roadmap

The list of blueprints targeting Havana is available and should be finished by next week. If you want to propose blueprints, you're free to do so and inform us about it so we can validate it. The same applies if you wish to implement one of them!

API extension

I do think the API version 2 is going to be heavily extended during this release cycle. We need more feature, like the group-by
functionality.

Healthnmon

In parallel of the design sessions, discussions took place in the unconference room with the Healthnmon developers to figure out a plan in order to merge some of their efforts into Ceilometer. They should provide a component to help Ceilometer supports more hypervisors than it currently does.

Alarming

Alarming is definitely going to be the next big project for Ceilometer. Today, Eoghan and I started building blueprints on alarming, centralised in a general blueprint.

We know this is going to happen for real and very soon, thanks to the engagements of eNovance and Red Hat who are committing resources to this amazing project!

Hy, Lisp in Python

Wed, 03 Apr 2013 00:00:00 GMT

I've meant to look at Hy since Paul Tagliamonte started to talk to me about it, but never took a chance until now. Yesterday, Paul indicated it was a good time for me to start looking at it, so I spent a few hours playing.

But what's Hy?

Python is very nice: it has a great community and a wide range of useful libraries. But let's face it, it misses a great language.

Hy is an implementation of a Lisp on top of Python.

Technically, Hy is built directly with a custom made parser (for now) which then translates expressions using the Python AST module to generate code, which is then run by Python. Therefore, it shares the same properties as Python, and is a Lisp-1 (i.e. with a single namespace for symbols and functions).

If you're interested to listen Paul talking about Hy during last PyCon US, I recommend watching his lightning talk. As the name implies, it's only a few minutes long.

Does it work?

I've been cloning the code and played around a bit with Hy. And to my greatest surprise and pleasure, it works quite well. You can imagine writing Python from there easily. Part of the syntax smells like Clojure's, which looks like a good thing since they're playing in the same area.

You can try a Hy REPL in your Web browser if you want.

Here's what some code look like:

(import requests)

(setv req (requests.get "http://hy.pault.ag"))
(if (= req.status_code 200)
  (for (kv (.iteritems req.headers))
    (print kv))
  (throw (Exception "Wrong status code")))

This code would ouput:

('date', 'Wed, 03 Apr 2013 12:09:23 GMT')
('connection', 'keep-alive')
('content-encoding', 'gzip')
('transfer-encoding', 'chunked')
('content-type', 'text/html; charset=utf-8')
('server', 'nginx/1.2.6')

As you can see, it's really simple to write Lispy code that really uses Python idioms.

There's obviously still a lots of missing features in Hy. The language if far from complete and many parts are moving, but it's really promising, and Paul's doing a great job implementing every idea.

I actually started to hack a bit on Hy, and will try to continue to do so, since I'm really eager to learn a bit more about both Lisp and Python internals in the process. I've already send a few patches on small bugs I've encountered, and proposed a few ideas. It's really exciting to be able to influence early a language design that I'll love to use! Being a recent fan of Common Lisp, I tend to grab the good stuff from it to add them into Hy.

Announcing Climate, the OpenStack capacity leasing project

Mon, 25 Mar 2013 00:00:00 GMT

While working on the XLcloud project (HPC on cloud) it appeared clear to us that OpenStack was missing a critical component towards resource reservations.

A capacity leasing service is something really needed by service providers, especially in the context of cloud platforms dedicated to HPC style workload. Instead of building something really specific, the decision has been made to build a new standalone OpenStack components aiming to provide this kind of functionnality to OpenStack. In the spirit of others OpenStack components, it will be extensible to fullfil a large panel of needs around this problematic.

The project is named Climate, and is hosted on StackForge. It will follow the standard OpenStack development modal. This service will be able to handle a calendar of reservations for various resources, based on various criteria.

The project is still at its early design stage, and we plan to have a unconference session during the next OpenStack summit in Portland to present our plans and ideas for the future!

Ceilometer bug squash day #2

Mon, 04 Mar 2013 00:00:00 GMT

The Ceilometer team is pleased to announce that tomorrow Tuesday 5th March 2013 will be the second bug squash day for Ceilometer.

We wrote an extensive page about how you can contribute to Ceilometer, from updating the documentation, to fixing bugs. There's a lot you can do. We've good support for Ceilometer built into Devstack, so installing a development platform is really easy.

The main goal for this bug day will be to put Ceilometer in the best possible shape before the grizzly-rc1 release arrives (14th March 2013). This version of Ceilometer should be the last one before the final Grizzly release, so it's a pretty important one.

We'll be hanging out on the #openstack-metering IRC channel on Freenode, as usual, so feel free to come by and join us!

OpenStack Ceilometer and Heat projects graduated

Wed, 27 Feb 2013 00:00:00 GMT

The OpenStack Technical Committee has voted these last weeks about graduation of Heat and Ceilometer, to change their status from incubation to integrated.

The details of the discussion can be found in the TC IRC meetings logs for the brave. The results are:

Approve graduation of Heat (to be integrated in common Havana release)? yes: 10, abstain: 1, no: 1
Approve graduation of Ceilometer (to be integrated in common Havana release)? yes: 11, abstain: 1

Therefore both projects have been graduated from Incubation to Integrated status. That means that Heat and Ceilometer will be released as part as OpenStack for the next release cycle Havana, due in Autumn 2013.

For people being curious, we the Ceilometer team put up a nice wiki page about our status and what we think we were ready to jump. For the curious, The OpenStack Technical Committee charter has some explanations about the incubation and integration process.

What about Grizzly?

Both projects will be released with Grizzly too, obviously, since they already follow the release process of OpenStack.

What about core?

The question that has been raised several times to me is if that means the projects are becoming Core projects. The answer is no, because how to become a Core project is still under discussion and is more a matter for the Board of Directors than the Technical Committee. But this is definitely a step in this direction.

Anyway, from a technical point of view, this means both projects are now onboard with other OpenStack components so you can enjoy them!

Cloud tools for Debian

Wed, 13 Feb 2013 00:00:00 GMT

Recently, I've worked on the cloud utilities that are provided as standard in Ubuntu, and I ported them to Debian. Let's see how that brings Debian to the cloud!

Basics of a cloud image

When starting an instance on a IaaS platform, your instance image is raw, un-configured. Therefore, you need to have a way to configure it automagically at boot time, based on what you want to do with it. Usually, IaaS platforms provides for this a metadata server, like Amazon EC2 does. It's a special HTTP server listening on a special and hard-coded IP address that your instance can request to know basic information about itself, like its hostname, and retrieve basic user metadata to auto-configure itself. You can check the documentation about the OpenStack metadata service for more information.

Also, image have a predefined size at upload time. So when you run it on a platform, the disk size you request is usually bigger than the size of your image disk: you mayneed to resize and grow your image to use the full disk space that is allocated to your instance.

Needed tools

To run a cloud platform, and especially Amazon EC2 or OpenStack, you need to configure and update your image based on the context you're started in. This also includes extending your template image disk to use the full available disk size provided to the running instance.

Ubuntu provides a set of cloud utils, which is actually composed of different source packages (cloud-init, cloud-utils and clout-initiramfs-tools).

Combined, these 3 packages will allow you to run a number of step, from disk resize at boot time to Puppet configuration handling.

So Ubuntu got this working right a long time ago, but unfortunately, Debian was really late on that.

Until now.

I've worked on getting these into Debian, and you can now find these 3 packages adapted and uploaded to Debian sid.

All you need to do, is to build a Debian image and then run:

apt-get install cloud-init cloud-tools cloud-initiramfs-growroot

And voilà: at the next reboot, your instance will extend its root partition size to the full available disk size, and ask the metadata server to configure things like its hostname.

The packages sources are available on Debian's git server for cloud-utils
and cloud-initramfs-tools and you can build them yourself until the packages are processed by ftp-master and get out of the NEW queue. cloud-init on the other hand is directly available in sid.

One of next steps would probably be to build or enhance a tool like vmbuilder to be able to build cloud-compatible Debian images with a simple command line.

Extending Swift with middleware: example with ClamAV

Tue, 22 Jan 2013 00:00:00 GMT

In this article, I'm going to explain you how you can extend Swift, the OpenStack Object Storage project, so it performs extra action on files at upload or at download time.

We're going to build an anti-virus filter inside Swift. The goal is to refuse uploaded data if they contain a virus. To help us with virus analyses, we'll use ClamAV.

WSGI, paste and middleware

To do our content analysis, the best place to hook in the Swift architecture is at the beginning of every request, on swift-proxy, before the file is actually stored on the cluster. Swift proxy uses, like many other OpenStack projects, paste to build his HTTP architecture.

Paste uses WSGI and provides an architecture based on a pipeline. The pipeline is composed of a succession of middleware, ending with one application. Each middleware has the chance to look at the request or at the response, can modify it, and then pass it to the following middleware. The latest component of the pipeline is the real application, and in this case, the Swift proxy server.

If you've already deployed Swift, you encountered a default pipeline in the swift-proxy.conf configuration file:

[pipeline:main]
pipeline = catch_errors healthcheck cache ratelimit tempauth proxy-logging proxy-server

This is a really basic pipeline with a few middleware. The first one catches error, the second one is in charge to return 200 OK response if you send a GET /healthcheck request on your proxy server. The third one is in charge of caching, the fourth one is used for rate limiting, the fifth for authentication, the sixth one for logging, and the final one is the actual proxy server, in charge of proxying the request to the account, container, or object servers (the others components of Swift). Of course, we could remove or add any of the middleware here at our convenience.

Be aware that the order matters: for example, if you put healthcheck after tempauth, you won't be able to access the /healthcheck URL without being authenticated!

ClamAV

If you don't know ClamAV, it's an antivirus engine designed for detecting trojans, viruses, malware and other malicious threats. Wwe're going to use it to scan every incoming file. To build the middleware, we'll use the Python binding pyclamd. The API is quite simple, see:

>>> import pyclamd
>>> pyclamd.init_unix_socket('/var/run/clamav/clamd.ctl')
>>> print pyclamd.scan_stream(pyclamd.EICAR)
{'stream': 'Eicar-Test-Signature(44d88612fea8a8f36de82e1278abb02f:68)'}
>>> print pyclamd.scan_stream("safe!")
None

Anatomy of a WSGI middleware

Your WSGI middleware should consist of a callable object. Usually this is done with a class implementing the __call__ method. Here's a basic boilerplate:

class SwiftClamavMiddleware(object):
    """Middleware doing virus scan for Swift."""

    def __init__(self, app, conf):
        # app is the final application
        self.app = app

    def __call__(self, env, start_response):
        return self.app(env, start_response)

def filter_factory(global_conf, **local_conf):
    conf = global_conf.copy()
    conf.update(local_conf)

    def clamav_filter(app):
        return SwiftClamavMiddleware(app, conf)
    return clamav_filter

I'm not going to expand more on why this is built this way, but if you want to have more info on this kind of filter middleware, you can read their documentation on Paste.

This middleware will just do nothing as it is. It's going to simply pass all requests it receives to the final application, and returns the result.

Testing our basic middleware

Now is a really good time to add unit tests. I hope you didn't think we were going to write code without some tests, right? It's really easy to test a middleware, as we're going to use WebOb for that.

import unittest
from webob import Request

class FakeApp(object):
    def __call__(self, env, start_response):
        return Response(body="FAKE APP")(env, start_response)

class TestSwiftClamavMiddleware(unittest.TestCase):

    def setUp(self):
        self.app = SwiftClamavMiddleware(FakeApp(), {})

    def test_simple_request(self):
        resp = Request.blank('/',
                             environ={
                                 'REQUEST_METHOD': 'GET',
                             }).get_response(self.app)
        self.assertEqual(resp.body, "FAKE APP")

We create a FakeApp class, that represents a fake WSGI application. You could also use a real application, or write a fake application looking like the one you want to test. It'll require more time, but your tests will be closer to the reality.

Here we write the simplest test we can for our middleware. We're just sending a GET / request to it, so it passes the request to the final application and returns the result. It is transparent, it does nothing.

Now, with that solid base we'll able to add more features and test these features incrementally.

Plugging ClamAV in

With our base ready, we can start thinking about how to plug ClamAV in. What we want to check here, is the content of the file when it's uploaded. If we refer to the OpenStack object storage API, a file upload is done via a PUT request, so we're going to limit the check to that kind of requests. Obviously, more checks could be added, but we'll keep things simple here for the sake of comprehensibility.

With WSGI, the content of the request is available in env['wsgi.input'] as an object implementing a file interface. We'll scan that stream with ClamAV to check for viruses.

import pyclamd
from webob import Response

class SwiftClamavMiddleware(object):
    """Middleware doing virus scan for Swift."""

    def __init__(self, app, conf):
        pyclamd.init_unix_socket('/var/run/clamav/clamd.ctl')
        # app is the final application
        self.app = app

    def __call__(self, env, start_response):
        if env['REQUEST_METHOD'] == "PUT":
            # We have to read the whole content in memory because pyclamd
            # forces us to, but this is a bad idea if the file is huge.
            scan = pyclamd.scan_stream(env['wsgi.input'].read())
            if scan:
                return Response(status=403,
                                body="Virus %s detected" % scan['stream'],
                                content_type="text/plain")(env, start_response)
        return self.app(env, start_response)

def filter_factory(global_conf, **local_conf):
    conf = global_conf.copy()
    conf.update(local_conf)

    def clamav_filter(app):
        return SwiftClamavMiddleware(app, conf)
    return clamav_filter

That's it. We only check for PUT requests and if there's a virus in the file, we return a 403 Forbidden error with the name of the detected virus, bypassing entirely the rest of the middleware chain and the application handling.

Then, we can simply test it.

import unittest
from cStringIO import StringIO
from webob import Request, Response

class FakeApp(object):
    def __call__(self, env, start_response):
        return Response(body="FAKE APP")(env, start_response)

class TestSwiftClamavMiddleware(unittest.TestCase):
    def setUp(self):
        self.app = SwiftClamavMiddleware(FakeApp(), {})

    def test_put_empty(self):
        resp = Request.blank('/v1/account/container/object',
                             environ={
                                 'REQUEST_METHOD': 'PUT',
                             }).get_response(self.app)
        self.assertEqual(resp.body, "FAKE APP")

    def test_put_no_virus(self):
        resp = Request.blank('/v1/account/container/object',
                             environ={
                                 'REQUEST_METHOD': 'PUT',
                                 'wsgi.input': StringIO('foobar')
                             }).get_response(self.app)
        self.assertEqual(resp.body, "FAKE APP")

    def test_put_virus(self):
        resp = Request.blank('/v1/account/container/object',
                             environ={
                                 'REQUEST_METHOD': 'PUT',
                                 'wsgi.input': StringIO(pyclamd.EICAR)
                             }).get_response(self.app)
        self.assertEqual(resp.status_code, 403)

The first test test_put_empty simulates an empty PUT request. The second one, test_put_no_virus simulates a regular PUT request but with a simple file containing no virus.

Finally, the third and last test simulates the upload of a virus using the EICAR test file. This is a special test file that is recognized as a virus, even if it's not real one. Very handy for testing virus detection software!

Configuring Swift proxy

Our middleware is ready! We can configure Swift's proxy server to use it. We need to add the following lines to our swift-proxy.conf to teach it how to load the filter:

[filter:clamav]
paste.filter_factory = swiftclamav:filter_factory

We'll assume that our Python modules is named swiftclamava here. Now that we've defined our filter and how to load it, we can use it in our pipeline:

[pipeline:main]
pipeline = catch_errors healthcheck cache ratelimit tempauth clamav proxy-logging proxy-server

Just before reaching the proxy-server, and after the user being authenticated, the content will be scanned for viruses. It's important here to put this after authentication for example, because otherwise we may scan content that will get rejected by the authtemp module, thus scanning for nothing!

Beyond scanning

And voilà, we now have a simple middleware testing uploaded content and refusing infected files. We could enhance it with various other things, like configuration handling, but I'll let that as an exercise for the interested readers.

We didn't exploited it here, but note that you can also manipulate request headers and modify them if needed. For example, we could have added a header X-Object-Meta-Scanned-By: ClamAV to indicates that the file has been scanned by ClamAV.

You should now be able to build your own middleware doing whatever you want with uploaded data. Happy hacking!

Overriding cl-json object encoding

Fri, 11 Jan 2013 00:00:00 GMT

CL-JSON provides an encoder for Lisp data structures and objects to JSON format. Unfortunately, in some case, its default encoding mechanism for CLOS objects isn't exactly doing the right thing. I'll show you how Common Lisp makes it easy to change that.

Identifying the problem

CL-JSON & CLOS

CL-JSON mechanism encoding CLOS object is really neat. Let's see how it works for a simple case:

(defclass kitten ()
  ((tail :initarg :tail)))

(json:encode-json-to-string (make-instance 'kitten :tail 'black))

will produce:

{"tail":"black"}

Still using CL-JSON, we can also decode the JSON object to a CLOS object:

(slot-value
 (json:with-decoder-simple-clos-semantics
   (json:decode-json-from-string "{\"tail\":\"black\"}"))
 :tail)

That code will return "black". Note that it's also possible to specify which class should be used when decoding objects, but that's beyond the purpose of this article.

Postmodern

Now, let's introduce Postmodern, a wonderful Common Lisp system providing access to the wonderful PostgreSQL database. It also provides a simple system to map rows in a database to CLOS classes, called DAO for Database access objects.

With this, we can easily store our kitten into a table.

(defclass kitten ()
  ((tail :initarg :tail))
  (:metaclass postmodern:dao-class))

If we try to encode this to JSON, it will produce the exact same result seen previously.

The problem is what happens when one of our column has a NULL value. Postmodern encodes this using the :null symbol.

So this code:

(defclass kitten ()
  ((tail :initarg :tail :col-type (or s-sql:db-null text)))
  (:metaclass postmodern:dao-class))

(postmodern:deftable kitten
  (postmodern:!dao-def))

(postmodern:connect-toplevel …)

(postmodern:create-table 'kitten)

(json:encode-json-to-string
  (postmodern:make-dao 'kitten))

will return:

"{"tail":"null"}"

Fail! The fact that the column is NULL is represented by the :null symbol. And CL-JSON encodes all symbols as string.

This is not at all what we want here!

Overriding encode-json

CL-JSON provides and uses the encode-json method to encode all kind of object. It is defined as a generic function, and a lot of different methods are implemented to handle the different standard Common Lisp types. The one used for standard-object is defined liked that:

(defmethod encode-json ((o standard-object)
                        &optional (stream *json-output*))
  "Write the JSON representation (Object) of the CLOS object O to
STREAM (or to *JSON-OUTPUT*)."
  (with-object (stream)
    (map-slots (stream-object-member-encoder stream) o)))

All we need to do here, is to create a new method for our kitten objects, that handles correctly the :null case.

(defclass kitten ()
  ((tail :initarg :tail :col-type (or s-sql:db-null text)))
  (:metaclass postmodern:dao-class))

(export 'kitten)

;; Switch package just to define the new method
(in-package :json)
(defmethod encode-json ((o cl-user:kitten)
                        &optional (stream json:*json-output*))
  "Write the JSON representation (Object) of the postmodern DAO CLOS object
O to STREAM (or to *JSON-OUTPUT*)."
  (with-object (stream)
    (map-slots (lambda (key value)
                 (as-object-member (key stream)
                   (encode-json (if (eq value :null) nil value) stream)))
               o)))

;; Go back into our package
(in-package :cl-user)

(postmodern:deftable kitten
  (postmodern:!dao-def))

(postmodern:connect-toplevel …)

(postmodern:create-table 'kitten)

(json:encode-json-to-string
  (postmodern:make-dao 'kitten))

With that new method, as soon as we encounter a :null symbol as a value for an object's slot, we replace it by nil.

Now if we try to encode another kitten, we'll get:

{"tail":null}

which is far better for our JavaScript data consumers!

In the end, I think that this kind of trick is feasible that easily because of the way CLOS provides its generic method implementation.
The fact that methods don't belong to any class makes the extension of every program, library and class so much easier. Doing this in another language like Java would likely by impossible, and in Python it would unlikely be as clean as it is done in Common Lisp.

The ability to teach any library about how it should handle your class just by defining a new method is really handy!

Integrating cl-irc and cl-async

Fri, 04 Jan 2013 00:00:00 GMT

Recently, I've started programming in Common Lisp.

My idea here is to use cl-irc, an IRC library into an event loop. This can be really useful, for example to trigger action based on time, using timers.

Creating a connection

The first step is to create a basic cl-irc:connection object on our own. This can be achieved easily with this:

(require :cl-irc)

(defun connect (server)
  (cl-irc:make-connection :connection-type 'cl-irc:connection
                                              :client-stream t
                                              :network-stream ?
                                              :server-name server))

This will return a cl-irc:connection object, logging to stdout (:client-stream t) and having the server name server. Note that the server name could be any string.

You probably noticed the ? I used as :network-stream value. This is not a real and working value: this should be a stream established to the IRC server you want to chat with. This is where we'll need to use cl-async:tcp connect to establish a TCP connection.

As you can read in this function's documentation, all we need to pass is the server address, two callbacks for read and general events, and the :stream option to get a stream rather than a socket.

So you would do something like:

(require :cl-irc)
(require :cl-async)

(defun connection-socket-read (socket stream)
  (format t "We should read the IRC message from ~a ~%" stream))

(defun connection-socket-event (ev)
  (format t "Socket event: ~a~%" ev))

(defun connect (server &optional (port 6667))
  (cl-irc:make-connection :connection-type 'cl-irc:connection
                          :client-stream t
                          :network-stream (as:tcp-connect server port
                                                          #'connection-socket-read
                                                          #'connection-socket-event
                                                          :stream t)
                          :server-name server))

(as:start-event-loop (lambda () (connect "irc.oftc.net")))

If you run this program, it will connect to the OFTC IRC server, and then notice you each time the server is sending you a message.

Therefore our problem here is how we you treat the message read from the stream in connection-socket-read and handle them in the name of our connection object you used? We can't link both together at this point.

We can't build a closure, because as the time we use as:tcp-connect we don't have the cl-irc:connection instance. Also we can't change easily the read-cb parameter of our network-stream established by as:tcp-connect, simply because cl-async doesn't use to do allow that.

Building a closure

So one solution here is to hack cl-irc:make-connection so we can build an cl-irc:connection instance without providing in advance the network-stream, allowing us to build a closure including the cl-irc:connection to read event for. This is what we're going to do in the connect function.

(require :cl-irc)
(require :cl-async)
(require :flexi-streams)

(defun connection-socket-read (connection)
  (loop for message = (cl-irc::read-irc-message connection)
        while message
        do (cl-irc:irc-message-event connection message)))

(defun connection-socket-event (ev)
  (format t "Socket event: ~a~%" ev))

(defun connect (server port nickname
                &key
                  (username nil)
                  (realname nil)
                  (password nil))
  ;; Build an instance of cl-irc:connection, without any network/output stream
  (let* ((connection (make-instance 'cl-irc:connection
                                    :user username
                                    :password password
                                    :server-name server
                                    :server-port port
                                    :client-stream t))
         ;; Use as:tcp-connect to build our network stream, and build a
         ;; closure calling `connection-socket-read' with our `connection'
         ;; as arguments
         (network-stream (as:tcp-connect server port
                                         (lambda (socket stream)
                                           (declare (ignore socket stream))
                                           (connection-socket-read connection))
                                         #'connection-socket-event
                                         :stream t)))
    ;; Set the network stream on the connection
    (setf (cl-irc:network-stream connection) network-stream)
    ;; Set the output stream on the connection
    (setf (cl-irc:output-stream connection)
         ;; This is grabbed from cl-irc:make-connection
          (flexi-streams:make-flexi-stream
           network-stream
           :element-type 'character
           :external-format '(:utf8 :eol-style :crlf)))

    ;; Now handle the IRC protocol authentication pass
    (unless (null password)
      (cl-irc:pass connection password))
    (cl-irc:nick connection nickname)
    (cl-irc:user- connection (or username nickname) 0 (or realname nickname))
    connection))

(as:start-event-loop (lambda () (connect "irc.oftc.net" 6667 "jd-blog")))

And here we are! If we run this, we're now using an event loop to run cl-irc. Each time the socket has something to read, the function connection-socket-read will be called on the non-blocking mode socket. If there's no message to be read, then the function will exit and the loop will continue to run.

Using timers

You can now modify the last line with this:

(defun say-hello (connection)
  (cl-irc:privmsg connection "#jd-blog" "Hey I read your blog!")
  (as:delay (lambda () (say-hello connection)) :time 60))

(as:start-event-loop (lambda ()
                       (let ((connection (connect "irc.oftc.net" 6667 "jd-blog")))
                         (cl-irc:join connection "#jd-blog")
                         (say-hello connection))))

This will connect to the IRC server, join a channel and then say the same sentence every minute.

Challenge accomplished!

And I'd like to thank Andrew Lyon, the author of cl-async, who has been incredibly helpful with my recent experimentations in this area.

Ceilometer bug squash day #1

Mon, 24 Dec 2012 00:00:00 GMT

In order to start the year in a good mood, what's the best than squashing some bugs on OpenStack?

Therefore, the Ceilometer team is pleased to announce that it organizes a bug squashing day on the Friday 4th January 2013.

The main goal on this bug day will be put Ceilometer in the best possible shape before the grizzly-2 milestone arrives (10th January 2013). This version of Ceilometer will aim to keep compatibility with Folsom, so early deployers can enjoy some of our new features before upgrading to Grizzly. After that date, we'll start merging more extensive changes.

We'll be hanging out on the #openstack-metering IRC channel on Freenode, as usual, so feel free to come by and join us!

Logitech Unifying devices support in UPower

Fri, 16 Nov 2012 00:00:00 GMT

A few months ago, I wrote about my reverse engineering attempt to Logitech Unifying devices. Back then, I concluded my post with big hopes on the future after receiving a document with some part of the specification of the HID++ 2.0 from Logitech.

A couple of weeks ago, some of my summer work has been merged to UPower, adding battery support for some Logitech devices.

HID++

As I discovered late in my first reverse engineering attempt, Logitech developed a custom HID protocol named HID++. This protocol exists in two versions, 1.0 and 2.0. Some devices talk with version 1 of the protocol (like my M705 mouse) and some others talk with version 2 of the protocol (like my K750 keyboard).

Recently, I've been able to be in touch with a Logitech engineer who worked on the Linux support for the Unifying receiver, and he has been really helpful and exposed me some details about this protocol.

Logitech made the decision to publish their HID++ specification publicly about a year ago, but still didn't do it. The internal review needed to publish such documents hasn't be done yet. The only published draft is just an extract of the specification, with even some typo in it as I discovered.

Some other documents have been recently published, but I didn't have the time to review them. They contains HID++ 1.0 specifications and some details I asked for about the K750 keyboard.

UPower support

It took me sometime to get a full understanding of the protocol, its different version etc. After reverse engineering my K750 keyboard, I've also reverse engineered the data stream used to get my M705 mouse battery status. I've also received some information about the HID++ 1.0 protocol, so I've been able to discover a bit more on what the packets mean. Most of my discoveries are now used to do proper #define in up-lg-unifying.c so the code makes more sense.

My first patch implements a new property for UPower devices, named luminosity, that use with K750 keyboard to report the light level received. The second patch add support for Logitech Unifying devices (over USB only) and should work with at least Logitech M705 and K750 devices. This should be available with the next version of UPower, which should be 0.9.19.

So far, Logitech has been kind enough to help me understanding part of the protocol and even sent me a few devices so I can play and test my work with them. Unfortunately, this will probably requires some work and time, and so far Logitech was not able to help with that.

There should be enough information out there to at least add support for battery to HID++ 2.0 devices, and probably a few other things too. I hope I'd get the time do this at some point, but feel free to beat me in this race!

OpenStack France meetup #2

Tue, 06 Nov 2012 00:00:00 GMT

I was at the OpenStack France meetup 2 yesterday evening.

This has been a wonderful evening, talking about OpenStack and all with around 30-40 people. I and Nick Barcet presented Ceilometer and have received some good feedbacks about it.

We should also thanks Nebula, who sponsored the evening, and Erwan Gallen since it was nicely organized, and free beers are always enjoyable.

Inside Synaps, a CloudWatch-like implementation for OpenStack

Mon, 22 Oct 2012 00:00:00 GMT

A few days ago, Samsung released the source code of Synaps, an implementation of the Amazon Web Service CloudWatch API for OpenStack.

Being a developer on the Ceilometer project, I've been curious to look on this project and how it could overlap with Ceilometer or other projects like Heat.

What is CloudWatch?

CloudWatch is a monitoring system provided by Amazon on its Web Services platform to monitor services. This allows you get notifications and trigger an action on certain threshold.

For example, this can be used to scale your architecture by monitoring the number of requests you get on it and its general load by starting new servers.

Synaps

Synaps is written in around 7k lines of Python (with 28 % of which are comments), reuses at least one common module of OpenStack (openstack.common.cfg) and copy some modules from Nova. One thing that strikes me, is that there seems to be only a few unit tests compared to most OpenStack projects. Also, many parts of the code and documentation contains text written in korean, which won't be very helpful for most people! :-) It uses some external technologies, like Storm, Cassandra to store its persistent data and Pandas to do data analysis.

The API server provides an EC2 compatible API only: no OpenStack specific API. This is probably not a bad thing for now, since I am not aware of any work in this direction. The API access directly the Cassandra back-end for read operation, but relies on RPC to do writes. This way, a set of daemon handles the write using the Storm part of Synaps and do data aggregation. The authentication only supports LDAP, but it should still be possible to add a driver for Keystone.

A Java and a Python SDK are provided to record metrics into Synaps, but
there's not enough documentation for it to be useful.

Overlap with Heat

For now, there's not a lot of overlap with Heat, because Heat does not implement completely the CloudWatch API. Heat actually still misses a lot of the CloudWatch functions. But as soon as it will implement the CloudWatch API completely, the overlap will be complete with Synaps in this regard.

One divergence point however, is that Heat uses RPC to access data from the storage back-end via its engine (the central daemon), whereas Synaps directly connects to Cassandra. Also, Heat relies on SQLAlchemy, like most OpenStack projects needing a database.

Overlap with Ceilometer

One of the goal of Ceilometer is to provide data probes and pollsters for all OpenStack components (Nova, Swift, Quantum…) whereas Synaps let the OpenStack users to put any kind of metric inside it, and therefore doesn't provide anything for now.

But the storage of metrics is the main common point between Synaps and Ceilometer. Synaps chose only one technology, Cassandra, to store its metrics, whereas Ceilometer took care of building an abstraction layer for the storage engine. Ceilometer currently allows an operator to use SQL or MongoDB, but Cassandra could likely be added.

Data metric consolidation is done by Synaps. This makes sense, since Synaps don't need to have the full data history to trigger alarms. On the opposite, Ceilometer needs to have a full history to allow things like billing, and don't do any aggregation on data.

Also, in Synaps, the data analysis is done using Pandas. This means the data used are retrieved from the Cassandra back-end, and then transformed by Pandas inside Synaps in something else. It's likely that in such a case, Synaps should use CQL to achieve that. Ceilometer manipulates the data near their storage: it means that the computation are done by back-end to be efficient (SQL, mapreduce…).

Conclusion

Considering Samsung open-sourced Synaps late in the development process, I don't feel like they aimed to have it becoming a core component. This is always sad, because the effort put into this implementation are big and it would have probably little to add some abstraction layers to follow what other OpenStack projects do. But this takes time and energy, and it's understandable that Samsung didn't want to achieve this in a short time frame.

There's a part of the code and architecture that overlaps with Ceilometer and Heat. Ceilometer is becoming a specialized point to store data metrics from any source: so it's sad, but understandable, that Synaps did not tried to reuse it. Fortunately, Heat is working with Ceilometer to achieve exactly that. This means OpenStack would have only one metrics storage point, used for billing, for monitoring and alarming.

Therefore, I think Synaps is an implementation of CloudWatch that should be looked at as an inspiration for Heat and Ceilometer to build a better and more integrated solution!

Ceilometer 0.1 released

Fri, 12 Oct 2012 00:00:00 GMT

After 6 months of development, we are proud to release the first release of Ceilometer, the OpenStack Metering project. Ceilometer. This is a first and amazing milestone for us: we follow all other projects by releasing a version for Folsom!

Using Ceilometer, you should now be able to meter your OpenStack cloud and retrieve its usage to build statistics or bill your customer!

You can read our announcement on the OpenStack mailing list.

Architecture

We spent a good amount of time defining and refining our architecture.

One of its important point, is that it has been designed to work without modifying any of the existing core components. Patching OpenStack components in an intrusive way to meter them was not an option for now, simply because we had no legitimacy to do so. This may change in the future, and this will likely be discussed next week during the OpenStack Summit.

Meters

Initially, we defined a bunch of meters we'd like to have for a first release, and in the end, most of them are available. Some of them are still missing, like OpenStack Object Storage (Swift) ones, mainly due to lack of interest from the involved parties so far.

Anyhow, with this first release, you should be able to meter your instances, their network usage, memory, CPU. Images, networks and volumes and their CRUD operations are metered too. For more detail, you can read the complete list of implemented meters.

REST API

The HTTP REST API has been partially implemented. The provided methods should allow basic integration with a billing system.

DreamHost is using Ceilometer in their deployment architecture and coupling it with their billing system!

Towards Grizzly

We don't have a clear and established road-map for Grizzly yet.

We already have a couple of patches waiting in the queue to be merged, like the use of Keystone to authenticate API request and the removal of Nova DB access.

On my side, these last days I've been working on a small debug user interface for the API. Ceilometer API server will return this interface if your do an API request from a browser (i.e. requesting text/html instead of application/json).

I hope this will help to discover Ceilometer API more easily for new comers and leverage it to build powerful tools!

Anyhow, we have tons of idea and work to do, and I'm sure the upcoming weeks will be very interesting. Also, we hope to be able to become an OpenStack incubated project soon. So stay tuned!

Gnus notifications

Wed, 29 Aug 2012 00:00:00 GMT

Today, I've merged my Gnus notifications module inside Gnus git repository. This way, it will be available for everybody in Emacs 24.2.

This module allows you to be notified via notifications-notify (the Emacs implementation of the Freedesktop desktop notifications) on new messages received in Gnus. It can also retrieves contacts photo via gravatar.el and google-contacts.el to include them in the notification.

To enable it in Emacs > 24.1, you just have to add the following line to your Gnus configuration file:

(add-hook 'gnus-after-getting-new-news-hook 'gnus-notifications)

If you want to download it and use it stand-alone for a previous Emacs version, you can fetch the latest file revision and load it before adding the previously given line.

Sony Vaio Z Debian Linux support

Sat, 11 Aug 2012 00:00:00 GMT

I had to install Debian Wheezy on a brand new Sony Vaio Z laptop with the new Ivy Bridge architecture (SVZ1311C5E). I'll talk about this here, because it's always nice to know that new hardware works quite fine (or not) under Debian.

The laptop is delivered with Window 7, which I decided to remove entirely anyway, and replace with Debian. I've installed it with Linux 3.2 and then ran Linux 3.4, 3.5 and 3.6-rc1.

USB booting

Don't ask me why, nor an Ubuntu or Debian USB installation booted, blocked at SYSLINUX at best, or at a black screen. This does not work. I had to use PXE to install Debian.

Storage

The only thing that can be surprising, is that the 128 GB SSD storage is actually made of 2 64 GB Samsung SSD aggregated in a RAID 0 using Intel Rapid Storage Technology, previously known as Intel Matrix. This is supported by Linux using the dm-raid module. So this is a fake RAID, and you anyway can see the both drives as sda and sdb under Linux.

Unfortunately, this kind of RAID is not supported correctly by GRUB, and I was unable to install it this way. Therefore, I decided to remove entirely this fake RAID (which is possible via the BIOS) and use a Linux software md RAID 0 instead, plus crypto on top of it. That I know well and I trust. :)

Graphics

The Intel HD Graphics 4000 works fine. I'm alsmo using the HDMI output, which works fine. There's some GPU hanging (as seen on screen and in kernel logs) in Linux up to 3.4, but with versions 3.5 and above, I didn't see any problem so far.

Sound

The Intel HDA sound card works pretty well, both for playing and recording. The main problem is that I hear a constant noise on the speakers, but tweaking the ALSA mixers ends it at some point. There's still probably a bug, not yet resolved in Linux 3.6-rc1.

Keyboard

The keyboard works fine, and the back-light too, via the sony-laptop kernel module. Wonderful.

Touchpad

Touchpad works fine.

Fingerprint

It does not work, and is not supported according to my research. Not that I care about, but don't count on it. It's an AuthenTec AES1660.

Webcam

It works perfectly.

USB

Well, USB 3.0 does not work. I had to disable XHCI in the BIOS and use the 2 ports as standard USB 2.0, otherwise I would just get errors from the kernel. Still not working with Linux 3.6-rc1, and I've no clue to debug, and do not use USB 3.0 yet, so…

WiFi

The WiFi module (based on iwlwifi) works fine. The only problem with NetworkManager is that the sony-laptop offers a second rfkill switch and NM does not know how to handle it correctly. A bug is opened about this and I hope to be able to write a patch or something at some point. Also, there seems to be some quality issue with the iwlwifi driver and 802.11n at this point. I'm losing connection quite often when the signal drops below 40 %. Loading the module with 11n_disable=1 helps a lot.

Ethernet

The gigabit Realtek Ethernet controller works perfectly.

Card reader

Works perfectly.

Ceilometer, the OpenStack metering project

Fri, 27 Jul 2012 00:00:00 GMT

For the last months, I've been working on a metering project for OpenStack, so it's time to talk a bit about it.

OpenStack is a growing cloud platform providing IaaS. A problem easily identified by everyone building a public cloud platform is that nothing is provided to retrieve the platform usage data. Some data are available in some places, but not everything is, and you have to do a lot of processing from the various components to get something useful in the end. But in order to bill customers that are using your public cloud platform, you need to do his.

In this regard, a lot of companies running public OpenStack based infrastructure wrote their own solution to cover this functional areas, and to become able to bill theirs customers.

To avoid everybody doing and maintaining such a stack in their corners, the Ceilometer has been created.

The project aims to cover the metering aspect of the OpenStack components, pulling usage data from every components and storing them into a single place. It then offer a retrieving point for this data via a REST API.

The initial specifications have been written in April this year, and actual implementation started in May. The project is currently worked on by me, Dreamhost and Canonical.

We already have designed an architecture that we are implementing, and we hope to release a first usable version with Folsom.

I did a presentation of this project yesterday at XLCloud, which has been very well received.

If you are interested in helping us and contributing, feel free to join us during one of our weekly IRC meeting or fix some bugs. :-)

Emacs configuration published

Tue, 24 Jul 2012 00:00:00 GMT

I've finally published my Emacs configuration.

This took me a while, since I had personal information inside (like passwords). Recently, I've been able to move them away and can now publish everything in my Git repository.

It's probably not yet usable from scratch, since I didn't include the bootstrap code for el-get. But you can at least lurk and grab some ideas or lines of code. And do not hesitate to ask me anything about it!

Note that I'm using Emacs development version (trunk), so it's possible that some things do not work with (old) released Emacs versions.

ERC notifications

Sat, 21 Jul 2012 00:00:00 GMT

Today, I've merged my erc notifications module inside Emacs trunk. This way, it will be available for everybody in Emacs 24.2.

This module allows you to be notified via notifications-notify (the Emacs implementation of the Freedesktop desktop notifications) on private message received on IRC, or when your nickname is mentioned on a channel.

To enable it in Emacs > 24.1, you just have to add the following line to your configuration file:

(add-to-list 'erc-modules 'notifications)

If you want to download it and use it stand-alone for a previous Emacs version, you can fetch the latest file revision and load it before adding the previously given line.

Logitech K750 keyboard and Unifying Receiver Linux support

Mon, 09 Jul 2012 00:00:00 GMT

A year ago, I bought a Logitech Wireless Solar Keyboard K750. I'm particularly picky on keyboards, but this one is good. It has an incredible useful feature: while being wireless, it has no need for disposable or rechargeable batteries, it uses solar power!

My problem is that there's obviously no way to know the battery status from Linux, the provided application only working on Windows.

And one dark night, while fragging on QuakeLive, my keyboard stopped working: it had no battery left. This activity being quite energy consuming, it emptied the whole battery.

Someone should write code to get the battery status and light meter from Linux: challenge accepted!

How the keyboard works

This keyboard, like many of the new wireless devices from Logitech, uses the Unifying interface. It's an USB receiver that can be attached up to 6 differents devices (mouse, keyboards…). On old Linux kernel, the Unifying receiver is recognized as only one keyboard and/or one mouse device.

Recently, a driver called hid-logitech-dj has been added to the Linux kernel. With this driver, each device attached to the receiver is recognized as one different device.

What the Logitech application does

The Logitech application under Windows works that way: you launch it, and it displays the battery charge level. On the keyboard, there's a special "light" button (up right). When pressed, a LED will light up on the keyboard: green if the keyboard is receiving enough light and is charging, red if the keyboard does not receive enough light and is therefore discharging. Pushing this same button while the application is running will makes the light meter activated: the application will tell you how much lux your keyboard is receiving.

Let's reverse engineer this

As far as I know, there's nothing in the USB HID protocol that handles this kind of functionality (battery status, light meter…) in a standard way. So the first task to accomplish is, unfortunately, to reverse engineer the program.

I discovered a bit too late that Drew Fisher did a good presentation on USB reverse engineering at 28c3. You might want to take a look at it if you want to reverse engineer on USB. I did not need it, but I learned a few things.

Anyway, my plan was the following: run the Logitech application inside a virtual machine running Windows, give it direct access to the USB keyboard, and sniff what happens on the USB wire.

To achieve that, you need a virtual machine emulator that can do USB pass-through. Both KVM and VirtualBox can do that, but VirtualBox works much better with USB and allow hot(un)plugging of devices, so I used it.

To sniff what happens on the USB, you need to load the usbmon Linux kernel module. Simply doing modprobe usbmon will work. You can then use Wireshark which know how to use usbmon devices and understand the USB protocol.

USB stuff you need to know

You don't need to know much about USB to understand what I'll write about below, but for the sake of comprehensibility I'll write a couple of things here before jumping in.

To communicate with an USB device, we communicate with one of its endpoints. Endpoints are regrouped into an interface. Interfaces are regrouped into a configuration. A device might contains one or several configurations.

There's also several types of packets in the USB wire protocol, and at least two of them interest us there, they are:

Interrupt packets, a packet send spontaneously;
Controls packets, used for command and status operations.

All of this and more is well (and better) explained in the chapter 13 of Linux Device Drivers, Third Edition.

Sniffed data

Once everything was set-up, I ran my beloved Wireshark. There's a an URB of type interrupt sent each time you press any key with some data in it. Therefore I advise you to plug another keyboard (or use the laptop keyboard if you're doing this on a laptop), otherwise you'll get crazy trying to sniff the keyboard you're typing on.

At this point, just launching the application does a bunch of USB traffic. Pressing the "light" button on the keyboard makes even more USB packets coming in and out. Here's the interesting packets that I noticed once I excluded the noise:

When pressing the "light" button, an URB of type interrupt is sent by the keyboard to the computer;
An URB control packet is sent by the computer to the keyboard in response;
Regularly URB interrupt packets are sent just after.

With all this, the next step was clear: understand the packets and reproduce that exchange under Linux.

What the packets mean

The "go for the light meter" packet

The packet sent from the computer to the keyboard is the following.

Frame 17: 71 bytes on wire (568 bits), 71 bytes captured (568 bits)
    Frame Length: 71 bytes (568 bits)
    Capture Length: 71 bytes (568 bits)
USB URB
    URB id: 0xffff880161997240
    URB type: URB_SUBMIT ('S')
    URB transfer type: URB_CONTROL (0x02)
    Endpoint: 0x00, Direction: OUT
        0... .... = Direction: OUT (0)
        .000 0000 = Endpoint value: 0
    Device: 6
    URB bus id: 1
    Device setup request: relevant (0)
    Data: present (0)
    URB sec: 1340124450
    URB usec: 495643
    URB status: Operation now in progress (-EINPROGRESS) (-115)
    URB length [bytes]: 7
    Data length [bytes]: 7
    [Response in: 18]
    [bInterfaceClass: HID (0x03)]
    URB setup
        bmRequestType: 0x21
            0... .... = Direction: Host-to-device
            .01. .... = Type: Class (0x01)
            ...0 0001 = Recipient: Interface (0x01)
    bRequest: SET_REPORT (0x09)
    wValue: 0x0210
        ReportID: 16
        ReportType: Output (2)
    wIndex: 2
    wLength: 7
0000  40 72 99 61 01 88 ff ff 53 02 00 06 01 00 00 00   @r.a....S.......
0010  22 ad e0 4f 00 00 00 00 1b 90 07 00 8d ff ff ff   "..O............
0020  07 00 00 00 07 00 00 00 21 09 10 02 02 00 07 00   ........!.......
0030  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   ................
0040  10 01 09 03 78 01 00                              ....x..

What's here interesting is the last part representing the data. wLength says that the length of the data is 7 bytes, so let's take a look at those 7 bytes: 10 01 09 03 78 01 00.

Well, actually, you can't decode them like that, unless you're a freak or a Logitech engineer. And I have actually no idea what they mean. But sending this to the keyboard will trigger an interesting thing: the keyboard will start sending URB interrupt with some data without you pressing any more key.

The "light meter and battery values" packet

This is most interesting packet. This is the one sent by the keyboard to the host and that contains the data we want to retrieve.

Frame 1467: 84 bytes on wire (672 bits), 84 bytes captured (672 bits)
    Frame Length: 84 bytes (672 bits)
    Capture Length: 84 bytes (672 bits)
USB URB
    URB id: 0xffff88010c43c380
    URB type: URB_COMPLETE ('C')
    URB transfer type: URB_INTERRUPT (0x01)
    Endpoint: 0x83, Direction: IN
        1... .... = Direction: IN (1)
        .000 0011 = Endpoint value: 3
    Device: 2
    URB bus id: 6
    Device setup request: not relevant ('-')
    Data: present (0)
    URB sec: 1334953309
    URB usec: 728740
    URB status: Success (0)
    URB length [bytes]: 20
    Data length [bytes]: 20
    [Request in: 1466]
    [Time from request: 0.992374000 seconds]
    [bInterfaceClass: Unknown (0xffff)]
Leftover Capture Data: 1102091039000c061d474f4f4400000000000000

0000  80 c3 43 0c 01 88 ff ff 43 01 83 02 06 00 2d 00   ..C.....C.....-.
0010  5d c5 91 4f 00 00 00 00 a4 1e 0b 00 00 00 00 00   ]..O............
0020  14 00 00 00 14 00 00 00 00 00 00 00 00 00 00 00   ................
0030  02 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00   ................
0040  11 02 09 10 39 00 0c 06 1d 47 4f 4f 44 00 00 00   ....9....GOOD...
0050  00 00 00 00                                       ....

This packets come in regularly (1 per second) on the wire for some time once you sent the "go for the light meter" packet. At one point they are emitted less often and do not contain the value for the light meter anymore, suggesting that the control packet sent earlier triggers the activation of the light meter for a defined period.

Now you probably wonder where the data are in this. They're in the 20 bytes leftover in the capture data part, indicated by Wireshark, at the end of the packet: 11 02 09 10 39 00 0c 06 1d 47 4f 4f 44 00 00 00 00 00 00 00.

Fortunately, it was easy to decode. Knowing we're looking for 2 values (battery charge and light meter), we just need to observe and compare the packet emitted on the wire with the values displayed by the Logitech Solar App.

To achieve this, I looked both at the Logitech Solar App and Wireshark while bringing more and more light near the keyboard, increasing the lux value received by the meter on the Solar App, and saw that the fields represented in blue (see below) where changing in Wireshark. Since 2 bytes were changing, I guessed that it was coded on 16 bits, and therefore it was easy to correlate the value with the Solar App.

[ ....9....GOOD....... ] 11 02 09 10 39 00 0c 06 1d 47 4f 4f 44 00 00 00 00 00 00 00 4 bytes - 1 byte for battery charge - 2 bytes for light meter - 2 bytes - 4 bytes for GOOD - 7 bytes

In this example, the battery has a charge of 0x39 = 57 % and the light meter receives 0x0c = 12 lux of light. It's basically dark, and that makes sense: it was night and the light was off in my office, the only light being the one coming from my screen.

I've no idea what the GOOD part of the packet is about, but it's present in every packet and it's actually very handy to recognize such a packet. Therefore I'm considering this as some sort of useful mark for now.

For the other bytes, they were always the same (0x11 0x2 0x9 0x10 at the beginning, 7 times 0x00 at the end). The 2 bytes between the light meter and GOOD probably mean something, but I've no idea what for now.

Building our solar app

Now we've enough information to build our own very basic solar application. We know how to triggers the light meter, and we know how to decode the packets.

We're going to write a small application using libusb. Here's a quick example. It's not perfect and does not check for error codes, be careful.

/* Written by Julien Danjou <julien@danjou.info> in 2012 */

#include <linux/hid.h>

#include <libusb.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
    libusb_context *ctx;
    libusb_init(&ctx);
    libusb_set_debug(ctx, 3);

    /* Look at the keyboard based on vendor and device id */
    libusb_device_handle *device_handle = libusb_open_device_with_vid_pid(ctx, 0x046d, 0xc52b);

    fprintf(stderr, "Found keyboard 0x%p\n", device_handle);

    libusb_device *device = libusb_get_device(device_handle);

    struct libusb_device_descriptor desc;

    libusb_get_device_descriptor(device, &desc);

    for(uint8_t config_index = 0; config_index < desc.bNumConfigurations; config_index++)
    {
        struct libusb_config_descriptor *config;

        libusb_get_config_descriptor(device, config_index, &config);

        /* We know we want interface 2 */
        int iface_index = 2;
        const struct libusb_interface *iface = &config->interface[iface_index];

        for (int altsetting_index = 0; altsetting_index < iface->num_altsetting; altsetting_index++)
        {
            const struct libusb_interface_descriptor *iface_desc = &iface->altsetting[altsetting_index];

            if (iface_desc->bInterfaceClass == LIBUSB_CLASS_HID)
            {
                libusb_detach_kernel_driver(device_handle, iface_index);
                libusb_claim_interface(device_handle, iface_index);

                unsigned char ret[65535];

                unsigned char payload[] = "\x10\x01\x09\x03\x78\x01\x00";

                if(libusb_control_transfer(device_handle,
                                           LIBUSB_REQUEST_TYPE_CLASS | LIBUSB_RECIPIENT_INTERFACE,
                                           HID_REQ_SET_REPORT,
                                           0x0210, iface_index, payload, sizeof(payload) - 1, 10000))
                {
                    int actual_length = 0;

                    while(actual_length != 20 || strncmp((const char *) &ret[9], "GOOD", 4))
                        libusb_interrupt_transfer(device_handle,
                                                  iface_desc->endpoint[0].bEndpointAddress,
                                                  ret, sizeof(ret), &actual_length, 100000);

                    uint16_t lux = ret[5] << 8 | ret[6];

                    fprintf(stderr, "Charge: %d %%\nLight: %d lux\n", ret[4], lux);
                }

                libusb_release_interface(device_handle, iface_index);
                libusb_attach_kernel_driver(device_handle, iface_index);
            }
        }
    }

    libusb_close(device_handle);
    libusb_exit(ctx);
}

What the program is doing is the following:

Request for the Unifying Receiver device based on vendor and product ID
Get the HID interface
Detach the HID interface from the kernel driver
Claim the interface
Send a control packets, were parameters are defined using the same data we captured earlier
Read interrupt packets coming in until we receive one we recognize (length 20 containing the "GOOD" string)
Decode the content (battery charge & light meter)
Release the interface
Reattach the kernel driver to the interface

This gives the following:

Found keyboard 0x0x24ec8e0
Charge: 64 %
Light: 21 lux

Challenge accomplished!

To be continued

Unfortunately, this approach has at least one major drawback. We have to disconnect the Logitech Unifying Receiver from the kernel. That means that while we're waiting for the packet, we're dropping packets corresponding to other events from every connected device (key presses, pointer motions…).

In order to solve that, I sent a request for help on the linux-input mailing list. That way, I learned that Logitech is actually using the HID++ protocol to communicate with the devices using the Unifying Receiver. Lars-Dominik Braun managed to get the HID++ specifications from Logitech and published them with their authorization.

This opens a whole new world. With that document, I may be able to understand the part I reverse engineered and convert this to a more useful and generic library using the hidraw interface (so we don't have to disconnect the devices from the kernel driver).

Making the jump: working freelance

Mon, 02 Jul 2012 00:00:00 GMT

For the last 10 years, I've been working on many Free Software projects. From Debian to OpenStack, through awesome, Emacs, XCB and many more. This obviously allowed me to enhance my technical skills, but it also taught me about Free Software and Open Source development processes, and how to work with and close to the community.

Working for almost 6 years at Easter-eggs taught me how to work in an autonomous manner, how to lead and manage a project. And how to run a company, thanks to the cooperative status of this great one.

These are the reasons why I decided to leave my latest job and run my own company to work as a freelance consultant & developer specialized in Free Software, starting today.

Therefore, I am now able and available to provide expertise and development on Free Software, including upstream contribution. Especially on projects I already worked on recently, like OpenStack.

How to make Twitter's Bootstrap tabs bookmarkable

Fri, 29 Jun 2012 00:00:00 GMT

I've been using Twitter's bootstrap library recently to build this Web site, and wondered how to be able to use the bootstrap-tab Javascript plugin in a bookmark friendly manner.

I ended up with a simple solution. These are my first steps in Javascript and front-end manipulation, and it's really not my area of expertise, so don't be harsh.

function bootstrap_tab_bookmark (selector) { if (selector == undefined) {
    selector = ""; }

    /* Automagically jump on good tab based on anchor */
    $(document).ready(function() {
        url = document.location.href.split('#');
        if(url[1] != undefined) {
            $(selector + '[href=#'+url[1]+']').tab('show');
        }
    });

    var update_location = function (event) {
        document.location.hash = this.getAttribute("href");
    }

    /* Update hash based on tab */
    $(selector + "[data-toggle=pill]").click(update_location);
    $(selector + "[data-toggle=tab]").click(update_location);
}

All you need is to use and call this function with a selector (only useful if you have several tabs/pills divisions) when the document is ready.

The first part takes care of showing the good tab based on the hash contained in the URL. The second part takes care of changing the document location to add the current tab to it when the user clicks.

OpenStack Swift eventual consistency analysis & bottlenecks

Mon, 23 Apr 2012 00:00:00 GMT

Swift is the software behind the OpenStack Object Storage service.

This service provides a simple storage service for applications using RESTful interfaces, providing maximum data availability and storage capacity.

I explain here how some parts of the storage and replication in Swift works, and show some of its current limitations.

If you don't know Swift and want to read a more "shallow" overview first, you can read John Dickinson's Swift Tech Overview.

How Swift storage works

If we refer to the CAP theorem, Swift chose availability and partition tolerance and dropped consistency. That means that you'll always get your data, they will be dispersed on many places, but you could get an old version of them (or no data at all) in some odd cases (like some server overload or failure). This compromise is made to allow maximum availability and scalability of the storage platform.

But there are mechanisms built into Swift to minimize the potential data inconsistency window: they are responsible for data replication and consistency.

The official Swift documentation explains the internal storage in a certain way, but I'm going to write my own explanation here about this.

Consistent hashing

Swift uses the principle of consistent hashing. It builds what it calls a ring. A ring represents the space of all possible computed hash values divided in equivalent parts. Each part of this space is called a partition.

The following schema (stolen from the Riak project) shows the principle nicely:

In a simple world, if you wanted to store some objects and distribute them on 4 nodes, you would split your hash space in 4. You would have 4 partitions, and computing hash(object) modulo 4 would tell you where to store your object: on node 0, 1, 2 or 3.

But since you want to be able to extend your storage cluster to more nodes without breaking the whole hash mapping and moving everything around, you need to build a lot more partitions. Let's say we're going to build 210 partitions. Since we have 4 nodes, each node will have 210 ÷ 4 = 256 partitions. If we ever want to add a 5th node, it's easy: we just have to re-balance the partitions and move 1⁄4 of the partitions from each node to this 5th node. That means all our nodes will end up with 210 ÷ 5 ≈ 204 partitions. We can also define a weight for each node, in order for some nodes to get more partitions than others.

With 210 partitions, we can have up to 210 nodes in our cluster. Yeepee.

For reference, Gregory Holt, one of the Swift authors, also wrote an explanation post about the ring.

Concretely, when building one Swift ring, you'll have to say how much partitions you want, and this is what this value is really about.

Data duplication

Now, to assure availability and partitioning (as seen in the CAP theorem) we also want to store replicas of our objects. By default, Swift stores 3 copies of every objects, but that's configurable.

In that case, we need to store each partition defined above not only on 1 node, but on 2 others. So Swift adds another concept: zones. A zone is an isolated space that does not depends on other zone, so in case of an outage on a zone, the other zones are still available. Concretely, a zone is likely to be a disk, a server, or a whole cabinet, depending on the size of your cluster. It's up to you to choose anyway.

Consequently, each partitions has not to be mapped to 1 host only anymore, but to N hosts. Each node will therefore store this number of partitions:

number of partition stored on one node = number of replicas × total number of partitions ÷ number of node

Examples:

We split the ring in 210 = 1024 partitions. We have 3 nodes. We want 3 replicas of data.
→ Each node will store a copy of the full partition space: 3 × 210 ÷ 3 = 210 = 1024 partitions.

We split the ring in 211 = 2048 partitions. We have 5 nodes. We want 3 replicas of data.
→ Each node will store 211 × 3 ÷ 5 ≈ 1129 partitions.

We split the ring in 211 = 2048 partitions. We have 6 nodes. We want 3 replicas of data.
→ Each node will store 211 × 3 ÷ 6 = 1024 partitions.

Three rings to rule them all

In Swift, there is 3 categories of thing to store: account, container and objects.

An account is what you'd expect it to be, a user account. An account contains containers (the equivalent of Amazon S3's buckets). Each container can contains user-defined key and values (just like a hash table or a dictionary): values are what Swift call objects.

Swift wants you to build 3 different and independent rings to store its 3 kind of things (accounts, containers and objects).

Internally, the two first categories are stored as SQLite databases, whereas the last one is stored using regular files.

Note that this 3 rings can be stored and managed on 3 completely different set of servers.

Data replication

Now that we have our storage theory in place (accounts, containers and objects distributed into partitions, themselves stored into multiple zones), let's go the replication practice.

When you put something in one of the 3 rings (being an account, a container or an object) it is uploaded into all the zones responsible for the ring partition the object belongs to. This upload into the different zones is the responsibility of the swift-proxy daemon.

But if one of the zone is failing, you can't upload all your copies in all zones at the upload time. So you need a mechanism to be sure the failing zone will catch up to a correct state at some point.

That's the role of the swift-{container,account,object}-replicator processes. This processes are running on each node part of a zone and replicates their contents to nodes of the other zones.

When they run, they walk through all the contents from all the partitions on the whole file system and for each partition, issue a special REPLICATE HTTP request to all the other zones responsible for that same partition. The other zone responds with information about the local state of the partition. That allows the replicator process to decide if the remote zone has an up-to-date version of the partition.

In case of account and containers, it doesn't check at the partition level, but check each account/container contained inside each partition.

If something is not up-to-date, it will be pushed using rsync by the replicator process. This is why you'll read that the replication updates are "push based" in Swift documentation.

## Pseudo code describing replication process for accounts
## The principle is exactly the same for containers
for account in accounts:
    # Determine the partition used to store this account
    partition = hash(account) % number_of_partitions
    # The number of zone is the number of replicas configured
    for zone in partition.get_zones_storing_this_partition():
        # Send a HTTP REPLICATE command to the remote swift-account-server process
        version_of_account = zone.send_HTTP_REPLICATE_for(account):
        if version_of_account < account.version()
            account.sync_to(zone)

This replication process is O(number of account × number of replicas). The more your number of account will increase and the more you will want replicas for your data, the more the replication time for your accounts will grow. The same rule applies for containers.

## Pseudo code describing replication process for objects
for partition in partitions_storing_objects:
    # The number of zone is the number of replicas configured
    for zone in partition.get_zones_storing_this_partition():
        # Send a HTTP REPLICATE command to the remote swift-object-server process
        verion_of_partition = zone.send_HTTP_REPLICATE_for(partition):
        if version_of_partition < partition.version()
            # Use rsync to synchronize the whole partition
            # and all its objects
            partition.rsync_to(zone)

This replication process is O(number of objects partitions × number of replicas). The more your number of objects partitions will increase, and the more you will want replicas for your data, the more the replication time for your objects will grow.

I think this is something important to know when deciding how to build your Swift architecture. Choose the right number the number of replicas, partitions and nodes.

Replication process bottlenecks

File accesses

The problem, as you might have guessed, is that to replicate, it walks through every damn things, things being accounts, containers, or object's partition hash files. This means it need to open and read (part of) a every file your node stores to check that data need or not to be replicated!

For accounts & containers replication, this is done every 30 seconds by default, but it will likely take more than 30 seconds as soon as you hit around 12 000 containers on a node (see measurements below). Therefore you'll end up checking consistency of accounts & containers on each all node all the time, using obviously a lot of CPU time.

For reference, Alex Yang also did an analysis of that same problem.

TCP connections

Worst, the HTTP connections used to send the REPLICATE commands are not pooled: a new TCP connection is established each time something has to be checked against the same thing stored on a remote zone.

This is why you'll see in the Swift's Deployment Guide this lines listed
under "general system tuning":

## disable TIME_WAIT.. wait..
net.ipv4.tcp_tw_recycle=1
net.ipv4.tcp_tw_reuse=1

## double amount of allowed conntrack
net.ipv4.netfilter.ip_conntrack_max = 262144

In my humble opinion, this is more an ugly hack than a tuning. If you don't activate this and if you have a lot of containers on your node, you'll end up soon with thousands of connections in TIME_WAIT state, and you indeed risk to overload the IP conntrack module.

Container deletion

We also should talk about container deletion. When a user deletes a container from its account, the container is marked as deleted. And that's it. It's not deleted. Therefore the SQLite database file representing the container will continue to be checked for synchronization, over and over.

The only way to have a container permanently deleted is to mark an account as deleted. This way the swift-account-reaper will delete all its containers and, finally, the account.

Measurement

On a pretty big server, I measured the replications to be done at a speed of around 350 {account,container,object-partitions}/second, which can be a real problem if you choose to build a lots of partition and you have a low number_of_node ⁄ number_of_replicas ratio.

For example, the default parameters runs the container replication every 30 seconds. To check replication status of 12 000 containers stored on one node at the speed of 350 containers/seconds, you'll need around 34 seconds to do so. In the end, you'll never stop checking replication of your containers, and the more you'll have containers, the more your inconsistency window will increase.

Conclusion

Until some of the code is fixed (the HTTP connection pooling probably being the "easiest" one), I warmly recommend to choose correctly the different Swift parameters for your setup. The replication process optimization consists in having the minimum amount of partitions per node, which can be done by:

decreasing the number of partitions
decreasing the number of replicas
increasing the number of node

For very large setups, some code to speed up accounts and containers synchronization, and remove deleted containers will be required, but this does not exist yet, as far as I know.

First release of PyMuninCli

Tue, 17 Apr 2012 00:00:00 GMT

Today I release a Python client library to query Munin servers.

I wrote it as part of some experiments I did a few weeks ago. I discovered there was no client library to query a Munin server. There's PyMunin or python-munin which help developing Munin plugins, but nothing to access the munin-node and retrieve its data.

So I decided to write a quick and simple one, and it's released under the name of PyMuninCli, providing the munin.client Python module.

mod_defensible 1.5 released

Tue, 03 Apr 2012 00:00:00 GMT

Apache 2.4 being out, I noticed that my good old mod_defensible did not compile anymore.

The changes in the new Apache 2.4 API were small for its concern, so it was pretty easy to update this software to make it compile again.

Honestly, I'm not sure that this module is really used into the wild, but I still think that it can serve as a good prototype for doing other things so I like keeping it around. :-)

All this has been triggered by the Apache 2.4 arrival into Debian experimental. Therefore I've updated the mod_defensible package to use the new dh_apache2, and imported it into Git at the same time.

xpyb 1.3 released

Thu, 22 Mar 2012 00:00:00 GMT

It took a while to get it out, but finally, 3 years after the latest release (1.2), the version of 1.3 of xpyb (the XCB Python bindngs) is out.

This version has a lot of improvement, and major bug fixes (memory corruption and memory leak were tracked down and fixed).

One amazing feature that is now shipped with that release, is my code to export the xpyb API to other Python modules, allowing to draw with Pycairo in Python using XCB.

Here is an example of a Python program that draws a spiral in a window using xpyb and Pycairo. You need xpyb >= 1.3 and Pycairo >= 1.10 to make this works.

import cairo
import xcb
from xcb.xproto import *

WIDTH, HEIGHT = 600, 600

def draw_spiral(ctx, width, height):
    """Draw a spiral with lines!"""
    wd = .02 * width
    hd = .02 * height

    width -= 2
    height -= 2

    ctx.move_to (width + 1, 1-hd)
    for i in range(9):
        ctx.rel_line_to (0, height - hd * (2 * i - 1))
        ctx.rel_line_to (- (width - wd * (2 *i)), 0)
        ctx.rel_line_to (0, - (height - hd * (2*i)))
        ctx.rel_line_to (width - wd * (2 * i + 1), 0)

    ctx.set_source_rgb (0, 0, 1)
    ctx.stroke()

## Connect to the X server
conn = xcb.connect()
## Get the X server setup
setup = conn.get_setup()
## Generate X ID for our X "objects"
window = conn.generate_id()
pixmap = conn.generate_id()
gc = conn.generate_id()
## Create a new window
conn.core.CreateWindow(setup.roots[0].root_depth, window,
                       # Parent is the root window
                       setup.roots[0].root,
                       0, 0, WIDTH, HEIGHT, 0, WindowClass.InputOutput,
                       setup.roots[0].root_visual,
                       CW.BackPixel | CW.EventMask,
                       [ setup.roots[0].white_pixel, EventMask.ButtonPress | EventMask.EnterWindow | EventMask.LeaveWindow | EventMask.Exposure ])

## Create a pixmap: it will be used to draw with cairo
conn.core.CreatePixmap(setup.roots[0].root_depth, pixmap, setup.roots[0].root,
                       WIDTH, HEIGHT)

## We just need a GC to copy later the pixmap on the window, so create one
## very simple
conn.core.CreateGC(gc, setup.roots[0].root, GC.Foreground | GC.Background,
                   [ setup.roots[0].black_pixel, setup.roots[0].white_pixel ])

## Create a cairo surface
surface = cairo.XCBSurface (conn, pixmap,
                            setup.roots[0].allowed_depths[0].visuals[0], WIDTH, HEIGHT)
## Create a cairo context with that surface
ctx = cairo.Context(surface)

## Paint everything in white
ctx.set_source_rgb (1, 1, 1)
ctx.set_operator (cairo.OPERATOR_SOURCE)
ctx.paint()

## Draw our spiral
draw_spiral (ctx, WIDTH, HEIGHT)

## Map the window on the screen so it gets visible
conn.core.MapWindow(window)

## Flush all X requests to the X server
conn.flush()

while True:
    try:
        event = conn.wait_for_event()
    except xcb.ProtocolException, error:
        print "Protocol error %s received!" % error.__class__.__name__
        break
    except:
        break

    # ExposeEvent are received when we need to refresh the content of the
    # window, so we copy the content of the pixmap (where cairo drew) in the
    # window
    if isinstance(event, ExposeEvent):
        conn.core.CopyArea(pixmap, window, gc, 0, 0, 0, 0, WIDTH, HEIGHT)
    # You click, I quit.
    elif isinstance(event, ButtonPressEvent):
        break
    conn.flush()

Seeing the complexity it is to draw something simple with this technology, I somehow understand why nobody bothered to release or use the code during the last 3 years.

But hey, now that it's out, you can build the next Python based desktop environment with bleeding edge technologies. :-)

Ten years as a Debian developer

Fri, 24 Feb 2012 00:00:00 GMT

Ten years ago, I joined the Debian project as a developer.

At that time, I was 18 and in my first year at university, hanging out with the TuxFamily system administrators, which included 3 french Debian developers (sjg, igenibel and creis).

I was learning Debian packaging while working on VHFFS, and decided to package one or two non-yet-packaged software for Debian. My friends pushed me into the NM process, and less than 2 months later I was a Debian developer. One have to admit that back in the days, the NM process was really fast if you were able to reply to the questions quickly. :-) I think I became the youngest developer among Debian's ones.

That was my first steps in a Free Software project, and it was really exciting.

In 10 years, I've been doing a lot of different things for Debian. Sure, I've been using it all the years long, but let's recap a bit what I did, from what I recall.

My first Debian only project was apt-build around 2003, and later rebuildd in 2007.

I built the Xen packaging team in 2005, I've been a Stable Release Manager for a year in 2006, and did heavy bug squashing to release Etch that same year.

I also was an Application Manager in 2006 and managed the application of 2
Debian developers (Jose Parrella and Damián Viano).

I admit I've been less active in Debian after 2007, mainly because I was busy working on awesome, GNU Emacs and others software.

Since 2011, I joined the OpenStack packaging team and I'm working on OpenStack on a (almost) daily basis.

I don't know how many packages I touched, managed or updated, but that should be one or two hundreds. I still maintain 53 of them.

After all, the adventure has been really pleasant, and I had the chance to work with and meet fabulous and smart people. I always liked this project and what it's trying to do.

After all these years, I'm definitively staying! See you in another 10 years, folks! :)

Google Calendar notifications using pynotify

Tue, 03 Jan 2012 00:00:00 GMT

I use Google Calendar to manage my calendars, and I really missed something to warn me whenever I have an appointment with an alert set.

So here is an example of a Python program to do such a thing. It is written using the Google Data APIs Python client library and pynotify.

I'll detail the code here, so you can build your own and adapt it to your needs.

First, we need to import GTK+ and pynotify, and initialize it.

import gtk
import pynotify
pynotify.init(sys.argv[0])

Then, we need to import gdata Calendar API and connect to the calendar. I'll use the simple email/password way to login, which is clearly not the best, but it's also the simplest. Feel free to use OAuth 2.0. :-)

calendar_service = gdata.calendar.service.CalendarService()
calendar_service.email = 'mygooglelogin'
calendar_service.password = 'mygooglepassword'
calendar_service.ProgrammaticLogin()

Now we're ready to request stuff and notify! First, request the events from the default calendar.

feed = calendar_service.GetCalendarEventFeed()

Now we can iterate over feed and do various checks.

for event in feed.entry:
    # If the event status is not confirmed, go to the next event.
    if event.event_status.value != "CONFIRMED":
        continue
    # Now iterate over all the event dates (usually it has one)
    for when in event.when:
        # Parse start and end time
        try:
            start_time = datetime.datetime.strptime(when.start_time.split(".")[0], "%Y-%m-%dT%H:%M:%S")
            end_time = datetime.datetime.strptime(when.end_time.split(".")[0], "%Y-%m-%dT%H:%M:%S")
        except ValueError:
            # ValueError happens on parsing error. Parsing errors
            # usually happen for "all day" events since they have
            # not time, but we do not care about this events.
            continue
        now = datetime.datetime.now()
        # Check that the event hasn't already ended
        if end_time > now:
            # Check each alert
            for reminder in when.reminder:
                # We handle only reminders with method "alert"
                # and whose start time minus the reminder delay has passed
                if reminder.method == "alert" \
                        and start_time - datetime.timedelta(0, 60 * int(reminder.minutes)) < now:
                    # Build the notification
                    notification = pynotify.Notification(summary=event.title.text,
                                                         message=event.content.text)
                    # Set an icon from the GTK+ stock icons
                    notification.set_icon_from_pixbuf(gtk.Label().render_icon(gtk.STOCK_DIALOG_INFO,
                                                                              gtk.ICON_SIZE_LARGE_TOOLBAR))
                    notification.set_timeout(0)
                    # Show the notification
                    notification.show()

Running this program, you should see a notification if an appointment has an alert to be raised at that time.

This should be enough to start to build something.

If you don't want to program this into Python, you might want to take a look at gcalcli.

Using GTK+ stock icons with pynotify

Tue, 27 Dec 2011 00:00:00 GMT

It took me a while to find this, so I'm just blogging it so other people will be able to find it.

I wanted to send a desktop notification using pynotify, but using a GTK+ stock icons.

With the following snippet, I managed to do it.

import pynotify
pynotify.init("myapp")
import gtk
n = pynotify.Notification(summary="Summary", message="Message!")
n.set_icon_from_pixbuf(gtk.Label().render_icon(gtk.STOCK_HARDDISK, gtk.ICON_SIZE_LARGE_TOOLBAR))
n.show()

Note that the use of a Label is just to have a widget instanciated to use the render_icon() method. It could be any widget type as far as I understand.

My OpenStack work

Fri, 16 Dec 2011 00:00:00 GMT

Like I already wrote here last week, I've been heavily working on OpenStack for the last weeks.

My first assignment was to package OpenStack for Debian. The packages already present in unstable were mainly done by Thomas Goirand, who based its work on the one done in Ubuntu. Therefore, the packages where not in a very good shape for Debian.

Today Ghe Rivero and I (members of the OpenStack Debian packaging team) managed to push the OpenStack Essex 2 milestone into unstable with great success. You can now test and deploy OpenStack Essex 2 very easily!

Packaging OpenStack made me write several patches, mainly related to packaging, patches which were all accepted and merged by upstream. This is nice because most of the OpenStack Debian packages lost their debian/patches directories now!

Finally, I've finished to implement one blueprint I really missed: the ability to boot from an ISO image using libvirt. The code still needs a review, but it should be included in the Essex 3 milestone if everything's right.

New job, new blog

Wed, 07 Dec 2011 00:00:00 GMT

It has been a while since I blogged but I've been very busy, with my new job and this new blog!

New job!

I quitted my job last September, and found another one that I started in October. I'm now the lead developer of eNovance Labs, where I work on the OpenStack project. So far, this allowed me to contribute heavily to the Debian packaging of OpenStack.

New blog!

In the meantime, I took some time to redesign my personal homepage and this blog, which is now using Hyde, the Python equivalent of Jekyll, which is in Ruby. Since I dislike Ruby (sorry), I preferred to use a Python based generator, and I admit Hyde is really cool.
Since I really suck at Web design, this one is obviously based on Twitter's bootstrap

Google Contacts for Emacs

Mon, 26 Sep 2011 00:00:00 GMT

I finally finished a thing I was really missing: accessing my Google Contacts from Emacs.

That's now possible, thanks to my new google-contacts.el package.

It includes searching for any contact and displaying the result in a window.
You can also jump to a contact from Gnus by pressing a
key, and complete e-mail addresses while composing a mail.

OAuth 2.0 for Emacs

Fri, 23 Sep 2011 00:00:00 GMT

This week, I've finished my OAuth 2.0 client implementation for GNU Emacs.

I have imported it into GNU ELPA so Emacs 24 users will be soon able to install it using the new Emacs packaging system.

OAuth 2.0 can be used to access, among others, Google APIs or the Facebook Graph API.

Quitting my job

Mon, 29 Aug 2011 00:00:00 GMT

After more than 5 years at Easter-eggs as a system engineer, I'll be leaving my job soon.

It has been a fabulous adventure, also due to the "cooperative" nature of the company. I've enjoyed worked here, with great people. I do wish them luck for the future. Looking at the numerous things I did for the past years, it has been quite productive!

Therefore, I'll be looking for a new job in the next weeks, which will probably keep me busy a bit. :-)

Python sets comparisons

Tue, 17 May 2011 00:00:00 GMT

This week I lost some time playing with Python's sets.

After digging into Python source code, I finally discovered there is what seems to be little bug. Anyway, it has been "fixed" in Python 3, fortunately. I did not find if it was reported somewhere, but since it's fixed, it's not a big deal.

Python 2.7.1+ (default, Apr 20 2011, 10:53:33) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class A(object):
...     def __eq__(self, other):
...             return True
... 
>>> A() == A()
True
>>> [A()] == [A()]
True
>>> set([A()]) == set([A()])
False

This clearly did not make any sense to me. I've then tested under Python 3.2:

Python 3.2.1a0 (default, May  4 2011, 19:59:25) 
[GCC 4.6.1 20110428 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class A(object):
...     def __eq__(self, other):
...             return True
... 
>>> set([A()]) == set([A()])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'A'

At least, raising an error is saner. It actually helped me to understand what I needed to do to have my sets working correctly with Python 2:

Python 2.7.1+ (default, Apr 20 2011, 10:53:33) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class A(object):
...     def __eq__(self, other):
...             return True
...     def __hash__(self):
...             return 123456789
... 
>>> set([A()]) == set([A()])
True

Why not Lua

Tue, 26 Apr 2011 00:00:00 GMT

Since my latest announcement of the Lua workshop, I received a couple of emails asking why I discourage the use of Lua.

Actually, I already wrote out many of the things I dislike about Lua. I won't come back on this technical issues here, but since Lua 5.2 is not yet released (it's still at alpha stage), they are still relevant nowadays.

Stack based API is harder

The ease of integration of Lua into a C program is one of the point of Lua. They claim it's very easy to integrate Lua into your C application, because it does not use pointer, nor reference counting, nor anything that requires a minimum amount of skills to be used.

It uses a virtual stack based approach. You push or pop things on a stack, and refers to them using a relative or absolute index.

In order to people who never wrote Lua code to understand, here's a quick example on how this work. The L pointer is a Lua environment.

/* Create a table on the stack: index 1 */
lua_newtable(L);
/* Push a string on the stack: index 2 */
lua_pushstring(L, "hello");
/* Push a number on the stack: index 3 */
lua_pushnumber(L, 123);
/* Set newtable["hello"] = 123 */
lua_settable(L, -3);

You first push a table (in Lua, a table is almost equivalent to what you'd call a hash table in other language), then push the key, the value, and do the assignment operation. In the settable, we use -3 as index, meaning the "3rd item on the stack counting from top". We could also have written lua_settable(L, 1), since the table is also the first item on the stack from the bottom.

So far, so good.

Problems arise when you do more complicated stuff. My previous example is what you would typically find in a tutorial, but of course, real life is different, and usually more complex. If you cut the things in different parts, it can start to be more complicated.

Let's take a look at the following:

/* Create a table on the stack: index 1 */
lua_newtable(L);
/* Push a string on the stack: index 2 */
lua_pushstring(L, "hello");
/* Push a number on the stack: index 3 */
lua_pushnumbe(L, mycomputingfunction());
/* Set newtable["hello"] = 123 */
lua_settable(L, -3);

Here, we do exactly the same thing, but we do not push 123 directly: we compute it.

And here's the trick: if your computing function is also using the Lua stack, things can become very messy. As long as your computing function use the stack cleanly by pushing and poping all its item, and returning the stack in the same state it was before, you're safe. The problem is that in a complex program, you also write bugs. You do not chose to, but you do. And sometimes, you forget to pop one of the item you fetched from a table.

Imagine that mycomputingfunction is:

int
mycomputingfunctiong(void)
{
  /* Just push the table we want to fetch
     the number from on the stack */
  pushatableonstack(L);
  lua_pushstring(L, "mykey");
  lua_gettable(L, -2);
  return lua_tonumber(L, -1);
}

This function works perfectly. It pushes a table, then a key ("mykey"), then fetches mytable["mykey"] and pops the key (lua_gettable does push value/pop key itself), and then returns the numeric value of the last item (the fetched one) of the stack.

However, this function has a bug: it does not pop the table! This does not prevent the function to work. It does not raise a segmentation fault. It does not show any problem under gdb. It does not show any leak under Valgrind. It does now show any problem under any standard C debugging tool.

But when you'll start using it, your program will start to do weird things, and you'll have to spend a huge amount of time debugging it manually, dumping the stack content at each step of your program to watch out what's wrong.

Another bad thing, that can happen, is some code poping accidentally an item from the stack, or worst, from an empty stack. This does not raise any error on the Lua side, but will break your program in very unfunny way.

Even if I've been very meticulous writing awesome, but we hit that problem regularly.

The easiest workaround is to use lua_settop(L, 0) to reset the stack to 0 element. Doing this regularly (like after each program event or treatment) can remove left-over items and avoid the never ending stack grow you may experience if your left-over items continue to pile up. Did I tell you I dislike work-around?

You could also use lua_call(), which would avoid such an error, but this would require a huge amount of indirection, and would make write more (useless) code.

This kind of problem does not exists with pointer based API. If you screw things up, the problem will cause a segmentation fault or leak memory, or cause things you can (easily) debug with standard tools like gdb or Valgrind.

No reference counting is a pain

Userdata objects are variable Lua size objects embedding a C struct you define. It's the equivalent of an object in object oriented language.

Lua does not provide any reference counting for the userdata objects. That means you can push this objects on the stack, use them, but they cannot directly reference each others. If you have a "car" userdata and a "wheel" one, the car cannot hold directly a reference to the wheel. This is not possible because userdata are allocated and garbage collected by Lua, and there's no way to increase the reference counting yourself.

So the common hack is to store the wheel into a table as a value, and store
the table index as an integer into the car data structure.

This obviously makes memory leaks tracking harder, add huge level of reference indirection in usage (still more code), and does not make the whole process less error prone (at least in my opinion).

No paradigm makes you lose time

Lua is proud to come with no paradigm and to provide metatables. I already showed 3 years ago that it has big flaws.

To me, this ain't no good. Lua is not functional, nor it is object oriented.

Most people, including me, want one of this paradigm, or any else. Plain old imperative is not enough.

So you'll start to build more, or to use something like LOOP, which implements an object model. You'll implement your paradigm. I say life is too short to (re)write a paradigm.

In awesome we wanted to have an object oriented approach (this is kind of typical in such a graphical application context), so we tried to build one.
To me, this started to be a show stopper when I realized that I've ended writing Python object model into Lua while developing awesome (which aims to be a window manager, not a language). This is one of the reason I stopped hacking on Lua things.

I liked Python object model and wanted to have it in Lua, and spending time rewriting Python is just not worth it. I probably should have chose Python, not Lua. YMMV.

Embedding may not be a good choice

This is not Lua related, but I want to mention it. Googling for "embedding vs extending" will probably tell you more about why you should double check that you really need to embed Lua rather than to extend it.

Being small is not an excuse

One common argument to choose Lua is that it has a small footprint. Yeah, that's true, but that's useless. Bummer! When I program, I don't have any resource usage pressure. People who have such pressure are either paranoids or playing in the world of embedded computers. This is also a no more existing conception since quad core processors equiped phones are coming into the market. I'm rather confident that what we used to call embedded devices are just dead and are now plain computers. But as usual, YMMV.

So start to forget about it, run in your underpants and yelling "yay we killed that shit!", and then use real computers stuff. :-)

Even if benchmarks show how Lua is damn fast, remember what a benchmark proves: that you can do useless things very fast.

Too few extension modules

This is not directly Lua's fault, but there's too few extension modules for Lua. The community is quite small compared to other big languages' ones.

So think twice

before you choose Lua (or any other language). My recommendations these days would be not to embed, but to extend. If you really have no choice and need to embed a language into your application, GNU Guile is probably worth considering, because it's a Scheme and therefore a functional language :-), and because it can provides also different languages.

Including Lua.

Lua workshop at Fabelier/tmplab

Thu, 14 Apr 2011 00:00:00 GMT

It seems I'll be at the Lua workshop at Fabelier/tmplab on April 28th 2011, where I'll try to present and talk about Lua, how to use it, and why you should probably not use it. ;-)

Using advanced filter with mod_authnz_ldap

Mon, 04 Apr 2011 00:00:00 GMT

As you may know, Apache's mod_authzn_ldap allows to authenticate users in Apache HTTP server using an LDAP server. Unfortunately, it has a little implementation flaw.

The filter used to authenticate the user is built by abusing the RFC 2255 which specifies the LDAP URL format. This format has an "attribute" field which is normally used to specify which attributes should be returned. But mod_authzn_ldap uses this attribute to compare with the username given by the client. That means that you have to have an attribute in your LDAP entries which matches the username, and you have to use it in the "attribute" part of the URL to get things working.

Therefore, I wrote a patch to add a format string in the LDAP URL in order to user the provided username in the filter, and ignore the attribute part of the URL, which has no use in such a context anyway.

The bug has been opened in ASF Bugzilla and has number #51005, with the patch. The patch is backward compatible with the current configuration format, which is not the best choice in theory, but probably the more pragmatic.

I've no clue on the typical delay for patches inclusion in Apache HTTP
server, so let's just wait'n see.

Org contacts now part of org-contrib

Fri, 18 Mar 2011 00:00:00 GMT

Thanks to my recent promotion allowing me to commit directly in Org-mode, I've moved Org-contacts into the contrib directory of the Orgmode distribution.

My latest contributions to the Emacs' world

Tue, 01 Mar 2011 00:00:00 GMT

I spend too much time writing Emacs Lisp code these days. Unfortunately, the more I do the more I find new useful tools to improve my work-flow and save time for doing more Lisp. D'oh.

I did not work on any big thing these last weeks, so I'm thinking it's a good time to talk about the various code and patches I sent to multiple Emacs packages.

el-get

el-get, a fabulous tool that installs and handles all the external Emacs packages I use. A friendly war started on the development list about autoloads handling. The discussion was overall pointless, since we had a very hard time to communicate our ideas, and we did not understand each others several times.

In the end, el-get now supports autoload correctly and do not load automatically all your packages, improving the startup time, and using the Emacs way to do things. Which is always better, obviously.

git-commit-mode

I've started to use git-commit-mode some times ago. I usually use git-commit with the -v option to see what I'm committing. I though it would be useful to color the diff with diff-mode, so I wrote a patch just to do that, which
was merged today by Florian.

magit

Some weeks ago, I decided to give a try to magit, and loved it. I am not always using it, but for basic operations it is very useful. But I really soon found some things I did not like and therefore send patches to enhance it.

First, I've added a patch to honor status.showUntrackedFiles which I use in my home directory. In the mean time, I've also added a patch to allow adding an arbitrary file.

Yesterday, I sent another pull request, not closed for now, which adds the possibility to visit files in another window from a diff file, and the support for add-change-log-entry directly from the displayed diff. Useful for these old projects still using ChangeLog files but accessible through git (hi Emacs & Gnus!).

Gnus

Nothing remarkable, but I write a couple of fixes and enhancements to the Sieve manage mode, to the Gravatar code and cleaned-up some very very old code. Also added the possibility to
set list-identifier as a group parameter.

Org-mode

I spent most of my time working on my jd/agenda-format branch, which is soon to be merged. I've also just got developer access to the Org-mode patch work and repository, so I'll be able to break things even more! ;-)

ERC

I fixed the bug that annoyed me for a long time. Now erc-track does not reset the last channel status on window visibility changes not made by the user.

Announcing Org-contacts

Tue, 08 Feb 2011 00:00:00 GMT

When I started to use Emacs, I got hooked by many stuff like Gnus and Org-mode. One thing I quickly started to hate is how the Lisp code can be old and unmaintained. That especially applies to BBDB, which has been unmaintained for years, and has very very very old and obsolete code.

Therefore I've decided to develop my own BBDB replacement based on my lovely Org-mode. It's called org-contacts, and it allows you to handle your contact like anything you would handle in Org. This way you can manage them the way you want, without any preset fields or any assumptions like BBDB has.

I had the chance to present it at the Paris OrgCamp a couple of weeks ago,
and due to the enthusiastic audience I had, I'm now releasing it to the wide
Internet.

Naquadah theme for Emacs

Mon, 31 Jan 2011 00:00:00 GMT

I often post Emacs screenshots on this blog, and consequently receive a bunch of request for my Emacs theme. Therefore I decided to publish it.

OrgCamp Paris 2011 review

Sun, 23 Jan 2011 00:00:00 GMT

Yesterday afternoon, I was at the first OrgCamp in Paris.

It was my first attendance to a BarCamp, and I really liked it. It's basically the first geek event I do not find boring nor useless.

There was about 18-20 persons participating, which was quite high, since we all initially though we would have been only 5.

We had several presentations of various features and personal usages of Org-mode. For my part, I've quickly presented the agenda, and my BBDB replacement named org-contacts (I'll probably talk about it on this blog in another post later).

The only downside was that Bastien (the new Org-mode maintainer) was not able to come and join us. On the other side, there were so much to tell for a first time, I did not have so much time to code. I only have been able to fix one bug reported during my agenda presentation.

In the end, the overall atmosphere was very enthusiastic and friendly, which was extremely pleasant. The #org-mode-fr IRC channel has been created on Freenode, following this event. Feel free to join us.

Since people liked it so badly, it seems there should be another barcamp in the next months. Stay tuned.

Code fontification with Gnus and Org-mode

Thu, 20 Jan 2011 00:00:00 GMT

I've added code fontification using Org src blocks inside Gnus.

This interprets the block as an Org buffer and fontify it accordingly if org-src-fontify-natively it set to t.

Thanks to Sébastien Vauban for the original idea and implementation. Now it works out of the box without any customization.

Color contrast correction

Tue, 23 Nov 2010 00:00:00 GMT

I finally took some time to finish my color contrast corrector.

It's now able to compare two colors and to tell if they are readable when used as foreground and background color for text rendering. If they are too close, the code corrects both colors so to they'll become distant enough to be readable.

To do that, it uses color coordinates in the CIE L_a_b* colorspace. This allows to determine the luminance difference between 2 colors very easily by comparing the L component of the coordinates. The default threshold used to determine readability based on luminance difference is 40 (on 100), which seems to give pretty good results so far.

Then it uses the CIE Delta E 2000 formula to obtain the distance between colors. A distance of 6 is considered to be enough for the colors to be distinctive in our case, but that can be adjusted anyway. That depends on reader's eyes.

If both the color and luminance distances are big enough, the color pair is considered readable when used upon each other.

If these criteria are not satisfied, the code simply tries to correct the color by adjusting the L (luminance) component of the colors so their difference is 40. Optionally, the background color can be fixed so only the foreground color would be adjusted; this is especially handy when the color background is not provided by any external style, but it the screen one (like the Emacs frame background in my case).

Here is an example result generated over 10 pairs of random colors. Left colors are randomly generated, and right colors are the corrected one.

<table style="border-collapse: collapse; width: 100%; font-family: monospace; font-size: 0.85em;"> <thead> <tr><th style="text-align: left; padding: 6px 10px;">Original</th><th></th><th style="text-align: left; padding: 6px 10px;">Corrected</th></tr> </thead> <tbody> <tr> <td style="padding: 6px 10px; background-color: #698b69; color: #ababab;">DarkSeaGreen4 / gray67</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #4a6b4b; color: #cccccc;">#4a6b4b / #cccccc</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #6c7b8b; color: #228b22;">SlateGray4 / forest green</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #9faec0; color: #005700;">#9faec0 / #005700</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #212121; color: #5c5c5c;">grey13 / grey36</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #131313; color: #6c6c6c;">#131313 / #6c6c6c</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #9f79ee; color: #f0fff0;">MediumPurple2 / honeydew</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #9e78ed; color: #f0fff0;">#9e78ed / #f0fff0</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #6e6e6e; color: #66cd00;">grey43 / chartreuse3</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #5e5e5e; color: #79de25;">#5e5e5e / #79de25</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #faf0e6; color: #ee1289;">linen / DeepPink2</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #faf0e6; color: #ee1289;">linen / DeepPink2</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #53868b; color: #0000ff;">CadetBlue4 / blue1</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #6c9fa4; color: #0000e1;">#6c9fa4 / #0000e1</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #545454; color: #cdb38b;">gray33 / NavajoWhite3</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #525252; color: #cfb58c;">#525252 / #cfb58c</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #7fff00; color: #cd9b9b;">chartreuse1 / RosyBrown3</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #9cff38; color: #b28282;">#9cff38 / #b28282</td> </tr> <tr> <td style="padding: 6px 10px; background-color: #c71585; color: #ff1493;">medium violet red / DeepPink1</td> <td style="padding: 6px 10px;">→</td> <td style="padding: 6px 10px; background-color: #9c0060; color: #ff55b9;">#9c0060 / #ff55b9</td> </tr> </tbody> </table>

All this has been written in Emacs Lisp. The code is now available in Gnus (and therefore in Emacs 24) in the packages color-lab and shr-color.

A future work would be to add support for colour blindness.

As a side note, several people pointed me at the WCAG formulas to determine luminance and contrast ratio. These are probably good criteria to choose your color when designing a user interface. However, they are not enough to determine if displayed color will be readable. This means you can use them if you are a designer, but IMHO they are pretty weak for detecting and correcting colors you did not choose.

Elisp color manipulation routines

Sat, 20 Nov 2010 00:00:00 GMT

Last week, I spent some time implementing various color manipulation routines. The ultimate goal was to find a way to determine if a text in a certain color was readable on a background with a different color.

Something I failed to do so far, despite my research in the area.

However, since I think my code could be useful for other people, I've set up a tiny git repository with the routines I wrote.

The funniest one to implement was CIEDE2000. I verified my code with the data given in the specifications and can assure it's correct. :-)

Org-mode and holidays

Mon, 15 Nov 2010 00:00:00 GMT

Org-mode has a nice option which allows you to show week-end days in a different color in your agenda. That means that Saturday and Sunday (when I do not work) are fontified with org-agenda-date-weekend.

But there are other days I do not work, like my vacations or holidays.

Therefore, I've wrote a patch to add org-agenda-day-face-function which is optionally called to determine what should be the face used to fontify a day. This allows me to use the same face for holidays and for week-end days, like for last Thursday which was an holiday in France.

That patch has been merged in Org last week.

Google Maps for Emacs: moving, caching and home

Mon, 08 Nov 2010 00:00:00 GMT

Last week, I worked on my Google Maps for Emacs extension. I've introduced a new format handling for locations which include the longitude and latitude. The initial format was just a string describing the location, which was obviously too limited.

It now prints coordinates of the different elements when the mouse is over the map, with other information.

It also center the map on M-x google-maps and set a default zoom level. This is something which was not set because it's not a good idea to set center coordinates in order to see all points on the map automatically. But you can still remove the centering by pressing "C". On the other hand, setting it automatically allows to move the map easily, and I think that what most users want to do.

I've also added a "place my home on the map" feature, accessible by pressing h on any map. That adds a marker according to the location set in Emacs using the calendar- variables.

This feature is also available under Org by pressing C-u C-c M-l, which shows the location of your appointment with your home on the map too.

Finally, you also get caching so it does not request images you already seen, which makes the moving nicer and faster to use, and prompt history.

Icon category support in Org-mode

Thu, 04 Nov 2010 00:00:00 GMT

My latest patch for Org mode has been accepted by Carsten today. It adds support for custom category icons in all views, like agenda or todo.

You just need to configure org-agenda-category-icon-alist and it will work out of the box.

Transparent GIF support in Emacs 24

Tue, 02 Nov 2010 00:00:00 GMT

Last week, I wrote a couple of patches to add support for transparency when Emacs is displaying GIF images.

Until now, it was displaying the color used to define transparency in the file data. Now it displays the image correctly by using the frame color as the transparency color, like it's done for other image formats.

The patches have not been merged yet, but will probably be soon.

No more dashes in Emacs 24 mode-line

Wed, 20 Oct 2010 00:00:00 GMT

We all know the good old Emacs mode-line you got under every window. Since the beginning (a long time ago), it starts and ends with dashes. I've proposed a patch to remove them.

Before:

After:

This has been merged in Emacs 24. You won't see any more ugly dashes in graphical mode.

Enhancing Emacs mouse avoidance

Tue, 19 Oct 2010 00:00:00 GMT

Recent Emacs versions have a wonderful capacity to hide the mouse pointer as soon as you type and insert characters in a buffer. This is controlled by the make-pointer-invisible variable, which is set to t by default.

However that does not hide the pointer when simply moving the cursor on screen. Therefore, I've started to use mouse-avoidance-mode, which make the mouse pointer jump if your cursor hits it.

Unfortunately, if your cursor hits the invisible mouse pointer, mouse-avoidance-mode makes it jump too, because it does not know it is invisible.

Well, it did not know. Now it does, thanks to my patches which have been merged in Emacs 24. Using the new function frame-pointer-invisible-p, one can know if the mouse pointer has been hidden by Emacs. Therefore I enhanced `mouse-avoidance-mode' to use it, and everything is alright now. :-)

Why notmuch is not much good

Thu, 07 Oct 2010 00:00:00 GMT

I've recently got a mail from one of my faithful reader, asking why not considering notmuch.

Actually, I think notmuch already exists in a better way, and that's called
IMAP and Sieve.

What notmuch does, is tagging your mail with tags (obviously), based on filtering rules you write. The big downside is that you have to tag all your mails on your computer.

And if you use several computers, you'll have to tag several times your mails. And you'll have to find a way to maintain your rules to be identical on all your computers. That does not scale.

Using Sieve for mail filtering, one can do already that actually, and much
more.

A notmuch rule like:

notmuch tag +intel from:intel.com and not tag:intel

Can be written as a Sieve rule like:

if address :all :contains "From" "intel.com" {
	addflag "intel";
}

The flags extension for Sieve is explained in RFC5232.

The Sieve based solution has the advantage of being treated server side, and therefore not subject to multiple or different MUA usages. It's also fast, if you use a good IMAP server like Dovecot, which has indexing, etc.

Furthermore, Sieve can obviously do a lot more than tagging, like splitting into different mailboxes, filtering with regexp usage, vacation, etc.

And if you want to fetch your mail locally, you can synchronize the IMAP box entirely with any software able to (like OfflineIMAP).

Now, what's probably missing, is a correct support for IMAP flags on various MUA around. But that's not something notmuch helps to solve either. :-)

Gnus and Gravatar support

Sat, 25 Sep 2010 00:00:00 GMT

This last couple of days I've been dedicated making Gnus… fresher.

I've decided to give a whirl on Gravatar support. I already tried the gravatar.el lying on the Interweb, but well, it was crap: it used wget to fetch pictures, therefore was totally synchronous. Reading each mail was slower. The cache did not even have TTL, as far as I recall.

So, I've now wrote gravatar.el implementing the Gravatar API. Asynchronously of course. With cache, TTL, etc. Perfect. :-)

Then I've composed gnus-gravatar.el, implementing a washing function adding Gravatar for From field and/or Cc/To fields, like done for picons.

As I was expecting, the patch was badly received by the GNU guys, which start talking about thinks like external resources, privacy, non-free software, etc. Boring.

Fortunately, Lars allowed me to push the patch in git so everybody can give it a try. I'm now waiting for feedbacks in order to know if I will have to maintain this patch outside Gnus, or not.

Here's the mandatory screenshot.

Gnus news is good news!

Thu, 23 Sep 2010 00:00:00 GMT

As I already wrote too many times, I've started to use Gnus 6 months ago, and never looked back.

At that time, I joined the ding mailing list in order to ask some dumb questions and, once, send a patch. There were very low activity on that list.

Until Lars, the original Gnus author, came back.

Three weeks ago, he started to wrote a new wash function to render HTML mails properly, with pictures. It's named gnus-html, and is (for now) based on w3m (but not on emacs-w3m, which is not part of Emacs).

Last week, I've sent a set of patches to replace the usage of curl by the standard url-retrieve function to fetch images, plus various enhancement. It seems that my work was good enough that Lars offered me write access to the git repository. I can therefore mess up the Gnus entirely. Hurrah!

I've continued to work on gnus-html and recently merged a set of patches improving image retrieval (which is now done in parallel) and starting to use url-cache to cache image for a defined period of time. Of course, I found a bunch of tiny bug and special case while reading RSS feeds and various HTML mails, and fixed them all along.

Lars added a libxml binding for Emacs 24, providing the html-parse-string function. His future plan seems to be the abandon of w3m in favor of a native parsing via libxml to render HTML, and therefore, HTML mails.

I should also mention the new nnimap back-end; Gnus has been designed to read NNTP newsgroups, and not mails. Consequently, it had a very poor behaviour when used with a back-end such has IMAP.

Lars took a week to rewrite entirely our dear nnimap back-end, and make it act in a more expected way. There's still a bunch of bug and code to write, but it is at least usable and seems faster than the old code.

Last thing I did was to rewrite the icon support in the group buffer. When I started to use Gnus, I was curious and tried to configure this. I never managed to make it work, and now know and understand why it was broken. So I ended rewriting entirely, and now it works. I never though I would understand, fix, and commit this code when reading the Gnus documentation this winter, but hell yeah, I did.

Now I've still several little project to improve things in all sort of area.
We'll see what I'll do next. :-)

Emacs, Org, whatever the weather!

Wed, 08 Sep 2010 00:00:00 GMT

Another week, another Emacs extension!

I had (once again) a wonderful idea: what if I could have the weather forecasts in my Org agenda? Wouldn't that be wonderful?

My quest started by looking for a service offering a good weather forecast API. I found nothing simple as the hidden Google Weather API, which is nice, but… not documented. Not at all. Not a single line. Nah.

Then, I wrote a google-weather extension, implementing a basic Emacs Lisp API to retrieve data from the Google service:

ELISP> (google-weather-data->forecast (google-weather-get-data "Paris"))
(((9 8 2010)
  (low "53")
  (high "63")
  (icon "http://www.google.com/ig/images/weather/rain.gif")
  (condition "Rain"))
 ((9 9 2010)
  (low "53")
  (high "69")
  (icon "http://www.google.com/ig/images/weather/chance_of_rain.gif")
  (condition "Scattered Showers"))
 ((9 10 2010)
  (low "54")
  (high "72")
  (icon "http://www.google.com/ig/images/weather/partly_cloudy.gif")
  (condition "Partly Cloudy"))
 ((9 11 2010)
  (low "55")
  (high "75")
  (icon "http://www.google.com/ig/images/weather/partly_cloudy.gif")
  (condition "Partly Cloudy")))

My API even implements data caching, which is nice to speed up the agenda display.

By the way, I think my next job will be to hack on the url-cache feature of Emacs, which is utterly buggy and has probably never be used. But that's another topic.

Finally, I just had to write another module on top of that to export the forecasts to Org. A screen shot is probably better than a long and boring explanation, so here's the result.

My only regret is that the icons provided by Google are ugly squares, so I did not want to use them. On the other hand, I did not found any icon set that would have all the icons Google provides (around 20). So I felt back on the icon naming specification to map the Google images to standard images. Any better idea would be welcome, of course.

All the information can be found on the Google Weather for Emacs extension homepage.

Emacs and OfflineIMAP

Fri, 03 Sep 2010 00:00:00 GMT

I recently decided to use OfflineIMAP to synchronize my mails on my laptop. It's a great piece of software, and allows me to read my mail while I'm offline.

I use it with Gnus, of course. But I lacked a proper way to integrate OfflineIMAP with it, so I decided to write a little Emacs extension to run and monitor OfflineIMAP directly from Emacs.

Here comes offlineimap.el, an Emacs extension to run OfflineIMAP directly within Emacs. It will display OfflineIMAP output in a buffer, and optionally shows the current OfflineIMAP operation in the mode line.

By default the status is in the mode line only if you are in the Gnus group buffer. But that's customizable, of course, since this is Emacs!

If you are using el-get, there's already a recipe to install it!

Emacs, Google Maps and BBDB

Wed, 18 Aug 2010 00:00:00 GMT

Today's fun idea was to put all my contacts stored into BBDB on a Google Maps' map, using my Google Maps extension for Emacs.

With the help of a few lines of Lisp glue:

(google-maps-static-show
 :markers
 (mapcar
  (lambda (address-entry)
    `((,(concat
         (mapconcat
          'identity
          (elt address-entry 1) ", ") ", "
          (elt address-entry 2) ", "
          (elt address-entry 3) ", "
          (elt address-entry 4) ", "
          (elt address-entry 5)))))
  (mapcan
   (lambda (record)
     ;; We need to copy the returned list, because mapcan will modify it later
     (copy-list (bbdb-record-addresses record)))
   (bbdb-records))))

It's really simplistic, but I did not need more to have fun. :-) This could be extended to set a specific marker and/or color for each contact, with a legend. I'll let that as an exercise for my readers.

Update on rainbow-mode

Tue, 10 Aug 2010 00:00:00 GMT

rainbow-mode had a big success and good feedbacks when I released it for the first time a couple of months ago.

Several users asked to me request its inclusion into Emacs. Therefore, some days ago, I proposed to merge it inside Emacs trunk. My request has been denied, but the mode has been added to the Emacs 24 package repository.

In the mean time, I've added support for hsl() and hsla() support, and added
CSS 3/SVG color names.

Porting D-Bus to XCB: story of a failure

Thu, 29 Jul 2010 00:00:00 GMT

Even if I recently stated I lost some of my faith in XCB, I still sometimes hack things to add support for it.

These last days, I've worked on a D-Bus port from Xlib to XCB. The port was quite straight forward, since there's only a little piece of D-Bus using X, which is dbus-launch.

I though D-Bus was a good candidate, since it's part of the Freedesktop initiative. Therefore, I was expecting a warm welcome and some enthusiasm from a fellow project.

My contribution got one useful review, and a cold reply from Thiago Macieira (a KDE/Qt/Nokia developer):

No, sorry, I don't agree..
I've just checked and my Solaris machine doesn't have XCB.
Please do not remove the X11 code. You may add the XCB code, but you cannot remove the X11 code.

This is not really the kind of answer I expected, actually. I then reworked the code to please Thiago, and added some #ifdef to add XCB support to D-Bus, with a fallback to libx11 where XCB would not be available.

Havoc Pennington replied:

Given that libX11 now uses xcb as backend, I don't understand the
value of porting to use libxcb directly when there isn't an issue of
round trips or other stuff. It will just make #ifdef hell, while the
X11 API is an API that works on both xcb and non-xcb platforms.
Maybe people should be thinking about porting xcb to non-Linux
platforms? The X protocol should be the same on other UNiX, so xcb in
theory ought to work fine if you just compiled it on Solaris/BSD, same
as GTK or dbus or Qt would work fine.

The last part "Maybe people should be thinking about porting xcb to non-Linux platforms?" is still unclear to me, even though I asked Havoc to explain what he meant.

Finally, Thiago refused to merge the patch:

[…] thanks for the patch, but like Havoc I am unsure of the value. We can't
drop the X11 codepaths now because too many systems exist without
XCB. Adding the XCB codepaths only made it more complex, even though you did a good job.

I can't disagree with that conclusion: using both XCB and X11 make the code unreadable for little gain. That's why I did replace libx11 by XCB directly in the first version of the patch. On the other hand, D-Bus people does not seems to really care about making their software evolve in the right direction, even if that requires users to upgrade their systems.

I think D-Bus using and depending on XCB would have been a good point to push adoption of XCB. Unfortunately, it seems you can't even rely of projects of the same initiative (i.e. Freedesktop) to work together to make things a little bit better.

After 5 years of existence, XCB is still not so obvious to people, and making it adopt is going to be a challenge for the next years. The upside is that new X.org 7.6 will bring XCB with it, as part of the katamari.

M-x google-maps

Mon, 28 Jun 2010 00:00:00 GMT

Since I have started to use Org-mode, I though it was missing something to have appointment locations on a map. Of course, it's easy to get a LOCATION property from an entry, and then browse-url on Google Maps.

But it is too easy for me, so once again I said: challenge accepted! I will bring Google Maps into Emacs!

After several hours of work, the google-maps.el project shows a map!

It fully implements the Google Static Maps API and the Google Maps Geocoding API.

You can type M-x google-maps and type some place to see it marked on map. Of course you can do much more, as seen in the screen shot above.

I've also completed all of this with a small org-location-google-maps.el which simply show a Google Maps' map for the location of an event in Org mode by pressing C-c M-l in an Org buffer or in the Org agenda.

Announcing rainbow-mode

Wed, 16 Jun 2010 00:00:00 GMT

While customizing Emacs this last weeks, I had the need to customize also the color theme.

Color themes are always a pain in the ass to edit, because you're supposed to read color strings like #aabbcc and guess what colors they represent.

This is why I wrote rainbow-mode, a minor mode for Emacs that will highlight strings that represents color, using the color they represent.

This support hexadecimal syntax, HTML color name, X color names and rgb() CSS syntax.

Desktop notification support for Emacs

Wed, 09 Jun 2010 00:00:00 GMT

This last weeks, I've worked on implementing the Desktop Notification Specification into Emacs.

It allows sending desktop notification in a very simple way.

(notifications-notify
    :title "You've got mail!"
    :body "There's 34 mails unread"
    :app-icon "~/.emacs.d/icons/mail.png"
    :urgency 'low)

It supports the protocol signals (NotificationClosed and ActionInvoked) and the two main methods (Notify and CloseNotification).

The methods specification are implemented entirely (hints, replaces, actions, icon, etc).

The signals are supported via callbacks function provided on the notification creation.

It have been merged into Emacs trunk today.

2010-06-09  Julien Danjou  <julien@danjou.info>

	* net/notifications.el: New file.

This also allowed me to discover, raise and fix a bug in the D-Bus binding of Emacs, which will be probably fixed in trunk soon.

Announcing erc-track-score

Mon, 07 Jun 2010 00:00:00 GMT

A couple of months ago, I've started using ERC to hang out on IRC.

I've read all the pages on EmacsWiki about it, just to see how far I could customize it.

I must admit that I was not disappointed, even if I expected to be. It's quite a nice software, and once well configured it's more convenient that my old irssi setup.

While browsing EmacsWiki, I read an interesting idea about channel scoring/temperature on the erc-track page. The idea is to see if it's worth jumping to an IRC channel to see what people are talking about.

Challenge accepted!

I sat up and started to dig though ERC source code to find the information I needed about variables and functions.

I finally did write something nice, which I called erc-track-score. And yet another piece of software I wrote for my lovely Emacs!

How does it work? Ha-ha, I was sure you would ask. You're so predictable, dude! Read the following, and you'll know everything you ever wanted to know about it since the moment you read the title of that blog entry.

Which probably turned you on.

Nasty you.

First of all, the score of a channel starts at zero. Zero means "seriously, don't bother, nothing is happening here".

Upon each new message arrival, the score is incremented by 1. If a new message contains a keyword, your nickname or is sent by a pal, the score is increased by configurable values, by default between 2 and 20 points, depending on the match type. On the other hand, when a message is send by some fool, the score is decreased by 1 by default.

Obviously, if the score is going negative, you really should not jump to the channel.

Finally, the score is permanently and slowly brought back to 0. By default, the score is decreased by 1 point every 10 seconds.

Overall, reading the score should gives you a good idea of the channel temperature.

I'm still not sure what is the best formula to compute the score, but so far the default values seem quite good. We'll see.

Thoughts and rambling on the X protocol

Tue, 01 Jun 2010 00:00:00 GMT

Two years ago, while working on awesome, I joined the Freedesktop initiative to work on XCB. I had to learn the arcane of the X11 protocol and all the mysterious and old world that goes with it.

Now that I've swum all this months in this mud, I just feel like I need to share my thoughts about what become a mess over the decades.

When I was unborn…

…the Toto band were releasing their song "Africa" and some smart guys were working on a windowing system: the X Window System (this is its full name) which therefore has a (too) long history. The latest version of its protocol, the 11th one, has been designed in the 80's. You can learn more about the history in the Wikipedia article about X.

In 2010, we still listen disco music and we still use various protocols designed in the 80's and even before X. Music have evolved, protocols have evolved, and so did X11.

The problem is that X11 did not evolve that well. The guys at MIT-and-some other-places-with-very-smart-people-in-it created X version 1 in 1984, and updated it until X version 11 (the one we're still using) in 1987. Eleven version in 3 years, that was following the "release early, release often" model. But I don't know why, it just stopped to happen for the last 23 years (that's not totally true: they added (and then deprecated) many extensions.)

I don't know what changes have been made in the first 11 major versions of the X protocol, but I'm rather sure we should have deserve a couple of major version updates this last 2 decades.

In my humble opinion, X11 was not designed to live 23 years. But hey, I'm not blaming anyone here: I was 4 years old and playing Lego® when they released this latest version of the X protocol, so there is little chance I'd have done something better.

We won't fix. We'll work-around.

That is probably one of the guideline of the X protocol for the last years. And don't misread me: I'm not bashing anyone thereafter.

Since the X11 protocol was aging, the X guys started to add extensions. They added tons of them over the years. This, in application of one of the early principles of X:

It is as important to decide what a system is not as to decide what it
is. Do not serve all the world's needs; rather, make the system extensible
so that additional needs can be met in an upwardly compatible fashion.

All of them with no exception were added because, bad luck, the X11 protocol did not anticipated the things that happened in the last 23 years, like video, OpenGL, multiple monitors, or the pleasure to draw oval windows. Some of this extensions are still in use, while some of them have been dropped.

While this is not a bad thing to extends the protocol, it seems like a bad thing to try to fix the protocol with for example the XFixes extension, even with all the good intentions Keith Packard might have in his greatness.

Actually it's even worst than you think

The X11 protocol (without extensions) defines about 120 types of requests: create a window, move a window, etc.

Nowadays, there's at least 25 % of them which are useless: usage of server-side font, or the drawing of squares and polygon, are unused by any modern application or toolkit. All of this is superseded by requests from extensions, like the XRender one.

The handling of multiple monitors displays has totally been screwed up. X11 has been designed to work in Zaphod mode (independent monitors). But Xinerama, and nowadays XRandR have replaced it up: recent X servers (released after ~2007) does not support Zaphod mode anymore, even if it's a core piece of the X11 protocol.

Worst: on many requests, there's limitation or design flaws, like described in this document: Why X Is Not Our Ideal Window System by DEC researchers.

We'll add more broken standard on top of that

Following its early principle, X does not define policies but only mechanisms, which seems like a good thing,

Consequently, people started writing specifications to determine a number of stuff and dogmas: ICCCM. That was 22 years ago in 1988. It's useless to add that many things in this specification are now obsolete, useless, or that it misses many modern stuff.

I was not the only one to think that. The people from what will be the major desktop environments, KDE and GNOME, saw that too in the 90's while I was learning to count. So they wrote EWMH, another standard that comes on top of ICCCM and extends it with nifty features like maximization, full screen mode, etc.

The problem is that this standard has also been written by narrow-minded people who at that time, were working on GNOME or KDE (and maybe others). This desktop environments were having and still have some strong concepts of how should work a desktop: "it should have work-spaces", "a window is only on one workspace", "we only see a workspace at a time", "you do not have multiple screens", etc.

Dude, we don't care: we have toolkits!

This vision of how the desktop should work have now been written in marble in all applications and libraries implementing EWMH, like GTK+ or Qt.

Nowadays, everybody forgot about all of this standards. Toolkits have implemented this, circumvented the X11 protocol limitation and flaws, and nobody wants to look back.

Like all standards, obviously some people implemented them badly. This had some side effects, like OpenOffice acting like a pager.

We don't look back? Worst, we forgot where we came from!

With all these layers of bad designed standards, the desktop continued to evolve for more than a decade. They continued to add more standard, the more recent ones being based on D-Bus like the Desktop Notification Specification or the latest Status Notifier Specification developed by KDE.

The Status Notifier is a new implementation of the good old system tray based on D-Bus and XEmbed instead of the X11 mechanisms, and adding the possibility to show the system tray with something else than icons.

This specification draft saw an important issue and design flaw raised by Wolfgang Draxinger in this thread on the XDG mailing-list. What Wolfgang points out, is that X is network-oriented, and D-Bus is not. Therefore, making the Status Notifier specification to use D-Bus to pass system tray messages around is a bad idea, since running a X application from host A on host B will draw the system tray on the wrong host!

Apparently, reading the thread, this does not fear some of the KDE people:

of course this is a bizarre corner case not worth much thought. at least
that's what you'll think until you actually run into it yourself (be it
because you are testing something or because you are setting up some
weird kiosk environment).

What Oswald describes as a corner case is an actual common use case for many of us. Of course, YMMV.

From my point of view, this is a step back in the wrong direction. But we can conclude that the network part of X is now worthless, to at least KDE.

I used to believe in XCB

When I joined Freedesktop, it was to work on XCB, the X C Binding. XCB is a nice, clean, 21st century technology based API to play with the X11 protocol. Its code is auto generated based on XML file describing the protocol.

In comparison, Xlib is 80's obfuscated code with almost no comments and hard-coded things. Only a few people can understand some of its corner like its i18n or XKB implementations. And all its code is synchronous.

For people not knowing it yet, X is a network protocol where you send request (like a GET in HTTP) and then get a response. Xlib forces the application to wait for the reply to its request, so the application is blocked until the X server sends the reply to the request. XCB on the other hand does not block and allows the application to send a batch of requests, do some other stuff in the mean time, and then gets the replies.

It's like your Web browser would send one request at a time to a Web server, and would wait until you downloaded all the images one by one to display the page.

In cases where X and all its clients are on the same host, the latency is small and not really visible, therefore the gain for XCB to be asynchronous is small. On slow network however, the gain can be huge, as proved in the rewrite of xlsclients with XCB by Peter Harris.

One of the long standing goal of the XCB folks is to kick-out Xlib, to increase speed and hides latency in X11 applications. That requires to port many libraries, because almost none of them (Cairo being an exception) supports XCB.

From where I stand, I don't really see if the work is worth it now. The desktop world is trusted by GNOME and KDE, meaning GTK+ and Qt. It seems none of this toolkits are interested to work on XCB, neither on the X protocol. They probably put hard effort in bypassing X limitation and flaws, and they now sit on top of crap of workarounds and broken-by-design-standard implementation. It seems to me they don't want to go back in the layers and improves things.

They're too high to go back down and they don't see what the gain would be.

Enlightenment with its EFL was the first toolkit to have an XCB back-end with the work of Vincent Torri. Unfortunately, the back-end is not maintained and nobody cares about it. Last time I tried it, it did not compile at all.

X12?

There's a page called X12 on the Freedesktop wiki, listing all the things that should be fixed some days. Unfortunately, the list continues to grow up an no one talks about working on X12.

On the other hand, there's a handset of people trying to work when they will have time on XKB2, the second version of the "let's-try-to-fix-up-the-keyboard-part-of-the-protocol-we-wrote-23-years-ago-a-second-time" extension.

To me, it does not seem X12 will happen in the next decade neither.

Alternative?

Do we got alternative to X? There's Wayland, but it's far from being usable. There's DirectFB, but that's not very portable. None seems candidate to replace X some days to me.

Anyhow, none of the main toolkits around support this alternative. GTK+ once supported DirectFB, but as far as I know, it is not supported nor works nowadays, as stated by Josselin Mouette. This is why recent versions of the Debian installer have migrated to X for the graphic part, thanks to Cyril Brulebois work.

Conclusion

XCB has been around for more than half-a-decade, and very few people showed interested in it. As far as I can see, nobody is interested to use the X protocol and everybody tries to encapsulate it in some higher-level API as soon as possible to stop seeing it. This leads to poorly written application and toolkits, with a lot of ugly hack.

All of that also means that starting to write applications and graphical toolkits based on XCB would be a very interesting project, but that would lead to spend too much time learning to circumvent the X protocol flaws, things that have been done in years by predecessors like Qt and GTK+.

Major toolkits implementations have almost nothing to win in going back in the dark water of X. I guess most of their folks prefer to work on shiny 3D effects based on your GPS location, rather than redefining better basis for everyone.

The manpower available in the X world is very small. Debian lacking of X maintainers is just the summit of that. There is very smart and very competent and skilled guys in the X world, as you can see by simply reading blog posts on Planet Freedesktop for example (me excluded). Unfortunately, there's not enough of them to cover all the things involved in X: input devices, graphics devices, new protocol extension specification and so on. The X server is really late, and it seems most of the developers prefers to work on the server itself than on the protocol behalf. Which is understandable.

I'm curious to see where all of that will lead in the upcoming years. I've been walking in the X world hallways for about 3 years now, and I feel desktop alternatives to KDE and GNOME will all die sooner or later. The time were you could choose between a dozen "modern" window managers has passed away.

After all, maybe that is simply Darwinism applied to computer software.

Making startup-notification XCB native

Mon, 24 May 2010 00:00:00 GMT

I'm trying to work on XCB this week. And today I've started to accomplish the second step of a long term goal: making an X11 only library using XCB as its primary interface instead of Xlib.

Last year, I had extended the API of startup-notification to support XCB as a back-end. This had been made possible by factorizing some code, duplicating the X11 code and translating it into equivalent XCB.

Today, I've accomplished the second step, being dropping the Xlib code inside startup-notification to keep only the XCB one.

For this, I used the x11-xcb library, which is available when Xlib is compiled with XCB as its transport, which is nowadays the standard.

This library provides the function XGetXCBConnection, which can convert a Display pointer to a xcb_connection_t pointer. Consequently, it's now possible to write and execute XCB based code and being compatible with Xlib.

I've made some benchmark of my work for the occasion, in order to measure what the gain is.

The first table described 1000 launches of a fake application (a modified version of the startup-notification test suite actually). The X server is local (the latency is very minimal then). The gain is computed between the same back-end type for the total time. Full XCB is the "version" I'm working on.

Version - Back-end

User time (seconds)

Kernel time (seconds)

Total time (seconds)

Gain

0.10 - libx11

3.20

7.42

12.989

0.10 - libxcb

2.76

7.36

12.414

Full XCB - libx11

2.74

7.50

12.380

4.6 %

Full XCB - libxcb

2.72

7.16

12.037

3.0 %

The user time and kernel time are provided but are not really interesting. XCB does not offers a big gain in CPU execution time, but is more about latency. Anyhow, there's always a gain with XCB.

This second table describe the same test but running only 100 times over a slow network.

Version - Back-end

Total time (seconds)

Gain

0.10 - libx11

0.10 - libxcb

Full XCB - libx11

5.2 %

Full XCB - libxcb

5.7%

The gain is relatively small, about 5 %. But anyhow, there's still a gain. Note that the difference between the execution time of the same test written in XCB and Xlib is just huge. I've tried to optimize the Xlib test, but I did not manage to win more seconds.

In conclusion, considering that startup-notification is only used when an application launches another application, the perceivable gain might be even smaller. But anyhow, I think it's worth it.

Announcing muse-blog

Wed, 19 May 2010 00:00:00 GMT

Digging into the fabulous world of Emacs and Lisp, I wanted to use it to build my personal Web site and my blog.

I already moved from ikiwiki to Emacs Muse for my HTML pages some weeks ago.

Muse provides an extension to maintain a journal, called muse-journal. Unfortunately, it was far to fulfill all my needs, and I decided that it would be a good exercise to write a better extension.

Consequently, I started to wrote my own extension, which I named muse-blog.

And this is now what is used to build this blog. :-)

Entering the Emacs world

Mon, 17 May 2010 00:00:00 GMT

In February 2009, my friend dim tried to force me using Emacs. I know a couple of people using it and Gnus for reading their mail, and it always made me curious.

At that time, more than a year ago, Emacs 22 and Gnus did not seem usable from my point of view.

But around mid February, with the help of dim, I tried again to start using Emacs.

Actually, this was not something new for me. I (very basically) used Emacs between 2000 and 2006. In 2006, when I finished the university and started working at Easter-eggs, I met a couple of vim enthusiasts. They taught me how to use it in various ways, and I started to know more about vim than Emacs, so I switched.

This time, I started by configuring it, but reading the manual and also learning a bit of Lisp. It took me several weeks, but step by step I learned many, many things. And I must admit, I liked it.

I've configured and starting to use some very important mode, like Gnus, Org mode, Muse, or even ERC.

I'll probably talk about various Emacs related things in the near future, since I already wrote more than a thousand lines of Lisp in the last 2 months.

Anyhow, I'd just conclude by asserting that my new Emacs/Gnus/Org/ERC setup beats my old vim/mutt/nothing/irssi to the death with a baseball bat. :-)

Python cairo and XCB support

Tue, 22 Dec 2009 00:00:00 GMT

cairo has a Python binding (pycairo) since a long time, and some months ago a Python binding for XCB (xpyb) has been released.

Pycairo has no support for creating Xlib surfaces. You can get a Xlib surface from PyGTK and then use Pycairo to draw on it, but there's no way to create one directly.

What I've done is make Pycairo aware of xpyb so it can creates directly an XCB surface from a XCB connection and a drawable.

As said in my mail to the XCB list, I'm now waiting for a review before pushing this upstream. :-)

For the first time, I guess, XCB has beaten Xlib support! ;-)

Teething troubles

Sun, 20 Dec 2009 00:00:00 GMT

It's not that often that I start something from scratch. It's an amazing feeling to start a new project, to start writing something new. I like that. It's creation, it's an artistic part of our computing stuff. I feel like a code artist.

And what I like even more is that little feeling that you are going in an unknown land. Some area in this tech world where nobody ever came before you, or only a few pioneers.

That's the sensation I got starting to using Cython, Python 3 and various other tools. I just spent half of my time trying to fix problems, rather than working on my code. Problems in autoconf macro not knowing Python 2.6 or Python 3.1. Problems and limitations in Cython. And problem in Python.

That last one was a hard one. I'm still a beginner in the Python world: I barely know anything. And I was trying to use something nobody never did: building an embedded Python with a set of built-in modules.

I spent hours trying to find why one type of module importing was badly failing. I finally found the answer thanks to a guy. who has the same problem A guy ? No. A pioneer. What do I say? A hero. He's been my week-hero! Thank you Miguel Lobo because you found the bug I chased for hours and because you even reported it as issue 1644818, including a patch! How not damn wonderful is that?

I will not bore you with the technical details of that bug, since nobody cares. Nobody cares, even the Python guys, since that bug has been opened for 3 years, and nobody even reviewed in that time. I found an old thread about that bug where some guys were wanking about how they should do the review, because Miguel pushed for several weeks to have a review, back in 2007.

But that bug was in my way. I had to do something. So I prepared my mail reader, mounted my web browser and here I was for a uniq quest: getting a Python bug fixed.

At that point, if you did not stop reading earlier, you might get very excited. Don't be, spoiler, it's still not fixed. You'll have to wait the end of the season and see all the episodes I'll have to write to get the end of the story!

Let's continue.

I had to create an account on the Python bug tracking system. That was a trivial task for a man like me (you bet). Then, I launched a verbal attack, something you rarely see in a bug tracking system. Something I knew would awake any developer caring about their software.

Julien Danjou:
Is there any chance to see this bug fixed someday?

I had the deep feeling that my quest was starting here. How many days would I have to wait until I get an answer? Time was passing. Minutes were ticking while I was waiting, sat in a comfortable sofa in a softly lighted room. It seemed like all my life was shorter than the delay I had to wait to get an answer.

After waiting for hours, suddenly, and only 15 minutes later, I got an answer:

Martin v. Löwis:
Please ask on python-dev. I may be willing to revive my five-for-one offer.

Martin? Don't know that guy. Who is he? Who is he like? Will he fix that bug? What is this offer? So many question without an answer. But he asked to ask on python-dev, and I said: challenged accepted! I will write a mail to python-dev to get that bug fixed.

Which I did. I sent a short (but well written you know, I made efforts) "WTF?" to pyhon-dev.

And then the guy asked me to review 5 bugs so he will review and fix this one. And this is how I said that he was pissing me off for blackmailing me to fix a bug that was its "duty".

Therefore, this is the end of the story so far. Will that bug be fixed some day? There's a hope, because another guy jumped in and took the bug assignment.

To be continued.

My conclusion about all that story: that is a little rude to start something new, with new tools, and get quickly into teething troubles. It's even more harsh to enter a community because you just found bugs, and be not very well received when you ask to apply a 10 lines long fix somebody wrote 3 years ago to fix it.

I'll probably still use Python :-), but I get a darker image of its community now.

Courier to Dovecot migration

Fri, 02 Oct 2009 00:00:00 GMT

This week, I've managed to migrate from courier-imap to dovecot at work. I always had a good experience with dovecot, and I still have one.

Dovecot performances are very good in comparison with courier. With that switch, we dropped the CPU usage of the server from 25 % to 10 %, and it's damn faster now. I have no idea why, but I think that it's better written looking at the code, and also that its usage of index files helps a lot.

We got no problem getting things work with public folders either, so the switch was almost painless.

The only problem we had is that Dovecot is too smart for some MUA. Consequently, we hit an 8 years old Mutt bug #969, which I also reported to the Debian BTS as #549204 with a not-well-tested-but-seems-to-work patch.

Thanks to Claws mail, we also found a bug in dovecot 1.2.5, which should be fixed soon. Dovecot upstream is very responsive and that's always something nice to know when you use a free software.

Various news: what happend during summer

Tue, 22 Sep 2009 00:00:00 GMT

It's been a while since I blogged about something. So here's a bunch of things I've done the last month.

Holidays

Well, I've been in holidays one week. :-P

awesome

There have been a huge number of changes between 3.3 (released in June) and 3.4 (almost relesed). I wrote a small but very useful object layer on top of Lua, which adds a class/object system a bit like gobject. I've also replaced all the hooks by per-class/object signals. Finally, the awesome Lua basement are cleaner than they were before, and the extendability is improved. How nice.

We're trying to release 3.4 (rc2 should be out soon), but the development pace is a bit slower than a year before. We're basically almost 2 months late on what was our previous release rate. Not a big deal however.

I've started working on 3.5 slowly. It gonna get amazing new features too. :-)

Google Summer Of Code 2009

I've mentored Mariusz Ceier on XCB GSoC. He worked on adding Xinput2 and XKB extensions. And he managed to do this. His work should be imported ASAP, the discussion has started on XCB maling list last week.

In exchange, Google offered me (and to every mentor) an awful blue t-shirt! Thanks Google! :-P

TODO list management

Fri, 10 Jul 2009 00:00:00 GMT

My fellow Debian developer Steve Kemp told us about his TODO list management.

While reading his post, I was constantly thinking "been there, been there buddy". Yeah, I've been.

I had the same problem since months, impossibility to track the things I had to do, being computer related stuff or real life ones. The bad thing is that until you write them down, you keep them in mind, and that's exhausting. You know you have, let's say 5, things to do, but unless you write this 5 items down in a TODO list, you will keep thinking about it once in a while. And that's a real lost time.

And that's totally inefficient: imagine you though "it'd be nice if I could buy a USB stick next time I buy some hardware". Well, unless you actually write this somewhere and have the habit to check the "To Buy" category of your TODO list, you're going to buy a replacement hard drive in a hurry some day, and forget about your USB stick.

I think the good practice, which I really recommend to everyone, is to write down as soon as possible what you think you have to do. Don't write it on a small paper you will lose, write it in a TODO list, a paper or electronic one, whatever, but write it, and stop thinking about it. When you'll have time, you'll get your TODO list from your pocket and give a look at it, doing what you can do at that moment. Once in a while, you check that list.

Personally, the tool I chose to handle my TODO list is a Palm Centro phone, which I got for only a hundred of euros. It runs the good old PalmOS, which basically know how to handle TODO list and plannings better than all phones I saw so far (and yes, probably better than your iPhone).

My choice was based on the fact that I've random ideas almost everywhere: that means while hacking, but also while walking in the street, while being in the train or while sleeping (yeah, already happened). And the only thing I always carry with me is my phone, in my pocket.

However, Steve choice may be nice if you have Internet access on your phone, which I haven't since it's too expensive for what it is, in my opinion. :-)

Upgrading to dovecot 1.2: hello Sieve!

Thu, 09 Jul 2009 00:00:00 GMT

Last year, I told you I wanted to use Sieve to filter my mail. I did not switch, because of the lacking implementation of some Sieve features inside Dovecot, my preferred IMAP server.

After that disapointement, I kept my 8 years old mail setup, being fetchmail running on my workstation and throwing the mails in procmail, then using mutt locally to read the maildirs. But that's over.

I got a laptop to replace my workstation. It was not possible to continue using such a mail setup, since my laptop can be offline, and so would be my mails.

So I decided to upgrade Dovecot to 1.2. I used the dovecot-1.2-work Subversion branch of our lovely Debian maintainers, and built a Debian package for Lenny. The upgrade from 1.1 was almost painless, since the configuration file did not change heavily.

Then I started to write my little Sieve script. Sieve is a very nice language. Almost user friendly. So in 20 lines I rewrote all my procmail stuff, matching things like List-Id with regex to put the mails automagically in the right folder. I reconfigured mutt to use IMAP, and it works fine. I even reimported my old Maildir via IMAP using mutt too.

I am now a happy IMAP user.

For people wondering why I wanted to switch away from procmail to Sieve: the reason is that Sieve script can be uploaded remotely via managesieve. This means you do not need FTP/SSH/whatever access to put your script. You can, for example, use connect-sieve or the Sieve plugin for Thunderbird/Icedove.

Taking the other direction

Wed, 15 Apr 2009 00:00:00 GMT

I've started to develop awesome more than 18 months ago, and somehow I feel it's time to stop a bit and think where we come from and where we are going to.

The motivation

I never though I'd be written a window manager one day. That seems kinda stupid when you see how many window manager there's around.

As many people, I've tested and have been using tons of window manager: Window Maker, Fluxbox, etc.

In August 2007, I was using fvwm since 2004 and was quite happy with it. I used the famous fvwm crystal as a configuration starter and then rewrote lots of stuff. Digging into ''fvwm'' configuration files was boring, and since I'm lazy, I never really configured it to fit entirely my needs.

The thing is that, in July 2007, my workstation died. I bought a new one based on the amd64 architecture. Too bad, with this new box, fvwm decided that it will not longer runs and was segfaulting almost every time I logged in.

I was really upset. Another failure in the window manager world. So I decided to get the yearly ride of testing many window managers. I went on the no more developed stuff like the *boxes, ion3, etc… but well, I did not like them, there were not powerful enough, too bugged or upstream was insane.

Then I found xmonad. The Haskell configuration file format made my cry. I did not want to learn Haskell, it seemed too obfuscated to me. At that time it was even not packaged for Debian, so I gave up. But I found

The jdwm

I just added a 'j' in front of dwm and started to hack it days and nights to add many feature I missed, like multi-head, etc… On 5th September 2007, I created a git repository to host my code.

That's gonna be… awesome.

Five days later, on 10th September, I finally found a name for my new pet: awesome, borrowed from Barney Stinson who heavily uses and abuses this word.

The 1.x branch

The first releases until December were noted 1.x. It was just a better dwm with a simple flat configuration file.. The configuration file used libconfig, but it was a very poor choice. And I was not able to put in into Debian because of name clash.

The 2.x branch

The 2.x branch came in January 2008 with a brand new configuration file format based on libconfuse, which was a bit more powerful. Many concepts and features that have been added in this branch are still used in the current 3.x branch.

At this time, between December 2007 and April 2008, the community was growing smoothly.

But as I said, awesome 2 was based on a flat configuration file. That raised a problem very soon: users expectation were growing and the development team (me and a couple of regular contributors) was unable to cope with them.

One of the event that started to change my mind was the support for titlebars.

When I've added titlebar support, it was minimal. It was on top of a window, with the window title. Dot. Then I've started to add a lot of options, like the application icon drawing, the position (left, right, bottom) etc.

And then users started to ask for more, like: "add titlebar on windows only when the window is floating".

That's ok, but that's complicated: that's again another option to do some stuff conditionally. And then, why don't add titlebar on windows when ?

The 3.x branch

Why

At that time, around April 2008, I'd totally stopped development. I was trying to find a solution which was simple and powerful. But after 2 weeks of thinking, I was not able to find anything else than: use a real language for configuration.

So, I've started prototyping awesome 3 using Lua. The choice was not obvious, and despite the problem Lua might suffer, it's one of the easiest language to integrate into an existing application.

But, let's go a little back: in January 2008, Arnaud Fontaine contacted me because he was interested to use awesome as one of its school project. He decided to port awesome from Xlib to XCB, a modern asynchronous X library.

His work took some time, but in May 2008, Arnaud did finished to port git master version of awesome to use XCB.

Consequently, I decided to start a new major branch, using XCB instead of Xlib (no change for users in this regard) and Lua instead of our previous flat configuration file format.

Development

It took me a while to get from here to there, but in September 2008, it was ready. We had a simple Lua API, and the XCB port was working perfectly.

It took us some time to release and have something totally working, because we had to work on XCB and contribute back to the project. It was really not ready to use by an application, but we did great work in this area and it's now really fine.

We're still here

Releases continue to happens, 3.1 around December 2008, and 3.2 around March 2009. 3.3 should be here in June.

One of the drawback we had, is that we moved many stuff from C to Lua. Why? Because writing things in Lua is quicker and easier to maintain than C, and makes thing more configurable for the user.

For example, the layout algorithm used to organize window were written in C until 3.2 came out. At that time, users had no choice than using a set of predefined layout to organize their windows.

Starting with 3.2, if they have minimal knowledge about geometry, they can start writing a layout function organising windows on the screen.

But this kind of API changes was a bit rough for users, since they had to port some part of their configuration file to the new API. The thing is that the project was still a teenager at that time, not really knowing were it will go. But I'm happy to announce that API breakage are more and more rare (so far only one minor between 3.2 and 3.3), and anyway always for the Good.

But I admit that it built a bad reputation around awesome 3.x during its first month of existence.

Future direction

I am currently working on 3.3 development. We have still many things to do. Time passing, we get more idea, and more users. And more users bring more ideas. We also have many more contributors, and some guys are even taking maintainer-ship of some code area.

Conclusion

My post title is "Taking the other direction" because I feel this way.

I've got that feeling that some approaches in projects like GNOME are sometimes bad. Please don't misread me, I know we are not playing in the same yard.

When adding a key shortcut for starting an application makes you dig into gconf, I wonder how this is a win for the user.

Well, it's probably a win for the end-user, but I surely am not one of them. And I don't intend to target them with my software, anyway.

And now, when I hear things like GNOME 3.0 and the "desktop shell" approach, that makes me smile. Guys, it was time, but have luck. What I see from here, is that any desktop control interface is wrong somehow, and that there's no approach that can fulfill all users wishes.

I think that we, the awesome development team (no pun intended) took the direction of building a frame-work window manager rather than a solution written in marble.

We (partially) solved the issue of UI ergonomic by not writing one and allowing the user to write his own. I don't say that's easy to do for most of users, but it's doable.

And I think it's worth it: I use window managers since I use Linux, around 1998. If something like ''awesome'' came 5 years ago, I'd be using it so far, because you can write Fluxbox or WindowMaker using awesome in a hundred of Lua code. And you can write your own version of it. And it starts in less than 3 seconds, supporting almost all standard desktop specification (ICCCM, EWMH, XDG, system tray, message notification, D-Bus, etc), whereas many of the window mangers do not.

You can even write and play space invaders.

Finally, I'm happy about the the road we took so far, and hope we will continue into that direction. The rants I read about our project are not that big, compared to the kudos we received.

OpenOffice is better as a pager than as a text processor

Wed, 11 Feb 2009 00:00:00 GMT

Since several month, awesome users have reported a bug with OpenOffice.org. When using OOo and clicking on a menu, or using the mouse wheel to read a document, the currently selected tag (desktop) will change automagically to another one.

I've digged into awesome and found that awesome received a _NET_CURRENT_DESKTOP request. As defined by EWMH,
this kind of request are sent by a pager to change the active desktop.

That was weird. Nobody is using a pager here. So, I just kicked my gdb out, attached it to OOo, breaking on XSendEvent call. And I got it:

Breakpoint 1, XSendEvent (dpy=0x1a00080, w=483, propagate=0, event_mask=1572864, event=0x7fff1fd70d70)
   at ../../src/SendEvent.c:46
(gdb) bt
#0  XSendEvent (dpy=0x1a00080, w=483, propagate=0, event_mask=1572864, event=0x7fff1fd70d70)
   at ../../src/SendEvent.c:46
#1  0x00007f8c0ab4193f in vcl_sal::WMAdaptor::switchToWorkArea ()
  from /usr/lib/openoffice/basis3.0/program/libvclplug_genlx.so
#2  0x00007f8c0aafdbd8 in X11SalFrame::Show ()
  from /usr/lib/openoffice/basis3.0/program/libvclplug_genlx.so
#3  0x00007f8c1378623c in Window::Show ()
  from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#4  0x00007f8c13785f40 in Window::Show ()
  from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#5  0x00007f8c1372cb54 in FloatingWindow::StartPopupMode ()
  from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#6  0x00007f8c1373c877 in ?? () from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#7  0x00007f8c1373ccf2 in ?? () from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#8  0x00007f8c1373ce84 in ?? () from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#9  0x00007f8c13795e7f in ?? () from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#10 0x00007f8c13797e74 in ?? () from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#11 0x00007f8c13796748 in ?? () from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#12 0x00007f8c0aafe6f8 in X11SalFrame::HandleMouseEvent ()
  from /usr/lib/openoffice/basis3.0/program/libvclplug_genlx.so
#13 0x00007f8c0ab040c2 in X11SalFrame::Dispatch ()
  from /usr/lib/openoffice/basis3.0/program/libvclplug_genlx.so
#14 0x00007f8c0ab31625 in SalX11Display::Yield ()
  from /usr/lib/openoffice/basis3.0/program/libvclplug_genlx.so
#15 0x00007f8c0ab356f3 in ?? () from /usr/lib/openoffice/basis3.0/program/libvclplug_genlx.so
#16 0x00007f8c0ab2df1f in SalXLib::Yield () from /usr/lib/openoffice/basis3.0/program/libvclplug_genlx.so
#17 0x00007f8c135b050e in Application::Yield ()
  from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#18 0x00007f8c135b0587 in Application::Execute ()
  from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#19 0x00007f8c17517e80 in ?? () from /usr/lib/openoffice/program/../basis-link/program/libsofficeapp.so
#20 0x00007f8c135b4b24 in ?? () from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#21 0x00007f8c135b4bc5 in SVMain () from /usr/lib/openoffice/program/../basis-link/program/libvcllx.so
#22 0x00007f8c1754ca6c in soffice_main ()
  from /usr/lib/openoffice/program/../basis-link/program/libsofficeapp.so
#23 0x000000000040105b in main ()

I started digging more into the code, and this is what I finally found in salframe.cxx:

        // #i45160# switch to desktop where a dialog with parent will appear
        if( mpParent && mpParent->m_nWorkArea != m_nWorkArea )
            GetDisplay()->getWMAdaptor()->switchToWorkArea(mpParent->m_nWorkArea );

Beautiful! It even has a comment with a IssueZilla bug number. Let's go and see where it comes from.

After 10 minutes of research to find that fucking IZ, I finally found the link to the issue #45160. The bug is IMHO not related to OOo but to a window manager doing poor job.

I've found that an awesome user already reported an bug… err, wait, I mean an issue as issue #96684 (remember there's no bug in OOo, only issues) and I commented about it.

It seems OOo developers have agreed to fix that bug eventually.

startup-notification ported to XCB

Thu, 29 Jan 2009 00:00:00 GMT

Since Tuesday, I've begun to work on XCB portage of the startup-notification library.

I've just completed the job, and send a bunch of patches to the XDG mailing list.

If the patches are merged, which I don't doubt, I'll be able to use this lib into awesome, which would be nice step to the Freedesktop standard compliance I like to make.

Rants about Lua

Tue, 30 Dec 2008 00:00:00 GMT

I've started using Lua some months ago, while looking for a more powerful way to configure awesome. At this time, around March 2008, Lua seemed to be the best language to integrate inside the core system of awesome.

I still think that Lua was a good choice, but after 8 months, it shows some important drawbacks.

I'll try to keep my explanation simple and to make you understand everything, even if you do not know Lua.

I refer here to Lua version 5.1.

Design flaws

Length operator

Lua has a length operator on its objects, known as #. It can be used to get the size of various objects.

> return #"lol"
> 3

This operator works for table, string, etc… It's possible to define this operator by setting a __len meta-method on a userdata value.

The problem is that you cannot redefine it on string or table objects, see:

> a = { "hello", "world" }
> return #a
2
> setmetatable(a, { __len = function () return 18 end })
> return #a
2

Indeed, looking at the Lua core code:

      case OP_LEN: {
        const TValue *rb = RB(i);
        switch (ttype(rb)) {
          case LUA_TTABLE: {
            setnvalue(ra, cast_num(luaH_getn(hvalue(rb))));
            break;
          }
          case LUA_TSTRING: {
            setnvalue(ra, cast_num(tsvalue(rb)->len));
            break;
          }
          default: {  /* try metamethod */
            Protect(
              if (!call_binTM(L, rb, luaO_nilobject, ra, TM_LEN))
                luaG_typeerror(L, rb, "get length of");
            )
          }
        }
        continue;
      }

You clearly see that tables and strings always use the internal length operator, never the __len meta-method.

That's for me a design problem, which will cause more trouble. We'll see later.

index and newindex metamethods

Lua defines two useful meta-methods, which are __index and __newindex. Both can be set on a table or any other object. __index will be called upon each read access to an undefined key on an object, and __newindex upon each write access.

> a = {}
> -- function are not defined, this is just an example
> setmetatable(a, { __index = myindexfunction, __newindex = mynewindexfunction })
> a[1] = "hello" -- This will call __newindex metamethod
> return a[2] -- This will call __index metamethods
> return a[1] -- This will NOT call __index

The last line does not call __index meta-method because a[1] does exists. This is a problem when you want to use table as object, because sometimes you want to monitor access to the table elements.

This can be easily worked around using a proxy system: you don't store
things in the table you manipulate, but in another table.

> a = {}
> realtable = {}
> -- function are not defined, this is just an example:
> setmetatable(a, { __index = myindexfunction, __newindex = mynewindexfunction })

Where meta-methods are something like:

function myindexfunction(table, key)
   return realtable[key]
end

function mynewindexfunction(table, key, value)
   realtable[key] = value
end

This way, our a table will always be empty, and realtable will have the data. At every read or write access to a, the meta-methods will be called. This is very convenient and widely used hack.

But this has serious drawbacks: as we saw before, the length operator (#) cannot be redefined on a table. That means #a will always be 0, and you cannot get the table length anymore, except by defining another method or a special attribute.

Also, Lua has several functions in the table library that are used to manipulate table in a easy way. The problem is that standard functions like table.insert or table.remove do raw accesses to the table. Meaning that if you do table.insert(a, 1, 1) it will insert the value 1 at the key 1 into a, without calling the __newindex meta-methods, breaking all your beautiful object-oriented model.

Another solution is to use a userdata object, like done in the Lua newproxy function (which is under-documented).

The problem is that it breaks all the other functions that are waiting for a table as argument, because they see a userdata, not a table. So this time table.insert is now more usable, which somehow fixes the problem, but not in the right way IMHO. However, this allows to use the __len meta-method.

Development model

The development model of Lua is, from my point of view, non-existent.

There is no public version control system repository available, so there's no chance to really contribute to Lua. It seems only a defined set of people work on it and therefore, the development system is very closed to my eyes, in comparison of usual projects.

Conclusion

I still think Lua is a good choice, because it is very easy to integrate into any C program, and to expand to fulfill your needs. However, some bad design choices were made, and the poor and closed development model chosen does not allow to have a good overview of the future of Lua.

This has been well stated by the authors themselves.

Security bug found in Imlib2

Sat, 22 Nov 2008 00:00:00 GMT

Yeah, I'm the proud discover of CVE-2008-5187.

It's my first time, it does mean something to me. ;-)

The eggtray problem

Fri, 03 Oct 2008 00:00:00 GMT

I still don't know why but many GTK+ applications use something called eggtrayicon. As far as I know, eggtrayicon.c is a file written in 2002 by Anders Carlsson which implements the Freedesktop.org system tray procotol for GTK+ applications.

Problem is that this C file is used in dozens of programs and maybe more, and is a bit bugged. I've already send patches for mail-notification and Audacious. pidgin is the first fixed implementation I found and works quite well. Many other applications are probably affected.

That seems to me like a real problem. Multiple copy of bad code instead of using native GTK+ system tray implementation.

So please stop using this bad implementation…

Unexpected VARMon new release

Mon, 18 Aug 2008 00:00:00 GMT

This has been 4 years since I released a new upstream release of VARMon, the DAC960 administration tool.

There was a bug first discovered in #401236. It has been fixed in Debian with an ugly fix, which did not work finally for a long time. Recently #491505 got opened too, which was the same as the previous one. But this time I got access to hardware, thanks to Christoph! And I finally fixed the bug. I've even be able to test the fixes I wrote years ago for all of the compilation warnings.

That's a shame that the problem was caused by dead code from the previous upstream, and that I did not realize that sooner. Kids, do not let dead debug code in your program at home.

So I've finally been able to release a new 1.2.1 version which maybe the last release for the next decade! ;-)

ATL1E support in Linux 2.6.26-1

Thu, 31 Jul 2008 00:00:00 GMT

Ben Armstrong opened an ITP for the ATL1E NIC driver, which is found on some Asus EeePC laptops. So, as suggested by Maximilian Attems, I provided a clean patch for this driver, made from a cherry-pick from the linux-netdev 2.6.27 tree. It has been committed into the 2.6.26-1 Debian kernel, which will be furnished with Lenny.

What's fun, is that in the mean time, I got a new computer at work. Wait, it's not fun yet. Because what I did not know is that it's made of an Asus P5Q motherboard which runs a NIC needing the ATL1E driver (and now you see why it's fun).

So I've just upgraded to 2.6.26-1-amd64 and I'm glad that my own work is useful to me (and will be probably be to others as well). :-)

EWMH and XRandR

Thu, 27 Dec 2007 00:00:00 GMT

Today I decided to add some EWMH support to awesome. It now supports a bunch of this extensions quite nicely.

However, while reading the spec and writing the code, it appears that this forces a window manager to behave in only one way: have a poor desktop support, and no multi-head/XRandR/Xinerama support at all.

The main caveats are that in Xinerama/XRandR mode, you'll have only one root window. And the root window is where you must store the NET_WM X properties… So you cannot handle screens in a independant way like awesome does. That's really a shame.

There's also a big problem for window managers like awesome which are happy to draw several desktops at the same time. There's no support for stuff like that.

So far, I think EWMH is nice but is really too narrow-minded for software and people who want to think window management in a different way.

Kicking out Web spammers with DNSBL

Mon, 15 Jan 2007 00:00:00 GMT

Every project has its story. Every war has its winner, and its casualties. They were 20 millions men, fighting for their freedom.

And you'll never know their story.

Because during last week, I was looking why my Web server was so heavily loaded. And I discovered that my blog was attacked by spammers trying to post comments. They were stopped by a great plug-in named spamplemousse, which use spam keywords and DNSBL to drop spam comments. However, this plug-in is written in PHP, like the rest of my blog, so it loads Apache and MySQL in a way that is no more acceptable: the page have still to be rendered for this !@#$ spammers.

Consequently, I decided to write a Apache 2.x module which will just drop a 403 Forbidden error page in the spammers' head using DNSBL servers. Here it is, and it is called mod_defensible.

I'm using it since 3 days now, and I got some pretty interesting result and less load on my Web server, so c'est tout bon.