Production and Reliability Series

What software engineering looks like under production pressure: debugging hard problems, handling incidents, managing overload, building for reliability.

5 min read

Buggy Code on Production, Survived

Areca is the name of the billing engine I am working on for Turk Telekom. Funny enough, it is also the name of the flowers we bought to freshen the office. We wanted the office...

5 min read

Local vs Production Debugging

I have been debugging this data workflow tool we built in house lately. It has an Angular UI and a Java backend, and it moves data between different systems like Postgres to Hiv...

6 min read

Update Statements on Production

Executing update statements on a production database is always a big challenge. It’s one of those tasks that looks deceptively simple until something breaks in ways you didn’t i...

9 min read

Message Brokers Are Modern Grids

While working on my book on , I keep noticing the same pattern. Some systems look simple while they belong to one team and become something else after everybody starts using the...

14 min read

Service Overload Strategies

Service overload happens a lot. If you haven't seen one, count yourself lucky. The first time I watched it take a system down, I realized how serious it’s to get the basics righ...

11 min read

Balancing Act of Reliability

Once something is in production, you are no longer just building software. You are also keeping it alive. That sounds obvious, but teams forget it all the time. We get excited a...

6 min read

Silent Guardians of Quality

In the realm of software development, testers are the silent guardians. Their role is often misunderstood and underappreciated, especially when they do their job so well that no...

4 min read

Why Metrics Don’t Equal Quality

In 1902, Hanoi was drowning in rats. The government was getting nervous about plague. Hence, the city put a bounty per rat tail. Suddenly, the system had a scoreboard, something...

17 min read

Promoting Learnings in Incidents

Incidents are used for the negative consequences of an action. The incident comes from an action that fails to result in the expected outcome. For instance, deploying a code to...

6 min read

Operational Skills Needed

Over the years, I've interviewed many candidates. One crucial skill that often gets overlooked is operational reflexes during oncalls. Surprisingly, few companies test for this,...

8 min read

The Mirror Is Part of the Machine

The worst telemetry problems I have seen did not start with waste. They started when an incident happened. We could not see enough, and the missing field became the villain of t...