Buggy Code on Production, Survived
Areca is the name of the billing engine I am working on for Turk Telekom. Funny enough, it is also the name of the flowers we bought to freshen the office. We wanted the office...
What software engineering looks like under production pressure: debugging hard problems, handling incidents, managing overload, building for reliability.
Areca is the name of the billing engine I am working on for Turk Telekom. Funny enough, it is also the name of the flowers we bought to freshen the office. We wanted the office...
I have been debugging this data workflow tool we built in house lately. It has an Angular UI and a Java backend, and it moves data between different systems like Postgres to Hiv...
Executing update statements on a production database is always a big challenge. It’s one of those tasks that looks deceptively simple until something breaks in ways you didn’t i...
While working on my book on , I keep noticing the same pattern. Some systems look simple while they belong to one team and become something else after everybody starts using the...
Service overload happens a lot. If you haven't seen one, count yourself lucky. The first time I watched it take a system down, I realized how serious it’s to get the basics righ...
Once something is in production, you are no longer just building software. You are also keeping it alive. That sounds obvious, but teams forget it all the time. We get excited a...
In the realm of software development, testers are the silent guardians. Their role is often misunderstood and underappreciated, especially when they do their job so well that no...
In 1902, Hanoi was drowning in rats. The government was getting nervous about plague. Hence, the city put a bounty per rat tail. Suddenly, the system had a scoreboard, something...
Incidents are used for the negative consequences of an action. The incident comes from an action that fails to result in the expected outcome. For instance, deploying a code to...
Over the years, I've interviewed many candidates. One crucial skill that often gets overlooked is operational reflexes during oncalls. Surprisingly, few companies test for this,...
The worst telemetry problems I have seen did not start with waste. They started when an incident happened. We could not see enough, and the missing field became the villain of t...