Articles tagged with Operational Excellence

Writing on repeatability, disciplined execution, and improving how work actually runs.

8 min read

The Mirror Is Part of the Machine

The worst telemetry problems I have seen did not start with waste. They started when an incident happened. We could not see enough, and the missing field became the villain of t...

7 min read

Why Headcount Math Lies

In 1911, Frederick Winslow Taylor published and helped cement one of management’s oldest instincts. In simple terms, break work into measurable units, optimize for efficiency, a...

14 min read

Incentives Drive Everything

In early modern France, the monarchy kept running into the same problem. Wars were expensive, revenue was not steady, and every obvious solution came with a political price. New...

10 min read

Scaling Culture Without Dilution

As organizations grow across geographies, one thing becomes disproportionately important. Culture. We, engineers, often dismiss culture as soft and cushy. This is until you see...

12 min read

What Good Looks Like

A few companies back, my manager and I inherited a group of teams after layoffs. Confidence was already low. People didn't believe in the systems we maintained. Stakeholders los...

14 min read

Why Airport Security Feels Random

I’m about to take yet another flight, this time flying to India. I’m excited, but then I can’t seem to pass the thought of why the heck security checks are so random. I had to c...

10 min read

The Janus Protocol

In the Roman Forum, there was a small shrine with double doors. When Rome was at war, the doors were left open. When Rome was at peace, the doors were closed. It is an oddly mod...

11 min read

What Good Execution Looks Like

The other day I was talking with one of my directs. We ended up discussing something we’ve both learned over the years. When execution works, the environment is quiet. Not slow....

12 min read

Building Remote Teams

You've probably heard stories of big tech companies in US and hiring double that number in India, blaming AI for the shift. Everyone's first thought is likely cheap labor. While...

6 min read

Subteam Tenets

Over the years, working across multiple organizations, I developed the concept of subteam tenets. I’ve tweaked it along the way to fit each company's unique quirks, but I still...

11 min read

Balancing Act of Reliability

Once something is in production, you are no longer just building software. You are also keeping it alive. That sounds obvious, but teams forget it all the time. We get excited a...

6 min read

Operational Skills Needed

Over the years, I've interviewed many candidates. One crucial skill that often gets overlooked is operational reflexes during oncalls. Surprisingly, few companies test for this,...

10 min read

Engineering Health Essentials

Engineering health is a term that deserves far more attention than it receives. Sustainable software development is not only about the features we ship or the speed at which we...

6 min read

Update Statements on Production

Executing update statements on a production database is always a big challenge. It’s one of those tasks that looks deceptively simple until something breaks in ways you didn’t i...

8 min read

Engineering Roles and Responsibilities

Engineering roles exist whether you define them or not. In some teams, ownership is explicit. People know who drives incident management, who keeps an eye on risk, who pushes on...

4 min read

Manager as a Service

What would a manager as a service look like? What kind of systems would a manager resemble? How can you describe a manager’s responsibility through various systems? Here’s my ta...

14 min read

Service Overload Strategies

Service overload happens a lot. If you haven't seen one, count yourself lucky. The first time I watched it take a system down, I realized how serious it’s to get the basics righ...

17 min read

Promoting Learnings in Incidents

Incidents are used for the negative consequences of an action. The incident comes from an action that fails to result in the expected outcome. For instance, deploying a code to...