Operational Skills Needed
Over the years, I’ve interviewed many candidates. One crucial skill that often gets overlooked is operational reflexes during oncalls. Surprisingly, few companies test for this, yet it’s a capability that greatly distinguishes engineers.
There is a gap in interviewing. Some of the candidates excel in code interviews and system design but not on the operational side of things. They can build elegant architectures on the whiteboard but freeze when the pager goes off. Sometimes, they effectively play the system without real operational experience. They secure positions but lack the practical skills to handle real world scenarios.
As software engineers, we are expected to run what we build, either individually or as a team. If you lack operational experience, it can spell trouble. The operational muscle is something you build with time. It requires a combination of interpreting graphs, understanding system behavior, and staying calm. It’s the difference between reacting to noise and responding to signals.

The Importance of Operational Reflexes
Operational reflexes are not just about reacting to incidents but also about anticipating and preventing them. They’re built the hard way, by breaking things, owning them, and cleaning up after yourself. Engineers with strong operational skills can diagnose problems from seemingly minor alerts, recognize patterns that suggest larger issues, and apply fixes before problems escalate. You don’t develop this by reading post-mortems; you develop it by being in one.
The best lessons usually come from pain. Everyone who’s done oncall long enough remembers the first real incident. That script that looked harmless until it locked half the database, or the small config change that quietly took down an entire region. In the middle of that chaos, you learn more in two hours than in two years of normal operation. You learn how systems really behave under stress, how communication fails under pressure, and how your tools behave when you actually need them.
Having the gut feeling for operations helps improve and maintain system reliability and performance. Most engineers get that gut feeling by operating systems in production and seeing different failure cases. That’s why I believe operational skill can only be truly learned on the job. You can process, automate, and document to improve but nothing replaces the raw, lived experience of handling real failures.
That’s why operational reflexes aren’t theoretical knowledge; they’re earned scars. Each incident teaches timing, restraint, and humility the hard way. These skills can’t be measured by any coding interview. Look around your company, there are people who are so good at this. Why? Because they have been to many of those before.
Interviewing for Operational Experience
Alright, we understand we should test someoneʼs operational experience but how do we do it? Traditional interviews focus on coding skills and theoretical knowledge. Obviously, they are easier to evaluate through standardized questions. Yet, operational skills require a different approach, as they depend on judgment, timing, and how people behave under real pressure.
One potential method could be incorporating scenario-based questions in interviews. For example, presenting candidates with a hypothetical system outage and asking them to walk through their troubleshooting process. Hence, this could reveal their thought process, decision-making skills, and familiarity with operational tools and practices. But even that has limits. Real incidents never follow the script. I wish they did. In real life, graphs lag, alerts flood in and you have ten minutes to make a call you’ll have to explain in a post-mortem later.
That’s why I believe operations can only be partially simulated. The best proxy is asking for real stories. Ask candidates what their worst oncall night looked like. What went wrong? What did they do first? What did they learn? You’ll quickly see who has been through it and who hasn’t. People who’ve truly run systems don’t talk about perfect playbooks. They will talk about tradeoffs, damage control, and lessons learned.
Interviewing for operations is mostly about testing composure and reflection. You’re not looking for someone who’s never broken production; you’re looking for someone who knows what to do next time.
Operational Scenario Interviews
To better evaluate candidates, interview processes should include:
- Realistic Incident Scenarios: Present candidates with past incidents your team has faced. Ask them how they would respond, what steps they would take to identify and resolve the issue. Understand how they would potentially communicate with stakeholders during the incident.
- System Monitoring and Alerting: Assess candidatesʼ familiarity with monitoring tools like Prometheus, Grafana, or Datadog. Ask them how they would set up alerts and what metrics they would monitor to ensure system health. Don’t settle for textbook answers like “CPU, memory, and latency.” Ask why those metrics matter, what thresholds they’d pick, and how they’d avoid alert fatigue. A good candidate ties observability to decision-making, not dashboards.
- Post-Mortem Analysis: Discuss the importance of post-mortem analysis after incidents. Ask candidates to describe how they would conduct a post-mortem, identify root causes, and implement measures to prevent recurrence. Listen to how they handle accountability: do they focus on learning or blame? Engineers with strong operational maturity turn every incident into institutional knowledge, not personal defense.
Operational experience is a crucial yet often overlooked aspect of interviewing. By introducing realistic operational scenarios into the interview process, we can better evaluate an engineerʼs ability to handle real-world challenges. This not only ensures that new hires are technically proficient but also operationally capable. Ultimately, it’s not about hiring people who never break production. It’s recruiting engineers who you can trust to bring it back gracefully.
All in All
You can’t fake operational maturity. It only comes from being there when things go wrong and staying long enough to fix them. The first time you take production down, you panic. The second time, you breathe. The third time, you already have a plan. That’s how operational reflexes form.
The best engineers are the ones who recover fast, share what they learned, and make sure no one else has to repeat the same mistake. Every outage hurts, but it’s also a form of apprenticeship. Operations humbles you. It teaches you that reliability isn’t a feature you build, it’s a culture you practice.