Local vs Production Debugging

I have been debugging this data workflow tool we built in house lately. It has an Angular UI and a Java backend, and it moves data between different systems like Postgres to Hive, Hive back to Postgres, Redshift, and a few other places. Debugging it locally is usually straightforward. I can run the job, watch the logs, and see where something goes wrong. In production, the same workflow behaves differently. The data is different, the timing is different, and sometimes a job fails in a way I cannot reproduce at all.

In this post, I want to talk about the differences I keep noticing between debugging locally and debugging in production. I want to talk about how the environment changes the behaviour, why timing and load matter more than I expected, how logs become the main source of truth, and why “it worked on my machine” does not help when a workflow touches multiple systems. I am still learning, but these are the things that stand out right now.

Code Works In My Computer

Local Debugging Is Straightforward

When I debug this thing locally, everything feels simple and honestly a bit fake. The UI sends a tiny request, the backend runs through the steps, and the whole workflow finishes before I take a sip of my coffee. Hive responds instantly because I am using sample data. Postgres behaves. Actually, it behaves in Prod, too. Love you, Postgres. Nothing is under load. Nothing is slow. Nothing is fucking weird. If something breaks, it breaks the same way every time, and I can fix it in a couple of minutes.

Nonetheless, local debugging also hides most of the real shit. Queries look fast because the data is small. Network calls magically always work because everything is basically running next to each other. Even stupid mistakes do not look serious because the environment is very forgiving. It makes me think the workflow is stable, but it is not. It is just local, and local lies.

Production Behaves Differently

The moment this workflow hits production, it stops behaving like the thing I tested locally. Hive suddenly takes thirty seconds to answer a tiny query that returns ten rows. Sometimes the Metastore is out of sync and the workflow thinks a partition is missing even though the files are there. YARN kills containers at random for memory reasons, which makes no sense because the same step runs fine the next time. None of this shit shows up on my machine.

Redshift is even worse. It gets stuck behind some giant query sitting in a WLM queue and my simple job just waits forever. The COPY commands fail because of weird encoding problems or half-written S3 files. Sometimes a node gets skewed and everything slows down, and sometimes vacuuming takes so long it basically kills the whole pipeline. I do not see any of this locally. On my machine, Redshift looks fast and polite. In production, it is a fucking menace.

The workflow only breaks when all these systems behave slightly differently at the same moment. A five second delay in Hive causes a timeout in the backend. A Redshift queue causes a retry that then conflicts with a different step. That’s why production is a completely different world. All the real problems only live there.

Logging Becomes the Only Real Tool

In production I cannot pause anything or step through the workflow. I don’t have my debugger. The system just runs, fails, and leaves me with whatever logs it decided to write. If the logs are bad, I am screwed. If the logs are missing, I am even more screwed. It becomes obvious very fast that most of the debugging pain is because I did not log enough, or I logged the wrong things, or I logged them in a way that makes no sense later on.

With Hive and Redshift in the mix, logging becomes even more important. When Hive falls over because YARN killed a container, the error message is usually buried somewhere in a giant stack trace. When Redshift stalls in a queue or fails a COPY, the only way to understand it is from the logs. I cannot replicate any of this locally. I just have to hope that past me wrote enough information to make sense of what the hell happened. Most of the time, past me did not. So I add more logs, redeploy, and wait for the next failure.

Logging is the only thing that survives the distance between local and production. It is the only thing that tells me what actually happened, not what I think happened. And when the workflow touches multiple systems, logging is the only glue that gives me a chance of understanding the full picture. Without it, production debugging doesn’t work at all.

All in All

Debugging this workflow locally and debugging it in production feel like two completely different jobs. Locally everything is clean and fast and predictable. In production everything depends on data size, timing, load, Hive being slow, Redshift being moody, and a hundred things I do not control. Most of the problems are not even about the code. They are about how these systems behave when they all run together.

I am still figuring out how to deal with it. I am adding better logs, testing with bigger datasets, and trying to think more about what can go wrong outside my machine. It is not perfect and it still breaks in ways that make me swear at the screen, but at least I am learning. Debugging production is teaching me things that local debugging never will.