From the outside, these problems are largely invisible. So despite the hazardous work, we are usually expected to work without making mistakes. We never have time to fix the hazards because there's always something more important to do. Building the tools we need for safe development like failure recovery, diagnostic support, adequate logging and reliable deployment are often deferred in favor of more features.
Then when something explodes, as it inevitably will, high-risk heroics are required to save the day. We work late nights and weekends repairing complex problems by hand and hope that nothing else goes wrong.
Instead of recognizing the symptoms of a serious problem, the long hours and heroics are often rewarded. Fire-fighting, overtime, and last-minute hacks start to be expected. Constant stress and exhaustion become the norm.
More people just add fuel to the fire.
Those that don't want to put in the long hours anymore are seen as not pulling their weight. Frustration builds and the team gets burned out and the best developers start to leave.
The new guys just make things worse. They don't know the software and the hazards to watch out for and they keep messing things up in the code. We try to hold things together, but it's hard to get anything else done. It becomes a full-time job just to keep the system from falling apart.
Management doesn't understand why productivity is so poor and tries to add more people to the work. This just adds fuel to the fire.
Once this cycle gets started, it's hard to turn things around. We get sucked into the problems, operating in a mode of constant urgency, and we don't want to see our project fail. So we push ourselves to the limit of stress and exhaustion doing the best we can. However, we're so busy reacting to all the things going wrong, there's no time to stop and fix the problems. One more late night and a few hacks to get things working, but the cycle just doesn't end.
We knew better, but we did it anyway
The worst part about this is even when we know better, we do it anyway.
I remember one night in particular after working 60+ hour weeks for several months. I checked in some code without running it at all and deployed my changes so I could test it in production. I was so used to working under constant urgency, I had eventually thrown all my sense of principle out the window.
We had built out the delivery infrastructure and automated our release process from the beginning. For a while, we were releasing every week; there were challenges, but for the most part things were going fairly well. We had a major deadline coming up to support a new customer on our platform and investors had been promised it would happen by the end of the year.
The requirements meant drastically changing parts of the architecture and conquering some extremely difficult problems. How long was it going to take? We had no idea, but we did know we had better get to work!
We broke down the work and started chipping away at it, trying to do just enough unit testing to get by. We paired on the more challenging parts and tried to parallelize the work to get it done as fast as we could. We tried to integrate early, but there were so many problems. The software produced weird results. We just had to work through it.
We were caught up in the cycle
Some of us worked on testing and fixing, while others kept pushing along with the remaining features. We knew we were headed down the path of a monstrous release, but we didn't seem to have any choice. We worked an insane amount of hours troubleshooting problems just trying to get it stable.
The end of the year was rolling around and we finally got the software in production. We thought the pain was finally over, but that was just the beginning. We had no time to build out the infrastructure we needed to make changes safely, and our new users had a long list of complaints. The pressure just never let up.
Every release it seemed like things would go wrong. We'd work all weekend and be up late Sunday night trying to fix deployments that went wrong. The data would be messed up. Reports wouldn't be right. We didn't really have a viable plan B. The system was down, it took too long to restore from backup, we just had to fix it in production.
Something had to give...
We were so exhausted, but the urgency didn't end. We were yelled at and threatened whenever things went wrong, but expected to continue the high-risk work. How could they possibly give us bandwidth for work that wasn't part of the deliverables, when the project was already several months behind schedule?
We had poured so much of our time into the software and the people on the team were my friends. We had great developers that had always been disciplined engineers and we all got sucked into the same trap.
Sometimes you just have to leave. Working under threat and constant urgency makes great people do really stupid things.