Sunday, April 8, 2012

A Humbling Experience

About 7 years ago, I was working on a custom SPC system project.  Our software ran in a semiconductor fab, and was basically responsible for reading in all the measurement data off of the tools and detecting processing errors.  Our users would write thousands of little mini programs that would gather data across the process, do some analysis, and then if they found a problem, could shutdown the tool responsible or stop the lot from further processing.

It was my first release on the project. We had just finished up a 3 month development cycle, and worked through all of our regression and performance tests.  Everything looked good to go, so we tied a bow on it and shipped it to production.

That night at about three in the morning, I got a phone call from my team lead. And I could hear a guy just screaming in the background.  Apparently, we had shut down every tool in the fab.  Our system ground to a screeching halt, and everyone was in a panic.  

Fortunately, we were able to rollback to the prior release and get things running again.  But we still had to figure out what happened.  We spent weeks verifying configuration, profiling performance, and testing with different data.  Then finally, we found a bad slow down that we didn't see before.  Relieved to find the issue, we fixed it quickly, and assured our customers that everything would be ok this time.

Fifteen minutes after installing the new release... the same thing happened.

At this point, our customers were just pissed at us.   They didn't trust us.   And what can you say to that? Oops?

We went back to our performance test, but couldn't reproduce the problem.  And after spending weeks trying to figure it out, and about 15 people on the team sitting pretty much idle, management decided to move ahead with the next release.  But we couldn't ship...

There's an overwhelming feeling that hits you when something like this happens.  A feeling that most of us will instinctively do anything to avoid.  The feeling of failure.

We cope with it and avoid the feeling with blame and anger.   I didn't go and point fingers or yell at anyone, but on the inside, I told myself that I wasn't the one that introduced the defect, that it was someone else that had messed it up for our team.

We did eventually figure it out and get it fixed, but by that time it was already time for the next release, so we just rolled in the patch.  We were extra careful and disciplined about our testing and performance testing, we didn't want the same thing to happen.

At first everything looked ok, but we had a different kind of problem.  It was a latent failure, that didn't manifest until the DBA ran a stats job on a table that crashed our system... again.   But this time, it was my code, my changes, and my fault.

There was nobody else I could blame but myself...  I felt completely crushed.

I remember sitting in a dark meeting room with my boss, trying to hold it in.  I didn't want to cry at work, but that only lasted so long.  I sat there sniffling, while he gave me some of the best advice of my life.

"I know it sucks... but it's what you do now that matters.  You can put it behind you, and try to let it go... or face the failure with courage, and learn everything that it has to teach you."

Our tests didn't catch our bugs.  Our code took forever to change.  When we'd try to fix something, sometimes we'd break five other things in the process.  Our customers were scared to install our software.  And nothing says failure more than bringing down production the last 3 times that we tried to ship!

That's where we started...

After 3 years we went from chaos, brittleness and fear to predictable, quality releases.  We did it.  The key to making it all happen, wasn't the process, the tools, or the automation.  It was about facing our failures.  Understanding our mistakes.  Understanding ourselves.  We spent those 3 years learning and working to prevent the causes of our mistakes.

No comments:

Post a Comment