Saturday, February 11, 2012

Fighting my way to agility - Part 4

Shrinking your Iterations when you have a Painful Test Burden

The obvious thing, which we quickly jumped on, was automating our existing manual test cases.  They were mostly end-to-end, click-through UI and verify tests.  We automated most of them with QTP, and tried to run them in our CI loop.  This turned out to be a terrible idea… or rather a horrible nightmare.  Not only were they terribly painful to maintain and always breaking - we had 2 other problems:  Our developers started totally ignoring the CI build since it was always broken,  and they weren't even catching most of our bugs.  It was a massive cost, with very little benefit.   We ended up throwing them all away and going back to manual.   Don't get me wrong, this sucked too.  We instead focused our efforts on creating better tests.  

We started looking more deeply at where our bugs were coming from, and why we were making mistakes.  Our old manual test suite was built by writing manual tests for each new feature that was implemented.  The switch in attention to finding ways to stop our actual bugs, instead of just trying to cover our features, was key to really turning quality around.   We ultimately created 3 new test frameworks.  

Performance simulation framework -  The SPC engine had some pretty tight performance constraints that seemed to unexpectedly go haywire.  Trying to track down the cause of the haywire performance, was a hell of a challenge.  The more code that had changed the harder it was to track down.  We also discovered that our existing performance test that used simulated data didn't find our problems.  The system had very different performance characteristics with highly-variable data.  So we got a copy of each production site and made a framework that would reverse engineer outputs as inputs and replay them through the system.  We would run a test nightly that did this and send us an email if it detected 'haywire'.  We actually used our own SPC product and created an SPC chart against our performance data to do it ;)  Catching performance issues immediately was a complete game-changer.   It also gave us a way to tune performance, and as a bonus some handy libraries for testing our app.

SPC 'fingerprinting' Tests -  Remember all the bandaids and fragileness I talked about?  It was insanely hard to not break.  We used the performance test tooling to replay the outputs as inputs to the system, but then we recorded the new output as a fingerprint file.  For every single chart in every production environment, we generated one of these fingerprinted scenarios.  Then as the system changed, we would compare the old fingerprints with the new ones and fail if there were differences.  Even with 8000 generated tests, it was easy to flip the bar green for expected changes by copying all files in 1 folder to another. But the challenge was in telling between expected changes and accidental ones.  Again, this greatly increased in complexity with an increased amount of change.  If the change was relatively small, you could look at a sample of files and see if behavior was as expected or not.  We not only caught a TON of bugs this way, but also found cases where users asked us to implement changes that would break other use cases and we were able to alert them.

Integration scenario tests - The other area where we had a lot of bugs were with scenarios that crossed system boundaries.  For example, SPC would put a lot (container of material) on hold, a user would investigate and then release the hold.  These scenarios got quite complicated when the lot was split up, combined with other material, and processed different places.  We had to track down all the material that might be affected and put all of it on hold.  With remote calls failing on occasion and other activities happening concurrently, there were a lot of ways for things to go wrong.  Anyway, we worked with the testers for the other system to create a framework that we could orchestrate and verify state across systems.  For our apps, rather than going through the UI, we had an internal test controller that we would use to drive the app from the inside.  To verify state we would internally collect and dump all the critical state information to an XML file that we would diff to detect failure.  We had about 150 or so end to end integration tests like this that had a much lower maintenance cost.

The Fate of Our Manual Tests

Once we had much better coverage of the error-prone parts of the system, running all of these tests became a lot less important.  A few we converted to scenario or unit tests, but for the most part these all stayed manual.  It was still expensive to run them all, but we used another strategy to reduce that burden.  We categorized them all in a Sharepoint list, then at release time, we would run only tests we thought had a chance at finding a bug based on what we had changed that release.   With all the other testing, this was good enough.  We found a couple bugs with them, but at this point, they were almost always green.

But the Work was Still TOO Big!

We still suffered some pretty massive productivity problems, but at this point we were back in control.  When we changed stuff it wasn't so risky.  But it was still too time-consuming.  So we focused our team on how could we get the most productivity gain for our effort

We analyzed both our past work for where we were spending time, and where future code changes needed to be.   The UI layer was frightful and always time-consuming to change, but we rarely had to change it.   Effort here, wouldn't actually buy us more productivity.   Most ouf changes were in the core SPC pipeline and it was a major bear to understand and change.  Although the effort level was high, we all knew we would see the payoff. The impact we had on productivity from rewriting the SPC engine was HUGE.  

We had all kinds of awesome new features we were able to do, including supporting SPC for a completely different type of facility.  Our productivity exploded.  We could have never done these features on the system before - the effort was probably beyond the cost to rewrite.  Our users were also thrilled because the behavior was so predictable - even for complex scenarios.   Their productivity improved in designing and maintaining SPC charts!

As we got better and better control of our productivity, we kept shrinking our iteration sizes, and were finally able to have consistent quality.  It wasn't perfect, but it was good enough to earn back the trust of our customers.   Though we never did end up using timeboxes, at this point, our sprints were 'boxable'. :)

No comments:

Post a Comment