Janelle Klein

Thursday, May 24, 2012

Humans as Part of the System

I think about every software process diagram that I've ever seen, and every one seems to focus on the work items and how they flow - through requirements, design, implementation, testing and deployment. Whether short cycles or long, discreet handoffs or a collapsed 'do the work' stage, the work item is the center piece of the flow.

But then over time, something happens. The work items take longer, defects become more common and the system deteriorates. We have a nebulous term to bucket these deterioration effects - technical debt. The design is 'ugly', and making it 'pretty' is sort of a mystic art. And likewise keeping a software system on the rails is dependent on this mystic art - that seems quite unfortunate. So why aren't the humans part of our process diagram - if we recognized the underlying system at work, could we learn how to better keep it in check?

What effect does this 'ugly' code really have on us? How does it change the interactions with the human? What is really happening?

If we start focusing our attention on thinking processes instead of work item processes, how ideas flow instead of how work items flow... the real impact of these problems may actually be visible. Ideas flow between humans. Ideas flow from humans to software. Ideas flow from software to humans. What are these ideas? What does this interaction look like?

Mapping this out even for one work item is enlightening. It highlights our thinking process. It highlights our cognitive missteps that lead us to make mistakes. It highlights the effects of technical debt. And it opens a whole new world of learning.

Thursday, April 12, 2012

Addressing the 90% Problem

If I were to try to measure the time that I spent thinking, analyzing and communicating versus actually typing in code, how much of the time would it be? If I were to guess, I'd say something at least 90%. I wonder what other people would say? Especially without being biased by the other opinions in the room?

We spend so much of our time trying to understand... even putting a dent in improvement would mean -huge- gains in productivity. So...

How can we improve our efficiency at understanding?

How can we avoid misunderstanding, forgetting, or lack of understanding?

How can we improve our ability and efficiency at communicating understanding?

How might we reduce the amount of stuff that we need to understand?

These are the questions that I want to focus on... its where the answers and solutions will make all the difference.

Mistakes in a World of Gradients

I've been working on material for avoiding software mistakes, and have been searching for clarity on how to actually define "mistake."

I've been struggling with common definitions that are very black and white about what is wrong and considered "an error" or "incorrect". Reality often seems more of a gradient than that, and likewise avoiding mistakes should maybe be more a matter of avoiding poor decisions in favor of better ones?

I like this definition better, because it accounts for the gradient in outcome, without missing the point.

"A human action that produces an incorrect or inadequate result. Note: The fault tolerance discipline distinguishes between the human action (a mistake), its manifestation (a hardware or software fault), the result of the fault (a failure), and the amount by which the result is incorrect (the error)."

GE Russell & Associates Glossary - http://www.ge-russell.com/ref_swe_glossary_m.htm

Tuesday, April 10, 2012

What is a Mistake?

I've been trying to come up with a good definition for "mistake" in the context of developing software.

It's easy to see defects as caused by mistakes, but what about other kinds of poor choices? Choices that led to massive work inefficiencies? And what if you did everything you were "supposed to" do, but still missed something, is that caused by a mistake? What if the problem is caused by the system and no person is responsible, is that a mistake?

All of these, I think should be considered mistakes. If we look at the system and the cause, we can work to prevent them. The problem with the word "mistake", is it's quickly associated with blaming the who responsible for whatever went wrong. Mistake triggers fear, avoidance, and guilt. Which is the exact opposite of the kind of response that can lead somewhere positive.

Here's the best definition I found from dictionary.com:

"an error in action, calculation, opinion, or judgment caused by poor reasoning, carelessness, insufficient knowledge,etc. "

From this definition, even if you failed to do something that you didn't know you were supposed to do (having insufficient knowledge), its still a mistake. Even if it was an action triggered by interacting parts of the system, but no one thing, its still a mistake.

But choices that cause inefficiencies? That seems to fall under the gradient of an "error of action or judgement". If we could have made a better choice, was the choice we made an error? Hmm.

Sunday, April 8, 2012

A Humbling Experience

About 7 years ago, I was working on a custom SPC system project. Our software ran in a semiconductor fab, and was basically responsible for reading in all the measurement data off of the tools and detecting processing errors. Our users would write thousands of little mini programs that would gather data across the process, do some analysis, and then if they found a problem, could shutdown the tool responsible or stop the lot from further processing.

It was my first release on the project. We had just finished up a 3 month development cycle, and worked through all of our regression and performance tests. Everything looked good to go, so we tied a bow on it and shipped it to production.

That night at about three in the morning, I got a phone call from my team lead. And I could hear a guy just screaming in the background. Apparently, we had shut down every tool in the fab. Our system ground to a screeching halt, and everyone was in a panic.

Fortunately, we were able to rollback to the prior release and get things running again. But we still had to figure out what happened. We spent weeks verifying configuration, profiling performance, and testing with different data. Then finally, we found a bad slow down that we didn't see before. Relieved to find the issue, we fixed it quickly, and assured our customers that everything would be ok this time.

Fifteen minutes after installing the new release... the same thing happened.

At this point, our customers were just pissed at us. They didn't trust us. And what can you say to that? Oops?

We went back to our performance test, but couldn't reproduce the problem. And after spending weeks trying to figure it out, and about 15 people on the team sitting pretty much idle, management decided to move ahead with the next release. But we couldn't ship...

There's an overwhelming feeling that hits you when something like this happens. A feeling that most of us will instinctively do anything to avoid. The feeling of failure.

We cope with it and avoid the feeling with blame and anger. I didn't go and point fingers or yell at anyone, but on the inside, I told myself that I wasn't the one that introduced the defect, that it was someone else that had messed it up for our team.

We did eventually figure it out and get it fixed, but by that time it was already time for the next release, so we just rolled in the patch. We were extra careful and disciplined about our testing and performance testing, we didn't want the same thing to happen.

At first everything looked ok, but we had a different kind of problem. It was a latent failure, that didn't manifest until the DBA ran a stats job on a table that crashed our system... again. But this time, it was my code, my changes, and my fault.

There was nobody else I could blame but myself... I felt completely crushed.

I remember sitting in a dark meeting room with my boss, trying to hold it in. I didn't want to cry at work, but that only lasted so long. I sat there sniffling, while he gave me some of the best advice of my life.

"I know it sucks... but it's what you do now that matters. You can put it behind you, and try to let it go... or face the failure with courage, and learn everything that it has to teach you."

Our tests didn't catch our bugs. Our code took forever to change. When we'd try to fix something, sometimes we'd break five other things in the process. Our customers were scared to install our software. And nothing says failure more than bringing down production the last 3 times that we tried to ship!

That's where we started...

After 3 years we went from chaos, brittleness and fear to predictable, quality releases. We did it. The key to making it all happen, wasn't the process, the tools, or the automation. It was about facing our failures. Understanding our mistakes. Understanding ourselves. We spent those 3 years learning and working to prevent the causes of our mistakes.

Wednesday, March 28, 2012

What we REALLY Value is the Cost...

Today, someone in the community mentioned the idea of measuring "value points". And the light went on... could this finally highlight our productivity problems? It could be a totally dead-end idea, but its a hypothesis that needs testing.

When I thought "value points", I imagined a bunch of product folks sitting around playing planning poker judging the relative value of features. Using stable reference stories and choosing whether one story was more or less valuable than others. Might seem goofy, but its an interesting idea. My initial thought was that this would be way more stable over time than cost since cost varies dramatically over the lifetime of a project. And if it is truly stable, it might provide the missing link when trying to understand changes in cost over time.

For this to make sense, you gotta think about long term trends. Suppose our team can deliver 20 points of cost per sprint. But our codebase gets more complex, bigger, uglier and more costly to change. Early on, we can do 10 stories at 2 points each. But 2 years later, very similar features on the more complex code base require more effort to implement, so maybe these similar stories now take 5 points each and we can do 4 of them. Our capacity is still 20 story points, but our ability to delivery value has REALLY decreased.

We often use story points as a proxy for value delivered per sprint, but think about that... We get "credit" for the -cost- of the story as opposed to the -value- of the story. If our costs go up, we get MORE credit for the same work!

How can we ever hope to improve productivity if we measure our value in terms of our costs? How can we tell if a story that has a cost of 5 could have been a cost of 1? Looking at story points as value delivered makes the productivity problems INVISIBLE. It's no wonder that it's so hard to get buy in for technical debt...

What if we aimed for value point delivery? If you improved productivity, or your productivity tanked, would it actually be visible then? On that same project, with 20 cost points per sprint, suppose that equates to 10 value points early on, and 4 value points later. Clearly something is different. Maybe we should talk about how to improve? Productivity, anyone? Innovation?

At least it would seem to encourage the right conversations...

Thursday, March 8, 2012

Does Agile process actually discourage collaboration and innovation?

Before everyone freaks out at that assertion, give me a sec to explain. :)

In the Dev SIG today, we were discussing our challenges with integrating UX into
development, and had an awesome discussion. I think Kerry will be posting some
notes. Most of the discussion though, went to ideas and challenges with
creating and understanding requirements, and the processes that we use to scale
destroying a lot of our effectiveness. The question we all left with, via Greg
Symons, was how do we scale our efforts while preserving this close connection
in understanding between the actual customer and those that aim to serve them?

In thinking about this, our recent discussions about backlog, and recalling past
projects, I realized some crucial skills that we seem to have largely lost. In
the days of waterfall, we were actually much more effective at it.

My first agile project was with XP, living in Oregon, and fortunate enough to
have Kent Beck provide a little guidance on our implementation. Sitting face to
face with me, on the other side of a half wall cube, was an actual customer of
our system, who had used it and things like it for more than 20 years. I could
sit and watch how they used it, ask them questions, find out exactly what they
were trying to accomplish and exchange ideas. From this experience I came away
with a great appreciation for the power of a direct collaborative exchange
between developers and real customers.

My next project, was waterfall. One of the guys on my team, wickedly smart, his
background was mainly RUP, and he just -loved- requirements process. What he
taught me were techniques for understanding, figuring out the core purpose,
figuring out the context of that purpose, and exploring alternatives to build a
deeper understanding of what a user really needs. Some of these were
documentation techniques, and others were just how you ask questions and
respond. I learned a ton. On our team, the customers would make a request, and
the developers were responsible for working with the customers to discover the
requirements.

With Scrum-esque Agile process, this understanding process is outsourced to the
product owner. As we try to scale, we use a product owner to act as a
communication proxy, and with it create a barrier of understanding between
developers and actual customers. Developers seldom really understand their
customers, and when given the opportunity to connect with them, the number of
discoveries of all the things we've been doing that could have been so much
better, are astounding.

I've done agile before on a new project, sitting in the same room with our real
users, understanding their problems, taking what they asked for and figuring out
what they needed, and also having control of the architecture, design, interface
and running the team to build it - the innovation of the project and what we
build was incredible. Industry cutting-edge stuff was just spilling out of
everything we did. And it all came out of sitting in a room together and
building deep understanding of both the goals, and the possibilities. This was
agile with no PO proxy. The developers managed the backlog, but really wrote
very little down... we did 1 week releases.

Developers seldom have much skill in requirements these days. And are often
handed a specification or a problem statement that is usually still quite far
from the root problem.

In building in this understanding disconnect, and losing these skills, are we
really just building walls that prevent collaboration and tearing down our
opportunities to innovate?