Janelle Klein

Friday, June 8, 2012

What Makes a "Better" Design?

An observation... there are order of magnitude differences in developer productivity, and a big gap in between - like a place where people get stuck and those that make a huge leap.

Of the people I've observed, it seems like there's a substantial difference in the idea of what makes a better design, as well as an ability to create more options. Those that don't make the leap tend to be bound to a set of operating rules and practices that heavily constrain their thinking. Think about how software practices are taught... I see the focus on behavior-focused "best practices" without thinking tools as something that has stunted the learning and development of our industry.

Is it possible to learn a mental model such that we can evaluate "better" that doesn't rely on heuristics and best practice tricks? If we have such a model, does it allow us to see more options, connect more ideas?

This has been my focus with mentoring - to see if I could teach this "model." More specifically a definition of "better" that means optimizing for cognitive flow. But since its not anything static, I've focused on tools of observation. By building awareness of how the design affects that flow, we can learn to optimize it for the humans.

A "better" software design is one that allows ideas to flow out of the software, and into the software more easily.

Monday, June 4, 2012

Effects of Measuring

As long as measurements are used responsibly, not for performance reviews or the like, it doesn't affect anything, right?

It's not just the measurements being used irresponsibly - the act of measuring effects the system, our understanding, and our actions. Like a metaphor - metrics highlight certain aspects of the system, but likewise hide others. We are less likely to see and understand the influencers of the system that we don't measure... and in software the most important things, the stuff we need to understand better, we can't really put a number on.

Rather than trying to come up with a measurement, I think we should try and come up with a mental model for understanding software productivity. Once we have an understanding of the system, maybe there is hope for a measurement. Until then, sustaining productivity is left to an invisible mystic art - with the effects of productivity problems being so latent, by the time we make the discovery, its usually way too late and expensive to do much about it.

Productivity understanding, unlike productivity measuring, I believe is WAY more worth the investment. A good starting point is looking at idea flow.

Thursday, May 24, 2012

Humans as Part of the System

I think about every software process diagram that I've ever seen, and every one seems to focus on the work items and how they flow - through requirements, design, implementation, testing and deployment. Whether short cycles or long, discreet handoffs or a collapsed 'do the work' stage, the work item is the center piece of the flow.

But then over time, something happens. The work items take longer, defects become more common and the system deteriorates. We have a nebulous term to bucket these deterioration effects - technical debt. The design is 'ugly', and making it 'pretty' is sort of a mystic art. And likewise keeping a software system on the rails is dependent on this mystic art - that seems quite unfortunate. So why aren't the humans part of our process diagram - if we recognized the underlying system at work, could we learn how to better keep it in check?

What effect does this 'ugly' code really have on us? How does it change the interactions with the human? What is really happening?

If we start focusing our attention on thinking processes instead of work item processes, how ideas flow instead of how work items flow... the real impact of these problems may actually be visible. Ideas flow between humans. Ideas flow from humans to software. Ideas flow from software to humans. What are these ideas? What does this interaction look like?

Mapping this out even for one work item is enlightening. It highlights our thinking process. It highlights our cognitive missteps that lead us to make mistakes. It highlights the effects of technical debt. And it opens a whole new world of learning.

Thursday, April 12, 2012

Addressing the 90% Problem

If I were to try to measure the time that I spent thinking, analyzing and communicating versus actually typing in code, how much of the time would it be? If I were to guess, I'd say something at least 90%. I wonder what other people would say? Especially without being biased by the other opinions in the room?

We spend so much of our time trying to understand... even putting a dent in improvement would mean -huge- gains in productivity. So...

How can we improve our efficiency at understanding?

How can we avoid misunderstanding, forgetting, or lack of understanding?

How can we improve our ability and efficiency at communicating understanding?

How might we reduce the amount of stuff that we need to understand?

These are the questions that I want to focus on... its where the answers and solutions will make all the difference.

Mistakes in a World of Gradients

I've been working on material for avoiding software mistakes, and have been searching for clarity on how to actually define "mistake."

I've been struggling with common definitions that are very black and white about what is wrong and considered "an error" or "incorrect". Reality often seems more of a gradient than that, and likewise avoiding mistakes should maybe be more a matter of avoiding poor decisions in favor of better ones?

I like this definition better, because it accounts for the gradient in outcome, without missing the point.

"A human action that produces an incorrect or inadequate result. Note: The fault tolerance discipline distinguishes between the human action (a mistake), its manifestation (a hardware or software fault), the result of the fault (a failure), and the amount by which the result is incorrect (the error)."

GE Russell & Associates Glossary - http://www.ge-russell.com/ref_swe_glossary_m.htm

Tuesday, April 10, 2012

What is a Mistake?

I've been trying to come up with a good definition for "mistake" in the context of developing software.

It's easy to see defects as caused by mistakes, but what about other kinds of poor choices? Choices that led to massive work inefficiencies? And what if you did everything you were "supposed to" do, but still missed something, is that caused by a mistake? What if the problem is caused by the system and no person is responsible, is that a mistake?

All of these, I think should be considered mistakes. If we look at the system and the cause, we can work to prevent them. The problem with the word "mistake", is it's quickly associated with blaming the who responsible for whatever went wrong. Mistake triggers fear, avoidance, and guilt. Which is the exact opposite of the kind of response that can lead somewhere positive.

Here's the best definition I found from dictionary.com:

"an error in action, calculation, opinion, or judgment caused by poor reasoning, carelessness, insufficient knowledge,etc. "

From this definition, even if you failed to do something that you didn't know you were supposed to do (having insufficient knowledge), its still a mistake. Even if it was an action triggered by interacting parts of the system, but no one thing, its still a mistake.

But choices that cause inefficiencies? That seems to fall under the gradient of an "error of action or judgement". If we could have made a better choice, was the choice we made an error? Hmm.

Sunday, April 8, 2012

A Humbling Experience

About 7 years ago, I was working on a custom SPC system project. Our software ran in a semiconductor fab, and was basically responsible for reading in all the measurement data off of the tools and detecting processing errors. Our users would write thousands of little mini programs that would gather data across the process, do some analysis, and then if they found a problem, could shutdown the tool responsible or stop the lot from further processing.

It was my first release on the project. We had just finished up a 3 month development cycle, and worked through all of our regression and performance tests. Everything looked good to go, so we tied a bow on it and shipped it to production.

That night at about three in the morning, I got a phone call from my team lead. And I could hear a guy just screaming in the background. Apparently, we had shut down every tool in the fab. Our system ground to a screeching halt, and everyone was in a panic.

Fortunately, we were able to rollback to the prior release and get things running again. But we still had to figure out what happened. We spent weeks verifying configuration, profiling performance, and testing with different data. Then finally, we found a bad slow down that we didn't see before. Relieved to find the issue, we fixed it quickly, and assured our customers that everything would be ok this time.

Fifteen minutes after installing the new release... the same thing happened.

At this point, our customers were just pissed at us. They didn't trust us. And what can you say to that? Oops?

We went back to our performance test, but couldn't reproduce the problem. And after spending weeks trying to figure it out, and about 15 people on the team sitting pretty much idle, management decided to move ahead with the next release. But we couldn't ship...

There's an overwhelming feeling that hits you when something like this happens. A feeling that most of us will instinctively do anything to avoid. The feeling of failure.

We cope with it and avoid the feeling with blame and anger. I didn't go and point fingers or yell at anyone, but on the inside, I told myself that I wasn't the one that introduced the defect, that it was someone else that had messed it up for our team.

We did eventually figure it out and get it fixed, but by that time it was already time for the next release, so we just rolled in the patch. We were extra careful and disciplined about our testing and performance testing, we didn't want the same thing to happen.

At first everything looked ok, but we had a different kind of problem. It was a latent failure, that didn't manifest until the DBA ran a stats job on a table that crashed our system... again. But this time, it was my code, my changes, and my fault.

There was nobody else I could blame but myself... I felt completely crushed.

I remember sitting in a dark meeting room with my boss, trying to hold it in. I didn't want to cry at work, but that only lasted so long. I sat there sniffling, while he gave me some of the best advice of my life.

"I know it sucks... but it's what you do now that matters. You can put it behind you, and try to let it go... or face the failure with courage, and learn everything that it has to teach you."

Our tests didn't catch our bugs. Our code took forever to change. When we'd try to fix something, sometimes we'd break five other things in the process. Our customers were scared to install our software. And nothing says failure more than bringing down production the last 3 times that we tried to ship!

That's where we started...

After 3 years we went from chaos, brittleness and fear to predictable, quality releases. We did it. The key to making it all happen, wasn't the process, the tools, or the automation. It was about facing our failures. Understanding our mistakes. Understanding ourselves. We spent those 3 years learning and working to prevent the causes of our mistakes.