Sunday, February 24, 2013

The third half, continued: Meta-thinking


In a previous post I wrote about the realization that as of this year I'll have been programming for half my life, and wondering where I'm going to take my career from here. I left a few details for a later post: meta-thinking, and kindness in software. This post is about the former.

Craftsmanship has a subtopic worth its own discussion, namely, thinking about how we think. This is something that has always excited me.

It excites me in large part not only because of it usefulness, but because of its strongly counterintuitive nature. We've all had those moments when the door is labeled, plain as day, Push. And yet we pull it. This doesn't work. In our industry it turns out that stepping back from the guns and smoke and details of the impending deadlines and ringing phones to iron out some kinks in the process can result in quicker progress toward those deadlines. It's as though pulling the push-door actually worked.

This isn't new to software. One of my favorite quotes is due to Abraham Lincoln: Give me six hours to chop down a tree and I will spend the first four sharpening the axe.

I should also disclaim that as much as I enjoy meta-thinking, I'm not always good at it, especially when I most need to be: those times when all I can think is I have this pile of logs to chop and tick tick tick!

Topics in this post:
* How do we get there from here?
* Exponential expense
* Technical debt
* Technical surplus
* Multiplicativity
* Finding root problems
* Disruption

How we get there from here?

Back in 1992 or so, shortly after I'd first taught myself C, I read an article in Scientific American about the Mandelbrot set and I was struck by how such a simple algorithm (literally, just a few dozen lines of code) could lead to such amazing complexity. My first version, on a 386, painted a few square inches of screen in a few minutes. The results were pretty. Then one day at the campus computer lab I happened to sit down at a machine with a math co-processor. The same pretty results --- in seconds. The hardware innovation made all the difference.

What are the barriers now? We have gigascale networking, CPUs, and RAM; terascale disks; places like Google run programs on thousands of machines, and you can rent the use of similar resources on Amazon's AWS. The biggest barrier isn't hardware anymore. I have access to ten thousand times the computing resources I did twenty years ago. Am I ten thousand times more productive? Why not, and is there a future in which I could be?

I often wonder, How do we get there from here? Or: Why aren't we there already?

That's a big question, and it's too much to answer in full. But I can say that as a developer, on a day-to-day basis, I know that we all have demands from multiple customers, who in turn have changing needs. Meanwhile the platforms we're building on top of are also changing. Most importantly, we're all busy, we don't have enough time in our days, and emergencies keep happening -- emergencies that we just can't ignore. Something is always in the way.

And so I ask myself about getting some of those things out of the way:

* How can we find more time in our day?
* How can we have more fun doing our work?
* How can we bring about the change we want to see in the world?
* In short, what stands between us and everything we dream of?

My current best partial answers follow.  They all involve stepping away for a moment from the too-few-hours-in-a-day kinds of too-hard thinking, and instead thinking about how I think --- treating thought itself as a scarce commodity to be carefully planned for. They involve technical surplus, multiplicativity, finding root problems, and disruption.

Exponential expense

In my first post-college job, at a small software company in Arizona, I learned firsthand about exponential bug cost and what I started calling the needle-in-a-haystack problem. Namely: we make mistakes (bugs) all the time. And the more time that passes between the error and its discovery --- equivalently, the more people involved between the oopser and the oucher --- the cost of finding and fixing it goes up exponentially.

For example:
* If I make a mistake and catch it right there at my desk, no one even needs to know.
* Suppose I don't catch it, and suppose I say there's a good reason: We have an entire department devoting to software testing and it's good division of labor to let them do the testing.
* If, after I check in my code along with the other half-dozen people on my team, the integrator sees a symptom, it'll take a few minutes, or maybe an hour, for them to comb through the merge report and find out I'm the one with today's mistake.
* If the bug survives to system test, there we might have a domain-knowledgeable user: for example, a medical-software company employing nurses and respiratory techs to use the software in a (hopefully) real-world way. They'll know a lot about their work but not much about mine --- they'll see a symptom at the user-interface level (if they see it at all), know something's wrong somewhere, and write up a bug report --- and after that someone will dig into it, and eventually my phone will ring.
* If the bug makes it past system test and isn't triggered until the real real world, then a display crashes on a nurse doing bedside rounds at 3 a.m. in some time zone somewhere, the sysadmin's pager goes off, our nighttime software-support person looks up the symptom in our database of known bugs (finding nothing since this one is new) ... and so on. Eventually my phone rings and I get to fix my bug.

Finding bugs in software is like finding needles in a haystack. At my desk, I've got one bale of hay. If I comb through it --- and that takes time I might not think I have, since tick tick tick! --- I'll find it in minutes. If I don't comb through it, I save those minutes --- and I'm dumping my bale of hay into a cartful, from there into a barnful, and so on. Debugging in a large-scale deployed software system is like looking for a needle in a barnful of hay and it is no fun.

Technical debt

The needles-in-haystacks metaphor I found out about early on in my career is just one example of a more general concept, known as technical debt. Here's a very nice write-up: http://en.wikipedia.org/wiki/Technical_debt. If you didn't read that, the basic idea is that the minutes we save today can end up costing us more tomorrow --- and worse, borrowing time from the future comes at a cost of compound interest. A team can get to the point when all they're doing is servicing interest --- receiving piled-up bug reports, debugging, fixing, patching, working very, very hard --- and making more mistakes along the way.

These are difficult problems, and we borrow from the future for good reason: someone is in need today. And just like taking out a small-business loan to start a new business, technical debt does have its place. We borrow time from the future when we need to meet a deadline today, and this isn't always reckless: customers want their deliverables on time. The trick is knowing what to do early on, when the debt is just beginning to accrue, before it gets out of hand --- doing a few key extra things at the point when they don't even seem to be necessary.

Technical surplus

Technical debt has its flip side: technical surplus. This is when we do just a little more today than we need to, and it makes life a little better and gives us time to do one more little thing --- and so on. One example is a code comment around a particularly clever algorithm implementation which takes a minute to write, saving others down the road an hour of head-scratching. Another example is a few pages of big-picture documentation helping to organize in the reader's head what would otherwise be two hundred source files in a directory tree. A third is automated unit-test cases which help future maintainers of the code know when they've broken something.

When we produce needle-free bales of hay, we can scale to systems of arbitrary complexity. 

Again, the art is in doing a bit more than needed, before it appears to be necessary --- that is, when it appears gratuitous --- and in knowing what to pick. Eternally polishing one rock to the neglect of others is unconstructive perfectionism; polishing each of them is good craftsmanship.

Getting this right is an ongoing art. People tell me I'm good at writing software for the long haul --- but (unless my axe is sharp) I'm not fast. For me, there are a few rules of thumb:

* I try to avoid the not-today-because-having-a-bad-day excuse. I like the Dr. Seuss quote: "Today I shall behave as if this is the day I will be remembered."

* Do it now. That means either go ahead and write that comment or that wiki page or that test scenario while it's fresh in my head --- or, make an entry in the team's problem-tracking system which will make it an accountable task for me to catch up with.

* Too much documentation can be as bad as too little. The rule of thumb I use is, I try to write the simple reference I wished I had had. I do this by thinking about the things that confused me most when I started. (Since time is money, anything that caused me an hour of head-scratching to figure out is --- at the corporate bottom-line level --- worth writing down for the next person.) I write down the things I worry I'll otherwise forget in six months. I do it now, before I forget that understanding I have in the present. Then I put that where other people can find it (with hyperlinks in from parent pages, and out to other pages), and I move on.

There's a lot more good news --- our industry has come a long way since I first got started. Unit-testing is now widely recognized as a best practice. Agile development encodes technical surplus into a clearly articulated lifestyle. It's a good time to be a software developer.

Multiplicativity

One of the things I wish I'd learned about sooner is multiplication of effort. This enables crowdsourcing, or concurrent programming, to use current buzzphrases.

There are at least five levels of ways to change software, each providing a productivity multiplier over the one before:

* Level 0: change nothing.

* Level 1: Fix a specific bug, add a specific feature. There is a one-to-one correspondence between question and answer.

* Level 2: While fixing a bug, find the missing document, misunderstanding, missing unit-test case, etc. and fix that. For feature addition, create a solution which other people (not just the original requestor) can use. Here the answer-to-question ratio is greater than one. It's necessary to work at least at this level if we're ever to dig out of the tar pit of too-many-bugs-to-fix-too-little-time-to-fix-them.

* Level 3: Create tools which people can use to glue together to develop their own features. They're no longer waiting on a programmer to create features --- the power is in their own hands. It's important to make those tools documented, visible, and debuggable enough that they can find their own mistakes. This requires some level of customizability, whether in a GUI, a configuration language, an embedded language, or what have you. I've seen significant implementation of this paradigm at my current employer, and it's absolutely jaw-dropping to see what a hundred smart, motivated people can do when someone gives them a rich, powerful, extensible tool and gets out of their way.

* Level 4: Create tools which allow people to develop their own level-3 tools. The design of programming languages is an example of this. I have a programmer-centric point of view, but I think language designers engage in one of the highest forms of human thought.

Finding root problems

The difference between levels 1 and 2 above is a big one. Too many times over the years I've seen a user file a bug report, with a developer replying You had an error at line 52 of your config file or That method doesn't accept non-null pointers. User fixes their config file, and developer does their own null-pointer check --- problem solved? Absolutely not.

The challenge here is to keep asking why?

* What was the user trying to do? What did they believe the config-file syntax did?
* Did we ever give them examples of valid config-file syntax?
* Why was the pointer null in the first place?
* If the report was missing because the report-generator didn't run, why was that? Is the job scheduler supposed to retry failed jobs? Does the user think it is?
* And so on.

Just as technical surplus requires doing a little more than seems necessary --- but not so much as to miss deadlines --- here too there's a challenge. We've all been the child who kept asking a follow-on Why? to our parent's previous Because ... . From the parent's end, it's cute at first but it gets maddening after a while. And there are always other rocks to polish, too many why-chains to push down all at once.

My current rules of thumb are:

* Keep a list --- in my head, or preferably written down --- of frequently encountered themes. If after a few weeks or months the same kinds of things keep happening to different people in the same particular area, then that's the rock to polish, the chain of whys that it will pay off to descend into.

* At the very least, leave things better than I found them. I might get just a level or two into the recursive why chain, but as long as I keep my multiplicative ratio greater than one (fixing more than one problem instance for each reported problem) I feel like I've come out ahead.

Disruption

One of the challenges in my career --- and one I've often handled poorly --- is where in the complexity spectrum to make a change. Fixing a one-time bug is quick. Fixing the erroneous documentation which led the programmer to have the mistaken understanding which led him or her to create that bug in the first place is also quick. Taking a subsystem, preserving its API, and doing a gut-level rebuild of its internals may not be quick, but as long as the API is preserved it won't break anybody else, and no one else needs to be involved. I can do all those things, and have done so time and again. I've fixed a lot of problems along the way ...

... but what if the root problems lie deeper? What about making changes across systems, causing other people's code or workflow to need to be changed? This is called disruption, and being disruptive can be a virtue.

This takes courage and it takes communication. It requires taking time out of other people's day and sitting down with them and listening to them explain why my idea is a bad one. But it's only way to make meaningful progress in large-scale legacy systems. A single developer can fix a problem here and there, and add significant value, but it takes motivating an entire team to really move a culture and fix root problems.

No comments:

Post a Comment