Something is rotten… (part 1 of 2) – Human Technology and Organisations Research Group

“Something is rotten in the state of … ” well, at least in Sweden, and at least when it comes to software systems. There are a large number of failures adding up over the years. I can understand the problem we had in the 1960s and the 1970s, and I can even understand the Y2K phenomenon (which actually seems to fade in even computer scientists’ collective memory). Before the year 2000 we didn’t understand that the software was going to be so well designed (!) that it would survive into a year, when the year “00” was not an error message, but an actual number.

However, if we go back through the very short history of computing, we find that there were a large number of failures already from the beginning. Not only with computers, but the machines that were connected to the computers. Just take the example of the first ATM machines that were introduced. For some reason people seemed to become very absent-minded all of a sudden. They started to leave their banking cards in the machine over and over again. When this issue was investigated more thoroughly, it became clear that it was necessary to change the order of two actions in order to make the system better in this respect. Instead of “get the money first” and “retrieve card after that”, the trick was to “retrieve the card first” and “then get the money”. As simple as that, right.

Now this is no longer a problem, since we have this “touch” authorisation of everything you do. Just touch your card or even the phone to a small sensor and everything is fine. But just before this became more common, there was a small period of time when the ticket machines for the car parking were exchanged to a new, safer model. Apparently, it was now a requirement from the EU that the card should remain inserted during the whole operation (it was in fact held there by some special mechanism). Guess what happened? A large number of people became amnesiac again, leaving their cards in the slot. But this was not a surprise to some of us. No, we knew that this were to happen. And of course there was a good reason for this problem to reoccur — THE GOAL of using THE MACHINE!

When you go to an ATM or go to a ticket machine for a car park, you have a goal with this – namely to get money, or pay the parking fee. The card we use to pay is not a central part (unless the account is empty or the machine is broken) of the task, it is not as important as getting the money, or getting the parking paid. We KNOW this since the 1970s. But today, it seems, it’s not the users who suffer from bad memory; it’s the software developers turn.

Today we know quite a lot about human factors. Human factors are not bad, on the contrary. We know when they can cause a problem, but we also have quite good knowledge about how we can use the human factors constructively, so that they help people to do the right thing more or less by default. This does not mean that there is a natural, or obvious way to do something, but if we take the human factors into considerations, we can in fact predict if something is going to work or not.

This sounds simple, of course, but the problem is not just to change the order of two actions. It means to understand what we are good at, and also when the human factors can lead us the wrong way.

But how can we know all this? By looking back!

It is by looking back on all the bad examples that we should be able to start to figure out what can go wrong. And the list of bad (failed) examples has not grown shorter over the years, even after the Y2K chaos (which in fact should be called the “centennium bug” rather than the “millenium bug”). In the late winter last year we had a new version of the millium bug (or a similar event, at least). Two of the major grocery chains in Sweden (and one in New Zeeland) could not get their cashiers to work. The system was not working at all. Since the Swedish government has stressed “the cashless society” so hard, it had the effect that a large number of customers was unable to do the shopping that day.

So, what was the problem? Well, from the millenium bug we know that years can be very tricky entities. They should have four numbers, otherwise there can be problems. So far so good! However, the developer of this financial system didn’t make a proper estimation of the importance of the years in the system. It turned out, in the end, that the day when the system stopped working, was a day that didn’t exist. At least it did not exist in the minds of the developers, when they designed the system. But now and then, February 29 does indeed exist. Almost (every) every 4th is a year leap year, where February has an extra day attached.

The question is how this kind of bugs can enter the development process? The thing is that we can study a large number of single anecdotical accounts without drawing any wider conclusions from the examples. But if we instead look at the failures from a human factors perspective, there are many conclusions that we can draw. Most of these are very easy to understand, but oh, so difficult they seem to be to actually make anything out of. In part 2 of this post, I will dive deeper into some of the anecdotal examples, and make an attempt to generalize the reasoning for future references. (to be continued…).