Writings on civilization's systems and their limits

Thursday, 7 April 2011

Japan observations, and some ideas on complex systems

I've been quiet for a little bit - a bit of pressure at work, a weekend away and some wheelbuilding for a cycling holiday I have coming up later this month have been taking up a lot of my spare time! I also spent some time working on a post on Victorian water supply security and the desalination plant, but I wasn't particularly happy with the result, so I set that aside, and had a crack at some contemporary affairs commentary.

The effects of the Japanese earthquake on efficient systems

Over the past few weeks, the major item of news has been the Great Eastern Japan earthquake on 11 March, and the event's aftermath. The most serious effects appear to have been the tragic loss of life resulting from the earthquake and tsunami, the severe damage or total destruction of infrastructure, the disruption of the lives of the survivors, and the ongoing struggle to bring the severely damaged Fukushima Daiichi nuclear power plant under control. These four subjects have been attracting most of the media attention.

There have also been a small number of articles discussing supply chain problems resulting from reductions in Japanese manufacturing capacity, and these real world examples have illustrated some of my ideas from my earlier posts in a startling fashion. It is one thing to start with pre-existing data and develop a theory to explain that data, which is always open to charges of theory tweaking to fit the data - but quite another thing when new, independent data emerges that backs up the theory.

To recap, my theory was that efficient systems are brittle, and prone to failure when the environment changes. In the case of Japanese manufacturing, significant damage occurred to manufacturing capacity - which is hardly surprising, given the scale of the devastation. What was not so obvious was the rapidity with which this capacity destruction spread to affect manufacturers in Europe, North America and elsewhere.

The first article is about the supply chain problems resulting from the Japanese events, is this one from the Age.

The opening quote speaks for itself: "The disaster in Japan has exposed a problem with how multinational companies do business: The system they use to keep supplies rolling in is lean and cost-effective - yet vulnerable to sudden shocks." The article continues on in a similar vein, but this first line sums it up perfectly.

A large and diverse number of companies have been affected. According to this article from the New York Times, the production of General Motors' Chevy Volt in the US may be affected, Nissan's engine production plant in the affected area is out of action, Texas Instruments (a US based company) might not resume full capacity production at their Japanese plant until September, Toshiba has closed some NAND flash production lines, SanDisk had concerns about transportation and power supply reliability, while Sony, Canon, Pioneer (home entertainment) and Kirin (beer) may also be affected as they ship their products from ports which have been severely damaged or destroyed by the tsunami.

This article from Der Spiegel discusses the effect of the Japanese devastation on two competing fan manufacturers in Germany, EBM Papst and Ziehl-Abegg. Although competitors, they both obtain essential chips from Toshiba in Japan, and the Toshiba factory that manufactures their chips has been damaged. Consequently they both expected that they would need to shut down their production lines, perhaps for one to two weeks, if delivery of the chips was delayed. The article does not speculate on the consequences if they needed to find an alternative supplier and obtain new stock from them prior to restarting production. In addition, the Japanese manufacturer of the transmission for Porsche's Cayenne SUV is experiencing disruptions to production, a chip manufactured by Toshiba is used in Apple's iPad, and the German carmaker Opel has announced the cancellation of some manufacturing plant production shifts due to a shortage of components from Japan.

The most illuminating paragraph in the whole article is probably this: "The assembly lines at EBM Papst and Ziehl-Abegg now depend on a handful of electronic components from Japan, often costing little more than a few cents. But the transformers, resistors and memory chips are vital components in products ranging from fans for laptops and car engines to the air-conditioning systems in New York skyscrapers and hotels in Mecca."

What this demonstrates is that the manufacture of these highly specialised parts is incredibly efficient and at a low per-component cost, due to the economies of scale resulting from high volume production of a single component, which is then sold to huge numbers of customers worldwide - yet the specialisation and complexity of that part, combined with the lack of other manufacturers making an equivalent part, means that any interruption to supply rapidly propagates around the globe, with no alternative suppliers immediately available. There has been recent media coverage of limited stock of iPad 2's after their launch - I suspect this was due to the damage to Toshiba's chip manufacturing plant.

This illustrates that in modern manufacturing, manufacturers generally buy components from a single supplier in high volumes, which may also be shipped long distances due to the low cost of air freight. Further, manufacturers keep little stock on hand as a buffer against supply disruptions, in order to maximise financial efficiency by reducing warehousing costs as much as possible.

A computer network example

Another example of an efficient, brittle system is the network discussed in this 2008 article from the Oil Drum. The entire article is well worth reading. I read it at the time, and then forgot about it for a few years - but during a search for some unrelated material a few weeks ago, I came across it again. On rereading, I was startled just how well it illustrated my ideas on brittleness - with a disturbing twist. The author (aeldric), in discussing a failure of a computer network due to a faulty software driver on a single machine, focuses on the concept of the "frequency" of a system, and couches his (her?) discussion in slightly different terms - but the ideas expressed are directly analogous to mine on stress transmission through systems, and overall brittleness of systems.

The disturbing twist in aeldric's case study is that computer networks, and the Internet, were originally designed to be robust - so that these networks could continue to function, even in the event of failure of any given component. What the case study shows is that financial imperatives can take over in network management, and the network made more efficient in order to reduce financial cost. This decrease in cost comes at the expense of losses of system redundancy in specific components, which can then quickly cause overall system failure when those specific components fail. The message to be drawn from this is to avoid assuming that our internet-based services are robust - they may not be, and if they fail, they can fail almost instantly.

Complex Systems

Since my first few posts, I've been thinking more about the issue of complex systems, triggered by a few problems in the banking system earlier this year. The defining characteristic of these problems was complexity - and my systems theory (so far) says little about complexity! So, some extension is required. There is a strong correlation between efficiency and complexity - sometimes efficient systems will be complex. So, what are the implications of complex systems? Following are some of the ideas I've come up with.

I have a mental picture of what "complex" means, but I need to define it if I'm going to discuss it meaningfully! So, I will define complex systems as tending to be large, rigid, hard to understand, prone to incorrect implementation, efficient, and brittle.

A "large" system may have significant geographical scope, large financial cost, involve a large quantity of components or infrastructure, employ a large number of people, or interact with a large number of other parties or things. The trend of complex systems to become large might be a direct consequence of their efficiency - if they are competing against other, less efficient, systems, then they have advantages which are likely to result in users of the less efficient system transferring to the more efficient system. This then becomes a mechanism for system growth.

By "rigid", I mean that the system is designed to work in a specific way - for instance, it may only take a particular type of input, it may only provide a fixed set of features, or it may be dependent on the continued validity of a design assumption. If the input form changes, a new feature is desired, or the design assumptions are rendered invalid, then the system needs to be modified so that it will continue to function.

"Hard to understand" is a direct consequence of complexity and self explanatory, while by "prone to incorrect implementation", I mean that it is easy to make an error in system design or construction so that under some scenarios it will not generate the correct response (In software design, these are called "bugs"!) Efficiency and brittleness have already been discussed in earlier posts, so I will not rehash them here.

These properties of complex systems lead to several consequences.

One is that large systems cannot be easily replaced. Another is that large systems are often expensive and time consuming to create - so they are not easy to replace if they fail.

The rigidity and expense of complex systems combined with a desire for new features will often trigger the need or desire to modify the system to incorporate the new feature, as opposed to creating a replacement from scratch. The person or persons attempting to modify the system then need to develop a full understanding of the components of the system that they are intending to modify, so as to add in the new feature without breaking existing features. For on-line systems such as modern banking, an additional requirement is the need to maintain correct system operation while the changes are being introduced.

Because complex systems have the property of being hard to understand, an insufficiently carefully planned modification can break the system. For my day job, I write software for hearing aids - I often need to modify code in order to introduce new sound processing algorithms to a device build. Hearing aid software is highly complex, since many different algorithms need to be run on the audio input samples, while simultaneously maintaining low delay sound processing, with no breaks in the audio output. One of the guiding principles I follow when modifying code is that I need to understand exactly what a piece of software does before I modify it - otherwise I might break some undetected functionality of the code.

If changes to a complex on-line system are not done correctly, then not only can the system fail, but an additional problem - of needing to somehow restore the system to a valid state - emerges. A major characteristic of the NAB batch processing file failure (discussed in my first post) was that as a result of the corrupted bach processing file, the bank ended up in a state where their database was processing new transactions correctly, but existing bank balances were wrong - that is, the system state was incorrect. This appears to have been the major cause of the ongoing problems - the need to restore the customer bank balances in their database to a correct state, by means of manual checking and processing, although the system was by then processing new transactions correctly.

The following two articles are about the ASX failure on 1 March this year. One of them attributes the failure to a problem with a Nasdaq OMX system, which was introduced in November 2010.

ASX trading resumes after tech woes "Trading on the Australian Securities Exchange resumed at the normal time this morning, but the problem behind the disruption to trade on Monday remains unresolved. "

Computer breakdown paralyses trading on ASX
"ABOUT $1.5 billion in turnover was reportedly wiped from the Australian Securities Exchange yesterday after a computer problem forced the sharemarket to close abruptly at 2.48pm. A problem with the new trading system left the exchange with 149,513 fewer trades than the 2010 daily average. It is reported to have about $1 billion worth of trades an hour."

It appears that the problem may have been due to a bug in the implementation of the Nasdaq OMX, although if it was caused by a hardware failure, then perhaps the problem could be attributed to the brittleness property, which is a consequence of an efficient system.

Going by my stated attributes of complex systems, Nasdaq OMX is clearly a large system, it is hard to understand since the cause of the failure was not quickly determined, it may have been prone to invalid implementation (if the failure was due to a bug rather than a hardware failure), it was likely to be highly efficient as it was replacing an existing system, and it was brittle as it failed quickly. This brittleness is another possible indication of efficiency (efficiency implies brittleness, but brittleness may not always imply efficiency).

The following CBA failure, allowing people to overdraw cash from their accounts at ATMs, was startling. It is highly disturbing that CBA chose to allow their ATMs to go into stand-in mode, rather than shutting down the network until the problem could be rectified. According to the articles, CBA understood the likely consequence of this action - that ATM users would be able to withdraw more cash than was available in their accounts. In choosing to go to stand-in mode, the bank then turned what should have been an in-house problem into one that triggered police involvement, which was a significant waste of public resources. I suggest that this was also a means for the bank to shift the costs of their internal problem onto external parties - something that any taxpayer should be strenuously objecting to!

CBA's Netbank hit by tech gremlins
"Update: Police have issued a warning after reports that more than 40 Commonwealth Bank ATMs have been dispensing large amounts of cash. Police are unsure at this stage what has caused the fault and are liaising with the Commonwealth Bank, which has been hit all day by a technical glitch that has disrupted its online banking, ATMs and EFTPOS services."

Faulty ATMs spitting cash after technical glitch
"The Commonwealth Bank took a calculated risk and placed its ATMs into "stand-in" mode yesterday knowing that it would mean customers could overdraw their accounts. The bank confirmed it encountered an issue "when conducting routine database maintenance" but rather than shutting down its network of ATMs while the problem was being fixed, it placed them into stand-in mode to allow people to continue to have access to funds."

According to this Age article, a security consultant who had previously worked for CBA stated that the problems related to CBA's "core banking modernisation" project. The article helpfully provides a link to the CBA media release, titled "Commonwealth Bank Core Banking Modernisation". According to the media release, the purpose of the project is to replace internal legacy banking systems with a new, more efficient banking system. This is a very high risk project - not only is the CBA attempting to replace an entire system with a new system, but they are attempting to do so while maintaining system functionality! It would be interesting to interview some of the technical staff working on this porting project.

Blog observations

I've chatted to a few people who have read my early posts. My fear was that I was trying to write about subjects too technical for for a well educated (but not technically trained) audience, and that I wasn't giving enough examples - but that doesn't appear to be the case - thanks EJ! The blog viewing stats have been surprising - there has been an ongoing level of views, and a few new followers, despite few recent posts. I'm drawing the conclusion from this that, as long as I'm not just rehashing recent news (which can date quickly) then good content remains relevant and people are still interested, even if it's a week or two old. So I think my decision to can the water/desalination article (despite quite a few hours of work) was the right one, and I'll keep the emphasis on turning out interesting ideas which are well written up, rather than going for volume.

I've found blogging quite challenging - sometimes the ideas just flow and something comes together, sometimes it takes a few goes and a few fresh starts before a set of ideas are represented clearly and in a way which makes logical sense.

The next post

For my next post, I'm intending to investigate a gedankenexperiment - a thought experiment. The proposed topic of the gedankenexperiment is this: If you borrow money from the bank to buy a house, this parcel of money is then paid to the seller of the house. The seller may then turn around and use the same parcel of money in a similar fashion - to buy a different house elsewhere, and thus pass the parcel of money on to the seller of this house. The recursivity of the situation is apparent - but it poses the question, what is the ultimate fate of that parcel of money? Does it travel down an endless chain of house transactions, or does it dissipate out in some other way? Have a think about it while I compose the next post.

Thanks to everyone for reading - I'm really enjoying this!