I recently read a book called ‘Metldown’ by Chris Clearsfield and Andras Tilcsik (Link: Meltdown).
The book provides a framework to reason about complex systems that can be found all around us (from the cars we drive to processes in a factory). The word ‘system’ is used in the generic sense where it means a set of components interacting with each other. Each component expects some sort of input and provides some sort of output.
The decomposition of a system into components can be done at different levels of detail. The closer we get to the ‘real’ representation more complex can the interaction between components (or sub-systems) get. Imagine the most detailed representation of a computer chip which incorporates within it a quantum model of the transistor!
Let us look at some important points to consider when trying to understand a complex system. These allow us to classify and select appropriate lines of attack to unravel the complexity.
1. Complexity of Interaction
Complexity arises when we have non-linear interactions between systems. Linear interactions are always easier to reason about and therefore to fix in case of issues. With non-linear interactions (e.g. feedback loops) it becomes difficult to predict effects of changing inputs on the output. Feedback loops if unbounded (i.e. not convergent) can lead to catastrophic system failures (e.g. incorrect sensor data leading to wrong automated response – which worsens the situation).
Solution: Break feedback loops with linear interactions. Add circuit breakers or delay in reaction where not possible to break feedback loops.
2. Tight Coupling
When two or more systems are tightly coupled then it is quite easy to bring down all by taking down just one. Slack in the interaction between systems requires a system to be able to deal with imprecise, inaccurate and missing inputs while preserving some sort of functional state.
Solution: Allow clear statement of inputs, outputs and acceptable ranges. Provide internal checks to ensure errors do not cross component boundaries. Provide clear indication of the health of a component.
Any system (or group of systems) requires monitoring to provide control decisions. For example, when operating a car we monitor speed, fuel and the dashboard (for warning lights). Any system made up of multiple components/sub-systems should ideally have a monitoring feed from each of the components. But many times we cannot directly get a feed from a component, or it can lead to information overload and we rely on observer components (i.e. sensors) to help us. This adds a layer of obfuscation around a component. If the sensor fails then the operator/controller has no idea what is going on or worse has the wrong idea without knowing it and therefore takes the wrong steps. This is a common theme with complex systems such as nuclear reactors, aeroplanes and stock markets where indirect measurements are all that is available.
The other issue is that when a system is made up of different components from different providers, each component may not have a standard way of providing status. For example in modern ‘cloud enabled’ software we have no way of knowing if a cloud component which is part of our system has failed and restarted. It may or may not impact us depending on how tightly coupled our components are to the cloud component and if we need to respond to any restarts (e.g. by flushing cached information).
While it is difficult to map any system approaching day-to-day complexity to figure out where it can fail or degrade we can use techniques such as Anomalising to make sure cases of failures are recorded and action taken to prevent future occurrences. The process is straight forward:
- Gather data – collect information from different monitoring feeds etc. about the failure (this is why monitoring is critical)
- Fix raised issues – replace failing/failed components, change processes, re-train operators
- Address Root Cause – monitor replaced components, new procedures while making sure root cause is identified (e.g. was the component at fault or is it a deeper design issue? Are we just treating the symptom and not the cause?)
- Ensure solution is publicised so that it becomes part of ‘best practice’
- Audit – make sure audit is done to measure solution effectiveness
As most interesting systems involve a human:
- operator (e.g. pilot)
- controller (e.g. traffic controller)
- supervisor (e.g. in a factory)
- beneficiary (e.g. patient wearing a medical device)
- dependent (e.g. passenger in a car)
Then the big question is how can we humans improve how we work with complex systems? Or the other way around: How can complex systems be improved to allow humans to work with them more effectively?
There is a deceptively simple process that can be used to peel back some of the complexity. We can describe this as a ‘check-plan-and-proceed’ mechanism.
- Gather how the interaction with a given system has been in the previous time frame (week/month/quarter) [Check]
- Create a list of changes to be tried in the next time frame [Plan]
- Figure out what can be improved in the next time frame [Proceed]
This allows the human component of a complex system to learn in bite-sized chunks.
This also helps in dealing with dynamic systems (such as stock markets) where (as per the book) it is the weather prediction equivalent of ‘predicting a tornado rather than simply rainfall’. When the check-plan-and-proceed mechanism is abandoned we get systems running amok towards a ‘meltdown’ – be it a nuclear meltdown, stock market crash, plane crash or collapse of a company.