Focus on Recovery and Resilience in 2021

We, the ICS security community and asset owners, need to grow up in 2021. There is near hysteria whenever there is an actual, potential or mythical breach of an ICS, regardless of the impact. And this is in an environment where the number and consequences of successful attacks on ICS is very small compared to other production outage or degradation causes.

The outgrowth of this hysteria are recommendations to deploy more and more security controls, without regard to actual risk reduction achieved for the required resources. 2021 is the year where we learn to live with the reality that cyber attacks on ICS will sometimes succeed and focus or recovery and resilience after the breach.

An Attacker Only Needs To Succeed Once, A Defender …

This trope is depressing to the defender. It is also almost always used loosely or wrong in ICS because “success” is undefined. Is success taking control of a server in the ICS? Is success causing a manufacturing plant to have a two-day unplanned outage? Is success causing a six hour power outage? Is success causing a city to not have water coming out of the tap for four months?

As an asset owner, I would prefer, and aim for, my operations to be running safely and efficiently producing high quality products and services 100% of the time without unplanned outages. Almost all asset owners fall short of this goal every year for reasons unrelated to cyber security.

The asset owners have learned to live with weather disturbances, supply chain issues, labor issues, supporting system issues and failures in the cyber and physical systems that produce the product or service. We need to learn to live with the fact that there will be cyber or cyber/physical attack events that will affect the ICS’s ability to produce products and services. Planning and expecting a zero cyber event world is a fool’s errand.

As I wrote to close 2020, for the past two decades the impact of cyber attacks on ICS has been incredibly small as compared to the impact due to other causes. In the risk equation, the likelihood has been tiny. Much of this is due to lack of motivation and effort on the attacker side, and this could change at any time, but from a pure statistical analysis it would be foolhardy for asset owners to put significantly more resources on reducing likelihood.

High Impact, Low Frequency Events

High impact, low frequency (HILF) events are also discussed as long tail or black swan events. Richard Clarke calls this the “it never happened before problem”. Sometimes it truly has never happened before, but often it is an event that hasn’t happened in our lifetime, such as a pandemic prior to 2020.

I’m not going to try to channel or restate Nassim Taleb in this article. You are better off reading Antifragile, his best and most applicable book. The HILF risk is difficult for people to deal with due to our makeup. First, there is a hesitancy to believe it is real in our times. And second, if it is believed to be real the common reaction is “we can never let that happen” so the effort is placed on likelihood reduction even though it is tiny to begin with and will never reach zero.

The Path Forward

I’m not arguing that we shouldn’t improve the Protection component of an ICS security program. Many organizations are still lacking the basics of effective segmentation, application whitelisting endpoint protection, and two-factor authentication for remote access. And we need to get past the insecure-by-design nature of PLC’s and other Level 1 devices and related protocols so that other Protection security controls can achieve more than trivial risk reduction. There needs to be a plan to replace that insecure-by-design crap, as we highlighted as far back as S4x12 with Project Basecamp.

For those asset owners that have implemented basic ICS Protection security controls, and those advising them, 2021 should be the year to focus on reducing the consequence if a cyber attack caused breach of the ICS happens. If you want a specific goal, reduce the business consequences of an incident where an attack team with domain engineering and automation skills has full access and control of an engineering workstation. The business consequences are the consequence categories in your risk matrix such as Health & Safety, Financial Impact, Customer Impact, Reputation, etc.

Three broad categories of consequence reduction measures to look at are:

  • Non-hackable, non-cyber measures that reduce the worst case situation
  • Recovery or alternate supply without recovering the ICS
  • Faster recovery of the ICS

Safety and Protection Heritage

This resilience and recovery approach should be a natural for asset owners. The safety and protection systems were put in place for consequence reduction when things go wrong. The difference is that things did go wrong, very wrong. Significant deaths and damage spurred these systems and measures. The question is will we be able to put the resilience and recovery in place for potential cyber or cyber/physical caused events without these personal, business, regional or national catastrophes?

The radical approach is to actually cause some of these events through chaos engineering / the chaos monkey as we saw in two S4x20 sessions.