This is a fascinating real world case study and example why protocol stack security and reliability is so important. From a NRC report dated April 17, 2007:
On August 19, 2006, operators at Browns Ferry, Unit 3, manually scrammed the unit following a loss of both the 3A and 3B reactor recirculation pumps. …
The licensee determined that the root cause of the event was the malfunction of the VFD controller because of excessive traffic on the plant ICS network. … The licensee could not conclusively establish whether the failure of the PLC caused the VFD controllers to become nonresponsive, or the excessive network traffic, originating from a different source, caused the PLC and the VFD controllers to fail. However, information received from the PLC vendor indicated that the PLC failure was a likely symptom of the excessive network traffic. …
A key point is that all network devices must allocate time and resources to read and interpret each broadcasted data packet, even if the packet is not intended for that particular device. Excessive data packet traffic on the network may cause connected devices to have a delayed response to new commands or even to lockup, thereby, disrupting normal network operations. This excessive network traffic is sometimes called a broadcast (or data) storm. …
The reason the licensee at Browns Ferry investigated whether the failure of one device, the condensate demineralizer PLC, may have been a factor in causing the malfunction of the VFD controllers is that there is documentation of such failures in commercial process control. For instance, a memory malfunction of one device has been shown to cause a data storm by continually transmitting data that disrupts normal network operations resulting in other network devices becoming ‘locked up’ or nonresponsive.
I believe “scram” is the term for an emergency shutdown, so this was serious.
The write up is unclear as to whether the 10Mbps LAN pipe was full and preventing legitimate communication or if the protocol stack in the VFD couldn’t process the traffic on a 10Mbps LAN. The preponderance of the report points to the VFD stack not processing the traffic.
It is truly pathetic if an Ethernet connected device can’t handle 10Mbps traffic. (It may be equally pathetic that they have not upgraded to 100Mbps switches but this would have only exacerbated the problem, but then again if the network utilization is low there is no need to upgrade. It shows how little traffic travels over the average control system network.) Any IT network device would be considered a complete failure if it had such unreliable and insecure performance. Shouldn’t we have higher standards of reliability in control systems for the critical infrastructure?
Unfortunately this all to common for controllers and and other Ethernet enabled devices in control system networks. Asset owners have enough security issues to address without having to worry about whether the controllers can process packets correctly. It is why we are such strong proponents of Achilles Certification. The storms test cases in the Achilles Certification tests would have identified the failures described in the NRC report. Vendors that have protocol stacks that can continue to operate correctly during the rigorous Achilles testing should be commended.
Congratulations to Robert Lemos of Security Focus for identifying this story. One small note on his article, my quote on “if you were to test any control systems that have any more than three or four different network-connected devices, they could be knocked over very easily” is three of four different types of controllers or devices found in the field. Conservatively we are seeing 25% to 33% of the controllers and other Ethernet enabled devices with protocol stack problems identified in assessments. Many of our clients know this going in and say don’t bother scanning field site equipment because we know it causes devices to crash.
This is why you need to be very careful on how you do assessments of control systems and accounts for all the horror stories of an IT professional doing a Nessus scan on the control system subnets and causing outages, see our Scanning Control Systems whitepaper for our methodology.