[ISN] Tracking the Blackout bug

InfoSec News isn at c4i.org
Fri Apr 9 04:09:05 EDT 2004


http://www.theregister.co.uk/2004/04/08/blackout_bug_report/

By Kevin Poulsen
SecurityFocus
8th April 2004

A number of factors and failings came together to make the August 14th 
northeastern blackout the worst outage in North American history. One 
of them was buried in a massive piece of software compiled from four 
million lines of C code and running on an energy management computer 
in Ohio.

To nobody's surprise, the final report on the blackout released by a 
US-Canadian task force Monday puts most of blame for the outage on 
Ohio-based FirstEnergy Corp., faulting poor communications, inadequate 
training, and the company's failure to trim back trees encroaching on 
high-voltage power lines. But over a dozen of task force's 46 
recommendations for preventing future outages across North America are 
focused squarely on cyberspace.

That may have something to do with the timing of the blackout, which 
came three days after the relentless Blaster worm began wreaking havoc 
around the Internet - a coincidence that prompted speculation at the 
time that the worm, or the traffic it was generating in its efforts to 
spread, might have triggered or exacerbated the event. When US and 
Canadian authorities assembled their investigative teams, they 
included a computer security contingent tasked with looking 
specifically at any cybersecurity angle on the outage.

In the end, it turned out that a computer snafu actually played a 
significant role in the cascading blackout - though it had nothing to 
do with viruses or cyber terrorists. A silent failure of the alarm 
function in FirstEnergy's computerized Energy Management System (EMS) 
is listed in the final report as one of the direct causes of a 
blackout that eventually cut off electricity to 50 million people in 
eight states and Canada.

The alarm system failed at the worst possible time: in the early 
afternoon of August 14th, at the critical moment of the blackout's 
earliest events. The glitch kept FirstEnergy's control room operators 
in the dark while three of the company's high voltage lines sagged 
into unkempt trees and "tripped" off. Because the computerized alarm 
failed silently, control room operators didn't know they were relying 
on outdated information; trusting their systems, they even discounted 
phone calls warning them about worsening conditions on their grid, 
according to the blackout report.

"Without a functioning alarm system, the [FirstEnergy] control area 
operators failed to detect the tripping of electrical facilities 
essential to maintain the security of their control area," reads the 
report. "Unaware of the loss of alarms and a limited EMS, they made no 
alternate arrangements to monitor the system."

With the FirstEnergy control room blind to events, operators failed to 
take actions that could have prevented the blackout from cascading out 
of control.

In the aftermath, investigators quickly zeroed in on the Ohio 
line-tripping as a root cause. But the reason for the alarm failure 
remained a mystery. Solving that mystery fell squarely on the 
corporate shoulders of GE Energy, makers of the XA/21 EMS in use at 
FirstEnergy's control center. According to interviews, a half-a-dozen 
workers at GE Energy began working feverishly with the utility and 
with energy consultants from KEMA Inc. to figure out what went wrong.

The XA/21 isn't based on Windows, so it couldn't have been infected by 
Blaster, but the company didn't immediately rule out the possibility 
that the worm somehow played a role in the alarm failure. "In the 
initial stages, nobody really knew what the root cause was," says Mike 
Unum, manager of commercial solutions at GE Energy. "We spent a 
considerable amount of time analyzing that, trying to understand if it 
was a software problem, or if - like some had speculated - something 
different had happened."

Sometimes working late into the night and the early hours of the 
morning, the team pored over the approximately one-million lines of 
code that comprise the XA/21's Alarm and Event Processing Routine, 
written in the C and C++ programming languages. Eventually they were 
able to reproduce the Ohio alarm crash in GE Energy's Florida 
laboratory, says Unum. "It took us a considerable amount of time to go 
in and reconstruct the events." In the end, they had to slow down the 
system, injecting deliberate delays in the code while feeding alarm 
inputs to the program. About eight weeks after the blackout, the bug 
was unmasked as a particularly subtle incarnation of a common 
programming error called a "race condition," triggered on August 14th 
by a perfect storm of events and alarm conditions on the equipment 
being monitoring. The bug had a window of opportunity measured in 
milliseconds.

"There was a couple of processes that were in contention for a common 
data structure, and through a software coding error in one of the 
application processes, they were both able to get write access to a 
data structure at the same time," says Unum. "And that corruption lead 
to the alarm event application getting into an infinite loop and 
spinning."


Testing for Flaws

"This fault was so deeply embedded, it took them weeks of poring 
through millions of lines of code and data to find it," FirstEnergy 
spokesman Ralph DiNicola said in February.

After the alarm function crashed in FirstEnergy's controls center, 
unprocessed events began to cue up, and within half-an-hour the EMS 
server hosting the alarm process folded under the burden, according to 
the blackout report. A backup server kicked-in, but it also failed. By 
the time FirstEnergy operators figured out what was going on and 
restarted the necessary systems, hours had passed, and it was too 
late.

This week's blackout report recommends that the U.S. and Canadian 
governments require all utilities using the XA/21 to check in with GE 
Energy to ensure "that appropriate actions have been taken to avert 
any recurrence of the malfunction." GE Energy says that's a moot 
point: though the flaw has not manifested itself elsewhere, last fall 
the company gave its customers a patch against the bug, along with 
installation instructions and a utility to repair any alarm log data 
corrupted by the glitch. According to Unum, the company sent the 
package to every XA/21 customer - more than 100 utilities around the 
world - and offered to help install it, "irrespective of their current 
support status," he says.

The company did everything it could, says Unum. "We text exhaustively, 
we test with third parties, and we had in excess of three million 
online operational hours in which nothing had ever exercised that 
bug," says Unum. "I'm not sure that more testing would have revealed 
that. Unfortunately, that's kind of the nature of software... you may 
never find the problem. I don't think that's unique to control systems 
or any particular vendor software."

Tom Kropp, manager of the enterprise information security program at 
the Electric Power Research Institute, an industry think tank, agrees. 
He says faulty software may always be a part of the electric grid's 
DNA. "Code is so complex, that there are always going to be some 
things that, no matter how hard you test, you're not going to catch," 
he says. "If we see a system that's behaving abnormally well, we 
should probably be suspicious, rather than assuming that it's behaving 
abnormally well."

But Peter Neumann, principal scientist at SRI International and 
moderator of the Risks Digest, says that the root problem is that 
makers of critical systems aren't availing themselves of a large body 
of academic research into how to make software bulletproof.

"We keep having these things happen again and again, and we're not 
learning from our mistakes," says Neumann. "There are many possible 
problems that can cause massive failures, but they require a certain 
discipline in the development of software, and in its operation and 
administration, that we don't seem to find. ... If you go way back to 
the AT&T collapse of 1990, that was a little software flaw that 
propagated across the AT&T network. If you go ten years before that 
you have the ARPAnet collapse.

"Whether it's a race condition, or a bug in a recovery process as in 
the AT&T case, there's this idea that you can build things that need 
to be totally robust without really thinking through the design and 
implementation and all of the things that might go wrong," Neumann 
says.

Despite the absence of cyber terrorism in the blackout's genesis, the 
final report includes 13 recommendations focused squarely on 
protecting critical power-grid systems from intruders. The computer 
security prescriptions came after task force investigators discovered 
that the practices of some of the utility companies involved in the 
blackout created "potential opportunities for cyber system compromise" 
of EMS computers.

"Indications of procedural and technical IT management vulnerabilities 
were observed in some facilities, such as unnecessary software 
services not denied by default, loosely controlled system access and 
perimeter control, poor patch and configuration management, and poor 
system security documentation," reads the report.

Among the recommendations, the task force says cyber security 
standards established by the North America Electric Reliability 
Council, the industry group responsible for keeping electricity 
flowing, should be vigorously enforced. Joe Weiss, a control system 
cyber security consultant at KEMA, and one of the authors of the NERC 
standards, says that's a good start. ""The NERC cyber security 
standards are very basic standards," says Weiss. "They provide a 
minimum basis for due diligence."

But so far, it seems software failure has had more of an effect on the 
power grid than computer intrusion. Nevertheless, both Weiss and 
EPRI's Kropp believe that the final report is right to place more 
emphasis on cybersecurity than software reliability. "You don't try to 
look for something that's going to occur very, very, very 
infrequently," says Weiss. "Essentially, a blackout like this was 
something like that. There are other issues that are higher 
probability that need to be addressed."





More information about the ISN mailing list