[ISN] Tracking the Blackout bug
InfoSec News
isn at c4i.org
Fri Apr 9 04:09:05 EDT 2004
http://www.theregister.co.uk/2004/04/08/blackout_bug_report/
By Kevin Poulsen
SecurityFocus
8th April 2004
A number of factors and failings came together to make the August 14th
northeastern blackout the worst outage in North American history. One
of them was buried in a massive piece of software compiled from four
million lines of C code and running on an energy management computer
in Ohio.
To nobody's surprise, the final report on the blackout released by a
US-Canadian task force Monday puts most of blame for the outage on
Ohio-based FirstEnergy Corp., faulting poor communications, inadequate
training, and the company's failure to trim back trees encroaching on
high-voltage power lines. But over a dozen of task force's 46
recommendations for preventing future outages across North America are
focused squarely on cyberspace.
That may have something to do with the timing of the blackout, which
came three days after the relentless Blaster worm began wreaking havoc
around the Internet - a coincidence that prompted speculation at the
time that the worm, or the traffic it was generating in its efforts to
spread, might have triggered or exacerbated the event. When US and
Canadian authorities assembled their investigative teams, they
included a computer security contingent tasked with looking
specifically at any cybersecurity angle on the outage.
In the end, it turned out that a computer snafu actually played a
significant role in the cascading blackout - though it had nothing to
do with viruses or cyber terrorists. A silent failure of the alarm
function in FirstEnergy's computerized Energy Management System (EMS)
is listed in the final report as one of the direct causes of a
blackout that eventually cut off electricity to 50 million people in
eight states and Canada.
The alarm system failed at the worst possible time: in the early
afternoon of August 14th, at the critical moment of the blackout's
earliest events. The glitch kept FirstEnergy's control room operators
in the dark while three of the company's high voltage lines sagged
into unkempt trees and "tripped" off. Because the computerized alarm
failed silently, control room operators didn't know they were relying
on outdated information; trusting their systems, they even discounted
phone calls warning them about worsening conditions on their grid,
according to the blackout report.
"Without a functioning alarm system, the [FirstEnergy] control area
operators failed to detect the tripping of electrical facilities
essential to maintain the security of their control area," reads the
report. "Unaware of the loss of alarms and a limited EMS, they made no
alternate arrangements to monitor the system."
With the FirstEnergy control room blind to events, operators failed to
take actions that could have prevented the blackout from cascading out
of control.
In the aftermath, investigators quickly zeroed in on the Ohio
line-tripping as a root cause. But the reason for the alarm failure
remained a mystery. Solving that mystery fell squarely on the
corporate shoulders of GE Energy, makers of the XA/21 EMS in use at
FirstEnergy's control center. According to interviews, a half-a-dozen
workers at GE Energy began working feverishly with the utility and
with energy consultants from KEMA Inc. to figure out what went wrong.
The XA/21 isn't based on Windows, so it couldn't have been infected by
Blaster, but the company didn't immediately rule out the possibility
that the worm somehow played a role in the alarm failure. "In the
initial stages, nobody really knew what the root cause was," says Mike
Unum, manager of commercial solutions at GE Energy. "We spent a
considerable amount of time analyzing that, trying to understand if it
was a software problem, or if - like some had speculated - something
different had happened."
Sometimes working late into the night and the early hours of the
morning, the team pored over the approximately one-million lines of
code that comprise the XA/21's Alarm and Event Processing Routine,
written in the C and C++ programming languages. Eventually they were
able to reproduce the Ohio alarm crash in GE Energy's Florida
laboratory, says Unum. "It took us a considerable amount of time to go
in and reconstruct the events." In the end, they had to slow down the
system, injecting deliberate delays in the code while feeding alarm
inputs to the program. About eight weeks after the blackout, the bug
was unmasked as a particularly subtle incarnation of a common
programming error called a "race condition," triggered on August 14th
by a perfect storm of events and alarm conditions on the equipment
being monitoring. The bug had a window of opportunity measured in
milliseconds.
"There was a couple of processes that were in contention for a common
data structure, and through a software coding error in one of the
application processes, they were both able to get write access to a
data structure at the same time," says Unum. "And that corruption lead
to the alarm event application getting into an infinite loop and
spinning."
Testing for Flaws
"This fault was so deeply embedded, it took them weeks of poring
through millions of lines of code and data to find it," FirstEnergy
spokesman Ralph DiNicola said in February.
After the alarm function crashed in FirstEnergy's controls center,
unprocessed events began to cue up, and within half-an-hour the EMS
server hosting the alarm process folded under the burden, according to
the blackout report. A backup server kicked-in, but it also failed. By
the time FirstEnergy operators figured out what was going on and
restarted the necessary systems, hours had passed, and it was too
late.
This week's blackout report recommends that the U.S. and Canadian
governments require all utilities using the XA/21 to check in with GE
Energy to ensure "that appropriate actions have been taken to avert
any recurrence of the malfunction." GE Energy says that's a moot
point: though the flaw has not manifested itself elsewhere, last fall
the company gave its customers a patch against the bug, along with
installation instructions and a utility to repair any alarm log data
corrupted by the glitch. According to Unum, the company sent the
package to every XA/21 customer - more than 100 utilities around the
world - and offered to help install it, "irrespective of their current
support status," he says.
The company did everything it could, says Unum. "We text exhaustively,
we test with third parties, and we had in excess of three million
online operational hours in which nothing had ever exercised that
bug," says Unum. "I'm not sure that more testing would have revealed
that. Unfortunately, that's kind of the nature of software... you may
never find the problem. I don't think that's unique to control systems
or any particular vendor software."
Tom Kropp, manager of the enterprise information security program at
the Electric Power Research Institute, an industry think tank, agrees.
He says faulty software may always be a part of the electric grid's
DNA. "Code is so complex, that there are always going to be some
things that, no matter how hard you test, you're not going to catch,"
he says. "If we see a system that's behaving abnormally well, we
should probably be suspicious, rather than assuming that it's behaving
abnormally well."
But Peter Neumann, principal scientist at SRI International and
moderator of the Risks Digest, says that the root problem is that
makers of critical systems aren't availing themselves of a large body
of academic research into how to make software bulletproof.
"We keep having these things happen again and again, and we're not
learning from our mistakes," says Neumann. "There are many possible
problems that can cause massive failures, but they require a certain
discipline in the development of software, and in its operation and
administration, that we don't seem to find. ... If you go way back to
the AT&T collapse of 1990, that was a little software flaw that
propagated across the AT&T network. If you go ten years before that
you have the ARPAnet collapse.
"Whether it's a race condition, or a bug in a recovery process as in
the AT&T case, there's this idea that you can build things that need
to be totally robust without really thinking through the design and
implementation and all of the things that might go wrong," Neumann
says.
Despite the absence of cyber terrorism in the blackout's genesis, the
final report includes 13 recommendations focused squarely on
protecting critical power-grid systems from intruders. The computer
security prescriptions came after task force investigators discovered
that the practices of some of the utility companies involved in the
blackout created "potential opportunities for cyber system compromise"
of EMS computers.
"Indications of procedural and technical IT management vulnerabilities
were observed in some facilities, such as unnecessary software
services not denied by default, loosely controlled system access and
perimeter control, poor patch and configuration management, and poor
system security documentation," reads the report.
Among the recommendations, the task force says cyber security
standards established by the North America Electric Reliability
Council, the industry group responsible for keeping electricity
flowing, should be vigorously enforced. Joe Weiss, a control system
cyber security consultant at KEMA, and one of the authors of the NERC
standards, says that's a good start. ""The NERC cyber security
standards are very basic standards," says Weiss. "They provide a
minimum basis for due diligence."
But so far, it seems software failure has had more of an effect on the
power grid than computer intrusion. Nevertheless, both Weiss and
EPRI's Kropp believe that the final report is right to place more
emphasis on cybersecurity than software reliability. "You don't try to
look for something that's going to occur very, very, very
infrequently," says Weiss. "Essentially, a blackout like this was
something like that. There are other issues that are higher
probability that need to be addressed."
More information about the ISN
mailing list