[Infowarrior] - The Digital Ice Age

Richard Forno rforno at infowarrior.org
Mon Nov 20 20:51:46 EST 2006


The Digital Ice Age

The documents of our time are being recorded as bits and bytes with no
guarantee of future readability. As technologies change, we may find our
files frozen in forgotten formats. Will an entire era of human history be
lost?

BY Brad Reagan
Published in the December, 2006 issue.

http://www.popularmechanics.com/technology/industry/4201645.html

When the aircraft carrier USS Nimitz takes to sea, it carries more than a
half-million files with diagrams of the propulsion, electrical and other
systems critical to operation. Because this is the 21st century, these are
not unwieldy paper scrolls of engineering drawings, but digital files on the
ship's computers. The shift to digital technology, which enables Navy
engineers anywhere in the world to access the diagrams, makes maintenance
and repair more efficient. In theory. Several years ago, the Navy noticed a
problem when older files were opened on newer versions of computer-aided
design (CAD) software.

"We would open up these drawings and be like, 'Wow, this doesn't look
exactly like the drawing did before,'" says Brad Cumming, head of the
aircraft carrier planning yard division at Norfolk Navy Shipyard.

The changes were subtle ‹ a dotted line instead of dashes or minor dimension
changes ‹ but significant enough to worry the Navy's engineers. Even the
tiniest discrepancy might be mission critical on a ship powered by two
nuclear reactors and carrying up to 85 aircraft.

The challenge of retrieving digital files isn't an issue just for the U.S.
Navy. In fact, the threat of lost or corrupted data faces anyone who relies
on digital media to store documents ‹ and these days, that's practically
everyone. Digital information is so simple to create and store, we naturally
think it will be easily and accurately preserved for the future. Nothing
could be further from the truth. In fact, our digital information ‹
everything from photos of loved ones to diagrams of Navy ships ‹ is at risk
of degrading, becoming unreadable or disappearing altogether.

The problem is both immediately apparent and invisible to the average
citizen. It crops up when our hard drive crashes, or our new computer lacks
a floppy disk drive, or our online e-mail service goes out of business and
takes our correspondence with it. We consider these types of data loss
scenarios as personal catastrophes. Writ large, they are symptomatic of a
growing crisis. If the software and hardware we use to create and store
information are not inherently trustworthy over time, then everything we
build using that information is at risk.

Large government and academic institutions began grappling with the problem
of data loss years ago, with little substantive progress to date. Experts in
the field agree that if a solution isn't worked out soon, we could end up
leaving behind a blank spot in history. "Quite a bit of this period could
conceivably be lost," says Jeff Rothenberg, a computer scientist with the
Rand Corp. who has studied digital preservation.

Throughout most of our past, preserving information for posterity was mostly
a matter of stashing photographs, letters and other documents in a safe
place. Personal accounts from the Civil War can still be read today because
people took pains to save letters, but how many of the millions of e-mails
sent home by U.S. servicemen and servicewomen from the front lines in Iraq
will be accessible a century from now?

One irony of the Digital Age is that archiving has become a more complex
process than it was in the past. You not only have to save the physical
discs, tapes and drives that hold your data, but you also need to make sure
those media are compatible with the hardware and software of the future.
"Most people haven't recognized that digital stuff is encoded in some format
that requires software to render it in a form that humans can perceive,"
Rothenberg says. "Software that knows how to render those bits becomes
obsolete. And it runs on computers that become obsolete."

In 1986, for example, the British Broadcasting Corp. compiled a modern,
interactive version of William the Conqueror's Domesday Book, a survey of
life in medieval England. More than a million people submitted photographs,
written descriptions and video clips for this new "book." It was stored on
laser discs ‹ considered indestructible at the time ‹ so future generations
of students and scholars could learn about life in the 20th century.

But 15 years later, British officials found the information on the discs was
practically inaccessible ‹ not because the discs were corrupted, but because
they were no longer compatible with modern computer systems. By contrast,
the original Domesday Book, written on parchment in 1086, is still in
readable condition in England's National Archives in Kew. (The multimedia
version was ultimately salvaged.)

Changing computer standards aren't the only threat to digital data. In 2004,
Miami-Dade County announced it had lost almost all the electronic voting
records from a 2002 election because of a series of computer crashes ‹
reminding us that many of the failures of digital records ‹ keeping are
attributable to everyday equipment failure (see "Preserving Your Data" at
right). Additionally, software companies can go out of business, taking
their proprietary codes with them. In 2001, the online photo storage site
PhotoPoint shut down and hundreds of people lost the digital photos they
stored on the site.

But data loss is not always as apparent as a fried hard drive or a disc with
no machine to play it. A digital file is just a long string of binary code.
Unlike a letter or a photograph, its content is not immediately apparent to
the end user. In order to see a photograph that has been saved as a JPEG
file or to read a letter composed in a word processing program, we need
software that can translate that code for us.

Software applications are updated on average every 18 months to two years,
according to the Software and Information Industry Association, and newer
versions are not always backward compatible with the previous ones. That
could be a problem on the USS Nimitz, just as it could make trouble for you
if the file in question held your medical records.

Likewise, law firms find that metadata‹data about the data, such as the date
when a file was created‹are often not transferred accurately when files are
copied. For example, magnetic storage media, such as hard drives, allow for
a three-part date storage system (created/accessed/modified), whereas the
file architecture of optical media, such as CD-Rs, allows for only one date.
This presents a difficulty in litigation, when attorneys must build
chronologies of key events in a case. "I see this in almost every single
case," says Craig Ball, a computer forensics expert who advises law firms.
"It's a complex problem at so many levels. We are losing so much."

As Richard Pearce-Moses, past president of the Society of American
Archivists, puts it, "We can keep the 0s and 1s alive forever, but can we
make sense of them?"

I TRAVELED RECENTLY TO Washington, D.C., to meet with Ken Thibodeau, head of
the National Archives' Electronic Records Archive (ERA). The National
Archives is charged with the daunting task of preserving all historically
relevant documents and materials generated by the federal
government‹everything from White House e-mails to the storage locations of
nuclear waste. Ten years ago, Thibodeau's biggest concern was how to handle
the 32 million e-mails sent to the archives by the Clinton administration.
And that was just the beginning. The Bush White House is expected to produce
100 million e-mails by 2008. Thibodeau long ago realized that simply copying
the data to magnetic tapes‹the archives' previous means of storing
electronic records‹was not going to work in the Digital Age. It would take
years to copy those e-mails to tape, and that was just a trickle compared to
the avalanche of more complex digital files that were coming his way.

"The problem is that everything we build, whether it is a highway, tunnel,
ship or airplane, is designed using computers," Thibodeau says. "Electronic
records are being sent to the archives at 100 times the rate of paper
records. We don't know how to prevent the loss of most digital information
that's being created today."

The National Archives must not only sort through the tremendous volume of
data, it must also find a way to make sense of it. Thibodeau hopes to
develop a system that preserves any type of document‹created on any
application and any computing platform, and delivered on any digital
media‹for as long as the United States remains a republic. Complicating
matters further, the archive needs to be searchable. When Thibodeau told the
head of a government research lab about his mission, the man replied, "Your
problem is so big, it's probably stupid to try and solve it."

Last year, the National Archives awarded Lockheed Martin a $308 million
contract to develop the system. "We think this is a groundbreaking effort of
the Information Age," says Clyde Relick, the project's program director.


‹Ken Thibodeau
"Everything we build, whether it is a highway, tunnel, ship or airplane, is
designed using computers ... we don't know how to prevent the loss of most
digital information that's being created today."
To date, the ERA has identified more than 4500 file types that need to be
accounted for. Each file type essentially requires an independent solution.
What type of information needs to be preserved? How does that information
need to be presented?

As a relatively simple example, let's take an e-mail from the head of a
regulatory agency. If the correspondence is pure text, it's a
straightforward solution. But what if there is an attachment? What type of
file is the attachment? If the attachment is a spreadsheet, does the
behavior of the spreadsheet need to be retained? In other words, will it be
important for future generations to be able to execute the formulas and play
with the data?

"That is unlike a challenge we would have with a paper document," Relick
says. More complex file formats, such as NASA virtual reality training
programs, require more complex solutions. The ERA is working with a number
of research partners, including the San Diego Super-computer Center and the
National Science Foundation, on some of those more intricate challenges.

Lockheed is building what is primarily a "migration" system, in which files
are translated into flexible formats such as XML (extensible markup
language), so the files can be accessed by technologies of the future. The
idea is to make copies without losing essential characteristics of the data.

Not everyone agrees with Lockheed's approach. Rothenberg, of the Rand Corp.,
for example, believes an "emulation" strategy would be more appropriate.
Emulation allows a modern computer to mimic an older computer so it can run
a certain program. Popular emulation programs in use today are those that
allow people to take video games made for Sony PlayStation 2 or Microsoft
Xbox and play them on PCs.

"It seems to me that migration throws away the original," Rothenberg says.
"It doesn't even try to save the original. What you end up with is
somebody's idea about what was important about the original."

Relick says the cost and technical effort involved in emulation are not
feasible for a project the size of the ERA. In addition, he notes that the
archives in their entirety will need to be accessible to anyone with a
browser, and emulation becomes more difficult when you have to account for
users with an infinite variety of hardware and software.

The goal for the Lockheed team is to have initial operating capability for
the ERA in September 2007, but budget cuts may delay the program's search
functionality.

The data crisis is by no means limited to the National Archives, or to
branches of the military. The Library of Congress is in the midst of its own
preservation project, and many universities are scrambling to build systems
that capture and retain valuable academic research.

But the programs in development for government and academia won't help find
the lost e-mail of an individual computer user. Some experts believe that
this is the result of simple market forces: Consumers have shown little
interest in digital preservation, and corporations are in the business of
meeting consumer demand. Others say corporations are only concerned with
selling more new products.


"Their interest, it seems to me, is creating incompatibilities over time,
not compatibilities," Rothenberg says. "Looking at it cynically, they have
very little motivation to burden themselves with compatibility because doing
so only allows their customers to avoid upgrading."

Nevertheless, there have been encouraging developments. In late 2005,
Microsoft announced it was opening the file formats of its Office suite,
including Word and Excel, to competitors in order to get Office certified as
an international standard. By ceding proprietary control of the formats to
third-party developers, Microsoft greatly increases the odds that those
formats will be accessible for future generations.

Meanwhile, the International Organization for Standardization recently
certified a modified version of Adobe Systems' popular Portable Document
Format (PDF) specifically for long-term archiving. It's called PDF/A. In
essence, PDF/A preserves everything contained in a document that can be
printed while excluding features that may be useful in the short term but
problematic in the long term. For example, the new format does not allow
embedded links to external applications, which could become obsolete, and it
doesn't allow for passwords, which can be lost or forgotten. "It is all
about creating a reliable presentation down the road," says Melonie Warfel,
director of worldwide standards for Adobe, who worked on the project. Adobe
is also working on archiving standards for engineering documents and digital
images.

IF HISTORY IS A GUIDE‹and that, after all, is the point of preserving
history‹we know the future will offer the means to manipulate digital
information in ways we cannot yet imagine. The trick is to keep moving
forward without leaving too much behind.

"It goes beyond this notion of 'important records'‹it goes to the things
that are important to us," says Warfel, the mother of two children. "My mom
had shoeboxes full of photographs, but we don't do that anymore. I have hard
drives full of photographs." PM




More information about the Infowarrior mailing list