[Infowarrior] - Mine Data not Details

Richard Forno rforno at infowarrior.org
Mon Jun 19 07:57:04 EDT 2006


Mine Data not Details

http://www.wired.com/news/wireservice/1,71184-0.html
Associated Press 10:45 AM Jun, 17, 2006

CAMBRIDGE, Massachusetts -- As new disclosures mount about government
surveillance programs, computer science researchers hope to wade into the
fray by enabling data mining that also protects individual privacy.

Largely by employing the head-spinning principles of cryptography, the
researchers say they can ensure that law enforcement, intelligence agencies
and private companies can sift through huge databases without seeing names
and identifying details in the records.

For example, manifests of airplane passengers could be compared with
terrorist watch lists -- without airline staff or government agents seeing
the actual names on the other side's list. Only if a match were made would a
computer alert each side to uncloak the record and probe further.

"If it's possible to anonymize data and produce ... the same results as
clear text, why not?" John Bliss, a privacy lawyer in IBM's "entity
analytics" unit, told a recent workshop on the subject at Harvard
University.

The concept of encrypting or hiding identifying details in sensitive
databases is not new. Exploration has gone on for years, and researchers say
some government agencies already deploy such technologies -- though
protecting classified information rather than individual privacy is a main
goal.

Even the data-mining project that perhaps drew more scorn than any other in
recent years, the Pentagon's Total Information Awareness research program,
funded at least two efforts to anonymize database scans. Those anonymizing
systems were dropped when Congress shuttered TIA, even while the data-mining
aspects of the project lived on in intelligence agencies.

Still, anonymizing technologies have been endorsed repeatedly by panels
appointed to examine the implications of data mining. And intriguing
progress appears to have been made at designing information-retrieval
systems with record anonymization, user audit logs -- which can confirm that
no one looked at records beyond the approved scope of an investigation --
and other privacy mechanisms "baked in."

The trick is to do more than simply strip names from records. Latanya
Sweeney of Carnegie Mellon University -- a leading privacy technologist who
once had a project funded under TIA -- has shown that 87 percent of
Americans could be identified by records listing solely their birth date,
gender and ZIP code.

Sweeney had this challenge in mind as she developed a way for the U.S.
Department of Housing and Urban Development to anonymously track the
homeless.

The system became necessary to meet the conflicting demands of two laws --
one that requires homeless shelters to tally the people they take in, and
another that prohibits victims of domestic violence from being identified by
agencies that help them.

Sweeney's solution deploys a "hash function," which cryptographically
converts information to a random-appearing code of numbers and letters. The
function can't be reversed to reveal the original data.

When homeless shelters had to submit their records to regional HUD offices
for counting how many people used the facilities, each shelter would send
only hashed data.

A key detail here is that each homeless shelter would have its own
computational process, known as an algorithm, for hashing data. That way,
one person's name wouldn't always translate into the same code -- a method
that could be abused by a corrupt insider or savvy stalker who gained access
to the records.

However, if the same name generated different codes at different shelters,
it would be impossible to tell whether one person had been to two centers
and was being double-counted. So Sweeney's system adds a second step: Each
shelter's hashed records are sent to all other facilities covered by the HUD
regional office, then hashed again and sent back to HUD as a new code.

It might be hard to wrap your mind around this, but it's a fact of the
cryptography involved: If one person had been to two different shelters --
and so their anonymized data got hashed twice, once by each of the shelters
applying its own formula -- then the codes HUD received in this second phase
would indicate as much. That would aid an accurate count.

Even if HUD decides not to adopt the system, Sweeney hopes it finds use in
other settings, such as letting private companies and law enforcement
anonymously compare whether customer records and watch lists have names in
common.

A University of California, Los Angeles professor, Rafail Ostrovsky, said
the CIA and the National Security Agency are evaluating a program of his
that would let intelligence analysts search huge batches of intercepted
communications for keywords and other criteria, while discarding messages
that don't apply.

Ostrovsky and co-creator William Skeith believe the system would keep
innocent files away from snoops' eyes while also extending their reach:
Because the program would encrypt its search terms and the results, it could
be placed on machines all over the internet, not just computers in
classified settings.

"Technologically it is possible" to bolster security and privacy, Ostrovsky
said. "You can kind of have your cake and eat it too."

That may be the case, but creating such technologies is just part of the
battle. One problem is getting potential users to change how they deal with
information.

Rebecca Wright, a Stevens Institute of Technology professor who is part of a
five-year National Science Foundation-funded effort to build privacy
protections into data-mining systems, illustrates that issue with the
following example.

The Computing Research Association annually analyzes the pay earned by
university computer faculty. Some schools provide anonymous lists of
salaries; more protective ones send just their minimum, maximum and average
pay.

Researchers affiliated with Wright's project, known as Portia, offered a way
to calculate the figures with better accuracy and privacy. Instead of having
universities send their salary figures for the computer association to
crunch, Portia's system can perform calculations on data without ever
storing it in unencrypted fashion. With such secrecy, the researchers
argued, every school could safely send full salary lists.

But the software remains unadopted. One large reason, Wright said, was that
universities questioned whether encryption gave them legal standing to
provide full salary lists when they previously could not -- even though the
new lists never would leave the university in unencrypted form.

Even if data-miners were eager to adopt privacy enhancements, Wright and
other researchers worry that the programs' obscure details might be
difficult for the public to trust.

Steven Aftergood, who heads the Federation of American Scientists' project
on government secrecy, suggested that public confidence could be raised by
subjecting government data-mining projects to external privacy reviews.

But that seems somewhat unrealistic, he said, given that intelligence
agencies have been slow to share surveillance details with Congress even on
a classified basis.

"That part of the problem may be harder to solve than the technical part,"
Aftergood said. "And in turn, that may mean that the problem may not have a
solution."




More information about the Infowarrior mailing list