[Infowarrior] - Studying the Frequency of Redaction Failures
Richard Forno
rforno at infowarrior.org
Mon Jun 6 07:49:25 CDT 2011
Studying the Frequency of Redaction Failures in PACER
By Timothy B. Lee - Posted on May 25th, 2011 at 1:52 pm
http://freedom-to-tinker.com/blog/tblee/studying-frequency-redaction-failures-pacer
Since we launched RECAP a couple of years ago, one of our top concerns has been privacy. The federal judiciary's PACER system offers the public online access to hundreds of millions of court records. The judiciary's rules require each party in a case to redact certain types of information from documents they submit, but unfortunately litigants and their counsel don't always comply with these rules. Three years ago, Carl Malamud did a groundbreaking audit of PACER documents and found more than 1600 cases in which litigants submitted documents with unredacted Social Security numbers. My recent research has focused on a different problem: cases where parties tried to redact sensitive information but the redactions failed for technical reasons. This problem occasionally pops up in news stories, but as far as I know, no one has conducted a systematic study.
To understand the problem, it helps to know a little bit about how computers represent graphics. The simplest image formats are bitmap or raster formats. These represent an image as an array of pixels, with each pixel having a color represented by a numeric value. The PDF format uses a different approach, known as vector graphics, that represent an image as a series of drawing commands: lines, rectangles, lines of text, and so forth.
Vector graphics have important advantages. Vector-based formats "scale up" gracefully, in contrast to the raster images that look "blocky" at high resolutions. Vector graphics also do a better job of preserving a document's structure. For example, text in a PDF is represented by a sequence of explicit text-drawing commands, which is why you can cut and paste text from a PDF document, but not from a raster format like PNG.
But vector-based formats also have an important disadvantage: they may contain more information than is visible to the naked eye. Raster images have a "what you see is what you get" quality—changing all the pixels in a particular region to black destroys the information that was previously in that part of the image. But a vector-based image can have multiple "layers." There might be a command to draw some text followed by a command to draw a black rectangle over the text. The image might look like it's been redacted, but the text is still "under" the box. And often extracting that information is a simple matter of cutting and pasting.
So how many PACER documents have this problem? We're in a good position to study this question because we have a large collection of PACER documents—1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles—it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)
Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.
Implications
PACER reportedly contains about 500 million documents. We don't have a random sample of PACER documents, so we should be careful about trying to extrapolate to the entire PACER corpus. Still, it's safe to say there are thousands, and probably tens of thousands, of documents in PACER whose authors made unsuccessful attempts to conceal information.
It's also important to note that my software may not be detecting every instance of redaction failures. If a PDF was created by scanning in a paper document (as opposed to generated directly from a word processor), then it probably won't have a "text layer." My software doesn't detect redaction failures in this type of document. This means that there may be more than 194 failed redactions among the 1.8 million documents I studied.
A few weeks ago I wrote a letter to Judge Lee Rosenthal, chair of the federal judiciary's Committee on Rules of Practice and Procedure, explaining this problem. In that letter I recommend that the courts themselves use software like mine to automatically scan PACER documents for this type of problem. In addition to scanning the documents they already have, the courts should make it a standard part of the process for filing new documents with the courts. This would allow the courts to catch these problems before the documents are made available to the public on the PACER website.
My code is available here. It's experimental research code, not a finished product. We're releasing it into the public domain using the CC0 license; this should make it easy for federal and state officials to adapt it for their own use. Court administrators who are interested in adapting the code for their own use are especially encouraged to contact me for advice and assistance. The code relies heavily on the CAM::PDF Perl library, and I'm indebted to Chris Dolan for his patient answers to my many dumb questions.
Getting Redaction Right
So what should litigants do to avoid this problem? The National Security Agency has a good primer on secure redaction. The approach they recommend—completely deleting sensitive information in the original word processing document, replacing it with innocuous filler (such as strings of XXes) as needed, and then converting it to a PDF document, is the safest approach. The NSA primer also explains how to check for other potentially sensitive information that might be hidden in a document's metadata.
Of course, there may be cases where this approach isn't feasible because a litigant doesn't have the original word processing document or doesn't want the document's layout to be changed by the redaction process. Adobe Acrobat's redaction tool has worked correctly when we've used it, and Adobe probably has the expertise to do it correctly. There may be other tools that work correctly, but we haven't had an opportunity to experiment with them so we can't say which ones they might be.
Regardless of the tool used, it's a good idea to take the redacted document and double-check that the information was removed. An easy way to do this is to simply cut and paste the "redacted" content into another document. If the redaction succeeded, no text should be transferred. This method will catch most, but not all, redaction failures. A more rigorous check is to remove the redaction rectangles from the document and manually observe what's underneath them. One of the scripts I'm releasing today, called remove_rectangles.pl, does just that. In its current form, it's probably not user-friendly enough for non-programmers to use, but it would be relatively straightforward for someone (perhaps Adobe or the courts) to build a user-friendly version that ordinary users could use to verify that the document they just attempted to redact actually got redacted.
One approach we don't endorse is printing the document out, redacting it with a black marker, and then re-scanning it to PDF format. Although this may succeed in removing the sensitive information, we don't recommend this approach because it effectively converts the document into a raster-based image, destroying useful information in the process. For example, it will no longer be possible to cut and paste (non-redacted) text from a document that has been redacted in this way.
Bad redactions are not a new problem, but they are taking on a new urgency as PACER documents become increasingly available on the web. Correct redaction is not difficult, but it does require both knowledge and care by those who are submitting the documents. The courts have several important roles they should play: educating attorneys about their redaction responsibilities, providing them with software tools that make it easy for them to comply, and monitoring submitted documents to verify that the rules are being followed.
This research was made possible with the financial support of Carl Malamud's organization, Public.Resource.Org.
More information about the Infowarrior
mailing list