Surf No Evil: The Fight Against Big Brother

As previously discussed, there are a number of inherent flaws in the practice of web content classification and how the fruits of that classification are then sold in the form of content filtering products. The impact of these flaws, however, are not limited to the consumers of content filtering products. Mis-classification of web content affects websites themselves, often as harshly as it affects users of content filtering who need access to the resources provided by mis-classified sites. Before discussing the impact of mis-classification on legitimate websites and users of content filtering, it is important to understand the process by which a website is classified (or mis-classified) by the content classification industry.

In the mid-1990's as the World Wide Web rose to popularity, stories in the mass media prompted parents and religious leaders to worry that children might access indecent material. In 1996, the United States Congress reacted with the Communications Decency Act (CDA), banning indecency on the Internet. Civil liberties groups challenged the law under the First Amendment in 1997 under ACLU v. Reno and the U.S. Supreme Court partially overturned the law. While the CDA was struck down on the basis of free speech, another overriding objection to content filtering has always been the correctness of the filtering decisions made by content filtering products. Overly broad filters against key words either in the content of the website or in the site's URL led to sites carrying information about breast cancer, clothing, and poultry recipes being filtered because of their use of the word "breast". In a particularly famous example, Beaver College in Glenside, Pennsylvania changed its name to Arcadia University in part because content-filtering software was blocking access to the school's web site suggesting the college's use of "beaver" was indecent.

These kinds of mis-classifications are very common due to the way that most content filtering technology works. Content classification companies maintain huge databases of web sites which are classified into various categories. Users of content filtering software may then choose which categories will be allowed or blocked. In addition to URLs, so-called key words are used to help block sites which may not be specifically classified, but which may contain material related to a filtered category or web searches for restricted material. Key words are generally regular expressions which can be broadly applied to either the name of the site, or the content of the site itself. URL-based key word filters are generally faster and less reliable because, although the entire site is not checked for bad content before being displayed to the end-user, the URL rarely relates to the content of the web page itself. Content-based key words are considered more reliable but slow because the entire site must be processed. The accuracy of key word-based filtering comes into question when one looks more deeply into regular expressions and their use.

A regular expression is a string that is used to describe or match a set of characters according to a certain set of syntactical rules. There are a number of different iterations of regular expression syntax including POSIX modern, Perl-Compatible, and traditional UNIX. While the distinctions between the syntax are numerous, the goal is to correctly match every character in a string containing a specific, described pattern. When the set of strings is every URL on the Web, describing even a single pattern which will match one particular objectionable word is a monumental task. Combine with that simple inattention to detail or laziness and, suddenly, content filters block every instance of the string "sex" or "orgy" which leads to the blacklisting of sites relating to Sussex or Essex, or the opera Porgy and Bess as well as legitimate sites containing sex educational material.

Jericho writes:

To better illustrate this problem, consider the following web sites and their URLs. Each one is a legitimate business that has absolutely nothing to do with indecent material, but runs the risk of being filtered due to 'inappropriate content' simply because of the site's name. Some filtering products act on the assumption that if a 'bad' word is in the URL or site name, it must be part of the content they are offering.

http://www.dickssportinggoods.com/ - Dick's Sporting Goods

http://www.msexchange.org/ - Microsoft Exchange Server Resource Site

http://www.cummingfirst.com/ - First Cumming Methodist Church

http://www.penisland.net/ - Pen Island

http://www.molestationnursery.com/ - Mole Station Native Nursery

Submicron writes:

It is important to note that, despite the marketing jargon surrounding the functionality of content filtering software, key words really are the only intelligence used to make automatic filtering decisions. When an administrator uses such functionality, which is not turned on by default for some content filtering products, they may inadvertantly block thousands of harmless sites. The remainder of sites are filtered based on a static maintained list of URLs and IP addresses. These massive databases are maintained by the filtering companies themselves, regarded as the keys to their intellectual property kingdom, kept completely opaque from the end user and sold as an updated service, much like antivirus vendors sell virus definition updates. When a site like attrition.org gets categorized as "racism and hate", that is the content filtering company forcing their views or mistake on their customers. Considering that only a small fraction of the sites contained in these databases are actually viewed by human eyes, mis-classified sites constantly enter into the databases, and are rarely (if ever) re-classified correctly, unless or until a site realizes that it's been blacklisted and complains.

Jericho writes:

Before you casually dismiss this issue thinking "it doesn't affect me", consider that these content filtering products are frequently used for public Internet access. This means that if you use WiFi at your local coffee shop, public kiosk terminals at an airport or even the computers at your local library, you may be browsing the Web under the guidance and supervision of these products...

Lyger writes:

... which leads us to a time in the not-so-distant past (as you will read below) where attrition.org was again directly affected by the content filtering industry. Shortly after the release of an email to the "General Attrition Mayhem Mail List", we received a curious email stating the following:

From: A.A. (xxxxxxxx@xxxxxxxx.com)
To: lyger (lyger@attrition.org)
Date: Sat, 16 Jun 2007 13:01:50 -0700 (PDT)
Subject: Bloody Bobbing Bollocks, You've Been Blocked!

lyger-

Wow... just wow.  Imagine my happiness when I arrived at work after a few days 
off and saw a bunch of new dataloss entries and a new going postal.  "Good times," 
I thought to myself," as I typed attrition.org into my address bar only to be greeted 
by this:

You have attempted to access a site that is not consistent with [Company]'s Internet 
Usage Policy.

Your request for http://attrition.org/postal/p0014.html was denied because of its 
content categorization: "Racism and Hate"

--------------------------------------------------------------------------------
Use of the Internet by [Company] employees is permitted and encouraged where such use 
is suitable for business purposes and supports the goals and objectives of [Company] 
and its business units. The Internet is to be used in a manner that is consistent with 
[Company]'s standards of business conduct as defined in the company's Ethics in the 
Workplace policy, is a part of the normal execution of an employee's job responsibilities, 
and does not compromise the security or the integrity of [Company]'s information systems. 
This policy covers all connections via intranet, extranet, Internet, and any remote methods 
that allow physical or logical connectivity to internal [Company] information systems using 
[Company] resources. Violations of this policy may be subject to disciplinary action up to 
and including termination of employment.

Racism and hate?  I am Jack's complete shock.  Okay, maybe hate of stupid people but 
c'mon?  Racism?  Attrition.org?  What the fucking fuck?  I can only think that one of the 
myriad images from the image gallery or some mirrored page defacement is what did this.  
How did my corporate overlords even find out about Attrition?  I demand answers, dammit!

Okay, I know you don't have any answers for me but this sucks.  No more Going Postal.  
No more defacement mirror.  No more reviews.  No more charlatans.  Damn, what am I going to 
do when I'm bored at work... oh, now I see.  They blocked all the fun websites so we'll 
review company news and policies if we aren't busy.  (Like we're "supposed" to.)  Those 
bastards!

It's been fun checking Attrition.org out for the all-too brief period we've had together.  
I'll see you when I get internet at home again... now, to check and see if 
http://www.racismandhate.org is blocked.

a.

P.S.- As per usual, I'd like to humbly request that, should this e-mail be featured in 
the Going Postal section, you don't use my name or e-mail address.  (Especially since 
this involves my job.)  I also have a new request, being that you tell me it's going to 
be used since I won't be able to actually see it from work and can't see it from home 
for at least another month and a half.  I'd prefer it not be used but that's just my 
preference.

The most heinous "[Company] disclaimer" shown above appears to be their actual message displayed upon blocking a web site on their network. Why not just say "access denied, go read our AUP" and be done with it? I responded back, asking for more information about the filtering software in question. Before I received a reply, yet another email hit my inbox:

From: A.G. (xxxxxxx@xxxxxxxx.com)
To: staff@attrition.org
Date: Sat, 16 Jun 2007 17:44:45 -0400
Subject: Interesting with regards to Websense.

Looks like Attrition is a hotbed of racists, extremists, and
hatemongers. At least according to Websense. Attrition has now been
blocked under the dubiously humorous category "Racism and Hate".

It seems Websense Inc. or whoever rules their 'block list' has a hard
on for you guys. It also seems that I'll be getting my dose of sarcasm
and infosec news elsewhere. Well, however long it takes me to find
another working proxy.

Good luck getting Websense to unblock Attrition.org.

For a touch of historical flavor, we will say that attrition.org was once categorized as "hacking", which is another category generally blocked by many companies that use content-filtering products. Attrition.org was later placed under the "Computer Security" category after a long e-mail campaign and many customer complaints, so the recategorization to "Racism and Hate" concerned us for two reasons:

We searched Websense's public web site in an attempt to find a link, address or form we could use to request a site recategorization. Nothing appeared to be available. We then obtained an email address for Websense's support division and contacted them to request that the site be reviewed and appropriately recategorized. After 48 hours, no response was received. After a few minutes of research and strategic e-mails, we found a friend who is an existing Websense customer. On his own time, he contacted Websense by phone and spoke with tech support personnel who directed him to a web link that would allow a customer recategorization request. The recategorization request was made, and within three hours, the following message was delivered via email to his account:

Thank you for writing to Websense.

The sites you submitted have been reviewed and categorized accordingly: 

http://attrition.org/ - Information Technology

So customers who pay for the product can request support for a recategorization, but sites which may be wrongfully categorized sometimes have no readily available means to request a manual peer review? That seems wrong, and it really does smack of "big business": we can affect the way you are perceived by millions, but if you're not paying us, go screw yourselves and find someone who is. In other words, if you're not our customer, you get no service, even if we screwed you... and worse, potentially libeled you.

So why was attrition.org recategorized as "Racism and Hate"? We don't really have an honest or truthful answer, but we do have a few ideas:

Jericho writes:

If one of these were the reason, why not apply that designation to /postal instead of the entire site? If it was for the mirror content, why not a custom message explaining that it is a security site mirroring content from criminal activity and is useful to law enforcement and security personnel, but may be offensive to some? WebSense can use this type of custom message as a value-add, giving customers more reason to continue using their product. If you think this takes too much time and effort, do you really want to browse the web and get denied access to content based on three second snap judgements of web sites and their material?

Lyger writes:

As described by Submicron, the impact of a "negative recategorization" affects more than just a company using a content-filtering product. It also affects the person who visited a site (should their visit be logged and flagged), and also affects the site itself, which may have to jump through countless hoops in order to be fairly categorized. In some cases, reputations may be at stake, and it hardly seems right that companies with arbitrary control over viewable content should be able to unilaterally make decisions such as these and subject millions of users to them.