See No Evil: A Background On Content Filters

Fri Jun 22 00:16:53 EDT 2007

Submicron

In 1999, Senator John McCain introduced a bill to the United States Senate attempting to limit children's exposure to pornography and other controversial material online. The final version of this bill was passed as part of an omnibus spending bill on December 15, 2000 and signed into law on December 21, 2000. The bill, now known as the Children's Internet Protection Act (CIPA), requires schools and libraries to operate a technology protection measure with respect to any of its computers with Internet access that protects against access through such computers to visual depictions that are obscene, child pornography, or harmful to minors, and further that such a technology protection measure be employed during any use of such computers by minors. Provisions within the law exist to require the ability to override the protection for adults, to enable access for bona fide research or other lawful purpose.

Although previous attempts to restrict indecent or objectionable Internet content failed to overcome Supreme Court challenges on First Amendment grounds, CIPA took a completely different approach. The federal government lacked any direct method by which to control local school or library boards. Many schools and libraries, however, utilized universal service fund discounts, derived from the universal service tax paid by telecommunications users, to purchase Internet access, computers and networking equipment. CIPA requires that schools and libraries using these, so-called E-Rate discounts, purchase and use a technology protection measure on every computer connected to the Internet. Interestingly, CIPA does not provide funding for the purchase of required technology protection measures.

With the advent of CIPA, a huge market suddenly opened up for content-control software solutions. These censorware products effectively sit in between a client and the Internet, filtering requests to sites containing material matching categories defined by the site administrator as objectionable or inappropriate. The technical means by which this transaction is accomplished vary slightly but the result is the same. Web and other Internet service requests are intercepted, a filtering decision is made and the request is either allowed to proceed, or else it is rejected and logged.

With the market comes the players. Larger providers such as SurfControl and Websense comprise the largest part of the filtering market, along with smaller vendors such as 8e6, Lightspeed, S4F and Barracuda Networks. Additionally, open source solutions such as SquidGuard and Dan's Guardian evolved. Because the technological mechanism for content control is not terribly different across the major vendors, their databases of categorized sites becomes the primary differentiator. Each vendor's marketing and sales effort revolves primarily around the size, number of categories and quality of their database. Websense, for instance, claims that its database contains more than 6 million site postings, representing more than 1.1 billion Web pages. Further, Websense claims to add an average of 25,000 additional sites on a weekly basis. Meanwhile, Lightspeed Systems claims an industry-leading content filtering database containing 15+ million domains, IP addresses and URLs with a monthly growth rate of 500,000 sites per month.

Smaller vendors such as S4F tout the quality of their databases rather than size. These vendors claim to human review and classify each website before entering it into the database. This, they claim, helps minimize errors in classification common within larger databases populated by automated systems. While an admirable goal, human classifying even 1,000 websites is a major task, much less trying to keep pace with the 25,000 weekly average claimed by Websense or the truly incredible 500,000 sites a month Lightspeed claims. Assuming that these database statistics are even remotely accurate, how do they stack up against the Internet as a whole?

According to the June 2007 Web Server Survey conducted by Netcraft, 122,000,635 web sites responded, an increase of almost 4 million sites from the same survey conducted in May. This number is staggeringly large when compared against the database sizes claimed by the content filtering industry. Since the 122 million sites figure refers specifically to host names and not to the number of pages contained within a single site, the possible number of URLs displaying objectionable material is staggering. Assuming that just 1% of the 122 million sites contains porn or other objectionable content, an estimate that is likely to be naively conservative, the amount of material is almost obscene. Further, the growth of new sites on the Internet is accelerating rapidly. The 4 million sites added this month is up from just over 1.5 million sites a month in 2005. Using the entirely fictitious 1%, that's still 400,000 porn sites a month, leaving little room for the other 89 filtering categories most content filtering companies claim.

With growth like this, the idea that all new sites are human reviewed is just not possible. The headcount and training necessary to accurately and efficiently classify the massive wave of new websites would be massive. Maintaining an existing content database and purging sites that go dark or change content type alone would be well nigh impossible for even a large company. Thus the larger players in the space use automated engines to classify websites and, of course, mis-classification is extremely common.

Since the tools used to classify websites are generally proprietary to the filtering company, no real statistics exist on their accuracy. A rough analog can be drawn from the accuracy of spam filtering engines which can range between 85% and 95%. Splitting the difference and assuming that an average automated content classification engine can run at 90% accuracy, this still leads to huge amounts of corruption in these content filtering databases every month. Further compounding the problem is that sites change ownership, content or direction regularly leaving the previous classifications inaccurate and largely unrevised.

These corrupt databases are then sold to cash-strapped schools and libraries in order for them to even qualify for E-Rate funding. Although the content filter is required by CIPA, the cost of the content filter itself can't be recouped by E-Rate. The end result is that schools and libraries, not known for being the most resource-rich environments to begin with, are forced to adopt broken or shoddy products in order to simply qualify for funding to purchase the technology needed to keep pace with the educational needs of each new generation of students.

The real beauty of this situation is that E-Rate funding since the Iraq war has dropped dramatically. Funding now is only available for the absolutely poorest schools. Yet, in order to even be considered for funding, the school or library must already have a content filtering solution in place. This means that schools and libraries that can barely afford books or salaries for teachers must shell out money to pay for and support a content filter before they can even apply for funding to buy new computers and Internet access. It's a classic chick egg conundrum, with a huge dose of government-mandated snake oil on the side.