[Infowarrior] - Publishers seek to block Internet search engines from additional content

Sat Dec 1 02:15:38 UTC 2007

 Publishers seek to block Internet search engines from additional content

The Associated Press
Thursday, November 29, 2007

http://www.iht.com/bin/printfriendly.php?id=8532179

NEW YORK: Seeking greater control of their content, leading news
organizations and other publishers said Thursday they would push for a
revision to technology that controls access to their content by search
engines.

Google, Yahoo and other top search companies now voluntarily respect a Web
site's wishes as declared in a text file known as robots.txt. The file
allows a site to block indexing of individual Web pages, specific
directories or the entire site.

The proposal, presented by a consortium of publishers at the headquarters of
The Associated Press, would add to those commands, further restricting
access.

The current system does not give sites "enough flexibility to express our
terms and conditions on access and use of content," said Angela Mills Wade,
executive director of the European Publishers Council, one of the
organizations behind the proposal. "That is not surprising. It was invented
in the 1990s, and things move on."

Robots.txt was developed in 1994 in part because of concerns that some
crawlers were straining Web sites by visiting them repeatedly or rapidly. As
search engines expanded to offer services for displaying news and scanning
printed books, news organizations and book publishers began to complain.

The proposed extensions, known as Automated Content Access Protocol, partly
grew out of those disputes. Leading the drive for the extensions were groups
representing publishers of newspapers, magazines, online databases, books
and journals.

News publishers complained that Google was posting their news summaries,
headlines and photos without permission. Google asserted that "fair use"
provisions of copyright laws applied, though it eventually settled a lawsuit
with Agence France-Presse and agreed to pay The Associated Press without a
lawsuit being filed. Financial terms have not been disclosed.

The new automated commands will use the same robots.txt file that search
engines now recognize. Web sites could start using them Thursday alongside
the existing commands.

Like the current robots.txt, the use of the new protocol would be voluntary,
so search engines ultimately would have to agree to recognize the commands.
Search engines could ignore them and leave it to courts to rule on any
disputes over fair use.

A Google spokeswoman, Jessica Powell, said the company supported all efforts
to bring Web sites and search engines together but needed to evaluate the
new protocol to ensure it could meet the needs of millions of Web sites, not
just those of a single community.

"Before you go and take something entirely on board, you need to make sure
it works for everyone," Powell said.

Organizers of the new protocol tested their system with the French search
engine Exalead but had only informal discussions with others. Google, Yahoo
and Microsoft sent representatives to the announcement, and O'Reilly said
their "lack of public endorsement has not meant any lack of involvement by
them."

Danny Sullivan, editor in chief of the industry Web site Search Engine Land,
said robots.txt "certainly is long overdue for some improvements." But he
questioned whether the new protocol would do much to prevent legal battles.

And being an initiative of news publishers, he said, it might lack
attributes that blogs, online retailers and other Web sites might need in an
updated robots.txt.