homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Proposal for robots.txt To Have Greater Flexibility

 2:40 pm on Nov 29, 2007 (gmt 0)

The desire for greater control over how search engines index and display Web sites is driving an effort by leading news organizations and other publishers to revise a 13-year-old technology for restricting access.

The new proposal, to be unveiled Thursday by a consortium of publishers at the global headquarters of The Associated Press, seeks to have those extra commands and more apply across the board. Sites, for instance, could try to limit how long search engines may retain copies in their indexes, or tell the crawler not to follow any of the links that appear within a Web page.

Proposal for robots.txt To Have Greater Flexibilty [news.yahoo.com]



 4:34 pm on Nov 29, 2007 (gmt 0)

Most of the existing commands - even the basic ones - seem to be fail way too often, so I'm not sure more will be better than the sitemaps protocol which seems to be stealing a lead over robots.txt.

The formal rules allow a site to block indexing of individual Web pages, specific directories or the entire site, though some search engines have added their own commands

They were never "formal" rules. They were pretty informal from the off. As I recall it was some long since gone engine that started the protocol and others informally adopted it.

Let's see if anything gets beyond a press release.


 5:02 pm on Nov 29, 2007 (gmt 0)

Here's the tech framework [the-acap.org] (30 page PDF). Looks like I'll have more than enough opportunities to shoot myself in the foot if this is ever adopted.


 5:04 pm on Nov 29, 2007 (gmt 0)

The article is so vague on what exactly they're proposing [edit: thanks Jim.]. It would be nice if there was some regex capability (like the Google extensions, but a bit more).

Sitemaps vs robots.txt. As I see it

Sitemaps: "Crawl here. Index this. And if you want to follow other links, that's fine with me".

robots.txt: "Don't crawl here. Don't index this. And if you do or don't find stuff in other places, that's none of my business."

I don't think given that difference, one can ever "steal the lead" over the other really.


 6:29 pm on Nov 29, 2007 (gmt 0)

This seems aimed at large online newspapers etc. There is nothing here for the average webmaster.

From my quick overview of the spec, I could only see tags to tell search engines how much of your content they can show in snippets and images. It looks like they are trying to limit how much of their content can be shown in the search results. Maybe one day they intend for this to be legally binding and some sort of override of copyright fair-use.

robots.txt + sitemap is fine for 99% of websites I think. robots.txt could be tweaked, but I think the big search engines will dictate that through slow evolution (like the Sitemap tag)


 11:34 pm on Nov 29, 2007 (gmt 0)

Tepid response from the search engines, not surprisingly.


 12:16 am on Nov 30, 2007 (gmt 0)

Ack! The first thing they can do is redesign their website - [the-acap.org...] - with a readable colour scheme. I gave up after about 15 seconds, not a good start.



 11:47 am on Nov 30, 2007 (gmt 0)

Yes, I agree, that site is not easy on the eyes.

So, ACAP 1 will help the genuine sites, and the big search engines monetize thisng further. I have no problem with that, but it wont do anything to stop rogue bots, which I would have thought are the biggest problem.

Jessica Powell, a spokesman for Google, said the company supported all efforts to bring Web sites and search engines together but needed to evaluate ACAP to ensure it can meet the needs of millions of Web sites, not just those of a single community.

"Before you go and take something entirely on board, you need to make sure it works for everyone," Powell said.



 3:06 pm on Nov 30, 2007 (gmt 0)

but it wont do anything to stop rogue bots, which I would have thought are the biggest problem

Nah, I think this is the first shot across the bow by the publishers to stop rogue SEs. Maybe give it about a year or so when the publishers will be able to say something along the lines that the SEs have had enough time to comply with the publishers' new "terms of use."


 3:46 pm on Nov 30, 2007 (gmt 0)

The new ACAP commands will use the same robots.txt file that search engines now recognize.

Shouldn't they use a different one, say acap.txt? Only I can see this possibly breaking a lot of robots.txt files because once you start to add complexity, errors always creep in.

Other than that, I think it's a positive move for webmasters who want more control, provided there are enough options.


 4:45 pm on Nov 30, 2007 (gmt 0)

I think the search engines have the upper hand by far so I'll be surprised to see this go anywhere fast. The original robots.txt was about bandwidth and that was a concern for everyone, so the SEs had to listen. This is more about finer points of copyright - basically giving the publisher the ability to say "We give you a limited license to republish this content in your cache until 12 December 2007". Most site owners on the web don't care about that, so the SEs don't have to listen.

I don't really see how the publishers can force the SEs to index and display their content only in a manner and time frame that the publishers approve of. I think the SEs have a pretty good argument: "Okay, you don't like the way we use your material? Well then just disallow us in your robots.txt and we'll eventually remove all your content from our index. Either that or let us display it as we want. We're fine either way."

On the other hand, the publishers are in a tough spot. Revenues are plummeting for dailies, so they absolutely need that revenue off the net. But the NYT is the only one that I "subscribe" to (daily email), so any other daily I end up reading is because I get there via a search engine. I bet most people don't subscribe to any online version of a traditional newspaper.

So either they say, "Okay Google, you can't have our content and we'll prove our point as we go out of business" or they can say "Okay Google, we think you're in violation of fair use and are bleeding us dry over the long term, but for now we're just going to complain." Either way, the publisher is in the weak position. The search engines will do fine without them, but the newspapers are going to struggle without the engines.

Am I missing something? What's in it for the search engines? The only thing I can see is that the publishers agree not to sue, but like I say, I still think even if the publishers sue, the SEs win (i.e. they remove all publisher content and stop indexing it until the publisher cries uncle).


 4:58 pm on Nov 30, 2007 (gmt 0)

From the ACAP FAQ

>>A number of publishers are already involved in the project along with major search engines.

The list then includes not a single SE, major or otherwise.

By the way, in the previous comment, I was thinking specifically of the directive:

ACAP-allow-index: resource-specification time-limit=value

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved