Most of the existing commands - even the basic ones - seem to be fail way too often, so I'm not sure more will be better than the sitemaps protocol which seems to be stealing a lead over robots.txt.
|The formal rules allow a site to block indexing of individual Web pages, specific directories or the entire site, though some search engines have added their own commands |
They were never "formal" rules. They were pretty informal from the off. As I recall it was some long since gone engine that started the protocol and others informally adopted it.
Let's see if anything gets beyond a press release.
Here's the tech framework [the-acap.org] (30 page PDF). Looks like I'll have more than enough opportunities to shoot myself in the foot if this is ever adopted.
The article is so vague on what exactly they're proposing [edit: thanks Jim.]. It would be nice if there was some regex capability (like the Google extensions, but a bit more).
Sitemaps vs robots.txt. As I see it
Sitemaps: "Crawl here. Index this. And if you want to follow other links, that's fine with me".
robots.txt: "Don't crawl here. Don't index this. And if you do or don't find stuff in other places, that's none of my business."
I don't think given that difference, one can ever "steal the lead" over the other really.
This seems aimed at large online newspapers etc. There is nothing here for the average webmaster.
From my quick overview of the spec, I could only see tags to tell search engines how much of your content they can show in snippets and images. It looks like they are trying to limit how much of their content can be shown in the search results. Maybe one day they intend for this to be legally binding and some sort of override of copyright fair-use.
robots.txt + sitemap is fine for 99% of websites I think. robots.txt could be tweaked, but I think the big search engines will dictate that through slow evolution (like the Sitemap tag)
Tepid response from the search engines, not surprisingly.
Ack! The first thing they can do is redesign their website - [the-acap.org...] - with a readable colour scheme. I gave up after about 15 seconds, not a good start.
Yes, I agree, that site is not easy on the eyes.
So, ACAP 1 will help the genuine sites, and the big search engines monetize thisng further. I have no problem with that, but it wont do anything to stop rogue bots, which I would have thought are the biggest problem.
|Jessica Powell, a spokesman for Google, said the company supported all efforts to bring Web sites and search engines together but needed to evaluate ACAP to ensure it can meet the needs of millions of Web sites, not just those of a single community. |
"Before you go and take something entirely on board, you need to make sure it works for everyone," Powell said.
|but it wont do anything to stop rogue bots, which I would have thought are the biggest problem |
|The new ACAP commands will use the same robots.txt file that search engines now recognize. |
Shouldn't they use a different one, say acap.txt? Only I can see this possibly breaking a lot of robots.txt files because once you start to add complexity, errors always creep in.
Other than that, I think it's a positive move for webmasters who want more control, provided there are enough options.
I think the search engines have the upper hand by far so I'll be surprised to see this go anywhere fast. The original robots.txt was about bandwidth and that was a concern for everyone, so the SEs had to listen. This is more about finer points of copyright - basically giving the publisher the ability to say "We give you a limited license to republish this content in your cache until 12 December 2007". Most site owners on the web don't care about that, so the SEs don't have to listen.
I don't really see how the publishers can force the SEs to index and display their content only in a manner and time frame that the publishers approve of. I think the SEs have a pretty good argument: "Okay, you don't like the way we use your material? Well then just disallow us in your robots.txt and we'll eventually remove all your content from our index. Either that or let us display it as we want. We're fine either way."
On the other hand, the publishers are in a tough spot. Revenues are plummeting for dailies, so they absolutely need that revenue off the net. But the NYT is the only one that I "subscribe" to (daily email), so any other daily I end up reading is because I get there via a search engine. I bet most people don't subscribe to any online version of a traditional newspaper.
So either they say, "Okay Google, you can't have our content and we'll prove our point as we go out of business" or they can say "Okay Google, we think you're in violation of fair use and are bleeding us dry over the long term, but for now we're just going to complain." Either way, the publisher is in the weak position. The search engines will do fine without them, but the newspapers are going to struggle without the engines.
Am I missing something? What's in it for the search engines? The only thing I can see is that the publishers agree not to sue, but like I say, I still think even if the publishers sue, the SEs win (i.e. they remove all publisher content and stop indexing it until the publisher cries uncle).
From the ACAP FAQ
>>A number of publishers are already involved in the project along with major search engines.
The list then includes not a single SE, major or otherwise.
By the way, in the previous comment, I was thinking specifically of the directive:
ACAP-allow-index: resource-specification time-limit=value