Welcome to WebmasterWorld Guest from 126.96.36.199
Forum Moderators: goodroi
The desire for greater control over how search engines index and display Web sites is driving an effort by leading news organizations and other publishers to revise a 13-year-old technology for restricting access.
The new proposal, to be unveiled Thursday by a consortium of publishers at the global headquarters of The Associated Press, seeks to have those extra commands — and more — apply across the board. Sites, for instance, could try to limit how long search engines may retain copies in their indexes, or tell the crawler not to follow any of the links that appear within a Web page.
The formal rules allow a site to block indexing of individual Web pages, specific directories or the entire site, though some search engines have added their own commands
They were never "formal" rules. They were pretty informal from the off. As I recall it was some long since gone engine that started the protocol and others informally adopted it.
Let's see if anything gets beyond a press release.
Sitemaps vs robots.txt. As I see it
Sitemaps: "Crawl here. Index this. And if you want to follow other links, that's fine with me".
robots.txt: "Don't crawl here. Don't index this. And if you do or don't find stuff in other places, that's none of my business."
I don't think given that difference, one can ever "steal the lead" over the other really.
From my quick overview of the spec, I could only see tags to tell search engines how much of your content they can show in snippets and images. It looks like they are trying to limit how much of their content can be shown in the search results. Maybe one day they intend for this to be legally binding and some sort of override of copyright fair-use.
robots.txt + sitemap is fine for 99% of websites I think. robots.txt could be tweaked, but I think the big search engines will dictate that through slow evolution (like the Sitemap tag)
So, ACAP 1 will help the genuine sites, and the big search engines monetize thisng further. I have no problem with that, but it wont do anything to stop rogue bots, which I would have thought are the biggest problem.
Jessica Powell, a spokesman for Google, said the company supported all efforts to bring Web sites and search engines together but needed to evaluate ACAP to ensure it can meet the needs of millions of Web sites, not just those of a single community.
"Before you go and take something entirely on board, you need to make sure it works for everyone," Powell said.
but it wont do anything to stop rogue bots, which I would have thought are the biggest problem
The new ACAP commands will use the same robots.txt file that search engines now recognize.
Shouldn't they use a different one, say acap.txt? Only I can see this possibly breaking a lot of robots.txt files because once you start to add complexity, errors always creep in.
Other than that, I think it's a positive move for webmasters who want more control, provided there are enough options.
I don't really see how the publishers can force the SEs to index and display their content only in a manner and time frame that the publishers approve of. I think the SEs have a pretty good argument: "Okay, you don't like the way we use your material? Well then just disallow us in your robots.txt and we'll eventually remove all your content from our index. Either that or let us display it as we want. We're fine either way."
On the other hand, the publishers are in a tough spot. Revenues are plummeting for dailies, so they absolutely need that revenue off the net. But the NYT is the only one that I "subscribe" to (daily email), so any other daily I end up reading is because I get there via a search engine. I bet most people don't subscribe to any online version of a traditional newspaper.
So either they say, "Okay Google, you can't have our content and we'll prove our point as we go out of business" or they can say "Okay Google, we think you're in violation of fair use and are bleeding us dry over the long term, but for now we're just going to complain." Either way, the publisher is in the weak position. The search engines will do fine without them, but the newspapers are going to struggle without the engines.
Am I missing something? What's in it for the search engines? The only thing I can see is that the publishers agree not to sue, but like I say, I still think even if the publishers sue, the SEs win (i.e. they remove all publisher content and stop indexing it until the publisher cries uncle).
>>A number of publishers are already involved in the project along with major search engines.
The list then includes not a single SE, major or otherwise.
By the way, in the previous comment, I was thinking specifically of the directive:
ACAP-allow-index: resource-specification time-limit=value