Forum Moderators: open
Also, does it take most people two cycles to get fully indexed? Or new pages get perfectly indexed from crawl to update in the same month?
To be honest, it's not something I check often so if Google's changed I may not have noticed.
Robots Exclusion Protocol asks robots not to fetch the resource. Listing the URL is another matter. I remember GoogleGuy mentioning something about the META robots tag removing the whole listing if set to "noindex". Has anyone here tested that?
Paully :
> ...does it take most people two cycles to get fully indexed? Or new pages get perfectly indexed from crawl to update in the same month?
I prefer to think in terms of two updates. If your main links have high PageRank, then you are unlikely to get them credited from a late crawl.
For example, if you put up a new page last week, and linked it from a PR7 page, then Google will probably have included the page by now, but I would expect it to get its PR6 (and better rankings) at the next update.
Note that as Google gets fresher, this may change.
Any one else have any more info on the results with no description?
I should have made that the title of this thread, would have brought more traffic, lol. Hey Im new to this SEO thing. :)
Robots Exclusion Protocol asks robots not to fetch the resource. Listing the URL is another matter. I remember GoogleGuy mentioning something about the META robots tag removing the whole listing if set to "noindex". Has anyone here tested that?
Yes, and it doesn't work in a practical application. At least it doesn't work if you disallow a page in robots.txt and put the "noindex,nofollow" robots metatag on the page.
I would like to try removing the disallow in robots.txt, leaving only the noindex metatag, but then that creates a catch-22: What about other search engines that don't recognize the on-page noindex robots metatag?
And it's really a double catch-22: How can Googlebot know about the on-page metatag if the page is disallowed in robots.txt? It can't.
So, regardless of what the letter of the RFC says, I think Google should not list links to robots.txt-disallowed pages. Otherwise, it just doesn't work practically. I'd prefer they not mention disallowed pages in their index at all, since this is part of my spam-avoidance strategy. There are other engines that do, and other engines that don't. I'd prefer that they don't, but because of the double catch-22 and the RFC, I've just learned to live with it.
Jim
My (unpopular) view, is that what we publish to the Web is public. /robots.txt is a nice convention for asking search engines not to bother indexing spider traps (session Ids, etc.), but not useful from a privacy or security point of view. Please keep in mind that my view derives from the days when the 'webmaster' was the server admin, not the document author.
Spam avoidance is a problem, but the spam bots will wilfully ignore your /robots.txt and META robots.
<added>
WebGuerrilla, this used to be a major feature, I think "The Anatomy of a Hypertextual Search Engine" discusses it. It's probably good that they stopped listing email addresses though:).
robots.txt is ... not useful from a privacy or security point of view.
Spam avoidance is a problem, but the spam bots will wilfully ignore your /robots.txt and META robots.
There are the "Contact WidgetUsers.org!" front pages, which lead to e-mail contacts and phone numbers - I'd like those URLs kept out of the index, even though they are now forms-based.
There are also some pages on my sites which are not very useful outside the context of the pages which link to them. I know this from confused feedback from searchers who have entered those pages directly, and I am trying to use a totally objective assessment of their out-of-context worthlessness here. I'd like them kept out of search results in order to improve the user experience (where've I heard that before?). But this is a secondary concern.
I've been on the 'net long enough to remember the academic goal of a totally-open information space. Most of my sites fit that model, most are non-commercial, information-source -type sites (which never have any problems with placement in Google, so I never have to watch the update threads). On the other hand, the UCE interests now on the web couldn't give a rat's behind about making information freely available for the good of all mankind. All they want is 5 million e-mail addresses to sell on a CD to all buyers.
So, in my opinion, Google should bend the robots RFC just a little, and do what some other SE's do: Not list URLs which go to pages disallowed in robots.txt.
It's just a wish. I hope they will re-consider the trade-offs between strict compliance with the robots standard, and the real world concerns of spam and abuse of resources. But I have learned to live with their policy as it is.
Jim