Forum Moderators: open
< meta name="robots" content="noindex,nofollow" >
on each of the pages in that directory and yet Google has still added them to its index. There is no page title, description and no cache, just the URL linked to the page.
Also, I've just placed this on a highly bot hit page of my site:
< meta name="robots" content="noindex,follow" >
should that stop googlebot indexing the page yet allow it to continue through? Or is my building knowledge substandard on this one?
robots.txt is:
User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /images/
Disallow: /ex/
Disallow: /signup/
Disallow: /redirection/
but I've had my suspicions that Google has been developing a completely new crawler that has been tested over the past couple of months which is why crawling and indexing has been so slow and stale.
on each of the pages in that directory and yet Google has still added them to its index. There is no page title, description and no cache, just the URL linked to the page.
It sounds like these are what Google calles "partially indexed" pages. It's a dumb name, because they aren't really indexed by Google at all.
Instead, Google sees that someone else (or perhaps some of your other pages) are linking to these pages. So even though it won't index these pages because of your restrictions, it still knows about them based on what it sees from links to them.
Partially-indexed pages typically display exactly as you describe -- URLs only.
I also don't think there's a way that you can tell Google that you don't want partially-indexed listings. Since they don't come from having indexed your page, the robots.txt and meta robots commands don't really cover them.
Google provides a bit more about partial indexing here:
[google.com...]
I don't get the feeling that it was talking about pages that are included in the robots.txt file. I still think that this is an abberation from their policies and was a mistake by the crawler.
I've noticed GoogleGuy hasn't been around here as much since Tim Mayer came on the scene. Pity, could have done with a heads up on whether this was a mistake by Googlebot, me or if the terms of the robot.txt file have changed for Googlebot.
I'm now very confused
You ain't the only one, Google just wants to keep us on our toes.
Google insists it is adhering to the letter of the robots.txt law; it sure doesn't adhere to the spirit.
Google believes that it is allowed to list a page disallowed by robots.tx because it isn't actually retireving and indexing the page. All it is doing is listing the url for a page that it knows is there.
JD Morgan has pointed out a few times that the only way you can keep a page out of Google's index is completely non-intuitive -- you have to allow Google to spider the page and find the meta robots noindex tag. In time the page will drop from G's index. So take the disallow out of robots.txt and use the appropriate meta robots tags on each page you don't want in the index.
It's clunky, it's stupid, but it does work.
With Google it's the robots.txt mess, Yahoo/Ink has it's very own way of carrying 301 redirects. Wonder what hoops MSN is going to make us jump through when it launches? Aren't the standards there to make it easy for us to manage a website?
Didn't you hear GoogleGuy's busy whipping himself into shape for his Celebrity Death Match with Tim?
Googlebot's "indexing" pages of mine that have the noindex,nofollow meta tag AND are disallowed in the robots.txt. However, I've never seen the pages in the SERPS; I only catch them in a "site:" search.
Oddly, I only recall seeing this occur with dynamic pages.
Robots Exclusion Protocol [robotstxt.org] is not, and will never be, a way of protecting URLs,
or making them secret in any way. It's a way of reducing the server load of well-behaved robots fetching unnecessary URLs - Googlebot is a well behaved robot.
If security is an issue, basic authentication over SSL is cheap, efficient and rather easy.
I keep getting drafted to help on lots of projects lately, so I've had less free time to post, but I'm still around.
jimbeetle had it: if a page is forbidden by robots.txt, we won't crawl it, but if someone else links to that page, we can return that link without ever crawling the page. Back in ancient times, for example, nytimes.com had a robots.txt that wouldn't let Googlebot fetch any pages. So we wouldn't fetch any pages from them. But if the user did the query New York Times, we would return a link to www.nytimes.com without ever crawling their pages, because we had a reasonably high confidence that the url was relevant, even though we didn't actually fetch it.
We do obey noindex/nofollow, but only if we're able to fetch the page in the first place in order to see those tags. :) Bear in mind that noindex will be respected, but if page A links to page B with nofollow *but* page C links to page B, we may still find page B by following the link from page C.
If you want to be safe, I'd recommend using a password via .htaccess.
Ooooooh, you are a tease aren't you! 'Lots of projects', eh? Sounds like someone is trying to revive a little faith in the old GooglyWooglySearchyWearchy.
Just messin', I must say though, if there is one thing I can't stand about this job, it's the secrecy. Wars are secretive, that's why there's so much damage. In the 'real world', it pays to be upfront about everything and we only go with businesses we trust. If Yahoo had been a estate agent who had said:
'You're windows will be included in your house for free'
but then when you bought the house the Yahoo saleman turned around and said:
'Enjoy your six week free trial of your windows! It'll be $300 per window (subject to inspection) and $0.30 everytime you want to look through it'
We would have had our lawyers on the phone within milliseconds .....
So, just having the robots excluded from a page is not enough to keep it completely out of the index. It could still show as a URL-only entry if someone somewhere links to it.
It therefore follows that you should be able to get those pages removed completely out of the index, simply by making sure that any page that links to the page you don't want listed also has a nofollow meta tag on it.
If the link is from an external site, out of your control, then this will be easier said than done.
I have a subdirectory that holds all my site's common graphic elements - navigation icons, CSS graphics, etc. There's no HTML in this directory, and you're forbidden from viewing a (Apache HTML) directory listing. Naturally, every page on my site links multiples times to graphics in this directory (but the only links into it come from IMG tags & stylesheets.) The directory is disallowed in robots.txt - been that way for years and has been well-respected.
Until just recently... Googlebot now lists the directory with a partial listing when I search on "site:www.example.com". Click on the results link and see my 403 page...
Not knowing when Google "found" this directory means I haven't been able to trace back to find the bot-visit log entry that started it.
I'll let others argue about whether or not the letter or spirit of the protocol has been broken; I will say that I think it may be going too far to add forbidden URLs to the SERPs.
So these turn up as "partial listings" using an allinurl search.
Problem: the original pages have recently been showing as greyed out on the toolbar (they are indexed and turn up in searches) while the robots.txt protected landing pages(partially indexed) rank 1/10!
Am I assuming incorrectly that since they are "partial listings", Google can not see them as duplicates/near duplicates? Then why do they rank while the original pages are greyed out (and probably are losing ranking and position)!? Hmmm.
Make a folder called /foo and disallow it in the robots.txt file.
Put an index.html file which is completely blank in that folder.
Also in that folder make a sub-directory bar and put all the stuff that you want hidden inside that sub-folder.
.
Another alternative is to put sensitive material in a folder above the web root.
I accept all your theories that Google still lists files that are in the robots.txt file but doesn't give them a title or description - but why!?!?!?
Surely the whole point of the robots.txt file is to point out which pages not to be indexed. If Google ignores robots.txt file but obeys robots meta tag then what can I do? Someone suggested removing the robots.txt file and just using a noindex,nofollow robots meta tag but then all the other search engines will be checking (and obeying) the robots.txt file.
Lately it seems like Google are just sitting on their cash and taking time out to p**s webmasters off. (Sounds like fun, but I think three months is long enough!)
Foo,: "OK Googlebot you discover the page by an outside link. Why don't you crawl that page and index it?"
Gbot, "Oh, because it's robots.txt protected."
Foo, "But, if you know it's protected, why is there a partial listing - or any listing at all?"
Gbot, "We would return a link without ever crawling ... because we had a reasonably high confidence that the url was relevant, even though we didn't actually fetch it."
Foo, "Sure sounds like you just decided to ignore the robots.txt."