Google not following Robots Tag?

Forum Moderators: open

Message Too Old, No Replies

Google not following Robots Tag?

/robots.txt, noindex,nofollow and URL-only listings

internetheaven

3:24 pm on Mar 11, 2004 (gmt 0)

I've got a whole directory excluded by the robots.txt file and

< meta name="robots" content="noindex,nofollow" >

on each of the pages in that directory and yet Google has still added them to its index. There is no page title, description and no cache, just the URL linked to the page.

Also, I've just placed this on a highly bot hit page of my site:

< meta name="robots" content="noindex,follow" >

should that stop googlebot indexing the page yet allow it to continue through? Or is my building knowledge substandard on this one?

Brett_Tabke

3:36 pm on Mar 11, 2004 (gmt 0)

"nofollow" is not obeyed by Google most of the time that we have seen (personally, I don't think it obeys it at all).

How long has the "no index" been up there? It could be a timing/update issue? Does the page validate? Is the page actually indexed? (does it show a description, or just the url?)

internetheaven

4:59 pm on Mar 11, 2004 (gmt 0)

I agree about the Google following everything, that's just their competativeness overpowering them. As for the robots.txt and robots tag, both were inserted before the page was even uploaded to the web which is why I brought it up - I can't think of any other explanation than Googlebot has disobeyed it.

robots.txt is:

User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
Disallow: /images/
Disallow: /ex/
Disallow: /signup/
Disallow: /redirection/

but I've had my suspicions that Google has been developing a completely new crawler that has been tested over the past couple of months which is why crawling and indexing has been so slow and stale.

dannysullivan

5:37 pm on Mar 11, 2004 (gmt 0)

on each of the pages in that directory and yet Google has still added them to its index. There is no page title, description and no cache, just the URL linked to the page.

It sounds like these are what Google calles "partially indexed" pages. It's a dumb name, because they aren't really indexed by Google at all.

Instead, Google sees that someone else (or perhaps some of your other pages) are linking to these pages. So even though it won't index these pages because of your restrictions, it still knows about them based on what it sees from links to them.

Partially-indexed pages typically display exactly as you describe -- URLs only.

I also don't think there's a way that you can tell Google that you don't want partially-indexed listings. Since they don't come from having indexed your page, the robots.txt and meta robots commands don't really cover them.

Google provides a bit more about partial indexing here:
[google.com...]

internetheaven

9:41 pm on Mar 11, 2004 (gmt 0)

Thanks for that, thing is though, I'm now very confused. I have several entries in Google that are just the URL which the article you linked to describes. These are pages that are fully functional and that Google will eventually add the title and description to at a later 'Dance'.

I don't get the feeling that it was talking about pages that are included in the robots.txt file. I still think that this is an abberation from their policies and was a mistake by the crawler.

I've noticed GoogleGuy hasn't been around here as much since Tim Mayer came on the scene. Pity, could have done with a heads up on whether this was a mistake by Googlebot, me or if the terms of the robot.txt file have changed for Googlebot.

jimbeetle

10:04 pm on Mar 11, 2004 (gmt 0)

I'm now very confused

You ain't the only one, Google just wants to keep us on our toes.

Google insists it is adhering to the letter of the robots.txt law; it sure doesn't adhere to the spirit.

Google believes that it is allowed to list a page disallowed by robots.tx because it isn't actually retireving and indexing the page. All it is doing is listing the url for a page that it knows is there.

JD Morgan has pointed out a few times that the only way you can keep a page out of Google's index is completely non-intuitive -- you have to allow Google to spider the page and find the meta robots noindex tag. In time the page will drop from G's index. So take the disallow out of robots.txt and use the appropriate meta robots tags on each page you don't want in the index.

It's clunky, it's stupid, but it does work.

With Google it's the robots.txt mess, Yahoo/Ink has it's very own way of carrying 301 redirects. Wonder what hoops MSN is going to make us jump through when it launches? Aren't the standards there to make it easy for us to manage a website?

skipfactor

10:10 pm on Mar 11, 2004 (gmt 0)

>>I've noticed GoogleGuy hasn't been around here as much since Tim Mayer came on the scene.

Didn't you hear GoogleGuy's busy whipping himself into shape for his Celebrity Death Match with Tim?

Googlebot's "indexing" pages of mine that have the noindex,nofollow meta tag AND are disallowed in the robots.txt. However, I've never seen the pages in the SERPS; I only catch them in a "site:" search.

Oddly, I only recall seeing this occur with dynamic pages.

jimbeetle

10:13 pm on Mar 11, 2004 (gmt 0)

AND are disallowed in the robots.txt

That's why. Take the disallow out of the robots.txt and allow Google to find the noindex tag on each page.

skipfactor

10:17 pm on Mar 11, 2004 (gmt 0)

Seems like I only disallowed it in the robots.txt after I saw noindex pages appear, but I'll give it a try, thanks Jim.

ciml

11:02 pm on Mar 11, 2004 (gmt 0)

I agree 100% with Jim's description of the technical issue, and with the solution he describes. Still, I don't see Google as being at fault here in letter or spirit.

Robots Exclusion Protocol [robotstxt.org] is not, and will never be, a way of protecting URLs,
or making them secret in any way. It's a way of reducing the server load of well-behaved robots fetching unnecessary URLs - Googlebot is a well behaved robot.

If security is an issue, basic authentication over SSL is cheap, efficient and rather easy.

GoogleGuy

11:11 pm on Mar 11, 2004 (gmt 0)

"I've noticed GoogleGuy hasn't been around here as much ..."

I keep getting drafted to help on lots of projects lately, so I've had less free time to post, but I'm still around.

jimbeetle had it: if a page is forbidden by robots.txt, we won't crawl it, but if someone else links to that page, we can return that link without ever crawling the page. Back in ancient times, for example, nytimes.com had a robots.txt that wouldn't let Googlebot fetch any pages. So we wouldn't fetch any pages from them. But if the user did the query New York Times, we would return a link to www.nytimes.com without ever crawling their pages, because we had a reasonably high confidence that the url was relevant, even though we didn't actually fetch it.

We do obey noindex/nofollow, but only if we're able to fetch the page in the first place in order to see those tags. :) Bear in mind that noindex will be respected, but if page A links to page B with nofollow *but* page C links to page B, we may still find page B by following the link from page C.

If you want to be safe, I'd recommend using a password via .htaccess.

SyntheticUpper

11:29 pm on Mar 11, 2004 (gmt 0)

Does this go some way to explaining the "URL in serps, but no description" problem? It's not something I've suffered myself, but seems to crop up often on these boards.

ciml

11:44 pm on Mar 11, 2004 (gmt 0)

Not quite SyntheticUpper, some people are getting 'URL-only' listings for URLs that would certainly have been crawled in the past. That's a different issue (and IMO can be described as a penalty).

SyntheticUpper

11:47 pm on Mar 11, 2004 (gmt 0)

Even if it's a penalty, GG's description would still explain it. That is, the link to the URL from a non-penalised site is recognised by the crawler, but not the page itself.

ciml

12:01 am on Mar 12, 2004 (gmt 0)

Good point, from the descriptions I've seen 'slow death' looks to be to do with failing to crawl certain sites normally.

Still, whether /robots.txt excluded or not there's no reason not to list the URLs.

internetheaven

8:54 am on Mar 12, 2004 (gmt 0)

Googleguy - "I keep getting drafted to help on lots of projects lately, so I've had less free time to post, but I'm still around."

Ooooooh, you are a tease aren't you! 'Lots of projects', eh? Sounds like someone is trying to revive a little faith in the old GooglyWooglySearchyWearchy.

Just messin', I must say though, if there is one thing I can't stand about this job, it's the secrecy. Wars are secretive, that's why there's so much damage. In the 'real world', it pays to be upfront about everything and we only go with businesses we trust. If Yahoo had been a estate agent who had said:

'You're windows will be included in your house for free'

but then when you bought the house the Yahoo saleman turned around and said:

'Enjoy your six week free trial of your windows! It'll be $300 per window (subject to inspection) and $0.30 everytime you want to look through it'

We would have had our lawyers on the phone within milliseconds .....

g1smd

6:00 pm on Mar 15, 2004 (gmt 0)

>> jimbeetle had it: if a page is forbidden by robots.txt, we won't crawl it, but if someone else links to that page, we can return that link without ever crawling the page. <<

So, just having the robots excluded from a page is not enough to keep it completely out of the index. It could still show as a URL-only entry if someone somewhere links to it.

It therefore follows that you should be able to get those pages removed completely out of the index, simply by making sure that any page that links to the page you don't want listed also has a nofollow meta tag on it.

If the link is from an external site, out of your control, then this will be easier said than done.

ciml

6:23 pm on Mar 15, 2004 (gmt 0)

Yes, and keep in mind that sometimes it can take a long time for unlinked URLs to fall out of Google.

balam

2:32 pm on Mar 16, 2004 (gmt 0)

I have a similar "problem"...

I have a subdirectory that holds all my site's common graphic elements - navigation icons, CSS graphics, etc. There's no HTML in this directory, and you're forbidden from viewing a (Apache HTML) directory listing. Naturally, every page on my site links multiples times to graphics in this directory (but the only links into it come from IMG tags & stylesheets.) The directory is disallowed in robots.txt - been that way for years and has been well-respected.

Until just recently... Googlebot now lists the directory with a partial listing when I search on "site:www.example.com". Click on the results link and see my 403 page...

Not knowing when Google "found" this directory means I haven't been able to trace back to find the bot-visit log entry that started it.

I'll let others argue about whether or not the letter or spirit of the protocol has been broken; I will say that I think it may be going too far to add forbidden URLs to the SERPs.

wattsnew

4:10 pm on Mar 16, 2004 (gmt 0)

I use robots.txt to avoid a Google duplicate content penalty for landing pages intended for pay-per-click hits. These are near duplicate pages but are intended to help track this traffic - and they are useful. I "protect" them in a separate directory designated "disallow" by robots.txt, and do not link back to them.

So these turn up as "partial listings" using an allinurl search.

Problem: the original pages have recently been showing as greyed out on the toolbar (they are indexed and turn up in searches) while the robots.txt protected landing pages(partially indexed) rank 1/10!

Am I assuming incorrectly that since they are "partial listings", Google can not see them as duplicates/near duplicates? Then why do they rank while the original pages are greyed out (and probably are losing ranking and position)!? Hmmm.

g1smd

7:44 pm on Mar 16, 2004 (gmt 0)

Right. This is one way that could be used to get around that.

Make a folder called /foo and disallow it in the robots.txt file.

Put an index.html file which is completely blank in that folder.

Also in that folder make a sub-directory bar and put all the stuff that you want hidden inside that sub-folder.

Another alternative is to put sensitive material in a folder above the web root.

wattsnew

7:56 pm on Mar 16, 2004 (gmt 0)

Thanks g1smd.

I'll try it.

I have a separate disallowed folder now, so I'll add the "blank" index.html. But what is a sub-directory "bar"?

Simply move the pagefiles one level further down into a subdir of /foo?

Thanks!

I've no idea how to get above the root Web

internetheaven

11:26 am on Mar 19, 2004 (gmt 0)

The pages that I was complaining about in my original post that Google has included in its index despite the robots.txt file have now just been given PageRank.

I accept all your theories that Google still lists files that are in the robots.txt file but doesn't give them a title or description - but why!?!?!?

Surely the whole point of the robots.txt file is to point out which pages not to be indexed. If Google ignores robots.txt file but obeys robots meta tag then what can I do? Someone suggested removing the robots.txt file and just using a noindex,nofollow robots meta tag but then all the other search engines will be checking (and obeying) the robots.txt file.

Lately it seems like Google are just sitting on their cash and taking time out to p**s webmasters off. (Sounds like fun, but I think three months is long enough!)

SlyOldDog

11:58 am on Mar 19, 2004 (gmt 0)

Yes, there is a contradiction here:

GoogleGuy - you said:

"We do obey noindex/nofollow, but only if we're able to fetch the page in the first place in order to see those tags."

So what is stopping you fetch robots.txt?

ciml

12:27 pm on Mar 19, 2004 (gmt 0)

There's no contradiction SlyOldDog; Googlebot does fetch /robots.txt and if it is requested not to fetch /foo.html then it doesn't know that there's a noindex in that page. As a result, the URL-only listing may appear in the results.

wattsnew

4:27 pm on Mar 19, 2004 (gmt 0)

<<if a page is forbidden by robots.txt, we won't crawl it, but if someone else links to that page, we can return that link without ever crawling the page.>>

Foo,: "OK Googlebot you discover the page by an outside link. Why don't you crawl that page and index it?"

Gbot, "Oh, because it's robots.txt protected."

Foo, "But, if you know it's protected, why is there a partial listing - or any listing at all?"

Gbot, "We would return a link without ever crawling ... because we had a reasonably high confidence that the url was relevant, even though we didn't actually fetch it."

Foo, "Sure sounds like you just decided to ignore the robots.txt."

ciml

5:15 pm on Mar 19, 2004 (gmt 0)

/robots.txt doesn't protect anything, it merely asks robots not to fetch URLs and is a useful way to reduce your server load.

For protection, use basic authentication. For excellent protection, use basic authentication plus SSL.