Googlebot reading disallowed file?

Forum Moderators: open

Message Too Old, No Replies

Googlebot reading disallowed file?

file disallowed in robots.txt

bull

3:13 pm on Jan 16, 2004 (gmt 0)

Googlebot fetched robots.txt with other IP prior. File is disallowed for all UA in robots.txt, robots.txt validates according to searchengineworld tool.
Additionally, Googlebot produced a 404 error. The link to the disallowed html file is a image-only link. The pattern was disallowed.gif/disallowed.html.

HTML validates XHTML 1.0 Transitional.

dmorison

2:31 pm on Jan 17, 2004 (gmt 0)

disallowed.gif/disallowed.html

Can you post the actual section of your robots.txt; because the above line on its own (if that is exactly what is in robots.txt) would validate; and would disallow the file disallowed.html within the directory disallowed.gif - which may not be what you want to do...

bull

3:02 pm on Jan 17, 2004 (gmt 0)

robots.txt section:

User-agent: *
Disallow: /dir1/dir2/donotspider.html

from the html file lets say dir1/dir2/indexd.html:

<a href="donotspider.html"><img src="donotspider.gif" /></a>

and she requested

/dir1/dir2/donotspider.gif/donotspider.html

I am pretty sure it is a xhtml parsing issue. No, there is surely no external link to that disallowed file.

----------
I e-mailed googlebot at google dot com and am waiting for response, there is also a 301 issue currently making me extremly nervous, Googlebot getting proper and wanted 301 on some dozen files but not following them - Slurp does.

dmorison

3:12 pm on Jan 17, 2004 (gmt 0)

Yep - looks like a Googlebug.

nileshkurhade

3:47 pm on Jan 17, 2004 (gmt 0)

make sure it was G'bot IP

bull

5:58 pm on Jan 17, 2004 (gmt 0)

yes 64.68.82.44.

GoogleGuy

6:35 pm on Jan 17, 2004 (gmt 0)

I wouldn't worry about the 301's yet; we see the redirect, but it goes back on the queue to be checked out instead of following it that second, I think. Can you post your site in your profile for the robots.txt issue? I'm curious and want to check if we're doing anything wrong. Thanks for mentioning this..

bull

6:45 pm on Jan 17, 2004 (gmt 0)

Done - thanks in advance! Normal browser UA recommended.
The 301 affected subdirectory is the italian one.

GoogleGuy

8:03 am on Jan 18, 2004 (gmt 0)

Hmm. I'm a little rusty on robots.txt, but looks like yours disallows a specific page in a directory. Sounds like we're trying to fetch a different (faulty) page. So technically it wouldn't be a robots.txt problem, but if we're trying to fetch
/dir1/dir2/something.gif/somethingelse.html
then that would be a bot issue. Does that make sense so far? I'm happy to ask someone to check it out, but it doesn't sound like a robots.txt issue so much as a bad link, or us following a good link badly.

bull

8:08 am on Jan 18, 2004 (gmt 0)

Googleguy, you are right with your assumptions.

following a good link badly

is the only thing I currently can imagine, where the good link is disallowed. Not something really severe, but I think it is due to the xhtml code. Is there any problem with "cleaned up" (here: auto-generated) html, i.e. w/o any unnecessary line feeds, tabs, spaces?

dmorison

10:47 am on Jan 18, 2004 (gmt 0)

Is there any problem with "cleaned up" (here: auto-generated) html, i.e. w/o any unnecessary line feeds, tabs, spaces?

This is something I have worried about in the past; as my pages generally (unless there is inserted advertiser code) don't have any whitespace or new lines at all.

I am confident that Googlebot doesn't have a problem with non-stop markup; as I have had some very large pages indexed in Google with no problem.

However, it is easy to imagine an amateur spider that fetches a page into a local file; and then processes that file line-by-line having trouble with a 100% auto-generated page.

It would be comforting to hear GG state that Googlebot is a professional when it comes to non-stop "pure" markup...

bull

1:46 pm on Jan 19, 2004 (gmt 0)

I wouldn't worry about the 301's yet; we see the redirect, but it goes back on the queue to be checked out instead of following it that second, I think.

She was here today picking all over new redirected files with 200, so the 301 are really stacked, queued - awesome...

GoogleGuy

7:16 am on Jan 20, 2004 (gmt 0)

Glad to hear that, bull. In general, we should be able to follow any valid link. I'm puzzled how we could have followed the link on that page though.