Google crawls robots.txt only but shows 403 code - (deprecated) Google News Archive forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Google crawls robots.txt only but shows 403 code

why does it generate a 403 and not a 404

gujgifts

5:51 pm on Dec 13, 2002 (gmt 0)

10+ Year Member

Hi...

googlebot has been coming to my site atleast thrice everyday for the last 10 days, but the only trace it leaves in my logs is

216.239.46.166 - - [06/Dec/2002:05:41:33 +0530] "GET /robots.txt HTTP/1.0" 403 - "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

with a variation in the ip...it is of the form 64.68.82.*

I don't have robots.txt on my server.....searching google's ip, i see that it only seeks the robots.txt and then vamooses...

Off the cuff, it looks like google is not deep crawling my site....*BUT* where I am getting stumped is that why is the code being generated in the message is 403 (which means error forbidden i think) instead of the familiar 404 (not found)...

Also it is not even looking at the index.html file, let alone the other files...

is this normal...i mean coming to my site thrice a day and not even looking at even the index file? could it be that my site administrator has done some mischief to ban access to robots.txt file....? I am able to open this file in the browser though...
thanks

[edited by: gujgifts at 6:38 pm (utc) on Dec. 13, 2002]

Key_Master

6:01 pm on Dec 13, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I dont think it's mischief, just a dns problem.

Put a blank robots.txt on your site so robots don't get a 403 error and let your administrator know about the problem.

jdMorgan

6:10 pm on Dec 13, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Key_Master,

What is the connection between DNS and Gbot getting a 403 instead of a 404 as it should? - This looks really strange!

Thanx,
Jim

gujgifts

6:11 pm on Dec 13, 2002 (gmt 0)

10+ Year Member

btw, some of my normal files (that had been accidentally deleted from the server..) are showing 404.....

jdMorgan

6:48 pm on Dec 13, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

gujgifts,

If this is the site in your profile, then something has changed. Using the WebmasterWorld server header checker, the Search Engine World Robots.txt validator, and WannaBrowser, I see the following:

You have a valid robots.txt on your site, with a single subdirectory disallowed.
Requests for robots.txt return 200-OK responses to Googlebot's User-agent.
Requests to [yourdomain.com...] are redirected to [yourdomain.com...] correctly.

Unless you have forbidden Googlebot by IP address (which I can't test), it looks like it works to me...

Jim

Key_Master

6:55 pm on Dec 13, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

gujgifts, I see you put a robots.txt file up. That will help until the problem is solve.

gujgifts

8:06 pm on Dec 13, 2002 (gmt 0)

10+ Year Member

thanks guys

I never imagined that you guys would go to so much trouble to help me out...WebmasterWorld rocks!

I did put a robots.txt and disallowed one redundent directory as some seo's said that disallowing nothing may be interpreted as disallowing everything by some bots...

btw, if google can ask for robots.txt, surely it cannot be banned by IP right? I am asking as i want to make sure that my site administrator is not doing any mischief...

btw, i saw your request for my root using wannabrowser and was majorly happy for bout 15 mins thinking that Googlebot is here to cache my homepage! Hope the real one turns up soon....

another question: even if it skims my site, shouldn't it atleast ask for my root or index apart from the robots.txt file...?

ciao

jdMorgan

10:36 pm on Dec 13, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

gujgifts,

Your site appears to be indexed by Google, and I see a PR5 for your home page. I can't say why your pages are not cached, or why Google might have received a 403 on robots.txt.

We are not supposed to do site reviews here, but since I was already in there I'll offer a couple of suggestions:
1) Dump the revisit-after meta tag. Historically, only one search engine in Canada ever used it, and that one died years ago. All it does is take up space and dilute the value of the text on your page.
2) Add more text to the top of your pages, like maybe an introductory paragraph. You've sort of got a little of that now, but the text is "hidden" inside images - and spiders don't read images.
3) Add a DOCTYPE statement to the top of your pages to tell browsers what flavor of html you are using, and run your pages through the validators at w3c.org to make sure they validate.

Jim