Did I inadvertantly ban Google?

Forum Moderators: open

Message Too Old, No Replies

Did I inadvertantly ban Google?

If so, how can I correct?

pendanticist

3:13 pm on Nov 19, 2002 (gmt 0)

Greetings,

Does this mean I've inadvertantly banned Googlebot from checking my site?

64.68.82.74 - - [19/Nov/2002:05:00:58 -0800] "GET /robots.txt HTTP/1.0" 403 208 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

I've been tweaking my .htaccess file, but have no idea what I could have done so wrong.

I should point out that Inktomi and the others are doing fine from what I can see.

I appreciate any help on this. Don't want to loose that PR 'ya know...

Thank You.

Pendanticist.

lazerzubb

3:18 pm on Nov 19, 2002 (gmt 0)

It's very common for all search engine spiders to collect the robots.txt and then come back for the pages later.

Even if you ban the GoogleBot, it will still come back and see if you still have a ban for it.

But i would say wait up to 2,5 weeks, and if it hasn't spidered any pages of your site, then something is probably wrong.

WebSempster

3:45 pm on Nov 19, 2002 (gmt 0)

I don't like the look of that error '403' response. These are the delay times I'm seeing after a bot checks the robots.txt file:

FAST WebCrawler: 30 seconds
GoogleBot: 5 seconds
Slurp (Inktomi): 1 second

jomaxx

4:43 pm on Nov 19, 2002 (gmt 0)

I have no idea whether Google would attempt to spider any pages if it is 403'ed from accessing the robots.txt file. But if you have incoming links, I'm sure Google will return to try again.

Is it the site in your profile? I don't see anything wrong with it offhand.

pendanticist

5:38 pm on Nov 19, 2002 (gmt 0)

Thanks everyone.

Could this have been the culprit?

RewriteCond %{HTTP_USER_AGENT} bot [NC,OR]

It was contained in my .htaccess file until I saw those 403s.

My robots.txt hasn't been modified for a long, long time, whereas .htaccess has been an ongoing tweak for the last few days.

GoogleBot visits my pages fairly frequently as a result of my maintenance of links. (In time I want to do that "Recent" thing I read of in other posts. You know, the one that saves GB some time by re-spidering only those files which have recently been modified.)

Yes jomaxx, it is the one in my profile.

Any suggestions/solutions will be appreciated.

Pendanticist.

Mohamed_E

5:40 pm on Nov 19, 2002 (gmt 0)

> I have no idea whether Google would attempt to spider any pages if it is 403'ed from accessing the robots.txt file.

I am pretty sure (cannot lay my hands on the relevant stuff) that according to the standard the robot should (not must) treat a 403 as denying it permission to spider the site. That was Google's policy until recently.

They observed that most 403 codes were due to configuration errors, and that as a result they often failed to spider sites whose owners wanted to be spidered. So recently they started ignoring 403 codes.

pendanticist

10:53 pm on Nov 19, 2002 (gmt 0)

Thanks Mohamed_E,

I certainly hope that's the case. <phew!> Suddenly that would make my day.

My .htaccess file is finally doing what it was supposed to do and that is block those nasty bots. (The GoogleBot thing was a mistake on my part.)

EasyDL/3.02 - Shut this one right down.
grub-client-0.3.0 - served 8 distinct IPs worldwide.
/_vti_bin/owssvr.dll - Renders 403 now too.
URL_Spider_Pro/3.0 - requested robots and accounting_forensic.html both 403'd.

These preliminary results look good. Uh, with the exception of the good bots, that is :-)

FAST-WebCrawler/3.6/FirstPage and ZyBorg/1.0 are good spyders, but I'll just have to wait for them to return. The remaining spiders should return in a day or two as well.

Inktomi (I think it is) can't seem to grasp the concept of following the redirect to the new destination page and storing it. Instead, it keeps asking for the same old pages every time, getting redirected every time and returning every time to do the same thing every time. <shrug>

So, if they all share GoogleBots directives, then I may not have lost anything at all. SE's bring me half of my traffic and I don't want that to just 'go away' because of one mistake of ignorance. Here we go crossing fingers until morning...

Thanks Again Mohamed_E.

Pendanticist.

WebSempster

10:15 am on Nov 20, 2002 (gmt 0)

> I am pretty sure (cannot lay my hands on the relevant stuff)
> that according to the standard the robot should (not must)
> treat a 403 as denying it permission to spider the site. That
> was Google's policy until recently.

Agreed with Mohamed_E on 403's (I'd forgotten); at least a year ago, may be longer. I think that SSL / HTTPS not stopping spiders came in about the same time.