Forum Moderators: open

Message Too Old, No Replies

GoogleBot went berserk...

requested same page 20,000 times in quick succession

         

7_Driver

7:06 pm on May 3, 2003 (gmt 0)

10+ Year Member



Hi!

Had a problem with Googlebot on 27th / 28th April. It tried to index my sites Snitz forums (which use dynamic URLS).

It started out Ok, but then missed one of the parameters off the querystring. That caused the ASP page to crash with an ODBC error. This is recorded as a 500 (internal server error) in my logs.

Only problem is, Googlebot kept trying, requesting several topic pages from my forums > 20,000 times each, and getting internal server errors each time.

I have 275,000 Internal Server Errors on one day alone because of this. (Which hit my webserver pretty hard).

GoogleGuy - can you help? I assume this is a problem with Google's latest algorithm and Snitz Forums (which are quite popular).

I don't want to exclude Google (or for Google to exclude me!) - I'd like it to be able to index the forums. Happy to supply any details, logs etc if required to diagnose problem.

Thanks!

uksports

7:25 pm on May 3, 2003 (gmt 0)

10+ Year Member



I have the same problem - Googlebot is trying to crawl a URL format that doesn't exist on our shopping cart and has logged over 3,000 attempts in the last 3 days to do this creating an http 500 error each time - have mailed googlebot@google.com with the log file entry.

It would certainly appear to be an algo change as googlebot also tried over 30,000 in one day to access 5 pages that were SQL created pages that were missing off the site by mistake simply creating a 404 error which would normally be crawled once and ignored. As soon as I noticed and rectified the error, googlebot carried on with the rest of the site as normal - the worrying thing is that this latest error cannot be rectified easily at all without re-writing the whole cart and googlebot is 'stuck' trying this one URL and is not attempting anything else.

If you can help GoogleGuy, it would be much appreciated

JayC

8:22 pm on May 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Rather than waiting for GoogleGuy to pass by, you might want to email googlebot@google.com, as recommended here [google.com].

GoogleGuy

9:44 pm on May 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hey, my hunch is that this is because the server is doing more in-depth analysis about the best way to crawl your site. If you can live with the load for a little while, it should slack off after a bit, and Googlebot will be much better at crawling your website after that..

bcc1234

11:35 pm on May 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's really cool. I'm getting about 40 hits per minute.
I hope it all gets into the index.

Jesse_Smith

11:43 pm on May 3, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wouldn't 275,000 internal server error requests use up A LOT of bandwidth?! If that's all it's doing I would block it until it's fixed.

jdMorgan

12:08 am on May 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This would probably be a good time to mention again that anyone who uses custom error documents on Apache and is seeing problems with 404s being retried multiple times or otherwise mishandled should use the WebmasterWorld server headers checker [webmasterworld.com] to make darn sure that a request for a non-existent page actually returns a 404-Not Found server code to the requestor.

One of the most common causes of 404 problems on Apache server is incorrect syntax in the ErrorDocument directive [httpd.apache.org]. If you point 404's or 500's to a full URL instead of a local path, the server code returned will be a 302-Moved Temporarily, not a 404-Not Found. See the warning (concerning 401's but applicable to all error documents) at the bottom of the cited ErrorDocument documentation.

Jim

deanril

12:14 am on May 4, 2003 (gmt 0)

10+ Year Member



I thought you were a Computer Hardware Designer? Hmmmmm.. : )

Jesse_Smith

1:45 am on May 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



On that form a dead link gives a

HTTP/1.1 302 Found

message! Yes, I redirect to a full link for error pages using .htaccess

ErrorDocument 302 FULL-URL
ErrorDocument 401 FULL-URL
ErrorDocument 404 FULL-URL
ErrorDocument 500 FULL-URL
ErrorDocument 509 FULL-URL

I've never had any trouble with the Googlebot because of it.

jdMorgan

2:02 am on May 4, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



deanril,

Well, I was one for about 25 years, now I'm doin' whatever it takes to pay the bills! :)

See, in the old days, I had to design the hardware, write the boot code (machine code, and usually on paper), toggle it into the front panel in binary, and hit the run switch. The division between "hardware" and "software" was not so sharp.

Jesse,

I think you've been really lucky! Of course, Google probably has a few "extra" routines in their 'bot code to handle simple and common problems with 404s, and that's why you haven't had any trouble. I would recommend fixing it, though - they'll spend less time doing fix-up, and more time spidering.

Jim

SEO practioner

2:16 am on May 4, 2003 (gmt 0)

10+ Year Member



7-Driver

It reminds me a lot of the big server crash we had a while back while testing a new e-commerce application before deploying it.

Our Tomcat server caused our Apache server to go into a wild loop... and Catalina ( a log file inside Tomcat ) went totally bezerk and caused our whole server to crash with a big thump. Had to remove that file et manually reboot our server at the data center.... We coudn't remote it anymore...

When we deleted that file, it was over 22 Gigs! Just a log file, imagine...

What A day! I'l never forget that one...

hitchhiker

12:31 am on May 5, 2003 (gmt 0)

10+ Year Member



..this is because the server is doing more in-depth analysis about the best way to crawl your site.. ..Googlebot will be much better at crawling your website after that..

It's heuristic? Will all our phones ring at once one day?

Adaptive bot tech, using our own sites to continually learn! Given a few years any type of SEO would be impossible.

Jolly good.