Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Forcing Googlebot out of Trap

         

Butes

5:52 pm on Nov 11, 2015 (gmt 0)

10+ Year Member



A human error in a relative path was released last week that is supposed to simply fire an interstitial info box:

<...href="foo/bar">


Our favorite robot found this error getting caught in a trap of +10million matching requests per hour:

foo/barfoo/barfoo/barfoo/barfoo/barfoo/barfoo/barfoo/barfoo/barfoo/barfoo/barfoo/bar............

Since we track bot activity, this was quickly discovered and the following steps were taking:
1. relative path was fixed
2. 410 any matching error string
3. disallow matching string

here we are 5 days later, and gBot has slowed marginally on these strings to ~3million/hour even though there is no source to pick up on the signal.

Question:
Are 410 and disallow conflicting with each other? My gut is saying to drop the 410, but engineers are hesitant to do so for fear of opening a firehose of improper 200s.

aristotle

1:51 am on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



What do you mean by "disallow matching string"? Are you returning a 403 or a 410 or something else?

lucy24

5:33 am on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Are 410 and disallow conflicting with each other?

Well, yes, if I've understood the setup correctly. A 410-- or any other numerical code-- is a response to a request. If a given URL is disallowed in robots.txt, no request will be made, so no response can be received. But obviously the request is being made, or you wouldn't be getting those "3million/hour" (?!). So what, exactly, is getting disallowed?

3 million of anything per hour seems excessive. How many googlebot visits does the site get in a typical day?

The good news is that, #1, the Googlebot stops crawling a lot faster if it gets a 410 instead of a 404, and #2, requests will stop almost immediately if a given URL has never received anything but a 404/410 response. (Google is not stupid, no matter how it may look sometimes ;)) Disclaimer: I haven't seen either of #1 or #2 expressed as formal policy; I'm speaking from casual personal observation.