|/3 and /5 requests, strange|
Has anyone come across strange web requests like this that don't seem to bear any resemblance to what's on the site?
Requests for <url>/3 and <url>/5
From Princeton and Google - obviously they 404 because I've never had them but does it ring any bells with anyone why this might happen?
* Wasn't sure if this is the right forum, if not please move?
Hm. Computer-science class with "Create a robot" assignment?
The request from google is a little bit worrying because it implies they've actually seen the link somewhere. Google does ask for nonexistent files-- but always in the form "23lk4jdf9o8tu5.html" where they deliberately ask for a garbage name. It seems to be triggered by an unusual rate of redirects on your site.
Option B is that somebody's robot has got their shopping list mixed up with a different hostname. But Princeton, hm. I assume that's based on the IP, not just some claim in the UA string.
SSID forum maybe?
Can you show an example of what this looks like in your logs?
|220.127.116.11 - - [08/Apr/2014:03:39:31 +0100] "GET /(normalpage)/3/ HTTP/1.0" 301 207 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" |
18.104.22.168 - - [08/Apr/2014:03:39:32 +0100] "GET /(normalpage) HTTP/1.0" 301 208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
|22.214.171.124 - - [08/Apr/2014:04:30:58 +0100] "GET /(differentnormalpage)/3/ HTTP/1.0" 301 203 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" |
126.96.36.199 - - [08/Apr/2014:04:30:58 +0100] "GET /(differentnormalpage) HTTP/1.0" 301 204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I googled the IP:
IP Address: 188.8.131.52
Location: Mountain View, United States - 184.108.40.206 is a static assigned Corporate IP address allocated to Googlebot. Learn more.
I think the /5/ might be something to do with this:
|220.127.116.11 - - [08/Apr/2014:06:06:37 +0100] "GET /5/que-es-el-cine-de-genero/?utm_source=dlvr.it&utm_medium=twitterrce%3Dother_multiline&action_object_map=%257B%2522752450914769280%2522%253A547777391969572%257D&action_type_map=%257B%2522752450914769280%2522%253A%2522og.recommends%2522%257D&action_ref_map=%255B%255Dber_level_req%3D1&fb_locale=de_DE HTTP/1.0" 404 5753 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" |
Fwiw, I've been seeing a weird (looks like hacked) wordpress cricket site do REALLY weird links to my site with pages that don't exist, maybe google got it from there. I disavowed it as it's nothing to do with me... but weird stuff.
Lucy you say commonly done on sites with redirects, maybe it's that? I redirect to have a trailing slash and to make directory/index.php just be directory/, and canonical www to non www ... took them off just in case except the latter. Kind of annoying as it's my personal site which is slightly in the "webby" spotlight as of today
The google requests are definitely legitimate. What about the Princeton ones?
A routine redirect like trailing-slash or "index.html" shouldn't trigger any unusual googlebot requests, because everyone has those. I notice the garbage requests when I've made changes resulting in an unusually high proportion of redirects. They deliberately ask for something that's extremely unlikely to exist, just to verify that the site is still returning 404s. I have to assume that this is fully automated, and that it's triggered by proportions rather than absolute numbers.
Unlike some problems, this one is probably safe to dump in the "ignore them and they'll go away" bin. If you're getting a whole lot of bum requests from the same IP, you might look them up and see if they deserve a general block on grounds of underlying robotitude. But really, a 404 is as effective as anything. Some robots get all excited over 403s because they think it means you're hiding something from them.
Thanks Lucy... again!
Maybe it was 403 related - I had some .htaccess sending "badPeople" to 403 land. Took that off a couple of days ago. argh.
404? The paired examples were 301s... what are they redirecting to? (the filesize of the redirects are 207, 208, 203, 204, all very small sizes.
|what are they redirecting to? (the filesize of the redirects are 207, 208, 203, 204, all very small sizes. |
But that's just the size of the 301 response header, not the page itself. Exact header size apparently depends on your server; the "location" element itself will only vary by a few bytes.
If your ErrorDocument directive is incorrectly worded, everything turns into a 302. But not a 301; those only happen on purpose.
Yes - they were 301s because I redirect everything that doesn't have a page ending to have a trailing slash