homepage Welcome to WebmasterWorld Guest from 54.198.42.105
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
/3 and /5 requests, strange
roshaoar




msg:4661351
 10:22 am on Apr 8, 2014 (gmt 0)


Has anyone come across strange web requests like this that don't seem to bear any resemblance to what's on the site?

Requests for <url>/3 and <url>/5

From Princeton and Google - obviously they 404 because I've never had them but does it ring any bells with anyone why this might happen?

Thx


* Wasn't sure if this is the right forum, if not please move?

 

lucy24




msg:4661485
 6:41 pm on Apr 8, 2014 (gmt 0)

Hm. Computer-science class with "Create a robot" assignment?

The request from google is a little bit worrying because it implies they've actually seen the link somewhere. Google does ask for nonexistent files-- but always in the form "23lk4jdf9o8tu5.html" where they deliberately ask for a garbage name. It seems to be triggered by an unusual rate of redirects on your site.

Option B is that somebody's robot has got their shopping list mixed up with a different hostname. But Princeton, hm. I assume that's based on the IP, not just some claim in the UA string.

SSID forum maybe?

aristotle




msg:4661486
 6:42 pm on Apr 8, 2014 (gmt 0)

Can you show an example of what this looks like in your logs?

roshaoar




msg:4661489
 6:59 pm on Apr 8, 2014 (gmt 0)

66.249.68.59 - - [08/Apr/2014:03:39:31 +0100] "GET /(normalpage)/3/ HTTP/1.0" 301 207 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.68.59 - - [08/Apr/2014:03:39:32 +0100] "GET /(normalpage) HTTP/1.0" 301 208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


66.249.68.91 - - [08/Apr/2014:04:30:58 +0100] "GET /(differentnormalpage)/3/ HTTP/1.0" 301 203 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.68.59 - - [08/Apr/2014:04:30:58 +0100] "GET /(differentnormalpage) HTTP/1.0" 301 204 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


I googled the IP:

IP Address: 66.249.68.59
whatismyipaddress.com/ip/66.249.68.59?
Location: Mountain View, United States - 66.249.68.59 is a static assigned Corporate IP address allocated to Googlebot. Learn more.

I think the /5/ might be something to do with this:

173.252.112.117 - - [08/Apr/2014:06:06:37 +0100] "GET /5/que-es-el-cine-de-genero/?utm_source=dlvr.it&utm_medium=twitterrce%3Dother_multiline&action_object_map=%257B%2522752450914769280%2522%253A547777391969572%257D&action_type_map=%257B%2522752450914769280%2522%253A%2522og.recommends%2522%257D&action_ref_map=%255B%255Dber_level_req%3D1&fb_locale=de_DE HTTP/1.0" 404 5753 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"


Fwiw, I've been seeing a weird (looks like hacked) wordpress cricket site do REALLY weird links to my site with pages that don't exist, maybe google got it from there. I disavowed it as it's nothing to do with me... but weird stuff.

Lucy you say commonly done on sites with redirects, maybe it's that? I redirect to have a trailing slash and to make directory/index.php just be directory/, and canonical www to non www ... took them off just in case except the latter. Kind of annoying as it's my personal site which is slightly in the "webby" spotlight as of today

lucy24




msg:4661570
 8:56 pm on Apr 8, 2014 (gmt 0)

The google requests are definitely legitimate. What about the Princeton ones?

A routine redirect like trailing-slash or "index.html" shouldn't trigger any unusual googlebot requests, because everyone has those. I notice the garbage requests when I've made changes resulting in an unusually high proportion of redirects. They deliberately ask for something that's extremely unlikely to exist, just to verify that the site is still returning 404s. I have to assume that this is fully automated, and that it's triggered by proportions rather than absolute numbers.

Unlike some problems, this one is probably safe to dump in the "ignore them and they'll go away" bin. If you're getting a whole lot of bum requests from the same IP, you might look them up and see if they deserve a general block on grounds of underlying robotitude. But really, a 404 is as effective as anything. Some robots get all excited over 403s because they think it means you're hiding something from them.

roshaoar




msg:4661650
 10:43 pm on Apr 8, 2014 (gmt 0)

Thanks Lucy... again!

Maybe it was 403 related - I had some .htaccess sending "badPeople" to 403 land. Took that off a couple of days ago. argh.

tangor




msg:4661665
 1:09 am on Apr 9, 2014 (gmt 0)

404? The paired examples were 301s... what are they redirecting to? (the filesize of the redirects are 207, 208, 203, 204, all very small sizes.

lucy24




msg:4661689
 2:20 am on Apr 9, 2014 (gmt 0)

what are they redirecting to? (the filesize of the redirects are 207, 208, 203, 204, all very small sizes.

But that's just the size of the 301 response header, not the page itself. Exact header size apparently depends on your server; the "location" element itself will only vary by a few bytes.

If your ErrorDocument directive is incorrectly worded, everything turns into a 302. But not a 301; those only happen on purpose.

roshaoar




msg:4661738
 6:33 am on Apr 9, 2014 (gmt 0)

Yes - they were 301s because I redirect everything that doesn't have a page ending to have a trailing slash

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved