Welcome to WebmasterWorld Guest from 23.23.46.20

Forum Moderators: Ocean10000 & incrediBILL & phranque

Deny users with blank REMOTE HOST field?

   
1:36 am on Apr 26, 2012 (gmt 0)



I researched mod_security to see if this were possible - if it is, I can't locate the corresponding documentation.

There is a bot scraping images from my site using a blank IP and blank host field and I cannot figure out how to make scraping more difficult for this guy.

I reckon using:
SecRule REMOTE_HOST "" deny,status:403
would also boot a large portion of users.

Any tips? I'm open to preferable server-wide bans, not just htaccess options (if possible)?
2:56 am on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The entire internet functions on IP's, how is it that your site functions differently?

Are you view the bots activity via some stats software, rather than your "raw visitor logs"?
3:55 am on Apr 26, 2012 (gmt 0)



The user appears to be spoofing their IP, is what I am trying to say.
All other users have legit remote host / remote addr fields, except this particular user.
All they've been looking at for the past few hours is images.
4:15 am on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



If they are genuinely sending a forged IP, and it's landing in your raw logs that way, be afraid. Be very afraid. I'd always understood that this is the one thing robots-- or, for that matter, humans-- can't do. Much as they'd like to.

But nobody can be sure unless you post a sample from your raw logs. You imply in your first post that it's your own server, so it's not a question of access to the logs.
5:33 am on Apr 26, 2012 (gmt 0)



No, I am not looking at my raw logs actually.

It's a php script that monitors current activity (what users are looking at, referrers, etc) - I catch a majority of the weirdest activity from bots through this thing.

Now, without this particular user sending the remote host - how would I go about narrowing it in my raw access logs?

I can limit a search to the image filename viewed, but I'd suspect he wasn't the only one viewing the image - how can I be sure I found the right guy?
7:05 am on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Hopefully the timestamp should nail it.
3:13 pm on Apr 26, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The php script is not the Apache access log. That's the one you need to check for the problem and will include the IP (unless you have a server configuration problem).

If you want to get the IP via PHP make sure your php script works and points to the correct server variable. It may depend on hosting type, for cloud hosting the ip may not be in the $_SERVER['REMOTE_ADDR'] or maybe the script doesn't retrieve the ip from the right variable but from the one that can be set from the client end. So cross reference what you see with the apache log.
3:35 am on Apr 29, 2012 (gmt 0)



After some analyzation this appears to be from Google Web Preview. I'm not sure why this isn't picking up in my script (every other bot/visitor does):


[28/Apr/2012:23:03:33 -0400] "GET /image/example.jpg HTTP/1.1" 200 10749 "http://www.example.com" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.1 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/535.1"
74.125.158.83 - -

[28/Apr/2012:23:03:33 -0400] "GET /example.php HTTP/1.1" 200 14797 "http://www.example.com" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.1 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/535.1"
74.125.158.84 - -

This 1e100.net domain visits my site to grab a feed periodically, and it sends the IP/hostname without issue.
3:57 am on Apr 29, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



After some analyzation this appears to be

Shouldn't take much analysis, since it's a g### IP and it identifies itself upfront.

Google Web Preview is neither a human nor a robot. It's adept at slipping through cracks. But so far it doesn't seem to lie about its name. (Or vice versa. Would there be any way to tell if the googlebot wore the Preview's clothes so it could sneak in where it isn't wanted?)

:: shuffling papers ::

74.125.64.91 - - [27/Apr/2012:10:20:50 -0700] "GET /paintings/refrats/joan_blinds.html HTTP/1.1" 200 883 "http://www.google.com/search" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.1 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/535.1"


Yup. That's Preview. (Hm. Just noticed. Is 12 the most recent version of Chrome for Linux? It's otherwise 18.) Always the complete page including all images, css and-- unless you physically block it-- any and all js.
4:04 am on Apr 29, 2012 (gmt 0)



Trying to narrow a pattern requires analyzation :) This appears to be the only type of 'visit' that isn't accounted for with my script -- I'm curious to know what they're doing that isn't sending the server variables through the $_SERVER['REMOTE_ADDR'].. I will have to keep an eye on the blank visitors to see if I can figure out what's up. Though as with any Google product, I'm not keeping my hopes up to figure out what they're doing.

Initially, I'd expected this to be nefarious visitors or scrapers.
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month