homepage Welcome to WebmasterWorld Guest from 54.166.173.147
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Deny users with blank REMOTE HOST field?
brokaddr




msg:4445742
 1:36 am on Apr 26, 2012 (gmt 0)

I researched mod_security to see if this were possible - if it is, I can't locate the corresponding documentation.

There is a bot scraping images from my site using a blank IP and blank host field and I cannot figure out how to make scraping more difficult for this guy.

I reckon using:
SecRule REMOTE_HOST "" deny,status:403 would also boot a large portion of users.

Any tips? I'm open to preferable server-wide bans, not just htaccess options (if possible)?

 

wilderness




msg:4445764
 2:56 am on Apr 26, 2012 (gmt 0)

The entire internet functions on IP's, how is it that your site functions differently?

Are you view the bots activity via some stats software, rather than your "raw visitor logs"?

brokaddr




msg:4445780
 3:55 am on Apr 26, 2012 (gmt 0)

The user appears to be spoofing their IP, is what I am trying to say.
All other users have legit remote host / remote addr fields, except this particular user.
All they've been looking at for the past few hours is images.

lucy24




msg:4445790
 4:15 am on Apr 26, 2012 (gmt 0)

If they are genuinely sending a forged IP, and it's landing in your raw logs that way, be afraid. Be very afraid. I'd always understood that this is the one thing robots-- or, for that matter, humans-- can't do. Much as they'd like to.

But nobody can be sure unless you post a sample from your raw logs. You imply in your first post that it's your own server, so it's not a question of access to the logs.

brokaddr




msg:4445810
 5:33 am on Apr 26, 2012 (gmt 0)

No, I am not looking at my raw logs actually.

It's a php script that monitors current activity (what users are looking at, referrers, etc) - I catch a majority of the weirdest activity from bots through this thing.

Now, without this particular user sending the remote host - how would I go about narrowing it in my raw access logs?

I can limit a search to the image filename viewed, but I'd suspect he wasn't the only one viewing the image - how can I be sure I found the right guy?

g1smd




msg:4445833
 7:05 am on Apr 26, 2012 (gmt 0)

Hopefully the timestamp should nail it.

enigma1




msg:4446029
 3:13 pm on Apr 26, 2012 (gmt 0)

The php script is not the Apache access log. That's the one you need to check for the problem and will include the IP (unless you have a server configuration problem).

If you want to get the IP via PHP make sure your php script works and points to the correct server variable. It may depend on hosting type, for cloud hosting the ip may not be in the $_SERVER['REMOTE_ADDR'] or maybe the script doesn't retrieve the ip from the right variable but from the one that can be set from the client end. So cross reference what you see with the apache log.

brokaddr




msg:4447108
 3:35 am on Apr 29, 2012 (gmt 0)

After some analyzation this appears to be from Google Web Preview. I'm not sure why this isn't picking up in my script (every other bot/visitor does):


[28/Apr/2012:23:03:33 -0400] "GET /image/example.jpg HTTP/1.1" 200 10749 "http://www.example.com" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.1 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/535.1"
74.125.158.83 - -

[28/Apr/2012:23:03:33 -0400] "GET /example.php HTTP/1.1" 200 14797 "http://www.example.com" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.1 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/535.1"
74.125.158.84 - -

This 1e100.net domain visits my site to grab a feed periodically, and it sends the IP/hostname without issue.

lucy24




msg:4447119
 3:57 am on Apr 29, 2012 (gmt 0)

After some analyzation this appears to be

Shouldn't take much analysis, since it's a g### IP and it identifies itself upfront.

Google Web Preview is neither a human nor a robot. It's adept at slipping through cracks. But so far it doesn't seem to lie about its name. (Or vice versa. Would there be any way to tell if the googlebot wore the Preview's clothes so it could sneak in where it isn't wanted?)

:: shuffling papers ::

74.125.64.91 - - [27/Apr/2012:10:20:50 -0700] "GET /paintings/refrats/joan_blinds.html HTTP/1.1" 200 883 "http://www.google.com/search" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.1 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/535.1"

Yup. That's Preview. (Hm. Just noticed. Is 12 the most recent version of Chrome for Linux? It's otherwise 18.) Always the complete page including all images, css and-- unless you physically block it-- any and all js.

brokaddr




msg:4447122
 4:04 am on Apr 29, 2012 (gmt 0)

Trying to narrow a pattern requires analyzation :) This appears to be the only type of 'visit' that isn't accounted for with my script -- I'm curious to know what they're doing that isn't sending the server variables through the $_SERVER['REMOTE_ADDR'].. I will have to keep an eye on the blank visitors to see if I can figure out what's up. Though as with any Google product, I'm not keeping my hopes up to figure out what they're doing.

Initially, I'd expected this to be nefarious visitors or scrapers.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved