Forum Moderators: phranque

Message Too Old, No Replies

Constant stream of 404s with Google as the referrer

         

csdude55

6:46 am on Apr 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



After changing my LogLevel settings, I'm seeing a ton of 404 errors that are referred from Google. I started saving the REQUEST_URI about 45 minutes ago (1:45am), and so far I have:

/Murder
/2012
/Some
/Man
/Liquor
/Graduation
/Burn
/Voter
/Saving
/Preschool
/Environmental
/Firefighters

The IPs change a bit, but the most recent IP traced back to Amazon, so it's not a Google bot.

My server load is already pretty stressed, and I suspect that this is at least part of the problem.

Would you suggest saving and blocking the words manually via htaccess, a regex via htaccess to block any REQUEST_URI that begins with a capital letter followed by a lowercase (which should never actually exist on my end), modifying 404.php to database the IP for any such request and then blacklist the IP, or just let it go? Or other?

phranque

10:44 am on Apr 20, 2020 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you can often find more success by looking for patterns other than the IP used or URI requested.

have you studied this thread?
Blocking Methods [webmasterworld.com]

lucy24

4:31 pm on Apr 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm seeing a ton of 404 errors that are referred from Google
Ah, good ol’ google_ref botnet. (My name for it.) Not as common as it once was, but it’s simply a humanoid robot--i.e. nothing obviously blockable in headers or IP--that claims google.com as referer instead of the more usual referer-less request. Rarely I even see idiocies like /wp-admin claiming google as referer, which is a no-brainer all around :)

I didn’t see any for several months, but have seen a few recently. In fact it led me to wonder if google is pre-loading pages: if something is #1 in SERPs, they fetch the page on behalf of the user, and then if the user happens not to click on the #1 result, it comes out looking like a robot. But I think if this were happening there would be extensive discussion of it.

If you’re getting inundated with bogus requests for the same URL over and over again, you could certainly block them--or, equivalently, you could return a manual 404. The appeal of the latter approach is that it saves your server the work of physically looking for the file, while returning no usable information to the visitor.

Tangent: Why did it require a change in LogLevel to make this visible? 404s should be showing up in ordinary access logs anyway, unless you’ve configured your server not to include 404s in logs--which is definitely not the default.

csdude55

7:28 pm on Apr 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I had not seen that, @phranque, thanks for the link :-) Some of those scare me, though... would they potentially be triggered by ad blockers and "security" software that tries to mask users' identities? Especially blocking by user agent or IP?

@lucy24, these are pretty new to me, too. It could just be a coincidence that they started showing up around the same time that I modified the LogLevel, but it's also possible that I had lowered it to too low of a level when I set up WHM (WebHost Manager). I've had this server for several years, so I honestly don't remember what all I did in the initial setup.

Right now I have my 404s going to 404.php, which I use to simply save the request to a text file:

if (
$_SERVER['HTTP_REFERER'] &&
$_SERVER['HTTP_REFERER'] === 'https://www.google.com/'
) {
$handle = fopen('/home/example/www/404.dat', "a+");
fwrite($handle, $_SERVER['REQUEST_URI'] . "\n");
fclose($handle);
}

For the last 24 hours it's been the same 12 requests over and over, so I blocked them in .htaccess:

RewriteCond %{HTTP_REFERER} service.dropdowndeals.com [NC,OR]
RewriteCond %{QUERY_STRING} (?:(?:information|table)_schema|my_db_name|union+all+select) [NC,OR]

# I already had most of this, so I just added the list of requests
RewriteCond %{REQUEST_URI} ^/(?:crossdomain|wp-|administrator|phpmyadmin|p2|impl\.|2012|Burn|Environmental|Firefighter|Graduation|Liquor|Man|Murder|Preschool|Saving|Some|Voter)|/(?:a|b|shell|tiki-register|who|wp-login|xmlrpc)\.php|\.sql|license\.txt$ [NC]
RewriteRule ^ - [F]

I used to use blacklist.txt for CSF (ConfigServer Firewall) but found it to be kind of slow, and for now the list is short enough that I can plug it in to .htaccess. I'm going to keep 404.dat going, though, and just remember to check on the list periodically.

I'm also going to have to figure out a way to limit requests to once every 30 seconds or something...

tangor

8:00 pm on Apr 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I see several thousands of these each month. 404 is one way, I use 403 based on a little more than 30 pattern matches for the REQUEST to kill off over 95% of the noise and tell the server to don't bother looking ... and a 403 eventually makes some go away where a 404 just keeps getting hammered.

The one pattern that catches the most is ".php" ... but that's because I don't use php in the first place. :)