|Determine source of a 403/Forbidden|
A while back, I explicitly blocked two bots, but I don't recall how I did it, to remove the ban.
I checked the site's .htaccess, nothing there to block them (no user agent/partial UA, no IP or IP range present). I checked mod_security (the way I block at server level), nothing there.
Is there an apache log that can tell me what triggered the 403?
Is it your own server or isn't it? htaccess implies no; "mod_security at server level" implies yes. mod_security seems to add pretty detailed comments to the error log-- but I think they all come through as 500-class errors, so don't spend too much time there.
I took a quick detour to MAMP and tried locking myself out. Even at LogLevel "debug" it still says nothing more than "client denied by server configuration". Grrr.
There are lots of ways to lock people out, but most of them wouldn't apply to a potentially desirable robot. A referer block, for example-- but surely your robots don't come with their own referer?
If there's any chance you blocked them via mod_rewrite, you could try running a RewriteLog and see what turns up. Just to confuse you, there's no on/off setting or LogLevel, you just have to specify a file. But it can't be done in htaccess, so you're looking at restarting your server :(
Yes it is a dedicated server. Neither bot has a referrer field. I've checked high and low for references to their IPs/User agents and even ran grep on public_html.
|If there's any chance you blocked them via mod_rewrite |
No, I do the rewrites via .htaccess (checked there too), and I checked the server's root folder (the page that displays if you type the IP in the address bar) and there wasn't an .htaccess present.
|you could try running a RewriteLog and see what turns up. Just to confuse you, there's no on/off setting or LogLevel, you just have to specify a file. But it can't be done in htaccess, so you're looking at restarting your server |
Officially confused. How do I run a RewriteLog?
Make up a file name and tell your server about it :)
You can test the UA question pretty easily by using any browser that will let you fake a user-agent. Give the exact text that your robots use, and see if you can get into your site. If no, there is a UA block somewhere. If yes, keep looking for IP.
I really doubt you want to take the RewriteLog approach. If you do, the format is
where-- says Apache--
|If the name does not begin with a slash ('/') then it is assumed to be relative to the Server Root. The directive should occur only once per server config. |
And once you've done that, you then have to specify a LogLevel. (If you set a log level without naming a file, logs simply vanish into the ether. If you name a file without setting a non-zero log level, no logging gets done. mod_rewrite always has to do things differently from all other mods.)
Apache also says-- with exclamation marks--
|Using a high value for Level will slow down your Apache server dramatically! Use the rewriting logfile at a Level greater than 2 only for debugging! |
All of this strikes me as a last-resort solution if all you're trying to find out is how the ### you blocked those robots. In fact, this whole section of the docs gives the impression that Apache just isn't all that happy about the RewriteLog idea at all :)
Look in the CP and Security Section for Deny IP.
I used the firefox user agent switcher & copied one of the UA's from my 403 log.
I tried to access one of the pages the bots were looking for.
In the log, my visit showed up as broken images and i saw only the text/links of the 403 page. (Error document /403.php - for example)
When the actual bots visit, it's just 1 click and only the page they tried to reach shows up in the logs. The css/images, etc don't show up in the 403 logs.
The 403 logs is a script I've written myself that logs every hit to "403.php"
That makes me wonder if this is IP based, then.
|I really doubt you want to take the RewriteLog approach. |
This seems a little scary. If I make a mistake and fill the logs up & cause a crash to the server, it'll take 3+ hrs for the data center to reboot, if my last accidental crash from filling up logs is any indication.
|Look in the CP and Security Section for Deny IP. |
I've checked the firwall & didn't see their IPs. Also, when an IP is firewalled they don't show up in my 403 logs.
|When the actual bots visit, it's just 1 click and only the page they tried to reach shows up in the logs. The css/images, etc don't show up in the 403 logs. |
Uhm.... They're robots. Except in the rarest, most exceptional cases, robots never take anything but the html itself.
|I tried to access one of the pages the bots were looking for. |
In the log, my visit showed up as broken images and i saw only the text/links of the 403 page.
This is a little obscure. Do you mean that you, yourself, saw broken-image icons onscreen? And there are supposed to be images on the 403 page? If so, you have learned something very useful and potentially embarrassing in a "been there, done that" kind of way.
You have to make sure that everything needed by your 403 page is accessible to those who have been locked out. Numerically most 403s go to robots, who don't even look at your 403 page. But the page exists for the benefit of humans who took a wrong turn-- most often, by asking for a directory that doesn't have an index. So you have to poke a hole for them.
If the 403 comes from mod_rewrite, make a preliminary RewriteRule that says "if the request is for anything used by the 403 page, let them through". If the 403 comes from mod_authz-whatsit, make a <FilesMatch> envelope that says Allow from all.
And so on.
if you are using apache 2.4 the directives for logging mod_rewrite changed.
in 2.4 you can turn on logging for any module, so that could be useful.