| 10:13 am on Mar 30, 2012 (gmt 0)|
robots.txt has no effect on bad robots, because they probably don't read it and definitely don't obey it.
Blocking robots by htaccess will not prevent them from trying to get in. You will still see them in your logs. But all they take is a few hundred bytes for a 403, instead of the multiple Ks or MBs they would get if they reached the real page.
|I disallowed all bad robots by .htaccess and robots.txt |
All of them?! How?
| 12:38 am on Apr 5, 2012 (gmt 0)|
Reading your robots.txt
You allow ia_archiver and MSNPTC full access to your site.
All others you tell them not to ask for the following
As lucy24 says, bad bots don't read/obey it anyway.
There have been various discussions regarding bot traps in the forum Search Engine Spider and User Agent Identification, in particular start reading the thread Quick primer on identifying bot activity: And a how to guide to slow and stop scraping [webmasterworld.com...]
| 5:02 am on Apr 5, 2012 (gmt 0)|
I go through the identification of search engine spider and bots. That is very informative post from theoretical prospects. But there is also no solution of controlling bad bots like :
Some unidentified bots with these names are relentlessly accessing my bandwidth. Is here any way to block these bots?
| 8:16 am on Apr 5, 2012 (gmt 0)|
You could use either mod_rewrite or mod_setenvif to lock out anyone whose user-agent string contained any of those terms. Then make a sub-rule to exempt permitted robots like (I assume) the googlebot. Do that part by IP rather than UA because the Big Names have plenty of spoofers.
Remember, again, that robots have no brains. A 403 does not make them go away for good. It only prevents them from getting in right then and there. If they have a shopping list of 30 requests and the first 20 have been 403'd, that will not stop them from asking for the remaining 10 items.
| 8:51 am on Apr 5, 2012 (gmt 0)|
@lucy, How can i identify these terms? Is there any option to check the IP of bad robots? I already used the mod_rewrite code of 403 for some known bad bots but i don't know identification of these. Do u know some coding lines for these?
| 2:59 pm on Apr 5, 2012 (gmt 0)|
If there were a Recognized List of bad IPs, everyone hereabouts would be very, very happy :)
If a robot is thoughtful enough to identify itself as -bot, -crawler, -spider and so on, you can always block it. There are lots of posted lists of elements that never occur in a human UA. Java, Jakarta, Nutch etc.... Doesn't have to be a complete word. Just match the fragment.
And then un-block things like known google ranges. 66.249, 74.125... (Don't quote me, I'm just making this up off the top of my head and it's too early in the morning.) There's a thread over in SSID called At Home With the Robots that gives a pretty representative sampling of IP ranges for the most active robots.