homepage Welcome to WebmasterWorld Guest from 107.20.131.154
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Website
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
How i can control unidentified bots?
how to control on unidentified bots, they are accessing more bandwidth
skseofleet




msg:4435160
 9:38 am on Mar 30, 2012 (gmt 0)

Hello All,

How to control on unidentified bots they are accessing relentlessly bandwidth of website. I disallowed all bad robots by .htaccess and robots.txt but still bad bots are accessing bandwidth of my website. Now i don't know any other method to stop them please help me to control them. Here is the address of robots.txt file of my websites batsgap.com/robots.txt.

 

lucy24




msg:4435170
 10:13 am on Mar 30, 2012 (gmt 0)

robots.txt has no effect on bad robots, because they probably don't read it and definitely don't obey it.

Blocking robots by htaccess will not prevent them from trying to get in. You will still see them in your logs. But all they take is a few hundred bytes for a 403, instead of the multiple Ks or MBs they would get if they reached the real page.

I disallowed all bad robots by .htaccess and robots.txt

All of them?! How?

Dijkgraaf




msg:4437291
 12:38 am on Apr 5, 2012 (gmt 0)

Reading your robots.txt
You allow ia_archiver and MSNPTC full access to your site.

All others you tell them not to ask for the following
Disallow: /note/
Disallow: /search.php
Disallow: /click.php
Disallow: /t.php
Disallow: /exitpage/
Disallow: /popup/
Disallow: /r.php

As lucy24 says, bad bots don't read/obey it anyway.

There have been various discussions regarding bot traps in the forum Search Engine Spider and User Agent Identification, in particular start reading the thread Quick primer on identifying bot activity: And a how to guide to slow and stop scraping [webmasterworld.com...]

skseofleet




msg:4437337
 5:02 am on Apr 5, 2012 (gmt 0)

@Dijkgraaf,

I go through the identification of search engine spider and bots. That is very informative post from theoretical prospects. But there is also no solution of controlling bad bots like :

# Crawler"
# "Bot"
# "Spider
# user-agent

Some unidentified bots with these names are relentlessly accessing my bandwidth. Is here any way to block these bots?

lucy24




msg:4437365
 8:16 am on Apr 5, 2012 (gmt 0)

You could use either mod_rewrite or mod_setenvif to lock out anyone whose user-agent string contained any of those terms. Then make a sub-rule to exempt permitted robots like (I assume) the googlebot. Do that part by IP rather than UA because the Big Names have plenty of spoofers.

Remember, again, that robots have no brains. A 403 does not make them go away for good. It only prevents them from getting in right then and there. If they have a shopping list of 30 requests and the first 20 have been 403'd, that will not stop them from asking for the remaining 10 items.

skseofleet




msg:4437378
 8:51 am on Apr 5, 2012 (gmt 0)

@lucy, How can i identify these terms? Is there any option to check the IP of bad robots? I already used the mod_rewrite code of 403 for some known bad bots but i don't know identification of these. Do u know some coding lines for these?

lucy24




msg:4437512
 2:59 pm on Apr 5, 2012 (gmt 0)

If there were a Recognized List of bad IPs, everyone hereabouts would be very, very happy :)

If a robot is thoughtful enough to identify itself as -bot, -crawler, -spider and so on, you can always block it. There are lots of posted lists of elements that never occur in a human UA. Java, Jakarta, Nutch etc.... Doesn't have to be a complete word. Just match the fragment.

And then un-block things like known google ranges. 66.249, 74.125... (Don't quote me, I'm just making this up off the top of my head and it's too early in the morning.) There's a thread over in SSID called At Home With the Robots that gives a pretty representative sampling of IP ranges for the most active robots.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved