Forum Moderators: open
aipbot
BecomeBot
Cerberian Drtrs
COMBINE
ConveraCrawler
Custom Bot/Robot #20
e-collector
Faxobot/1.0
Faxobot
Fetch API Request
FAST Enterprise Crawler
Html Link Validator (www.lithopssoft.com)
iaea
ichiro
INDEXU Spider Link Checker
IRLbot
linkwalker
LinksManager
Microsoft URL Control
microsoft.url
NaverBot
Nutch
NutchOrg
OmniExplorer_Bot
Spam Bot
Test.Com
T-H-U-N-D-E-R-S-T-O-N-E
Twiceler
Xenu Link Sleuth
Silver you've listed 29 lines and called it a short list.
This portion of my htaccess is 297 lines. I'm not about to post that in its entirety in this forum or any other forum either in or out of Webmaster World.
The "Close to Perfect htaccess"
( [webmasterworld.com...] ) thread is prime example of how ugly and long these types of threads can get when each participant begins adding their lines, as compared to what has previously been added. (You were aware of that [having participated in that thread] and you both inquired and started a similar thread.
Perhaps I'm narrow minded, however it doesn't make sense to me to re-invent the wheel.
ahmed writes:
"So how can we get the IPs for these bots (to block that IP)? Otherwise, how do we block them? and what do they do?"
Ahmed,
You have multiple questions;
1) IP's?
a) You either view forum tools
[webmasterworld.com...]
or utilize links to other tools given in previous
threads.
b) in most instances the preference is to deny access
by UA rather than IP to lessen the denial of
innocents.
C) In the case of the 29 provided by silver you either
dig through the htaccess examples or seek an
alternative source and learn how to write your own
lines.
[webmasterworld.com...]
2)What they do?
Personally, I don't have any desire to understand what
they do or what they intend to do or what they say they
are going to do.
If the crawling UA does not contain a link to a URL
which offers an explantion of their goals and their
goals ARE parallel to my websites goals? Than I allow
them. If NOT the same goals, they are denied.
In short is their any benefit to the spidering of my
data and is that spidering benefical or detrimental to
my websites.
Spending time going through page after page of google
searches on some assinine name because the software
maker intends to decieve websmaters is hardly productive
or cooperative with my time.
The majority of my insights come from ARIN, RIPE and APNIC inqiries and direct links to IP's.
All other insights in to making a determination, webmasters either learn over time (by monitoring their visitor logs) or reading materials provided by various methods (one of which is this forum).
For ANY person to expect somebody to provide a functioning htaccess file which is compatible for two websites of entirely different content materials is very short-sighted.
Don
What you really mean is, "Hey, lets rehash the hundreds of threads that already cover the few spiders I mentioned, and probably contain the answers I'm looking for, but am too lazy to look for myself."
Ok, to be fair, I didn't find any quick results for "Test.Com" (a fake domain often used as an example around here) or for "Spam Bot" (often appearing as "Is this a spam bot?" around here), but for ALL the other listed spiders I found plenty of threads going back years with info, speculation, IPs, responsible companies & more, on said spiders.
Since some folk need others to make the determination as to whether a given spider is bad or not, as a public service I'd like to point out that "Googlebot", "Yahoo! Slurp" & "msnbot" are the three worst spiders you could have visiting your site. Ban them now, before it's too late and the damage is done!
Having trouble finding what you're looking for? You should check out this thread: FAQ: Additional Search Tools for WebmasterWorld [webmasterworld.com]
While humorous, the linked-to Flash presentation in the first message of this thread really should be required viewing: Posting and You... [webmasterworld.com]
A search of the Webmaster World pages at google with forum11 (this forum) returns the following:
[google.com...]
should you desire to add a particular bot, UA or IP?
just add after forum11 in the search box
+and the name
Some more Forum tools:
Valid Search Engine?
[webmasterworld.com...]
IIS and Global.asa
[w3schools.com...]
dbm Maps
[webmasterworld.com...]
Reduce harvests
[webmasterworld.com...] Msg#16
Throttle runaways
[webmasterworld.com...]
Block Methods (Scroll past opening Advertisements)
[diveintomark.org...]
Regular Expressions
[etext.lib.virginia.edu...]
[gnosis.cx...]
Close To Perfect I
[webmasterworld.com...]
Close To Perfect II
[webmasterworld.com...]
Close To Perfect III
[webmasterworld.com...]
Concise htaccess
[webmasterworld.com...]
robots.text on a diet
[webmasterworld.com...]
Search Tools
[webmasterworld.com...]
balam: you waste your time judging other's posts and btw your post is not only bored but also useless to the purpouse of the thread.
Actually!
I thought this part (below) was rather funny and on a couple of passing thoughts, almost submitted to remined others that it WAS tongue-in-cheek without the emoticon.
I'd like to point out that "Googlebot", "Yahoo! Slurp" & "msnbot" are the three worst spiders you could have visiting your site. Ban them now, before it's too late and the damage is done!
BTW silver,
ALL those links came from the Close to Perfect htaccess thread :)
This forum has been primarily htaccess (in addition to SESID) at least since I've been here (My profile says 2001, however I was previously registered under another screen name.)
I even used balam's google link to do a search on "forum11+IIS" and there wasn't much. There was a IIS inquiriy in the "Close to Perfect" that went unanswered.
Prior to this forum going down, it was a bundle of activity and many of the once participants here have not returned. (balam was a regular at one time.)
The moderated forum (no pun as the forum does exists,)is of a lesser reaching claw than the old forum. In the old forum, submissions were not delayed and a spider could be stopped in its tracks.
Nor was separating private from commercial IP ranges an issue in the old forum. Today, even though a private IP may be doing massive crawls, we are apparently limited by Charther, moderation and delay.
In all fairness though, the old forum was shut down because some posters were submitting their competitors as crawls.
That Bret decided to bring this forum back is commendable.
My question is?
If one of the 29 UA's you submitted came from a private IP range?
How could the poster provide the IP range without violating forum rules and yet sharing accurate information?
Don
If one of the 29 UA's you submitted came from a private IP range?
How could the poster provide the IP range without violating forum rules and yet sharing accurate information?
You can't. If an IP address appears to be that of a private individual, please don't post it. If you need to post it, please do not post the last last group of numbers in the IP address.
This forum was closed for liability reasons and was only reopened when I promised to enforce this strict guideline.
Remember, the primary purpose of the forum is to identify search engine spiders. Identifying other types of spiders (including building ban-lists) is a secondary function of this forum.
That being said, I have no problem with starting new threads listing "bannable bots", assuming those threads don't turn into flame-fests or some such. There are pre-existing threads with this information, but it takes a lot of digging to get to the meat of the info in them because they are so long. It would be nice to have one comprehensive thread of bad-bots with no extraneous posts in it, but I am probably dreaming.
Of course, everybody's definitions of bad and good will be different. In my opinion, any bot attached to a public search engine (which doesn't cache or has opt-out cacheing) is good. Anything else is bad.