Forum Moderators: open
I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.
Things a little slow over in PPC Greg ;)
Could you possibly provide the name of a bot that has appeared and gathered any significant market share from the major SE's and/or reputable existing companies expanding into SE (aka MSN) in the past five years?
Hopefully this would have come from some university or open source project?
Could you possibly provide the name of a bot that has appeared and gathered any significant market share from the major SE's and/or reputable existing companies expanding into SE (aka MSN) in the past five years?
How about Baidu?
Baiduspider+(+http://www.baidu.com/search/spider.htm)
I'll defer to someone like @incrediBILL who has detailed bot (spider) lists.
I guess you're free to allow whoever you want to crawl your site, but then you can't complain if at some future point, a popular search engine emerges that has "sandboxed" you because you refused it entry into your site one too many times. What would it have been like about eight years ago if Google was left off of your opt-in list? Are the only spiders that should be "trusted" the ones that come from popular (high revenue) SEs?
Hopefully this would have come from some university or open source project?
This illustrates a point that sometimes, the next big thing comes from an unlikely place. Another thing to consider is that an existing engine can acquire new infrastructure and use it for spidering purposes. So one of the major players might start spidering from someplace that isn't (yet) registered to it, and in fact may have been used by spammers, botnets, or other illdoers. If they don't get into your opt-in list right away, you run the risk of losing your ranking, etc.
I see dozens of new bots every week on my browser project website. Most of them do not respect robots.txt and take files that are disallowed.
I'm willing to risk a bot not being recognized for a week until I have a chance to analyze my log files, research the bots, and decide if they should be part of my opt-in list.
In the last few years the only bots I've added to my opt-in list have been Baidu and Yandex.
Yahoo is no longer on my opt-in list because of continued abuse, especially by their Japanese division.
Just for the OP's benefit: before using someone else's opt-in list, get enough of a handle on your traffic (and how it affects ranking) so you understand the tradeoffs involved in blocking spiders.
How about Baidu?
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Thanks anyway. Baidu; crawling from APNIC and a China IP range.
I have the bot denied as well as well as the IP range.
Not too many folks interested in my widegts on China.
This illustrates a point that sometimes, the next big thing comes from an unlikely place. Another thing to consider is that an existing engine can acquire new infrastructure and use it for spidering purposes. So one of the major players might start spidering from someplace that isn't (yet) registered to it, and in fact may have been used by spammers, botnets, or other illdoers. If they don't get into your opt-in list right away, you run the risk of losing your ranking, etc.
This forum has had participants in the past taking the same stand as yourself, however the bad-guy/PITA's far outnumber the viabale search engines.
Your certianly entitled to express your appreciation of where the next bot may apppear from and why we might leave open the doors of reason!
However. . .in the end?
It's up to each webmaster to determine what is beneficial or detrimental to his/her own site (s).
Personally my sites will benefit more by not wasting time dealing with PITA bots when I may be spending that time either archiving materials for new pages or creating new pages.
Course, most everybody relaizes that my expectations of visitors (bots or othwerwise) are unrealistic ;)
Don
I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.
If you have access, you can look at the error log and see everything that was kicked out by the server and adjust the whitelist to include things that need to be includied.
People that have tried this method have come back to me days later and said things like:
"WOW! I never knew so much JUNK was hitting my web server! It's amazing!"
Additionally, using something like Google Analytics you can see where all your legitimate traffic is coming from and make sure all of those sources are whitelisted.
Why I recommend using something like Google Analytics is because it's javascript based and typically only legitimate human browsers use javascript, not bots, so it filters out all of the noise you would tend to see in an access log down to where just the useful referrers.
This forum has had participants in the past taking the same stand as yourself, however the bad-guy/PITA's far outnumber the viabale search engines.
Gregbo is correct that most people could shoot themselves in the foot using traditional blocking methods as it's kind of blind and you don't have a clue when something new comes knocking.
That's why my control panel tells me daily if it sees a new user agent for my review, to approve or deny, so I make an interactive choice and not let the technology blindly block everything.
I'm all about the whitelist and stealth crawler detection as so far today I've blocked 178 unique IPs attempting to crawl 1237 pages and it's not even noon. Yesterday was fun with 289 bots trying to abscond with 3189 pages.
FWIW, they used to ask for many more thousands of pages a day BEFORE I clamped down, so these are just the stupid bots that don't take NO for an answer.