Welcome to WebmasterWorld Guest from 54.196.125.165

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

Is there an opt in script for bots?

What bots should be allowed and how can it be done?

     
7:31 am on Oct 9, 2006 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 7, 2005
posts:137
votes: 0


For those of who are not programmers, is there a script that we can use to allow only certain bots to our site?

If there is, what are the best bots to allow in?

Thank you for the help.

1:09 am on Oct 10, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 22, 2001
posts:2450
votes: 0


I think IncrediBILL may have something in the works that does that...
12:52 am on Oct 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 18, 2001
posts: 889
votes: 0


Oh, yes! When available, please may I have it also?
1:49 am on Oct 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5437
votes: 2


Bill provided a brief example of white listing in a previous thread.
Perhaps somebody marked the thread?
6:12 pm on Oct 11, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0

4:48 am on Oct 31, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 18, 2005
posts:817
votes: 0


I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.
5:25 am on Oct 31, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5437
votes: 2


I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.

Things a little slow over in PPC Greg ;)

Could you possibly provide the name of a bot that has appeared and gathered any significant market share from the major SE's and/or reputable existing companies expanding into SE (aka MSN) in the past five years?

Hopefully this would have come from some university or open source project?

9:56 pm on Oct 31, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 18, 2005
posts:817
votes: 0


Could you possibly provide the name of a bot that has appeared and gathered any significant market share from the major SE's and/or reputable existing companies expanding into SE (aka MSN) in the past five years?

How about Baidu?
Baiduspider+(+http://www.baidu.com/search/spider.htm)

I'll defer to someone like @incrediBILL who has detailed bot (spider) lists.

I guess you're free to allow whoever you want to crawl your site, but then you can't complain if at some future point, a popular search engine emerges that has "sandboxed" you because you refused it entry into your site one too many times. What would it have been like about eight years ago if Google was left off of your opt-in list? Are the only spiders that should be "trusted" the ones that come from popular (high revenue) SEs?

Hopefully this would have come from some university or open source project?

This illustrates a point that sometimes, the next big thing comes from an unlikely place. Another thing to consider is that an existing engine can acquire new infrastructure and use it for spidering purposes. So one of the major players might start spidering from someplace that isn't (yet) registered to it, and in fact may have been used by spammers, botnets, or other illdoers. If they don't get into your opt-in list right away, you run the risk of losing your ranking, etc.

10:07 pm on Oct 31, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


Greg, if I didn't use an opt-in list then my servers would be overrun by new bots.

I see dozens of new bots every week on my browser project website. Most of them do not respect robots.txt and take files that are disallowed.

I'm willing to risk a bot not being recognized for a week until I have a chance to analyze my log files, research the bots, and decide if they should be part of my opt-in list.

In the last few years the only bots I've added to my opt-in list have been Baidu and Yandex.

Yahoo is no longer on my opt-in list because of continued abuse, especially by their Japanese division.

10:36 pm on Oct 31, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 18, 2005
posts:817
votes: 0


OK, fair enough. You are experienced enough to understand the ramifications of disallowing bots from popular sites, and weighing that against them overrunning your site.

Just for the OP's benefit: before using someone else's opt-in list, get enough of a handle on your traffic (and how it affects ranking) so you understand the tradeoffs involved in blocking spiders.

11:34 pm on Oct 31, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 17, 2002
posts:2251
votes: 0


I agree completely with your second paragraph Greg. Never blindly accept what someone else thinks should be blocked. I'm a perfect example of that. I block Yahoo. I don't think most people would want to do that. But I have seen enough of their shenanigans to know I'm not getting enough traffic from them to justify how badly they try to abuse my websites. ;)
1:08 am on Nov 1, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5437
votes: 2


How about Baidu?
Baiduspider+(+http://www.baidu.com/search/spider.htm)

Thanks anyway. Baidu; crawling from APNIC and a China IP range.
I have the bot denied as well as well as the IP range.
Not too many folks interested in my widegts on China.

This illustrates a point that sometimes, the next big thing comes from an unlikely place. Another thing to consider is that an existing engine can acquire new infrastructure and use it for spidering purposes. So one of the major players might start spidering from someplace that isn't (yet) registered to it, and in fact may have been used by spammers, botnets, or other illdoers. If they don't get into your opt-in list right away, you run the risk of losing your ranking, etc.

This forum has had participants in the past taking the same stand as yourself, however the bad-guy/PITA's far outnumber the viabale search engines.

Your certianly entitled to express your appreciation of where the next bot may apppear from and why we might leave open the doors of reason!

However. . .in the end?

It's up to each webmaster to determine what is beneficial or detrimental to his/her own site (s).

Personally my sites will benefit more by not wasting time dealing with PITA bots when I may be spending that time either archiving materials for new pages or creating new pages.
Course, most everybody relaizes that my expectations of visitors (bots or othwerwise) are unrealistic ;)

Don

7:46 pm on Nov 9, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14648
votes: 94


I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.

If you have access, you can look at the error log and see everything that was kicked out by the server and adjust the whitelist to include things that need to be includied.

People that have tried this method have come back to me days later and said things like:
"WOW! I never knew so much JUNK was hitting my web server! It's amazing!"

Additionally, using something like Google Analytics you can see where all your legitimate traffic is coming from and make sure all of those sources are whitelisted.

Why I recommend using something like Google Analytics is because it's javascript based and typically only legitimate human browsers use javascript, not bots, so it filters out all of the noise you would tend to see in an access log down to where just the useful referrers.

7:59 pm on Nov 9, 2006 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14648
votes: 94


This forum has had participants in the past taking the same stand as yourself, however the bad-guy/PITA's far outnumber the viabale search engines.

Gregbo is correct that most people could shoot themselves in the foot using traditional blocking methods as it's kind of blind and you don't have a clue when something new comes knocking.

That's why my control panel tells me daily if it sees a new user agent for my review, to approve or deny, so I make an interactive choice and not let the technology blindly block everything.

I'm all about the whitelist and stealth crawler detection as so far today I've blocked 178 unique IPs attempting to crawl 1237 pages and it's not even noon. Yesterday was fun with 289 bots trying to abscond with 3189 pages.

FWIW, they used to ask for many more thousands of pages a day BEFORE I clamped down, so these are just the stupid bots that don't take NO for an answer.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members