homepage Welcome to WebmasterWorld Guest from 54.145.182.50
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Is there an opt in script for bots?
What bots should be allowed and how can it be done?
Gimp

5+ Year Member



 
Msg#: 3113746 posted 7:31 am on Oct 9, 2006 (gmt 0)

For those of who are not programmers, is there a script that we can use to allow only certain bots to our site?

If there is, what are the best bots to allow in?

Thank you for the help.

 

volatilegx

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3113746 posted 1:09 am on Oct 10, 2006 (gmt 0)

I think IncrediBILL may have something in the works that does that...

nancyb

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3113746 posted 12:52 am on Oct 11, 2006 (gmt 0)

Oh, yes! When available, please may I have it also?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3113746 posted 1:49 am on Oct 11, 2006 (gmt 0)

Bill provided a brief example of white listing in a previous thread.
Perhaps somebody marked the thread?

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3113746 posted 6:12 pm on Oct 11, 2006 (gmt 0)

[webmasterworld.com...]

gregbo

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3113746 posted 4:48 am on Oct 31, 2006 (gmt 0)

I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3113746 posted 5:25 am on Oct 31, 2006 (gmt 0)

I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.

Things a little slow over in PPC Greg ;)

Could you possibly provide the name of a bot that has appeared and gathered any significant market share from the major SE's and/or reputable existing companies expanding into SE (aka MSN) in the past five years?

Hopefully this would have come from some university or open source project?

gregbo

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3113746 posted 9:56 pm on Oct 31, 2006 (gmt 0)

Could you possibly provide the name of a bot that has appeared and gathered any significant market share from the major SE's and/or reputable existing companies expanding into SE (aka MSN) in the past five years?

How about Baidu?
Baiduspider+(+http://www.baidu.com/search/spider.htm)

I'll defer to someone like @incrediBILL who has detailed bot (spider) lists.

I guess you're free to allow whoever you want to crawl your site, but then you can't complain if at some future point, a popular search engine emerges that has "sandboxed" you because you refused it entry into your site one too many times. What would it have been like about eight years ago if Google was left off of your opt-in list? Are the only spiders that should be "trusted" the ones that come from popular (high revenue) SEs?

Hopefully this would have come from some university or open source project?

This illustrates a point that sometimes, the next big thing comes from an unlikely place. Another thing to consider is that an existing engine can acquire new infrastructure and use it for spidering purposes. So one of the major players might start spidering from someplace that isn't (yet) registered to it, and in fact may have been used by spammers, botnets, or other illdoers. If they don't get into your opt-in list right away, you run the risk of losing your ranking, etc.

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3113746 posted 10:07 pm on Oct 31, 2006 (gmt 0)

Greg, if I didn't use an opt-in list then my servers would be overrun by new bots.

I see dozens of new bots every week on my browser project website. Most of them do not respect robots.txt and take files that are disallowed.

I'm willing to risk a bot not being recognized for a week until I have a chance to analyze my log files, research the bots, and decide if they should be part of my opt-in list.

In the last few years the only bots I've added to my opt-in list have been Baidu and Yandex.

Yahoo is no longer on my opt-in list because of continued abuse, especially by their Japanese division.

gregbo

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3113746 posted 10:36 pm on Oct 31, 2006 (gmt 0)

OK, fair enough. You are experienced enough to understand the ramifications of disallowing bots from popular sites, and weighing that against them overrunning your site.

Just for the OP's benefit: before using someone else's opt-in list, get enough of a handle on your traffic (and how it affects ranking) so you understand the tradeoffs involved in blocking spiders.

GaryK

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3113746 posted 11:34 pm on Oct 31, 2006 (gmt 0)

I agree completely with your second paragraph Greg. Never blindly accept what someone else thinks should be blocked. I'm a perfect example of that. I block Yahoo. I don't think most people would want to do that. But I have seen enough of their shenanigans to know I'm not getting enough traffic from them to justify how badly they try to abuse my websites. ;)

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3113746 posted 1:08 am on Nov 1, 2006 (gmt 0)

How about Baidu?
Baiduspider+(+http://www.baidu.com/search/spider.htm)

Thanks anyway. Baidu; crawling from APNIC and a China IP range.
I have the bot denied as well as well as the IP range.
Not too many folks interested in my widegts on China.

This illustrates a point that sometimes, the next big thing comes from an unlikely place. Another thing to consider is that an existing engine can acquire new infrastructure and use it for spidering purposes. So one of the major players might start spidering from someplace that isn't (yet) registered to it, and in fact may have been used by spammers, botnets, or other illdoers. If they don't get into your opt-in list right away, you run the risk of losing your ranking, etc.

This forum has had participants in the past taking the same stand as yourself, however the bad-guy/PITA's far outnumber the viabale search engines.

Your certianly entitled to express your appreciation of where the next bot may apppear from and why we might leave open the doors of reason!

However. . .in the end?

It's up to each webmaster to determine what is beneficial or detrimental to his/her own site (s).

Personally my sites will benefit more by not wasting time dealing with PITA bots when I may be spending that time either archiving materials for new pages or creating new pages.
Course, most everybody relaizes that my expectations of visitors (bots or othwerwise) are unrealistic ;)

Don

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3113746 posted 7:46 pm on Nov 9, 2006 (gmt 0)

I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.

If you have access, you can look at the error log and see everything that was kicked out by the server and adjust the whitelist to include things that need to be includied.

People that have tried this method have come back to me days later and said things like:
"WOW! I never knew so much JUNK was hitting my web server! It's amazing!"

Additionally, using something like Google Analytics you can see where all your legitimate traffic is coming from and make sure all of those sources are whitelisted.

Why I recommend using something like Google Analytics is because it's javascript based and typically only legitimate human browsers use javascript, not bots, so it filters out all of the noise you would tend to see in an access log down to where just the useful referrers.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3113746 posted 7:59 pm on Nov 9, 2006 (gmt 0)

This forum has had participants in the past taking the same stand as yourself, however the bad-guy/PITA's far outnumber the viabale search engines.

Gregbo is correct that most people could shoot themselves in the foot using traditional blocking methods as it's kind of blind and you don't have a clue when something new comes knocking.

That's why my control panel tells me daily if it sees a new user agent for my review, to approve or deny, so I make an interactive choice and not let the technology blindly block everything.

I'm all about the whitelist and stealth crawler detection as so far today I've blocked 178 unique IPs attempting to crawl 1237 pages and it's not even noon. Yesterday was fun with 289 bots trying to abscond with 3189 pages.

FWIW, they used to ask for many more thousands of pages a day BEFORE I clamped down, so these are just the stupid bots that don't take NO for an answer.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved