Is there an opt in script for bots?

Forum Moderators: open

Message Too Old, No Replies

Is there an opt in script for bots?

What bots should be allowed and how can it be done?

Gimp

7:31 am on Oct 9, 2006 (gmt 0)

For those of who are not programmers, is there a script that we can use to allow only certain bots to our site?

If there is, what are the best bots to allow in?

Thank you for the help.

volatilegx

1:09 am on Oct 10, 2006 (gmt 0)

I think IncrediBILL may have something in the works that does that...

nancyb

12:52 am on Oct 11, 2006 (gmt 0)

Oh, yes! When available, please may I have it also?

wilderness

1:49 am on Oct 11, 2006 (gmt 0)

Bill provided a brief example of white listing in a previous thread.
Perhaps somebody marked the thread?

GaryK

6:12 pm on Oct 11, 2006 (gmt 0)

[webmasterworld.com...]

gregbo

4:48 am on Oct 31, 2006 (gmt 0)

I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.

wilderness

5:25 am on Oct 31, 2006 (gmt 0)

I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.

Things a little slow over in PPC Greg ;)

Could you possibly provide the name of a bot that has appeared and gathered any significant market share from the major SE's and/or reputable existing companies expanding into SE (aka MSN) in the past five years?

Hopefully this would have come from some university or open source project?

gregbo

9:56 pm on Oct 31, 2006 (gmt 0)

Could you possibly provide the name of a bot that has appeared and gathered any significant market share from the major SE's and/or reputable existing companies expanding into SE (aka MSN) in the past five years?

How about Baidu?
Baiduspider+(+http://www.baidu.com/search/spider.htm)

I'll defer to someone like @incrediBILL who has detailed bot (spider) lists.

I guess you're free to allow whoever you want to crawl your site, but then you can't complain if at some future point, a popular search engine emerges that has "sandboxed" you because you refused it entry into your site one too many times. What would it have been like about eight years ago if Google was left off of your opt-in list? Are the only spiders that should be "trusted" the ones that come from popular (high revenue) SEs?

Hopefully this would have come from some university or open source project?

This illustrates a point that sometimes, the next big thing comes from an unlikely place. Another thing to consider is that an existing engine can acquire new infrastructure and use it for spidering purposes. So one of the major players might start spidering from someplace that isn't (yet) registered to it, and in fact may have been used by spammers, botnets, or other illdoers. If they don't get into your opt-in list right away, you run the risk of losing your ranking, etc.

GaryK

10:07 pm on Oct 31, 2006 (gmt 0)

Greg, if I didn't use an opt-in list then my servers would be overrun by new bots.

I see dozens of new bots every week on my browser project website. Most of them do not respect robots.txt and take files that are disallowed.

I'm willing to risk a bot not being recognized for a week until I have a chance to analyze my log files, research the bots, and decide if they should be part of my opt-in list.

In the last few years the only bots I've added to my opt-in list have been Baidu and Yandex.

Yahoo is no longer on my opt-in list because of continued abuse, especially by their Japanese division.

gregbo

10:36 pm on Oct 31, 2006 (gmt 0)

OK, fair enough. You are experienced enough to understand the ramifications of disallowing bots from popular sites, and weighing that against them overrunning your site.

Just for the OP's benefit: before using someone else's opt-in list, get enough of a handle on your traffic (and how it affects ranking) so you understand the tradeoffs involved in blocking spiders.

GaryK

11:34 pm on Oct 31, 2006 (gmt 0)

I agree completely with your second paragraph Greg. Never blindly accept what someone else thinks should be blocked. I'm a perfect example of that. I block Yahoo. I don't think most people would want to do that. But I have seen enough of their shenanigans to know I'm not getting enough traffic from them to justify how badly they try to abuse my websites. ;)

wilderness

1:08 am on Nov 1, 2006 (gmt 0)

How about Baidu?
Baiduspider+(+http://www.baidu.com/search/spider.htm)

Thanks anyway. Baidu; crawling from APNIC and a China IP range.
I have the bot denied as well as well as the IP range.
Not too many folks interested in my widegts on China.

This illustrates a point that sometimes, the next big thing comes from an unlikely place. Another thing to consider is that an existing engine can acquire new infrastructure and use it for spidering purposes. So one of the major players might start spidering from someplace that isn't (yet) registered to it, and in fact may have been used by spammers, botnets, or other illdoers. If they don't get into your opt-in list right away, you run the risk of losing your ranking, etc.

This forum has had participants in the past taking the same stand as yourself, however the bad-guy/PITA's far outnumber the viabale search engines.

Your certianly entitled to express your appreciation of where the next bot may apppear from and why we might leave open the doors of reason!

However. . .in the end?

It's up to each webmaster to determine what is beneficial or detrimental to his/her own site (s).

Personally my sites will benefit more by not wasting time dealing with PITA bots when I may be spending that time either archiving materials for new pages or creating new pages.
Course, most everybody relaizes that my expectations of visitors (bots or othwerwise) are unrealistic ;)

Don

incrediBILL

7:46 pm on Nov 9, 2006 (gmt 0)

I hope you are aware that pursuing this solution locks out legitimate bots that aren't on the list, such as bots from new search engines.

If you have access, you can look at the error log and see everything that was kicked out by the server and adjust the whitelist to include things that need to be includied.

People that have tried this method have come back to me days later and said things like:
"WOW! I never knew so much JUNK was hitting my web server! It's amazing!"

Additionally, using something like Google Analytics you can see where all your legitimate traffic is coming from and make sure all of those sources are whitelisted.

Why I recommend using something like Google Analytics is because it's javascript based and typically only legitimate human browsers use javascript, not bots, so it filters out all of the noise you would tend to see in an access log down to where just the useful referrers.

incrediBILL

7:59 pm on Nov 9, 2006 (gmt 0)

This forum has had participants in the past taking the same stand as yourself, however the bad-guy/PITA's far outnumber the viabale search engines.

Gregbo is correct that most people could shoot themselves in the foot using traditional blocking methods as it's kind of blind and you don't have a clue when something new comes knocking.

That's why my control panel tells me daily if it sees a new user agent for my review, to approve or deny, so I make an interactive choice and not let the technology blindly block everything.

I'm all about the whitelist and stealth crawler detection as so far today I've blocked 178 unique IPs attempting to crawl 1237 pages and it's not even noon. Yesterday was fun with 289 bots trying to abscond with 3189 pages.

FWIW, they used to ask for many more thousands of pages a day BEFORE I clamped down, so these are just the stupid bots that don't take NO for an answer.