Forum Moderators: open

Message Too Old, No Replies

Why don't bots send a special header?!

         

mikomido

3:32 pm on Aug 20, 2007 (gmt 0)



Why can't legitimate bots send "Is-bot: 1" or something as a header? I hate trying to figure out whether a specific user-agent is a bot or not. They don't share "*bot*".

wilderness

6:07 pm on Aug 20, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They don't share much of anything in common, with the exception of the honorable ones complying with robots.txt ;)

Names are equal to flavor or brand!

The worst selections are names that include "spider or crawler" which are common harvester names.

volatilegx

1:26 am on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why can't legitimate bots send "Is-bot: 1" or something as a header?

What would stop rogue bots (harvesters, scrapers, etc.) from sending the "Is-bot: 1" header?

mikomido

2:26 am on Aug 21, 2007 (gmt 0)



"What would stop rogue bots (harvesters, scrapers, etc.) from sending the "Is-bot: 1" header?"

Nothing. That's not the point... the point is to at least catch all "good" bots.

wilderness

3:26 am on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Nothing. That's not the point... the point is to at least catch all "good" bots.

That's easily done with white-listing.
Apply to Incredibill ;)

mikomido

11:00 am on Aug 21, 2007 (gmt 0)



Incredibill? Hmm?

wilderness

2:17 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Incredibill? Hmm?

[webmasterworld.com...]

incrediBILL

11:39 pm on Aug 30, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Even if the bots broadcast a header if they crawl via a proxy it would still be stripped and your site would still fall victim to proxy hacking.

However, legit bot tracking is much easier to do now that the big 4 all have round trip DNS implemented.

If the reverse DNS has the following:

.googlebot.com
.crawl.yahoo.net
.search.live.com
.ask.com

You know it's a legit spider from the big 4 if the forward DNS check confirms the IP address matches.

No spider spoofing, no proxy hacking, no worrying about the exact UA, nada.

Others that I know off the top of my head that implemented round trip DNS include Exabot, Furlbot, Twiceler, VoilaBot, BecomeBot and tailrank.com

Even Gigabot is implementing this and we finally have confirmation that it's really Gigabot crawling from this range:

Performance Systems International Inc. COGENT-NB-0002 (NET-38-112-0-0-1)
38.112.0.0 - 38.119.255.255

host 38.114.104.36
36.104.114.38.in-addr.arpa domain name pointer ss26.dal0.gigablast.com

Wonder who else will get with the round trip DNS plan?