Forum Moderators: open

Message Too Old, No Replies

6 bots left...

What's the safest identification method?

         

menyak

1:25 am on Mar 7, 2003 (gmt 0)

10+ Year Member



From what I've read in other threads, there are 6 core-algo-search engines left right now:

google
inktomi
AV/FAST
wisenut
teoma
openfind

What's the safest and most efficient way to identify their spiders when you search for a substring of $HTTP_SERVER_VARS['HTTP_USER_AGENT'] in php?

"googlebot" and "slurp@inktomi.com" will obviously work well for the first two. What about the other four?

Jocelyn

1:38 am on Mar 7, 2003 (gmt 0)

10+ Year Member



I advise you to use the ranges of well-known ip adresses of those spiders. It is a much safer way to identify them, as everyone can easily fake a user-agent.

menyak

1:49 am on Mar 7, 2003 (gmt 0)

10+ Year Member



I would basically just need it to omit session IDs to make the crawling easier, so the only benefit of faking the user agent would probably be that you get logged out all of the time, but I wouldn't see a security concern. :)

menyak

8:26 pm on Mar 7, 2003 (gmt 0)

10+ Year Member



How about using...

"teomaagent"
"crawler@fast.no"
"WISEnutbot.com"
"robot-response@openfind.com"

Would that work? My concerns are that maybe one of the strings isn't up-to-date or is bound to change in due time. Would ALL the bots of the respective search engines contain these strings? Also, I wouldn't want to use anything that could also be part of a regular UA string, hence leaving a "normal" user without a proper session.

Thanks!

Craig_F

8:29 pm on Mar 7, 2003 (gmt 0)

10+ Year Member



I need info on this too. What is the best way to identify all major spiders? I need to be as accurate as possible.

menyak

1:04 pm on Mar 10, 2003 (gmt 0)

10+ Year Member



Is anyone able to help with this? It would really be MUCH appreciated! Thanks again.

StopSpam

10:58 am on Mar 11, 2003 (gmt 0)

10+ Year Member



I need info on this too. What is the best way to identify all major spiders? I need to be as accurate as possible.

===

The best way is indeed to check there IP adress ...
i got a tool for macintosh computer .. i am sure there
is something simular for PC ...

i type in the Ip adress of a robot ...

see sample: 64.68.82.46

my log says its from google. but to be 100% sure
i run it in this program it tells me this:

ip is from: crawler11.googlebot.com.

now i know for sure its google

fiestagirl

8:36 pm on Mar 11, 2003 (gmt 0)

10+ Year Member



How about using [robotstxt.org?...]
HTTP User-Agent is listed, as is exclusion tag.