Forum Moderators: open

Message Too Old, No Replies

Spider User Agents

ID via user agents

         

phrequency

7:18 pm on Oct 31, 2001 (gmt 0)



I was thinking about these crazy spiders and how to detect them for cloaked/doorway pages.

I know there is rather reliable javascript which will return to me the browser people are using to get to my site, IE or NS or Opera, etc. regardless of version.

So my question is:
Do spiders mimmic any of these browsers in their identification?
Are the spiders' using User Agents which say they are a MS browser or NS browser?

So basically what I am getting at is this: Is it possible to use a javascript detect which says if you are using IE, NS, Opera, Etc go here (real dynamic site) all others go here and the "others here" would be my SEO rich pages? Because chances are if you are not using IE, NS, Opera, AOL etc then you MUST be a spider no?

Thanks in advance.

littleman

7:37 pm on Oct 31, 2001 (gmt 0)



>Do spiders mimic any of these browsers in their identification?
>Are the spiders' using User Agents which say they are a MS browser or NS browser?
Well sort of, some spiders come in with UAs like these:
Mozilla/2.0 (compatible; Ask Jeeves)
Mozilla/3.0 (Slurp/cat; slurp@inktomi.com; [inktomi.com...]
Mozilla/4.0 (compatible; FastCrawler3, support-fastcrawler3@fast.no)
They don't ID them selves as Netscape, MSIE or Opera, but they do try to look like a browser of some sort. Many rogue spider mimic MSIE.

>So basically what I am getting at is this: Is it possible to use a javascript detect which says if you are using IE, NS, Opera, Etc go here (real dynamic site) all others go here and the "others here" would be my SEO rich pages? Because chances are if you are not using IE, NS, Opera, AOL etc then you MUST be a spider no?

Javascript won't do it, you need to have scripts/programs working at the server level an have them determine who is knocking at the door before you let them in.

phrequency

8:39 pm on Oct 31, 2001 (gmt 0)



You say:
"They don't ID them selves as Netscape, MSIE or Opera, but they do try to look like a browser of some sort. Many rogue spider mimic MSIE."

Is a "rogue" spider something to be concerned about when attempting to SEO sites for the major SE's?

------------------

Regardless of how detection is done, do you feel it is possible to detect spiders in the way I describe versus trying to get and maintain a list of spider IP's and such?

I have been trying to get my hands on a good list of spiders and am just not happy with the amount of maintenance it requires.

I was just thinking if I can see if they are IE, Netscape, Etc. before they enter the site I can direct them better than using a huge list of IP and such.

Thanks for the reply.

littleman

10:43 pm on Oct 31, 2001 (gmt 0)



> "rogue"
sometimes, some SEs are starting to play with nonstandard bots here and there.

>do you feel it is possible to detect spiders in the way
Yes, but not enough for solid detection.

It wouldn't be maintenance free either. Take a look at the UAs above, spiders are crawling with Mozilla. The UAs are not exactly the same, but they are very similar to real browsers. A lot of browser logging scripts would actually see them as Netscape. So if you do go this wrought you will have to pay close attention to your REGEX work and also watch your logs for new spider UAs.

You probably want to feed the spiders the same text as humans, but put it on a cleaner, simpler page. Make your 'spider pages' lighter versions of your 'human pages'. Your wall between bots and humans will have a lot of holes in it, so you want your cloaked pages to be able to pass human review if it comes down to it.

Will

10:45 am on Nov 1, 2001 (gmt 0)



The amount of maintenance required depends on what type of spider list you use.

You could maintain your own, but it probably makes more sense to use a third party list/service of some sort that keeps the list updated for you.

Fantomaster [fantomaster.com] have one of the bigger databases, which is subscription based. They also have a variety of helpful advice, scripts and so on, many of which are available for download.

XAgent [schtoom.co.uk] is a package containing detection scripts and a frequently updated spider list.

As littleman points out, you will need a server-side script of some sort and this will need to check more than just the useragent if the results are to be anything like reliable.

Client side script has three main problems:

1) It cannot be effectively used for cloaking purposes by definition, since the page has already been loaded. Therefore you will not be able to alter title,keywords,description or content in such a way that a spider would read them.

2) What happens if a (human) visitor visits your page using a browser that does not support script/has it switched off?

3) JavaScript is by far the best supported client side scripting language, but different browsers support different versions (sometimes, even browsers that claim to support the same version will differ slightly in their syntax!). In short, it is impossible to guess which environment your script will cause errors in. By contrast, use a server-side solution and you know that the code will run in the same environment each and every time.