Forum Moderators: open

Message Too Old, No Replies

Spider foot print

What does this term mean

         

startup

9:59 pm on Dec 21, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



WE know User Agent can be changed. Someone can change the User Agent in an attempt to decloak a page. Is there a way of identifying the fake, other than comparing IP and Agent?

han solo

10:25 pm on Dec 21, 2000 (gmt 0)



I had one person fake a user agent, and then get into my system...it was pretty funny...I did the same thing back, only I, well, I'm not going to go into details, but I think the chap was just trying to learn a few things. I did, too. I learned he wasn't very good.

The way to do it is to flag any user agent strings that

a don't <u>completely</u> match the usual User agent,

b ones that don't match by ip and or don't look up by whois in arin.net, where you can see who is allocated to the ip block.

Usually, I find if it doesn't have an arin.net whois entry, it isn't real. All of the engines, or exodus, end up being registered for the real spiders that should come crawling.

For a great list to get you started, check out Brett's list over at [searchengineworld.com...] It is one of my favorite resources.

Cheers,
Han Solo

littleman

10:55 pm on Dec 21, 2000 (gmt 0)



You know, back a couple of years ago I use to switch my UA to Scooter/1.0 and then hit pages using AV's bablefish translator. It was amazing how many cloaks I was able to break - fun stuff!

Startup, I guess it depends on how good the person is who is trying to crack your cloak.

startup

11:01 pm on Dec 21, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I want to be able to do this on the fly. Let's say a page has a large .gif on it. Now I want to use an external file to specify that the .gif must load first and also the time it will take it to load. A spider should download the page before the .gif loads. If this does work, I know I have a spider.
Is there a way of making the spider do something or querying the spider before you feed it another page? New SE spiders appear very often. In the above example, IP and Agent would not match, but the behavior would flag it as a spider.
Or is it possible to have a page load in segments and query the spider or browser after each segment?
I hope this is making sense?

littleman

11:23 pm on Dec 21, 2000 (gmt 0)



All the major bots won't pull the .gif at all.
>Or is it possible to have a page load in segments and query the spider or browser after each segment?
In a way yes. You could have a script, or a SSI call to a script that will print out spider food and keep the rest of the page the same.

But i do not think this will be enough to do what you want. I'd take a close look at ENV variables such as HTTP_REFERER and the like for secondary screening.

startup

11:42 pm on Dec 21, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks. Time to hit the books again. Is there any you would recommend?

PeteU

7:41 am on Dec 22, 2000 (gmt 0)

10+ Year Member



startup, I don't thinkyou can do anything of value without doing IP check, cloaking logic is simple, its the keeping up with IPs and maintenace thats time consuming, if you can check IPs as ranges it simplifies things a whole lot
very basic algorithm :

check refer
if yes -> surfer
if no
check ip
check ua
if ip and ua match -> spider
if ip match only -> ua alert (possible spider)
if ua match only -> ip alert (new ip or a snooper)
if none match -> surfer (type in and such) but could also
be a spider with both ip and ua new

of course in practise there is whole bunch of exceptions
that need to be considered like known spiders using plain mozilla ua, translators etc.

have fun ;)

msgraph

1:03 pm on Dec 22, 2000 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Also you should set up a way to ban out bad user agents. There are still a lot of people who don't know how to create their own user agents when pulling pages so they use 3rd-party page-sucking programs. Like Black Widow, Teleport Pro, etc.

Brett_Tabke

10:16 am on Feb 5, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Someone asked in Email about id'ing spider logic. (just bringing thread back to the top for them).

The only thing you may have overlooked pete, is studying http header values. Compare spiders to browsers sometime.