Three spiders at same time with unusual behavior

Forum Moderators: open

Message Too Old, No Replies

Three spiders at same time with unusual behavior

innocbystr

3:38 pm on Oct 30, 2005 (gmt 0)

I got hits on virtually all of my pages from these three IP's virtually simultaneously. All three IP's are apparently AT&T WorldNet Services. Didn't appear to get robots.txt file but otherwise behaved like spiders:

12.44.181.*** - - [27/Oct/2005:09:51:40 -0700] "GET /acme/?N=A HTTP/1.0" 200 1052 "-" "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 )"
12.44.172.** - - [27/Oct/2005:09:51:47 -0700] "GET /acme/?M=D HTTP/1.0" 200 1052 "-" "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 )"
63.160.77.*** - - [27/Oct/2005:09:51:54 -0700] "GET /acme/?S=D HTTP/1.0" 200 1052 "-" "Mozilla/4.0 (compatible; MSIE 4.0; Windows NT; ....../1.0 )"

Anyone else seen this? Also, what in the world does the "?N=A" and similar entries mean?

Thanks,
Blair

[edited by: volatilegx at 4:10 pm (utc) on Oct. 31, 2005]
[edit reason] obscured IP addresses [/edit]

Dijkgraaf

2:59 am on Nov 1, 2005 (gmt 0)

?N=A is what is called a query string, and with either server side scripts, or Javascript can parse these, where the item before the = is the variable name, and the part after is the value, in this case the Variable N has the value of "A".
You can have multiple values by having?N=A&M=B&... etc
I take it you aren't using this in your site, so why this spider is adding them is a bit of a mysery (unless somone is linking to your site with these) because if you aren't utilising query strings it would get serve up the same contents regardless of what is in the query string.

innocbystr

3:28 am on Nov 1, 2005 (gmt 0)

Thanks Dijkgraaf. The robot(s) only added these query strings to a few pages out of the many crawled. Think I'll go back and check which pages to see if there's a pattern of some kind and go from there.

Have a Good One,
Blair

innocbystr

4:19 am on Nov 1, 2005 (gmt 0)

I re-checked my logs and evidently these three spiders crawled everything on my site: all html files, css files, Thumbsdb files, every image (whether I was using it or not), and ran the query strings on all subdirectories and all image directories. Never looked at the robots.txt file. Should I be concerned?

Thanks,
Blair

Dijkgraaf

5:18 am on Nov 1, 2005 (gmt 0)

Well they are certainly badely behaved spiders, but as to whether you should be concerned, well that depends on what damage you think they can do to you. Also it is a bit the case closing the barn door after the horse has bolted now, but you may want to make some changes to your site to stop future occurences.

It is possibly for a spider to find everthing in a directory if there is no default page for that folder and directory browsing is enabled. This can be fixed by either putting a default page in that folder, or by changing a configuration setting for that web site/server that dissallows directory browsing.

If you want to stop bad bots spidering your web site you can implement a bot trap with automatic banning, there are various web sites and also threads on this bulletin board that tell you how to create those.
Currently I haven't bothered with this on my site, although I do have one page that is part of Project Honeypot, which looks to trap e-mail harvesting bots.