Rogue bots or legitimate spiders?

Forum Moderators: open

Message Too Old, No Replies

Rogue bots or legitimate spiders?

dan_popescu

8:41 pm on Nov 23, 2002 (gmt 0)

How do you know if it's a spam bot spidering your pages or a legitimate SE bot? I get spidered (not sure spidered is the word) all the time by many different IP's. For instance today I had these IP's 68.3.115.137, 64.140.49.68, 66.77.73.145, 165.166.181.156, 216.167.97.169 going through most of my pages. This seems to be getting bigger every day. What do you think these really are. I hate to see they're eating up my bandwidth, especially if it's not useful.

Thank you.
Dan

bull

9:13 pm on Nov 23, 2002 (gmt 0)

we could give better answers if you gave us the UA strings too, or, best, lines from your log files.

regards,
bull

mack

9:15 pm on Nov 23, 2002 (gmt 0)

I have had these Problems in the past. Now I have resorted to using a robots.txt that only allows for the known and trusted bots. In the event that something still proceeds to spider the site I would concider using htaccess.

My rule of thumb is , if I havent heard of it before it is of no use and not going to give me anything in return.

dan_popescu

9:20 pm on Nov 23, 2002 (gmt 0)

Bull
What's UA string? This is one line from my access_log file:

68.3.115.137 - - [23/Nov/2002:20:58:39 +0200] "GET /index.htm HTTP/1.1" 200 16839

bull

9:35 pm on Nov 23, 2002 (gmt 0)

UA=user agent.
Is this the whole line of your logfile? It does not contain the UA. But for identifying what prog spidered your site this is much help.
Mine looks like this:

1.2.3.4 - - [23/Nov/2002:19:24:44 +0100] "GET /x.gif HTTP/1.1" 200 963 www.domain.net "referer here" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" "-"

where "Mozilla..." is the UA.

You apparently have the so-called "common log format", not providing the UA. If there is a file named httpd.conf in the root directory of your site, there is most likely these lines somewhere:
LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

If there is also a line beginning with "CustomLog .... common", you might to change it to "combined".

[edited by: bull at 9:42 pm (utc) on Nov. 23, 2002]

jdMorgan

9:38 pm on Nov 23, 2002 (gmt 0)

Dan,

64.140.49.68 ICG NetAhead, Englewood Co.
68.3.115.137 Customer of Cox Communications
66.77.73.145 Fast Search
165.166.181.156 Customer of Rock Hill Telephone Co, Rock Hill, SC
216.167.97.169 Customer of Verio, Inc.

With the exception of the Fast Search engine spider, you've got a mixed bag, here.
You have to look these guys up - in ARIN [arin.net], for example - and then take into account what you can find out about:
1) Who they are
2) What User-agent they use
3) What they "do" on your site

1a) For the first IP, look these guys up and decide whether their stated business purposes justify deep visits to your site. Are they offering anything to your or your customers/visitors? Sometimes, you'll find "web information miners" that offer specialized web-gathered information for sale. They use your bandwidth to sell information to their customers. You decide if this is useful to you, or just theft of bandwidth.
1b) For the ones marked "Customer of xyz", you can't track them any further based on just the IP address. They may be dial-up users who are assigned a "random" IP address out of their ISP's pool of addresses each time they connect, for example. So, you go on to:

2) What user-agent do they use? Is it "Mozilla/x.xx (compatible; )" or "Opera/xx.xx"? This would be a normal IE, Netscape, or Opera browser. If not, is it a known-abusive web site harvester like "Indy Library"? Does it appear in the Close to perfect htaccess ban list [webmasterworld.com] posted here on WebmasterWorld? If so, it's bad news. If not, a search through this forum may turn up something. For total unknowns, we go on to:

3) What do they do on your site? Download your whole site in 15 seconds, bringing your server to its knees? Poke around looking for forrmail.pl? Dig into your robots.txt file and then go for disallowed files? Do they load just your html pages, and ignore images and scripts?

You have to analyze the big picture to figure this out, and the only really helpful thing is experience. Also, everyone's focus and attitude is a little different. From "ignorance is bliss" and "anything goes" to "ban everything that does not benefit my site or my visitors or customers."

Having decided, you can block by IP address or IP address range for known-static intruders. You can block by User-agent for those who use well-known e-mail address harvesters, site downloaders, etc. You can block by referer for sites which "borrow" your images. There are also several pesky 'bots that use a legitimate User-agents and legitimate-but-inappropriate referers (e.g. iaea.org) to "make it look good" while they download your whole site.

Sometimes, you just can't get a handle on a fixed IP address, User-agent, etc. In that case, you can set up some spider traps on your pages, and have them call a cgi script to dynamically add the offending IP address to your ban list. I am working on this at this time, using a modified version of this spider-trap script [webmasterworld.com] posted here on WebmasterWorld by member Key_Master. I can tell you it's effective, but potentially dangerous if you make a mistake - If you're not very careful, you could easily block Googlebot for example. Once I get more experience with it, I may post again with what I've learned about doing this safely.

The key issues are:
1) How to name the spider trap files to make them attractive to bad 'bots.
2) How to name them so that Google doesn't list links to them in SERPs (without cloaking).
3) How to protect them so that Google doesn't fetch them (there is a conflict here with #2)
4) How to hide them so that normal users can't find them, click on them, and get banned.

Right now, I know just a little about these four points, and would be uncomfortable advising anyone.

HTH,
Jim

bull

9:43 pm on Nov 23, 2002 (gmt 0)

hey jim,

great text!

-jan

dan_popescu

9:49 pm on Nov 23, 2002 (gmt 0)

Jim
Thanks so much for taking the time to make this clear to me. As you said it all comes with experience, and since I don't have much all these things are quite overwhelming sometimes. Anyway I'll print out your text and read it over a few times. Then I'll be back with more questions:)

Thanks very much

Dan

jdMorgan

10:14 pm on Nov 23, 2002 (gmt 0)

Shucks, print it out? I hope I didn't type it too fast, and mess it up, then!

Actually, I had an ulterior motive:

How many members here use a spider trap script approach, rather than having to spending hours each day looking at raw log files and adding bans line-by-line? As I said, I've begun playing with this approach, and I see that it can be quite effective against "brute force" site attacks. It's not a "final answer", it's just another tool in the toolbox. But I'm just an apprentice in the craft, and hoping to learn to use it better...

Thanks,
Jim

bull

10:37 pm on Nov 23, 2002 (gmt 0)

>spending hours each day looking at raw log files

this one. yep :)

-jan

wilderness

1:02 am on Nov 24, 2002 (gmt 0)

Hey Jim,
Since your looking for feedback here . . .
I'm not using a trap.

Most folks think I'm quite overbearing. However I tend to let most visitors enter at will until they begin acting malicious before responding.
In most instances there is some type of initial warning.
A lone inquiry into robots.txt, A HEAD only entry. These are the most common.

In some instances a bot will just beging hitting hard.

In the past week I've had two hits from AT&T users. This added on to a AT&T intrusion back in January prompted me to expand my AT&T denies. Notifying AT&T of the expanded denials. Only response was automated.
I've had something similar with MSN and Verizon recently.
Many folks are having Verio problems as well.

I think these unidentified private intrusions are going to increase as backbones are moved around in the failing economy.
The end result may well require extensive use of traps.

It's too bad that most service providers don't accept these intrsuions as intrusions even when they are provided with TOS links from their own websites. :-(