Forum Moderators: open

Message Too Old, No Replies

Edgy from Columbia?

getting hit by new spider

         

Megaclinium

12:58 am on Jun 19, 2008 (gmt 0)

10+ Year Member



I'm getting whacked by a apider with no ID in its UA, resolves to Columbia U in NY.

"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.11) Gecko/20060601 Firefox/2.0.0.11 (Ubuntu-edgy)"

Any idea what or why?

I've 403'd them.
They haven't hit robots and are (were) simply bypassing my web pages and going direct to media files.

Kind of clever, they are doing it over a wide period and random order from randomly selected directories. But a direct get to the content isn't particularly clever.

128.59.2x.xx is the address. 3 dift addresses in this columbia block have bots.

wilderness

1:23 pm on Jun 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm getting whacked by a apider with no ID in its UA, resolves to Columbia U in NY.

Not exactly ;)

The UA contains multiple unique names (even multiple logic) which allow focus.
Either on the names themselves or a combination of name and IP range.

An example of the latter:

# IF the UA contains Linux or Ubuntu and the visitor comes from 128.59.0-255.0-255, than deny access
RewriteCond %{HTTP_USER_AGENT} (Linux¦Ubuntu)
RewriteCond %{REMOTE_ADDR} ^128\.59\.
RewriteRule .* - [F]

Please note; the forum breaks the pipe character and requires correction prior to implementation.

Megaclinium

10:23 pm on Jun 19, 2008 (gmt 0)

10+ Year Member



Thanks!

Boy this spider is a bit of a dork.

When I wrote a vb program to check some shipping#s against the fedex web site for a job, I actually checked the return status.

It continues to immolate itself against my site, creating an endless stream of 403 errs with what appears to be random timing from a minute to 15 minutes.

(Kind of satisfying to see it stopped by the IP deny :)

So far it hasn't tried any other than the 3 IPs in the columbia U range.

wilderness

2:14 am on Jun 21, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Maybe they just like your sites ;)

The Class B is NOT included in my 128 denys, however I wouldn't hesitate for an NY second to add them in ;)

Don

Megaclinium

9:20 pm on Jul 4, 2008 (gmt 0)

10+ Year Member



I think they must like my site. I have endless media files, which it was going after.

Update: I notified columbia abuse email. Nice person said would notify machine owner that the retarded robot was immolating itself with endless 403s.

End result: machine still hitting my site. Oh well.
Does anyone else have any stories about asking sysops to stop?

jdMorgan

9:27 pm on Jul 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd expect no action until after this holiday weekend... Things on the U.S. Web are a bit slow today.

Jim

Megaclinium

9:58 pm on Jul 4, 2008 (gmt 0)

10+ Year Member



This was a week or so ago I asked columbia net ops to 'stop the madness'..
Apparently they don't care that unidentified bot is on their network.
Maybe is some kind of comp sci proj. I'd think a comp sci major would be smart enuf to check status of returned pages and stop major hammering sites.

I have samething happening to smaller degree from other addresses. I saw it scraping my pages months ago and 403'd it. Well, twice a day it tries for two media root pages. Just re-inforcing my idea it is a bot. It would probably start scraping again if I removed the block.

jdMorgan

10:17 pm on Jul 4, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If it ignores 403 responses, and your 403 page is in the least bit meaningful (to a human), then you can be sure it's a bot.

BTW, "Ubuntu" is the Linux distribution, and "Edgy" is the release name. This is a "package" that you can get on a CD with the Linux OS, Firefox, and other stuff. The previous release was "Feisty Fawn" IIRC. So, this 'bot is spoofing a Linux Distro, and it ignores 403 responses.

Don't be surprised that it does not follow HTTP protocols -- These things are written (often hurriedly) by students (i.e. young and inexperienced) based only on "book learning" (again, not experience), and rarely reviewed in detail by anyone we might call an "expert." Often, the spider itself is not the "main goal," but rather a means to gather data for the project. Therefore, the focus is only on making the spider "good enough to get the job done."

Jim

Megaclinium

11:17 pm on Jul 4, 2008 (gmt 0)

10+ Year Member



Wow! Thanks!

I knew a bit about linux but not what 'Edgy' was.
I was sure it was a bot already but I don't think spam scrapers would get all my media, just html so figured is a project of some type. Still not enuf to let it thru due to its rudeness.

I get all kinds of bots hitting my site. I've finally setup to write a sept file from the raw logs that skips all known bots, so I can see info about actual humans visiting my site. Like incredibill's rant, it is seeming more and more like just bots are spidering the web in a whole ecosystem and few people visit all the websites.

I just recently got screen scrapers, that pulled thumbnails,
html scrapers (probably looking for emails)
and one that was honest enuf to say 'email harvester'; not enuf to stop me from 403'ing it tho.

jdMorgan

3:49 am on Jul 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> I get all kinds of bots hitting my site.

As do we all... :)

Welcome to The 403 Club -- The banquet starts at 8:00 PM, open bar.

Jim