Forum Moderators: open
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.11) Gecko/20060601 Firefox/2.0.0.11 (Ubuntu-edgy)"
Any idea what or why?
I've 403'd them.
They haven't hit robots and are (were) simply bypassing my web pages and going direct to media files.
Kind of clever, they are doing it over a wide period and random order from randomly selected directories. But a direct get to the content isn't particularly clever.
128.59.2x.xx is the address. 3 dift addresses in this columbia block have bots.
I'm getting whacked by a apider with no ID in its UA, resolves to Columbia U in NY.
Not exactly ;)
The UA contains multiple unique names (even multiple logic) which allow focus.
Either on the names themselves or a combination of name and IP range.
An example of the latter:
# IF the UA contains Linux or Ubuntu and the visitor comes from 128.59.0-255.0-255, than deny access
RewriteCond %{HTTP_USER_AGENT} (Linux¦Ubuntu)
RewriteCond %{REMOTE_ADDR} ^128\.59\.
RewriteRule .* - [F]
Please note; the forum breaks the pipe character and requires correction prior to implementation.
Boy this spider is a bit of a dork.
When I wrote a vb program to check some shipping#s against the fedex web site for a job, I actually checked the return status.
It continues to immolate itself against my site, creating an endless stream of 403 errs with what appears to be random timing from a minute to 15 minutes.
(Kind of satisfying to see it stopped by the IP deny :)
So far it hasn't tried any other than the 3 IPs in the columbia U range.
Update: I notified columbia abuse email. Nice person said would notify machine owner that the retarded robot was immolating itself with endless 403s.
End result: machine still hitting my site. Oh well.
Does anyone else have any stories about asking sysops to stop?
I have samething happening to smaller degree from other addresses. I saw it scraping my pages months ago and 403'd it. Well, twice a day it tries for two media root pages. Just re-inforcing my idea it is a bot. It would probably start scraping again if I removed the block.
BTW, "Ubuntu" is the Linux distribution, and "Edgy" is the release name. This is a "package" that you can get on a CD with the Linux OS, Firefox, and other stuff. The previous release was "Feisty Fawn" IIRC. So, this 'bot is spoofing a Linux Distro, and it ignores 403 responses.
Don't be surprised that it does not follow HTTP protocols -- These things are written (often hurriedly) by students (i.e. young and inexperienced) based only on "book learning" (again, not experience), and rarely reviewed in detail by anyone we might call an "expert." Often, the spider itself is not the "main goal," but rather a means to gather data for the project. Therefore, the focus is only on making the spider "good enough to get the job done."
Jim
I knew a bit about linux but not what 'Edgy' was.
I was sure it was a bot already but I don't think spam scrapers would get all my media, just html so figured is a project of some type. Still not enuf to let it thru due to its rudeness.
I get all kinds of bots hitting my site. I've finally setup to write a sept file from the raw logs that skips all known bots, so I can see info about actual humans visiting my site. Like incredibill's rant, it is seeming more and more like just bots are spidering the web in a whole ecosystem and few people visit all the websites.
I just recently got screen scrapers, that pulled thumbnails,
html scrapers (probably looking for emails)
and one that was honest enuf to say 'email harvester'; not enuf to stop me from 403'ing it tho.