literate robots? - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

literate robots?

lucy24

1:56 am on Apr 13, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Am I wrong to assume that any and all weird behavior can be blamed on a robot that's up to no good? If I am wrong, someone will kick this question to the right place.

Question: What kind of robot would attempt to crawl a mailto link? The answers that come to mind are "A very stupid one" and "One that doesn't speak English" (the unknown visitor is from Romania) but possibly I have overlooked something. I don't mean read an html page and trawl for addresses, I mean "GET", just like you'd get a page or image.

My site is 95% personal and 5% very, very, very specialized, so these are genuinely consecutive lines from the log, at five-second intervals. (The first visited page is one of the few that gets bona fide visits from real humans, though there's usually a referrer.) The parts I've shown as {one} and {two} are two different pairs of numbers.

79.112.{one} - - [09/Apr/2011:05:44:30 -0700] "GET /games/ HTTP/1.0" 200 11167 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5"

79.112.{two} - - [09/Apr/2011:05:44:35 -0700] "GET /games/../index.html HTTP/1.0" 200 3833 "-" "{same}"

79.112.{one} - - [09/Apr/2011:05:44:40 -0700] "GET /games/../mailto:webmaster@{mydomain}.com HTTP/1.0" 404 1496 "-" "{same}"

The /../ elements are exactly as in the raw logs. I had to try it in a browser to verify that a human would get the same responses: item 2 gets the main Index page (but how did they know it's called index.html? did they deduce it from the name of the actual page in item 1?), item 3 gets the 404 message.

The user-agent is "not a known robot", but robots can hide behind anything can't they? I tend to associate HTTP/1.0 (as opposed to 1.1) with elderly robots, but then, Romania isn't really at the technological cutting edge.

SteveWh

11:17 am on Apr 13, 2011 (gmt 0)

10+ Year Member

any and all weird behavior can be blamed on a robot that's up to no good?

No, that's too sweeping a generalization, but this sounds like it's just a stupid robot (i.e. a badly written crawling program) that extracts everything from a page that looks like a link and then sends a request for it.

dstiles

10:14 pm on Apr 13, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Quite likely the bot is looking for email addresses to spam. A lot of this still goes on, some bots declaiming who they are and others hiding behind "real" UAs.

If you can, obfuscate the email address. The simplest technique is a simple "me [at] example.com" (without the quotes) and let the visitor work out the simple formula.

Probably a bot in any case as Firefox is now up to version 3.6 and anything earlier is either fake or hackable.

lucy24

5:10 am on Apr 14, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Here's the odd part. (Sorry, didn't think of this until after I posted.) The first page in the series includes several mailto: links, all leading to my regular address. The second page, the overall index page, has two mailto: links. One is the same address; the other is webmaster@. Only the webmaster one got the GET business.

So now it looks like a bizarrely convoluted way of finding out whether a "webmaster@" actually exists. Can't imagine why this is preferable to the normal method, which is to assume it's a valid address unless you're given hard evidence to the contrary.* Don't perfectly understand why they had to waste time in a subdirectory either; did they expect a "webmaster@subdirectory/maindirectory/domain.com" ?

* Like, f'rinstance, sending in a bad-link report and having the e-mail bounced back as "no such address", which to me suggests that the webmaster is either incompetent or suffers from acute paranoia. In my book, "webmaster" is a mandatory address, like "abuse" for e-mail hosts. But that's a different thread.

keyplyr

7:47 am on Apr 14, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

IMO you're reading too much into this. From my experience, it's just a low level bot following everything; probably nondiscriminatory.

dstiles

10:09 pm on Apr 14, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The webmaster@ address is not "mandatory". If you have an email account AT THE DOMAIN then abuse@ and postmaster@ are mandatory.

lucy24

1:28 am on Apr 15, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

dstiles, I meant de facto mandatory, like remembering your spouse's birthday even though there's no law that says you have to. If I find a typo or bad link in a site, I'm not going to hunt for a "contact us" address; I just write to "webmaster".

keyplyr, I know there will come a time when I react to everything new with
"(yawn) Oh, look, my old pal Stinkybot has been to the theatrical costumers again. Lessee which door he tries this time"
or
"(ho-hum) gwt Crawl Errors has achieved quadruple recursion by attempting to visit a no-longer-existent page linked only from a second nonexistent page et cetera back to a page that has not existed since August 1997"

but for now I'm averaging about one head-scratcher a day, like "What is it about this specific page that caught the (probable) robot's attention?" or "Why on earth would a human want to look at my (wholly nonexistent) crossdomain.xml file?"* or "Why do they think I have a file named 'c99.php'** of all things?"

And that's just the new and unfamiliar visitors. Once they've been 403'd at the gate, I don't care what they wanted to do.

* I say this with a straight face even though I have been known to look at other sites' robots.txt. Good source of information about robots I never knew existed.
** Nobody panic. I didn't recognize the name but my host did, and stopped them at the door with a quick 503.

dstiles

7:52 pm on Apr 15, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

I NEVER write to webmaster: it probably does not exist in most cases. If a site does not give an email address I enter the details into a form. If there is also no form I never go back to the site again.

As to spouses' birthdays: I often forget; it's the same day as our wedding anniversary, which I also forget. Bad memory. :)

What you're getting with the php hits is attempts at gaming or scraping well-known site templates.

I really recommend blocking as many server farms as you can find IP ranges for. They run into thousands and are often supplemented by botnets.

On top of that, find a list of bad UAs and discover other ways of deciding whether a good-looking UA is actually bad, then block what you don't like.

enigma1

5:00 pm on Apr 22, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

If you have an email account AT THE DOMAIN then abuse@ and postmaster@ are mandatory.

Where did you read this? Or just some organization that collects info about it. The whois suppose to have a valid contact address for a domain. Otherwise blindly sending emails to non-existent accounts can be processed as spam.

And if a host has the bright idea to enforce what email accounts I should create and maintain, I change hosts.

dstiles

7:06 pm on Apr 22, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Look at the RFCs for mail. Postmaster is mandatory IF you have an email account (eg mail server) at the domain. Abuse is highly recommended and its absence (assuming no other is publicly declared) results in a black mark at rfc-ignorant. Mail server setup testing sites also look for abuse.

The email address for domains is completely separate to the mail requirements. I think they were originally intended on COM/NET/ORG as scrapable/spammable addresses. :)

lucy24

2:48 am on Apr 26, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Postscript: They're ba-ack!

Musta been in a hurry, because they didn't have time to redial. Or maybe the Romanian phone service was having a good day and didn't disconnect them in mid-session. Same UA. The first file links from the page they visited back on the 9th.

79.112.{three} - - [25/Apr/2011:15:14:15 -0700] "GET /games/{my}Downloads.html HTTP/1.0" 200 11797 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5"

79.112.{three} - - [25/Apr/2011:15:15:07 -0700] "GET /games/downloads/{nameofgame}Patch.sit HTTP/1.0" 200 86241 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5"

I love these people. I wonder what they're planning to do with a StuffIt file containing a patch whose sole function is to update a game that was made for, and only runs in, the Classic Mac OS?

* * *

Anyway, it was a nice break from dealing with my first known visit from Kintiskton-- I looked them up here, so nothing new to add-- during which they scooped up a total of 606 files including 79 html pages, but couldn't find time to have a look at robots.txt. It would have saved them the trouble of downloading one image.

I'm lying about the 79 pages. It was really 57, of which 22 were judged worthy of a second visit after a ten-minute break to change UAs and switch from HTTP 1.1 to 1.0. (In that order.) Not just the pages, either; they grabbed all associated files again. You never know when a graphic last updated in 2007 might change. One stylesheet was picked up fourteen separate times-- enough to send me scurrying to the raw logs to make sure 304'ing hadn't somehow been turned off.

Did anyone ever figure out what they're up to? In the meantime, I think it calls for an "I don't like your face" lockout. Thoughtful of them to cling so firmly to 65.208.151.112/29.

dstiles

9:16 pm on Apr 26, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

No idea what Kintiskton are doing. I've had 65.208.151/24 blocked for years.

What the Romanians are doing is anyone's guess but if it's a problem and you don't expect trade from there then block 79.112/16 - or even 79.112/13.

lucy24

1:08 am on Apr 27, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Chapter Three.

:: sigh ::

About 12 hours later, someone completely unrelated downloaded the same .sit file. More accurately: they downloaded it, switched UAs,* and then did a partial download again. Quick check tells me that, #1, the identical IP was locked out a month ago for the identical behavior with a couple of other downloads that live in the same directory and are reached from the same page, and, #2, the download was not preceded by visits to either robots.txt or the originating html page.

OK, so "Deny from" in .htacess does not work with direct downloads. This is bad. It would be much worse if I had anything sensitive among my downloads; luckily I don't.

After poring over Apache documentation to see if there's some way to redirect the sucker to 127.0.0.1, I cave in and look them up. It's bitdefender, who are apparently With The Good Guys. They just happen to behave exactly like Very Bad Guys.

Does this mean that every ### time someone does something that looks hinky, you have to investigate them individually? Is there a special category of IPs That Are Above The Law? Has someone put together a convenient list somewhere?

:: mutter, grumble ::

* If anyone is curious, the relevant UAs each time were Python-urllib/2.5 followed by "Mozilla/4.0" (with \escaped\ quotes). I didn't look them up, but I believe both translate as "I, Robot".

caribguy

1:40 am on Apr 27, 2011 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Welcome to the :: mutter, grumble :: stage. Some of us have moved on to the 'shoot first, ask questions later' stage and whitelist only what can be confirmed to be a human visitor or legitimate bot.

You may find that to be more productive in the long run...