lucy24 - 8:30 am on Feb 11, 2013 (gmt 0)
At Home with the Robots: 2013 Edition
Yup, it's that time again [webmasterworld.com]. It wasn't a perfect time for bot tracking: a mid-month server move means that I ended up losing about 24 hours' worth of logs (in bits and pieces over two or three days). But we do what we can.
General disclaimer: Some of the User-Agent strings may not be exactly right. Some extra spaces and single quotes are an artifact of my log-wrangling routines.
Cut to the Chase:
If you don't feel like reading the whole thing, these are some of the new IP blocks I added after looking over January's logs. Many are associated with robots that were already blocked for other reasons, like UA or behavior, but no harm in blocking both ways.
22.214.171.124 blocked as 126.96.36.199/18 (hosting)
188.8.131.52/18 (yet another Fairpoint-within-verizon range)
184.108.40.206 (blocked as 220.127.116.11/19) One of the nasties; the UAs included "start.exe".
18.104.22.168/28 (Russian-- I hate blocking this small, but there are humans in the neighborhood)
22.214.171.124/19 hosting (malign robot at 126.96.36.199)
1. There are some things money can't buy. For everything else, there's Google.
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)
SAMSUNG-SGH-E250/1.0 Profile/ MIDP-2.0 Configuration/ CLDC-1.1 UP.Browser/188.8.131.52.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)
DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/ bot.html)
The last two are new to me, though apparently not to google. Last year at this time I saw only the iPhone version. They are still using OS 4_1 for their robot, though the real iPhone is now up to 6_1. The mobile UAs never ask for anything but pages; the vanilla googlebot gets everything; the imagebot gets images-- and, on one isolated occasion, the favicon.
Speaking of which: the google family remains somewhat random about robots.txt. Generally they pick it up every day or two. Sometimes it can be well over 24 hours. Can't say I especially care for this. I'll bet someone has a ready-made php script that will unconditionally block any robot that has not asked for robots.txt within the past 24 hours. But they do seem to obey what they read-- apart from claiming not to understand the Crawl-delay directive, so you have to go to gwt if you don't want them wolfing down multiple files in a single second. If you are in the habit of serving custom robots.txt files based on the asker's current UA, note that all requests for robots.txt come from the googlebot by that name.
Numbers: Slightly more than twice as many visits as last year. My record-keeping is admittedly slapdash, but I don't think I have twice as many files. About a quarter of the change is due to increased Googlebot-Mobile activity.
Referer: Conversely, there are not as many Googlebots arriving with a referer-- not that the numbers were ever especially huge. Here's a quirk I didn't notice last year, though it happened then too. Sometimes the referer requests come soon after a visit to the page it gives as referer. Other times there's no connection. And the original visit may not have been from the googlebot at all; sometimes it's one of the mobiles.
If I didn't have a <files> exemption for favicon.ico, the faviconbot would never get in at all. The exemption wasn't made for google's benefit; it's to help flag humans who may have been locked out by mistake. (Some robots will pick up the stylesheet that goes with the 403 page. I know of only one that also gets the favicon.)
Google Preview and Google Translate:
I didn't count them with the robots this year, except for some special cases. Preview persists in trying to get at piwik.js, even though they cannot possibly not know it's an analytics program.
Further Unanswerable Question: Since Preview does not play sounds, either upfront or on request, and since the .midi extension can't be anything but a sound file ... why does Google Preview persist in asking for .midi files? I don't know whether this is a new habit or an ongoing one. Last year I didn't have any sound files worth mentioning-- and the ones I did have were formatted as <a href> downloads. Now I've got a slew of them embedded in one subdirectory.
2. We Try Harder.
Once again, Bing/msn is the single most common robot if you count all requests. But this year it remains in the #1 spot even if you don't count robots.txt. There have been recent posts commenting on increased bingbot activity. It's definitely evident in my case.
Key change from last year: After looking at the behavior of the assorted bing/msn entities, I blocked the plainclothes bingbot. So you won't find it here.
IP ranges: 65.52, 157.55, 207.46
plus 184.108.40.206 for BingSiteAuth (BWT) only
UA: Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
The bingbot retains a hearty appetite for robots.txt, though it has dropped back a little from last year's morbid extremes. In the course of the month, about 2/3 of all robots.txt requests came from the bingbot-- down from 80% last year. One time it went seven hours without asking for robots.txt. Conversely, I don't see any place where the bingbot-- by that name-- asked more than two times in a row. But overall, about 1/3 of their requests were still for robots.txt.
IP ranges: 65.66, 131.253, 207.46
UA: msnbot/2.0b (+http://search.msn.com/msnbot.htm)
You can tell how busy bing was: they had to bring the vanilla msnbot out of retirement to pick up some of the overflow. Most of the time it just asked for robots.txt followed by the sitemap, but a few times around midmonth it even collected a page or two.
IP: 65.66, 131.253, 207.46
UA: msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)
This robot went through a striking sea change early last fall. Up to then it had an absolutely consistent pattern: robots.txt, one image, and then the page the image belongs to. Things started to go haywire on 21 August (I went back and checked logs). For the next week or so it alternated between about 4 robots.txt to every one image. Its last old-style visit-- robots.txt, image, page-- came on the 29th. Then a few more robots.txt, one image ... and finally, from the 2nd to the 14th of September, came 64 (sixty-four) consecutive requests for robots.txt. After that it settled into its new pattern: robots.txt, image, two by two in strict alternation. No pages.
Everything I said about Google Preview applies equally to bing. They didn't happen to visit any gallery pages this month, so I don't know if they would have gone for the .midi files.
3. Can you say 'rat' in Russian?
IP (Russia): 220.127.116.11, 18.104.22.168, 22.214.171.124
IP (US): 126.96.36.199
IP (favicon only): 188.8.131.52
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
and two new ones:
Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)
Obviously those are not Yandex's complete IP ranges, but they like to pick an exact address and stick with it. So these are the only ones I've seen this month. Current visits are divided about half-and-half between the Russian and US IPs. In a complete flipflop from last year, the 95. range is used only by the YandexBot, while 178. is only for images. The US range gets both.
I don't know what they do with the favicon. No skin off my nose. Their information page [help.yandex.com] sheds no light on the MirrorDetector. Pretty droll in my case, since my only mirrored pages are in a language they don't support. (Their WMT says so; there's a section that lists crawled-but-not-indexed pages and says why for each individual one. Don't you wish G and B had the same feature?)
Yandex's imagebot seems to have been on vacation-- possibly it runs on the Orthodox church calendar, which currently lags 13 days behind the Gregorian-- and didn't show up until the 18th. But it quickly made up for lost time, mainly by scrambling around and confirming that most of my pictures hadn't changed since last time.
IP: 184.108.40.206, .17
UA: SeznamBot/3.0 (+http://fulltext.sblog.cz/)
Czechs must spend a lot of time online. I can't think of another country this size whose search engine is so visible. Half of the time they just pick up robots.txt and the sitemap, but sometimes they'll get a few pages as well. Like Google and Bing, they've got a preview, in their case called the screenshot-generator. And just like the big boys, it has to be told to stay the bleep out of piwik.
UA: DoCoMo/2.0 P900i(c100;TB;W24H11) (compatible; ichiro/mobile goo; +http://search.goo.ne.jp/option/use/sub4/sub4-1/)
With a name like that, could it come from anywhere but Japan? Here's the odd part: I was going to say that I first set eyes on them in March of 2012, though they didn't really become active until this past month. But processed logs say it ain't so. They paid a couple of long visits way back in April 2011, picking up several hundred HEADs of images. Only. Some months later they came back and behaved so innocuously that I've ignored them ever since. So right now I can't tell if they're pushing up their activity level, or they're one of those robots that operates by fits and starts.
IP: 220.127.116.11 and 18.104.22.168
UA: Mozilla/5.0 (compatible; Blekkobot; ScoutJet; +http://blekko.com/ about/ blekkobot)
As far as I know, these folks never set foot on my site until April 2012. It's still not a very heavy foot: robots.txt, front page, favicon, one specific inner page. (Always the same one.) The person who runs this robot reads, or used to read, the WebmasterWorld forums. That's all I know about it.
UA: TosCrawler/ Nutch-1.6 (http://www.toshiba.co.jp/rdc/about/crawl_info.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')
changed just a few days ago to
TosCrawler/Nutch-1.6 (http://www.toshiba.co.jp/rdc/about/crawl_info_en.htm; 'Rdc-crawler at ml dot toshiba dot co dot jp')
Another newcomer, as far as I can tell. First known sighting, September 2012; they didn't really become active until November. The info page says
The main goal of developing the crawler is to collect web pages for R&D related to natural language processing. Using the collected web pages, we extract new or unknown words, and we analyze statistical information such as word frequency. Utilizing this information, we develop highly accurate statistical machine translation systems, text-to-speech systems and so on.
I hope this is true, because it's something I would definitely support. Insert nasty cracks ad lib about caliber of linguistic information to be gleaned from study of any site with my name on it. They are also one of the very, very few places that answer "Is this your robot?" e-mails. Currently they seem to be absolutely in love with my site; it's especially noticeable because I haven't got around to ignoring them yet.
Yeti (Korea) and Baidu (Japan)
IP: 61.247.204 (Yeti), 119.63.19x (Baidu)
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)
Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)
Not much activity this month, but I note them for appearances' sake. I wouldn't swear to the MSIE 7 one. It comes from 119.63.193 which does belong to Baidu (119.63.192-199) but their regular spider activity is from ..196.
User-Agent: Mozilla/5.0 (compatible; MJ12bot/v1.4.1; http://www.majestic12.co.uk/bot.php?+)
No change from last year, when I said:
They refuse to have their own IP, so you can't tell if it's the real thing or a spoofer. They also seem to have a lot of trouble getting names right: constant directory-slash redirects alternating with top-level www redirects.
One time this month they ate a steady stream of what would have been /index.html redirects, except that they were operating from a blocked IP range.
UA: Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)
Mozilla/5.0 (compatible; Exabot/3.0 (BiggerBetter); +http://www.exabot.com/go/robot
The only reason I tolerate these guys is that they're simply too boring to block. At some point when I wasn't paying attention-- last December, it turns out-- they added the "BiggerBetter" UA. It must be working off an ancient shopping list, because its only requests to date have been for pages that ceased to exist some time in 2011. This sends it scurrying for the sitemap-- but it might have done this anyway.
IP: 22.214.171.124, ..78
UA: Mozilla/5.0 (compatible; YioopBot; +http://126.96.36.199/bot.php)
Like the BlekkoBot, this one is run by a WebmasterWorld reader. Last year it appeared only as a walk-on, with a single occurrence of
188.8.131.52 - - [15/Jan/2012:05:14:54 -0800] "GET /robots.txt HTTP/1.1" 206 517 "-" "Mozilla/5.0 (compatible; YioopBot +http://www.yioop.com/bot.php)"
Notice the 206? I don't think I have ever met any other robot that bothered with a 206 on robots.txt. The YioopBot's record for January 2013 is sixteen of them in a single calendar day. My preliminary guess was that it's running one of those scripts that says "If the file has changed, give it to me; otherwise just toss me the header." Further investigation suggests that this robot simply doesn't have a very big appetite. January was a narrowly focused month: apart from two 206's on the sitemap-- and one on the Panda Page for appearances' sake-- it wasn't interested in anything but robots.txt. But on a few visits in November and December it really extended itself, scooping up 206s right and left. If anyone knows more about this robot I would like to hear it; I took a closer look at those November-December hits and simply couldn't make head or tail of it.
User-Agent: Mozilla/5.0 (compatible; CareerBot/1.1; +http://www.career-x.de/bot.html)
I think this one first showed up in July. Information page says:
Der CareerBot ist der Webcrawler von Career-X. Der CareerBot crawlt durch das Internet, um aktuelle Stellenangebote von Unternehmen zu finden.
Evidently my front page is enough to tell it that I am not an Unternehmen and it won't find any aktuelle Stellenangebote here, because that's all it has ever asked for. Yawn.
User-Agents: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:14.0; ips-agent) Gecko/20100101 Firefox/14.0.1
BlackBerry9000/184.108.40.206 Profile/ MIDP-2.0 Configuration/ CLDC-1.1 VendorID/102 ips-agent
These folks swing by once a month like clockwork. Generally they just pick up the top-level directory pages. I don't know what they're doing, but it's no skin off my nose.
UA: TurnitinBot/2.1 (http://www.turnitin.com/robot/crawlerinfo.html)
One of those plagiarism checkers, I think. I've only seen them about six times in the past year, generally picking up a dozen or so pages. Shrug.
User-Agent: ia_archiver (+http://www.alexa.com/site/help/webmasters; firstname.lastname@example.org)
Some people hate this robot on principle. Some like it on the same principle. I'm not looking to start a fight --not here and now, anyway-- so I just note its continued existence.
User-Agent: Mozilla/4.0 (compatible;)
Requests: images only
Someone once explained to me what these critters do. They're somewhere on the cusp between robot and human. The one I see most often comes from one of the Keewaytinook Okimakanak (chiefs council) ranges, meaning satellite internet at the far end of Ontario. As a further oddity, most of their requests are for an image file that hasn't been used since October-- that is, the file itself exists, but it's no longer used by its formerly requesting page. One of these days I will ask them what they plan to do with all those copies of my administrative gif. There is no shadow of a doubt that the original visitor was human.
The rest of the rest:
Finally there are the visitors who must have some pretensions to robotitude, because they asked for robots.txt. Most of them went on to ask for the front page and nothing more. If that's all they're going to ask for, I don't even particularly care if they got robots.txt or not. The group includes some that were more visible last year: picsearch, findlinks, and 80legs among others. This year they just popped their heads in for a moment. Or possibly-- like the YioopBot-- they did their major pickups in some other month, so I simply didn't notice them.
To be continued...