Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- How do you translate "preview"?


lucy24 - 1:26 am on Feb 2, 2012 (gmt 0)


I wouldn't normally post about a one-off, but this one is weird enough to catch my attention.

74.125.18.18 - - [31/Jan/2012:02:30:18 -0800] "GET / HTTP/1.1" 200 1997 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1,gzip(gfe) (via translate.google.com)"

And that was all she wrote.

For comparison purposes, here's a normal Translate request:

74.125.42.83 - - [23/Jan/2012:03:14:46 -0800] "GET /fonts/custom_greek.html HTTP/1.1" 200 4813 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; uk; rv:1.9.2.25) Gecko/20111212 Firefox/3.6.25,gzip(gfe) (via translate.google.com)"
aaa.bbb.ccc.ddd - - [23/Jan/2012:03:17:00 -0800] "GET /fonts/fontstyles.css HTTP/1.0" 200 4165 "http://translate.googleusercontent.com/translate_c?hl=uk&langpair=en%7Cuk&rurl=translate.google.com.ua&u=http://www.example.com/fonts/custom_greek.html&usg=ALkJrhhqlHR6C778FoOD_JaxNedqjrjr6g" "Mozilla/5.0 (Windows; U; Windows NT 5.1; uk; rv:1.9.2.25) Gecko/20111212 Firefox/3.6.25"

(et cetera)

That is: The page itself is requested by a google IP, giving the human user's UA with appended ",gzip(gfe) (via translate.google.com)" --no leading space before "gzip". All subsidiary files are requested by the human user's IP and UA, without the "via translate" part. The referer for these files repeats all translation information. The main page will have either no referer if it came in via "translate", or a long search-type referer if it came in via a SERP and "translate this page".

I see enough of these that I've got a log-wrangling Regular Expression to show me the human user's IP. This time, there wasn't one. The following lines in logs are unrelated other stuff. The original IP seems legit; I've never met that specific one before, but it's in the general Preview-and-Translate range. And speaking of Preview, have another look at that UA:

Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

If it looks vaguely familiar, it should:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51

Explanation 1 is the boring one: Some robot is vaguely spoofing the Google Preview UA and, for reasons best known to itself, picking up a page via Google Translate rather than grabbing it directly. Since there are no subsidiary files, I don't know what languages are involved-- or, of course, the real IP. (You can translate from English to English. I checked.)

Explanation 2 is the interesting one. Speculation welcome.


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4413345.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com