homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

How do you translate "preview"?

 1:26 am on Feb 2, 2012 (gmt 0)

I wouldn't normally post about a one-off, but this one is weird enough to catch my attention. - - [31/Jan/2012:02:30:18 -0800] "GET / HTTP/1.1" 200 1997 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1,gzip(gfe) (via translate.google.com)"

And that was all she wrote.

For comparison purposes, here's a normal Translate request: - - [23/Jan/2012:03:14:46 -0800] "GET /fonts/custom_greek.html HTTP/1.1" 200 4813 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; uk; rv: Gecko/20111212 Firefox/3.6.25,gzip(gfe) (via translate.google.com)"
aaa.bbb.ccc.ddd - - [23/Jan/2012:03:17:00 -0800] "GET /fonts/fontstyles.css HTTP/1.0" 200 4165 "http://translate.googleusercontent.com/translate_c?hl=uk&langpair=en%7Cuk&rurl=translate.google.com.ua&u=http://www.example.com/fonts/custom_greek.html&usg=ALkJrhhqlHR6C778FoOD_JaxNedqjrjr6g" "Mozilla/5.0 (Windows; U; Windows NT 5.1; uk; rv: Gecko/20111212 Firefox/3.6.25"

(et cetera)

That is: The page itself is requested by a google IP, giving the human user's UA with appended ",gzip(gfe) (via translate.google.com)" --no leading space before "gzip". All subsidiary files are requested by the human user's IP and UA, without the "via translate" part. The referer for these files repeats all translation information. The main page will have either no referer if it came in via "translate", or a long search-type referer if it came in via a SERP and "translate this page".

I see enough of these that I've got a log-wrangling Regular Expression to show me the human user's IP. This time, there wasn't one. The following lines in logs are unrelated other stuff. The original IP seems legit; I've never met that specific one before, but it's in the general Preview-and-Translate range. And speaking of Preview, have another look at that UA:

Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

If it looks vaguely familiar, it should:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.51 (KHTML, like Gecko; Google Web Preview) Chrome/12.0.742 Safari/534.51

Explanation 1 is the boring one: Some robot is vaguely spoofing the Google Preview UA and, for reasons best known to itself, picking up a page via Google Translate rather than grabbing it directly. Since there are no subsidiary files, I don't know what languages are involved-- or, of course, the real IP. (You can translate from English to English. I checked.)

Explanation 2 is the interesting one. Speculation welcome.



 9:12 pm on Feb 2, 2012 (gmt 0)

Don't think that's a preview UA. Firefox and Safari are quite different, the latter generally having more exploit holes, which is probably why google uses it. :)

Seriously though, web preview uses applewebkit, which firefox does not. Firefox 4 was a very fleeting update which may still be around on an unpatched linux box (see comments in another thread re: firefox versions). The UA itself varies under translate, being the real user's UA.

Since translate is a proxy, I would expect the FWD_FOR and VIA to be valid, the former showing the user's IP.

Checking through this month's security logs, I see most instances of translate have an inclusion in the VIA field of "translate.google.com TWSFE/0.9" (sans quotes). It has a FWD_FOR of the original user's IP. Only one I can find without is via a WebSense proxy, which has probably hidden the initial IP.


 10:05 pm on Feb 2, 2012 (gmt 0)

Ooh, interesting to know. But slightly infuriating because I can't see security logs. Just access and error, and of course I can't change the detail level.

I hope sneaking in via translate doesn't become the next robotic fad, because I get a lot of legitimate translation requests.* (That is, ahem, "a lot" proportionally, not in absolute numbers ;)) But it might be easier to block than the current fad for auto-referers.

* Recently I even "met" a-- hold on to your hats --human Ukrainian! Well, I assume that's what "uk" is. Didn't look it up.


 10:53 pm on Feb 2, 2012 (gmt 0)

Lucy - by "security logs" I mean my "gotcha" logs that record baddies - home grown. Sorry if I mis-led you.

I block a lot of translates more or less accidentally. Something comes in on a G IP that's bad, it gets blocked for x hours where x increments if it happens again before the timeout. Apart from a few IPs where I get most of the translates, I have now blocked most of G apart from the genuine googlebot; unfortunately my clients want to be seen in G's SERPS. Sad but true. :)

UK is, er, UK, where I live. UA is Ukraine. Although sometimes it's difficult to tell the difference: there are a LOT of idiot Brits who are careless about what they click on and seem to take ages to get rid of viruses, even if they find them in the first place.

I met one such idiot whilst buying a mobile broadband dongle in a local shop. While we were waiting to be served he told me about this laptop he'd just bought secondhand "in a pub". Used up a whole 30G of mobile bandwidth over-night, said he'd found a lot of viruses, how could he get rid of them? When I said he should get professional help he said he'd just buy another 30G's bandwidth - couldn't afford the de-tox.


 11:30 pm on Feb 2, 2012 (gmt 0)

I block a lot of translates more or less accidentally.

I block all translate services including G (by UA not IP) and have for a couple years. They have proven to be a window for scraping and other nefarious purposes, not to mention that they filter out my ads.


 12:16 am on Feb 3, 2012 (gmt 0)

I block all translate services including G (by UA not IP) and have for a couple years.

Ditto. No: (transla|transcod)


 12:51 am on Feb 3, 2012 (gmt 0)

I blocked them forever.

Jim was quite adamant about about taking the time to sort out the genuine translate users from the harvesters. Don't recall if he used a headers check or something else (sure there's hordes of old threads on this topics).

His logic being that if a visitor took the time to use translate that he deserved access for his/her effort.


 3:51 am on Feb 3, 2012 (gmt 0)

I looked it up. "uk" the language is Ukrainian. Not to be confused with "ua" the tld.* Well, I'd be hard-pressed to think what else it could have been. Though an American-to-British translation might be quite entertaining for some sites.

Most of my translates are unquestionably legitimate: they're from assorted Spanish-speaking countries asking for an e-book that happens to be an English translation of a Spanish original. You can get the original online-- I have a link to it-- and I suspect the Spanish-to-English translator took liberties. But the translation has much prettier pictures. (Had to fine-tune my hotlink routine to let people see them.) Even if the visitors are scrapers it's no skin off my nose (disagreeable mental picture there) because the book is in the public domain. It's not as if they are taking my own deathless prose and claiming it's their own.

I did once get someone asking for a translation of one of my Inuktitut pages into Russian. That's one time I really wished for more than "google search" as the referer :)

* In a related "hold on to your hats!" moment, I also recently met something I never would have thought existed: a well-behaved Ukrainian robot. Asked for robots.txt, picked up a reasonable group of pages, left.


 9:20 pm on Feb 3, 2012 (gmt 0)

Unfortunately one of my customers (at least) needs translated traffic but the site is too small to warrant page translation for enough countries.

Lucy - sorry, I thought you meant domain/DNS, not language.

Only UA bot I've seen is sitebot, which has a Kill tag on it. :)


 1:44 am on Feb 4, 2012 (gmt 0)

Speaking of translating/translators/translations... Just got a / hit with a Ref from a .mx visitor:

(example did not include .com)

No clue if the hit was a real person, or if babylon.com's a real alternative to allowing Google trans -- or if they are Google trans... [globes.co.il...]


 3:11 am on Feb 4, 2012 (gmt 0)

I thought "babylon" sounded familiar. Turns out I had one just a few days ago. Raw logs turn up random others. Definitely look human.* A few of the searches were region-specific so I looked those up; the IP matches the region. It's all the same IP, not like google translate where the page itself is Google and the supporting files are the real person.

* This is one of those things that are much easier to tell when you're small. The whole visit is one consecutive batch of log lines, not tangled up with a bunch of other concurrent hits from unrelated people.


 4:23 am on Feb 8, 2012 (gmt 0)

Returning to the theme of general wtf-ness: - - [07/Feb/2012:07:29:53 -0800] "GET /{directory}/{directory}/{roboted-out directory}/{file.jpg} HTTP/1.1" 200 2714 "{hotlinking page}" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

The referring page is a bona fide-- I use the term loosely-- hotlinker. The kind that would take several minutes to load up if you waited for every single hotlinked image to come through.

The jpg lives on a page that hasn't been edited in years and hardly ever gets visited. Its two most recent visits weren't even full-blown humans, just Preview. Leading to the secondary question of how this new hotlinker even found the picture.

The primary question is: Where does it say "Google"? Have they started wearing bing's castoff UAs?

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved