homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 45 message thread spans 2 pages: 45 ( [1] 2 > >     
UNTRUSTED in Nokia User Agent

 6:35 pm on Apr 14, 2012 (gmt 0)

Nokia6300/2.0 (06.01) Profile/MIDP-2.0 Configuration/CLDC-1.1 nokia6300/UC Browser8.0.3.107/69/444 UNTRUSTED/1.0

Took the words right out of my mouth.

The "MIDP-2.0" element has apparently been around for a while-- it goes with mobiles-- but you have to give ChinaCache (65.255.37.nn) points for honesty. Can we look forward to a long line of UNTRUSTED versions?



 7:37 pm on Apr 14, 2012 (gmt 0)

that is hilarious!


 9:57 pm on Apr 28, 2012 (gmt 0)

I take it back.

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Alcohol Search; IEMB3; IEMB3)

Wonder if they would have found any? The search came from 208.80.nn.nn, so we'll never know. Maybe they already found some-- hence the hiccup at the end of the UA.


 1:38 pm on Apr 29, 2012 (gmt 0)


hiccup at the end of the UA.

FWIW, this use to be a very common part of the MS UA's and similar to all the current .Net updates.

That is shows up with MSIE 7.0 is a joke.


 10:43 pm on May 2, 2012 (gmt 0)

Complete UA string:

mahonie, neofonie search:robot/search:robot/0.0.1 (This is the MIA Bot - crawling for mia research project. If you feel unhappy and do not want to be visited by our crawler send an email to spider@neofonie.de; http://spider.neofonie.de; spider@neofonie.de)

Thanks, but I heard you the first time.

Investigating the link leads to a FAQ list:

Es passieren allerhand ungeplante Dinge auf Ihrer Webseite?

Sie werden mit leeren Kommentaren in Ihrem Gästebuch überschwemmt? Es werden große Warenkörbe gefüllt oder leere Bestellungen ausgelöst?
Formulare werden unausgefüllt abgeschickt?

Bitte denken Sie daran, daß der Spider nur Links verfolgt. Eine echte Interaktion mit Ihrer Webanwendung findet nicht statt.

Sollte durch das einfache Verfolgen von Links obiges oder ähnliches passieren, dann ist das ein Zeichen für Schwächen Ihrer Webanwendung. Ein leeres Formular sollte z.B. nicht ohne weiteres abgeschickt werden können und Aktionen wie das Versenden von Emails auslösen. Eventuell entstandene Unannehmlichkeiten Ihrerseits tun uns sehr leid und sind von uns in keinster Weise beabsichtigt.

(Italics mine.) All of that is inarguably true, but let's talk belt-plus-suspenders.


 1:14 am on Jun 12, 2012 (gmt 0)

In the category of Mindless Repetition: - - [10/Jun/2012:16:05:54 -0700] "GET / HTTP/1.1" 403 1479 "-" "-" - - [10/Jun/2012:16:05:54 -0700] "GET / HTTP/1.1" 403 1479 "-" "-" - - [10/Jun/2012:16:05:57 -0700] "GET / HTTP/1.1" 403 1479 "-" "-" - - [10/Jun/2012:16:06:11 -0700] "GET / HTTP/1.1" 403 1479 "-" "-"

:: yawn ::
:: file nails ::
:: feed cat ::
:: check e-mail :: - - [10/Jun/2012:16:06:17 -0700] "GET / HTTP/1.1" 403 1479 "-" "-" - - [10/Jun/2012:16:06:17 -0700] "GET / HTTP/1.1" 403 1423 "-" "-" - - [10/Jun/2012:16:06:18 -0700] "GET / HTTP/1.1" 403 1479 "-" "-" - - [10/Jun/2012:16:06:24 -0700] "GET / HTTP/1.1" 403 1479 "-" "-"

:: are you still around? can I help you with something? ::
:: twiddle thumbs ::
:: wash dishes ::
:: change water in 20-gallon aquarium :: - - [10/Jun/2012:16:06:30 -0700] "GET / HTTP/1.1" 403 1423 "-" "-" - - [10/Jun/2012:16:06:30 -0700] "GET / HTTP/1.1" 403 1423 "-" "-" - - [10/Jun/2012:16:06:35 -0700] "GET / HTTP/1.1" 403 1479 "-" "-" - - [10/Jun/2012:16:06:37 -0700] "GET / HTTP/1.1" 403 1479 "-" "-"

:: isn't there some place you have to be? ::
:: does your mother know you're out? I make it after midnight in your time zone ::
:: talk to rats ::
:: turn on Judge Judy :: - - [10/Jun/2012:16:07:19 -0700] "GET / HTTP/1.1" 403 1479 "-" "-" - - [10/Jun/2012:16:07:19 -0700] "GET / HTTP/1.1" 403 1479 "-" "-" - - [10/Jun/2012:16:07:19 -0700] "GET / HTTP/1.1" 403 1423 "-" "-" - - [10/Jun/2012:16:07:23 -0700] "GET / HTTP/1.1" 403 1479 "-" "-"

:: wait expectantly ::
:: check time ::
:: think about dinner ::
:: huh. Guess that's the last of 'em ::

Time elapsed: just under a minute and a half.
Number of requests for index page: 49.
Number of 403s served: 49.

Perfunctory research suggests that
#1 I have had visitors from this neighborhood before, most of them dressed inadequately ("Mozilla/5.0 (compatible; news bot /2.1)", "Spider", "'artviper(tm)") -- or oddly ("Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows NT 5.1) Opera 7.01 [en]") -- or not at all ("-");
#2 It appears to be a German server farm;
#3 The entire 85.25 block can be safely locked out at no cost to anyone.



 1:25 am on Jun 12, 2012 (gmt 0)

More than a decade ago and when I first began my website, I had a DE visitor that was denied.

They apparently didn't like the door being closed and set about of the next 4-5 hours (while I was snoozing) eating consecutive 403's.
I never counted how many, it had to be in the thousands.
Not sure if they believed they could crash the server or was just throwing a fit ;)
BTW, they occasionally caught a 200, however were going so fast with the requests that they never saw the 200.


 12:51 am on Aug 20, 2012 (gmt 0)

1.202.aaa.bbb - - [18/Aug/2012:03:06:22 -0700] "GET /robots.txt HTTP/1.1" 200 874 "-" "Mozilla/5.0 ()"

Heck, why bother to put anything at all inside the parentheses? We both know you're a robot.

P.S. It followed this request by putting on :: yawn :: a simple jikespider costume, eating two 301's in quick succession, and then quietly leaving. Shrug.


 2:56 am on Sep 15, 2012 (gmt 0)

178.238.235.ddd - - [13/Sep/2012:22:22:36 -0700] "GET /ebooks/alida/Alida.html HTTP/1.1" 200 714070 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20090824 Firefox/3.5.3 GTB5"

And your point is...? - - [13/Sep/2012:22:22:39 -0700] "GET /ebooks/alida/Alida.html HTTP/1.1" 200 714070 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/2009021910 Firefox/3.0.7" - - [13/Sep/2012:22:22:42 -0700] "GET /ebooks/alida/Alida.html HTTP/1.1" 200 714070 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/20090824 Firefox/3.5.3 GTB5"

... et cetera, for a total of 57 (fifty-seven) requests over the course of about two and a half minutes.

I wish I could say they used 57 varieties of user-agent. But aside from that single anomalous 3.0.7, it was all FF 3.5.3.

German server farm "Giga-hosting". Never met them before, will probably never meet them again, but let's slap on a Deny from just for the heck of it.

Fifty-seven. Now, honestly.


 7:23 pm on Sep 15, 2012 (gmt 0)

Gigahosting, DE (so far)... - - - - - - - -

Some very short ranges in there!

These date mostly from 2010 - I haven't checked them recently.


 8:21 pm on Sep 15, 2012 (gmt 0)

Thanks dstiles. I only had one of those.


 9:32 pm on Sep 15, 2012 (gmt 0)

91? 193? Oh, ###. Those are the ranges where you bend down and kiss the earth if you find a /21 because they don't get any bigger :( 193.34. has a stretch of /25s. Not by host. By country.


 8:33 pm on Sep 16, 2012 (gmt 0)


I don't usually give advice to Chinese robots, but I'll make an exception here. Take it from me: If you're doing this for a class, it's not quite ready to be handed in :) - - [16/Sep/2012:08:16:16 -0700] "GET /fun/lions.html ++++++++++++++++++++++++++++++++++++++++ Result: +forum+not+found+/+could+not+find+IP HTTP/1.0" 403 1389 "http://www.example.com/fun/lions.html ++++++++++++++++++++++++++++++++++++++++ Result: +forum+not+found+/+could+not+find+IP" "Opera/9.80 (Windows NT 6.1; WOW64; U; en) Presto/2.10.289 Version/12.00" - - [16/Sep/2012:08:16:16 -0700] "GET / HTTP/1.0" 403 1389 "http://www.example.com/fun/lions.html ++++++++++++++++++++++++++++++++++++++++ Result: +forum+not+found+/+could+not+find+IP" "Opera/9.80 (Windows NT 6.1; WOW64; U; en) Presto/2.10.289 Version/12.00"

I inserted a space before and after each "Result:" and before each long string of plusses.


 8:44 pm on Sep 16, 2012 (gmt 0)

There are a variety of those requests also coming from other IPs, at least one of which makes that pronouncement after being fed a string of random rejects over preceding days having arrived using a number of other UAs.


 9:26 pm on Sep 16, 2012 (gmt 0)

Lucy - the IP ranges may be rather short but my experience is: if you delve deeper they ultimately belong to a single organisation that sub-lets by country and in such circumstances the larger range can usually be entirely blocked.

In my ranges above I confined them, in some cases, to /24 and /23 because it would have taken too long to track down the larger ranges and I'd had no bad hits to make me reappraise them.


 3:13 am on Dec 18, 2012 (gmt 0)

Auto-referring rises to new heights. Or do I mean plumbs new depths? Two of 'em, about half an hour apart. Guess they were afraid they wouldn't be allowed to see the page if they didn't offer a respectable referer. - - [17/Dec/2012:05:10:32 -0800] "GET /robots.txt HTTP/1.1" 200 657 "http://www.example.com/robots.txt" "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv: Gecko/20070312 Firefox/; 360Spider"

 5:42 am on Dec 18, 2012 (gmt 0)

Note to self... if I ever write a scraper, name it after something universally popular, like a record-breaking Ferrari.


 11:32 am on Dec 20, 2012 (gmt 0)

Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http ://help.yahoo.com/help/us/ysearch/slurp) NOT Firefox/3.5

There you go, NOT Firefox/3.5?, no sh.. Sherlock, came from, seems like Ya-hu IP.

I almost dropped of the.._


 7:25 am on Jan 20, 2013 (gmt 0)

Awright, translation time again.

Most people's attention will of course go straight to the UA. Mine was caught by the search string, which made me distinctly nervous. But the more you look, the more you'll find. Plenty of time to find things, too, because this was to all appearances a human visitor. Got all associated files within the appropriate time frame. The only oddity is the truncated search info.

70.39.184.nnn - - [18/Jan/2013:08:54:21 -0800] "GET /fonts/hamlet.html HTTP/1.1" 200 7548 "http://www.google.co.in/search?&q=mmmmmmmmmmlil" "NokiaX2-00/5.0 (08.35) Profile/MIDP-2.1 Configuration/CLDC-1.1 UCWEB/2.0(Java; U; MIDP-2.0; en-us; nokiax2-00) U2/1.0.0 UCBrowser/ U2/1.0.0 Mobile UNTRUSTED/1.0"
is apparently some sort of proxy, physically located in LA. (I did say I'd need a translation.) They've even got an invisible www site. I first thought they didn't like Camino, but they were equally invisible to an up-to-the-minute Safari.

I've only met them once before. Can't say I understand the point of using a proxy when both visits came in via google India, which kinda blows the disguise. That earlier visit had a somewhat similar UA and-- glory, glory-- came in with a search whose result took them straight to a redirect reserved for South Asian visitors. File under: I guess you had to be there.

I was hoping I could shift one digit and block the whole 38-39 range-- I've already excluded half of 70.38-- but there seem to be humans in another part of 70.39, darn it.

GET /fonts/hamlet.html
People who know me will deduce right away that this page has nothing to do with Shakespeare, and very little to do with small villages. The word may have a technical meaning in Canadian, but if we start on Canadianisms we'll be here all day. Like several other pages in its directory, this one calls a javascript function, which calls another, which leads us to...

To some of youse this will look like gibberish. It's the text string used by one of the simplest font-checking routines. Lots of ems for width; a couple of ascenders for height. (Using both ascenders and descenders would actually reduce the accuracy of the function.) "One of the simplest" = the one I use. Duh.

Here's where I get uneasy, because the page doesn't visually display this text at any point. At least I hope it doesn't. Turns out I am one of several hundred people using this exact function; they all come up in Search, with the text duly displayed. But only in search results-- whew!-- not in Preview.

One random page I visited apparently checks to see if I've got a particular font that I haven't got, because it showed lots of placeholder-characters. This would seem to be impossible, but it became understandable when I investigated and found we're in a Private Use Area. So the author's eight closest friends see one version of the page, while the rest of us see another. I do not perfectly understand why he bothers to check for a font if he's going to display the same text either way, but never mind that. At least I'm better off than the Google-Preview-Not-A-Robot, because all you see there is a couple of lines of text. And it obviously isn't because g### didn't read-- and execute-- the script ;)

NokiaX2-00/5.0 (08.35) Profile/MIDP-2.1 Configuration/CLDC-1.1 UCWEB/2.0(Java; U; MIDP-2.0; en-us; nokiax2-00) U2/1.0.0 UCBrowser/ U2/1.0.0 Mobile UNTRUSTED/1.0
Hey, I remember UNTRUSTED. It's what started this thread. Can it be that it's just Nokia-speak for "unstable release"? I've met UCweb a handful of times before too. First reaction: the University of California has its own version of the internet now? Well, maybe not. All earlier sightings have come from confidence-inspiring neighborhoods like ChinaCache or Yahoo Cache. It seems to have something to do with cell phones. (See above about translation.)

But I still can't begin to guess why our Indian proxy-user was searching for this particular string. It's not something you'd make up, or type in from memory-- and why search for something if it's already right in front of you?

The final unsolved mystery is one I almost didn't notice. I've got four or five pages that call the function that uses this string. But /hamlet.html is the only one that comes up in search. At all, I mean. And I can't for the life of me find any difference among the pages, or in their respective preliminary functions. Except that ::cough-cough:: in /hamlet.html the first function is entirely enclosed in a "try/catch" framework, and the others aren't. That can't possibly make a difference to indexing. Uhm. Can it?

:: returning to state of habitual puzzlement ::


 5:36 pm on Jan 20, 2013 (gmt 0)

I show the IP as blocked:
OrgName: PacketExchange, Inc
NetRange: -

and also block
.in as referer. Untrusted might be added as a UA if I were to see it doing naughty things.

As to why the search query, my guess would be that he secretly admires the technique you are using and would like to emulate it or is using it as an example somewhere to tutor others(?) Since the result is not visually showing anything 'useful' maybe the idea is to visit the source.(?)


 7:28 pm on Jan 20, 2013 (gmt 0)

Blend27 - I have the range - blocked as a "bot" with the note: "several bot-UA hits but no valid rDNS". This still pertains so any bot from that IP range remains blocked until yahoo gets its act together.

Lucy - I block the complete range under the name PacketExchange.


 7:55 pm on Jan 20, 2013 (gmt 0)

...I block the complete range under the name PacketExchange



 8:27 pm on Jan 20, 2013 (gmt 0)

Blend27 - I have the range - blocked as a "bot" with the note: "several bot-UA hits but no valid rDNS". This still pertains so any bot from that IP range remains blocked until yahoo gets its act together.

I have about 3 dozen IPs with that "NOT Firefox" appended to that standard Slurp UA from 98.137.20[6-7].nnn, they all point to Hnnn.hlfs.bf1.yahoo(.)com [bgp.he.net...] At the same time, and this started on the 20th of Dec/2012 there are hits from 72.30.198.nnn with the same user agent(as well as normal Slurp UA). They also point to Hnnn.hlfs.bf1.yahoo(.)com - [bgp.he.net...] .

I've been allowing for ages.

98.137.20[6-7]... gets the boot for now. IPs from this rage do request the images from the pages they try to crawl(one dynamically generated image URI, logged and matched with a cookie that is generated from the original page request).


 9:02 pm on Jan 20, 2013 (gmt 0)

FWIW since Novenber:

RewriteCond %{REMOTE_ADDR} 98\.13[6-9]\.
RewriteRule \.(jpg|gif|pdf)$ - [NC,F]


 12:41 am on Jan 21, 2013 (gmt 0)

and also block .in as referer

I only do that for .ru and .ua. Oh, and .su which by now exists only in bad referers :) Partial exemption for google dot country -- hypothetical anyway, because in real life they all use yandex. (Further rule here to block robots with boringly predictable yandex referer. But I now recognize the cyrillic for "rat" on sight.)

Matter of fact I've never met .in except in the specific case of google.co.in. These are legitimate-- but the ones whose searches point them to {one specific page} are looking for something very different. So different, in fact, that I can't imagine why they even clicked on the search result in the first place. I happily embarrass them by redirecting to a page with a lurid pink background and faint suggestion of bare skin in the image. Nobody has ever followed the link back to the real page ::snrk::

But if anyone can figure out why only one page comes up in search, while four pages call the identical code (different preliminaries, but all the same mmm etc. function) ... Can't help but think it would cast some light on g###'s inner workings w/r/t javascript files.

In any case I've cleaned up the js a bit. I have to get it into my head that code intended to run on other people's computers has to come with more escape clauses. Sites that require you to force-quit your browser are not sites that get bookmarked and recommended to all your friends ;)

:: detour here to find out what the ### happened to old logs after recent server move ::


 1:23 am on Jan 21, 2013 (gmt 0)

:: detour here to find out what the ### happened to old logs after recent server move ::

Good luck with that lucy.

My el-cheapo moved to some cloud hosting and things have been a nightmare since.
They say another week before everything is cleaned up.


 1:42 am on Jan 21, 2013 (gmt 0)

Good luck with that lucy.

<begin topic drift>
On the plus side: While searching host's discussion boards to see if anyone else had already asked the same question, I found a mod_rewrite question that I could answer standing on my head. Heh heh.

They also goofed and told us exactly which Apache version they're using, after years of being deliberately close-mouthed on "security" grounds. Double heh heh.
</end topic drift>


 8:33 pm on Jan 21, 2013 (gmt 0)

Blend - I used to allow 72.30/16 for bots but now block the complete range as of May 2010.

Lucy - I see a fair few IN IPs both on web sites and mail - they are incorrigible "targetted spam" fiends and several of my older IN blocks were because of form spam. I treat them the same as RU, UE, CN, ID, VN, BR and a few more: block depending on web site "trading" area - if it's UK only they aren't wanted.

Oh, and remember that JS is turned off on a lot of browsers!


 3:17 am on Jan 22, 2013 (gmt 0)

remember that JS is turned off on a lot of browsers

Especially ::cough-cough:: browsers belonging to WebmasterWorld members ;)

The js in this directory is Added Value: if you've got certain specific fonts installed, you get a fancier version of the headers, or the text of a few footnotes changes. On the one page where js is required, the page says so. And even there I've got a simplified backup function that provides some of the same information.*

I've got another site that really does rely on javascript, but that's an art-gallery site so users would kinda expect it anyway. Turning off scripting doesn't break the page, it just cuts back on the things it can do.

* It also tells me-- recent accidental discovery-- that Camino handles font names in a different way from all my other browsers. One of these days I will investigate further. Fortunately the difference only manifests itself when applied to an extremely obscure UCAS legacy font. And it's much less annoying than Opera, which also handles fonts in a manner all its own.


 3:28 pm on Jan 26, 2013 (gmt 0)

And now, returning to this thread's original theme: - - [25/Jan/2013:23:33:26 -0800] "GET /robots.txt HTTP/1.1" 200 657 "-" "LWNutch/Nutch-1.4 (another scientific bot - we accept your robots.txt! )" - - [25/Jan/2013:23:33:26 -0800] "GET /robots.txt HTTP/1.1" 200 657 "-" "LWNutch/Nutch-1.4 (another scientific bot - we accept your robots.txt! )" - - [25/Jan/2013:23:33:26 -0800] "GET /fun/lions.html HTTP/1.1" 200 2466 "-" "LWNutch/Nutch-1.4 (another scientific bot - we accept your robots.txt! )"

For a given definition of "scientific", anyway.

(For those who care: 206.117 is an outfit called Los Nettos. I don't know them, but when contact addresses begin with 'hostmaster@' you can safely draw conclusions ;) I do know there is some reason why I don't globally block UAs containing 'Nutch'; I just don't remember what that reason is.)

This 45 message thread spans 2 pages: 45 ( [1] 2 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved