Forum Moderators: open

Message Too Old, No Replies

MSN with new bot?

New one on me

         

tangor

7:07 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just this last week started getting the following:

MSNBOT_Mobile MSMOBOT Mozilla/2.0 (compatible; MSIE 4.02; Windows CE; Default)/1.1 (+http://search.msn.com/msnbot.htm)

From IPs...

202.96.51.*
219.142.53.*

Bot or person?

janharders

7:23 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



since they're both from china ("China Network Communications Group Corporation" and "CHINANET beijing province network") I somehow doubt they're official MS. plus: why should they run a "mobile bot" on windows CE? I can understand to build the search for CE, but actually running the bots on it?

jdMorgan

8:07 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A lookup at APNIC clearly shows the 202.96.51.* IP address range as being assigned to Microsoft (China) Co. LTD, and the user-agent string is completely-valid for Microsoft's Mobile Web crawler.

Don't block this user-agent and the 202.96.51.* IP address range unless you do not care about mobile (cell phone, PDA, iPod, etc.) users.

The 219.142.53.* IP address range is *not* listed as assigned to MS, and has no reverse DNS, so blocking that one is your choice.

As mobile devices get more and more capable, and as users become aware of Web-to-mobile transcoders provided by the search engines, the number of people surfing the Web with mobile devices is going up and up. So take care when deciding whether to block mobile crawlers.

Jim

janharders

8:26 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> A lookup at APNIC clearly shows the 202.96.51.* IP address range as being assigned to Microsoft (China) Co. LTD

what did you search for? and where? I still get

inetnum: 202.96.0.* - 202.96.63.*
netname: CNCGROUP-BJ
descr: CNCGROUP Beijing province network
descr: China Network Communications Group Corporation
descr: No.156,Fu-Xing-Men-Nei Street,
descr: Beijing 100031
country: CN
admin-c: CH455-AP
tech-c: SY21-AP
mnt-by: APNIC-HM
mnt-lower: MAINT-CNCGROUP-BJ
mnt-routes: MAINT-CNCGROUP-RR
changed: 20000101

maybe I'm missing some big point and don't know how to search their database correctly -- help me out here.

Anyhow, why would the useragent be "Mozilla/2.0 (compatible; MSIE 4.02; Windows CE; Default)/1.1 (+http://search.msn.com/msnbot.htm)" ... I mean ... MSIE 4.02 ... Windows CE ...?

jdMorgan

9:02 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You must search for an address in the correct subnet, in this case, 202.96.51.128 - 255.

The 0 - 127 subnet does not resolve to MS, so that is the likely cause of the difference between what you saw and what I saw (I got the actual IP address from my logs, not from this thread).

The actual MSMOBOT IP addresses I have in the range including 219.142.53.1 - 31 do not resolve -- There is no rDNS for them, which is why I withheld judgment on that range.

The user-agent string is exactly as it appears in the initial post of this thread, and is correct for MSMOBOT - even with that funky/strange "Default)/1.1" sequence in it. All they are saying is that their UA acts (more or less) like MSIE 4.02 running on a Windows CE platform.

I'll readily admit to being disgusted at the very sloppy use of user-agent strings in mobile devices -- I doubt that most of the people who define these UA strings in the phones' software have ever read the original Netscape standards. Some seem almost random; I've seen one recently where the characters that should be semicolons (;) are colons (:). It seems that the mobile robot designers are following suit.

Jim

janharders

9:06 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ahh, yeah, that was the story, I checked for stuff below 128 ... thanks for clearing that up.

tangor

10:17 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks, guys... I'll watch this for a few weeks and see what happens. My site is English, but is a minor interest for literature buffs, and those buffs I have found world wide...which is why I have not routinely blocked countries, though rude bots and scrappers get the boot.

tangor

11:14 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Whoops! Looks like the thread got moved to Spiders... So, is the original string I posted a spider?

jdMorgan

11:52 pm on Jul 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, I thought I said so in my first post above.

This is MSN's mobile Web crawler, MSMOBOT. It crawls pages written for mobile devices using XHTML+XML/Mobile Profile, WML, perhaps iMode, and possibly also HTML -- which is then transcoded to either of the first two or three markup languages, depending on what device requests the page and what markup language that device supports.

I qualified that statement, because I'm not sure how much processing of HTML pages is done by MSMOBOT, and how much is done using their regular MSNbot Web crawler -- I have no "inside" information, and both crawl HTML pages. Also, I have no visibility into the Japan/Asia Market, so I don't know for certain if it handles iMode, which is dominant in those markets.

Jim

wilderness

12:04 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is MSN's mobile Web crawler, MSMOBOT. It crawls pages written for mobile devices using XHTML+XML/Mobile Profile, WML, perhaps iMode, and possibly also HTML -- which is then transcoded to either of the first two or three markup languages, depending on what device requests the page and what markup language that device supports.

Jim,
are you aware of any rendered examples that are pages larger than 1k in word counts?

TIA

Don

jdMorgan

12:24 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No, not if you mean from MSN. I don't pay a lot of attention to them right now (long story).

However, Google Wireless Transcoder will handle some very large pages -- breaking them into multiple sub-pages if needed to fit the smaller memory capacity of cell phones.

The best way to see this is to try it, but if you don't have mobile Web access, then look at a "regular" Web page in the Google cache, and just imagine that it only shows half of that cached page, but adds links to navigate back and forth between the first cached half-page and the second. Then imagine that on a very small screen... :)

Jim

tangor

12:34 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The page size suggested now makes this more interesting. Each page that was accessed is 150k to 350k. I have LARGE TEXT FILES THINK BOOKS, NOVELS, SCHOLARLY REPORTS. I do not break these up into itty bitty pages, nor is my site monetized in any fashion. Now I'm wondering why a mobile phone is hitting some of the LARGEST pages on my website.

I did a little goodle and discovered the string is a valid UA, but it does not make sense, particularly in light of these other observations (and yes jdMorgan, I understood you first time, just wondered why the thread got moved).

Because Log Analytics was my original question.

jdMorgan

12:43 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why do you think this is a mobile phone? It is MSN Live's mobile web page crawler.

If it detects a large non-mobile page, it will 'tag' it internally, so that MSN/Live will pass it through a transcoder and break it up into smaller pieces if it is requested by a mobile device by clicking on a link in the m.live.com search engine results.

The thread was moved here because the staff felt that the question would get more attention and better answers here, and because it fit better with the charter of this forum than it did with the Log Analytics forum charter.

Jim

wilderness

12:48 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No, not if you mean from MSN. I don't pay a lot of attention to them right now (long story).

I'm aware.

However, Google Wireless Transcoder will handle some very large pages -- breaking them into multiple sub-pages if needed to fit the smaller memory capacity of cell phones.

Any idea of Google keeps refreshing the page in the process similar to what Acrobat does for multiple page PDF's when the request (s) is made?

The page size suggested now makes this more interesting. Each page that was accessed is 150k to 350k. I have LARGE TEXT FILES THINK BOOKS, NOVELS, SCHOLARLY REPORTS.

I actually have some pages that exceed 3k in word counts and are enhanced with images as well.

Jim's been prodding me to jump on the bandwagonm however there are issues (such as cache) which I may never overcome.

jdMorgan

12:49 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Correction: Where I used the term "iMode" above, please replace with cHTML. iMode was the DoCoMo service offering which used the cHTML (compact HTML) markup language.

Jim

jdMorgan

12:56 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Any idea of Google keeps refreshing the page in the process similar to what Acrobat does for multiple page PDF's when the request (s) is made?

No, a transcoder grabs the page, transcodes it, possibly breaking it into smaller pieces, and then the visitor navigates inside the transcoded copy if it has been broken into smaller pieces. What you see is one page fetch, complete with all images, CSS, etc. The only difference is that you may see it come from a Google, Yahoo, or MSN IP address, or from servers at companies such as OpenWave (which provide transcoding services for ISPs, among other things).

Jim

wilderness

1:08 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The only difference is that you may see it come from a Google, Yahoo, or MSN IP address, or from servers at companies such as OpenWave (which provide transcoding services for ISPs, among other things).

Many thanks.

You have an organized list of UA's and IP's?

I've been denying these tools for what seems like an eternity , and without accumulating a categorical reference.

jdMorgan

1:37 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No, I have a disorganized list in my head...

Just stuff I found chasing strange-looking accesses on several servers.

Jim

wilderness

1:59 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I saw a link somewhere (not sure which forum or even if it was Webmaster World), which provided UA's for mobile devices.

Don't believe I saved the link, however even with the UA's and without the capapbility to compare to an IP list, the conversion would take an eternity of monitoring and updates.

Many thanks.

Don

tangor

4:06 am on Jul 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Why do you think this is a mobile phone? It is MSN Live's mobile web page crawler. "

That's why I asked. Forgive my ignorance...mobile these days does seem to indicate mobile phones. So, by interpretation, it is not a mobile phone, right, thus not a user?

As to the breaking up of pages into itty little bits that also sounds like mobile phones, which does not thrill me since computers are notorious at breaking things up and LOSING things in the process. This I do not want (nor do my authors).

All I asked, from the get go, was if the string was a valid User-Agent or a robot (I think I said "person or bot").

So this is a mobile robot that is not a user and, because my authors do not want their docs shared piecemeal might have to consider banning, especially considering the size of our content (nothing smaller than 50kb). In any event the UA which opened this must be fairly new (I presume) since it did not show up until last week on my website and has not appeared in the last three years of log files, and just about every Tom, Dick, and Harry robot and UA has been encountered.

Then again, not every Tom, Dick or Harry has been to my site.

jdMorgan

9:17 pm on Jul 31, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



MSN's mobile crawlers have been active for a long time. I guess they just now showed up on your site, though.

If I was running your site, and saw a mobile access using a transcoder, I'd think, "Oh someone's stuck in an airport, reading one of my novels. Sure hope they have good eyesight!"

So, accesses from mobile 'bots (or transcoders) to large pages should not be looked on with too much suspicion -- It is the kind of content I'd love to find if stuck waiting on a long trip.

Jim

Samizdata

10:07 pm on Jul 31, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



accesses from mobile 'bots (or transcoders) to large pages

For an example of how they handle it you can try the Google Wireless Transcoder:

[google.com...]

(I could never find an equivalent for MSN but the effect should be similar).

The Google version uses a GooglePlex IP and this user-agent:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Google Wireless Transcoder;)

On my sites this will generally be intercepted and fed special mobile content.

...

jdMorgan

10:35 pm on Jul 31, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There's really no use feeding the transcoder mobile content, except perhaps to insert a link that *offers* mobile content. The transcoder expects "normal-Web-sized" HTML pages, and converts them to xhtml+xml or cHTML on the fly, breaking them into several smaller framed pages if needed.

I would think you might confuse them by providing mobile markup as input to the transcoder. Your mobile pages should normally appear as separately-listed in the mobile SERPs, and should be directly-available from those SERPs without invoking the transcoder. Be aware that G changed their Mobile SiteMap format recently -- Be sure you're marking you mobile URLs as such using the <mobile:mobile /> xml sitemap tag.

In normal operation, if you hover over a link to an XHTML+XML or cHTML mobile page in G's mobile SERPs, you should see a straight link to that mobile page. If you hover over an HTML "big Web" page link, you should see a link to the "google.com/gwt" transcoder URL, with your page's URL passed as a parameter.

Jim

Samizdata

11:28 pm on Jul 31, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would think you might confuse them by providing mobile markup as input to the transcoder.

Apologies, I didn't mean it as general advice.

I have long catered for mobiles and handhelds - one of my sites deals with a lot of them, and has special sections for a wide range of devices (anything from WAP phones to the Nintendo Wii).

I do a lot of device and capability sniffing, have alternate stylesheets, use XHTML Transitional (which works with cHTML phones if kept simple) and offer appropriate rich media to almost anything.

I am not suggesting that this is necessary for text-heavy sites, which transcoders can cope with.

...

tangor

11:37 am on Aug 5, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



So, accesses from mobile 'bots (or transcoders) to large pages should not be looked on with too much suspicion -- It is the kind of content I'd love to find if stuck waiting on a long trip.

Dang it, Jim!... Back a bit late on this topic, and yes, you'd like to read this content in an airport or wherever, and I'll keep that mobile phone aspect in mind in future log reads.

I'm VERY NEW to expanded services and really freakin' ignorant. Thanks to all, jdMorgan and Samizdata in particular.