homepage Welcome to WebmasterWorld Guest from 54.196.62.23
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Microsoft / Bing Search Engine News
Forum Library, Charter, Moderators: mack

Bing Search Engine News Forum

    
MSNBOT-MEDIA Crawls Thru Javascript
incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 3:16 am on Jan 2, 2009 (gmt 0)

Today I caught MSNBOT-MEDIA crawling thousands of links that were only accessible thru javascript.

65.55.235.202 "GET /feedback.html?id=1010101234 HTTP/1.0"
"msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"

I have a couple of pages used for site feedback for various page elements and each link on the page has an OnClick command like this:

a href="#" OnClick="OpenFeedback(1010101234)

Elsewhere in the code is the actual function:

function OpenFeedback(id) {
window.open('feedback.html?id=' + id,.....')
}

MSNBOT appears to have assembled it together and was crawling thousands of links such as "/feedback.html?id=1010101234" and so on, page after page.

I checked and these pages don't appear to have any references anywhere else on the web, nor should they being only accessible in javascript, so the only conclusion that I can draw is MSNBOT-MEDIA now has rudimentary understandings of javascript.

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 3:48 am on Jan 2, 2009 (gmt 0)

Now that I've had time to do a little more research I'm assuming that MSNBOT-MEDIA is looking for Ajax data. With many image galleries using javascript and flash these days they're probably trying to harvest that data.

youfoundjake

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3817612 posted 3:55 am on Jan 2, 2009 (gmt 0)

I was gonna nominate this as a featured home page topic, but someone beat me to it.
I think the ramifications of msnbot's ability to traverse javascript is very significant in my book. Over the past few months, we've seen google start to work on flash, and more in depth with pdfs, and now it looks like microsoft has raised the bar with the new bot..
How were you able to determine that they were looking for ajax data?

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 4:54 am on Jan 2, 2009 (gmt 0)

How were you able to determine that they were looking for ajax data?

a) based on the type of site, visual niche
b) most image galleries are in javascript or flash
c) it's the media bot

Just speculation obviously, but all the javascript and flash for galleries is seriously curtailing the media bots ability to access images so it makes sense that they would figure out how to traverse the code to get access to the image locations.

httpwebwitch

WebmasterWorld Administrator httpwebwitch us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3817612 posted 4:55 am on Jan 2, 2009 (gmt 0)

bots starting to understand javascript, it's very Turingesque... next they'll start rewriting their own algos, developing intelligence, self-awareness, filling out forms, and signing up for their own Gmail addresses. Watch out Flickr, for images of electric sheep!

But seriously, it's good to know that we can no longer hide our data behind a layer of JavaScript + AJAX, and good that the bots can finally start seeing all that hidden content.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 5:03 am on Jan 2, 2009 (gmt 0)

I'm not sure how deep they can go into javascript as my example is very thin, but it's obviously the harbinger of things to come.

vincevincevince

WebmasterWorld Senior Member vincevincevince us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3817612 posted 6:52 am on Jan 2, 2009 (gmt 0)

Are we sure this isn't linked to the 'phishing protector' data Microsoft are getting now?

hlang

5+ Year Member



 
Msg#: 3817612 posted 8:34 am on Jan 2, 2009 (gmt 0)

I was pained when obfuscation stopped working around August of '06. Then images got decoded for text. Next PDF's. Now javascript. How do you give the real human visitor a contact number or email without giving it away to the bots? If you ask them to send their email address to get contact information in return, you're putting one more barrier between you and the customer. Leaving contact info in the open is asking for spam and unsolicited phone calls, and changing your email and phone number periodically is not consistent with a customer-oriented business plan. /rant

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 9:27 am on Jan 2, 2009 (gmt 0)

hlang, you can cloak [webmasterworld.com] the information so good bots never see it and leave it obfuscated for all other page views hopefully defeating the spambots looking for that data.

Alternatively, put your obfuscated contact details on a separate page, perhaps a javascript activated pop-up, then block that page from being crawled in robots.txt.

The msnbot-media actually checked robots.txt several times today and promptly quit crawling my pages when I added them to the disallowed list.

JS_Harris

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3817612 posted 9:46 am on Jan 2, 2009 (gmt 0)

Another possible (probable?) reason...

You're using Firefox and have visited those pages before.

If you have various toolbars from major search engines you've already agreed to send your browsing information along. Click on "tools/page info" for example in FF and you'll see how many times you've visited a page. It stands to reason that your browser is passing along links with your browsing history in my opinion.

edit: Expanding the possibility further - having your browser report the web pages you've seen is in line with Google's recent discussions on improving the net by using visitors computers instead of just a central database isn't it?

[edited by: JS_Harris at 9:53 am (utc) on Jan. 2, 2009]

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 9:57 am on Jan 2, 2009 (gmt 0)

Not *MY* browser as I have zero toolbars, they're against my anti-adware religion, but it's possible others have passed this information along.

I didn't consider that before but why only msnbot-media?

Also, one of the pages they are crawling is rarely used so the ability to pick up that much information about all the thousands of combinations of pages, assuming an MSN toolbar, seems a little far fetched albeit in the realm of possibilities.

Anyone got a contact at Live Search? MSNDUDE? anyone that could shed some light on this?

maximillianos

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3817612 posted 12:56 pm on Jan 2, 2009 (gmt 0)

I thought Gbot has been able to interpret javascript for years? Maybe I was just hearing rumors?

httpwebwitch

WebmasterWorld Administrator httpwebwitch us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3817612 posted 1:26 pm on Jan 2, 2009 (gmt 0)

M11s, you're right - I heard the same rumour: Gbot is capable of parsing+executing JS, but the same rumour was that they don't actually do it.

If JS-executing bots hitting your REST-ful API is a concern, it isn't hard to write a little spider trap. Something like "for(;;)" ought to do it.

bwnbwn

WebmasterWorld Senior Member bwnbwn us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 3817612 posted 1:57 pm on Jan 2, 2009 (gmt 0)

httpwebwitch
filling out forms
I already see yahoo trying to do this on our sites and it is shutting down our DB. I have contacted Yahoo about this and spoke to a Yahoo rep at the PubCon in Vegas on this issue as well. I have spoken to our IT department and they said it is still happening but not as frequent but still continues.

Google has been following javascript for some time now so I am not suprised MSN is doing it as well, and I am sure Yahoo is following javascript as well.

JonW

5+ Year Member



 
Msg#: 3817612 posted 4:25 pm on Jan 2, 2009 (gmt 0)

httpwebwitch & bwnbwn

Google been filling our forms for a while. GResearch published a paper last spring on the topic. identifying presentation(sorting) versus search criteria input elements, maximizing coverage of the database, while minimizing form submits, finding keywords for text inputs.

kaled

WebmasterWorld Senior Member kaled us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3817612 posted 4:32 pm on Jan 2, 2009 (gmt 0)

If you don't want an html page to be indexed, then use a noindex meta tag. Relying on bots to not find pages has never been a good idea.

Kaled.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 4:55 pm on Jan 2, 2009 (gmt 0)

If you don't want an html page to be indexed, then use a noindex meta tag.

It was never a problem if the bot found the page as it was just ONE page that would be indexed and they'd go away.

What was previously impossible, therefore not an issue, was any bot combining the database record IDs in the javascript plus the page URL, also in javascript, which has 30K+ combinations and with more than one of these types of pages we're talking almost 100K new pages to crawl.

NOINDEX isn't the solution because that only tells the bot not to add the page to their index, not to stop crawling it. I didn't want them there in the first place, didn't want them crawling 100K new pages, and some will even follow a NOFOLLOW tag so robots.txt disallow was the only way I could see to shut them down.

...or obfuscate the javascript and see what happens! :)

[edited by: incrediBILL at 4:58 pm (utc) on Jan. 2, 2009]

hlang

5+ Year Member



 
Msg#: 3817612 posted 5:20 pm on Jan 2, 2009 (gmt 0)

>hlang, you can cloak the information so good bots never see it

Well, it's not the good bots I'm trying to block of course. I already have nofollow on the links to contact pages and noindex in their heads. Some have said you just have to live with it (spam to a company email address) and use client-side filtering.

>...or obfuscate the javascript and see what happens! :)

Now that would probably work.

For a while.

madmatt69

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3817612 posted 6:45 pm on Jan 2, 2009 (gmt 0)

Maybe they'll figure out 301's next.

httpwebwitch

WebmasterWorld Administrator httpwebwitch us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 3817612 posted 7:31 pm on Jan 2, 2009 (gmt 0)

obfuscate the javascript and see what happens

iBill, hlang,
This would definitely NOT work.

Javascript isn't executed by a dude sitting and looking at the source code. It's run by an engine. probably something like Rhino [mozilla.org] or Spidermonkey [mozilla.org]. If your browser can execute it, theirs will too, no matter if your code is all scrambled on one line using bizarre unreadable variable names.

Say you want to protect your AJAX content from being indexed.

1) using "onClick" instead of "href" won't work, because the bots are executing onclicks now, and they've been sniffing out simple ones for quite a while.

2) encrypting/obfuscating the URL with client-side ciphering won't work - a JS-aware bot will find that URL no matter how convoluted or obfuscated or encrypted or enciphered it is; if your users are able to decrypt/deobfuscate/decipher the URL, they will too, because they'll be using the same JS engines as a browser does.

3) cloaking based on the referer won't work, since it's so easy to spoof in an HTTP request, whether it's XHR or not. Besides, a bot visitor will make the same kind of request a real human user would anyways, by executing JS found on the parent page

4) cloaking based on User-Agent may work, as long as you're dealing with a compliant bot who broadcasts their useragent like Slurp or Googlebot. Blackhat Spiders that do aggressive data scraping often identify themselves as Mozilla (to blend in with normal traffic) or Googlebot (since webmasters tend to welcome Googlebot with open arms, praying for PageRank). So... useragent cloaking is dodgy at best.

5) cloaking based on IP is reliable, but you have to maintain a blacklist

I'm concerned for all the applications out there that use JS for client-server interaction. I don't want some stinky bot clicking "delete" buttons, voting for posts, or executing code that acts through an API to add new records to my database. Imagine if a JS-aware bot started crawling facebook? Think how many of those damned "pokes" you'd get.

Now it may be possible to protect content from the bots, IF the content is loaded using 2 layers of AJAX in dynamically-delivered Javascript, AND if the bootstrapping script is loaded asynchronously using XHR. Layer #1 is a bootstrap which loads and executes on the browser, generates a ciphered nonce [en.wikipedia.org], and requests script #2 using that. #2 is the script that loads the content.

You protect #2 by "fanging" - putting "for(;;)" at the beginning of the bootstrap script. The technique is called "fanging" because of the two semicolons that look like fangs :)

The reason this works is that scripts loaded with XHR don't execute as soon as they arrive - you need to eval() them. Well, with fanged output, you just take a substring of the response and strip off the fangs before sending it into eval(). Anyone who isn't defanging (such as a browser or agent who requests the JS directly, not via XHR) will encounter the fangs and get stuck in a wicked little unending loop.

If you are linking scripts into your <head> the normal way, like <script src="myscript.js">, then fanging doesn't work. But if you employ a bootstrapping technique to load scripts asynchronously, then fanging is a real botbuster. It prevents JSON hijacking, among other things.

Because of the fangs, if the bot requests the bootstrap directly, it'll crash.
But if the bot executes the same script that a real human user would, using XHR and defanging, then the whole technique is moot. They came in the front door.

Of course you could toss some HTTP or SESSION authentication onto your scripts, but that would just plain suck.

Maybe Flash is an option. Are bots executing SWF's? I know they sniff them for text content, but are they executing the Actionscript too?

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3817612 posted 7:38 pm on Jan 2, 2009 (gmt 0)

-- Maybe they'll figure out 301's next. --

Or maybe they should figure out the Robots.txt first.

User-agent: *
Disallow: /availability/
User-agent: msnbot
Disallow: /availability/

is not an open invitation to the file in directory /availability/ by msnbot-media

maximillianos

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3817612 posted 8:51 pm on Jan 2, 2009 (gmt 0)

We have been seeing some of our javascript executed that submits forms for comments. We are getting a lot of "blank" comments lately since the bots are submitting the form with no data.

We fixed the problem with a simple validation, but interesting none the less that some bots are now starting to venture deeper into our applications.

Folks should start reviewing their scripts and make sure you have plenty of validations in place.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 5:12 am on Jan 3, 2009 (gmt 0)

User-agent: *
Disallow: /availability/
User-agent: msnbot
Disallow: /availability/

that is incorrect syntax.
this might work better:
User-agent: msnbot
Disallow: /availability/

User-agent: *
Disallow: /availability/


you need the blank line after each group and it is suggested to put the wild card user agent exclusion group last.

although this should do the same thing:
User-agent: *
Disallow: /availability/


blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 3817612 posted 3:31 pm on Jan 3, 2009 (gmt 0)

Phranque,

Big thanks for the suggestion.

Every other respectable BOT respects the Robots.txt we have in place including GoogleBot, Slurp, Gigablast and Ask. Even Cuil digs it.

I will make the changes and see what happens. If I sum up the importance of it like this

msnbot:

ignores Robots.txt = yes
contributes to unnecessary server load = yes
contributes to skewed statistics = yes
contributes to scraped content = yes
sends traffic = 2%(lets see if Microsoft and Dell deal would help if any)
Ranking in SERP Top 3 VS. visitor Conversion to Orders = 100% to almost 0

and if it does not stop i am afraid that next step to be is to put

User-agent: msnbot
Disallow: /

there.

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 3817612 posted 3:34 pm on Jan 3, 2009 (gmt 0)

and if it does not stop i am afraid that next step to be is to put

User-agent: msnbot
Disallow: /

Please don't! :o

It will cause Microsoft to issue profit warning!

Maybe not. ;)

jason1989

5+ Year Member



 
Msg#: 3817612 posted 5:16 am on Jan 10, 2009 (gmt 0)

I plan to insert AJAX data into an html page (when the page loads) and would like this data indexed.

Has anyone had their AJAX data indexed by Google?

Thanks!

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 3817612 posted 3:55 pm on Jan 24, 2009 (gmt 0)

Just a heads up.
No Java on my sites.

219.142.53.27 - - [24/Jan/2009:09:18:19 +0000] "GET /robots.txt HTTP/1.0" 403 998 "-" "librabot/1.0 (+http://search.msn.com/msnbot.htm)"
219.142.53.29 - - [24/Jan/2009:09:18:19 +0000] "GET /MyFolder/MyPage.html HTTP/1.0" 403 998 "-" "librabot/1.0 (+http://search.msn.com/msnbot.htm)"

Lain_se

5+ Year Member



 
Msg#: 3817612 posted 3:34 am on Jan 27, 2009 (gmt 0)

I have also seen this bot, however unlike all the posts above nobody has taken the time to trace the IP?

The one that hit my website, also from China and did not even ask for a robots.txt file....rude!

219.142.53.17 librabot/1.0 (+http://search.msn.com/msnbot.htm)
traces to: Data Communication Division / Beijing Telecom

Now I don't know about all you guys, but I have a long history of...shall we say "undesirable traffic" that emanates from China. Until I see a consistant "user agent" and traceable IP range for this bot, its getting blocked!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Microsoft / Bing Search Engine News
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved