I was gonna nominate this as a featured home page topic, but someone beat me to it.
How were you able to determine that they were looking for ajax data?
|How were you able to determine that they were looking for ajax data? |
a) based on the type of site, visual niche
c) it's the media bot
Are we sure this isn't linked to the 'phishing protector' data Microsoft are getting now?
hlang, you can cloak [webmasterworld.com] the information so good bots never see it and leave it obfuscated for all other page views hopefully defeating the spambots looking for that data.
The msnbot-media actually checked robots.txt several times today and promptly quit crawling my pages when I added them to the disallowed list.
Another possible (probable?) reason...
You're using Firefox and have visited those pages before.
If you have various toolbars from major search engines you've already agreed to send your browsing information along. Click on "tools/page info" for example in FF and you'll see how many times you've visited a page. It stands to reason that your browser is passing along links with your browsing history in my opinion.
edit: Expanding the possibility further - having your browser report the web pages you've seen is in line with Google's recent discussions on improving the net by using visitors computers instead of just a central database isn't it?
[edited by: JS_Harris at 9:53 am (utc) on Jan. 2, 2009]
Not *MY* browser as I have zero toolbars, they're against my anti-adware religion, but it's possible others have passed this information along.
I didn't consider that before but why only msnbot-media?
Also, one of the pages they are crawling is rarely used so the ability to pick up that much information about all the thousands of combinations of pages, assuming an MSN toolbar, seems a little far fetched albeit in the realm of possibilities.
Anyone got a contact at Live Search? MSNDUDE? anyone that could shed some light on this?
M11s, you're right - I heard the same rumour: Gbot is capable of parsing+executing JS, but the same rumour was that they don't actually do it.
If JS-executing bots hitting your REST-ful API is a concern, it isn't hard to write a little spider trap. Something like "for(;;)" ought to do it.
httpwebwitch I already see yahoo trying to do this on our sites and it is shutting down our DB. I have contacted Yahoo about this and spoke to a Yahoo rep at the PubCon in Vegas on this issue as well. I have spoken to our IT department and they said it is still happening but not as frequent but still continues.
httpwebwitch & bwnbwn
Google been filling our forms for a while. GResearch published a paper last spring on the topic. identifying presentation(sorting) versus search criteria input elements, maximizing coverage of the database, while minimizing form submits, finding keywords for text inputs.
If you don't want an html page to be indexed, then use a noindex meta tag. Relying on bots to not find pages has never been a good idea.
|If you don't want an html page to be indexed, then use a noindex meta tag. |
It was never a problem if the bot found the page as it was just ONE page that would be indexed and they'd go away.
NOINDEX isn't the solution because that only tells the bot not to add the page to their index, not to stop crawling it. I didn't want them there in the first place, didn't want them crawling 100K new pages, and some will even follow a NOFOLLOW tag so robots.txt disallow was the only way I could see to shut them down.
[edited by: incrediBILL at 4:58 pm (utc) on Jan. 2, 2009]
>hlang, you can cloak the information so good bots never see it
Well, it's not the good bots I'm trying to block of course. I already have nofollow on the links to contact pages and noindex in their heads. Some have said you just have to live with it (spam to a company email address) and use client-side filtering.
Now that would probably work.
For a while.
Maybe they'll figure out 301's next.
This would definitely NOT work.
Say you want to protect your AJAX content from being indexed.
1) using "onClick" instead of "href" won't work, because the bots are executing onclicks now, and they've been sniffing out simple ones for quite a while.
2) encrypting/obfuscating the URL with client-side ciphering won't work - a JS-aware bot will find that URL no matter how convoluted or obfuscated or encrypted or enciphered it is; if your users are able to decrypt/deobfuscate/decipher the URL, they will too, because they'll be using the same JS engines as a browser does.
3) cloaking based on the referer won't work, since it's so easy to spoof in an HTTP request, whether it's XHR or not. Besides, a bot visitor will make the same kind of request a real human user would anyways, by executing JS found on the parent page
4) cloaking based on User-Agent may work, as long as you're dealing with a compliant bot who broadcasts their useragent like Slurp or Googlebot. Blackhat Spiders that do aggressive data scraping often identify themselves as Mozilla (to blend in with normal traffic) or Googlebot (since webmasters tend to welcome Googlebot with open arms, praying for PageRank). So... useragent cloaking is dodgy at best.
5) cloaking based on IP is reliable, but you have to maintain a blacklist
I'm concerned for all the applications out there that use JS for client-server interaction. I don't want some stinky bot clicking "delete" buttons, voting for posts, or executing code that acts through an API to add new records to my database. Imagine if a JS-aware bot started crawling facebook? Think how many of those damned "pokes" you'd get.
You protect #2 by "fanging" - putting "for(;;)" at the beginning of the bootstrap script. The technique is called "fanging" because of the two semicolons that look like fangs :)
The reason this works is that scripts loaded with XHR don't execute as soon as they arrive - you need to eval() them. Well, with fanged output, you just take a substring of the response and strip off the fangs before sending it into eval(). Anyone who isn't defanging (such as a browser or agent who requests the JS directly, not via XHR) will encounter the fangs and get stuck in a wicked little unending loop.
If you are linking scripts into your <head> the normal way, like
<script src="myscript.js">, then fanging doesn't work. But if you employ a bootstrapping technique to load scripts asynchronously, then fanging is a real botbuster. It prevents JSON hijacking, among other things.
Because of the fangs, if the bot requests the bootstrap directly, it'll crash.
But if the bot executes the same script that a real human user would, using XHR and defanging, then the whole technique is moot. They came in the front door.
Of course you could toss some HTTP or SESSION authentication onto your scripts, but that would just plain suck.
Maybe Flash is an option. Are bots executing SWF's? I know they sniff them for text content, but are they executing the Actionscript too?
-- Maybe they'll figure out 301's next. --
Or maybe they should figure out the Robots.txt first.
is not an open invitation to the file in directory /availability/ by msnbot-media
We fixed the problem with a simple validation, but interesting none the less that some bots are now starting to venture deeper into our applications.
Folks should start reviewing their scripts and make sure you have plenty of validations in place.
|User-agent: * |
that is incorrect syntax.
this might work better:
|User-agent: msnbot |
you need the blank line after each group and it is suggested to put the wild card user agent exclusion group last.
although this should do the same thing:
|User-agent: * |
Big thanks for the suggestion.
Every other respectable BOT respects the Robots.txt we have in place including GoogleBot, Slurp, Gigablast and Ask. Even Cuil digs it.
I will make the changes and see what happens. If I sum up the importance of it like this
ignores Robots.txt = yes
contributes to unnecessary server load = yes
contributes to skewed statistics = yes
contributes to scraped content = yes
sends traffic = 2%(lets see if Microsoft and Dell deal would help if any)
Ranking in SERP Top 3 VS. visitor Conversion to Orders = 100% to almost 0
and if it does not stop i am afraid that next step to be is to put
|and if it does not stop i am afraid that next step to be is to put |
Please don't! :o
It will cause Microsoft to issue profit warning!
Maybe not. ;)
I plan to insert AJAX data into an html page (when the page loads) and would like this data indexed.
Has anyone had their AJAX data indexed by Google?
Just a heads up.
No Java on my sites.
18.104.22.168 - - [24/Jan/2009:09:18:19 +0000] "GET /robots.txt HTTP/1.0" 403 998 "-" "librabot/1.0 (+http://search.msn.com/msnbot.htm)"
22.214.171.124 - - [24/Jan/2009:09:18:19 +0000] "GET /MyFolder/MyPage.html HTTP/1.0" 403 998 "-" "librabot/1.0 (+http://search.msn.com/msnbot.htm)"
I have also seen this bot, however unlike all the posts above nobody has taken the time to trace the IP?
The one that hit my website, also from China and did not even ask for a robots.txt file....rude!
126.96.36.199 librabot/1.0 (+http://search.msn.com/msnbot.htm)
traces to: Data Communication Division / Beijing Telecom
Now I don't know about all you guys, but I have a long history of...shall we say "undesirable traffic" that emanates from China. Until I see a consistant "user agent" and traceable IP range for this bot, its getting blocked!