Forum Moderators: Robert Charlton & goodroi
Our members here also note that many of these domains use an odd character instead of a dot - possibly part of the scheme. Also, these spam domains may be from anywhere, and not neccesarily China at all.
Warning for those doing detective work - many of these spam domains will attempt to install malware on your computer.
...nine of the top ten results are these weird Chinese sites....the more specific and detailed the search request, the more likely Google is to list these Chinese sites. The issue has apparently been reported to Google, but if the basic algorithms allow this sort of result, even banning the specific sites will not stop this sort of abuse.[news.yahoo.com...]
There are more people talking about Google and the spam sites now.
Did a search for it in yahoo and other forums and papers are talking about it, also referring to MC and the infrastructure change...
Glad we could get thos going :) Might help all of us! Or not....
Chinese domains are more prominent but SERP's are affected with anything of the Chinese, Japanese, Korean characters as well as ANY language.
The plague is more general as Google currently pulls up sites that have some of the lowest quality profiles (IMO) in some industries.
Low quality websites replace all sites that have not yet gained the status of authority.
Basically I think that the .cn domains are just a consequence, not the source of the problem.
I remember short periods of time during which Google showed similar results (never been so widespread tho), used to stick for a few hours/days, but now it's been almost 2 weeks so they must have a serious problem at the plex.
Maybe a solar powered office is not so efficient finally :)
But I'll do what I always do when something unusual happens. Be creative and speculate on the possibilities of who, why, and how. So, in a blink of an eye, I suddenly find myself with other shoes on:
If you were to launch a full scale attack on google and/or other serps - you would not just do it without having some knowledge on what does what and how to do it with the resources you have available. You would run different test scenarios on some random keyword combinations in different overlooked areas where people wouldn't notice it or think it was so unusual... just another bad result, you know the deal.
You would probably also have bought an exploit on it on the black market which was classified as "serious" and you would do the before mentioned tests and different scenarios beforehand so that it would have the most impact possible on the target before the exploit was fixed, and the scenario uncovered and blocked from that very second and every day forward.
An infestation like the one currently unfolding would require alot of resources to set up - think of the number of visitors and traffic - if it was private persons I would suspect they had alot of zombie computers (trojan or virus infested which could be called from their main) which were used for the purpose and taking some of the traffic load and there could be a multitude of reasons for it. Could be more zombies, money, ideology, not enough medication, you name it - never know what private people are up to.
If it was a commercial attack I would suspect somebody who has a crutch on google for whatever reason you can think of which involves money and/or marketshares and/or ideology - but wouldn't suspect any usual competitor to pull a trick like that, too high risk for a brand if it could be tracked back to them in any way - and there would be a way with something as large scale as this - would just take some time, but it would be done..
If it was a message from somebody somewhere to the search engines, then I would suspect they would not use .cn domains, but rather use the results to display some arbitrary message about whatever the great purpose was. That pretty much rules that out on the surface anyways.
This pretty much rules out the china option unless it's yet another one of those "we don't like america statements) though the .cn domain extension could be a decoy for political reasons to make people react against china for instance. (worse things have been done throughout history from all sides of anywhere... propaganda you know)
I'll give up just around here - because it gets more and more unpleasant with the options and ideas I get for what and who did and how and why - so i'll leave it there. All of the above is pure speculation, there are no facts or references to support any of it - just got curious as usual and my thoughts wandered.
Google, and other serps which are affected have a serious problem and some cleaning to do - that means overtime, and perhaps they can't even use a robot for it. Let's hope they fix it anyways - and if they don't - the internet will survive. I'll go consider writing a novel about this and make a conspiracy theory - The Google Code or something...;)
Sincerely, and have fun,
My money's on 2.
Incidentally, I have noticed that a couple of parked domains that I track (because I want them ;-) ) have suddenly become parked domains that try to install malware. Coincidence? Or is it part of a large-scale malware push? Is something even bigger brewing?
I agree, but I think you misunderstand my point. :-)
If a bank leaves its safe open and unattended, and they have some of the money stolen from it, the absolute first thing to do is to shut the safe door, lock it and guard it. Only then would you actually go after the criminals.
All I was saying was that if these particular attacks are the result of some kind of loophole in the way Google indexes sites, closing that loophole ought to be the top priority. If it isn't closed, then it will continue to be exploited, even if the original exploiters are caught.
if these particular attacks are the result of some kind of loophole in the way Google indexes sites, closing that loophole ought to be the top priority. If it isn't closed, then it will continue to be exploited, even if the original exploiters are caught.
What I still don't understand, and this might be Universal Search changes that did it, is that how can newly registered domains get into the index so fast? 3-4 weeks and they're in, when it takes months for a clean site to get indexed? How can anybody exploit a thing like that?
It seems that when they turned on the Universal Search, which in my opinion is just another term for Universal garbage, that they skipped all filters and everything else. They just let everything slide, new sites, spam sites, authority sites... everything....
Maybe it is just a Google test to gather visitor data and they didn't know that they let the Ugly beast show it's head when they implemented and invers algo.....
The .[space]cn thing got to be the easiest one to fix now when people here has told them how to read the [space]. I'm glad WebmasterWorld exists and I'm hoping that GOOG continue to read our solutions to the problems :)
< using the query that Dvorak wrote about >
Google result #4
----------------Title: <a href="http://subdomain.spamdomain.cn/" class=l onmousedown="return clk(this.href,'','','res','4','')">ASCII CHARACTERS</a>
Translate this page: <a href="http://translate.google.com/translate?hl=es&sl=en
&u=http://subdomain.spamdomain.cn/&..." class=fl>
Traduzca esta página</a>Display URL: <span class=a>subdomain.spamdomain[UNICODE CHARACTER]cn/ - </span>
Similar pages: <a class=fl href="/search?hl=es&client=firefox-a&rls=org.mozilla:en-US:official&hs=WOp&q=related:subdomain.spamdomain[UNICODE CHARACTER]cn/">Páginas similares</a>
The Title link and translate this page link point to a normal dot cn domain (.cn), the display URL and similar pages link have the unicode weird character.
If someone placed a link to "subdomain.spamdomain.cn" in some website, I think its nearly impossible that Google adds the weird unicode character by itself, so one possible explanation is that someone placed a link to "subdomain.spamdomain[UNICODE CHARACTER]cn" in some website, or that he made the href to the correct domain without unicode and the anchor text he placed the domain with the unicode character.
One fact is that somehow the domain with the unicode character must have been fed to Google.
So when google spider the link he gets:
A. The unicode link, he stores this as display URL, detects it is unicode and translates back to normal .cn for title link in SERPS
B. The href link, he stores this as title link in SERPS and he uses the anchor text domain name with unicode for display URL (weird, but who knows ...)
More facts:
If you made a "site:spamdomain.cn" search in Google you get 12 subdomains indexed.
If you made a "site:spamdomain[UNICODE CHARACTER]cn" search in Google you get 29 subdomains indexed. Some are also included in the normal .cn site search and some are not.
So now the main mystery is, how did they insert this data into Google index, some lines above I have thought one possibility, place links in the web and wait for Google to spider the .cn websites, but I havent found any link. Maybe they used some other techniques, like submit the URL with the UNICODE character using the "Add your URL to Google"?
[edited by: tedster at 6:27 pm (utc) on Sep. 26, 2007]
[edit reason] remove some specfics [/edit]
"[URL]" +".cn"
6 MILLION PAGES.... lotsa' links :) Not all of them goes to these spam sites but I bet a lot of them do.
Maybe this is the result of GOOG want's to be able to show all sites that have been updated the last 24 hours ago? A thing MC was so proud of. Sure seems to be a thing coming from blogs (another thing they like!) spam....
But I did find some links last week to other, similar sites, which traces led to '.cn' URLs eventually ( as those who tested the method early - mid summer simply returned to post more links! ). Do the same query on Yahoo!, and find a domain that doesn't say .cn but for instance, .info. Do a query on the URL, and see for yourself.
For me, the first example was < a domain >, which has been mentioned on a retail site, as a custumer review.
Along with about 500 .cn domains.
Only the cache remains though, the webmaster removed the spam.
The rest didn't surprise me either.
Trackbacks, blog comments, forum posts, reviews, anything and everything that could be spammed has been spammed with these fake URLs.
Date is from late July, as with all else regarding the .info batch of this stuff. .cn mostly dates to august.
...
...
OK I know this isn't fair, might even be against the Forum Charter but... look at this page < link removed -tedster >.
There's your PR and TrustRank propagated. Nanotechnology sites... wow, are there a lot of puns in there. Btw. this too was... no, IS a host of both the first ( .info ) and the second ( almostdotcn ) wave.
I'd put the blame at least partially on any webmaster who lets things like this happen, leave forums, comment and other user generated areas unattended, unmoderated, and can't even delete a page properly.
And it's not a website I'd expect to be so easily spammed either.
They seem to have delinked the entire page, but since it's still online, and there's a blog still linking directly to it, Google will continue to cache it, and include it in link calculations.
[edited by: tedster at 5:56 pm (utc) on Sep. 26, 2007]
The key here is the weird unicode character, I havent found any link with this character now but I am thinking this is the main trick to reach to the top of the SERPS.
Google is coded in (if I remember well) in C, so they must have functions like:
int domain_in_sandbox(char *domain);
int other_type_of_filter(char *domain);
Note: This is just an example, the following theory is valid to any programming language.
So when they apply the filters to the domains with unicode, maybe the routines doesnt handle well the domain names and happens the following:
domain_in_sandbox("spam-domain.cn") returns TRUE
but
domain_in_sandbox("spam-domain[UNICODE]cn") returns FALSE, so Google doesnt apply any spam or who-knows filters, thinks all the backlinks are valid, and thinks the keywords in the website are valid as well, so it positions this website on the top of the SERPS.
PS: I am still willing to see a link to a [UNICODE]cn domain
in Yahoo site explorer if you do a search for < a certain unicode cn domain >
The websites that are returned contain a link without the UNICODE char, so both yahoo and google may be doing the translation of the domain names somewhere.
[edited by: tedster at 5:58 pm (utc) on Sep. 26, 2007]
[edit reason] remove specifics [/edit]
I think you are right! I think this is a result of several things and one of them being the new Universal Search (Universal Garbage as I call it) and that they had to skip some filters to make it work.
With the filters not in place they let in all type of stuff into the SERPs, without doing an "ocular review" of the results. They kept filters in place where "authority sites" didn't get slammed (couldn't afford amazon getting angry....) but other sites got hammered with penalities, while other sites got up to the top after just a few weeks.
The .cn sites are a result of this Universal Search and as if these problems were not enough, then you have all the other things that GOOG doesn't know what to do with right now, hence an infrastructure change.... They have a much bigger problem than just the .cn sites right now if you ask me and I can't imagine what the Plex looks like now... some heads might be rolling....
crobb305,
A lot of them are legit .cn domains which is OK but hundreds of thousands, if not millions are spam sites
I have reversed what kind of redirect the chinese spam sites do and they do the typical JS redirect:
<!--
var l1lll1l1l="HkEBrEAfbeXezzLuQpxwirxxHQrJSCZsSckCLPvPTsCTLdwOkhXJMRQCcFUdsBbMN";
var lll1l1l1lll1l="%2C%04%267%1F%20%2F%12L%097%06%1B%0E%25%1A%3FM%5F%1F%1D%06%08Bg%7E
%06%22653%03%20%06%0A1%2F8X3%3B%1El%279%10%05%2ED%016d%2E58%7C%07%233%05%06%2E%16ju";
eval(unescape("function%20ew%28l1ll1lll1l%2C%20l11l1l1l1l1llll%29%20%7Bvar%20result
%20%3D%20%27%27%3Bfor%28i%20%3D%200%3B%20i%20%3C%20l1ll1lll1l%2Elength%3B%20i%2B%2B
%29result%20%2B%3D%20String%2EfromCharCode%28l11l1l1l1l1llll%2EcharCodeAt%28i%20%25
%20l11l1l1l1l1llll%2Elength%29%20%5E%20l1ll1lll1l%2EcharCodeAt%28i%29%29%3Breturn
%20result%3B%7Deval%28ew%28l1lll1l1l%2Cunescape%28lll1l1l1lll1l%29%29%29%3B"));
// -->
This is simple eval() and unescape() trick plus some XOR encoded strings, so all this translates to:
document.location='http://example.com/sutra/in.cgi?default';
Is just simple document.location JS redirect, so, where is this human bot following JS that Google has?
Or this [UNICODE]cn thing also skipped this JS following bot filters?
[edited by: tedster at 6:47 pm (utc) on Sep. 26, 2007]
I hope that Google finds a fix that does not tighten up on the sandbox effect, now that it's become a bit friendlier. It does give me some extra caution when a site wants to launch a subdomain - the future may be a bit cloudy.
Also, these spammers certainly took the idea of the long tail search seriously - so seriously that they automated their approach.
The algorithm to convert IDN domain names to ASCII is standard and public. To make a DNS query or HTTP petition you need the ASCII domain name, but for the end user, is better to display them [CHINESE-LETTERS].cn than "xn--ub1a.cn"
So, if you want to check the IDN to ASCII conversion, get the GNU libidn installed into your Linux box, so you can see this:
hal9000:~# CHARSET='UTF-8' idn --quiet --idna-to-ascii 'example-domain[UNICODE DOT]cn'
Result of translation: example-domain.cn
So this is the reason why this IDN domains are showing up in the index, as I noted before, for the title link Google uses the ASCII translation of the domain, and for the display URL Google uses the IDN domain name.
I am sure they spam their websites in this way:
<a href="http://example-domain[UNICODE DOT]cn">mr spock, we have a looooong tail to index</a>
So when Google spiders it gets the IDN domain in the href, and stores the IDN and the ASCII translation somewhere inside Google.
I think that the main flaw here is that somewhere inside Google kernel, some function is checking the IDN or the ASCII name of the domain, and is not able to relate the IDN name and the ASCII name, so this IDN websites are not giving positives in the spam filters.
Imaginary example:
We have some system that check whois data, this system uses just ASCII names.
Other system (spam filters) asks for whois data but feeding in an IDN domain name, the check whois data system will return domain not existant or something like that, so this spam filter doesnt raise his spam flag.
If someone has some other theories please post here.