Google's CN Domain Spam Plague - now noted by John Dvorak - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google's CN Domain Spam Plague - now noted by John Dvorak

tedster

5:59 pm on Sep 25, 2007 (gmt 0)

In our September Google SERP Changes [webmasterworld.com] discussions, members have been discussing the plague of spam sites using a cn domain extension. This has gone on long enough that now John Dvorak of PC Magazine published an article about it. Pretty high profile exposure for an embarrassing problem that is apparently quite challenging for Google's infrastructure!

Our members here also note that many of these domains use an odd character instead of a dot - possibly part of the scheme. Also, these spam domains may be from anywhere, and not neccesarily China at all.

Warning for those doing detective work - many of these spam domains will attempt to install malware on your computer.

...nine of the top ten results are these weird Chinese sites....the more specific and detailed the search request, the more likely Google is to list these Chinese sites. The issue has apparently been reported to Google, but if the basic algorithms allow this sort of result, even banning the specific sites will not stop this sort of abuse.
[news.yahoo.com...]

gehrlekrona

2:53 am on Sep 26, 2007 (gmt 0)

Yeah, I know browsers and crawlers are different, but they should be able to detect a javascript redirect.

There are more people talking about Google and the spam sites now.
Did a search for it in yahoo and other forums and papers are talking about it, also referring to MC and the infrastructure change...

Glad we could get thos going :) Might help all of us! Or not....

followgreg

5:16 am on Sep 26, 2007 (gmt 0)

I don't think that the issue is specifically .cn related.
From where I stand it's a much wider issue for a couple of weeks.

Chinese domains are more prominent but SERP's are affected with anything of the Chinese, Japanese, Korean characters as well as ANY language.

The plague is more general as Google currently pulls up sites that have some of the lowest quality profiles (IMO) in some industries.
Low quality websites replace all sites that have not yet gained the status of authority.

Basically I think that the .cn domains are just a consequence, not the source of the problem.

I remember short periods of time during which Google showed similar results (never been so widespread tho), used to stick for a few hours/days, but now it's been almost 2 weeks so they must have a serious problem at the plex.

Maybe a solar powered office is not so efficient finally :)

zafile

6:28 am on Sep 26, 2007 (gmt 0)

IMHO, someone bought all of Dvorak's Christmas gifts in order to get the story placed in PC Magazine. I wonder who?

RandomDot

6:37 am on Sep 26, 2007 (gmt 0)

Google is in alot of trouble if it hits the news the wrong way and people begin to associate their brand with more negative .. it could probably be spinned, but that would require the usual judas character of every myth, the betrayer, the sinner, the devious, the guilty. They'd have some trouble making that happen -

But I'll do what I always do when something unusual happens. Be creative and speculate on the possibilities of who, why, and how. So, in a blink of an eye, I suddenly find myself with other shoes on:

If you were to launch a full scale attack on google and/or other serps - you would not just do it without having some knowledge on what does what and how to do it with the resources you have available. You would run different test scenarios on some random keyword combinations in different overlooked areas where people wouldn't notice it or think it was so unusual... just another bad result, you know the deal.

You would probably also have bought an exploit on it on the black market which was classified as "serious" and you would do the before mentioned tests and different scenarios beforehand so that it would have the most impact possible on the target before the exploit was fixed, and the scenario uncovered and blocked from that very second and every day forward.

An infestation like the one currently unfolding would require alot of resources to set up - think of the number of visitors and traffic - if it was private persons I would suspect they had alot of zombie computers (trojan or virus infested which could be called from their main) which were used for the purpose and taking some of the traffic load and there could be a multitude of reasons for it. Could be more zombies, money, ideology, not enough medication, you name it - never know what private people are up to.

If it was a commercial attack I would suspect somebody who has a crutch on google for whatever reason you can think of which involves money and/or marketshares and/or ideology - but wouldn't suspect any usual competitor to pull a trick like that, too high risk for a brand if it could be tracked back to them in any way - and there would be a way with something as large scale as this - would just take some time, but it would be done..

If it was a message from somebody somewhere to the search engines, then I would suspect they would not use .cn domains, but rather use the results to display some arbitrary message about whatever the great purpose was. That pretty much rules that out on the surface anyways.

This pretty much rules out the china option unless it's yet another one of those "we don't like america statements) though the .cn domain extension could be a decoy for political reasons to make people react against china for instance. (worse things have been done throughout history from all sides of anywhere... propaganda you know)

I'll give up just around here - because it gets more and more unpleasant with the options and ideas I get for what and who did and how and why - so i'll leave it there. All of the above is pure speculation, there are no facts or references to support any of it - just got curious as usual and my thoughts wandered.

Google, and other serps which are affected have a serious problem and some cleaning to do - that means overtime, and perhaps they can't even use a robot for it. Let's hope they fix it anyways - and if they don't - the internet will survive. I'll go consider writing a novel about this and make a conspiracy theory - The Google Code or something...;)

Sincerely, and have fun,

callivert

8:00 am on Sep 26, 2007 (gmt 0)

There are two possible motives for this. Both are plausible.
1. The people behind this wanted to damage Google and possibly Yahoo.
2. They don't care about Google or Yahoo, they just wanted all that traffic for a couple of weeks.

My money's on 2.

Incidentally, I have noticed that a couple of parked domains that I track (because I want them ;-) ) have suddenly become parked domains that try to install malware. Coincidence? Or is it part of a large-scale malware push? Is something even bigger brewing?

gibbergibber

9:32 am on Sep 26, 2007 (gmt 0)

--The conditions always exist for criminals to make money. The day that banks don't exist, bank robbers will be out of business. That doesn't mean that we have to be fatalistic and just give up all hope. --

I agree, but I think you misunderstand my point. :-)

If a bank leaves its safe open and unattended, and they have some of the money stolen from it, the absolute first thing to do is to shut the safe door, lock it and guard it. Only then would you actually go after the criminals.

All I was saying was that if these particular attacks are the result of some kind of loophole in the way Google indexes sites, closing that loophole ought to be the top priority. If it isn't closed, then it will continue to be exploited, even if the original exploiters are caught.

maygle

10:24 am on Sep 26, 2007 (gmt 0)

Lord Majestic,you are right,the problem is too cheap,about $1/7.5 for the first year to get .cn domain.

tedster,I found a related/useful link to the current topic,I want to share to others.

callivert

10:35 am on Sep 26, 2007 (gmt 0)

if these particular attacks are the result of some kind of loophole in the way Google indexes sites, closing that loophole ought to be the top priority. If it isn't closed, then it will continue to be exploited, even if the original exploiters are caught.

True. The scary thing with this is both the sophistication and the scale of the operation. I wonder how many people would have both the knowledge and the resources to do this.

Bones

12:11 pm on Sep 26, 2007 (gmt 0)

Looks like quite a lot less .cn spam at the moment. Can anyone confirm?

The site search mentioned earlier in this thread isn't returning any results for me now.

gehrlekrona

1:01 pm on Sep 26, 2007 (gmt 0)

The .[space]cn problem might have been taken care of but all the other cn domains are still there.
I have a search I do and the only thing that has happened is that there are more of them. GOOG is probably working overtime to get this fixed since it is bad PR for them, especially now when it has hit the public.

What I still don't understand, and this might be Universal Search changes that did it, is that how can newly registered domains get into the index so fast? 3-4 weeks and they're in, when it takes months for a clean site to get indexed? How can anybody exploit a thing like that?
It seems that when they turned on the Universal Search, which in my opinion is just another term for Universal garbage, that they skipped all filters and everything else. They just let everything slide, new sites, spam sites, authority sites... everything....
Maybe it is just a Google test to gather visitor data and they didn't know that they let the Ugly beast show it's head when they implemented and invers algo.....
The .[space]cn thing got to be the easiest one to fix now when people here has told them how to read the [space]. I'm glad WebmasterWorld exists and I'm hoping that GOOG continue to read our solutions to the problems :)

Alvaro

1:03 pm on Sep 26, 2007 (gmt 0)

This morning I found some spare time and I have analyzed some chinese spam results, I found the following:

< using the query that Dvorak wrote about >

Google result #4
----------------
Title: <a href="http://subdomain.spamdomain.cn/" class=l onmousedown="return clk(this.href,'','','res','4','')">ASCII CHARACTERS</a>
Translate this page: <a href="http://translate.google.com/translate?hl=es&sl=en
&u=http://subdomain.spamdomain.cn/&..." class=fl>
Traduzca esta p�gina</a>
Display URL: <span class=a>subdomain.spamdomain[UNICODE CHARACTER]cn/ - </span>
Similar pages: <a class=fl href="/search?hl=es&client=firefox-a&rls=org.mozilla:en-US:official&hs=WOp&q=related:subdomain.spamdomain[UNICODE CHARACTER]cn/">P�ginas similares</a>

The Title link and translate this page link point to a normal dot cn domain (.cn), the display URL and similar pages link have the unicode weird character.

If someone placed a link to "subdomain.spamdomain.cn" in some website, I think its nearly impossible that Google adds the weird unicode character by itself, so one possible explanation is that someone placed a link to "subdomain.spamdomain[UNICODE CHARACTER]cn" in some website, or that he made the href to the correct domain without unicode and the anchor text he placed the domain with the unicode character.

One fact is that somehow the domain with the unicode character must have been fed to Google.

So when google spider the link he gets:

A. The unicode link, he stores this as display URL, detects it is unicode and translates back to normal .cn for title link in SERPS

B. The href link, he stores this as title link in SERPS and he uses the anchor text domain name with unicode for display URL (weird, but who knows ...)

More facts:

If you made a "site:spamdomain.cn" search in Google you get 12 subdomains indexed.

If you made a "site:spamdomain[UNICODE CHARACTER]cn" search in Google you get 29 subdomains indexed. Some are also included in the normal .cn site search and some are not.

So now the main mystery is, how did they insert this data into Google index, some lines above I have thought one possibility, place links in the web and wait for Google to spider the .cn websites, but I havent found any link. Maybe they used some other techniques, like submit the URL with the UNICODE character using the "Add your URL to Google"?

[edited by: tedster at 6:27 pm (utc) on Sep. 26, 2007]
[edit reason] remove some specfics [/edit]

gehrlekrona

1:29 pm on Sep 26, 2007 (gmt 0)

Do a search for:

"[URL]" +".cn"

6 MILLION PAGES.... lotsa' links :) Not all of them goes to these spam sites but I bet a lot of them do.
Maybe this is the result of GOOG want's to be able to show all sites that have been updated the last 24 hours ago? A thing MC was so proud of. Sure seems to be a thing coming from blogs (another thing they like!) spam....

Miamacs

1:44 pm on Sep 26, 2007 (gmt 0)

You can't really track those, for the characters are screwed up cross and sideways.

But I did find some links last week to other, similar sites, which traces led to '.cn' URLs eventually ( as those who tested the method early - mid summer simply returned to post more links! ). Do the same query on Yahoo!, and find a domain that doesn't say .cn but for instance, .info. Do a query on the URL, and see for yourself.

For me, the first example was < a domain >, which has been mentioned on a retail site, as a custumer review.

Along with about 500 .cn domains.
Only the cache remains though, the webmaster removed the spam.

The rest didn't surprise me either.
Trackbacks, blog comments, forum posts, reviews, anything and everything that could be spammed has been spammed with these fake URLs.

Date is from late July, as with all else regarding the .info batch of this stuff. .cn mostly dates to august.

...

OK I know this isn't fair, might even be against the Forum Charter but... look at this page < link removed -tedster >.

There's your PR and TrustRank propagated. Nanotechnology sites... wow, are there a lot of puns in there. Btw. this too was... no, IS a host of both the first ( .info ) and the second ( almostdotcn ) wave.

I'd put the blame at least partially on any webmaster who lets things like this happen, leave forums, comment and other user generated areas unattended, unmoderated, and can't even delete a page properly.

And it's not a website I'd expect to be so easily spammed either.

They seem to have delinked the entire page, but since it's still online, and there's a blog still linking directly to it, Google will continue to cache it, and include it in link calculations.

[edited by: tedster at 5:56 pm (utc) on Sep. 26, 2007]

thetrasher

1:44 pm on Sep 26, 2007 (gmt 0)

Warning for those doing detective work - many of these spam domains will attempt to install malware on your computer.

KLIK gang's VideoAccessCodecInstall.exe trojan - that's what they try to install.

gehrlekrona

2:16 pm on Sep 26, 2007 (gmt 0)

They have tried to spam my site as well, but in my code, before it save to the database, I don't save it and if anything slips by, then I delete everything spammy from the database before anything shows up.
I am thinking that they do some sql injection buty haven't found the place yet :(

Alvaro

3:10 pm on Sep 26, 2007 (gmt 0)

With the info Miamacs post here I was following some traces to led to the origin of the spam and it seems the typical post my url to everywhere and auto-generated websites.

The key here is the weird unicode character, I havent found any link with this character now but I am thinking this is the main trick to reach to the top of the SERPS.

Google is coded in (if I remember well) in C, so they must have functions like:

int domain_in_sandbox(char *domain);
int other_type_of_filter(char *domain);

Note: This is just an example, the following theory is valid to any programming language.

So when they apply the filters to the domains with unicode, maybe the routines doesnt handle well the domain names and happens the following:

domain_in_sandbox("spam-domain.cn") returns TRUE

but

domain_in_sandbox("spam-domain[UNICODE]cn") returns FALSE, so Google doesnt apply any spam or who-knows filters, thinks all the backlinks are valid, and thinks the keywords in the website are valid as well, so it positions this website on the top of the SERPS.

PS: I am still willing to see a link to a [UNICODE]cn domain

netmeg

3:27 pm on Sep 26, 2007 (gmt 0)

The .[space]cn problem might have been taken care of but all the other cn domains are still there.

Taken care of how? As of 30 seconds ago, I still see it in the same search query I used when I originally ran across it, ten days ago or so.

Alvaro

3:40 pm on Sep 26, 2007 (gmt 0)

I have found this:

in Yahoo site explorer if you do a search for < a certain unicode cn domain >

The websites that are returned contain a link without the UNICODE char, so both yahoo and google may be doing the translation of the domain names somewhere.

[edited by: tedster at 5:58 pm (utc) on Sep. 26, 2007]
[edit reason] remove specifics [/edit]

crobb305

4:06 pm on Sep 26, 2007 (gmt 0)

all the other cn domains are still there.

55,000,000 shown when searching site:.cn

pageoneresults

4:10 pm on Sep 26, 2007 (gmt 0)

Here comes the Tin Hat...

Did someone/something manage to unleash some sort of virus in the indices? I mean, these things surely exhibit the behavior of a virus. Did something get inside The Gorg and also The Hoo! and is now wreaking havoc?

Jean-Luc and his crew did it.

gehrlekrona

4:14 pm on Sep 26, 2007 (gmt 0)

followgreg,
"Basically I think that the .cn domains are just a consequence, not the source of the problem. "

I think you are right! I think this is a result of several things and one of them being the new Universal Search (Universal Garbage as I call it) and that they had to skip some filters to make it work.
With the filters not in place they let in all type of stuff into the SERPs, without doing an "ocular review" of the results. They kept filters in place where "authority sites" didn't get slammed (couldn't afford amazon getting angry....) but other sites got hammered with penalities, while other sites got up to the top after just a few weeks.
The .cn sites are a result of this Universal Search and as if these problems were not enough, then you have all the other things that GOOG doesn't know what to do with right now, hence an infrastructure change.... They have a much bigger problem than just the .cn sites right now if you ask me and I can't imagine what the Plex looks like now... some heads might be rolling....

crobb305,
A lot of them are legit .cn domains which is OK but hundreds of thousands, if not millions are spam sites

potentialgeek

4:47 pm on Sep 26, 2007 (gmt 0)

Has any big internet security site issued a warning yet?

p/g

P.S. What better way to damage Google's brand than for people to think using it will get their computers infected or hard drive wiped out?! It better move quickly.

Alvaro

6:39 pm on Sep 26, 2007 (gmt 0)

UPDATE:

I have reversed what kind of redirect the chinese spam sites do and they do the typical JS redirect:

This is simple eval() and unescape() trick plus some XOR encoded strings, so all this translates to:

document.location='http://example.com/sutra/in.cgi?default';

Is just simple document.location JS redirect, so, where is this human bot following JS that Google has?

Or this [UNICODE]cn thing also skipped this JS following bot filters?

[edited by: tedster at 6:47 pm (utc) on Sep. 26, 2007]

jimbeetle

7:02 pm on Sep 26, 2007 (gmt 0)

The .cn sites are a result of this Universal Search

Okay, what about Yahoo?

tedster

7:14 pm on Sep 26, 2007 (gmt 0)

I'm also thinking that, in addition to exploiting IDNs, this spam again exploits Google's efforts to give the new site an initial break, especially for the domain root. A year ago we saw subdomain spam that took such advantage, and in the present case, it's always the subdomain home page that is getting ranked.

I hope that Google finds a fix that does not tighten up on the sandbox effect, now that it's become a bit friendlier. It does give me some extra caution when a site wants to launch a subdomain - the future may be a bit cloudy.

Also, these spammers certainly took the idea of the long tail search seriously - so seriously that they automated their approach.

Alvaro

7:45 pm on Sep 26, 2007 (gmt 0)

I am starting to find some more interesting things, and this could led to understand why the xploit is working in Google and Yahoo.

The algorithm to convert IDN domain names to ASCII is standard and public. To make a DNS query or HTTP petition you need the ASCII domain name, but for the end user, is better to display them [CHINESE-LETTERS].cn than "xn--ub1a.cn"

So, if you want to check the IDN to ASCII conversion, get the GNU libidn installed into your Linux box, so you can see this:

hal9000:~# CHARSET='UTF-8' idn --quiet --idna-to-ascii 'example-domain[UNICODE DOT]cn'

Result of translation: example-domain.cn

So this is the reason why this IDN domains are showing up in the index, as I noted before, for the title link Google uses the ASCII translation of the domain, and for the display URL Google uses the IDN domain name.

I am sure they spam their websites in this way:

<a href="http://example-domain[UNICODE DOT]cn">mr spock, we have a looooong tail to index</a>

So when Google spiders it gets the IDN domain in the href, and stores the IDN and the ASCII translation somewhere inside Google.

I think that the main flaw here is that somewhere inside Google kernel, some function is checking the IDN or the ASCII name of the domain, and is not able to relate the IDN name and the ASCII name, so this IDN websites are not giving positives in the spam filters.

Imaginary example:

We have some system that check whois data, this system uses just ASCII names.

Other system (spam filters) asks for whois data but feeding in an IDN domain name, the check whois data system will return domain not existant or something like that, so this spam filter doesnt raise his spam flag.

If someone has some other theories please post here.

Lord Majestic

8:16 pm on Sep 26, 2007 (gmt 0)

document.location='http://example.com/sutra/in.cgi?default';

Note word "sutra" - this is a sure sign that this spam is of Russian origin: they use chinese domains because they are cheap and can't be easily shutdown.

tedster

8:45 pm on Sep 26, 2007 (gmt 0)

I think I've've tracked some of these domains to other countries in eastern europe, but so far no Russia. However, they are pretty good at hiding their tracks, so no guarantees there. The word "sutra" originates from Sanskrit - so programmers from India would also be a possibility. But as I said earlier, this is not the work of a country necessarily - more likely that of individuals, whetever their agenda.

acemi

8:58 pm on Sep 26, 2007 (gmt 0)

Had a look at 15-20 .cn results for an obscure search returning 68 results - 60 of which are these type of urls.
* The titles of the pages are all titles of recent blog posts.
* None of the domains appears to be on a shared ip
* Most of the sites are hosted in Latvia (Riga)
* Most of the domains were registered between 14 and 19 September
* Most seem to use the same nameserver (cnmsn)

jimbeetle

9:05 pm on Sep 26, 2007 (gmt 0)

Just reran Dvorak's search and it looks like G's been able to nuke most of them, just 4 remaining at the moment -- though it's now showing a boatload of dupe copies of Dvorak's piece -- proves subdomain spam still works on G ;-). Yahoo's still returning a bunch of them.

This 78 message thread spans 3 pages: 78