Welcome to WebmasterWorld Guest from 52.206.226.77

Forum Moderators: bill

Message Too Old, No Replies

Chinese Simplified and Traditional dupe content issue

Getting ignored by Baidu

     
6:35 am on Aug 29, 2008 (gmt 0)

New User

10+ Year Member

joined:June 9, 2008
posts: 23
votes: 0


I have two sites in Chinese; one is in Simplified (.cn domain), and one Traditional (.hk domain).

Right now we're getting penalized by Baidu for dupe content. I was thinking about blocking Baidu from crawling one of the site's, most likely .hk, in the hope that this will boost our rankings for the .cn site, which is obviously the larger market. However, Yahoo has always been sending decent traffic to the .hk domain, so if there was any danger of also blocking Yahoo I'll need a serious re-think on this strategy. On top of which I've heard Baidu doesn't necessarily pay any attention to 'ignore' requests anyway.

I'd welcome any previous experiences with the Chinese simplified vs traditional issue as I'm really stuck on this one having spent a lot of time and effort getting the sites set up and translated in the first place.

Has anyone been here before?

7:04 am on Aug 29, 2008 (gmt 0)

Administrator from JP 

WebmasterWorld Administrator bill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 12, 2000
posts:15136
votes: 167


Are you certain that it's a dupe content penalty?

I've got similar content sites in simplified and traditional, but they're not close enough to be called dupes. That's not going to help you much I'm afraid. Yahoo and Google have been very good ranking the sites for their respective content. Baidu has always favored the simplified site, but that's the older more established site for me anyway.

7:58 am on Aug 29, 2008 (gmt 0)

New User

10+ Year Member

joined:June 9, 2008
posts:23
votes: 0


Thanks Bill, but we're pretty confident it's a dupe content penalty, as the sites are pretty much the same.

However, since the original post it now transpires that we don't know how to block Baidu for one of these sites even if we wanted to.

A more pertinent question might actually be: is there an equivalent of the robots.txt exclusion standard that Baidu recognizes?

8:29 am on Aug 29, 2008 (gmt 0)

Administrator from JP 

WebmasterWorld Administrator bill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 12, 2000
posts:15136
votes: 167


Have you tried Baidu's help page?
[baidu.com...]

If baiduspider is still not checking or obeying robots.txt as has been reported [webmasterworld.com], then you might want to use .htaccess to ban them.

8:51 am on Aug 29, 2008 (gmt 0)

New User

10+ Year Member

joined:June 9, 2008
posts: 23
votes: 0


Thanks again for your advice.

We had previously included in our robots.txt;

User-agent: baiduspider
Disallow: /

But this didn't seem to make any difference. If they've since developed their bots to recognize this command then fantastic!

I'll give it another whirl and report back...

Cheers!

9:11 am on Aug 29, 2008 (gmt 0)

Administrator from JP 

WebmasterWorld Administrator bill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 12, 2000
posts:15136
votes: 167


We'd appreciate that. baiduspider has a history of problems. It would be good to get confirmation one way or another of their compliance with robots.txt.
4:12 pm on Sept 1, 2008 (gmt 0)

Full Member

10+ Year Member

joined:Oct 21, 2004
posts:321
votes: 0


Hi, i can confirm that baidu complies with robots.txt.
11:00 am on Oct 31, 2008 (gmt 0)

New User

10+ Year Member

joined:June 9, 2008
posts: 23
votes: 0


Well it looks like Baidu does comply with robots.txt. Following the new implementation, visits to our .hk site have now dropped right off. We've re-submitted .cn now so will be interested to see if dupe content was indeed the issue here.

I'll report back the results in a few weeks. In the meantime if anyone knows of a "Webmaster Tools" type area within Baidu I'd be keen to hear about it. It would hopefully give us a little more insight into what is going on.

Cheers

11:03 am on Oct 31, 2008 (gmt 0)

New User

10+ Year Member

joined:June 9, 2008
posts: 23
votes: 0


Sorry, to make that clearer; crawling activity from Baiduspider to our .hk site has dropped right off - confirming baiduspider is robots.txt compliant.
10:54 pm on Nov 3, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 4, 2003
posts:53
votes: 0


Strange, I have 2 sites with identical content: example.com with simplified and a subdomain tw.example.com with traditional and I haven't been penalized at all...unless I'm not checking correctly.

I've had these sites for several years though (4+) so maybe that might have helped it?

I also wonder if the encoding might affect it? I don't know if I'm completely off base here but I use gb2312 and big5 rather than just utf8 completely.

2:57 am on Nov 5, 2008 (gmt 0)

New User

10+ Year Member

joined:June 9, 2008
posts: 23
votes: 0


Interesting point.

We're using UTF-8 across both sites, though looking at popular sites majority do seem to be specify either GB2312 or Big5 charset's. I wonder if this could be part of the issue.

Thanks! I'll take a look...

4:28 am on Nov 5, 2008 (gmt 0)

Administrator from JP 

WebmasterWorld Administrator bill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 12, 2000
posts:15136
votes: 167


The GB2312 and Big5 charsets are more likely to be displayed correctly on a wider array of devices. There aren't any modern PC browsers that have issues with encoding these days, but some of the older ones did have problems with UTF-8 support. Also, not all of your content will be viewed by a PC browser. There are mobile browsers and other services out there that may not display UTF-8 correctly.

If you do UTF-8 right then in most cases there won't be an issue for the spiders or the majority of your viewers. In a market like China there are still the occasional visitors using very ancient software though. For some that is an issue they would prefer to avoid. Using GB2312 or Big5 charsets is a safe bet.