Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google and Duplication Content

what about robots.txt?

         

moftary

10:58 pm on Aug 25, 2005 (gmt 0)

10+ Year Member



Hello WWers,

I have been recently banned by google in the last month and failed to be reincluded since then. The first suspecious of getting banned was "duplication content" and as been advised from several webmasters here, I have modified robots.txt so googlebots should no longer index this duplication content.

Now the $32.000 questions is, should google ban you for this although you are telling their crawlers not to spider these contents?

Cheers,
mOftary

MarkHutch

6:55 am on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Please explain how you are using robots.txt to remove duplicate content. I've heard talk of this before, but I haven't seen any examples of how it's done, yet.

moftary

8:12 am on Aug 27, 2005 (gmt 0)

10+ Year Member




User-Agent: Googlebot
Disallow: /duplicated_content

in robots.txt

reseller

8:47 am on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



moftary

>>Now the $32.000 questions is, should google ban you for this although you are telling their crawlers not to spider these contents?<<

Of course not. Its Google´s fault not yours if they keep indexing pages you asking not to get indexed.

However, I would have also added the following meta tag:

<META NAME="GOOGLEBOT" CONTENT="NOINDEX, FOLLOW">

Just to be sure ;-)

moftary

9:21 am on Aug 27, 2005 (gmt 0)

10+ Year Member



However, I would have also added the following meta tag:

<META NAME="GOOGLEBOT" CONTENT="NOINDEX, FOLLOW">

hard to implement on thousands of static pages:)

Anyhow googlebots seem finally understood robots.txt. They arent indexing any more, but the ban remains active. Thoughts?

moftary

9:25 am on Aug 27, 2005 (gmt 0)

10+ Year Member



I think there should be a way to tell google to delete the suspected duplication content from google index, not just using robots.txt? Oh, the delete-a-site-and-get-back-in-180-days you say?;)

HarryM

10:14 am on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are you sure it is just duplicate content that has resulted in your site being banned? Also what do you mean by "banned"? Is the site penalized, or has it simply lost traffic?

My suspicion is that Google looks for anomalies where a site has a SEO features outside the norm. E.g., a higher than average level of duplicate content, a disproportionate amount of hidden text in alt tags or keywords in unnecessary meta tags, folders banned to search engines, the overuse of Hx tags compared to the total amount of text on the page, etc. One or two anomalies can be accidental or white hat, but too many anomalies and a site may be considered as spammy or over SEO'd.

By having duplicate content on your site and then taking measures to stop Google indexing it may in itself raise Google's suspicions.

moftary

11:43 am on Aug 27, 2005 (gmt 0)

10+ Year Member



HarryM, you cannot tell why did you get banned (that's [site: site.com] returns nothing) when google insists on the "we cannot provide any individual assistance" response.

Anyway, I have absolutely no SEO tricks, white hat or black ones. Nothing but a suspected duplication content and some site-wide-links. I took the site-wide-links down although I dont know why would google ban a site for this when every major company use site-wide-links to market their websites (check devshed or internet dot com).

By having duplicate content on your site and then taking measures to stop Google indexing it may in itself raise Google's suspicions.

Elaborate please! Should I drop these contents for the eyes of google? What about my site visitors?
AFAIK, google should be only concerned with their index, but this is the heart of this thread anyway.

Isnt it enough to tell googlebots not to spider the suspected duplicated content?

BTW, these modifications of robots.txt were purposed to get rid of the ban. They are certainly not causing the ban as I had an allow all robots.txt before.

g1smd

5:53 pm on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Check also that non-www requests are redirected (301) to www for all pages, otherwise that too is duplicate content.

zeus

6:59 pm on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



the problem is just sometimes when a site is filtered for dublicated content the 301 trick can take 6-12 month before its active, Im still waiting (5 month) and googlebot is not visiting so oftern eather, old caches dont get updated, pages which dont exist anymore are still in the index and I can go on.

HarryM

7:20 pm on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Elaborate please! Should I drop these contents for the eyes of google?

That decision is completely up to you. Nobody can tell you what Google likes or dislikes with any certainty. I just prefer to play safe. When I lost all my traffic (twice this year) I removed anything that might be considered black hat, including all my print pages. They might have been seen as duplicate content, and if I banned Google from them that might also have aroused suspicions. The traffic is now back, although there is no guarantee that what I did has anything to do with it.

But I prefer to stay squeaky clean and on side with Google - there is no point in considering your visitors without SE's to drive traffic to your site.

g1smd

7:27 pm on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The 301 redirect from non-www to www can be helped along by making a fake sitemap that lists all the URLs that you want removed from the index. You then host that sitemap on another site so that Google revisits the URLs and "sees" the redirect. Things will be fixed within a couple of months if you do that.

Your print pages could be kept out of the index by putting <meta name="robots" content="noindex"> on each one. It is very disconcerting to click on a search result and have your printer start up before the page has finished loading on screen.

HarryM

8:52 pm on Aug 27, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your print pages could be kept out of the index by putting <meta name="robots" content="noindex"> on each one

Yes and no. Google knows the url of these pages because it will have followed a link to them. It should follow the noindex rule, but the url will still lurk somewhere in the depths of Googleland. And from time to time, especially when Google does a rollback, some of the print pages may appear in the serps - url only without a snippet. I have had some print pages pop up in serps after a year has elapsed.

Also just because Google follows the noindex rule, it doesn't necessarily follow that Google hasn't crawled the page.

The way to go is not to have seperate print pages but to use a CSS solution.