homepage Welcome to WebmasterWorld Guest from 54.204.94.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
How to Locate Internal Duplicate Content?
synthese




msg:4466980
 1:29 am on Jun 19, 2012 (gmt 0)

My site is been languishing since Panda 1 (Feb 2011) in spite of an incredible amount of change (rewriting content, removing/redirecting thin content, nailing down technical issues, redesign, improving engagement).

In fact it took a further blow in the Panda update of June 8. Some folks have suggested it might suffering a dupe content penalty due to this:

A site: command on Google shows around 900 pages. When you page thru the results you get to about 540 and then get the message "...we have omitted some results very similar..." When you click "repeat the search", the actual listed results are the same (about 540).

So what's with the phantom 360 odd pages?

 

louieramos




msg:4466996
 2:20 am on Jun 19, 2012 (gmt 0)

Is your site a blog and have tags and multiple categories?

mslina2002




msg:4467004
 2:51 am on Jun 19, 2012 (gmt 0)

"...we have omitted some results very similar..."


Usually means dupe content.

When you click on that you will be thrown back to page 1 again. Click through again until you can get to the last page that G will show you. The last few pages you can usually see why you have dupes.

g1smd




msg:4467099
 5:51 am on Jun 19, 2012 (gmt 0)

Change the listings to 100 URLs per page.

Save all pages of both results sets. Compare the two listings to find out which pages are dropped.

I recently did that by cutting off the page header and footer from each page and then joining all the results together in order in two files and then running DIFF against the two files.

phranque




msg:4467113
 6:31 am on Jun 19, 2012 (gmt 0)

a search like this will be helpful:
http://www.google.com/search?q=site%3Aexample.com&safe=off&filter=0&num=100

i think you will have to turn off Instant for the &num=100 to work.

"...we have omitted some results very similar..."

this can also mean links were followed to robots.txt-excluded urls and the "snippet-less"/url-only results start to look "very similar".

how many pages to you expect to have indexed?
have you specified and/or submitted a sitemap?
have you crawled the site?

g1smd




msg:4467114
 6:39 am on Jun 19, 2012 (gmt 0)

Another thing to check.

site:example.com -inurl:www
site:www.example.com


One of those should return zero results.

phranque




msg:4467117
 6:50 am on Jun 19, 2012 (gmt 0)


site:example.com -inurl:www
site:www.example.com

One of those should return zero results.


...unless you are also using other subdomains such as blog.example.com, secure.example.com, etc...

driller41




msg:4467195
 2:34 pm on Jun 19, 2012 (gmt 0)

Feb 2011, that is a year and a half ago - perhaps it is time to move on and build a new site if this one is still not performing - just a thought.

realmaverick




msg:4467197
 2:36 pm on Jun 19, 2012 (gmt 0)

One of the best ways, is to take a sentence of text from your page and do

site:www.example.com "insert sentence here"

Ensure you click to view omitted results too.

synthese




msg:4467343
 8:25 pm on Jun 19, 2012 (gmt 0)

Thanks for some awesome responses.

@louieramos - Yes it is a blog - all tags categories have been noindexed for a long time.

@mslina - I've done that can cannot see any difference between these results, and the ones before you click the "show omitted" link.

synthese




msg:4467344
 8:30 pm on Jun 19, 2012 (gmt 0)

@g1smd @phranque This has shown an https://example.com in the result set. Which is bizarre as I have SSL turned off at the hosting but its serving a default apache page. Not sure how to get rid of this - something in DNS settings?

synthese




msg:4467348
 8:39 pm on Jun 19, 2012 (gmt 0)

@phranque - Expecting about 550 pages to be indexed. This is what is in the sitemap. I haven't crawled the site -- what would you use to do that (maybe a sitemap generator?).

synthese




msg:4467352
 8:55 pm on Jun 19, 2012 (gmt 0)

Okay this is weird. G is showing about 3 forum.example.com urls -- despite the forum subdomain being deleted in 2006, and all non www. urls 301'd to www. (apache rewrite rule).

I've also noticed that there's a good 50 or so URLs in the dupe index that have been returning 404s for over 12 months... this is frustrating.

synthese




msg:4467353
 9:03 pm on Jun 19, 2012 (gmt 0)

@driller41 - Big call, and I've certainly thought about it. Prepanda: 918k visits in the month before panda hit. Now: 170k visits this last month.

I've had another domain sitting there for some time wondering whether to shift the whole site and start over -- but the risk is that things will get even worse.

g1smd




msg:4467357
 9:25 pm on Jun 19, 2012 (gmt 0)

I'd not particularly worry about a single root https domain holding page being listed. If it take only minutes to fix then I would try to get your branding there or a redirect to http in place.

phranque




msg:4467369
 10:31 pm on Jun 19, 2012 (gmt 0)

https://example.com

is your server returning that content?
e.g. is it your IP address?
which port?

i typically use xenu and/or screaming frog to crawl sites.

re: unexpected urls in the index -
are you excluding any urls from being crawled in robots.txt?
e.g. what does [forum.example.com...] say?

netmeg




msg:4467371
 10:33 pm on Jun 19, 2012 (gmt 0)

I'd sic screaming frog on it. I know we're not supposed to mention specific tools, but to my mind this is such an essential SEO tool and probably so necessary to what you're trying to figure out, I'm gonna risk it.

phranque




msg:4467409
 12:08 am on Jun 20, 2012 (gmt 0)

forgot about the example subdomain/forum linking problem...

what does what does http://forum.example.com/robots.txt say?

synthese




msg:4467714
 7:52 pm on Jun 20, 2012 (gmt 0)

http://forum.example.com does not exist, and has not for 5 years. I have a redirect setup: anything.example.com -> www.example.com

[edited by: tedster at 10:07 pm (utc) on Jun 20, 2012]

g1smd




msg:4467719
 8:00 pm on Jun 20, 2012 (gmt 0)

Make sure that it is a 301 redirect.

Make sure that from every non-canonical URL the redirect to the matching canonical URL happens in a single step.

tedster




msg:4467766
 10:11 pm on Jun 20, 2012 (gmt 0)

I have a redirect setup: anything.example.com -> www.example.com

Warning - that kind of "wildcard" subdomain set-up has sometimes been used by competitors to wreak havoc. Following g1smd's advice about using a 301 redirect is pretty good insurance, but only "pretty good."

This is a case where I would highly prefer a 404 status - if it ain't there, then don't resolve the request.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved