Welcome to WebmasterWorld Guest from 54.159.111.156

Message Too Old, No Replies

How to Locate Internal Duplicate Content?

     
1:29 am on Jun 19, 2012 (gmt 0)

5+ Year Member



My site is been languishing since Panda 1 (Feb 2011) in spite of an incredible amount of change (rewriting content, removing/redirecting thin content, nailing down technical issues, redesign, improving engagement).

In fact it took a further blow in the Panda update of June 8. Some folks have suggested it might suffering a dupe content penalty due to this:

A site: command on Google shows around 900 pages. When you page thru the results you get to about 540 and then get the message "...we have omitted some results very similar..." When you click "repeat the search", the actual listed results are the same (about 540).

So what's with the phantom 360 odd pages?
2:20 am on Jun 19, 2012 (gmt 0)

5+ Year Member



Is your site a blog and have tags and multiple categories?
2:51 am on Jun 19, 2012 (gmt 0)

10+ Year Member



"...we have omitted some results very similar..."


Usually means dupe content.

When you click on that you will be thrown back to page 1 again. Click through again until you can get to the last page that G will show you. The last few pages you can usually see why you have dupes.
5:51 am on Jun 19, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Change the listings to 100 URLs per page.

Save all pages of both results sets. Compare the two listings to find out which pages are dropped.

I recently did that by cutting off the page header and footer from each page and then joining all the results together in order in two files and then running DIFF against the two files.
6:31 am on Jun 19, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



a search like this will be helpful:
http://www.google.com/search?q=site%3Aexample.com&safe=off&filter=0&num=100

i think you will have to turn off Instant for the &num=100 to work.

"...we have omitted some results very similar..."

this can also mean links were followed to robots.txt-excluded urls and the "snippet-less"/url-only results start to look "very similar".

how many pages to you expect to have indexed?
have you specified and/or submitted a sitemap?
have you crawled the site?
6:39 am on Jun 19, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Another thing to check.

site:example.com -inurl:www
site:www.example.com


One of those should return zero results.
6:50 am on Jun 19, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




site:example.com -inurl:www
site:www.example.com

One of those should return zero results.


...unless you are also using other subdomains such as blog.example.com, secure.example.com, etc...
2:34 pm on Jun 19, 2012 (gmt 0)

5+ Year Member



Feb 2011, that is a year and a half ago - perhaps it is time to move on and build a new site if this one is still not performing - just a thought.
2:36 pm on Jun 19, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



One of the best ways, is to take a sentence of text from your page and do

site:www.example.com "insert sentence here"

Ensure you click to view omitted results too.
8:25 pm on Jun 19, 2012 (gmt 0)

5+ Year Member



Thanks for some awesome responses.

@louieramos - Yes it is a blog - all tags categories have been noindexed for a long time.

@mslina - I've done that can cannot see any difference between these results, and the ones before you click the "show omitted" link.
8:30 pm on Jun 19, 2012 (gmt 0)

5+ Year Member



@g1smd @phranque This has shown an [example.com...] in the result set. Which is bizarre as I have SSL turned off at the hosting but its serving a default apache page. Not sure how to get rid of this - something in DNS settings?
8:39 pm on Jun 19, 2012 (gmt 0)

5+ Year Member



@phranque - Expecting about 550 pages to be indexed. This is what is in the sitemap. I haven't crawled the site -- what would you use to do that (maybe a sitemap generator?).
8:55 pm on Jun 19, 2012 (gmt 0)

5+ Year Member



Okay this is weird. G is showing about 3 forum.example.com urls -- despite the forum subdomain being deleted in 2006, and all non www. urls 301'd to www. (apache rewrite rule).

I've also noticed that there's a good 50 or so URLs in the dupe index that have been returning 404s for over 12 months... this is frustrating.
9:03 pm on Jun 19, 2012 (gmt 0)

5+ Year Member



@driller41 - Big call, and I've certainly thought about it. Prepanda: 918k visits in the month before panda hit. Now: 170k visits this last month.

I've had another domain sitting there for some time wondering whether to shift the whole site and start over -- but the risk is that things will get even worse.
9:25 pm on Jun 19, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I'd not particularly worry about a single root https domain holding page being listed. If it take only minutes to fix then I would try to get your branding there or a redirect to http in place.
10:31 pm on Jun 19, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



https://example.com

is your server returning that content?
e.g. is it your IP address?
which port?

i typically use xenu and/or screaming frog to crawl sites.

re: unexpected urls in the index -
are you excluding any urls from being crawled in robots.txt?
e.g. what does [forum.example.com...] say?
10:33 pm on Jun 19, 2012 (gmt 0)

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I'd sic screaming frog on it. I know we're not supposed to mention specific tools, but to my mind this is such an essential SEO tool and probably so necessary to what you're trying to figure out, I'm gonna risk it.
12:08 am on Jun 20, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



forgot about the example subdomain/forum linking problem...

what does what does http://forum.example.com/robots.txt say?
7:52 pm on Jun 20, 2012 (gmt 0)

5+ Year Member



http://forum.example.com does not exist, and has not for 5 years. I have a redirect setup: anything.example.com -> www.example.com

[edited by: tedster at 10:07 pm (utc) on Jun 20, 2012]

8:00 pm on Jun 20, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Make sure that it is a 301 redirect.

Make sure that from every non-canonical URL the redirect to the matching canonical URL happens in a single step.
10:11 pm on Jun 20, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I have a redirect setup: anything.example.com -> www.example.com

Warning - that kind of "wildcard" subdomain set-up has sometimes been used by competitors to wreak havoc. Following g1smd's advice about using a 301 redirect is pretty good insurance, but only "pretty good."

This is a case where I would highly prefer a 404 status - if it ain't there, then don't resolve the request.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month