Welcome to WebmasterWorld Guest from 54.162.172.144

Forum Moderators: Robert Charlton & aakk9999 & andy langton & goodroi

Message Too Old, No Replies

How to Locate Internal Duplicate Content?

     
1:29 am on Jun 19, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 8, 2007
posts:61
votes: 0


My site is been languishing since Panda 1 (Feb 2011) in spite of an incredible amount of change (rewriting content, removing/redirecting thin content, nailing down technical issues, redesign, improving engagement).

In fact it took a further blow in the Panda update of June 8. Some folks have suggested it might suffering a dupe content penalty due to this:

A site: command on Google shows around 900 pages. When you page thru the results you get to about 540 and then get the message "...we have omitted some results very similar..." When you click "repeat the search", the actual listed results are the same (about 540).

So what's with the phantom 360 odd pages?
2:20 am on June 19, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 22, 2010
posts:139
votes: 0


Is your site a blog and have tags and multiple categories?
2:51 am on June 19, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 7, 2003
posts:358
votes: 0


"...we have omitted some results very similar..."


Usually means dupe content.

When you click on that you will be thrown back to page 1 again. Click through again until you can get to the last page that G will show you. The last few pages you can usually see why you have dupes.
5:51 am on June 19, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Change the listings to 100 URLs per page.

Save all pages of both results sets. Compare the two listings to find out which pages are dropped.

I recently did that by cutting off the page header and footer from each page and then joining all the results together in order in two files and then running DIFF against the two files.
6:31 am on June 19, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10551
votes: 10


a search like this will be helpful:
http://www.google.com/search?q=site%3Aexample.com&safe=off&filter=0&num=100

i think you will have to turn off Instant for the &num=100 to work.

"...we have omitted some results very similar..."

this can also mean links were followed to robots.txt-excluded urls and the "snippet-less"/url-only results start to look "very similar".

how many pages to you expect to have indexed?
have you specified and/or submitted a sitemap?
have you crawled the site?
6:39 am on June 19, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Another thing to check.

site:example.com -inurl:www
site:www.example.com


One of those should return zero results.
6:50 am on June 19, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10551
votes: 10



site:example.com -inurl:www
site:www.example.com

One of those should return zero results.


...unless you are also using other subdomains such as blog.example.com, secure.example.com, etc...
2:34 pm on June 19, 2012 (gmt 0)

Preferred Member

5+ Year Member

joined:Nov 29, 2007
posts:385
votes: 0


Feb 2011, that is a year and a half ago - perhaps it is time to move on and build a new site if this one is still not performing - just a thought.
2:36 pm on June 19, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 9, 2007
posts:876
votes: 0


One of the best ways, is to take a sentence of text from your page and do

site:www.example.com "insert sentence here"

Ensure you click to view omitted results too.
8:25 pm on June 19, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 8, 2007
posts:61
votes: 0


Thanks for some awesome responses.

@louieramos - Yes it is a blog - all tags categories have been noindexed for a long time.

@mslina - I've done that can cannot see any difference between these results, and the ones before you click the "show omitted" link.
8:30 pm on June 19, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 8, 2007
posts:61
votes: 0


@g1smd @phranque This has shown an [example.com...] in the result set. Which is bizarre as I have SSL turned off at the hosting but its serving a default apache page. Not sure how to get rid of this - something in DNS settings?
8:39 pm on June 19, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 8, 2007
posts:61
votes: 0


@phranque - Expecting about 550 pages to be indexed. This is what is in the sitemap. I haven't crawled the site -- what would you use to do that (maybe a sitemap generator?).
8:55 pm on June 19, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 8, 2007
posts:61
votes: 0


Okay this is weird. G is showing about 3 forum.example.com urls -- despite the forum subdomain being deleted in 2006, and all non www. urls 301'd to www. (apache rewrite rule).

I've also noticed that there's a good 50 or so URLs in the dupe index that have been returning 404s for over 12 months... this is frustrating.
9:03 pm on June 19, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 8, 2007
posts:61
votes: 0


@driller41 - Big call, and I've certainly thought about it. Prepanda: 918k visits in the month before panda hit. Now: 170k visits this last month.

I've had another domain sitting there for some time wondering whether to shift the whole site and start over -- but the risk is that things will get even worse.
9:25 pm on June 19, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I'd not particularly worry about a single root https domain holding page being listed. If it take only minutes to fix then I would try to get your branding there or a redirect to http in place.
10:31 pm on June 19, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10551
votes: 10


https://example.com

is your server returning that content?
e.g. is it your IP address?
which port?

i typically use xenu and/or screaming frog to crawl sites.

re: unexpected urls in the index -
are you excluding any urls from being crawled in robots.txt?
e.g. what does [forum.example.com...] say?
10:33 pm on June 19, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member netmeg is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Mar 30, 2005
posts:12784
votes: 164


I'd sic screaming frog on it. I know we're not supposed to mention specific tools, but to my mind this is such an essential SEO tool and probably so necessary to what you're trying to figure out, I'm gonna risk it.
12:08 am on June 20, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10551
votes: 10


forgot about the example subdomain/forum linking problem...

what does what does http://forum.example.com/robots.txt say?
7:52 pm on June 20, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Feb 8, 2007
posts:61
votes: 0


http://forum.example.com does not exist, and has not for 5 years. I have a redirect setup: anything.example.com -> www.example.com

[edited by: tedster at 10:07 pm (utc) on Jun 20, 2012]

8:00 pm on June 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Make sure that it is a 301 redirect.

Make sure that from every non-canonical URL the redirect to the matching canonical URL happens in a single step.
10:11 pm on June 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


I have a redirect setup: anything.example.com -> www.example.com

Warning - that kind of "wildcard" subdomain set-up has sometimes been used by competitors to wreak havoc. Following g1smd's advice about using a 301 redirect is pretty good insurance, but only "pretty good."

This is a case where I would highly prefer a 404 status - if it ain't there, then don't resolve the request.