Welcome to WebmasterWorld Guest from 3.209.80.87

Forum Moderators: open

Message Too Old, No Replies

Duplicate content in pagination

How big of a concern

     
4:15 pm on Mar 20, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2709
votes: 822


I'm currently working on a pagination algorithm for a website that has very varied array of data that needs to be paginated. Any datum entity can have as few as no pages to paginate, to as many as thousands of pages. In order to allow users to quickly skip ahead I plan on adding buttons to skip to the next "logical block" eg: letter of the alphabet.

For example:
There will be a unique page for each entity, "Entity one", "Entity two", "Entity three", etc...
Each "Entity" page will show 10 sub entities.
Sub entities are structured "Entity one Aa", "Entity one Ab", "Entity one Ac", "Entity one Ba", "Entity one Bb", etc...

Example Page 1 for Entity One would be:

URL: /entity-one-A.html
"Entity one Aa",
"Entity one Ab",
"Entity one Ac",
"Entity one Ad",
"Entity one Ae",
"Entity one Af",
"Entity one Ba",
"Entity one Bb",
"Entity one Bc",
"Entity one Bd",


Example Page 2 based on a "Next" click would be:

URL: /entity-one-B.html?various=params
"Entity one Be",
"Entity one Bf",
"Entity one Bg",
"Entity one Bh",
"Entity one Ca",
"Entity one Cb",
"Entity one Cc",
"Entity one Cd",
"Entity one Ce",
"Entity one Da"


But as I mentioned in my intro, I would have buttons that allow the user to skip ahead to "Entity one B.." or "Entity one C.."
A click on "B" would display the page:

URL: /entity-one-B.html
"Entity one Ba",
"Entity one Bb",
"Entity one Bc",
"Entity one Bd",
"Entity one Be",
"Entity one Bf",
"Entity one Bg",
"Entity one Bh",
"Entity one Ca",
"Entity one Cb",


A click on "A" would display the page1 shown above:

The astute observe would notice that the element "Entity one Ba" appears on both the "A" page and the "B" page.

Is this duplication going to be an issue for Google?

Note, that the resulting URL's from the "Next" clicks will be canonicalized back to the nearest letter page (simply trimming off any params).

Also note that Entity one B could and does in some instances span several pages while in other instance it has one or no results. It is for this reason that I can not make the "B" page show everything greater than or equal to B and less than C. Because then some instance there would be pages with no content, and in many cases there would be so little content that it would force the pages to be spread over 26 pages, for data that can easily be displayed on a single page.

Is the duplication a problem? If so, can this issue be mitigated?

Or have I managed to confuse everybody?
6:51 pm on Mar 20, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8639
votes: 283


Well, what does Google say about duplicate content?

In the Search Quality Evaluation Guidelines[1] they say (p. 158):
Please mark two results as dupes if they have essentially the same content on the main landing page AND you would not want a search engine to return both results for the query.


And on page 40

7.4.5 Copied Main Content
Every page needs MC. One way to create MC with no time, effort, or expertise is to copy it from another source.

Important : We do not consider legitimately licensed or syndicated content to be “copied” (see here for more on web syndication). Examples of syndicated content in the U.S. include news articles by AP or Reuters.

The word “copied” refers to the practice of “scraping” content, or copying content from other nonaffiliated websites without adding any original content or value to users (see here for more information on copied or scraped content).


So you don't fit those defintions.

And then their Duplicate Content help page [2] says
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.


More appropriate to your situation, they add this example
If you have many pages that are similar, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities, but the same information on both pages, you could either merge the pages into one page about both cities or you could expand each page to contain unique content about each city.


To me it sounds like in general your pages will be substantially different. However, what about the edge case where A-F all have no data? In your example, those pages would all be identical (that is, they would all start with "Entity one Ga")

So do you have many such edge cases? If so, it seems like you might want to do the filtering on your end with your queries rather than letting Google do it on their end.


1. [google.com...]

2. [support.google.com...]
7:09 pm on Mar 20, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2709
votes: 822


Interesting input

I am not worried about duplicate content as described by your first few quotes.

I'm more concerned with the SEO implication, that is how will Google choose tho index one page over the other, which one, why and is that selection acceptable to me?

As to your last point, actually what you describe is more the norm then the edge case, but I will only provide previous/next pagination on Entities where the number of pages is low, eg: less than a half a dozen or a dozen pages Thus eliminating this situation.
7:56 pm on Mar 20, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8639
votes: 283


Right, but my point is this

Q. Is it "copied" content as per the rater guidelines?
A. No, therefore there is no worry about a penalty (that part was obvious to both of us at the outset of course).

Q. Is it "duplicate content" in the sense of content that is just repeated because of poor canonicalization?
A: No, so we don't worry about that EXCEPT in the edge case I mentioned, in which case, yes, it is and we do worry about.

Now we get into the realm of opinion and my last point. If you have content that is substantially similar, at a certain point, someone is going to filter it. My feeling is that it is always better if you filter it yourself than if Google filters it.

Basically, my rules of thumb are:
- Do not expect Google to be smart.
- Do not expect Google to do your work for you

The fact is, that Google *has* gotten really smart in this respect and it *will* do a lot of your work for you, so good solid content will outweigh the occasional "mistake" and my gut feeling is that if the overlap is not egregious, you would be fine. If you want to control what Google indexes, you need to tell it what to index.

But at the end of the day, design it for usability. If it repeats enough that you are worried about Google filtering the results, then it will be annoying for the user. If it's not repetitive enough to be annoying to the user, why would Google filter the results?
2:42 pm on Mar 21, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2709
votes: 822


Sorry for the delay getting back, I've been working on this on thinking about this. The thinking impedes the work.

Q. Is it "duplicate content" in the sense of content that is just repeated because of poor canonicalization?

Yes it is duplicated. The question I have been pondering is how best to canonicalize this. On the one hand this must be done in terms of duplications but more importantly it must be done for optimal crawling and indexing.

What I described in my initial post will not work. The reason is that if "Entity-one" has more sub entities eg: "Entity-One-B's" than can be shown on a single page than none of those additional sub entities will ever be seen Google due to the canonicalization scheme described.

What I have concluded is that the only way to to do this is to use the rel="next" / rel="previous". This will solve the duplication problem.

My question now is does this impact indexing and ranking?

If a user is searching for a sub-entity that appears four or five next clicks down (page 5). Will Google show the user that specific page in search, or will Google point them back to the first page (this is what is described in Google's docs: [webmasters.googleblog.com...] ). Does this de-value, in terms of search ranking, the content that appears further down the list?

On the surface rel="next" / rel="previous", because no matter how the user/Googlebot gets to the page the next/prev tags will be there. So the user benefits from highly usable pagination and we benefit by Google being able index and rank appropriately. But does it really work?

[edited by: ergophobe at 3:20 pm (utc) on Mar 21, 2018]
[edit reason] fixed broken link [/edit]

3:17 pm on Mar 21, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8639
votes: 283


I should have an answer to this because I've dealt with a couple of situation with lots of pagination, but I'm really hazy on the details.

So if you should be skeptical of what I say above when I say we're in the "realm of opinion"... now I would need to go into faulty memories of opinion based on an older Google.

But I'll throw out one thought - huge numbers of sites on the web are blogs where content gets pushed off the front page and it takes many, many "nexts" to get to older content. If everything else is done well, Google seems to get there eventually and index it. Then you really have to get canonicalization right so that Google has time for a deep crawl.

Can you create more entry points into the deeper data? This might be good for the user too. Like, could you create pagination like

1 2 3 4 5 10 20 30 40 50 60 70 80 90 97 98 99 End

As a user, I find myself often doing that with the URL if the pagination is based on a GET parameter.
3:24 pm on Mar 21, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8639
votes: 283


I fixed the broken link in your post and read the blog article and I think that matches pretty well with what I said

Paginated content exists throughout the web and we’ll continue to strive to give searchers the best result, regardless of the page’s rel=”next”/rel=”prev” HTML markup—or lack thereof.


How many pages are we talking about here by the way? 100? 1000? I've certainly had no problem with product catalogs that cover 10 pages. 1000 would make me pretty nervous with a prev/next nav as the only way to get there, not just because of Google, but because of the user experience.

In fact, I think anything over 5-6 pages is a frustrating UI. All of those clickbait "slideshows" that are trying to increase pageviews? I hit the back button the second I see unnecessarily paginated content. I know that's not your situation. I'm just point out that my tolerance for paginating has decreased as the prevalence of paginated content has increased.
3:42 pm on Mar 21, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2709
votes: 822


I'm dealing with big data, hundreds of millions of records. Using absolute value for pagination is not computationally possible. But I do provide the ability to skip ahead lexicographically. In other words click on the letter P gets you to the P's or greater. So you can easily jump through the data. Pages vary depending on the entity, from 1 to thousands. The entity to sub-entity relationship (number of pages) is exponentially distributed, so few entities have many pages and many entity have few pages.

As you can see this is a challenging problem I'm facing, something I totally under estimated when I took this on.

I think anything over 5-6 pages is a frustrating UI.

There is no doubt that for the entities with the most pages user will get frustrated, but those pages are few. For the mid to low page count entities I think that the UI is actually quite intuitive and usable. One reason is that if you are looking for sub-entity "Sz" you can skip ahead to "T" and then click the previous button and you will be taken to "Sz" not the last page you were on.
6:56 pm on Mar 21, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8639
votes: 283


Wow. That sounds challenging and fascinating

Is this the sort of thing where a user would use site search a lot? Or is it too hard to know in advance what you're searching for?

Because if you have a lot of users doing a lot of searches, you'll have a type of meta data that would let you refine the UI a lot over time I would think.

Hundreds of millions of records makes it hard to do progressive search (i.e. search suggestions based on the data). I can't imagine you could do that in real time.

Have you ever heard of "sporadic dictionaries"? Not sure that's a real term - in French it's "dictionnaires sporadiques". Basically, it's when they have a large corpus (say a 10-volume dictionary, like Godefroy's dictionary of Middle French), but you do not have the whole thing searchable. What they do is create an index that is just the page headings (so the first and last word on the page) and then any search that falls between those words takes you to the non-searchable PDF that you scan with your own eyes.

I wonder if you couldn't have something vaguely inspired by that. In other words, you would have your massive data set, but you would have a small data set that would come up in progressive search that would narrow in on the large data set and which would not have massively slow queries. You would build that index maybe every day (or month) and then have something people could query super fast.

Of course, that assumes that you have sequential data (which is sounds like you do).

I'm not sure how you then make that work for you for the indexing problem, but my thought is the "sporadic index" would potential provide many gateways into the data at important intermediate points.

Just kind of stream of consciousness typing there... not sure it's any help, but just trying to generate some ideas.
7:35 pm on Mar 21, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator martinibuster is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 13, 2002
posts:14935
votes: 494


Is there are a way to surface the most popular content within any given section?

For example, let's say we're talking about shoes and there are three brands that are runaway best sellers. Or to slip a different wrapper around this, consider it from the perspective of popularity, which entities are most popular?

Which leads to the question, does it make sense to force users to keep clicking through an arbitrary pagination scheme? Even the choice to make it alphabetical is arbitrary. Or does it make more sense to create the pagination according to your choice of pagination scheme but surface the most popular content?

I've been thinking about duplicate content and the thinking makes more sense when I consider it within the context of user intent of the queries.

For example, a page with substantially similar content to four other pages may have reason to exist if the differentiating factor is COLOR and if the SERPs seem to indicate that user intent is best satisfied by web pages that include the specific color.

So if someone is specifically querying Google for Firetiger Rapala Floating Lure, does it make sense to show a page with a purple lure from which the user will have to click around on a drop down to find the Firetiger lure? Or does it make even more sense to show the page with the Firetiger page?

In other words, if the query is specific enough, sometimes Google tends to choose the duplicate content page that is spot on with that one query element, like color or size or price or whatever. But you have to go to the SERPs or have a good understanding of what will satisfy user intent for those queries the pages are intended to rank for.
7:51 pm on Mar 21, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member ergophobe is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 25, 2002
posts:8639
votes: 283


>>when I consider it within the context of user intent of the queries

Yeah, that's why I was thinking that once up and running, site search data would be invaluable.
8:00 pm on Mar 21, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator martinibuster is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 13, 2002
posts:14935
votes: 494


You can kind of get an idea of where the user intent is by looking at the SERPs. So I wouldn't even wait for site search data. I'd draw some conclusions first from the SERPs then wait for the site search data to tell me more.
2:01 am on Mar 22, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2709
votes: 822


Because if you have a lot of users doing a lot of searches, you'll have a type of meta data that would let you refine the UI a lot over time I would think.

This is true to some extent, but given that there is a lot of data there are a lot of pages, so anyone page alone doesn't generally get much traffic but the aggregate can add up. So it may be possible on the aggregate.

Have you ever heard of "sporadic dictionaries"?

No not specifically, but the process you describe is similar to the pattern I am using. I keep track of the first and last entries on each page. If a user goes back, then I search from the first entry backwards, and if the user goes forward I search from the last entry forward. (This is oversimplified, but the general idea). Like this I only ever get the data that is required for that one page, and I don't need to get all the sub-entities of an entity. For a small entity this provides little benefits, but when an entity has thousands of subs it saves a lot computing power and time.

Is there are a way to surface the most popular content within any given section?

This doesn't really apply in this case due to nature of the data but I would definitely approach the problem differently in a e-commerce type scenario, as you describe.

There is another consideration, if Google is able to accurately answer the query and point the user to the correct page, then the pagination or any other method of searching is of little importance to the user, as they will land right on the correct page. But if you omit providing means of easily navigating the data, then Google will be unable to find the content and then it will be of utmost importance.
7:19 am on Mar 22, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator martinibuster is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 13, 2002
posts:14935
votes: 494


...if you omit providing means of easily navigating the data, then Google will be unable to find the content and then it will be of utmost importance.


If I understand you correctly, this is the point I was trying to make.

Google is trying to show the most satisfactory result.

Your job is to feed the most satisfactory response page to Google.

Satisfaction is measured by popularity. Which result is most popular ranks at the top because it's what most users want to see.

Surfacing the most "popular" page is important because this is what Google wants to show.

Marketers have to stop thinking in terms of showing "relevant" content because that's not really what google shows, not in the way SEOs conceive of relevance.

SEOs think of relevance in terms of words and how they relate to each other.

Search engines think of relevance in terms of identifying what a user means when they search for a certain thing with a certain phrase.

There's a huge difference between knowing user intent and knowing semantic relatedness.
12:23 pm on Mar 22, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:2709
votes: 822


This is not quite what I meant. None the less this makes sense in most cases. But as I mentioned, due to the nature of the data this doesn't really apply to this case. In this case no entity is really any more likely to be searched than another. There may be some that will be more popular than others, but it is unlikely that other users would be more interested by the more popular entity than the entity that they are searching for. Basically the entities are unique and cannot be substitute for each other.

In contrast a user shopping for shoes, may want blue shoes, then sees that red shoes are more popular and so switches their mind to red shoes. So showing red shoes to users to users that want blue shoes may be desirable. But not in this case.

My point was simply if your site has good discoverability for search engines, than navigability is less important and if discoverability is poor navigability becomes important but discoverability and navigability are linked and almost the same. So you need to build in the great navigability so that the user will never needs to use it. (this is my zen thought for the day!)