Welcome to WebmasterWorld Guest from 54.226.159.223

Forum Moderators: phranque

Is canonical tag needed when index.html pages are used

     
3:53 pm on Jan 16, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 1575
votes: 422


My site uses index.html pages extensively. Each "main" content page for a given entity is placed in a subfolder and additional pages about the entity are added to the folder.
like so:
example.com/top/mid/bottom/entity1/index.html and example.com/top/mid/bottom/entity1/entity1-moreinfo.html
example.com/top/mid/bottom/entity2/index.html and example.com/top/mid/bottom/entity2/entity2-moreinfo.html
...
to entity millions....

Now default browser behavior is such that example.com/top/mid/bottom/entity1/ and example.com/top/mid/bottom/entity1/index.html lead to the same content. Since this is default behavior is a canonical tag required or for Googlebot and other bots (Mediapartners-Google) is it simply understood that these requests are the same?
5:29 pm on Jan 16, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


Er, that's not "default browser behavior". It's default server behavior, determined by the DirectoryIndex directive in Apache, or equivalent in other servers. Your job as webmaster is to ensure that all requests for "index.xtn" (html or whatever you use) get redirected to /directory/ alone.
5:42 pm on Jan 16, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 1575
votes: 422


Yes serve not browser, sorry!

I realize that this implementation is far from ideal (to put it politely). It is was implemented, when I first began and knew nothing and was based on someone else's recommendation. Even at that time it seemed odd. Later I had chance to change it, but getting millions of page de-index and re-indexed seemed like a not so great idea. In the near term I am stuck with it and I need find a worker around. In the mid-term I preparing to change the site structure from a to z, so it will get dealt with.

My question now is, will adding a rel=canonical to each affected page a good idea or a waste of time?
6:24 pm on Jan 16, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10624
votes: 630


This article may be worth reviewing:
5 common mistakes with rel=canonical [webmasters.googleblog.com]
6:36 pm on Jan 16, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 1575
votes: 422


Thanks Keyplyr, that was good article with some things to consider. Unfortunately it doesn't address my question, which is specific to /index.html pages.
6:59 pm on Jan 16, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10624
votes: 630


Well it explains the proper usage of the canonical tag. Then you apply that to your needs.

All of my sites are set up as yours. I use canonical tags only when some of my sub pages may end up getting indexed instead of the index page in that category (directory.) This could happen because of popularity, traffic, a Social Media post going viral, etc.

I also use the canonical tag when subsequent pages may contain parameters or hash tag anchors.

Other than that, I do not overuse canonical tags like I read about others doing.
7:18 pm on Jan 16, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 1575
votes: 422


@keyplyr
Can I infer from you answer that a canonical tag is not required for differentiating between

example.com/top/mid/bottom/entity1/
and
example.com/top/mid/bottom/entity1/index.html
7:25 pm on Jan 16, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10624
votes: 630


Do you use any other indicator to tell SEs to treat multiple pages as a single page? Does your server use the index.html default? If so, then I see no need.
7:40 pm on Jan 16, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts: 1575
votes: 422


Do you use any other indicator to tell SEs to treat multiple pages as a single page?

N-y-es
example.com/top/mid/bottom/entity1
re-writes to:
example.com/top/mid/bottom/entity1/

and
example.com/top/mid/bottom/entity1/index.html
does not re-write to anything else.

So both ".../index" and ".../" will show the page.
8:42 pm on Jan 16, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:10624
votes: 630


Right, so IMO Google,Yandex, Bing et al will index the default path (without the index.html)

As I said above, I personally only use canonical tags where parameters, hast tag anchors or other such ambiguous paths that may interfere with the SE indexing the wrong page.
11:07 pm on Jan 16, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:1575
votes: 422


I just did a "site:" search for my domain. The first URLs I see are all ".../", but if I search site:example.com index.html then it returns another large chunk of results. This suggests to me that it is in fact index both versions but most likely showing only one.
1:00 am on Jan 17, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


example.com/top/mid/bottom/entity1
re-writes to:
example.com/top/mid/bottom/entity1/

and
example.com/top/mid/bottom/entity1/index.html
does not re-write to anything else.
example.com/directory redirects (not rewrites) to example.com/directory/ because that is mod_dir doing one of its two jobs. It is called the directory-slash redirect.
example.com/directory/ serves the content of example.com/directory/index.html because that is mod_dir doing its other job. This one really is a rewrite, though it isn't handled by mod_rewrite and people don't usually think of it in those terms.

So both ".../index" and ".../" will show the page.
Yes, but they shouldn't. Why waste your server resources, and the search engines' crawl budget, by providing duplicate URLs for every single page? What happens in some future year when you decide that php or asp or jsp would work better for your index pages? Under normal circumstances, each page (= a particular content) should be accessible by one and only one URL--and in the case of directory-index pages, that one URL should be the form with final slash.
1:01 am on Jan 17, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11156
votes: 116


here's what will most likely happen:
- googlebot will eventually crawl all versions of the urls serving the duplicate content.
- as google is indexing the content for each url, it will eventually notice that some of this content duplicates other content it has indexed.
- when it sees that one url version is the directory (the trailing slash url) and the other is the directory index document (a classic canonicalization issue), it will eventually start showing just one of the urls in the index, usually the shortest one - especially in this case.
- when google sees conflicting signals (xml sitemaps, internal&external links, 200 OK instead of 301, link rel canonical, etc) it will eventually make it's best guess for which url is canonical.

how much time it takes for google to take these steps, how much crawl and rendering budget is wasted, how much link equity is lost (if only temporarily), google getting it all wrong in the end - these are all questions to consider:
how interested is google in your site?
how lucky do you feel?

the other option is to redirect all noncanonical requests to the canonical url.
this makes the canonical tag mostly irrelevant.
2:07 am on Jan 17, 2018 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:Apr 1, 2016
posts:1575
votes: 422


Where to start...
Everything I described so far occurred well in the past, any impact in terms of random selection of pages to include, or wasted crawl budget etc... is all done, history, I am now living with the current state. Things are mostly going pretty well (except for AdSense), I have been hitting new traffic records lately so the general trend is positive although I can't help but think that this may have caused things to progress more slowly then they could have. Water under the bridge...

What happens in some future year when you decide that php or asp or jsp would work better for your index pages?

That time is basically now. I have begun work on a redesign of the site, making the content dynamic. No php or asp I will be using Python and JS. But given the current issues this will only occur in a few months. But this is a definite consideration on how I will handle this specific issue. My plan is to split my content that is basically shown on .index.html and instead show it under 3 or 4 URL's but using ajax so that it appears to the user as a single page. (This is kind of besides the point).

Really my big and urgent concern right now is not Googlebot bot but rather Mediapartner-Google bot. Assuming that there are 3 possible URLs pointing to the exact same content then there is significantly increased probability that when a user requests a page that it will be "uncrawled" from Mediapartners-Google's perspective. It is not quite 3x as it will largely be determined by what Google has indexed and how my page are linked internally.

Digging into the stats (since the beginning of November) only 1% of traffic landed on pages ending without a ".html" and 93% of my traffic landed specifically on /index,html pages. So this might not be such a big issue. This would also suggest that to not disrupt the current state of affairs the rel-canonical to add should point to the .index.html and may not even be required?

For the future Python allows me to direct multiple routes to the same page. But even so, maintaining the current structure is probably not so bad entity1/index.html would be the basic page, then entity1/feature-a.html would show the additional content for feature-a and entity1/feature-b.html for feature b etc...
3:19 am on Jan 17, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3550
votes: 196


This is an old thread but not much has changed in this context: [webmasterworld.com...]

It is never a good thing to serve the same page from multiple URLs whether it is www vs. non-www, http vs. https, or example.com/directory/ vs. example.com/directory/index.html and a canonical metatag doesn't really help because the same page is served in both requests.
5:21 am on Jan 17, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14426
votes: 576


any impact in terms of random selection of pages to include, or wasted crawl budget etc... is all done, history
Why? Have search engines stopped crawling your site?
10:58 am on Jan 17, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11156
votes: 116


Everything I described so far occurred well in the past, any impact in terms of random selection of pages to include, or wasted crawl budget etc... is all done, history, I am now living with the current state.

have you checked a representative sample of your server access logs to determine which urls googbot/mediapartners is actually crawling and how often it is requesting various urls?

you might also consider analyzing how recently the supporting resources (css, javascript, images) have been requested as referred from each of a set of noncanonical urls.