Forum Moderators: phranque

Message Too Old, No Replies

Wordpress URL Hashtags and Canonical mess

         

JS_Harris

6:11 pm on May 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi, I haven't posted in a while so I hope this is the right forum, I'm seeking an .htaccess solution, or any solution, to a convoluted mess which I'll do my best to explain.

#1 - Wordpress, still, does not handle canonical tags properly on paginated tag archive results(redirects all pages to the first page). I solved that problem with a rather simple theme functions file entry from here - [wordpress.stackexchange.com...] .I don't want to debate the issue of canonical pointing to page one on multi-page posts because Google has stated you should not and, on this particular site, each page is different. Think list of products with each jump having a different product from the group.

#2 - browserstate functions in history.js, still, do not degrade well on html4 browsers(IE9 etc) because they do not have an implementation of the session history API. The result is that it will turn example.com/some-content-here/4 into example.com/some-content-here/#some-content-here/4?_suid1234567 Which, unfortunately, breaks my solution #1. I end up with a canonical tag on all paginated content pointing to page one. More on history.js from github here - [github.com...]

INTERESTING
#3 - I wanted to view the problem from Google's eyes so I fetched as Googlebot. I fetched example.com/some-content-here/#some-content-here/4?_suid1234567 which, as I said has a rel-canonical pointing to page 1 instead of page 4, and googlebot fetches page 1. IGooglebot ignores everything after the hashtag and returns page 1 even if following my link leads a person to page 4. Of course this concerns me, it's unintentionally showing a visitor a different page than a search engine, but that's not my fault... I can't make googlebot ignore my url and load my canonical suggestion, they do that on their own when visitors do not.

As you can tell this isn't your every day problem but it does affect anyone using wordpress that has a plugin which uses the history.js file, such as theia post slider for example. MY CONCERN IS TO ENSURE PROPER INDEXING - I will eventually fix wordpress or fix the history.js degradation bug on non html4 browsers but I need to make sure I am getting indexed properly so....

Since Googlebot is ignoring paginated content with hashtags and _suid appended and jumping to page 1 I need to make sure that if anyone follows a link to example.com/some-content-here/#some-content-here/4?_suid1234567 that they also end up on page 1 of the multi page post or archive, thus I need a redirect to send people to the non-hashtag version of urls.

But it's not that simple. As you can see by my example url page 4 is actually in the url(/4?_) but after the hashtag portion. Oddly enough that's enough for wordpress to send a visitor to page 4 even if canonical reports being on page 1. This is causing some analytics nightmares and various other problems. If anyone has a suggestion to fix that I'm all ears BUT what I really need first is a redirect to strip the hashtag portion of the url without removing the page number that comes after, I don't want visitors to be unable to switch pages. As you might guess, if I remove the hashtag portion of the url then wordpress gets canonical right and so I no longer need to redirect to page one, and in fact I shouldn't.

Round and round we go, someone coming in on a hashtag link needs to land on a non-hashtag page but as soon as they go to the next page the hashtag appears and since that's not server side they don't get redirected.... and they create more links to hashtag pages... bleh.

I don't yet have a best effort for this, I was presented the problem this morning and am still trying to wrap my head around the implications. The site also has htaccess code to strip a .html ending from pages as well as your typical non-www to www redirect. A hashtag removal solution would need to work with both of those, and I am only looking for a hashtag removal htaccess solution(for now). I hope you found this set of issues interesting, I suspect it affects a lot more sites than people realize.

What are your thoughts?

mod note: I linked to the two reference sites because they offer a non-promotional plugin free solution to a common problem and insight into a bigger problem when the two issues collide. They are not my sites but are reputable, I hope they satisfy the linking policy on WW.

whitespace

12:08 pm on May 18, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Interesting... Just throwing in a cent or two, and to follow this thread... :)

I fetched example.com/some-content-here/#some-content-here/4?_suid1234567 which, as I said has a rel-canonical pointing to page 1 instead of page 4, and googlebot fetches page 1. IGooglebot ignores everything after the hashtag...


Right, "Fetch as Google" seems to ignore the fragment identifier when making the request. So, this is what is resulting in you seeing "page 1", not because you have a canonical link set (as you seem to imply in the first bit?) - Google does nothing with the canonical link at this stage as far as I can tell.

Oddly enough that's enough for wordpress to send a visitor to page 4 even if canonical reports being on page 1.


Well, you say "wordpress" - I kind of think of "wordpress" as the server-side back end. The fragment identifier is never passed to the server, only client-side JavaScript is able to read and process this (which could involve an AJAX request back to the server), but ultimately it's the client-side JS that makes this happen. The canonical link plays no part here (it's just advisory).

...and since that's not server side they don't get redirected.... I am only looking for a hashtag removal htaccess solution(for now).


Right, it's not server-side. So, you can't create a "hashtag removal" solution in .htaccess (ie. server-side)? Any hashtag removal would need to be client-side (in JavaScript). But also, simply removing the hashtag (dang, you've got me calling it that now) should not trigger an "external redirect", if that is what you are implying. Changing the fragment identifier should not result in any direct network traffic (unless you have script on the page that does something in the background).

JS_Harris

3:36 pm on May 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry for the confusion, it had me doing triple takes too.

- There is a script that gathers content from all pages of a multi-page post, it's a slider.
- The slider changes the URL, it appends the page number as people "slide".
- It also adds the # on some browsers, such as IE9, that don't fully support history.js

The page is this: example.com/some-content#some-content/4?_suid1234567
Googlebot sees this: example.com/some-content and crawls the first part of the multi-page post
Visitors land on page 4 of the same multi-page post

So it must be the slider(ajax based) and not wordpress that is still picking up the /4 even though it's after the #. It's still a mess, and only on some browsers. Firefox and Safari both handle history.js well and so the slider never appends the # or everything after it besides the /4.

Do I even want to fix this purely for backwards compatibility on one older browser? No, I really don't. I do, however, have a problem with the inbound links. Every _suid has different numbers and so each backlink with the # etc creates yet another copy of the same page in Google's eyes. Having a night to think about it I realized that I need to see if it's even possible to resolve the issue between history.js and some older browsers, any redirects would be pointless and wouldn't work anyway.

Thanks for the feedback.