Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Some help with wrong URL structure

         

shaunm

7:57 am on Feb 25, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Hi All,

The CMS generates pages like this :http://subdomain.example.com/folder/quest/002/This is a question

When copied to the browser it appears (of course it should) with '+' symbol in between the words.

Google reports these pages in Webmaster tools as they are. There is no crawl errors reported for this subdomain as well.

My question is, should I be concerned about the pages at all since the URLs is not user/search engine friendly?

Thanks for your help.

aakk9999

12:29 pm on Feb 25, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When copied to the browser it appears (of course it should) with '+' symbol in between the words.

I find this strange. I believe this URL would appear as:
http://subdomain.example.com/folder/quest/002/This%20is%20a%20question

because %20 represents space.

Does the http://subdomain.example.com/folder/quest/002/ return the same page content? If so, Google may index these instead of the longer URL, in which case you would have a duplicate content.

Also, I would personally try to avoid capitals in URL.

shaunm

1:01 pm on Feb 25, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks @aakk9999

I find this strange. I believe this URL would appear as:
http://subdomain.example.com/folder/quest/002/This%20is%20a%20question

because %20 represents space.


It looks like + represents the space in my case though. Is that ok? Because I've searched for the topic and most of them suggests either '%20' or '+'.

Also, how do I know if the URL encoding is done in my dev back-end and that not the browser that replaces the spaces with + or %20?

Does the http://subdomain.example.com/folder/quest/002/ return the same page content? If so, Google may index these instead of the longer URL, in which case you would have a duplicate content.


No, it gets redirected to this longer version. Sometimes Google webmaster tool reports traffic for both the http://subdomain.example.com/folder/quest/002/ and http://subdomain.example.com/folder/quest/002/This is a question. Strange, isn't it? Because I don't see the shorter version appear in the search and it redirect to the longer one as well. So where is the question of shorter one getting clicks?!?

Also, there are component pages such as
http://subdomain.example.com/folder/quest/002/This is a question
http://subdomain.example.com/folder/quest/002/This is a question=page=1
http://subdomain.example.com/folder/quest/002/This is a question=page=2

Is the rel='next' and 'prev' the only solution for this where the content actually changes? Or I can just leave them as it is? All this pages are appearing int the search.

Thanks again!

lucy24

9:44 pm on Feb 25, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The CMS generates pages like this:
http://subdomain.example.com/folder/quest/002/This is a question

When copied to the browser it appears (of course it should) with '+' symbol in between the words.

Well, no, it shouldn't. When you say "copied to browser" do you mean actual copying-and-pasting into your local address bar? Or something else? Handling of spaces at the local level is browser-specific. For example, Camino simply omits spaces. (This is ideal for me, because my log-wrangling routine inserts formulaic spaces that have to be deleted if I'm checking out the URL.) Safari leaves them as-is.

Which form of the URL appears in your logs? Space, plus or percent?

:: detour to test site ::

Thought so, but had to check. If you request something containing literal spaces, the request arrives with %20.

What worries me is that something is getting routed via a query string, which in turn would mean-- or could mean-- that an internal rewrite is getting sent back "out there".

phranque

11:25 pm on Feb 25, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



http://www.ietf.org/rfc/rfc1738.txt [ietf.org]:
The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs.
...
All unsafe characters must always be encoded within a URL.

this means spaces in the path part of a url should be encoded as %20.


http://www.ietf.org/rfc/rfc1630.txt [ietf.org]:
Within the query string, the plus sign is reserved as shorthand notation for a space.
[my emphasis]

aakk9999

1:11 am on Feb 26, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks phranque :) I did not know that + is reserved for a space in query string. So what it appears is happening here is that during the URL rewrite, the query string that contained + has been appended as the last part of the rewritten URL.

@shaunm,
Your internal URLs (e.g. when you do View Source), are you sure that they have spaces? To me it seems that they are with + already. Because although + a shorthand for space, something has to put it in there and from what I can see, it does not happen automatically by pasting URL that had a space in query string (and anyway, in this particular URL this is not a query string as qs has been converted into the main part of URL).

http://subdomain.example.com/folder/quest/002/This is a question=page=1

This looks strange. Are you sure this is how your URL looks like and that there isn't a question mark in URL, e.g. does it in fact end with .... question?page=1

With regards to rel='next' and 'prev', they supposed to consolidate linking properties of paginated pages and therefore make a landing page stronger, so they would be a good idea.

No, it gets redirected to this longer version. Sometimes Google webmaster tool reports traffic for both...

Are you sure that there is no error in redirect? I have seen a cases where redirects on occasions do not work - especially in .net (IIS). So normally URLs would be redirecting, then for a few requests they would not for various reasons. So when you see this in analytics, I would check the logs for that day to ensure that all request for that URL have in fact responded with 301.

shaunm

7:32 am on Feb 26, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks @lucy24 :-)

Well, no, it shouldn't. When you say "copied to browser" do you mean actual copying-and-pasting into your local address bar? Or something else?

Sorry that I made a mistake explaining/understanding it properly. Thanks @aakk9999.

The internal URLs in the source codes appear as :example.com/folder/002/This+is+a+question.
When I decode them, using the online encoding/decoding tools they become as: :example.com/folder/002/This is a question.
Now when I copy this decoded one and paste into the address bar in Chrome/Firefox, the spaces are automatically replaced with %20 such as: :example.com/folder/002/This%20is%20a%20question

Am I in for big trouble? This is not a new sub-domain but it's been there for a very long time.

Which form of the URL appears in your logs? Space, plus or percent?
I don't have access to it as yet, have requested access to it. So the URL appear in the log is considered the ideal form of this 3 variants? Space or + or %20 in my case?

Thanks @phranque
Within the query string, the plus sign is reserved as shorthand notation for a space.

Does that mean the + symbol can be only used within a query string, replacing the spaces?

Thanks @aakk9999
So what it appears is happening here is that during the URL rewrite, the query string that contained + has been appended as the last part of the rewritten URL.
Can you please explain it further for me?

This looks strange. Are you sure this is how your URL looks like and that there isn't a question mark in URL, e.g. does it in fact end with .... question?page=1
I'm sorry there is a question mark in the URL for the paginated content URLs.

Are you sure that there is no error in redirect? I have seen a cases where redirects on occasions do not work - especially in .net (IIS). So normally URLs would be redirecting, then for a few requests they would not for various reasons. So when you see this in analytics, I would check the logs for that day to ensure that all request for that URL have in fact responded with 301.


The crawlers like ScreamingFrog usually crawls the URLs as it is like: example.com/folder/002/This+is+a+question
But Google only reports for versions such as: example.com/folder/002/This is a question - Is that normally how the search engines report?

Also, yes I have checked manually that the URL at:example.com/folder/002/ redirected to example.com/folder/002/This+is+a+question even though Google reports clicks data for both separately. I've never worked with log files, it looks like I should have and should be.

Thank you all once again for the kind help :-)

Robert Charlton

8:49 am on Feb 26, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



shaunm - Thinking out loud here... I'm wondering whether these query strings might have been generated at some point as user-generated questions, which then got saved in a Q&A section on your site... and that Google might be seeing internal server files, along with rewritten files.

shaunm

9:49 am on Feb 26, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks @Bob :-) After a very long time!

I'm wondering whether these query strings might have been generated at some point as user-generated questions, which then got saved in a Q&A section on your site

First of all, could you please tell me what do we actually mean by query strings here? Because what I've been thinking of query strings all these days is that 'query strings are the extra characters added ad the end of the URL for tracking purpose' Such as CMP=, ?=Tracking something like that.

Secondly, you are absolutely correct the pages are user-generated in the question and answer section where there will be categories and logged in users can ask questions and it will be answered by the admin.

and that Google might be seeing internal server files, along with rewritten files.

Could you please help me to understand it?

Thanks!

phranque

5:23 pm on Feb 26, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



the query string is anything after the first question mark in the url up to the end of the url or the first hash mark, whichever occurs first.

for example:
http://example.com/foo.php?parameter=some+text+string#fragid
the query string is "someparameter=some+text+string"
and when the query string is parsed the single parameter name in the query string is "someparameter" and the parameter's value is "some text string".

in this example:
http://example.com/some+path+name/some+file+name.php
the pluses in this example are simply characters in the file or path name and don't represent blanks.

blanks are also allowed in filenames in some filesystems but they are impractical as urls, so i would suggest avoiding them.
i would also avoid pluses in file and path names.
otherwise you will constantly be fighting noncanonical and truncated url problems with requests for urls containing either of the 3 possibilities (blank, plus or %20) due to incorrect encoding/decoding/translating/escaping of these urls.

JD_Toims

5:34 pm on Feb 26, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A note on the encoding:

It's actually very easy to end up with + rather than %20, especially in PHP, since if you use urlencode() rather than rawurlencode() you get spaces converted to + rather than %20.

I agree it's better to not use spaces in URLs, but I would guess Google, Bing and other major search engines are fairly good at handling them since it's something that's been easy to get wrong for years -- I'd be more concerned about linking not being correct than I would about major search engines not being able to figure out the correct location due to a + rather than %20 in the directory/file names.

phranque

6:03 pm on Feb 26, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



it's not a matter of google figuring out "the correct location" it's a matter of having to accept the collateral damage from malformed urls (PR loss from 404s or PR dilution from 200 OKs) or manage the 301 redirection thereof to the canonical urls.
they are different urls and i wouldn't count on google to do that url translation and canonicalization without some technical help.

lucy24

7:37 pm on Feb 26, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Did you at some point say which URL actually reaches the page? It's either + or %20 but not both. If it's + then use that in links. If it's %20 you should probably use " " (literal spaces) because it looks better and will be auto-converted, so no Duplicate Content issues.

shaunm

7:54 am on Feb 27, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks @Everyone for the kind responses.

So, again I need your help and suggestions with how I should tackle the situation.

Google reports traffic data for this URL:example.com/category/quest/02/Data Error: This value generates error (ABC Pro)
But the actual URL is (as they are internally linked): example.com/category/quest/02/Data+Error%3A+This+value+generates+error+%28ABC+Pro%29

If the pages are encode properly, as you all have mentioned it should be the %20 for 'spaces' but in my case it looks like the spaces are intenationally/wrongly replaced by the '+' value. But there are proper encoding for other characters as you could see from the above example where ':' is replaced by %3A and '(' by %28, ')' by %29. And I could see the proper encoding for rest of the other characters in a URL as well.

So, is that what I have to be doing in the following order?
1. Rewrite rules to change uppercase URLs into lowercase.
2. Replace + and %20 with hyphens/dashes (-) and redirect the +, %20 requests to -.
3. Add self canonical tags to the clean pages (+ and %20 replaced with -)
4. Add them the Sitemap and submit it for crawling and expect and deep drop in the traffic :-)

@Lucy
Did you at some point say which URL actually reaches the page? It's either + or %20 but not both. If it's + then use that in links. If it's %20 you should probably use " " (literal spaces) because it looks better and will be auto-converted, so no Duplicate Content issues.


To continue with above
5. The PAGE is actually reachable with both + and %20 and upper/lowercase version of them as well.
i. example.com/category/quest/02/Data+Error%3A+This+value+generates+error+%28ABC+Pro%29 (Uppercase)
ii. example.com/category/quest/02/data+error%3A+this+value+generates+error+%2abc+pro%29 (Lowercase)
iii. example.com/category/quest/02/Data%20Error%3A%20This%20value%20generates%20error%20%28ABC%20Pro%29 (Uppercase)
iv. example.com/category/quest/02/data%20error%3A%20this%20value%20generates%20error%20%28abc%20pro%29 (Lowercase)

Should I consider adding rel=canonical from the duplicate versions? Or it's just addressed in point no:2, 3?


Thanks again for your valuable inputs.

lucy24

11:07 am on Feb 27, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is it too late to demand a refund on the CMS?