homepage Welcome to WebmasterWorld Guest from 54.161.166.171
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Upper case lower case removal urls, different to natural spidering?
flanok




msg:4391258
 11:54 am on Nov 26, 2011 (gmt 0)

hi.

Not sure if this the right category to post in, as it is both related to webmaster tools and the natural spidering of Google, please move it, if you think it is wrong.

I have a large site that when created 5 years ago, made the mistake of mixing uppcase and lower case letters within all urls which are seprated by directories.

i.e site/FIRSTDIRECTRY/SECOND-DirecTory/Third-directory/

This mish mash of upper and lower caseletters used in the directories and urls meant that as soon as a link went to the wrong sequence of upper case and lower case letters, a new page was formed that was duplicate.

This occurred over many pages in our site that was bearable until Panda came along and took the whole site ranking down, rather than just the pages concerned.

Process
So I got a programmer to add some code that forced all urls to upper case (all 301, no matter what sequence they would be) and then changed the linking structure (menus) thoughout the site to reflect the new upper case version.

The problem with this new menu, was google was no longer finding the old pages to learn to redirect to the new uppercase page, so this process actually made things worse. It was finding the new upper case urls, whilst still keeping the older mish mash.

Now the question
So I started to remove all non upper case directories from webmaster tools (which cleaned the site up), with the site: command you will only see upper case urls now.

I then added as many pages as I could to sitemaps to speed up the process to finding the new uppercase pages.

But rather than google finding new pages, it is in fact showing me that google is REMOVING new upper case pages instead. approx a thousand a day.

This strongly suggests that Google can remvove directories with a mish mash of upper case and lower case letters within webmaster tools (and leave upper case version in the index), but when it comes to NATURAL spidering of the site, it treats any url the same whether it has any sequence of letters.

It sees a sequence of letters is removed in webmaster tools, but even though the url it has visted (naural sidering) is now upper case, it does not define beween the upper case version and mish mash version and removes the upper case version also.


So I am now losing pages through Google just spidering my site, even though the only pages being blocked/redirected are those with a mish mash of letters.

No directory with upper case only letters is being blocked or redirected in any way, either in robots text file, or in .htaccess file. I have checked lots of times.

No upper case page within the index redirects anywhere also.

I can see no reason why google is not increasing the upper case directory pages, rather than losing them.

Any ideas

Thanks for your time 11:41 AM

 

lucy24




msg:4391377
 12:24 am on Nov 27, 2011 (gmt 0)

I am horribly tempted to say: Rename everything again, this time using all lower-case letters, which is what you should have done in the first place. Remove nothing. Rewrite everything containing even a single upper-case letter to a php script that makes everything lower-case and then 301 Redirects to the now-correct place.

Sit back and wait a few years for g### to figure it out.

Option B is to take early retirement :(

tedster




msg:4391396
 3:23 am on Nov 27, 2011 (gmt 0)

I also question the wisdom of going to all upper-case. However, you've already done that and it's now part of the terrain. Adding another redirect and changing all your canonical URLs one more time, would increase the tangle even more.

Are you sure you've changed ALL your internal linking to upper-case, as well as ALL your Sitemap URLs and any other .htaccess rules (such as no-www to with-www, or index.html to directory root).

-----

Another question - can you give us some idea of the time scale involved here?

flanok




msg:4391429
 10:35 am on Nov 27, 2011 (gmt 0)

Hi Tester

Yes I am sure all are redirected to upper case, it occurs even before the .htaccess as all forums state it was difficult to do at this level.

Non www and www has always been done.

The issue, is Google can remove non upper pages and leave upper pages in place using the removal tool in webmaster tools. (there is no need to use robots file, as it recognises the redirects as the page/ directory no longer exists).

But when it comes to natural spidering, G sees the url within the removal tool but does not distinguish between upper and lower case and so continues to remove any page with the same url.(at this point the versions I want to keep)

The annoying thing is, when I cancel a directory removal in webmaster tools, G throws back the mish mash versions back into the index, even though it should now know these are redirects, and the page/directory no longer exists from the removal process. I just dont understand why it would do this.

If g did not throw back the old versions, this would have been ok. I would just simply cancel all removals, freeing up the natural G spider to do its work, without re introducing back the old pages.

flanok




msg:4391432
 11:03 am on Nov 27, 2011 (gmt 0)

Sorry Tedster

I Did not answer all your questions

You are right about the wisdon of going upper case and at the time I did not expect such issues as I have now, but it was do do with h1 tags within the site. They are connected to the url within the code and I could not force the first letter of the h tag to be a capital when the urls were lower case.

Knowing the work I am having to put into the site now, I should have found a solution. but I am confident that as long as all versions are a set format, it should be ok and all other verions 301 to a set format, it should be ok.

Not sure about what you mean by timescales.
We were hit by panda 10% drop July, then 80% drop August.

An employee in July added a complex menu system within the site not acknowleging the complex nature of the url set up (not his fault, mine for not being close enough to the work), which meant by August Panda update, pretty much every page had a double and so Panda destroyed the site late August (whilst I was on Holiday).

Early Sept I decided that even if we did correct the menu system, it would catch us out again if there was not a standard url throughout, so the decision was made to upper case (as explained above).

I then removed several directories using robots text file, but as now, we started losing good pages also. I decided it must be the robots text file process as G was not being allowed to access the old mish mash url to acknowledge it was a redirect and so remove it all together.
Ironically on septembers panda update the filters were removed and we gained back rankings even though we still were losing good pages at an alarming rate, rankings up, but traffic not realy returing.

So we removed all directories from the removal tool in webmaster tools late september (gradually over a month through to October) and waited for nearly 2 months in the hope that G would naturaly remove all mish mash urls.

Of course the next Panda update we lost rankings again, with too many duplicate pages being reintroduced.

So in desperation, I tried to remove pages without the robots text file, just using the fact all pages were redirects and no longer existed, the removal process worked, G knows these pages do not exist.

This is where we are today. Still under Panda and pages being removed every day from the index, even though there is no redirect in place or anything in robots files for the upper case versions that are being removed.

HuskyPup




msg:4391440
 12:08 pm on Nov 27, 2011 (gmt 0)

I am also tempted to agree with lucy24.

The entire site needs restructuring correctly, you've a mish-mash of all sorts going on and you've well and truly confused the bots.

it was do do with h1 tags within the site. They are connected to the url within the code and I could not force the first letter of the h tag to be a capital when the urls were lower case.


This sounds like a serious CMS flaw, what are you using or is it hand-made (doubtful)?

flanok




msg:4391443
 12:52 pm on Nov 27, 2011 (gmt 0)

It is handmade, coded by a previous owner. (I am not a coder)

Built in 2005, then I inherited it in 2007.

I believe the site is restructured correctly now, but still has the old mixed with the new.

The post is about removing the old pages, without removing the new.

I could just remove all pages and rename the new.

But this site is a business with several clients and I would go under in the time it would take to find all new pages.

I think the soluttion is going to be a mix of the 2.

Allow the last visted pages to disapear and rename them.
Try to force G to recognise the redirect on the popular pages.

g1smd




msg:4391480
 5:02 pm on Nov 27, 2011 (gmt 0)

Make sure all internal linking uses the correct case.

Make sure that all incorrect URLs return 404 or redirect to the correct URL. Use Xenu Linksleuth or similar to be sure.

Google will eventually figure it all out. Make sure that no incorrect URLs invoke any sort of multiple-step redirection chain.

tedster




msg:4391488
 6:05 pm on Nov 27, 2011 (gmt 0)

This may be a side issue now, but you can use a CSS text-transform rule to change an all lower-case H tag (in the source code) to display in all upper-case, or with capitalized words. There's no need to have the URLs tied to the capitalization schema of the element in the source code, even if the CMS does seem to link the two.

g1smd




msg:4391489
 6:09 pm on Nov 27, 2011 (gmt 0)

If you're using PHP, the lcwords and ucwords functions are also useful.

Other programming languages will likely have similar functions.

flanok




msg:4391490
 6:37 pm on Nov 27, 2011 (gmt 0)

Hi
Thanks for the reply


Make sure all internal linking uses the correct case.

Make sure that all incorrect URLs return 404 or redirect to the correct URL. Use Xenu Linksleuth or similar to be sure.

Google will eventually figure it all out. Make sure that no incorrect URLs invoke any sort of multiple-step redirection chain.




This is what I did right from the start of repair.

But in hyndsight pobably the wrong strategy, as G was finding new upper case pages and no longer had a linking struture to the old urls, that needed to be found to be removed from the index.

If I had left the old linking structure in place after adding the redirects, there is a real chance that G would have spidered the old urls and removed most of them by now.

Then when most of the urls had been removed, changed the menu to what it is now.

I am sure G will sort it all out, and if it wasn't a commercial venture with clients, I would have just left it for 6 months, but it is my livelyhood, so it is more urgent than the other sites i have.

flanok




msg:4391493
 6:47 pm on Nov 27, 2011 (gmt 0)

Thanks Tedster

This may be a side issue now, but you can use a CSS text-transform rule to change an all lower-case H tag (in the source code) to display in all upper-case, or with capitalized words. There's no need to have the URLs tied to the capitalization schema of the element in the source code, even if the CMS does seem to link the two.


I didn't know that,

But I do know people I could have rang to sort that out for me at the time (and should have).

I accept I made the wrong decision, but not sure if I would be in any different boat as I am now.

I have decided to let some pages just drop off and re introduce the content with different url names (an opportunity to make these better). It make take time for G to find these new pages, but at least it is clean.
By not adding the old urls back into the index keeps my duplcate count down against the fight for Panda.

Other normally high traffic pages, (that have 2 verions) I will use sitemaps and fetchgoogle bot to force google to the pages that no longer exist as redirects, so they are removed asap.

Thanks again

g1smd




msg:4391497
 7:30 pm on Nov 27, 2011 (gmt 0)

G was finding new upper case pages and no longer had a linking struture to the old urls, that needed to be found to be removed from the index.

They don't need a link to the old URLs. If they have "seen" a URL previously (in a link somewhere) whether or not they have actually ever requested that URL, they will have made a note to request that URL from time to time in the future and see what is returned. If the old URL now returns 301 or 404 they will take action on that. It can take 3 months for the SERPs to sort themselves out and 6 months for WMT to be in agreement but they eventually get there. You should not continue to link to old URLs. Link to the new URLs so they can be found. The old URLs will be dropped based on the server response and the fact that nothing links to them any more. Google especially warns that you should list only valid URLs in your sitemaps. You should not list URLs that redirect.

flanok




msg:4391499
 7:45 pm on Nov 27, 2011 (gmt 0)

Thanks g1smd

I now have no intention of changing back my internal structure and do know that G would eventually find the old urls.

I will take your advice on sitemaps though, I did see this as a way to speed up the process, bearing in mind I am under Panda anyway.

But I will refrain from using them with old urls.

At the moment it is about speeding up the process and helping G find the changes.

If it is all just about waiting for it to happen in G's time, then I am pretty much stuffed, I am already lucky to keep some clients as it is.

enigma1




msg:4392409
 11:07 am on Nov 30, 2011 (gmt 0)

Couple of things I would recommend in your case.

1. Fix the web application so it only generates one link format, lower case links should be fine. Make sure everywhere in your site's content, valid links are exposed double check the way you programmed the application.

2. Once you completed 1, stop using redirects or 404s on old mixed case link requests. Use either the server or application scripts to convert requests of the old mixed case links to the new link format then output 200 OK in these cases.

lucy24




msg:4392679
 9:29 pm on Nov 30, 2011 (gmt 0)

Once you completed 1, stop using redirects or 404s on old mixed case link requests. Use either the server or application scripts to convert requests of the old mixed case links to the new link format then output 200 OK in these cases.

If you are changing the request, you are redirecting. If you handle it as a rewrite, you have not helped the problem.

enigma1




msg:4392686
 9:52 pm on Nov 30, 2011 (gmt 0)

If you are changing the request, you are redirecting. If you handle it as a rewrite, you have not helped the problem.

I think I have and it's better than early retirement, instead of giving out a 404 which is a total waste you service such requests with 200 like nothing happened. Google will eventually pick up the new link from the site's main navigation and after a while will drop the old one.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved