joined:Aug 14, 2008
It's been a while since I've started my own thread, but this concern has been an ongoing issue for the last two years and we could really use some outside perspective. I'd appreciate any brainstorming, suggestions or general ideas especially from the old-timers like myself ;-) (this is my 2nd account- I lost my first one from waaaaay long ago). Here's the basics:
*Note: Numbers are fictitious and are used just for simplicity sake.
Brief Overview of Website
- This website has a high volume of pages, many backlinks, is made up of User Generated Material & augmented with information from verified sources, and has been online since around 2000.
- The website has three primary sections, but only two are in question. Lets call them directory A and directory B.
- Directory A is the 'main' directory that contains say 10,000 pages that acts as a hub for all the pages contained in directory B. Directory B contains lets say 500,000 pages
- Directory A URL = /directoryA/UniqueID
- Directory B URL = /directoryB/UniqueID
- Directory B honestly contains the most important information to our users. In other words, people searching can be looking for Directory A information, but 90% of the time are looking for information on the pages in Directory B.
- Directory A has a 99% index rate and monopolizes about 90% of our crawl rate.
- Directory B is not being crawled nearly as much, and until recently went from a 98% index rate to bleeding indexed pages until we are down to .5%.
- The weird thing is that all of the pages contained in Directory B are unique text (there are very near duplicates based on the type of information these pages contain), while Directory A pages (as they are a hub) are mainly filled with generic text. The value to our users is clearly in Directory B pages, but Google continues to value Directory A over Directory B.
- We fouled ourselves over by changing the url structure of Directory A and Directory B prior to seeing a decline in indexing, but again it has been now 2 years so the big G should have changed over to our new url structure (and we setup 301 redirects to the new pages), plus Directory A is fully indexed and it also went through the url change.
- We fouled ourselves over again with incorrectly handling canonicals and ?tid tracking in urls which caused duplicate content but these pages are now all removed from Google's index. Again, time has passed enough for G to catch up.
- We've verified that robots.txt is accurate and that we do not block any of the pages we want indexed.
- We've done a great job (if I do say so myself) on updating sitemaps with a site this size we use .xml.gz for children sitemaps.
- We even engaged SearchBros to get search quality ex-googlers insights and followed all the basic clean-up items they have found.
- It's obviously easier for internal links for Directory A (as there are 100x more pages in Directory B) but we have gone through and utilized multiple solutions for making sure that there are at least a few internal links to every page in Directory B.
I could go on for days with additional information, but know you don't have the time to read every nuance. If you've made it this far thank you for reading! =-) Questions? Clarifications? Random insights or suggestions? Anything is welcome and appreciated at this point.
Thanks all! If we are able to resolve this issue I'll make sure to update this thread with our findings.