Please stick with me... this is probably going to get long. I have been having some very odd problems in Web Master Tools and would like to get some advice from some of you with more knowledge then me (probably most of you out there). I have radically made the examples as simple as I can to get my points across.
Two and a half years ago, I decided to take individual forum files and group them under one filename. Back then, each forum was named separately... i.e...
Each of these files would pass arguments to pull up a list of posts... i.e...
Very simple. I decided to combine all the posts and rename them into a single filename. I knew there would be fallout for this, but at the time, I didn't care. The forum was relatively busy and I was changing it for easier administration. So now, the new filename was just forum.ext and I broke each forum out like this...
Still, very simple (pay attention to case and order of arguments).
Okay, things went great after that. The old forum pages were dropped and now showed 404s. The new forum pages were crawled and everything was great. The forum in the SERPs dropped for about 4 months and then started to come back. I had arranged titles and meta-descriptions to avoid any duplicate content like this...
title - <topic> <forumName>
description <snip of first 100 characters of post> <forumName>
Everything went well and at the end of 2007, there were 13,500 forum pages in the index which was about 50% of the pages. All of this was done with no robots.txt or deleting URLs using Google Webmaster Tools.
In April, 2008, I started seeing a drop in the number of pages indexed. I really didn't know why. I read through other posts on WebMaster World about how the site: operator was flawed anyway, so I really didn't pay it close attention. GWT still looked good, rankings were good and everything looked okay to me.
In June, 2008, I was shocked to see in GWT that the site had over 6500 pages of duplicate titles and meta descriptions. As far as I could tell, URLs were getting muxed somehow. I posted of the dilemma here in Webmaster World in the Googlebot Now Crawls via HTML Forms posts located here...
What has happened is that Google has muxed the URLS like this...
All of these URLS would point to the same page. Case was randomly changed and the order of the arguments were changed. Also, I started getting a bunch of pages that didn't exist like so...
There is no ForumNum=18
Needless to say, I was flabbergasted as to what was going on. I searched as much as I could to see if these URLs existed anywhere on the Web. I cannot find them. I then thought this was a programming error, but I checked, double checked, triple checked and quadriple checked my code and there is no way these URLs are being generated on my end. All log files on the server showed only GoogleBot was trying to load these pages.
So I started to take some preventive measures to see if I could fix the problem. All muxed URLs now return a 404. I added every combination of muxed URLs in the robots.txt file. I deleted as many of the muxed URLs as I could in GWT and life went on. GWT shows that dupicated titles is now down to 14 and duplicated descriptions down to 4. These are actually posts that have the same topic and very similar content so I am willing to let those go.
Yesterday, I logged into GWT and started going through internal links and guess what! It shows old forum filenenames in there... huh? Where did those come from. I clicked the links to see if I could find them, and they are not on my pages. So Now I have...
back in the internal links though these pages have not existed for two years. Not only that, the pages that these links are supposedly on are from a redesigned site with a single forum file AND from pages that did not exist back when the original forum was busted up into multiple filenames.
So now, I added these filnames into robots.txt, but I am at my wits end trying to figure out what the heck is going on. It seems every time I try to fix something, it just get worse. Either there is something broken on Googles end or someone is trying really hard to bring the site down. And I cannot find the reason these pages even exist on the Web.
Now the site is down to 8050 pages indexed (though I know the site: operator can be flawed). It seems GoogleBot is only doing deep crawls about once a month (this has been going on for the past 2 months - it used to get deep crawled at least twice a week), and everything I look at in GWT seems to be bogus. Funny thing though is that the ranking of the site has never been better. #1s are still #1 and a few other terms actually went up a bit. Another thing too is that though there are so few deep crawls, new topics in the forum are getting added to the index within 24 hours in some cases. I am totally baffled as to what is going on...
Is there anyone else seeing any of this type of behavior? Am I going nuts? Today I actually found an archived version of the site (not even in the root of the server) and deleted it... that's how nuts I am getting. And then I started thinking about sites like the Way Back Machine... could G have picked up on something like that?
Anyway... thanks for reading...