Forum Moderators: Robert Charlton & goodroi
The rewrite code was added 3-5 months ago. The rewrite makes the URLs, which happen to be for pages with domain/website information, friendly like, www.oursite.com/page/domain.com/ , the unwritten pages appear as www.website.com/page.php?id=12345 . The only reason some pages were not rewritten is because some of the domain names, from older entries, are actually full paths to subdirectories and have forward slashes such as "domain.com/sub/" and we simply decided to not rewrite these URLs because of the handling of the forward slashes.
Within the website, on the rewritten pages, there are links to the non-rewritten pages when we have a querystring parameter to add (so, page.php?id=12345&something=else). These versions of the page, for the most part (other than maybe 5), have disappeared as well. Under some circumstances the non-rewritten page with extra parameters and the rewritten page could look (HTML) exactly the same.
Thoughts on why the rewritten URLs have disappeared? Are we being penalized for having both rewritten and non-rewritten URLs that have some of the same content?
If you have not locked down acceptable URLs, you may have been a victim of an attack.
I'm not 100% clear on what you have actually done. Please can you clarify if I have it right:
OLD:
www.example.com/page.php?id=12345
NEW
www.example.com/12345/domain.com/
Or is the original id hidden on the returned page:
www.example.com/page/domain.com/
Finally, is the page REQUESTED (i.e. linked to)
"/12345/domain.com/"
and internally rewritten to fetch id=12345
Or is the page requested "id=12345", which is then 301'd to "/12345/domain.com/"
Duplicate pages normally result in ALL BUT ONE version of the page disappearing- so that shouldn't be the problem either.
Have you visited the pages (both the PHP page and the tidy form) using Live HTTP Headers? Are all the server responses as you expect (200 for the tidy, one single 301 -> 200 for the PHP)?
If the apended parameters make no substantial changes to the html, you might help the issue by using the canonical tag. Also, you should make sure you disallow all non-intentional apended parameters via .htaccess.
Other than that, I'm at a loss.
When Shadows said
If you have not locked down acceptable URLs, you may have been a victim of an attack.I agree, I seen that happen few times. There are some competitors who send bots to what I call URILATED URLS (URLS with keyword strings and page IDs which return 200 OK response no matter what you have in the URL as long as the page ID exists, hence causing havoc and duplicate content). Malicious spammers also do that nowadays, their motive are free ads and stats backlinks, not realizing the havoc caused as well.
I won't reverse the situation unless I am sure you are in breach of the G* guidelines, looks to me either the situation needs time for the 301ed pages to transfer rank (6 months is the average I seen), that is assuming all is well with your method and no more than one URL is leading to the same page (splitting PR and causing duplication).
Changing thousands of URLs at once when they are already established and have backlinks can take a long time, 6-9 months for the 301 to take hold.
If those pages did have 200k unique daily visitors in that section, they must have been well established with solid backlinks and trustrank, you should have taken SEO advice first from our community here, though not everyone is a SEO expert, some of us do this for a living, you could've PMed someone here, tedster and other guys here could help (I don't have time I am afraid and too busy with similar requests for help) or gone directly to a well known SEO company or site.
As you explained, I didn't get the exact problem you're having, same as Shadows, you need to explain in detail. If the situation remains as is, I'd PM someone here and give them all your site's details, logs, stats, credentials etc, in other words, you need someone on the case.
If I were you I won't do the same for other sections, you could contaminate the whole site, if you have 5 sections drawing the same number of daily visitors, that's a million daily visitors disappearing either for years, forever or for a while.
The problem could also be temporary in the SERPs and indeed could be good news as that sometimes can indicate few days of "blankness" followed by a new change which may swap the old pages with the new indexed pages.
The original ID is hidden, yes, so the new url is www.example.com/page/domain.com/
If the URLs appear similar, but each leading to a different page with different content, even if they have different titles, they are likely to be looked upon by search engine bots as one single page duplicated hundreds or thousands of times and probably only one page picked up OR the WHOLE of them deliberately avoided. The situation may not be the case, but just in case (if you forgive the un-intended pun here)
The duplicated pages with querystring are basically this, you have the main page, www.mysite.com/page/domain.com/ which shows the most relevant information for the user, but if you want to see a bunch more information, you click a link to see more, so it loads the "same page" just with more content at the bottom - and this content can either be much much more content, or in some cases, no further content. Of course, this could be done via other methods that do not create two versions of the file (such as DHTML) but changing the code to work this way reduced our bandwidth significantly, so it has it's pluses.
In searching through what IS in G* index, I did see a hand full of the spammy extra querystring parameters that you guys are both referring to. I've never had an issue with this so I would have to do some research in how to remove them via htaccess. [b]If anyone has the code, please share.[/b[
The website as a whole gets 200K-300K visits (fluctuates) and this section of the website makes up about half of that. I have reverted back to the standard, ugly, PHP querystrings (and updated my G* sitemaps) in hopes that G* will find those to be OK, as it appears that the 17K indexed pages are all in this format are still in the index. There should be about 90K pages in the index for this "page".
There are other querystring versions of this page linked on the page, such as www.example.com/page.php?id=12345&upgrade=true. This page would have much different content, though. These links have been there for over a year.
These URLs do not get rewritten, only the /page/domain.com/ format is rewritten.
There are is also the chance of a URL looking like page.php?id=12345&domain=domain.com#domain.com which is used to allow linking/jumping to a certain section of the page from external sites. Few websites have linked to us this way, but there is at least one who has done it site-wide. These links have been available for 6 months but and probably on the site-wide site for 3+ months. We encourage users to link this way (which makes sense when you know what site we are). It's hard to explain without going into detail, which I'm sure everyone understands.
[edited by: tedster at 2:05 am (utc) on Aug. 11, 2009]
[edit reason] switch to example.com - it cannot be owned [/edit]