Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Rewritten URLs removed from Google

         

wesmaster

9:15 pm on Aug 9, 2009 (gmt 0)

10+ Year Member



We noticed a big drop in traffic to one section of our website that normally gets over 200K visitors a day. Over 4-5 days the traffic to this section went to almost zero coming from Google - while the rest of the site maintained it's Google traffic. After doing a bunch of Google "site:" searches I finally discovered that pages in this section that did not use mod rewrite were still in the Google index, but all of the rewritten pages had disappeared.

The rewrite code was added 3-5 months ago. The rewrite makes the URLs, which happen to be for pages with domain/website information, friendly like, www.oursite.com/page/domain.com/ , the unwritten pages appear as www.website.com/page.php?id=12345 . The only reason some pages were not rewritten is because some of the domain names, from older entries, are actually full paths to subdirectories and have forward slashes such as "domain.com/sub/" and we simply decided to not rewrite these URLs because of the handling of the forward slashes.

Within the website, on the rewritten pages, there are links to the non-rewritten pages when we have a querystring parameter to add (so, page.php?id=12345&something=else). These versions of the page, for the most part (other than maybe 5), have disappeared as well. Under some circumstances the non-rewritten page with extra parameters and the rewritten page could look (HTML) exactly the same.

Thoughts on why the rewritten URLs have disappeared? Are we being penalized for having both rewritten and non-rewritten URLs that have some of the same content?

Shaddows

8:22 am on Aug 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hi, welcome to WebmasterWorld.

If you have not locked down acceptable URLs, you may have been a victim of an attack.

I'm not 100% clear on what you have actually done. Please can you clarify if I have it right:

OLD:
www.example.com/page.php?id=12345
NEW
www.example.com/12345/domain.com/

Or is the original id hidden on the returned page:
www.example.com/page/domain.com/

Finally, is the page REQUESTED (i.e. linked to)
"/12345/domain.com/"
and internally rewritten to fetch id=12345

Or is the page requested "id=12345", which is then 301'd to "/12345/domain.com/"

wesmaster

3:07 pm on Aug 10, 2009 (gmt 0)

10+ Year Member



The original ID is hidden, yes, so the new url is www.example.com/page/domain.com/

The request is mod-rewritten when it is for the new URL, but also requests for "page.php?id=12345" are 301'd via the PHP page.

Shaddows

9:52 pm on Aug 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ok, the problem is not what I originally thought (no single canonical version of page).

Duplicate pages normally result in ALL BUT ONE version of the page disappearing- so that shouldn't be the problem either.

Have you visited the pages (both the PHP page and the tidy form) using Live HTTP Headers? Are all the server responses as you expect (200 for the tidy, one single 301 -> 200 for the PHP)?

If the apended parameters make no substantial changes to the html, you might help the issue by using the canonical tag. Also, you should make sure you disallow all non-intentional apended parameters via .htaccess.

Other than that, I'm at a loss.

dusky

10:19 pm on Aug 10, 2009 (gmt 0)

10+ Year Member



I had similar situations and been dealing with mod-rewrite problems for years.

When Shadows said

If you have not locked down acceptable URLs, you may have been a victim of an attack.
I agree, I seen that happen few times. There are some competitors who send bots to what I call URILATED URLS (URLS with keyword strings and page IDs which return 200 OK response no matter what you have in the URL as long as the page ID exists, hence causing havoc and duplicate content). Malicious spammers also do that nowadays, their motive are free ads and stats backlinks, not realizing the havoc caused as well.

I won't reverse the situation unless I am sure you are in breach of the G* guidelines, looks to me either the situation needs time for the 301ed pages to transfer rank (6 months is the average I seen), that is assuming all is well with your method and no more than one URL is leading to the same page (splitting PR and causing duplication).
Changing thousands of URLs at once when they are already established and have backlinks can take a long time, 6-9 months for the 301 to take hold.

If those pages did have 200k unique daily visitors in that section, they must have been well established with solid backlinks and trustrank, you should have taken SEO advice first from our community here, though not everyone is a SEO expert, some of us do this for a living, you could've PMed someone here, tedster and other guys here could help (I don't have time I am afraid and too busy with similar requests for help) or gone directly to a well known SEO company or site.

As you explained, I didn't get the exact problem you're having, same as Shadows, you need to explain in detail. If the situation remains as is, I'd PM someone here and give them all your site's details, logs, stats, credentials etc, in other words, you need someone on the case.

If I were you I won't do the same for other sections, you could contaminate the whole site, if you have 5 sections drawing the same number of daily visitors, that's a million daily visitors disappearing either for years, forever or for a while.

The problem could also be temporary in the SERPs and indeed could be good news as that sometimes can indicate few days of "blankness" followed by a new change which may swap the old pages with the new indexed pages.

dusky

10:28 pm on Aug 10, 2009 (gmt 0)

10+ Year Member



Again, not too sure what you mean by
The original ID is hidden, yes, so the new url is www.example.com/page/domain.com/

If the URLs appear similar, but each leading to a different page with different content, even if they have different titles, they are likely to be looked upon by search engine bots as one single page duplicated hundreds or thousands of times and probably only one page picked up OR the WHOLE of them deliberately avoided. The situation may not be the case, but just in case (if you forgive the un-intended pun here)

wesmaster

12:44 am on Aug 11, 2009 (gmt 0)

10+ Year Member



I already have the canonical tag on the pages.

The duplicated pages with querystring are basically this, you have the main page, www.mysite.com/page/domain.com/ which shows the most relevant information for the user, but if you want to see a bunch more information, you click a link to see more, so it loads the "same page" just with more content at the bottom - and this content can either be much much more content, or in some cases, no further content. Of course, this could be done via other methods that do not create two versions of the file (such as DHTML) but changing the code to work this way reduced our bandwidth significantly, so it has it's pluses.

In searching through what IS in G* index, I did see a hand full of the spammy extra querystring parameters that you guys are both referring to. I've never had an issue with this so I would have to do some research in how to remove them via htaccess. [b]If anyone has the code, please share.[/b[

The website as a whole gets 200K-300K visits (fluctuates) and this section of the website makes up about half of that. I have reverted back to the standard, ugly, PHP querystrings (and updated my G* sitemaps) in hopes that G* will find those to be OK, as it appears that the 17K indexed pages are all in this format are still in the index. There should be about 90K pages in the index for this "page".

tedster

12:55 am on Aug 11, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does the click for more content also generate a new url - even just the old URL but now with a query string added? Or maybe the extra part of the url follows an added hash tag [#]?

And assuming there is a modification made to the URL, how does that interact with your mod_rewrite?

wesmaster

1:40 am on Aug 11, 2009 (gmt 0)

10+ Year Member



The extra querystring would be like this: www.example.com/page.php?id=12345&showall=true These links were introduced months (3-5) ago.

There are other querystring versions of this page linked on the page, such as www.example.com/page.php?id=12345&upgrade=true. This page would have much different content, though. These links have been there for over a year.

These URLs do not get rewritten, only the /page/domain.com/ format is rewritten.

There are is also the chance of a URL looking like page.php?id=12345&domain=domain.com#domain.com which is used to allow linking/jumping to a certain section of the page from external sites. Few websites have linked to us this way, but there is at least one who has done it site-wide. These links have been available for 6 months but and probably on the site-wide site for 3+ months. We encourage users to link this way (which makes sense when you know what site we are). It's hard to explain without going into detail, which I'm sure everyone understands.

[edited by: tedster at 2:05 am (utc) on Aug. 11, 2009]
[edit reason] switch to example.com - it cannot be owned [/edit]

wesmaster

6:57 pm on Aug 19, 2009 (gmt 0)

10+ Year Member



Just a quick update. After changing the site to only use querystring URLs, reflecting this change in the G sitemaps, and submitting a reinclusion via the sitemaps page, the URLs are starting to reappear in G (or, the querystring versions are). Also, I'm seeing position restored for the pages that have been reincluded. It appears to be a slow process, but it's happening. I also see my G crawl rate increasing (it had dropped significantly). 10K URLs indexed today for the page in question, where it was 5K when the rewritten URLs disappeared. 80K more to go.