homepage Welcome to WebmasterWorld Guest from 54.196.206.80
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google Won't Index My Rewritten URLs
wesmaster




msg:4102595
 8:16 pm on Mar 22, 2010 (gmt 0)

Update to previous post: [webmasterworld.com ]

One of my websites has a section that lists websites that are similar to other websites. In early 2009 I changed this section to have mod_rewrite rewritten URLs. The URLs were like www.example.com/view-sites/example.com. Previously they were www.example.com/view-sites.php?id=1000. All URLs were correctly 301 redirected. I updated my XML sitemap code, etc. Everything went fine for 2-3 months, then suddenly ALL of the rewritten URLs disappeared from G virtually on the same day. There were some (5%) non-rewritten URLs (that had slashes, etc, in them) that remained in G. Once I realized it was only the rewritten URLs that had disappeared I rolled the site back in panic, submitted a reinclusion request to G (just to be safe), and the website listing pages started to show back up with the querystring URLs version. It took MONTHS for the pages to show back up, I kept track of them in G sitemaps every day.

In February of 2010 I decided to slowly switch, again, to the rewritten URLs and hope for the best. I changed the code to only rewrite URLs that had been added to the database in 2010+. So it would show rewritten URLs for all websites added to the database then on, and the ones that had been added already in 2010, but keep the querystring URLs for anything prior to 2010. So far G has not index a SINGLE, rewritten, newly added website, and has kept the querystring version for websites between Jan 1 2010 and the day that I made the code change. So page links that it already knew about, that now have 301 redirects to rewritten URLs, have not been updated with their new location, in G's index.

So essentially, G will not index rewritten URLs for this ONE website that I own. I have 2 other websites that rewrite URLs in the same fashion (except the format is www.example.com/page/variable/ - with the directory style slash) that G indexes just fine.

Any thoughts?

 

g1smd




msg:4102729
 12:00 am on Mar 23, 2010 (gmt 0)

Do the links ON your pages point directly to the new URLs, or do they still link to the old URLs?

When a request for an old URL is received by the server, does it issue a 301 or a 302 redirect?

Does your site directly resolve content at www or at non-www or at both?

Is there a non-www to www redirect in place?

From clicking a link to seeing content, are there any redirects involved? Use Live HTTP Headers to check the HTTP status codes.

Does accessing any new URL return anything other than a 200 OK HTTP status code?

wesmaster




msg:4102799
 3:47 am on Mar 23, 2010 (gmt 0)

Do the links ON your pages point directly to the new URLs, or do they still link to the old URLs?

All links point to the new URLs. I also have the canonical tag set to the new URL. Web page sitemaps are with the new URL as are XML sitemaps.

When a request for an old URL is received by the server, does it issue a 301 or a 302 redirect?

301

Does your site directly resolve content at www or at non-www or at both?

Non-www is redirected to www for the whole domain, so only www should exist as far as G knows.

Is there a non-www to www redirect in place?

Yes

From clicking a link to seeing content, are there any redirects involved? Use Live HTTP Headers to check the HTTP status codes.

When on the website and clicking a link to a rewritten page, there is no redirect, it responds with 200.

Does accessing any new URL return anything other than a 200 OK HTTP status code?

No, just 200 OK is given. For example:

HTTP/1.1 200 OK
Date: Tue, 23 Mar 2010 03:45:11 GMT
Server: Apache/2.0.52 (Red Hat)
X-Powered-By: PHP/5.2.12
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

pageoneresults




msg:4102802
 3:57 am on Mar 23, 2010 (gmt 0)

I'm following along and have an interest in the outcome of this topic. I have a suggestion but before I offer that, you've got yourself into somewhat of a pickle. I count 4 sets of instructions that you've sent to the crawlers since you started this process in 2009? That first "panic" reaction is probably what set the stage for a less than optimal crawl moving forward, you're confusing the bots. It's a less than optimal environment for a crawler.

It sounds like Google is relying on its cache for a solid reference point, I'm just guessing at this stage. What I'd like to recommend is that you implement NoArchive on a global basis. My theory is that this forces Googlebot and the others to crawl the latest content since it can't fall back on an archive copy, although they have one, it just isn't visible to the public.

I just like to think that having that NoArchive is a way to keep stuff like this from getting cached and stuck in the index. Am I making sense? :)

This would be the perfect time to experiment too. I'd document crawl activity before and after the implementation of NoArchive. That should give you a strong indication of what is taking place. My guess is that you'll see a spike in activity as the bots come to get the latest versions of your documents. Then it will simmer down while it recalculates the new instructions.

During this time, you can't make any major changes, that makes for a somewhat challenging situation and I'd understand not being able to do it. But, I'd still recommend the NoArchive method to send a fresh crawl signal. :)

wesmaster




msg:4104212
 4:39 am on Mar 25, 2010 (gmt 0)

I did implement the NOARCHIVE to see what happens. I've since found pages in Google that don't have "Cached" links but are still the querystring format instead of the rewritten version. So the cache was removed, and the page was crawled, but the URL was not updated. In one case it's a longer querystring version and not the main page. For example, we only show the the top listed sites on the page, then you have to click "show all" to see everything. The &showall=true is added to the querystring on that link. The G indexed version of the page is the one with the &showall=true, which I've never seen G index with higher priority than the main page with only the ID querystring paramter. There is a canonical tag with the rewritten URL, but obviously this is not being read.

tedster




msg:4104238
 5:45 am on Mar 25, 2010 (gmt 0)

Canonical tags are taken only as a suggestion by Google - they have some kind of internal logic to decide whether they will apply a "virtual 301" or not.

g1smd




msg:4104900
 1:04 am on Mar 26, 2010 (gmt 0)

Even though the old URLs still appear in the SERPs, Google has already been at work fixing things up.

It is likely that the old URLs now reside in the Supplemental Index. Google wants to hold on to old URLs for a while, so that searchers can still find that content even after you have moved or deleted it.

Change an address or phone number on a page and that page will continue to rank for the old data long after it also ranks for the new data. Likewise when you move a page to a new URL, the old URL will continue to appear for some searches long after the new one is indexed and ranking, and long after the old URL no longer serves content, and instead serves a redirect.

Your measure of success is NOT in how quick the old URLs disappear, but in how quick the new URLs are indexed and ranking and bringing traffic. Once the redirect is in place, let Google remove the old URLs at their own pace.

claus




msg:4104932
 2:43 am on Mar 26, 2010 (gmt 0)

So essentially, G will not index rewritten URLs


IMHO the case is not "will not", rather it is a case of the process being extremely slow. The fact that you have not seen results yet does not imply that you will never see results.

Id like to offer a small piece of advice, even though it seems a little too late now:

A 301 status code means that the ressource has been moved permanently. Search engines tend to take web protocols and standards very literally. So, if you permanently move a piece of content from one address to another, do just that. That is, make the move permanent: don't change the URL back.

Say you make a change of URL, and for some odd reason it takes more than two months for a search engine to digest that move. Then, if you panic and change that URL back, the search engine will now most likely find it harder to digest this second move. Meaning that you risk an even longer second waiting period before things get straightened out.

This may not help you now, but it may do so in the future :)

wesmaster




msg:4105434
 9:18 pm on Mar 26, 2010 (gmt 0)


IMHO the case is not "will not", rather it is a case of the process being extremely slow.


I don't dissagree, directly, but what would be your reasoning for Google being extremely slow in indexing this website's rewritten URLs (even ones that it never knew about as querystrings, so they are not 301'ed from a previous address) but indexing querystring URLs quickly? That makes no sense.

That is, make the move permanent: don't change the URL back.


In a perfect world where you don't lose 100K visitors overnight, because of G dropping 1/4 of your website out of the index, and all of the financial loss because of that, sure. I mean, we're not talking about a brand new website trying to get off it's the ground losing a few hundred visitors for a month or two. I agree in theory with your statement, for future scenarios.

g1smd




msg:4105495
 11:34 pm on Mar 26, 2010 (gmt 0)

There's likely something botched in the move to the new URL format - some Duplicate Content issue that was missed, or some non-canonical linking still happening within the site. It's easy for that to happen. I've just spent the last week pulling apart a ZenCart site with all manner of designed-in screw-ups. Close inspection of Analytics and WebmasterTools data as well as a site crawl using Xenu LinkSleuth can sometimes be a revelation.

wesmaster




msg:4106503
 5:49 pm on Mar 29, 2010 (gmt 0)

There's likely something botched in the move to the new URL format - some Duplicate Content issue that was missed, or some non-canonical linking still happening within the site.


What do you guys think about there still being links to extended URLs with querystrings. For example, http://www.example.com/view-sites.php?id=100&showall=true

There are not links to JUST the ID format of the quesrsting, but when there is a parameter other than the ID there are querystring parameters.

TheMadScientist




msg:4106528
 6:29 pm on Mar 29, 2010 (gmt 0)

I would guess it has more to do with the id= than the showall... It has been stated here previously (often, but quite a while ago) Google tries to avoid indexing URLs with sessions in them and recommends using query_string variables without sid, id, etc. that could easily be a session if you want the best chance of having the URL indexed and it's really not a session id or anything like that.

Anyway, I would guess it's your query_string convention more than anything else and would probably try switching them to something 'keywordish' or 'descriptive' EG book=NUM or even generic item=NUM or product=NUM or something else that is not simply 'id' since they try to keep from indexing those query_strings and it might be causing the issue.

wesmaster




msg:4106549
 7:02 pm on Mar 29, 2010 (gmt 0)

...something else that is not simply 'id'...


On this site it's actually "sitesid", I just used "id" in this thread for simplicity.

TheMadScientist




msg:4106552
 7:15 pm on Mar 29, 2010 (gmt 0)

sitesid

Not much else for me to say...
You've got sid in your query_string and my recommendation is to change it.

wesmaster




msg:4106670
 1:05 am on Mar 30, 2010 (gmt 0)

But your logic is backwards. G is indexing (or indexing quickly, depending on who in this thread you ask) my URLs with "...sid" in them, and not indexing (again, depends on who you ask) the ones without (rewritten). So I'm missing how this applies.

The rewritten URLs do not include the questionable parameter ("...sid") in the URL, they are ".../view-sites/example.com".

TheMadScientist




msg:4106679
 1:53 am on Mar 30, 2010 (gmt 0)

Ahhhhh, sooooo... Reading for comprehension escapes me sometimes. I read your last post before I posted and it sounded like those were the URLs not being indexed... You were talking about links to them and I thought, wow, that seems easy to see with what the query_strings are...

For the sake of clarity in terminology:
Technically indexed means 'shown in the results' and 'spidered' or 'crawled' or something to that effect would mean they are in the 'underlying data' (for lack of a better phrase) but are not shown in the results.

Basically, G shows visitors their 'index' when they search, but they don't show all the data they have compiled the index from. So, indexed = in the results. Spidered = they have the info but don't show the pages in the results.

Just think 'database' and you'll probably understand why they call the results they show you their index... It sort of makes sense the data they show people from their database would be 'indexed'.

I'll look closer at the thread and let you know if anything else jumps out at me. Sorry for misunderstanding your post / predicament.

TheMadScientist




msg:4106687
 2:05 am on Mar 30, 2010 (gmt 0)

K, in reading through a bit more thoroughly (I actually remember reading your initial post now, but didn't re-read all the way down to the one before I posted), and the first thing I remember thinking about it previously and again now is the second domain name being in the URL... It might look 'spammy' or like you are trying to rank for www.example.com using your URLs with the two domain names in them, so the first thing I would probably try is removing the domain name from the actual path and try some different text to see what happens.

Maybe a set without the subdomain. or the .com, so www.example.com would be:
www.example.com/view-sites/example

And I would probably try another set where you switch back to a number and not the domain name but keep the friendly URL:
www.example.com/view-sites/10000

Anyway, that's where I would start and see if there's any luck... The biggest 'could be questionable' thing I see in the URLs is the second domain name, so that's where I would start looking for a fix.

wesmaster




msg:4106689
 2:16 am on Mar 30, 2010 (gmt 0)

The biggest 'could be questionable' thing I see in the URLs is the second domain name, so that's where I would start looking for a fix.


Due to forum rules I cannot post the actual URLs. Can you confirm that you realize that it's actually www.example.com/view-sites/domain.com? Just making sure. There is never (except for one single record) a time where the actual domain name is in the URL twice.

wesmaster




msg:4106694
 2:32 am on Mar 30, 2010 (gmt 0)

BTW, one of the reasons I'm so concerned about this issue is that at least one competitor has made a huge jump in the rankings and replaced my website in the SERPs in the past 6-12 months for domain searches that we used to be at the top (usually right after the actual domain) for. This competitor has URLs formatted like www.example.com/similar/domain.com. So I think that this is an important SEO point, since this website came out of nowhere to jump over me.

TheMadScientist




msg:4106700
 2:52 am on Mar 30, 2010 (gmt 0)

Can you confirm that you realize that it's actually www.example.com/view-sites/domain.com?

Yes, that was what I was thinking, and now that you mention your competitor doing it and you following I have other thoughts, like how unique (in actual content) is your site from theirs and could the URL similarities have made it look like they were 'mirror sites' in some ways? Are there similarities in the titles? Would the average user find much difference in the actual information presented by the two? IOW Is it possible your URL change tripped an 'essentially the same' filter of some kind, rather than the domain name in the URL being the issue itself?

wesmaster




msg:4106726
 4:30 am on Mar 30, 2010 (gmt 0)

Similar maybe, but duplicate, no.

wesmaster




msg:4106752
 5:03 am on Mar 30, 2010 (gmt 0)

I did a G "site:" search just now, for the beginning of the URL for the websites section of the site, and set the date to 01/01/2010+. Every URL (which remember should be rewritten) is now indexed with the &showall=true added to the querystring instead of the main version. I've never seen G index the &showall=true version of our URLs until I did the NOARCHIVE (site-wide) mentioned earlier in the thread. Now, I want G to index the showall version of the URL, so I'm not going to add it to the G sitemaps "parameter handling". Actually, I have it set to "don't ignore" in G sitemaps. But it shouldn't be the main URL, IMO. This just tells me that for some reason G doesn't like the rewritten version, but that makes no sense!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved