Forum Moderators: phranque

Message Too Old, No Replies

re-write help please, i am new

Problem with Google appliance

         

dcguy

3:06 pm on Jul 20, 2005 (gmt 0)

10+ Year Member



Hello. I am new to this board, and relatively new to Apache. The previous webmaster set up a re-write and I think it is causing a problem with the way our Google search appliance is indexing our site.

The re-write rule looks like this:

RewriteEngine on
RewriteLogLevel 0
RewriteRule ^/([0-9]+)/$ /$1 [R]
RewriteRule ^/([0-9]+)$ /page.cfm?pageID=$1

It takes an unfriendly URL like this: /page.cfm?ID=10000001 and turns it into this: /10000001

Pretty simple stuff. This is on a SiteMinder protected intranet. We have the same rule running on our public site, but we don't have the following Google problem there.

Here's an example of some redirection taking place that is messing up the Google crawl from our appliance:

10000000 Info: Redirected URL. 18 Jul 10:54 AM
10000001 Info: Redirected URL. 19 Jul 3:35 AM
10000002 Info: Redirected URL. 19 Jul 3:35 AM
10000004 Info: Redirected URL. 19 Jul 3:35 AM
10000005 Info: Redirected URL. 19 Jul 3:35 AM
10000015 Info: Redirected URL. 18 Jul 10:54 AM
10000016 Info: Redirected URL. 18 Jul 10:54 AM
10000016?pageID=10000016 Excluded: In "Do Not Crawl" URLs. 18 Jul 8:17 AM
10000050 Info: Redirected URL. 19 Jul 3:35 AM
10000054 Info: Redirected URL. 18 Jul 10:54 AM
10000054?pageID=10000054 Excluded: In "Do Not Crawl" URLs. 18 Jul 8:17 AM
10000056 Info: Redirected URL. 18 Jul 10:54 AM
10000056?pageID=10000056 Excluded: In "Do Not Crawl" URLs. 18 Jul 8:17 AM
10000061 Info: Redirected URL. 18 Jul 10:54 AM
10000061?pageID=10000061 Excluded: In "Do Not Crawl" URLs. 18 Jul 8:17 AM
10000064 Info: Redirected URL. 19 Jul 3:35 AM
10000064?pageID=10000064 Excluded: In "Do Not Crawl" URLs. 18 Jul 7:59 AM
10000081 Info: Redirected URL. 18 Jul 10:54 AM
10000081?pageID=10000081 Excluded: In "Do Not Crawl" URLs. 18 Jul 8:17 AM
10000086 Info: Redirected URL. 18 Jul 10:54 AM
10000086?pageID=10000086 Excluded: In "Do Not Crawl" URLs. 18 Jul 8:17 AM
10000089 Info: Redirected URL. 18 Jul 10:54 AM
10000089?pageID=10000089 Excluded: In "Do Not Crawl" URLs. 18 Jul 8:17 AM
10000091 Info: Redirected URL. 18 Jul 10:54 AM

I can't explain why a page like "10000086" is being turned into "10000086?pageID=10000086" as Google reads it. The re-write rule doesn't even look like it would do that. /10000086 should just be /10000086 as far as Google is concerned, it should not be 10000086?pageID=10000086. Maybe the way the Google appliance "hits" the site is interpred differently? Anyway, I don't have control over the Google appliance and I have no idea how it's configured. I am the the mercy of support, but they don't know anything about Apache or redirects so the whole project is just standing still.

Also, the tech support guy told me Google IGNORES redirected URLs, so after it gets done indexing the site it backs OUT the redirected pages, leaving us with nothing. Does that makes sense? It seems to be true anyway.

Can anyone help me figure this out? My log file looks pretty much the same as the Google error log above; it shows the same pattern of URLs. Thanks.

jd01

7:22 pm on Jul 20, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Dcguy,

Welcome to WebmasterWorld.

Also, the tech support guy told me Google IGNORES redirected URLs, so after it gets done indexing the site it backs OUT the redirected pages, leaving us with nothing. Does that makes sense? It seems to be true anyway.

Rule 1: Never listen to a tech on SEO, unless they are an SEO who understands tech. I have 20K pages that are silently redirected which G does just fine with. Google does fine with all 20K of the 301 Redirects from domain.com to www.domain.com. I have friends who redirect *many* more (like 600K) pages than this and G does just fine with them.

That said, lets see if we can make this work.

1. The R flag on your first rule will default to 302, or temporary, with all the 302 issues G may be choking on this, so let's make it permanent. [R=301]

2. Let's only do one rewrite at a time. There is no L flag on either rule, so if a URL is qualified for the first rule the second will also act on it, not send it back as a new request and then act on it.

So, your first rule will look like this:
RewriteRule ^/([0-9]+)/$ /$1 [R=301,L]

3. Let's put a last flag on the second rule also, just for good measure.

So, your second rule will look like this:
RewriteRule ^/([0-9]+)$ /page.cfm?pageID=$1 [L]

The final file will then look like this:
RewriteEngine on
RewriteLogLevel 0
RewriteRule ^/([0-9]+)/$ /$1 [R=301,L]
RewriteRule ^/([0-9]+)$ /page.cfm?pageID=$1 [L]

My guess is there was some conflict with having the URL's acted on by both the rules at the same time. Not sure why, because it does not seem that G-bot would get a different result than a browser, from a redirect, unless there is some type of browser check, etc... But anything is possible.

Hope this helps.

Please, let us know if this does not work and maybe we can find another reason, but those are the only adjustments I see right now.

Justin

dcguy

9:36 pm on Jul 21, 2005 (gmt 0)

10+ Year Member



Okay, I will try this and let you know (I have to give my tech support time to make the change).

Fyi, this is the exact same rule that works on our public site, and that site doesn't seem to cause any problem with Google crawling by the appliance.

Do you think it matters that we have our own appliance doing the spidering vs. Google-at-large when it comes to this redirect stuff?

The only diff is our intranet is protected by SiteMinder, but we are using the Google security module for single sign on.

dcguy

3:34 pm on Jul 25, 2005 (gmt 0)

10+ Year Member



Hi, jd01. Can you tell me if the following new rule is going to do the same thing as your suggested rule? Somebody gave this to me as a possible alternative but I need a step-by-step interpretation of what it's doing differently. I am trying to troubleshoot a Google appliance indexing problem.

NEW RULE:
RewriteRule /(\d+)/? /page.cfm\?pageID=$1

YOUR SUGGESTED RULE:
RewriteRule ^/([0-9]+)/$ /$1 [R=301,L]
RewriteRule ^/([0-9]+)$ /page.cfm?pageID=$1 [L]

Thank you!