Forum Moderators: phranque

Message Too Old, No Replies

mod rewrite and http status codes

is this a problem

         

incywincy

9:55 am on May 16, 2004 (gmt 0)

10+ Year Member



hi,

i recently decided to move my site to php. to cater for the search engines i have inserted the following in my .htaccess file

RewriteCond %{REQUEST_URI} html
RewriteCond %{REQUEST_URI}!logs
RewriteCond %{REQUEST_URI}!spiderlog
RewriteRule ^(.*)html$ /$1php [R=permanent,L]

Lines 2 and 3 allow me to keep my log analysis reports as html files.

i noticed that when google requests an old .html file a http 301 is returned, the url is rewritten to .php and a 301 is again returned.

Is this correct? I would have expected a status of 200 to be returned. I have found that half of my pages have disappeared from google's index but i put this down to the changeover period and expected this to return to normal following the next deep crawl.

extract from access_log:
64.68.82.199 - - [15/May/2004:02:03:48 +0100] "GET /widgets/blue.html HTTP/1.0" 301 342 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

64.68.82.37 - - [15/May/2004:04:51:55 +0100] "GET /widgets/blue.php HTTP/1.0" 301 346 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

gergoe

2:07 pm on May 16, 2004 (gmt 0)

10+ Year Member



The second redirect could happen because if you do redirect based on this rule, the number of leading slashes are doubled in the result, because you've placed everything in the pattern (with the leading slash included) into the back-reference, and in the substitution text you've placed an additional one in the front of the back-reference. By the way I suggest you to change the pattern and the substitution strings to include the dot before the extensions, to ensure that no directory or file with the .xhtml extension will be rewritten accidentally.

The reason why the 301 is sent back not the 200 is that you defined the rule to function as a redirection instead of rewriting; just leave out the Redirect flag from the RewriteRule, so it will rewrite the url without changing the url.

Forgot to mention that you can't use the permanent, temp and other substitution texts with the R= RewriteRule flag, only the numerical status codes; so if you want a redirection to be permanent, then use R=301 flag. To read more about the status codes see ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt

jdMorgan

8:23 pm on May 16, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Leading slashes are stripped from the URL seen by RewriteRule in the .htaccess context, so duplicate slashes are not the problem. Also [R=permanent] is acceptable according to the documentation, although as gergoe says, it's probably not what you want.

So, looking for the reason for the second redirect, do you have any other rules that may be being invoked?

To shorten it up and elimininate the 301 redirect, you could write it like this:


RewriteCond %{REQUEST_URI} !/(logs¦spiderlog)/
RewriteRule ^(.*)\.html$ /$1.php [L]

This will make all of your php files appear to be html files, so no links need to be updated, and exempts /logs/ and /spiderlog/ from being redirected.

Change the broken vertical pipe "¦" character to a solid pipe before use.

Jim

gmiller

7:15 pm on May 17, 2004 (gmt 0)

10+ Year Member



Even if duplicate slashes were occuring, it's hard to see how that could result in a .php URL matching a regex that ends in "html$".

As for the [R] option, the Apache docs say "[R=permanent]" is fine, and I've used it without problems for some time.

gergoe

12:08 am on May 18, 2004 (gmt 0)

10+ Year Member



Sorry about the symbolic names and the Redirection, I used to skip some parts of the documentations, since they are all so long to read everything. How about a beer to forget about this one? ;-)

jimpoo

7:40 pm on May 18, 2004 (gmt 0)

10+ Year Member



Hi jdMorgan,
After doing Mod_Rewrite, the server will send out different header between the real static .html file and unreal dynamic .html .

For example, there is a real file name index.html on the server, and there is a Mod_Rewrite url like this:
[domain...]

When you check the server header, for the real one 'index.html', it sends out the real 'Last-Modified' and 'Content-Length' information.
But for '1234.html', it sends out empty 'Last-Modified' and 'Content-Length' information.

By checking the 'Content-Length' or some other headers, I wonder googlebot can still detect that '1234.html' is not a real file and give it some kind of penalty.

So, after doing mod rewrite, can you figure out how to provide googlebot the real information of server header 'Content-Length' and some other headers like
'Last-Modified' and 'ETag'?

Thank you very much.

jdMorgan

8:05 pm on May 18, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The reason this happens is because the server knows that php-generated files are dynamic, and therefore that they are created on-demand. Therefore, there is no Last-Modfied date for them.

You can modify your php script to output any appropriate headers you desire. Because I don't know what 'part' of the php-generated page you would consider to be a 'real' page -- what part actually might have a unique (and fixed) creation date -- I can't recommend what you should choose to base your Last-Modified date on. If your site is database-driven, perhaps you can store and retrieve a content-modified date from your database, so that (for example) articles are tagged with the date you first entered them into your database, while the surrounding php-generated 'page framework' is freshly-generated.

If your php script is simply 'passing' html documents to the browser, then the script itself should examine the creation date of the document it is passing, and output a last-modified header based on the modification date of that document on the server.

Jim

incywincy

9:20 pm on May 18, 2004 (gmt 0)

10+ Year Member



Thanks for the replies.
To add a bit more detail.
I have moved from html to php but have maintained the file name prefixes. so widgets/big-blue.html becomes widgets/big-blue.php. i have written this rewrite so that spiders will eventually request the correct url.

I thought this was common practise. Is this an incorrect way of telling googlebot that the url has moved permanently?

I also noticed that googlebot didn't actually request the entire page. When I made the same request in the browser the normal 301 , 200 http status code sequence was returned. Was googlebot's request different to a standard request in a browser?

gergoe

10:04 pm on May 18, 2004 (gmt 0)

10+ Year Member



No, it is a usual HTTP/1.1 compliant request, with all the information required to process the request. The difference between a browser and in a spider is that the second one does not handle sessions well (because the spider does not check the pages each after the other, it could be that to crawl through your site it will take two days, so if you've created session in your php pages, then those might be empty), and spiders are ignores any cookies, which is more or less connected to the first problem.

If you are using "static" php files (i think yes, since you just moved from html), so no session handling, no cookies (at least no decisions taken based on cookies), and even if you aren't using the query-string (yet), then the google should have the same content as you have it in your browser.

Did you checked already what's in the cache of google?

<add>
If you want to set different header items in the response from php, like the Last-Modified or the ETag, then there's a part in the php manual which might be very useful for you:
[php.net...]

about the Content-Length header, I'm pretty sure that the php should do this by itself, if it doesn't then you might need to check your php config, or post a question on the appropriate forum is it normal or not.
</add>

incywincy

10:32 pm on May 18, 2004 (gmt 0)

10+ Year Member



thanks for all your answers. i now think i understand the problem.