Forum Moderators: phranque
i recently decided to move my site to php. to cater for the search engines i have inserted the following in my .htaccess file
RewriteCond %{REQUEST_URI} html
RewriteCond %{REQUEST_URI}!logs
RewriteCond %{REQUEST_URI}!spiderlog
RewriteRule ^(.*)html$ /$1php [R=permanent,L]
Lines 2 and 3 allow me to keep my log analysis reports as html files.
i noticed that when google requests an old .html file a http 301 is returned, the url is rewritten to .php and a 301 is again returned.
Is this correct? I would have expected a status of 200 to be returned. I have found that half of my pages have disappeared from google's index but i put this down to the changeover period and expected this to return to normal following the next deep crawl.
extract from access_log:
64.68.82.199 - - [15/May/2004:02:03:48 +0100] "GET /widgets/blue.html HTTP/1.0" 301 342 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.82.37 - - [15/May/2004:04:51:55 +0100] "GET /widgets/blue.php HTTP/1.0" 301 346 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
The reason why the 301 is sent back not the 200 is that you defined the rule to function as a redirection instead of rewriting; just leave out the Redirect flag from the RewriteRule, so it will rewrite the url without changing the url.
Forgot to mention that you can't use the permanent, temp and other substitution texts with the R= RewriteRule flag, only the numerical status codes; so if you want a redirection to be permanent, then use R=301 flag. To read more about the status codes see ftp://ftp.rfc-editor.org/in-notes/rfc2616.txt
So, looking for the reason for the second redirect, do you have any other rules that may be being invoked?
To shorten it up and elimininate the 301 redirect, you could write it like this:
RewriteCond %{REQUEST_URI} !/(logs¦spiderlog)/
RewriteRule ^(.*)\.html$ /$1.php [L]
Change the broken vertical pipe "¦" character to a solid pipe before use.
Jim
For example, there is a real file name index.html on the server, and there is a Mod_Rewrite url like this:
[domain...]
When you check the server header, for the real one 'index.html', it sends out the real 'Last-Modified' and 'Content-Length' information.
But for '1234.html', it sends out empty 'Last-Modified' and 'Content-Length' information.
By checking the 'Content-Length' or some other headers, I wonder googlebot can still detect that '1234.html' is not a real file and give it some kind of penalty.
So, after doing mod rewrite, can you figure out how to provide googlebot the real information of server header 'Content-Length' and some other headers like
'Last-Modified' and 'ETag'?
Thank you very much.
You can modify your php script to output any appropriate headers you desire. Because I don't know what 'part' of the php-generated page you would consider to be a 'real' page -- what part actually might have a unique (and fixed) creation date -- I can't recommend what you should choose to base your Last-Modified date on. If your site is database-driven, perhaps you can store and retrieve a content-modified date from your database, so that (for example) articles are tagged with the date you first entered them into your database, while the surrounding php-generated 'page framework' is freshly-generated.
If your php script is simply 'passing' html documents to the browser, then the script itself should examine the creation date of the document it is passing, and output a last-modified header based on the modification date of that document on the server.
Jim
I thought this was common practise. Is this an incorrect way of telling googlebot that the url has moved permanently?
I also noticed that googlebot didn't actually request the entire page. When I made the same request in the browser the normal 301 , 200 http status code sequence was returned. Was googlebot's request different to a standard request in a browser?
If you are using "static" php files (i think yes, since you just moved from html), so no session handling, no cookies (at least no decisions taken based on cookies), and even if you aren't using the query-string (yet), then the google should have the same content as you have it in your browser.
Did you checked already what's in the cache of google?
<add>
If you want to set different header items in the response from php, like the Last-Modified or the ETag, then there's a part in the php manual which might be very useful for you:
[php.net...]
about the Content-Length header, I'm pretty sure that the php should do this by itself, if it doesn't then you might need to check your php config, or post a question on the appropriate forum is it normal or not.
</add>