Forum Moderators: phranque
My URL structure is all directory based such as www.domain.com/foo/ and in this folder there is index.php which is not shown. I do this so I can, if I decide to, change to html, asp, whatever without affecting backlinks.
Yahoo is starting to put the 404 error message under my listings for nearly ALL pages because of this problem, I guess after spidering the original page and it feels like checking the URL again it checks it WITHOUT the trailing slash and replaces the actual content with the 404 message.
I'm currently using something like:
RewriteRule (.*)/$ /foo/index\.php?page=$1 [L]
but if I remove the trailing slash to something like
RewriteRule (.*)$ /foo/index\.php?page=$1 [L]
I get an internal server error. The htaccess file resides in a subfolder for each category and the rule changes depending if there are subcategories such as
RewriteRule (.*)/(.*)/$ /foo/index\.php?sub=$1&page=$2 [L]
It works well as long as the trailing slash is included.. and doesn't work without the trailing slash when typed in the browser or clicked from yahoo.
Can anyone tell me how I can add a rewrite rule to accomodate the exact same url minus the trailing slash so yahoo doesn't index all my pages as my 404 error message.
RewriteRule (.*)/(.*)[b]/?$[/b] /foo/index\.php?sub=$1&page=$2 [L]
RewriteRule ([^/]+)/(.*)/?$ /foo/index\.php?sub=$1&page=$2 [L]
RewriteRule (.*)/?$ /foo/index\.php?page=$1 [L]
RewriteRule ([^/]+)/([^/]+)/?$ /foo/index\.php?sub=$1&page=$2 [L]
RewriteRule ([^/]+)/?$ /foo/index\.php?page=$1 [L]
Jim
I tried redirecting from the non slash version to the slash version which I thought would then do the rewrite normally.. but didn't work either.
Where exactly does RewriteBase fit in? When I was experimenting with mod_rewrite before implenting it I couldn't get things to work properly unless did things a certain way. My RewriteBase ended up always being just / and I had to remove any ^ or caret from the start of the rules or they wouldn't work or didn't work right with relative URLS.
Now this, yahoo not using trailing slash, google does, don't know about new MSN yet. If I can just get the non trailing slash version of the URL to rewrite just as the trailing slash rule does then I'd be set.
I only noticed this problem because Yahoo is the only one actually doing any indexing. Here I thought things were working great but any clicks to those urls ends up at a 404.
A couple of things that often cause problems are if you have UseCanonicalName on, Options MultiViews, and for Apache 2 only, AcceptPathInfo, particularly if you don't need them. Another thing that can cause major problems is if php is loaded *after* mod_rewrite in the server load list.
Jim
Apache version is 1.3.29
This is only a problem I'm thinking because when you don't have the trailing slash and the file doesn't exist then another module (mod_dir?) automatically changes it into a directory which is then used in the standard way (displaying index.* or default.* of that directory). Since the directories(sometimes a couple sub dirs deep) are all generated by PHP and don't exist on the file system then that module doesn't kick in.
I'm thinking of just changing the extension to html instead of a slash despite the rigorouus spidering last month. That way it's not open to what the search engine thinks is proper form and server configuration variables.
Can you suggest a catchall type mod_rewrite rule that would 301 all pages that end in slash to the corresponding page but as .html instead.
ex: domain.com/foo/1/ to domain.com/foo/1.html
but would work for any sub directory level so I don't have to do every page by hand with
redirect permanent /foo/1/ [domain.com...]
I'm not sure how to do a 301 with mod_rewrite.. although I got the pages working by studying as many examples as I can I still don't fully understand mod_rewrite because I haven't learned anything about regular expressions yet other than what it does.
I feel like I'm working with a puzzle but can't see the edges of the pieces clearly, self taught has many holes. :(
# If requested resource does not exist as a file
RewriteCond %{REQUEST_FILENAME} !-f
# and does not end with a period followed by a filetype
RewriteCond %{REQUEST_URI} !\..+$
# and does not end with a slash
RewriteCond %{REQUEST_URI} !/$
# then add a trailing slash and redirect
RewriteRule (.*) http://example.com/$1/ [R=301,L]
#
# Redirect dynamic pages to script
RewriteRule ([^/]+)/([^/]+)/$ /foo/index\.php?sub=$1&page=$2 [L]
RewriteRule ([^/]+)/$ /foo/index\.php?page=$1 [L]
RewriteRule (.*) /$1/
Jim
Although I'm also wondering what kind of overhead the RewriteCond lines will add and if changing (.*) to ([^/]+) would offset it? Or is it all too small to worry about, since I am on a shared type hosting package.
I didn't quite understand what you meant by changing the first rule to an internal rewrite, is that because of how Yahoo handles 301s?
Here's what I get for response headers with the changes you suggested:
[domain.com...]
GET /foo/1 HTTP/1.1
Host: www.domain.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040626 Firefox/0.9.1
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 301 Moved Permanently
Date: Fri, 24 Sep 2004 03:57:49 GMT
Server: Apache
Location: [domain.com...]
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1
----------------------------------------------------------
[domain.com...]
GET /foo/1/ HTTP/1.1
Host: www.domain.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040626 Firefox/0.9.1
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
HTTP/1.x 200 OK
Date: Fri, 24 Sep 2004 03:57:49 GMT
Server: Apache
X-Powered-By: PHP/4.3.4
Last-Modified: Fri, 24 Sep 2004 03:46:55 GMT
Cache-Control: max-age=14400
Content-Encoding: gzip
Vary: Accept-Encoding
Etag: 153d690a9df65c8f496d0eceb7e11cbb
Content-Length: 1390
Connection: close
Content-Type: text/html
Use the site search of any major search engine to find threads on WebmasterWorld pertaining to Yahoo's 301 redirect problem. There are at least three threads about it, one of which was active within the past couple of days. Yahoo apparently does not update their URL database to show the new URL when a 301 is used to tell them that the resource has moved. Yahoo is aware of the problem and their rep here says they are working on it. No word yet on whether this has been fixed, but even when it is fixed, it may take awhile to sort out the new index.
My advice is to implement 301 redirects according to the HTTP/1.1 specification, but to avoid redirecting URLs unless it is truly necessary. For example, I would not recommend renaming a bunch of pages at this time (which is how we got into this exchange), but if requested URLs are incorrect and the problem is outside your site, then a 301 is a reasonable and correct response. Basically, I recommend "going by the book" on protocol issues, and letting those who have errors in their handling of the protocol fix their errors, rather than having every Webmaster in the world implement some clunky work-around (and I don't think even a clunky work-around is available to Webmasters that would compensate for the 301 problem at Yahoo).
> I'm also wondering what kind of overhead the RewriteCond lines will add and if changing (.*) to ([^/]+) would offset it?
I believe in making the code as efficient as possible, while still letting the computer do the work for you. In other words, you need to fix this problem, and it requires some RewriteConds to do it, so let the computer do the work. However, there is no use wasting computer resources on ambiguous regular-expressions patterns, when a more-efficient unambiguous pattern can be used.
When matching the hostname "abc-def" to the pattern "(.*)-(.*)", for example, mod_rewrite will stuff all of abc-def into the first (.*), because ".*" is a "greedy" expression. It will then continue looking at the pattern, and realize that it needs at least one "-" in the requested hostname string to satisfy the rest of the pattern. So, mod_rewrite then has to "back up" into the characters already matched into the first ".*" until it finds a hyphen. So basically, this means that mod_rewrite must scan the requested hostname both from left-to-right, and then again from right-to-left.
However, if you use "[^-]+", then mod_rewrite scans the requested hostname from left to right, matching characters until it finds the first chararcter that is a hyphen. It then knows that it is finished with the first pattern. It also already knows that it's got a hyphen, and so continues with the next pattern, which ends with a dot. So, the main advantage in using the "[NOT some-character]" construct is that it allows pure forward parsing, with no need to "backtrack."
Don't worry about the overhead of processing a few RewriteConds. You can add several hundred on an average server, and see no practical difference. This does not mean, however, that you should not try to write efficient code.
Jim
So if they fix the redirect issue, how will Yahoo handle redirects of non trailing slash to trailing slash. Would they index the trailing slash version properly and then cut off the trailing slash and have to reindex it over and over again? :P It's because they cut off the trailing slash that I even noticed this problem.
With an html or any extension it wouldn't be an issue, I'm torn heh. Right now ranking is a non issue and has been poor for a long time because of a robots.txt snafu.
Another question with regards to efficient regexp, depending on the site or category the URL may have dashes such as /blue-widgets.html, dynamically generated, would ([^/]+) work for this? somtimes there may be more than one dash in the category name.
ex: ([^/]+).html$ /foo/index\.php?cat=$1
The way I see it is I can "fix" it now and lose any potential indexing in the semi long term because of recent spidering or keep it as is with the 301 redirect and wonder if Yahoo will ever fix themselves and how MSN would handle it. Google hasn't spidered yet either which is another incentive to add extensions before it does start deep crawling.
Basically the redirect you have helped me with is the workaround you mention because of Yahoo cutting trailing slashes off URLs.. but then it opens another can of worms with Yahoo screwing up 301 redirects. Which has me thinking to cut any losses or future problems now and just make it so the SE's can't screw it up. 404 a lot of pages now and hope they drop from the index while they spider them again with .html extensions.
I've checked further and already I see some URLs with my 404 message rather than the content on the page... clicking on the URL of course brings me to the non trailing slash version of the page which is now a 301 redirect. Which brings me back to the part where Yahoo cuts off those trailing slashes! *curses yahoo*
The [^/]+ pattern does not care about anything but slashes; it will match as many characters as it can find, up to but not including the first slash. It does not care if those characters are alpha, numeric, hyphens, or whatever. It is simply looking for the first slash, and stopping there.
I can't really advise you of what to do about your site. It is entirely up to you, as we are all wondering how long it will be until Yahoo fixes their code. At the very least however, the code patch above should correct the listings in Yahoo showing your 404 error page content. If it were me, I'd leave my URLs alone and use the patch to fix the 404 listings.
Yahoo will fix their problem, as they are obviously non-compliant with the spirit and purpose of HTTP/1.1 server response codes. Too much longer will mean bad press.
There are many sites on-line with all or most URLs ending in "/". They simply have a separate directory for each "page" and the content that is displayed for each directory comes from the default index.html document in that directory. There is no practical difference between this and the handling of the root directory default document at http://example.com/ -- You can imagine what havoc it would cause if Yahoo failed to index those URLs correctly!
I believe the problems with redirects and meta-refreshes we are seeing with several major search engines come from attempts to compensate for common Webmaster errors and the limitations of free hosting plans, and possibly from some "anti-SEO" games. The trouble is that they are fooling with some basic protocol issues here, and it is causing more trouble than what they are trying to fix... The cure is worse than the disease. The search engines will either correct these problems or cease to be relevant -- basic market forces at work.
Jim
I think I'll add the redirect rules instead of extensions and hope Yahoo gets their act together. Then cross fingers that MSN isn't as bad with URLs and redirects when it launches.
Thanks for the code tips and help.
RewriteConds must come before the (single) RewriteRule that they control. They are not evaluated unless the pattern in the Rule matches. See API processing phases in the mod_rewrite documentation.
Don't sweat your tiny .htaccess file. When it gets to 30kB in size, then you might worry about it on a busy server. While re-arranging a site and moving (redirecting) some pages, mine has peaked out at 33kB -- Effect on server: None discernable.
Remember that these directives are processed by "native" Apache code, and therefore there is very little overhead to it. It's not like the server has to load and initialize a PERL or PHP interpreter thread or something. :) I advise writing structured and efficient code so that you *can* have a 33kB .htaccess file and not drag down the server. It's not normally required, but it is part of "Best practices."
Jim
It all works fine except for the "home" directory the scripts are located in.
for example:
# If requested resource does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# and does not end with a period followed by a filetype
RewriteCond %{REQUEST_URI}!\..+$
# and does not end with a slash
RewriteCond %{REQUEST_URI}!/$
# then add a trailing slash and redirect
RewriteRule (.*) [example.com...] [R=301,L]
#
# Redirect dynamic pages to script
RewriteRule ([^/]+)/$ /foo/index\.php?page=$1 [L]
This works great WHEN I have variables to redirect.. but the plain old directory page that really exists, "/foo/" in this case, should automatically display index.php without any variables to rewrite(which happens to be the main page).
The problem is when I type in www.domain.com/foo without the slash I get redirected to "www.domain.com/foo//home/username/public_html/foo/" with a 301. This can't be good if an engine visits this page without the slash(shows blank).
The thing is /foo/ is an actual physical directory so apache SHOULD automatically add a slash and then display index.* or default.* (like it does for directories or pages that do not require mod_rewrite)
I haven't quite wrapped my head around mod_rewrite and can't figure out what the conditions are seeing to shove the home user directory in there. I've tried adding another rewrite condition that checks for trailing slash on the directory in question but it didn't change.
I might figure it out with time but these pages are live :P Any help would be appreciated.
You should not have to do *anything at all* for these Yahoo links to work, and your host should help you fix this. Ask them to turn off UseCanonicalName for your account's directory.
Or, you can try adding yet another RewriteCond to avoid redirecting existing directories. Add this right above or below the check for file-exists:
RewriteCond (REQUEST_FILENAME) !-d
www.example.com/foo///home/userdir/public_html/foo/?page=foo
I can only assume one of the conditionals or rules is matching the actual dir name.. perhaps I have the whole rewritebase wrong(the mod_rewrite documentation is as clear as mud with regards to this one).
----------------
Before finishing this post I decided to comment out some lines and see which was matching.. which proved inconclusive since firefox was caching something it shouldn't have :P (nothing worked until I cleared everything)
I think it may be working now since I've added your suggestion about checking the actual directory.
The server seems good.. the service sucks.. moving would've been easier than getting them to answer email since they were bought out by another company.