Forum Moderators: phranque

Message Too Old, No Replies

Yahoo and mod_rewrite

         

Code Sentinel

11:29 pm on Sep 23, 2004 (gmt 0)

10+ Year Member



I decided to check how many pages I had in yahoo and noticed that yahoo doesn't use trailing slashes on any of the URLS.. and when I was creating the site with mod_rewrite I was only able to get things going only WITH the trailing slash. The problem now is that yahoo can seemingly spider the pages fine but when it indexes and ranks the page it strips the trailing slash from the URL. When you click on this URL without the trailing slash you end up at a 404 page since it's all dynamic with mod_rewrite and doesn't physically exist as a page.

My URL structure is all directory based such as www.domain.com/foo/ and in this folder there is index.php which is not shown. I do this so I can, if I decide to, change to html, asp, whatever without affecting backlinks.

Yahoo is starting to put the 404 error message under my listings for nearly ALL pages because of this problem, I guess after spidering the original page and it feels like checking the URL again it checks it WITHOUT the trailing slash and replaces the actual content with the 404 message.

I'm currently using something like:
RewriteRule (.*)/$ /foo/index\.php?page=$1 [L]

but if I remove the trailing slash to something like

RewriteRule (.*)$ /foo/index\.php?page=$1 [L]

I get an internal server error. The htaccess file resides in a subfolder for each category and the rule changes depending if there are subcategories such as

RewriteRule (.*)/(.*)/$ /foo/index\.php?sub=$1&page=$2 [L]

It works well as long as the trailing slash is included.. and doesn't work without the trailing slash when typed in the browser or clicked from yahoo.

Can anyone tell me how I can add a rewrite rule to accomodate the exact same url minus the trailing slash so yahoo doesn't index all my pages as my 404 error message.

jdMorgan

12:05 am on Sep 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Not sure if the problem is this simple, but you can make the trailing slash optional by following it with a question mark:

RewriteRule (.*)/(.*)[b]/?$[/b] /foo/index\.php?sub=$1&page=$2 [L]

I'd also suggest you change the patterns in the rules to speed up processing a bit:

RewriteRule ([^/]+)/(.*)/?$ /foo/index\.php?sub=$1&page=$2 [L]
RewriteRule (.*)/?$ /foo/index\.php?page=$1 [L]

and if you only wish to match one subdirectory level per pattern, you can use:

RewriteRule ([^/]+)/([^/]+)/?$ /foo/index\.php?sub=$1&page=$2 [L]
RewriteRule ([^/]+)/?$ /foo/index\.php?page=$1 [L]

"[^/]+" means "accept one or more characters not equal to slash" -- in other words, grab all characters up to the next slash. Since this is an unambiguous pattern, it is processed much more quickly, because mod_rewrite does not have to figure out how much of the requested URI "fits into" each ".*" pattern while still allowing enough for the next "/" and ".*" to match.

Jim

Code Sentinel

12:50 am on Sep 24, 2004 (gmt 0)

10+ Year Member



putting in the question mark before the $ just gave a server error.

I tried redirecting from the non slash version to the slash version which I thought would then do the rewrite normally.. but didn't work either.

Where exactly does RewriteBase fit in? When I was experimenting with mod_rewrite before implenting it I couldn't get things to work properly unless did things a certain way. My RewriteBase ended up always being just / and I had to remove any ^ or caret from the start of the rules or they wouldn't work or didn't work right with relative URLS.

Now this, yahoo not using trailing slash, google does, don't know about new MSN yet. If I can just get the non trailing slash version of the URL to rewrite just as the trailing slash rule does then I'd be set.

I only noticed this problem because Yahoo is the only one actually doing any indexing. Here I thought things were working great but any clicks to those urls ends up at a 404.

jdMorgan

1:01 am on Sep 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What version of Apache are you on, and do you have access to httpd.conf?

A couple of things that often cause problems are if you have UseCanonicalName on, Options MultiViews, and for Apache 2 only, AcceptPathInfo, particularly if you don't need them. Another thing that can cause major problems is if php is loaded *after* mod_rewrite in the server load list.

Jim

Code Sentinel

1:34 am on Sep 24, 2004 (gmt 0)

10+ Year Member



I don't have access to httpd.conf on my host server, it's a simple shared/reseller type account.

Apache version is 1.3.29

This is only a problem I'm thinking because when you don't have the trailing slash and the file doesn't exist then another module (mod_dir?) automatically changes it into a directory which is then used in the standard way (displaying index.* or default.* of that directory). Since the directories(sometimes a couple sub dirs deep) are all generated by PHP and don't exist on the file system then that module doesn't kick in.

I'm thinking of just changing the extension to html instead of a slash despite the rigorouus spidering last month. That way it's not open to what the search engine thinks is proper form and server configuration variables.

Can you suggest a catchall type mod_rewrite rule that would 301 all pages that end in slash to the corresponding page but as .html instead.

ex: domain.com/foo/1/ to domain.com/foo/1.html

but would work for any sub directory level so I don't have to do every page by hand with

redirect permanent /foo/1/ [domain.com...]

I'm not sure how to do a 301 with mod_rewrite.. although I got the pages working by studying as many examples as I can I still don't fully understand mod_rewrite because I haven't learned anything about regular expressions yet other than what it does.

I feel like I'm working with a puzzle but can't see the edges of the pieces clearly, self taught has many holes. :(

jdMorgan

2:33 am on Sep 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It sounds like your problem is a tough one. Rather than changing all of your URLs, let's try this fixup code first. It may not be quite right, as it's freshly-composed and untested, but maybe it will give you some ideas:

# If requested resource does not exist as a file
RewriteCond %{REQUEST_FILENAME} !-f
# and does not end with a period followed by a filetype
RewriteCond %{REQUEST_URI} !\..+$
# and does not end with a slash
RewriteCond %{REQUEST_URI} !/$
# then add a trailing slash and redirect
RewriteRule (.*) http://example.com/$1/ [R=301,L]
#
# Redirect dynamic pages to script
RewriteRule ([^/]+)/([^/]+)/$ /foo/index\.php?sub=$1&page=$2 [L]
RewriteRule ([^/]+)/$ /foo/index\.php?page=$1 [L]

Yahoo currently has a known problem with handling 301 redirects, so you might also want to try this as an internal rewrite if you notice Yahoo doing anything strange while spidering. That would involve changing the first rule to

RewriteRule (.*) /$1/

leaving the RewriteConds in place as they are. Note that the [L] flag should not be used in this case.

Jim

Code Sentinel

4:08 am on Sep 24, 2004 (gmt 0)

10+ Year Member



It works! Thanks a lot, I'm gonna look up this 301 probablem yahoo has.

Although I'm also wondering what kind of overhead the RewriteCond lines will add and if changing (.*) to ([^/]+) would offset it? Or is it all too small to worry about, since I am on a shared type hosting package.

I didn't quite understand what you meant by changing the first rule to an internal rewrite, is that because of how Yahoo handles 301s?

Here's what I get for response headers with the changes you suggested:

[domain.com...]

GET /foo/1 HTTP/1.1
Host: www.domain.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040626 Firefox/0.9.1
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 301 Moved Permanently
Date: Fri, 24 Sep 2004 03:57:49 GMT
Server: Apache
Location: [domain.com...]
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1
----------------------------------------------------------
[domain.com...]

GET /foo/1/ HTTP/1.1
Host: www.domain.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7) Gecko/20040626 Firefox/0.9.1
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

HTTP/1.x 200 OK
Date: Fri, 24 Sep 2004 03:57:49 GMT
Server: Apache
X-Powered-By: PHP/4.3.4
Last-Modified: Fri, 24 Sep 2004 03:46:55 GMT
Cache-Control: max-age=14400
Content-Encoding: gzip
Vary: Accept-Encoding
Etag: 153d690a9df65c8f496d0eceb7e11cbb
Content-Length: 1390
Connection: close
Content-Type: text/html

jdMorgan

4:41 am on Sep 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your server responses look correct for the two request variants.

Use the site search of any major search engine to find threads on WebmasterWorld pertaining to Yahoo's 301 redirect problem. There are at least three threads about it, one of which was active within the past couple of days. Yahoo apparently does not update their URL database to show the new URL when a 301 is used to tell them that the resource has moved. Yahoo is aware of the problem and their rep here says they are working on it. No word yet on whether this has been fixed, but even when it is fixed, it may take awhile to sort out the new index.

My advice is to implement 301 redirects according to the HTTP/1.1 specification, but to avoid redirecting URLs unless it is truly necessary. For example, I would not recommend renaming a bunch of pages at this time (which is how we got into this exchange), but if requested URLs are incorrect and the problem is outside your site, then a 301 is a reasonable and correct response. Basically, I recommend "going by the book" on protocol issues, and letting those who have errors in their handling of the protocol fix their errors, rather than having every Webmaster in the world implement some clunky work-around (and I don't think even a clunky work-around is available to Webmasters that would compensate for the 301 problem at Yahoo).

> I'm also wondering what kind of overhead the RewriteCond lines will add and if changing (.*) to ([^/]+) would offset it?

I believe in making the code as efficient as possible, while still letting the computer do the work for you. In other words, you need to fix this problem, and it requires some RewriteConds to do it, so let the computer do the work. However, there is no use wasting computer resources on ambiguous regular-expressions patterns, when a more-efficient unambiguous pattern can be used.

When matching the hostname "abc-def" to the pattern "(.*)-(.*)", for example, mod_rewrite will stuff all of abc-def into the first (.*), because ".*" is a "greedy" expression. It will then continue looking at the pattern, and realize that it needs at least one "-" in the requested hostname string to satisfy the rest of the pattern. So, mod_rewrite then has to "back up" into the characters already matched into the first ".*" until it finds a hyphen. So basically, this means that mod_rewrite must scan the requested hostname both from left-to-right, and then again from right-to-left.

However, if you use "[^-]+", then mod_rewrite scans the requested hostname from left to right, matching characters until it finds the first chararcter that is a hyphen. It then knows that it is finished with the first pattern. It also already knows that it's got a hyphen, and so continues with the next pattern, which ends with a dot. So, the main advantage in using the "[NOT some-character]" construct is that it allows pure forward parsing, with no need to "backtrack."

Don't worry about the overhead of processing a few RewriteConds. You can add several hundred on an average server, and see no practical difference. This does not mean, however, that you should not try to write efficient code.

Jim

Code Sentinel

5:13 am on Sep 24, 2004 (gmt 0)

10+ Year Member



As it is right now only MSN and Yahoo have spidered and partly indexed. Even if I keep the 301's in place Yahoo still cuts off that trailing slash which is required for my site.

So if they fix the redirect issue, how will Yahoo handle redirects of non trailing slash to trailing slash. Would they index the trailing slash version properly and then cut off the trailing slash and have to reindex it over and over again? :P It's because they cut off the trailing slash that I even noticed this problem.

With an html or any extension it wouldn't be an issue, I'm torn heh. Right now ranking is a non issue and has been poor for a long time because of a robots.txt snafu.

Another question with regards to efficient regexp, depending on the site or category the URL may have dashes such as /blue-widgets.html, dynamically generated, would ([^/]+) work for this? somtimes there may be more than one dash in the category name.

ex: ([^/]+).html$ /foo/index\.php?cat=$1

The way I see it is I can "fix" it now and lose any potential indexing in the semi long term because of recent spidering or keep it as is with the 301 redirect and wonder if Yahoo will ever fix themselves and how MSN would handle it. Google hasn't spidered yet either which is another incentive to add extensions before it does start deep crawling.

Basically the redirect you have helped me with is the workaround you mention because of Yahoo cutting trailing slashes off URLs.. but then it opens another can of worms with Yahoo screwing up 301 redirects. Which has me thinking to cut any losses or future problems now and just make it so the SE's can't screw it up. 404 a lot of pages now and hope they drop from the index while they spider them again with .html extensions.

I've checked further and already I see some URLs with my 404 message rather than the content on the page... clicking on the URL of course brings me to the non trailing slash version of the page which is now a 301 redirect. Which brings me back to the part where Yahoo cuts off those trailing slashes! *curses yahoo*

jdMorgan

5:48 am on Sep 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Another question with regards to efficient regexp, depending on the site or category the URL may have dashes such as /blue-widgets.html, dynamically generated, would ([^/]+) work for this? somtimes there may be more than one dash in the category name.

The [^/]+ pattern does not care about anything but slashes; it will match as many characters as it can find, up to but not including the first slash. It does not care if those characters are alpha, numeric, hyphens, or whatever. It is simply looking for the first slash, and stopping there.

I can't really advise you of what to do about your site. It is entirely up to you, as we are all wondering how long it will be until Yahoo fixes their code. At the very least however, the code patch above should correct the listings in Yahoo showing your 404 error page content. If it were me, I'd leave my URLs alone and use the patch to fix the 404 listings.

Yahoo will fix their problem, as they are obviously non-compliant with the spirit and purpose of HTTP/1.1 server response codes. Too much longer will mean bad press.

There are many sites on-line with all or most URLs ending in "/". They simply have a separate directory for each "page" and the content that is displayed for each directory comes from the default index.html document in that directory. There is no practical difference between this and the handling of the root directory default document at http://example.com/ -- You can imagine what havoc it would cause if Yahoo failed to index those URLs correctly!

I believe the problems with redirects and meta-refreshes we are seeing with several major search engines come from attempts to compensate for common Webmaster errors and the limitations of free hosting plans, and possibly from some "anti-SEO" games. The trouble is that they are fooling with some basic protocol issues here, and it is causing more trouble than what they are trying to fix... The cure is worse than the disease. The search engines will either correct these problems or cease to be relevant -- basic market forces at work.

Jim

Code Sentinel

6:26 am on Sep 24, 2004 (gmt 0)

10+ Year Member



What you say makes sense with regards to the redirects, and I only consider it an issue at the moment because for some reason Yahoo is my biggest(read: only) referrer.

I think I'll add the redirect rules instead of extensions and hope Yahoo gets their act together. Then cross fingers that MSN isn't as bad with URLs and redirects when it launches.

Thanks for the code tips and help.

Code Sentinel

7:09 pm on Sep 24, 2004 (gmt 0)

10+ Year Member



Another question with the RewriteCond lines, do they have to come before the redirect rules? are they checked every time even if the trailing slash is there?

jdMorgan

7:36 pm on Sep 24, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Please check out the mod_rewrite documentation cited in our charter -- It'll answer these questions and many others.

RewriteConds must come before the (single) RewriteRule that they control. They are not evaluated unless the pattern in the Rule matches. See API processing phases in the mod_rewrite documentation.

Don't sweat your tiny .htaccess file. When it gets to 30kB in size, then you might worry about it on a busy server. While re-arranging a site and moving (redirecting) some pages, mine has peaked out at 33kB -- Effect on server: None discernable.

Remember that these directives are processed by "native" Apache code, and therefore there is very little overhead to it. It's not like the server has to load and initialize a PERL or PHP interpreter thread or something. :) I advise writing structured and efficient code so that you *can* have a 33kB .htaccess file and not drag down the server. It's not normally required, but it is part of "Best practices."

Jim

Code Sentinel

12:37 am on Oct 20, 2004 (gmt 0)

10+ Year Member



New issue I just noticed with this code that I didn't catch previously. I've tried fiddling with it and can't seem to get it to change its behavior.

It all works fine except for the "home" directory the scripts are located in.

for example:

# If requested resource does not exist as a file
RewriteCond %{REQUEST_FILENAME}!-f
# and does not end with a period followed by a filetype
RewriteCond %{REQUEST_URI}!\..+$
# and does not end with a slash
RewriteCond %{REQUEST_URI}!/$
# then add a trailing slash and redirect
RewriteRule (.*) [example.com...] [R=301,L]
#
# Redirect dynamic pages to script
RewriteRule ([^/]+)/$ /foo/index\.php?page=$1 [L]

This works great WHEN I have variables to redirect.. but the plain old directory page that really exists, "/foo/" in this case, should automatically display index.php without any variables to rewrite(which happens to be the main page).

The problem is when I type in www.domain.com/foo without the slash I get redirected to "www.domain.com/foo//home/username/public_html/foo/" with a 301. This can't be good if an engine visits this page without the slash(shows blank).

The thing is /foo/ is an actual physical directory so apache SHOULD automatically add a slash and then display index.* or default.* (like it does for directories or pages that do not require mod_rewrite)

I haven't quite wrapped my head around mod_rewrite and can't figure out what the conditions are seeing to shove the home user directory in there. I've tried adding another rewrite condition that checks for trailing slash on the directory in question but it didn't change.

I might figure it out with time but these pages are live :P Any help would be appreciated.

jdMorgan

1:04 am on Oct 20, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Man, your server configuration is messed up! Have you considered changing hosts?

You should not have to do *anything at all* for these Yahoo links to work, and your host should help you fix this. Ask them to turn off UseCanonicalName for your account's directory.

Or, you can try adding yet another RewriteCond to avoid redirecting existing directories. Add this right above or below the check for file-exists:


RewriteCond (REQUEST_FILENAME) !-d

Jim

Code Sentinel

2:18 am on Oct 20, 2004 (gmt 0)

10+ Year Member



adding that the url changed to

www.example.com/foo///home/userdir/public_html/foo/?page=foo

I can only assume one of the conditionals or rules is matching the actual dir name.. perhaps I have the whole rewritebase wrong(the mod_rewrite documentation is as clear as mud with regards to this one).

----------------

Before finishing this post I decided to comment out some lines and see which was matching.. which proved inconclusive since firefox was caching something it shouldn't have :P (nothing worked until I cleared everything)

I think it may be working now since I've added your suggestion about checking the actual directory.

The server seems good.. the service sucks.. moving would've been easier than getting them to answer email since they were bought out by another company.