Forum Moderators: phranque
Here's what Googlebot asked for:
http://example.com/examplepage.shtml/
There should not be a slash at the end, but the server loaded the page with a 200 OK. The page looked a bit messed up when viewed, but I know Google is going to think I've got duplicates, one with and one without the "/".
I've already done the www. to non-www, I've also added code to remove extra slashes from directories, extra "." from URLs, etc. Now apparently I need another one. A search on the web has turned up nothing. Is there one setting that makes the server just say NO if the URL isn't exact?
http://example.com/examplepage1.shtml/examplepage2.shtml
The server is returning a 200 OK, and the page is messed up when viewed.
How can I prevent the server from allowing this? I'd rather have it issue a Not Found instead, because the URL Googlebot is asking for doesn't exist! Why is the server loading the page with a 200 OK?
Added it, and it didn't help. It is loading the first page requested in the URL twice, almost as if one were on top of and slightly below and to the right of the other. Very, very strange.
There is an I-Frame on the page, and the page in the I-Frame has an include as well. Both of the I-Frames seem to be loading properly.
I have no idea how these strange things can happen. I don't understand how the server can load something that doesn't exist. Any other ideas?
I will never get my site straightened out as long as Googlebot continues to add to the problems. I know there are no links on my site like this.
Is there a way to mod rewrite any requests for
http://example.com/*.html/*? and return a 400 or 404?
I have both .html and .shtml pages, but would never have anything after the .html, so there should be no requests for anything like that. And if there are, I would want something other than a 200 OK returned.
Redirects may also mess things up -- an HTTP redirect from www.example.com to www.newsite.com/page1 may result in hits to some non-existent pages like www.newsite.com/page1/robots.txt -- this is especially true if you create (simplified?) links on other sites that depend on the redirect like www.example.com/another.html -- in this case, relative links on www.newsite.com/page1/another.html may not get mapped correctly, such that a relative link on another.html like "example2.html", may be seen as www.example.com/another.html/example2.html -- I've seen this in cases where there is also some aliasing going on by the Apache HTTPd.
Even if you use a full http preface in a URL, pay close attention to trailing slashes in HREF tags with a naked domain mame -- "www.example.com" may not look the same as "www.example.com/" when redirects, rewrites and aliasing are in play. The implicit "/", when/if added can break a redirected URL.
Your HTTPd returning 200 on obviously bad URLs is an indication that you are getting an exact match on a rule that is causing the rest of the URL to be ether ignored or discarded..
If you don't have content-negotiation enabled, then there is no way Apache will show a page when a directory is requested, unless you have some code in your config files or .htaccess that is making it happen. So as spinnercee says, you need to look through your configuration code carefully, and find out where the problem is actually caused. You could add more code as a a band-aid fix, true, but it would be better to bullet-proof your existing code instead.
Jim
RewriteRule ^([^.]+\.[^/]+)/ http://example.com/$1 [R=301,L]
That has allowed a 301 redirect to the URL originally specified, dropping the second page from the URL. This will let Googlebot know it's asking for stuff that doesn't exist.
Thanks everybody - one less thing to cause problems.
I have one that deals with the /index.html issue, redirecting to /directory/ instead.
Another gets rid of the www and redirects to non-www. (I read the index issue needs to be done before the www redirect, and I put it first.)
Another deals with double slashes in the URL, http://example.com/forum//page1.php
Something in there also deals with putting a "." at the end of a URL, as in http://example.com./page9.html
I think that's it. I have a few denys. I do have the RewriteEngine On, symlinks, I added the -MultiViews to the code as well.
I suppose it's possible one of the rewrites is causing the problem, but it appears to be OK for now. I'm watching Google closely, though, to see what else it tries to pull. How many more ways can it screw up a URL? Enough already!
I've seen five common ways so far:
# Fix additional directory paths appended to filenames (e.g. /logo.jpg/<directory_path>)
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]
#
# Remove trailing punctuation for printed-text links (Remove any trailing
# hex-encoded characters or any characters not one of a-z, A-Z, or 0-9)
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+)(\%(25)*[0-9a-f]{2}¦[^a-z0-9])\ HTTP/[0-9]+\.[0-9]$ [NC]
RewriteRule . http://www.example.com/%1 [R=301,L]
#
# Fix malformed query strings ( &<something> appended to filenames )
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.[^?&]+)?&
RewriteRule .* http://www.example.com/%1 [R=301,L]
#
# Fix extra leading slashes in URL (handled by redirect rule)
RewriteCond %{REQUEST_URI} ^(.*)//+(.*)$
RewriteRule . http://www.example.com/%1/%2 [R=301,L]
#
# Redirect direct client requests for "/index.html" to "/"
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html[^\ ]*\ HTTP/
RewriteRule ^index\.html$ http://www.example.com/ [R=301,L]
<Files 403.shtml>
order allow,deny
allow from all
</Files>
deny from #*$!.#*$!.#*$!.#*$!
Redirect permanent /examplepage.html http://example.com/examplepage2.html
Options +FollowSymlinks
Options -MultiViews
RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?
RewriteRule ^(([^/]*/)*)index\.html?$ http://example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} ^www\.example\.com$ [NC]
RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^example\.com$
RewriteRule (.*) http://example.com/$1 [R=301,L]
RewriteCond %{REQUEST_URI} ^/(.*)//+(.*)
RewriteRule .* http://example.com/%1/$2 [R=301,L]
RewriteCond %{HTTP_USER_AGENT} ^(Python[-.]?urllib¦Java/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR}!^207\.126\.2(2[4-9]¦3[0-9])\.
RewriteCond %{REMOTE_ADDR}!^216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteRule .* - [F,L]
RewriteRule ^([^.]+\.[^/]+)/ http://example.com/$1 [R=301,L]
# Redirect client requests for "/index.html" to "/"
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html[^\ ]*\ HTTP/
RewriteRule ^index\.html$ http://www.example.com/ [R=301,L]
here's my statement inside httpd.conf
DirectoryIndex index.jsp home.html
<Directory /opt/local/apache2/htdocs>
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
# Redirect client requests for "/index.jsp" to "/" HTTP/
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index.jsp[^\ ]*\
RewriteRule ^index\.jsp$ http://www.example.com/ [R=301,L]
</IfModule>
</Directory>
The rules seem to be applying only on direct request to
http://www.example.com not to http://www.example.com/index.jsp
what's wrong here? Any suggestion appreciated!
here's my rewrite.log
xx.x.x.x - - [13/Oct/2006:17:34:25 --0400] [www.example.com/sid#1089f8][rid#17f218/initial] (3) [per-dir /opt/local/apache2/htdocs/] strip per-dir prefix: /opt/local/apache2/htdocs/ ->
xx.x.x.x - - [13/Oct/2006:17:34:25 --0400] [www.example.com/sid#1089f8][rid#17f218/initial] (3) [per-dir /opt/local/apache2/htdocs/] applying pattern '^index\.jsp$' to uri ''
xx.x.x.x - - [13/Oct/2006:17:34:25 --0400] [www.example.com/sid#1089f8][rid#17f218/initial] (1) [per-dir /apache2/htdocs/] pass through /opt/local/apache2/htdocs/
Use the RewriteCond exactly as I showed it, changing only the pagename, and add a leading slash to the "index\.jsp" in the RewriteRule pattern - you need "^/index\.jsp$" if you're using the code in httpd.conf or conf.d, etc.
Note also that if you want to match a literal period/dot, then that character must be escaped as shown in all patterns, including the RewriteCond.
Based on your RewriteLog, you may have another module running before mod_rewrite that is already trying to make this URL change. Therefore, the RewriteRule may not be seeing "/index.jsp" as the REQUEST_URI, but only "/"
Make sure that you don't have any Redirect, RedirectMatch, or Alias directives that are trying to do this same URL rewrite/redirect.
Jim
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.jsp[^\ ]*\ HTTP/
RewriteRule ^/index\.jsp$ http://www.example.com/ [R=301,L]
.jsp extentions here are handled by tomcat and all config for that is in workers2.properties
To avoid any possible Redirect, RedirectMatch, or Alias I did change the file name with an uncommon one and still did't work. Any hints here appreciated.
However, the fact that non-jsp files do get rewritten indicates that you're on the right track... Dig into that module execution order issue and see if you can't get the rewrites processed first.
Jim
One approach you mi9ght be able to use is to alias the index.jsp URL to a different URL, so that the Tomcat proxy/alias won't grab it before mod_rewrite sees it. Then use REQUEST_FILENAME in mod_rewrite to detect it.
A last-ditch solution would be to change the on-page URL of the index.jsp URL -- Not a vary good option, but it would avoid the whole problem.
Jim
I'm applying Alias and rewrite inside <VirtualHost> which is supposed to override global rules where jk mapping happens
[httpd.apache.org...]
I did not understant my second suggested option though! Thank you!
I could not find an option to grab .jsp request before is passed to tomcat so I could apply mod_rewrite rules to it.
Jim I did try your suggestion:
Alias /index.jsp "/opt/local/apache2/htdocs/index.html"
and then rewrite
RewriteCond %{REQUEST_FILENAME} ^/opt/local/apache2/htdocs/index\.html?
RewriteRule ^(([^/]*/)*)index\.html?$ http://www.example.com/$1 [R=301,L]
It worked with any files other than .jsp, this alias did not have precedence over jk aliasing/mapping
To my own surprise tomcat itself does not have config rules to address 301 redirect.
Finally I solved the issue in a less elegant way by inserting a 301 redirect at jsp level.
<c:if test='${pageContext.request.requestURI == "/tech/index.jsp"}'>
<%-- again jstl does not handle 301 redirect --%>
<%-- <c:redirect url="/tech/" /> will do 302 --%>
<%-- i had to explicitly set it this way --%>
<%
response.setStatus(301);
response.setHeader( "Location", "/tech/" );
response.setHeader( "Connection", "close" );
%>
</c:if>