Forum Moderators: phranque

Message Too Old, No Replies

More Google issues

Looking for modified URLs that never existed!

         

AndyA

3:05 pm on Oct 10, 2006 (gmt 0)

10+ Year Member



Once again, Googlebot is looking for URLs on my site that never existed, and my server is returning OK 200 codes. I am really getting sick of this.

Here's what Googlebot asked for:
http://example.com/examplepage.shtml/

There should not be a slash at the end, but the server loaded the page with a 200 OK. The page looked a bit messed up when viewed, but I know Google is going to think I've got duplicates, one with and one without the "/".

I've already done the www. to non-www, I've also added code to remove extra slashes from directories, extra "." from URLs, etc. Now apparently I need another one. A search on the web has turned up nothing. Is there one setting that makes the server just say NO if the URL isn't exact?

AndyA

3:13 pm on Oct 10, 2006 (gmt 0)

10+ Year Member



At it again. This time it's recursive include errors. It's going to URL's that look like this:

http://example.com/examplepage1.shtml/examplepage2.shtml

The server is returning a 200 OK, and the page is messed up when viewed.

How can I prevent the server from allowing this? I'd rather have it issue a Not Found instead, because the URL Googlebot is asking for doesn't exist! Why is the server loading the page with a 200 OK?

jdMorgan

3:13 pm on Oct 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do you use content-negotiation? If not, try adding

Options -MultiViews

to your top-level .htaccess.

Jim

AndyA

3:51 pm on Oct 10, 2006 (gmt 0)

10+ Year Member



Jim,

Added it, and it didn't help. It is loading the first page requested in the URL twice, almost as if one were on top of and slightly below and to the right of the other. Very, very strange.

There is an I-Frame on the page, and the page in the I-Frame has an include as well. Both of the I-Frames seem to be loading properly.

I have no idea how these strange things can happen. I don't understand how the server can load something that doesn't exist. Any other ideas?

I will never get my site straightened out as long as Googlebot continues to add to the problems. I know there are no links on my site like this.

AndyA

3:56 pm on Oct 10, 2006 (gmt 0)

10+ Year Member



Jim,

Is there a way to mod rewrite any requests for
http://example.com/*.html/*? and return a 400 or 404?

I have both .html and .shtml pages, but would never have anything after the .html, so there should be no requests for anything like that. And if there are, I would want something other than a 200 OK returned.

spinnercee

4:29 pm on Oct 10, 2006 (gmt 0)

10+ Year Member



Search engines usually find (and try to crawl) goofy URLs like that when you don't properly create HREF tags on your site -- for example if you dont properly preface relative links (that don't start with http) with the path (example.shtml vs /example.shtml) the crawler may not always know what you mean exactly.

Redirects may also mess things up -- an HTTP redirect from www.example.com to www.newsite.com/page1 may result in hits to some non-existent pages like www.newsite.com/page1/robots.txt -- this is especially true if you create (simplified?) links on other sites that depend on the redirect like www.example.com/another.html -- in this case, relative links on www.newsite.com/page1/another.html may not get mapped correctly, such that a relative link on another.html like "example2.html", may be seen as www.example.com/another.html/example2.html -- I've seen this in cases where there is also some aliasing going on by the Apache HTTPd.

Even if you use a full http preface in a URL, pay close attention to trailing slashes in HREF tags with a naked domain mame -- "www.example.com" may not look the same as "www.example.com/" when redirects, rewrites and aliasing are in play. The implicit "/", when/if added can break a redirected URL.

Your HTTPd returning 200 on obviously bad URLs is an indication that you are getting an exact match on a rule that is causing the rest of the URL to be ether ignored or discarded..

jdMorgan

4:58 pm on Oct 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



All of the weirdness in the page appearance is likely due to broken relative links, because the requested URL -- used by the browser as the base URL for relative link resolution, is incorrect. Don't worry about that, as it will likely cure itself when the root problem is fixed.

If you don't have content-negotiation enabled, then there is no way Apache will show a page when a directory is requested, unless you have some code in your config files or .htaccess that is making it happen. So as spinnercee says, you need to look through your configuration code carefully, and find out where the problem is actually caused. You could add more code as a a band-aid fix, true, but it would be better to bullet-proof your existing code instead.

Jim

AndyA

5:45 pm on Oct 10, 2006 (gmt 0)

10+ Year Member



I found the fix in another thread. I just didn't know what to search for.

RewriteRule ^([^.]+\.[^/]+)/ http://example.com/$1 [R=301,L]

That has allowed a 301 redirect to the URL originally specified, dropping the second page from the URL. This will let Googlebot know it's asking for stuff that doesn't exist.

Thanks everybody - one less thing to cause problems.

AndyA

6:19 pm on Oct 10, 2006 (gmt 0)

10+ Year Member



I have several rewrites in my htaccess. I suppose one of them could be causing the problem.

I have one that deals with the /index.html issue, redirecting to /directory/ instead.

Another gets rid of the www and redirects to non-www. (I read the index issue needs to be done before the www redirect, and I put it first.)

Another deals with double slashes in the URL, http://example.com/forum//page1.php

Something in there also deals with putting a "." at the end of a URL, as in http://example.com./page9.html

I think that's it. I have a few denys. I do have the RewriteEngine On, symlinks, I added the -MultiViews to the code as well.

I suppose it's possible one of the rewrites is causing the problem, but it appears to be OK for now. I'm watching Google closely, though, to see what else it tries to pull. How many more ways can it screw up a URL? Enough already!

jdMorgan

6:34 pm on Oct 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> How many more ways can it screw up a URL?

I've seen five common ways so far:


# Fix additional directory paths appended to filenames (e.g. /logo.jpg/<directory_path>)
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]
#
# Remove trailing punctuation for printed-text links (Remove any trailing
# hex-encoded characters or any characters not one of a-z, A-Z, or 0-9)
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+)(\%(25)*[0-9a-f]{2}¦[^a-z0-9])\ HTTP/[0-9]+\.[0-9]$ [NC]
RewriteRule . http://www.example.com/%1 [R=301,L]
#
# Fix malformed query strings ( &<something> appended to filenames )
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^.]+\.[^?&]+)?&
RewriteRule .* http://www.example.com/%1 [R=301,L]
#
# Fix extra leading slashes in URL (handled by redirect rule)
RewriteCond %{REQUEST_URI} ^(.*)//+(.*)$
RewriteRule . http://www.example.com/%1/%2 [R=301,L]
#
# Redirect direct client requests for "/index.html" to "/"
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html[^\ ]*\ HTTP/
RewriteRule ^index\.html$ http://www.example.com/ [R=301,L]

Jim

AndyA

6:37 pm on Oct 10, 2006 (gmt 0)

10+ Year Member



Are there any problems with this? Here is my .htaccess file (specifics removed):

<Files 403.shtml>
order allow,deny
allow from all
</Files>

deny from #*$!.#*$!.#*$!.#*$!

Redirect permanent /examplepage.html http://example.com/examplepage2.html

Options +FollowSymlinks
Options -MultiViews
RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]*/)*index\.html?
RewriteRule ^(([^/]*/)*)index\.html?$ http://example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} ^www\.example\.com$ [NC]
RewriteRule ^(.*)$ http://example.com/$1 [R=301,L]
RewriteCond %{HTTP_HOST} .
RewriteCond %{HTTP_HOST}!^example\.com$
RewriteRule (.*) http://example.com/$1 [R=301,L]
RewriteCond %{REQUEST_URI} ^/(.*)//+(.*)
RewriteRule .* http://example.com/%1/$2 [R=301,L]
RewriteCond %{HTTP_USER_AGENT} ^(Python[-.]?urllib¦Java/?[1-9]\.[0-9]) [NC]
RewriteCond %{REMOTE_ADDR}!^207\.126\.2(2[4-9]¦3[0-9])\.
RewriteCond %{REMOTE_ADDR}!^216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\.
RewriteRule .* - [F,L]
RewriteRule ^([^.]+\.[^/]+)/ http://example.com/$1 [R=301,L]

cvas

9:55 pm on Oct 13, 2006 (gmt 0)

10+ Year Member



Jim, I'm trying to implement the last set of rules from your examples
with no luck so far.

# Redirect client requests for "/index.html" to "/"
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.html[^\ ]*\ HTTP/
RewriteRule ^index\.html$ http://www.example.com/ [R=301,L]

here's my statement inside httpd.conf

DirectoryIndex index.jsp home.html
<Directory /opt/local/apache2/htdocs>

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
# Redirect client requests for "/index.jsp" to "/" HTTP/
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index.jsp[^\ ]*\
RewriteRule ^index\.jsp$ http://www.example.com/ [R=301,L]
</IfModule>

</Directory>

The rules seem to be applying only on direct request to
http://www.example.com not to http://www.example.com/index.jsp
what's wrong here? Any suggestion appreciated!

here's my rewrite.log
xx.x.x.x - - [13/Oct/2006:17:34:25 --0400] [www.example.com/sid#1089f8][rid#17f218/initial] (3) [per-dir /opt/local/apache2/htdocs/] strip per-dir prefix: /opt/local/apache2/htdocs/ ->
xx.x.x.x - - [13/Oct/2006:17:34:25 --0400] [www.example.com/sid#1089f8][rid#17f218/initial] (3) [per-dir /opt/local/apache2/htdocs/] applying pattern '^index\.jsp$' to uri ''
xx.x.x.x - - [13/Oct/2006:17:34:25 --0400] [www.example.com/sid#1089f8][rid#17f218/initial] (1) [per-dir /apache2/htdocs/] pass through /opt/local/apache2/htdocs/

jdMorgan

10:26 pm on Oct 13, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You are missing the last part of the RewriteCond pattern, and the RewriteRule pattern in your code has not been modified for httpd.conf use.

Use the RewriteCond exactly as I showed it, changing only the pagename, and add a leading slash to the "index\.jsp" in the RewriteRule pattern - you need "^/index\.jsp$" if you're using the code in httpd.conf or conf.d, etc.

Note also that if you want to match a literal period/dot, then that character must be escaped as shown in all patterns, including the RewriteCond.

Based on your RewriteLog, you may have another module running before mod_rewrite that is already trying to make this URL change. Therefore, the RewriteRule may not be seeing "/index.jsp" as the REQUEST_URI, but only "/"

Make sure that you don't have any Redirect, RedirectMatch, or Alias directives that are trying to do this same URL rewrite/redirect.

Jim

netchicken1

10:42 pm on Oct 13, 2006 (gmt 0)

10+ Year Member



Slight question.

Can't you just put

disallow: //exmaple/

in your robots.text and leave it at that? Then even if the bots use the wrong url, they won't index it.

cvas

5:11 am on Oct 14, 2006 (gmt 0)

10+ Year Member



Jim, no luck with suggested changes

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.jsp[^\ ]*\ HTTP/
RewriteRule ^/index\.jsp$ http://www.example.com/ [R=301,L]

.jsp extentions here are handled by tomcat and all config for that is in workers2.properties
To avoid any possible Redirect, RedirectMatch, or Alias I did change the file name with an uncommon one and still did't work. Any hints here appreciated.

jdMorgan

2:01 am on Oct 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are you aliasing or proxying .jsp requests over to the Tomcat server? That's a very common approach, but will bypass any rewrites on the front-end server.

Jim

cvas

3:46 am on Oct 18, 2006 (gmt 0)

10+ Year Member



All .jsp and /webapps/* requests are tunneled to tomcat through mod_jk2.
Now mod_jk2 is normally added last in the load modules list, what I've done I moved it up and have mod_rewrite last on the list so rewrite statements are processed first before mod_jk ones and still does not work. If I change rewrite rule to a path/file that belongs to apache rewrite works as intended if I apply the rewrite rule to a path mapped to tomcat no rewrite. I'm missing something here?

jdMorgan

5:07 am on Oct 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you're on Apache 2, the LoadModule order does not affect the module execution order as it did in Apache 1.x -- Apache 2 uses an internal priority scheme to determine module execution order. Alas, I don't know exactly how this works in Apache 2, so I can't advise.

However, the fact that non-jsp files do get rewritten indicates that you're on the right track... Dig into that module execution order issue and see if you can't get the rewrites processed first.

Jim

cvas

3:02 pm on Oct 18, 2006 (gmt 0)

10+ Year Member



Thanks Jim,
I'm running apache2 with DSO support and tomcat5.
I did try to find some info on changing modules execution order on apache2 but there's no much out there.
Am I running out of options here?

jdMorgan

5:25 pm on Oct 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't know, as I don't use DSO or Tomcat myself.

One approach you mi9ght be able to use is to alias the index.jsp URL to a different URL, so that the Tomcat proxy/alias won't grab it before mod_rewrite sees it. Then use REQUEST_FILENAME in mod_rewrite to detect it.

A last-ditch solution would be to change the on-page URL of the index.jsp URL -- Not a vary good option, but it would avoid the whole problem.

Jim

cvas

8:59 pm on Oct 18, 2006 (gmt 0)

10+ Year Member



I did try the first suggestion
Alias /index.jsp "/opt/local/apache2/htdocs/index.html"
and then apply rewrite rules but it won't rewrite looks like Redirect rules apply first before any Aliases
[httpd.apache.org...]

I'm applying Alias and rewrite inside <VirtualHost> which is supposed to override global rules where jk mapping happens
[httpd.apache.org...]

I did not understant my second suggested option though! Thank you!

cvas

7:19 pm on Oct 20, 2006 (gmt 0)

10+ Year Member



Well, I gave up on 301 redirects on apache/tomcat mod_rewrite combination.

I could not find an option to grab .jsp request before is passed to tomcat so I could apply mod_rewrite rules to it.

Jim I did try your suggestion:
Alias /index.jsp "/opt/local/apache2/htdocs/index.html"
and then rewrite
RewriteCond %{REQUEST_FILENAME} ^/opt/local/apache2/htdocs/index\.html?
RewriteRule ^(([^/]*/)*)index\.html?$ http://www.example.com/$1 [R=301,L]
It worked with any files other than .jsp, this alias did not have precedence over jk aliasing/mapping

To my own surprise tomcat itself does not have config rules to address 301 redirect.

Finally I solved the issue in a less elegant way by inserting a 301 redirect at jsp level.

<c:if test='${pageContext.request.requestURI == "/tech/index.jsp"}'>
<%-- again jstl does not handle 301 redirect --%>
<%-- <c:redirect url="/tech/" /> will do 302 --%>
<%-- i had to explicitly set it this way --%>
<%
response.setStatus(301);
response.setHeader( "Location", "/tech/" );
response.setHeader( "Connection", "close" );
%>
</c:if>