Forum Moderators: phranque

Message Too Old, No Replies

Apache 301-Redirect Mystery

No, it is not simple!

         

wildbest

8:51 am on Jul 31, 2010 (gmt 0)

10+ Year Member



It's a fairly simple 5-page website. A week or so ago Google switched details for my Home page to dmoz and excluded one page from the index. Yahoo did the same. Bing not yet, but may follow soon. I've rechecked everything I can think of but alas, not a single clue. Today, however, I noticed something very strange on my server logs.

Request GET www.example.com/page1.html status 301
Request GET www.example.com/page1.html status 200

It is my IP. Obviuosly, it was generated by me clicking the link www.example.com/page1.html on my Home page. No, link is as it is with no querry strings. I couldn't duplicate this while using different server header check tools.

.htaccess is

RewriteEngine On

### Require 'www' for both http and https pages, except for 'spec' folder
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteCond %{SERVER_PORT}>s ^(443>(s)|[0-9]+>s)$
RewriteCond $1 !^spec/ [NC]
RewriteRule ^(.*)$ http%2://www.example.com/$1 [R=301,L]

### Remove query strings on all requested URLs
RewriteCond %{THE_REQUEST} [?]
RewriteRule ^(.*)$ %{REQUEST_URI}? [R=301,L]

Is it possible some 'invisible' query string is attached by a srcipt on the server to the original link on html source? This would be the easiest explanation, but how that could be achieved? Have you seen something like that before? Any idea where should I look at?

jdMorgan

12:01 pm on Jul 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try this first. Change your second rule to:

### Remove query strings on all requested URLs
RewriteCond %{THE_REQUEST} [?]
RewriteRule ^(.*)$ http://www.example.com/$1? [R=301,L]

This ensures that the rule uses the specified hostname, instead of having to look up the configured ServerName, which might indeed be "example.com" with no "www." Therefore if this rule was ever invoked, the first rule would also then get invoked, resulting in two "chained" redirects in a row.

You should also reverse the order of these two rules. Putting the domain redirection rule second.

If you actually use HTTPS, then the 'remove query string' rule should also be modified to make it protocol-aware, using the same technique as the domain redirect rule.

If you have not already done so, test your link using the "Live HTTP Headers" add-on for Firefox/Mozilla-based browsers, and look at the hostname, requested URL-path, and other details of the request(s) that get redirected. That will likely lead to a faster solution than trying to guess based on the spares info in the standard server log files.

Jim

wildbest

1:43 pm on Jul 31, 2010 (gmt 0)

10+ Year Member



Thank you, Jim. Although still don't understand how this might happen, I shall follow your recommendation and let you know about the result. Hopefully G and Y would react positively to such a change soon.

In regards to 'Live HTTP Headers' add-on. If I try to install I got a warning - 'Author not verified - Malicious software can damage your computer...'. Can you stand surety for authors? :)

PS - Upon check _SERVER["SERVER_NAME"] is showing with www.

jdMorgan

2:24 pm on Jul 31, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think you'll find that all add-ons throw that warning if there is no SSL security certificate for the site. IIRC, that add-on is from SourceForge, a reputable open-source development site, but they do not use HTTPS, so there is no certificate.

I use this add-on daily. It --or something like it-- is basic Webmaster kit. It will show you the details of all HTTP transactions between your browser and your server, and likely answer your questions about this 'mystery redirect' almost instantly.

As standing surety is a legal thing, the best I can do is to vouch for the add-on itself as a useful program. You should rely on an up-to-date and reputable anti-virus/anti-malware program for 'surety.'

Jim

wildbest

4:04 pm on Jul 31, 2010 (gmt 0)

10+ Year Member



From what I see on Live Headers, Google Analytics code is adding some tracking query strings that are not registered in server logs but what might well trigger 301-redirect. I start testing by disabling it and will keep you posted.

wildbest

8:23 am on Aug 6, 2010 (gmt 0)

10+ Year Member



Probably my troubles with Google index have nothing to do with Apache 301-redirect. After testing extensively I have discovered by chance that Googlebot is not reading well our "Content-Type" server headers. I used Google WMT>Labs>Fetch as Googlebot.

This is how Googlebot fetched the page:

URL: http://www.example.com/
Date: Thu Aug 05 23:40:00 PDT 2010
Googlebot Type: Web

HTTP/1.1 200 OK
Date: Fri, 06 Aug 2010 06:40:01 GMT
Server: Apache
Content-Encoding: gzip
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type:

Please note. Googlebot fetched only server headers without html page source!

This is how headers look like on FF "HTTP Live Headers" add-on:

URL: http://www.example.com/

HTTP/1.1 200 OK
Date: Fri, 06 Aug 2010 07:37:56 GMT
Server: Apache
Content-Encoding: gzip
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8

Any idea why Googlebot doesn't read well standard Apache "Content-Type" headers?

g1smd

9:44 am on Aug 6, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's probably not a case of not reading them but instead the server not sending them for a HEAD request.

Is that data really in the HTTP header, or are you relying on specifying it using meta data (which isn't sent for a HEAD request)?

wildbest

11:16 am on Aug 6, 2010 (gmt 0)

10+ Year Member



Thank you all.
Problem identified.
Server hacked.
Google and Yahoo bots blocked.
Cleaning...

wildbest

12:28 pm on Aug 8, 2010 (gmt 0)

10+ Year Member



Have to revive this thread. If I have this .htaccess:

RewriteEngine On

### Remove query strings on all requested URLs
RewriteCond %{THE_REQUEST} [?]
RewriteRule ^(.*)$ http://www.example.com/$1? [R=301,L]

### Require 'www' for both http and https pages, except for 'spec' folder
RewriteCond %{HTTP_HOST} !^(www\.example\.com)?$
RewriteCond %{SERVER_PORT}>s ^(443>(s)|[0-9]+>s)$
RewriteCond $1 !^spec/ [NC]
RewriteRule ^(.*)$ http%2://www.example.com/$1 [R=301,L]

Requesting http://www.example.com/this.html/whatever is responded by server with 200 OK instead of 404 Not Found.

jdMorgan

11:18 pm on Aug 8, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try adding two more directives at the top, ahead of the "RewriteEngine" directive:

Options +FollowSymLinks -Indexes -MultiViews
AcceptPathInfo Off

to turn off AcceptPathInfo and content-negotiation, and see if that helps.

Note that AcceptPathInfo can only be used on Apache 2.0 and later servers.

Jim

wildbest

3:32 pm on Aug 9, 2010 (gmt 0)

10+ Year Member



Hi Jim, and thank you. Generating internal server error because it's Apache/1.3.33.

It is part of the same problem:
[webmasterworld.com...]

I've changed rewrite rule to:

RewriteRule ^/?(.*)$ http://www.example.com/$1? [R=301,L]

and issue with "http://www.example.com//?q=12345" is resolved. However, "http://www.example.com/this.html/?q=12345" still remains! Actually, with "Remove query strings" rule now in place it is rewritten to "http://www.example.com/this.html/".

jdMorgan

5:19 pm on Aug 9, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Replace your one rule with these two, in this order:

### Remove trailing slash(es) and query strings on all requested html and htm "page" URLs
RewriteCond %{THE_REQUEST} [?]
RewriteRule ^/?(.+\.html?)/+$ http://www.example.com/$1? [R=301,L]
#
### Remove query strings on all remaining requested URLs
RewriteCond %{THE_REQUEST} [?]
RewriteRule ^/?(.*)$ http://www.example.com/$1? [R=301,L]

Jim

wildbest

6:04 pm on Aug 9, 2010 (gmt 0)

10+ Year Member



Hmmm... that is fixing all trailing slash(es) issues if there is "?" in the request, but doesn't resolve "http://www.example.com/this.html/" or "http://www.example.com/this.html/whatever_no_exist" serving "http://www.example.com/this.html" content with 200 OK.

I've noted on our server logs that Google-site-verification bot is testing exactly like this - "http://www.example.com/this.html/whatever_no_exist". And no surprise this page is now excluded from their index.

jdMorgan

3:19 am on Aug 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, you need to make a list of all "wrong-URL variables" and what you want to do with the URLs if they contain those "errors." Then modify the code above to do what you want.

First write the specifications, then code the solution. Otherwise, you're wasting time and effort, yours as well as others'.

Jim

wildbest

2:00 pm on Aug 10, 2010 (gmt 0)

10+ Year Member



Otherwise, you're wasting time and effort, yours as well as others.

This is not a fair reproach, jdMorgan. :)
I can't be more specific than that.

Do you think this will work:

RewriteCond %{THE_REQUEST} \?|\.html?!\s [NC]
RewriteRule ^/?([^\.]+\.html?).*$ http://www.example.com/$1? [R=301,L]

BTW, looking at our error log files, 5-6 seconds after my 1st test of "AcceptPathInfo Off" the same was tested from 2 other IPs for 1st time as well. Something like this:

[Mon Aug ...........] [alert] [client XX.XX.XX.XX] /var/chroot/home/............/.htaccess: Invalid command 'AcceptPathInfo', perhaps mis-spelled or defined by a module not included in the server configuration

Those client IPs have nothing in common with our data center. Any idea how that can happen?

jdMorgan

4:15 pm on Aug 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, remove the AcceptPathInfo line, then...

...defined by a module not included in the server configuration.

If AcceptPathInfo isn't supported on your server, then there is no need to try to turn it off.

If you review the threads above, you will see a pattern. You post a requirement, either explicitly or by providing an example URL.

Others here respond with a correct solution to meet your requirement.

Then you add further requirements by saying, "It works with this kind of URL, but not with this (different) kind URL." That is to be expected, as the correct coded solution depends completely on the requirements.

As I said,
...make a list of all "wrong-URL variables" and what you want to do with the URLs if they contain those "errors." Then modify the code above to do what you want.


This will prevent everyone from losing interest in this thread, because it is already too long and complicated for most people to bother reading...

Jim

wildbest

4:31 pm on Aug 10, 2010 (gmt 0)

10+ Year Member



Others here respond with a correct solution to meet your requirement.

Then you add further requirements by saying...

Dear Jim, have to respectfully disagree.

The last sentence on my post on 12:28 pm on Aug 8, 2010 is "Requesting http://www.example.com/this.html/whatever is responded by server with 200 OK instead of 404 Not Found."

Then you suggest a solution on your post on 5:19 pm on Aug 9, 2010.

On 6:04 pm on Aug 9 I just remind you that suggested solution "is fixing all trailing slash(es) issues if there is "?" in the request, but doesn't resolve "http://www.example.com/this.html/" or "http://www.example.com/this.html/whatever_no_exist"

I know it can be irritating, but feel at a loss to account for what you say.

jdMorgan

5:30 pm on Aug 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The point here is that if you need to redirect or rewrite URLs with or without query strings appended to them, then you need to say so.

Again, the correct solution cannot be found if the requirements are not clear and utterly complete.

So I am asking you to carefully consider the entire URL-space of your site -- all URLs that should be affected, and all that should not, to include protocol, domain, FQDN, port numbers, URL-path, query strings and URL fragments (also known as "named anchors" on HTML pages), and to thoroughly describe your requirements in these terms. Otherwise, this thread may continue on into next year before the complete solution is identified.

Jim