Forum Moderators: phranque

Message Too Old, No Replies

Google and /

Google webmaster tools is showing errors

         

scraulb

12:17 am on Mar 31, 2009 (gmt 0)

10+ Year Member



In Google Webmaster Tools we are seeing under:
Content Analysis -> Duplicate Title Tags the following issue:

/directory/page.html/ -- Don't Want
/directory/page.html -- Want

This is the first time we have seen a / show up after .html but we are concerned that more will show up. We have 301's correcting other potential canonical issues but we aren't sure how to stop this particular problem.

Are we missing a setting somewhere in Apache?

[edited by: scraulb at 1:00 am (utc) on Mar. 31, 2009]

encyclo

12:45 am on Mar 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have you got MultiViews enabled, and if so, do you really need it? If you don't need MultiViews (content negotiation), then you can disable it in httpd.conf or via a document-root-level .htaccess file:

Options -MultiViews

The first URL given in example above should give a 404 Not Found response if content negotiation is disabled.

[httpd.apache.org...]
[httpd.apache.org...]

scraulb

12:55 am on Mar 31, 2009 (gmt 0)

10+ Year Member



Just checked my httpd.conf and we dont have any MultiViews set.

scraulb

12:57 am on Mar 31, 2009 (gmt 0)

10+ Year Member



Sorry just read my first post and some weird characters appeared.

It should have been:

/directory/page.html/ -- Don't Want
/directory/page.html -- Want

encyclo

1:09 am on Mar 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does /directory/page.html/ give a 200 OK server response? What response does /directory/page (no extension) give? If MultiViews are disabled, what rewrite rules do you currently have in place?

scraulb

1:27 am on Mar 31, 2009 (gmt 0)

10+ Year Member



Hi encyclo,

/directory/page.html/ we get 200 Get
/directory/page.html we get 200 Get
/directory/page (no extension) we get 404

We have disabled all the rewrite rules.

Scraulb

Caterham

10:44 am on Mar 31, 2009 (gmt 0)

10+ Year Member



If page.html is a static file served by the default handler you must have changed AcceptPathinfo to on because default is to reject (or rewrite rules, but as you said, they're disabled).

scraulb

3:51 pm on Mar 31, 2009 (gmt 0)

10+ Year Member



After a lot of messing around I finally stumbled on php.conf. In there, we have a line that says:
AddHandler php5-script .php .html

If I remove the .html and restart, the problem goes away. So somehow is php causing this problem?

AcceptPathInfo is off

Thanks

encyclo

4:00 pm on Mar 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> So somehow is php causing this problem?

Good catch, Scraulb - I hadn't thought of that possibility. If the .html page was being parsed as PHP, that means the trailing slash was, I assume, seen as a variable.

As long as your .html files contain only plain HTML and no PHP, then it's better that they are not handled by the PHP parser - it will only cause extra load for nothing, and also will stop Apache providing proper caching for the static files.

scraulb

4:05 pm on Mar 31, 2009 (gmt 0)

10+ Year Member



We use php on just about all our pages. Is there a way to turn it off as being seen as a variable? We do not use variables at the end of any of our pages.

Caterham

4:15 pm on Mar 31, 2009 (gmt 0)

10+ Year Member



Yes, by default a cgi content handler accepts requests with path_info, hence no 404 but 200 ok.

AcceptPathInfo is off

Explicitly set? (since the default is "default", i.e, it's up to the content handler whether to accept or 404 the request).

It's also up to the content handler whether to respect your setting of AcceptPathInfo or not. If the content handler does not handle the setting you can set AcceptPathInfo to off as many times you want; that would be useless.

While AcceptPathInfo is a directive provided by the core, there's no point where the core acts in ap_process_request_internal where path-info is generated (directory walk). It's always up to the content handler which is not part of the ap_process_request_internal logic where most modules run in their registered hooks.

I don't know which handler you use (fastcgi?) but if you set AcceptPathInfo to off, restarted your server, cleared your browser cache and get a 200 ok, your handler doesn't handle the setting.

scraulb

4:51 pm on Mar 31, 2009 (gmt 0)

10+ Year Member



Oops! After going round and round in circles, we discovered you were dead right. We added
AcceptPathInfo off explicitly and now it works fine.

What is interesting is that we were trying it on 3 servers and the 2 latest Apache servers failed but the older server was fine! None of them had AcceptPathInfo explicitly set on or off.

Another lesson learned, thanks for the help!

John