Forum Moderators: open
I can practically hear folks asking "But isn't www.foo.com/path the same as www.foo.com/path/"? In practice, they almost always are the same, but technically according to the HTTP standards I don't think that they have to be the same.
I've got a few minutes free, so let's go into detective mode for a bit. Most webservers are configured to append the "/" automatically via a 301 redirect. For example, if you try to fetch www.google.com/webmasters, our web server will do a permanent 301 redirect to the canonical page, which is www.google.com/webmasters/ (note the trailing slash).
Just to illustrate the point, let's use the same imitate-the-browser-using-telnet technique that I posted about in
[webmasterworld.com...]
It's a really good debugging technique. What actually happens when you request a directory without the trailing slash looks like this:
telnet www.google.com 80
Trying 216.239.33.99...
Connected to www.google.com (216.239.33.99).
Escape character is '^]'.
GET /webmasters HTTP/1.0HTTP/1.0 301 Moved Permanently
Connection: Keep-Alive
Date: Sun, 03 Aug 2003 22:11:43 GMT
...
Location: [google.com...]
Content-Type: text/html
Server: GWS/2.1
Content-length: 163<HTML><HEAD><TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/webmasters/">here</A>.
</BODY></HTML>
So the server basically said "Instead of fetching this page, try it again with a trailing slash"? That's why it's ever-so-slightly faster if you go to "www.webmasterworld.com/forum3/" instead of "www.webmasterworld.com/forum3"--because your browser doesn't have to get the redirect and do another fetch of the new url.
So to make a long story not quite as long, I noticed that the webserver for this domain returns a 301, but it looks like it doesn't add the trailing slash correctly in either the "Location:" field in the HTTP headers or in the text of the page. So that's the main thing I'd check on your web server.
On the other hand, even if we get duplicate content for two nearly identical urls, we have heuristics that normally detect that sort of thing. That's why the search collapses those two urls together unless you do "&filter=0". So the duplicate content filter was cleaning things up in this case. I think if you switch the webserver to do the 301 to the trailing-slash url, you should be in good shape in the future too.
Let me shed some light to my setup.
A long time ago I realised (before I had access to my apache configs) that it was much easier to throw the URL at a script and let the script figure out what page to load. No problems with any URL scheme. Anything I can do in my programming language of choice (first Delphi, later HTAG) I could do to the URLs. I could make em jump through hoops and all.
From then on, ALL accesses to a domain went right into one root script, the index.htag. Path_info was simple passed through. This root script would look at the URL and figure out what to load. so most of my URLs from then one were like
this:
domain.com/pagetoload/moreinfo/pagenumber
or
domain.com/pagetoload/subpagetoload/subsubpagetoload/moreinfo/pagenumber
No trailing slashes.
Now none of these are strictly folders, but simply virtual files without extensions (valid in most modern OSes).
Webservers are setup to laoddefault files when no file is given, so loading "folder/" will bring up "folder/index.html"
But, what if te root contains following:
/
/wordword [file]
/wordword [folder]
/wordword/index.html [file]
Now the URLS "/wordword" and "/wordword/" are distinct. And since the client cannot forsee such case, it is up to the server to redirect the 301 with a trailign slash. It's really jsut a nicety to clean up behind folks who forget their trailing slashes, and is ENTIRELY dependent that no extensionless file exists with the same name as the folder.
Regarding my own addressing scheme. I know it puts a great responsibility on me, as suddenly virtually ANY url is valid on that server. It is up to my script to determine with urls are invalid and return the appropriate error codes.
The server which caused this mixup is a local production machine and the site is pre-publishing, and therefore the 404 generating code is not yet in place.
URL addressing schemes are fun, but they can get as complex as you like ;)
Thanks again GoogleGuy.
SN
So the server basically said "Instead of fetching this page, try it again with a trailing slash"? That's why it's ever-so-slightly faster if you go to "www.webmasterworld.com/forum3/" instead of "www.webmasterworld.com/forum3"--because your browser doesn't have to get the redirect and do another fetch of the new url.
So internal absolute links pointing to an index should use the trailing slash to avoid the 301:
right: http¦//www.mydomain.com/
wrong: http¦//www.mydomain.com
right?
additionally, i'm not aware of an OS that allows a filename and a directory to be identical... granted, the html environment isn't an OS and i suppose you could force a server to have a virtual directory that has the same *exact* name as a file but it would tend to border on the "very unusual" side of things..
that's my 2cents and HO...
You can have extensionless URLs if the site uses content negotiation. You call www.domain.com/somename and that could be a folder or a page. The actual uploaded page MUST contain whatever extension is relevant for it, but the agent calling it does not have to specify any extension, so the site could transparently change from .html to .php without any referrers having to edit any of their pages.
There is an article on this over at evolt somewhere.
These results were repeatable over several days but have gone away (and my own entry has dropped even further) :(
So, it does happen.
- Ash