Google Lists Exact Same URL Twice

Forum Moderators: open

Message Too Old, No Replies

Google Lists Exact Same URL Twice

On an inurl search.

killroy

7:52 pm on Aug 2, 2003 (gmt 0)

Hmm just found a page of mine listed in google twice with exactly the same URL, how can that happen?

It's even a page that doesn't exist and isn't linked from anywhere, very strange.

I didn'T know the same URL could appear twice in a SERP.

g1smd

7:12 pm on Aug 3, 2003 (gmt 0)

I have had that, but looking closer, one had been indexed as domain.com/ and the other as www.domain.com/.

I have seen /index.html and /index.php from a site once though, but only for a couple of days.

Shak

7:15 pm on Aug 3, 2003 (gmt 0)

are you sure its not http:// and https://

many a person been caught out before on that 1.

Shak

Yidaki

7:20 pm on Aug 3, 2003 (gmt 0)

check also: uppercase / lowercase differences ...

killroy

10:52 pm on Aug 3, 2003 (gmt 0)

Yeah, just checked it close and it's

doimain.com/word
VS
doimain.com/word/

sorry, didn't wanna cause confusion.

strange thing though is that itS' a completely invalid url that I could not have possibly linked to.

GoogleGuy

4:20 am on Aug 4, 2003 (gmt 0)

Interesting case, killroy. Thanks for passing it back to me via stickymail. I think the difference is that one url has a trailing slash and one url doesn't.

I can practically hear folks asking "But isn't www.foo.com/path the same as www.foo.com/path/"? In practice, they almost always are the same, but technically according to the HTTP standards I don't think that they have to be the same.

I've got a few minutes free, so let's go into detective mode for a bit. Most webservers are configured to append the "/" automatically via a 301 redirect. For example, if you try to fetch www.google.com/webmasters, our web server will do a permanent 301 redirect to the canonical page, which is www.google.com/webmasters/ (note the trailing slash).

Just to illustrate the point, let's use the same imitate-the-browser-using-telnet technique that I posted about in
[webmasterworld.com...]
It's a really good debugging technique. What actually happens when you request a directory without the trailing slash looks like this:

telnet www.google.com 80
Trying 216.239.33.99...
Connected to www.google.com (216.239.33.99).
Escape character is '^]'.
GET /webmasters HTTP/1.0
HTTP/1.0 301 Moved Permanently
Connection: Keep-Alive
Date: Sun, 03 Aug 2003 22:11:43 GMT
...
Location: [google.com...]
Content-Type: text/html
Server: GWS/2.1
Content-length: 163
<HTML><HEAD><TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/webmasters/">here</A>.
</BODY></HTML>

So the server basically said "Instead of fetching this page, try it again with a trailing slash"? That's why it's ever-so-slightly faster if you go to "www.webmasterworld.com/forum3/" instead of "www.webmasterworld.com/forum3"--because your browser doesn't have to get the redirect and do another fetch of the new url.

So to make a long story not quite as long, I noticed that the webserver for this domain returns a 301, but it looks like it doesn't add the trailing slash correctly in either the "Location:" field in the HTTP headers or in the text of the page. So that's the main thing I'd check on your web server.

On the other hand, even if we get duplicate content for two nearly identical urls, we have heuristics that normally detect that sort of thing. That's why the search collapses those two urls together unless you do "&filter=0". So the duplicate content filter was cleaning things up in this case. I think if you switch the webserver to do the 301 to the trailing-slash url, you should be in good shape in the future too.

killroy

10:21 am on Aug 4, 2003 (gmt 0)

Thank you for looking into it GoogleGuy.

Let me shed some light to my setup.

A long time ago I realised (before I had access to my apache configs) that it was much easier to throw the URL at a script and let the script figure out what page to load. No problems with any URL scheme. Anything I can do in my programming language of choice (first Delphi, later HTAG) I could do to the URLs. I could make em jump through hoops and all.

From then on, ALL accesses to a domain went right into one root script, the index.htag. Path_info was simple passed through. This root script would look at the URL and figure out what to load. so most of my URLs from then one were like
this:

domain.com/pagetoload/moreinfo/pagenumber
or
domain.com/pagetoload/subpagetoload/subsubpagetoload/moreinfo/pagenumber

No trailing slashes.

Now none of these are strictly folders, but simply virtual files without extensions (valid in most modern OSes).

Webservers are setup to laoddefault files when no file is given, so loading "folder/" will bring up "folder/index.html"

But, what if te root contains following:

/
/wordword [file]
/wordword [folder]
/wordword/index.html [file]

Now the URLS "/wordword" and "/wordword/" are distinct. And since the client cannot forsee such case, it is up to the server to redirect the 301 with a trailign slash. It's really jsut a nicety to clean up behind folks who forget their trailing slashes, and is ENTIRELY dependent that no extensionless file exists with the same name as the folder.

Regarding my own addressing scheme. I know it puts a great responsibility on me, as suddenly virtually ANY url is valid on that server. It is up to my script to determine with urls are invalid and return the appropriate error codes.

The server which caused this mixup is a local production machine and the site is pre-publishing, and therefore the 404 generating code is not yet in place.

URL addressing schemes are fun, but they can get as complex as you like ;)

Thanks again GoogleGuy.

GoogleGuy

6:41 pm on Aug 4, 2003 (gmt 0)

"URL addressing schemes are fun, but they can get as complex as you like."

Well said, killroy. Personally, I find this sort of thing fun. It's cool that you wrote your own processor to handle URL requests. HTTP lets you do lots of fun things. :)

skipfactor

9:33 pm on Aug 5, 2003 (gmt 0)

So the server basically said "Instead of fetching this page, try it again with a trailing slash"? That's why it's ever-so-slightly faster if you go to "www.webmasterworld.com/forum3/" instead of "www.webmasterworld.com/forum3"--because your browser doesn't have to get the redirect and do another fetch of the new url.

So internal absolute links pointing to an index should use the trailing slash to avoid the 301:

right: http�//www.mydomain.com/

wrong: http�//www.mydomain.com

right?

wkitty42

10:08 pm on Aug 5, 2003 (gmt 0)

from what i see, that is right...

additionally, i'm not aware of an OS that allows a filename and a directory to be identical... granted, the html environment isn't an OS and i suppose you could force a server to have a virtual directory that has the same *exact* name as a file but it would tend to border on the "very unusual" side of things..

that's my 2cents and HO...

GoogleGuy

10:58 pm on Aug 5, 2003 (gmt 0)

Right, skipfactor. I would always recommend the trailing slash. If you know the exact right url, it's often best to give it directly and save everyone that extra redirect.

g1smd

11:44 pm on Aug 5, 2003 (gmt 0)

I read that the extra slash is only required on the end of folder names, not when the URL is just a domain name. However, I always add the trailing slash to all URLs either at domain or folder level.

You can have extensionless URLs if the site uses content negotiation. You call www.domain.com/somename and that could be a folder or a page. The actual uploaded page MUST contain whatever extension is relevant for it, but the agent calling it does not have to specify any extension, so the site could transparently change from .html to .php without any referrers having to edit any of their pages.

There is an article on this over at evolt somewhere.

anallawalla

12:09 am on Aug 6, 2003 (gmt 0)

I started a thread in the Supporters forum [webmasterworld.com] last week that showed not one but 3-4 such dupes, but they were EXACT dupes, not the slight variations mentioned above and they were on successive pages (100 per page), not all on the same one.

These results were repeatable over several days but have gone away (and my own entry has dropped even further) :(

So, it does happen.

- Ash

skipfactor

12:59 am on Aug 6, 2003 (gmt 0)

Great tip GoogleGuy, thanks. The simple things in WebmasterWorld are the best.