homepage Welcome to WebmasterWorld Guest from 23.20.220.79
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Google Lists Exact Same URL Twice
On an inurl search.
killroy




msg:171315
 7:52 pm on Aug 2, 2003 (gmt 0)

Hmm just found a page of mine listed in google twice with exactly the same URL, how can that happen?

It's even a page that doesn't exist and isn't linked from anywhere, very strange.

I didn'T know the same URL could appear twice in a SERP.

SN

 

g1smd




msg:171316
 7:12 pm on Aug 3, 2003 (gmt 0)

I have had that, but looking closer, one had been indexed as domain.com/ and the other as www.domain.com/.

I have seen /index.html and /index.php from a site once though, but only for a couple of days.

Shak




msg:171317
 7:15 pm on Aug 3, 2003 (gmt 0)

are you sure its not http:// and https://

many a person been caught out before on that 1.

Shak

Yidaki




msg:171318
 7:20 pm on Aug 3, 2003 (gmt 0)

check also: uppercase / lowercase differences ...

killroy




msg:171319
 10:52 pm on Aug 3, 2003 (gmt 0)

Yeah, just checked it close and it's

doimain.com/word
VS
doimain.com/word/

sorry, didn't wanna cause confusion.

strange thing though is that itS' a completely invalid url that I could not have possibly linked to.

SN

GoogleGuy




msg:171320
 4:20 am on Aug 4, 2003 (gmt 0)

Interesting case, killroy. Thanks for passing it back to me via stickymail. I think the difference is that one url has a trailing slash and one url doesn't.

I can practically hear folks asking "But isn't www.foo.com/path the same as www.foo.com/path/"? In practice, they almost always are the same, but technically according to the HTTP standards I don't think that they have to be the same.

I've got a few minutes free, so let's go into detective mode for a bit. Most webservers are configured to append the "/" automatically via a 301 redirect. For example, if you try to fetch www.google.com/webmasters, our web server will do a permanent 301 redirect to the canonical page, which is www.google.com/webmasters/ (note the trailing slash).

Just to illustrate the point, let's use the same imitate-the-browser-using-telnet technique that I posted about in
[webmasterworld.com...]
It's a really good debugging technique. What actually happens when you request a directory without the trailing slash looks like this:


telnet www.google.com 80
Trying 216.239.33.99...
Connected to www.google.com (216.239.33.99).
Escape character is '^]'.
GET /webmasters HTTP/1.0

HTTP/1.0 301 Moved Permanently
Connection: Keep-Alive
Date: Sun, 03 Aug 2003 22:11:43 GMT
...
Location: [google.com...]
Content-Type: text/html
Server: GWS/2.1
Content-length: 163

<HTML><HEAD><TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/webmasters/">here</A>.
</BODY></HTML>

So the server basically said "Instead of fetching this page, try it again with a trailing slash"? That's why it's ever-so-slightly faster if you go to "www.webmasterworld.com/forum3/" instead of "www.webmasterworld.com/forum3"--because your browser doesn't have to get the redirect and do another fetch of the new url.

So to make a long story not quite as long, I noticed that the webserver for this domain returns a 301, but it looks like it doesn't add the trailing slash correctly in either the "Location:" field in the HTTP headers or in the text of the page. So that's the main thing I'd check on your web server.

On the other hand, even if we get duplicate content for two nearly identical urls, we have heuristics that normally detect that sort of thing. That's why the search collapses those two urls together unless you do "&filter=0". So the duplicate content filter was cleaning things up in this case. I think if you switch the webserver to do the 301 to the trailing-slash url, you should be in good shape in the future too.

killroy




msg:171321
 10:21 am on Aug 4, 2003 (gmt 0)

Thank you for looking into it GoogleGuy.

Let me shed some light to my setup.

A long time ago I realised (before I had access to my apache configs) that it was much easier to throw the URL at a script and let the script figure out what page to load. No problems with any URL scheme. Anything I can do in my programming language of choice (first Delphi, later HTAG) I could do to the URLs. I could make em jump through hoops and all.

From then on, ALL accesses to a domain went right into one root script, the index.htag. Path_info was simple passed through. This root script would look at the URL and figure out what to load. so most of my URLs from then one were like
this:

domain.com/pagetoload/moreinfo/pagenumber
or
domain.com/pagetoload/subpagetoload/subsubpagetoload/moreinfo/pagenumber

No trailing slashes.

Now none of these are strictly folders, but simply virtual files without extensions (valid in most modern OSes).

Webservers are setup to laoddefault files when no file is given, so loading "folder/" will bring up "folder/index.html"

But, what if te root contains following:

/
/wordword [file]
/wordword [folder]
/wordword/index.html [file]

Now the URLS "/wordword" and "/wordword/" are distinct. And since the client cannot forsee such case, it is up to the server to redirect the 301 with a trailign slash. It's really jsut a nicety to clean up behind folks who forget their trailing slashes, and is ENTIRELY dependent that no extensionless file exists with the same name as the folder.

Regarding my own addressing scheme. I know it puts a great responsibility on me, as suddenly virtually ANY url is valid on that server. It is up to my script to determine with urls are invalid and return the appropriate error codes.

The server which caused this mixup is a local production machine and the site is pre-publishing, and therefore the 404 generating code is not yet in place.

URL addressing schemes are fun, but they can get as complex as you like ;)

Thanks again GoogleGuy.

SN

GoogleGuy




msg:171322
 6:41 pm on Aug 4, 2003 (gmt 0)

"URL addressing schemes are fun, but they can get as complex as you like."

Well said, killroy. Personally, I find this sort of thing fun. It's cool that you wrote your own processor to handle URL requests. HTTP lets you do lots of fun things. :)

skipfactor




msg:171323
 9:33 pm on Aug 5, 2003 (gmt 0)

So the server basically said "Instead of fetching this page, try it again with a trailing slash"? That's why it's ever-so-slightly faster if you go to "www.webmasterworld.com/forum3/" instead of "www.webmasterworld.com/forum3"--because your browser doesn't have to get the redirect and do another fetch of the new url.

So internal absolute links pointing to an index should use the trailing slash to avoid the 301:

right: http¦//www.mydomain.com/

wrong: http¦//www.mydomain.com

right?

wkitty42




msg:171324
 10:08 pm on Aug 5, 2003 (gmt 0)

from what i see, that is right...

additionally, i'm not aware of an OS that allows a filename and a directory to be identical... granted, the html environment isn't an OS and i suppose you could force a server to have a virtual directory that has the same *exact* name as a file but it would tend to border on the "very unusual" side of things..

that's my 2cents and HO...

GoogleGuy




msg:171325
 10:58 pm on Aug 5, 2003 (gmt 0)

Right, skipfactor. I would always recommend the trailing slash. If you know the exact right url, it's often best to give it directly and save everyone that extra redirect.

g1smd




msg:171326
 11:44 pm on Aug 5, 2003 (gmt 0)

I read that the extra slash is only required on the end of folder names, not when the URL is just a domain name. However, I always add the trailing slash to all URLs either at domain or folder level.

You can have extensionless URLs if the site uses content negotiation. You call www.domain.com/somename and that could be a folder or a page. The actual uploaded page MUST contain whatever extension is relevant for it, but the agent calling it does not have to specify any extension, so the site could transparently change from .html to .php without any referrers having to edit any of their pages.

There is an article on this over at evolt somewhere.

anallawalla




msg:171327
 12:09 am on Aug 6, 2003 (gmt 0)

I started a thread in the Supporters forum [webmasterworld.com] last week that showed not one but 3-4 such dupes, but they were EXACT dupes, not the slight variations mentioned above and they were on successive pages (100 per page), not all on the same one.

These results were repeatable over several days but have gone away (and my own entry has dropped even further) :(

So, it does happen.

- Ash

skipfactor




msg:171328
 12:59 am on Aug 6, 2003 (gmt 0)

Great tip GoogleGuy, thanks. The simple things in WebmasterWorld are the best.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved