Forum Moderators: open

Message Too Old, No Replies

File extensions

File extensions and Google indexing

         

webquest

5:59 pm on Aug 9, 2004 (gmt 0)

10+ Year Member



Hello,

I currently run a Cobalt RAQ4 and have a question about html file extensions and Google:

Say I create an html file in the webroot (for example, www.domain.com/test.html). When I go to a browser and type in "www.domain.com/test" the page comes up. OK, so for some reason the server checks and decides that since there is no directory called "test," I must be looking for "test.html". Great.

Now, here is my question...

Google comes along and indexes my site and and follows (and indexes) test.html. Then someone else comes along and links to my site via "www.domain.com/test". Does Google follow THEIR link and then index this thinking it's a subdirectory and I now (potentially) have a duplicate content problem? Does Google know that it's *really* test.html even though my server is serving the page when "www.domain.com/test/" is requested?

I hope I made sense... I'm not a technical person!

Thanks,

Matt

DerekH

8:19 pm on Aug 12, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to WebMasterWorld, webquest
I'm sorry no-one sems interested in answering your query, and I don't think you should be made to feel it's a daft question - it's not, and my half-baked reply to follow, will at least get you onto the first page of the forum again, where hopefully someone can give you a better answer.

Obviously, things would be so much better if your server didn't convert test to test.html automatically, and since this is a slightly unusal thing to do, there must be a way to stop it.
As regards your question about links, though, well, no-one SHOULD link to you in the way you describe if they have followed your own internal linking structure, since the URL on display at the top of their browser should be syntatically correct with the .html extension too.

As regards the duplicate content penalty, well it depends who you ask as to how severe the problem is. My biggest UK news website has hi-graphics and text-only versions of each story, and both are in the Google index, and the cache to each (being devoid of the CSS) looks identical to me.
And I ended up with a duplicate content page - the home page of one of my sites - when a meta-refresh from elsewhere ended up with Google getting the same content from two apparently different URLs. One page simply got removed from the index and nothing else was affected. When the redirection was removed, the page reappeared. It was quite harmless and well-contained, and actually did what I expected.

OK - that's got you back onto the front page of the forum - now let's hope someone else will puff up my reply a bit!

DerekH

jchance

9:12 pm on Aug 12, 2004 (gmt 0)

10+ Year Member



Just a thought here, but might it be the case that your web browser is first trying the /test and when it gets back 404 it goes and tries test.html?

You could verify using the server header checker at: [searchengineworld.com...]

funkytastic

7:33 pm on Aug 19, 2004 (gmt 0)

10+ Year Member



It's not the browser, it's the server. If you're using Apache, you have Options MultiViews turned on; you can fix this by creating a text file called .htaccess in your home directory containing the line:

Options -MultiViews

I know that Mac OSX Server has this on by default.

webquest

7:49 pm on Aug 19, 2004 (gmt 0)

10+ Year Member



Thank you! I appreciate it!