Forum Moderators: open

Message Too Old, No Replies

The importance of getting page suffix right

The wrong page suffix can mean no indexing

         

Pimpernel

1:08 pm on May 27, 2004 (gmt 0)

10+ Year Member



Here is a very strange thing we discovered recently with Google. We were creating additional pages on a client's web site. We created links on the home page to exact URLs as follows:
www.example.com/folder1/index.htm
www.example.com/folder2/index.htm

But the default page names on that client's server are index.html.

In other words, if you typed into your browser www.example.com/folder1 you would get a 404 error as there was no index.html created.

But it should not have mattered as we were not linking to the folder, but to a specific page within the folder called index.htm.

Anyway, the point is google did not follow the link for one month. Then we changed it to index.html and the next day it was in google.

We then applied the same treatment to two other clients where we were waiting for indexing and jey presto, up they appeared.

So, Google is doing some form of checking of the existence of folders before visiting any pages within that folder, or it is checking what the default suffix is for the server, or somethink like this.

I have no idea why or how this is happening, but thought it might first of all solve a problem for some people who are not getting indexed, but I also would like to know if anyone knows what is going on here.

Herenvardo

5:03 pm on May 27, 2004 (gmt 0)

10+ Year Member



As a programmer, that's my viewpoint
"index.html" = "index.htm" is a false statement. The strings are not equal. Try it on windows, for example: try to create those two files on a folder and edit one of them. In any case, you won't be editing the other file.
is like trying to enter at www.domain.com typing www.domai.com. Of course, that will give an error the .htm(l) issue is exactly the same case, so it'll give the same error.
Probably, you'll be convinced that .htm and .html are the same, but they aren't. But there is some historic info that will help to explain that:
In the times of FAT16 (DOS, Win95), filenames were limited by the 8.3 format. This means that a file name could only be 8 character long, plus 3 chars as extension. Was for this reason that the .htm extension appeared. Then, most of web browsers were adapted to that, trying to get a .htm when the .html failed and viceversa. Even some servers delivered automatically the .htm when a .html was asked, and so on. But these are still different filenames.
The wrong page suffix can mean no indexing

I supose that as suffix you mean file extension. In that case, it will be more accurate that a miss-spelled URL gives a 404, wich won't get indexed. Going further, if you work on a non-windows server (that is, after all, a good choice), it is very probable that index.htm and index.html will be treated as different filenames for the simple reason that they ARE different filenames.
To avoid such a problem, try to use always the .html, check always the url before linking and use full adressess only when they are truly needed. Examples:
A external link to http:// www.domain.com/index.html
GOOD: href="www.domain.com", href="http://www.domain.com"
UGLY: href="www.domain.com/index.html" will work... untill the webmaster modifies the homepage's URI
BAD: href="www.domain.com/index.htm" you cannot even be sure that it works!
If the link is internal, use href="/". With that, you are going to the home, without caring wich file is the home ;)

If you want further info, visit that: [w3.org...]
Will the system allow a w3.org link? ;)

Hoping be useful,
Herenvardö

PS: I edited to broke a URL. It's confusing a URI example linking to a true, unrelated page!

kaled

10:50 pm on May 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Seems simple to me.

When Google sees folder/index.html or /index.htm and other variations they are mapped in the index (rightly or wrongly) to folder/

It therefore follows that if the server does not deliver the required page when folder/ is requested, a problem will follow.

You could argue that Google should check that the server performs the expected mapping, but this would take up valuable resources.

Kaled.

jdMorgan

11:13 pm on May 27, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You could also argue that the server configuration should be correct for your purposes.

On Apache, this problem colud have easily been avoided by adding or modifying the DirectoryIndex directive, which tells the server what file to serve when a directory index is requested. This can be done in httpd.conf or in .htaccess.

The directive


DirectoryIndex index.html index.htm

would tell the server to look for index.html first, and then for index.htm if that failed.

Blaming this on Google is wrong.

Jim

ALbino

5:53 am on May 28, 2004 (gmt 0)

10+ Year Member



We even go so far as to add .shtml, .php, etc for our clients. This is definitely a server issue and should be resolved, even if Google was indexing them.

Pimpernel

9:10 am on May 28, 2004 (gmt 0)

10+ Year Member



I appreciate all the posts, but maybe I have not explained myself well. The issue is not how to create correct files. The issue I was raising is that if you create a page and link to the full URL from the home page, Google will not follow the link and index that page if it is the wrong format, EVEN THOUGH the page is perefctly viewable by a browser. It is almost as though Google has gone looking for www.example.com/folder1 and if it gets a 404 error then it will not visit the link you have created www.example.com/folder1/index.html

I think this is important because I am sure there are instances where people create pages within folders without putting up an index page for that folder and I believe based on our experience and my interpretation of that experience above, that those pages will not get indxeed even though they are linked to.

Does anyone understand what I am saying? :)

tafkar

9:22 am on May 28, 2004 (gmt 0)

10+ Year Member



Does anyone understand what I am saying? :)

I'll give it a shot :)

As I understand it you are saying that google does not index files in a folder if there is no default page for that folder even if the files in the folder are properly linked and reachable.

How good is my guess? ;-)

Nova Reticulis

12:33 pm on May 28, 2004 (gmt 0)

10+ Year Member



Again.

(A) www.foo.com/bar
(B) www.foo.com/bar/
(C) www.foo.com/bar/index.html

In case A, the request might be a file, a directory or output of a script. If a correctyl configured webserver finds that a script or a file named ``bar'' doesn't exist, it would check for existence of a directory. If it is a directory, it will respond with an HTTP redirect to B.

In case B, any webserver would clearly know that it's a directory and, as anything below ``bar'' is omitted, would issue directory index. If your webserver is configured to consider ``index.html'' as directory index (see DirectoryIndex post above), (B) will be equivalent to C because the same results would be returned.

If you don't get C when you go to A then (your case):

1) Your webserver is misconfigured and doesn't complete the directory names automagically. You do not want A and B return the same; you want a redirect from A to B.
2) You should never use links of type A
3) You didn't correctly create index pages.

Your situation has nothing to do with google and everything to do with wrong linking on your website and webserver configuration.

Pimpernel

12:55 pm on May 28, 2004 (gmt 0)

10+ Year Member



tafkar

Button on.

Nova Recticulis

You keep telling me how to configure properly etc etc. That is not my point. I promise I will never ever again create an index.htm file when I should have created an index.html file, or I'll reconfigure my server etc etc.

I did not post this message to get a lecture on how to configure my server. I posted this message to help anyone else who may be experiencing a similar problem of pages not being indexed, and in the hope that someone could explain why google chooses not to follow a perfectly valid link simply because there is no valid index page within that directory (or because the file type is not the default type for that server).

OK, I've calmed down now - sorry for the aggression, but I think I am not being understood properly except by tafkar!

kaled

1:21 pm on May 28, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Pimpernel said
I posted this message ..... in the hope that someone could explain why google chooses not to follow a perfectly valid link.......

You may disagree with Google's logic, but I think I explained what is (probably) going on......essentially, this is a side-effect of a fix for a much bigger problem.

You were correct to post in the hope that others may learn from your mistake, and you are also correct, technically, in saying that Google is at fault. However, I do not believe Google would consider this to be a bug.

Kaled.

Pimpernel

2:14 pm on May 28, 2004 (gmt 0)

10+ Year Member



Kaled

Sorry, I did not read your post carefully enough. You have certainly explained what you think the reason might be.

It is still strange to me though because google follows links, whereas what we are seeing here is that google goes a step further and checks for folders even though the folder itself, without the file extension, is not linked to anywhere.

watercrazed

9:28 am on Jun 2, 2004 (gmt 0)

10+ Year Member



I am not sure I am following this correctly but I do not have any index.htm or index.html pages in any of my sub folders and they are all being spidered and ranked.

kaled

1:45 pm on Jun 2, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Index pages are not required for spidering. HOWEVER, if you have a page that Google believes is an index page, it is essential that your server is configured to deliver that page if the directory-only url is requested.

Google probably treats any page index.* as a default page. They ought to explain this in their help notes for webmasters.

Kaled.

Herenvardo

3:27 pm on Jun 9, 2004 (gmt 0)

10+ Year Member



To avoid such a problem, try to use always the .html, check always the url before linking and use full adressess only when they are truly needed. Examples:
A external link to http:// www.domain.com/index.html
GOOD: href="www.domain.com", href="http://www.domain.com"
UGLY: href="www.domain.com/index.html" will work... untill the webmaster modifies the homepage's URI
BAD: href="www.domain.com/index.htm" you cannot even be sure that it works!
If the link is internal, use href="/".

I'm quoting myself 'cos I believe that my post was too long.
What I was trying to say is that a logical linking strategy would completely avoid the issue you speak of.
But I also have seen some comments that add some details to care about. Kaled said:
Index pages are not required for spidering. HOWEVER, if you have a page that Google believes is an index page, it is essential that your server is configured to deliver that page if the directory-only url is requested.

Good point! It seems that G stores a URL with index.html as its dir's path, assuming that it's a index page. So, be sane enough to name your files index, default, etc only when they are the index/default file on your dir.

As a summary, you'd avoid this problem if you use a logical and coherent dir/filename structure and the linking advises I posted before.

Trying to be useful,
Herenvardö