Forum Moderators: open
But the default page names on that client's server are index.html.
In other words, if you typed into your browser www.example.com/folder1 you would get a 404 error as there was no index.html created.
But it should not have mattered as we were not linking to the folder, but to a specific page within the folder called index.htm.
Anyway, the point is google did not follow the link for one month. Then we changed it to index.html and the next day it was in google.
We then applied the same treatment to two other clients where we were waiting for indexing and jey presto, up they appeared.
So, Google is doing some form of checking of the existence of folders before visiting any pages within that folder, or it is checking what the default suffix is for the server, or somethink like this.
I have no idea why or how this is happening, but thought it might first of all solve a problem for some people who are not getting indexed, but I also would like to know if anyone knows what is going on here.
The wrong page suffix can mean no indexing
If you want further info, visit that: [w3.org...]
Will the system allow a w3.org link? ;)
Hoping be useful,
Herenvardö
PS: I edited to broke a URL. It's confusing a URI example linking to a true, unrelated page!
When Google sees folder/index.html or /index.htm and other variations they are mapped in the index (rightly or wrongly) to folder/
It therefore follows that if the server does not deliver the required page when folder/ is requested, a problem will follow.
You could argue that Google should check that the server performs the expected mapping, but this would take up valuable resources.
Kaled.
On Apache, this problem colud have easily been avoided by adding or modifying the DirectoryIndex directive, which tells the server what file to serve when a directory index is requested. This can be done in httpd.conf or in .htaccess.
The directive
DirectoryIndex index.html index.htm
Blaming this on Google is wrong.
Jim
I think this is important because I am sure there are instances where people create pages within folders without putting up an index page for that folder and I believe based on our experience and my interpretation of that experience above, that those pages will not get indxeed even though they are linked to.
Does anyone understand what I am saying? :)
(A) www.foo.com/bar
(B) www.foo.com/bar/
(C) www.foo.com/bar/index.html
In case A, the request might be a file, a directory or output of a script. If a correctyl configured webserver finds that a script or a file named ``bar'' doesn't exist, it would check for existence of a directory. If it is a directory, it will respond with an HTTP redirect to B.
In case B, any webserver would clearly know that it's a directory and, as anything below ``bar'' is omitted, would issue directory index. If your webserver is configured to consider ``index.html'' as directory index (see DirectoryIndex post above), (B) will be equivalent to C because the same results would be returned.
If you don't get C when you go to A then (your case):
1) Your webserver is misconfigured and doesn't complete the directory names automagically. You do not want A and B return the same; you want a redirect from A to B.
2) You should never use links of type A
3) You didn't correctly create index pages.
Your situation has nothing to do with google and everything to do with wrong linking on your website and webserver configuration.
Button on.
Nova Recticulis
You keep telling me how to configure properly etc etc. That is not my point. I promise I will never ever again create an index.htm file when I should have created an index.html file, or I'll reconfigure my server etc etc.
I did not post this message to get a lecture on how to configure my server. I posted this message to help anyone else who may be experiencing a similar problem of pages not being indexed, and in the hope that someone could explain why google chooses not to follow a perfectly valid link simply because there is no valid index page within that directory (or because the file type is not the default type for that server).
OK, I've calmed down now - sorry for the aggression, but I think I am not being understood properly except by tafkar!
I posted this message ..... in the hope that someone could explain why google chooses not to follow a perfectly valid link.......
You may disagree with Google's logic, but I think I explained what is (probably) going on......essentially, this is a side-effect of a fix for a much bigger problem.
You were correct to post in the hope that others may learn from your mistake, and you are also correct, technically, in saying that Google is at fault. However, I do not believe Google would consider this to be a bug.
Kaled.
Sorry, I did not read your post carefully enough. You have certainly explained what you think the reason might be.
It is still strange to me though because google follows links, whereas what we are seeing here is that google goes a step further and checks for folders even though the folder itself, without the file extension, is not linked to anywhere.
Google probably treats any page index.* as a default page. They ought to explain this in their help notes for webmasters.
Kaled.
To avoid such a problem, try to use always the .html, check always the url before linking and use full adressess only when they are truly needed. Examples:
A external link to http:// www.domain.com/index.html
GOOD: href="www.domain.com", href="http://www.domain.com"
UGLY: href="www.domain.com/index.html" will work... untill the webmaster modifies the homepage's URI
BAD: href="www.domain.com/index.htm" you cannot even be sure that it works!
If the link is internal, use href="/".
Index pages are not required for spidering. HOWEVER, if you have a page that Google believes is an index page, it is essential that your server is configured to deliver that page if the directory-only url is requested.
As a summary, you'd avoid this problem if you use a logical and coherent dir/filename structure and the linking advises I posted before.
Trying to be useful,
Herenvardö