Forum Moderators: open
Maybe I am missing something here guys/girls but Googlebot has started visiting our new 'indexable' site section. We have about 15,000 pages that can be indexed now, but I noticed that Googlebot visited yesterday and tried to index a directory level, not a file.
All links point to [blah.com...]
And this is what I see in our logs...
"/cms/<ITEM>/ HTTP/1.0¦404¦-¦-¦Googlebot/2.1"
It seems for some reason it didn't try to get the 'index.html' page.
Any ideas?
I checked the referring pages, and ALL of our URLS link to 'index.html' or other, never just to the directory.
Thanks.
Most servers are set up to associate a reference to "/" with a list of several "standard" default pages, for example, index.html, index.htm, main.html, main.htm, etc. The server will serve whichever page in the list it finds first in response to a request for "/".
Also, in most servers, requests for a directory name, i.e. a URL without a filetype extension and missing the "/", are internally redirected to the same URL with a "/" appended. So "GET /<something>" gets rewritten to "GET /<something>/", while "GET /<something>.html" would not be redirected.
In Apache server, you can declare what filenames you want to use for accesses to "slash" using the DirectoryIndex directive of mod_dir. Your server should be set up to internally redirect to /index.html if it exists, but evidently isn't.
See the Apache mod_dir and mod_autoindex modules, and Apache core Options directive documentation [httpd.apache.org] for more info.
Even if you are using a different server, these documents will help explain the problem. Then you can ask for help with whatever server you do use and get a better answer.
Most Web sites use just "www.domain.com" or "domain.com" as their "published" URL for print and TV ads, so the trend is definitely away from including the "/index.html" part.
Jim
<edited>Post in "Website Technology Issues" for further discussion</edit>
My theory is that Google has now set it so that anything with index.htm or index.html gets stipped back to / - It might be for a few reasons such as saving bandwidth on request and storage on their servers of the filenames etc.. BUT my best theory is that google does this so that filenames look more attractive to the surfer on the SERPS, and its a good idea I think..
If you are having trouble, you can just use mod_rewrite !