Googlebot accessing /index.html /index.htm and no/ at all

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot accessing /index.html /index.htm and no/ at all

What is googlebot playing at?

stinkfoot

7:51 am on Dec 23, 2005 (gmt 0)

Hi Guys,

Ok I posted something similar when I first spotted this a few months ago and no one was able to help. I have now spotted something additional and similar happening and wondered if anyone has seen this problem before and thus a possible solution.

My logs tell me googlebot in one pass of my site is accessing
a) /islands/hawaii
b) /islands/hawaii/index.htm
c) /islands/hawaii/index.html

Now having looked at the google sitemap /islands/hawaii/ is the file that googlebot is told exists.

Wget results from
a) 301 to /islands/hawaii/
b) 404 as this extension does not exist
c) 200 this IS the correct file that exists there

Has anyone seen anything like this from googlebot before?

g1smd

11:19 pm on Dec 24, 2005 (gmt 0)

Must be a part of their "canonical, supplemental, duplicate content" {hereinafter known as "The CSDC Fixes"} remedies that might happen in the next month or two.

If that is all they are sampling, then problems will still continue: index.php, home.htm, default.asp, among others, are commonly used too.

The 301 from /folder to /folder/ is correct.
The 200 for /folder/index.html is also correct.

What do you get for /folder/ now? I would assume 200 again.

g1smd

11:45 pm on Dec 24, 2005 (gmt 0)

This seems a good time to mention another problem that webmasters inadvertantly cause themselves from time to time:

Beware of a 301 redirect from non-www to www where the defaultsitename is domain.com and where you are linking to a folder, and where you forget to add the trailing / to the URL in the link.

If you forget the trailing / then your link to www.domain.com/folder will first be redirected to domain.com/folder/ {without www!} before arriving at the required www.domain.com/folder/ page.

The intermediate step, at domain.com/folder/ will kill your listings. Luckily, this effect is very easy to see if you use Xenu LinkSleuth to check your site: it shows up as reporting double the number of pages (when you generate the sitemap) that you actually have, with half of the pages having a title of "301 Moved".

Now re-read the above, and invert www for non-www for each part of the explanation.

The remedy is to always add the trailing / to any URL that points to a folder: /folder/ or http://www.domain.com/folder/ etc.

stinkfoot

1:53 pm on Dec 26, 2005 (gmt 0)

Hi And thanks v much for the info. Having checked it would seem that 301 has the correct / on the end I believe. Not being an apache wizard though I thought I best post it here for other / better eyes to check out

RewriteEngine on
RewriteCond %{HTTP_HOST} ^agreatsite.com$
RewriteRule ^/(.*) [agreatsite.com...] [R=301]

The wget info for accessing a folder name on the server eg

wget www.agreatsite.com/foldername/ = 200

g1smd

2:26 pm on Dec 26, 2005 (gmt 0)

Err, no. You misunderstood what I meant by adding a trailing / - this needs to be added to links ON your HTML page when you call a folder.

My last post, above, was warning that if you link to <a href="http://www.domain.com/folder.name">some page</a> on your page, when your defaultservername is actually domain.com {without www} and you do have a 301 redirect set up from non-www to www in your .htaccess file, that you will still cause yourself a problem.

Your call for www.domain.com/folder.name will be redirected to domain.com/folder.name/ {without www} before being redirected to the www.domain.com/folder.name/ URL that you really wanted.

Replace www with non-www and vice versa in the above, for sites that have www.domain.com as the defaultservername to see that they too will have a similar problem.

BillyS

2:46 pm on Dec 26, 2005 (gmt 0)

What about this situation? User searches for a URL which is not on the website:

foo.com/xyz

First is 301'd to the www version:

www.foo.com/xyz

Then 301'd to add the trailing slash:

www.foo.com/xyz/

Then 404'd because the page does not exist in the first place. Is there any way around this?

g1smd

3:20 pm on Dec 26, 2005 (gmt 0)

If the defaultservername is www.domain.com then the middle step wouldn't usually exist. Any access to domain.com/folder would be straightaway redirected to www.domain.com/folder/, in one step, and then 404'd.

That would be completely correct and bots should handle that situation just fine. Your situation is different to the one that I described in the previous post. Even if it were a two-step process I still think that bots should handle it fine.

In any case you can help yourself simply by always adding the trailing / to any folder URL that you link to, every time.

stinkfoot

12:10 am on Dec 27, 2005 (gmt 0)

ah i think i understand you now
wget www.agreatsite.com/index.html/ = 404
Been using Xenu for years no dead internal links

I can still see this as a unique case. there are what 3000 webmasters that probably had a look at this post in the last few days and 0 have the same problem .. I think I may need to change servers / hosts

g1smd

12:42 am on Dec 27, 2005 (gmt 0)

>> wget www.agreatsite.com/index.html/ = 404 <<

No that isn't it either. I never mentioned index.html at all.

It is always incorrect to add a trailing / after a filename.

It is always correct to add one after a folder name.

Please re-read my posts carefully. I can't explain it any better than what I already wrote. Every word is important, and the examples are exactly correct. Look carefully at each URL to see if www is there or not. Look carefully at each URL to see if a trailing / is there or not. Follow the trail of what you ask for, and where you are redirected. Observe that for a site supposedly using www in the redirect and www in the links that there is an intermediate step WITHOUT THE WWW if you leave off the trailing / in the link on your site.

2by4

1:08 am on Dec 27, 2005 (gmt 0)

we've been seeing this issue, if I understand the original poster correctly. This is what we're seeing:

links go to /folder/index.htm

but googlebot decided, just for the heck of it, to also see if /folder/ exists. In other words, googlebot is actually MAKING UP a url, all by itself, then requesting it, just to see what happens.

So what happens is that you get a duplicate content issue if you have the default / index page set to index.htm or index.html or index.php and so on.

the solution is to change all your in site urls to never show the /folder/index.htm and instead to always point to /folder/

This is an apache mod_rewrite rule which has been noted here several times before:

=========================
#this is what handles the default folder index page, including /
DirectoryIndex index.htm index.html index.php

#start rewrite stuff
# Options.. is not always needed, try it without it to see if it works

Options +FollowSymLinks
RewriteEngine On

# this is the rewrite to /folder/ - note, this is only two lines of code if it wraps, both lines start with Rewrite...

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([a-z-]+)/index\.htm\ HTTP/
RewriteRule ^([a-z-]+)/index.htm$ /$1/ [R=301,L]

======================
the above rewrite rule handles any lower case folder name, including -, for example, /my-folder/index.htm is handled, but /My-folder/index.htm is not handled.

To add upper case support change ([a-z-]+) to ([A-Za-z-]+)

and then you have to change the internal navigation urls to get rid of the /folder/index.htm and change them all to /folder/

I think this what is being asked, although I'm not certain.

Whatever you do, do not name /folder/ as /folder, no trailing slash that is.

All shared IIS site holders are out of luck, you have no practical solution except to move to Apache hosting.

======
Note: if this is not what the original poster is asking, it will almost for certain be what other people reading this thread are having problems with, since googlebot started doing this a while ago, it's very strange behavior, why googlebot feels the need to make up urls and see if they exist is beyond me, it's just creating more problems for itself.

I give google engineers a big fat 0 for this boneheaded decision, which can only create problems, both internally in google and externally for people who will be seeing duplicate content issues that are not their fault. Bad google, bad.

g1smd

6:39 pm on Dec 27, 2005 (gmt 0)

I think that by asking for www.domain.com/ or www.domain.com/folder/ (rather than a URL including an index filename) they can still access your site even if you should change your index.html to be index.php at any time.

If they had to ask for an exact filename then they would have to ask for index.html, index.htm, index.php, index.php4, default.asp, home.htm, index.cfm, index.nsf, and so on.

There would have to be a very complex routine if more than one was discovered to be valid.

What would happen if they parsed the list, asking your server for each one in turn, and discovered a file but it was not the same one that the server was serving as its default index file when the server was asked for www.domain.com/ or for /folder/? How would that be resolved?

There is no way discover what the actual index file filename really is for a domain or for a folder request: that is why you should NEVER state the name of that index file in any link.

Since site != domain, and domain != site it is quite possible for the index file to be different in different folders; think geocities, yahoo, or any place where each folder is another site, for example.

[edited by: g1smd at 6:42 pm (utc) on Dec. 27, 2005]

stinkfoot

12:35 am on Dec 31, 2005 (gmt 0)

2by4 ah so it is not just me .. phew

Your theory seems to fit in with my experiences and yes my pages do point to some URL/folder/index.html files as well as URL/folder/ files. The sites have been up for 5 years or so thus maybe worth doing some housekeeping to make all the links into a uniform format.

I was really wondered how googlebot got hold of the files URL/folder and URL/folder/index.htm

Someone a few months ago mentioned to me that they may just be testing the server configuration but as to why ... who knows exactly?

Thanks for letting me know I was not alone in this wierdness happening to my server and URLs.

2by4

11:10 pm on Dec 31, 2005 (gmt 0)

stinkfoot, I wasn't sure this is what you were seeing, glad I guessed right.

Starting in especially bourbon, I saw an increasing amount of errors catch sites, this led to cleaning up sites, moving sites to apache hosting so we could take care of these issues easily.

I started seeing googlebot make requests for /folder/ a while ago, I'm now going through sites and cleaning up the range of sloppy webmastering stuff, it's in general a good idea no matter what to tighten up on this stuff, it's not very hard, and as long as you have access to rewrite rules, the corrections are not that difficult once you have the right code.

I'm just going to make all new sites like this in the future, not giving the bots anything they can choke on, that means less work over time, less stress. If it's all done technically well to begin with, updates will cause less worry over time. Less temptation too to tweak stuff during an update, which I think in general isn't a good idea.

Again, this particular issue was not technically sloppy webmastering, it's googlebot asking for something that you shouldn't need to protect against since no such links exist internally or externally. One more case of catering to the weakness and error of the google spidering systems. It's getting old. And shouldn't be necessary, with g's resources this is really kind of inexcusable, but I guess we just have to deal with it.

happy western new year, can't remember when the chinese one is this year. Happy year of the dog when it comes.