Forum Moderators: Robert Charlton & goodroi
Ok I posted something similar when I first spotted this a few months ago and no one was able to help. I have now spotted something additional and similar happening and wondered if anyone has seen this problem before and thus a possible solution.
My logs tell me googlebot in one pass of my site is accessing
a) /islands/hawaii
b) /islands/hawaii/index.htm
c) /islands/hawaii/index.html
Now having looked at the google sitemap /islands/hawaii/ is the file that googlebot is told exists.
Wget results from
a) 301 to /islands/hawaii/
b) 404 as this extension does not exist
c) 200 this IS the correct file that exists there
Has anyone seen anything like this from googlebot before?
If that is all they are sampling, then problems will still continue: index.php, home.htm, default.asp, among others, are commonly used too.
The 301 from /folder to /folder/ is correct.
The 200 for /folder/index.html is also correct.
What do you get for /folder/ now? I would assume 200 again.
Beware of a 301 redirect from non-www to www where the defaultsitename is domain.com and where you are linking to a folder, and where you forget to add the trailing / to the URL in the link.
If you forget the trailing / then your link to www.domain.com/folder will first be redirected to domain.com/folder/ {without www!} before arriving at the required www.domain.com/folder/ page.
The intermediate step, at domain.com/folder/ will kill your listings. Luckily, this effect is very easy to see if you use Xenu LinkSleuth to check your site: it shows up as reporting double the number of pages (when you generate the sitemap) that you actually have, with half of the pages having a title of "301 Moved".
Now re-read the above, and invert www for non-www for each part of the explanation.
The remedy is to always add the trailing / to any URL that points to a folder: /folder/ or http://www.domain.com/folder/ etc.
RewriteEngine on
RewriteCond %{HTTP_HOST} ^agreatsite.com$
RewriteRule ^/(.*) [agreatsite.com...] [R=301]
The wget info for accessing a folder name on the server eg
wget www.agreatsite.com/foldername/ = 200
My last post, above, was warning that if you link to <a href="http://www.domain.com/folder.name">some page</a> on your page, when your defaultservername is actually domain.com {without www} and you do have a 301 redirect set up from non-www to www in your .htaccess file, that you will still cause yourself a problem.
Your call for www.domain.com/folder.name will be redirected to domain.com/folder.name/ {without www} before being redirected to the www.domain.com/folder.name/ URL that you really wanted.
Replace www with non-www and vice versa in the above, for sites that have www.domain.com as the defaultservername to see that they too will have a similar problem.
That would be completely correct and bots should handle that situation just fine. Your situation is different to the one that I described in the previous post. Even if it were a two-step process I still think that bots should handle it fine.
In any case you can help yourself simply by always adding the trailing / to any folder URL that you link to, every time.
I can still see this as a unique case. there are what 3000 webmasters that probably had a look at this post in the last few days and 0 have the same problem .. I think I may need to change servers / hosts
No that isn't it either. I never mentioned index.html at all.
It is always incorrect to add a trailing / after a filename.
It is always correct to add one after a folder name.
Please re-read my posts carefully. I can't explain it any better than what I already wrote. Every word is important, and the examples are exactly correct. Look carefully at each URL to see if www is there or not. Look carefully at each URL to see if a trailing / is there or not. Follow the trail of what you ask for, and where you are redirected. Observe that for a site supposedly using www in the redirect and www in the links that there is an intermediate step WITHOUT THE WWW if you leave off the trailing / in the link on your site.
links go to /folder/index.htm
but googlebot decided, just for the heck of it, to also see if /folder/ exists. In other words, googlebot is actually MAKING UP a url, all by itself, then requesting it, just to see what happens.
So what happens is that you get a duplicate content issue if you have the default / index page set to index.htm or index.html or index.php and so on.
the solution is to change all your in site urls to never show the /folder/index.htm and instead to always point to /folder/
This is an apache mod_rewrite rule which has been noted here several times before:
=========================
#this is what handles the default folder index page, including /
DirectoryIndex index.htm index.html index.php
#start rewrite stuff
# Options.. is not always needed, try it without it to see if it works
Options +FollowSymLinks
RewriteEngine On
# this is the rewrite to /folder/ - note, this is only two lines of code if it wraps, both lines start with Rewrite...
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([a-z-]+)/index\.htm\ HTTP/
RewriteRule ^([a-z-]+)/index.htm$ /$1/ [R=301,L]
======================
the above rewrite rule handles any lower case folder name, including -, for example, /my-folder/index.htm is handled, but /My-folder/index.htm is not handled.
To add upper case support change ([a-z-]+) to ([A-Za-z-]+)
and then you have to change the internal navigation urls to get rid of the /folder/index.htm and change them all to /folder/
I think this what is being asked, although I'm not certain.
Whatever you do, do not name /folder/ as /folder, no trailing slash that is.
All shared IIS site holders are out of luck, you have no practical solution except to move to Apache hosting.
======
Note: if this is not what the original poster is asking, it will almost for certain be what other people reading this thread are having problems with, since googlebot started doing this a while ago, it's very strange behavior, why googlebot feels the need to make up urls and see if they exist is beyond me, it's just creating more problems for itself.
I give google engineers a big fat 0 for this boneheaded decision, which can only create problems, both internally in google and externally for people who will be seeing duplicate content issues that are not their fault. Bad google, bad.
If they had to ask for an exact filename then they would have to ask for index.html, index.htm, index.php, index.php4, default.asp, home.htm, index.cfm, index.nsf, and so on.
There would have to be a very complex routine if more than one was discovered to be valid.
What would happen if they parsed the list, asking your server for each one in turn, and discovered a file but it was not the same one that the server was serving as its default index file when the server was asked for www.domain.com/ or for /folder/? How would that be resolved?
There is no way discover what the actual index file filename really is for a domain or for a folder request: that is why you should NEVER state the name of that index file in any link.
Since site != domain, and domain != site it is quite possible for the index file to be different in different folders; think geocities, yahoo, or any place where each folder is another site, for example.
[edited by: g1smd at 6:42 pm (utc) on Dec. 27, 2005]
Your theory seems to fit in with my experiences and yes my pages do point to some URL/folder/index.html files as well as URL/folder/ files. The sites have been up for 5 years or so thus maybe worth doing some housekeeping to make all the links into a uniform format.
I was really wondered how googlebot got hold of the files URL/folder and URL/folder/index.htm
Someone a few months ago mentioned to me that they may just be testing the server configuration but as to why ... who knows exactly?
Thanks for letting me know I was not alone in this wierdness happening to my server and URLs.
Starting in especially bourbon, I saw an increasing amount of errors catch sites, this led to cleaning up sites, moving sites to apache hosting so we could take care of these issues easily.
I started seeing googlebot make requests for /folder/ a while ago, I'm now going through sites and cleaning up the range of sloppy webmastering stuff, it's in general a good idea no matter what to tighten up on this stuff, it's not very hard, and as long as you have access to rewrite rules, the corrections are not that difficult once you have the right code.
I'm just going to make all new sites like this in the future, not giving the bots anything they can choke on, that means less work over time, less stress. If it's all done technically well to begin with, updates will cause less worry over time. Less temptation too to tweak stuff during an update, which I think in general isn't a good idea.
Again, this particular issue was not technically sloppy webmastering, it's googlebot asking for something that you shouldn't need to protect against since no such links exist internally or externally. One more case of catering to the weakness and error of the google spidering systems. It's getting old. And shouldn't be necessary, with g's resources this is really kind of inexcusable, but I guess we just have to deal with it.
happy western new year, can't remember when the chinese one is this year. Happy year of the dog when it comes.