Forum Moderators: not2easy
I discovered when checking my Google site maps for two different sites that are set up the same way that when Google tries to access that directory it gets a 403 error. That obviously means none of these articles are being indexed.
I really don't want to dump those articles in the main directory because it is a pain to manage them in there with all the other pages. However not having them indexed is also an issue. I have to believe if Google bots are getting this error when they try to access that directory and index them that other search engines are as well.
I am not sure why this would happen in the first place. The directory is not write protected and it only has HTML and PDF docs inside and other folders I put things in don't seem to have the same problem. I am looking for some advice of how to handle this.
One thing does occur to me: Are these articles accessible via links, or are you just telling Google to index them via the site map? If they are in a directory and NOT linked to from anywhere else, Google bot may not have permission to browse the directory. Hence, the 403.
The articles themselves are inside the folder on HTML pages, but there is also an HTML page outside the folder called Articles.htm, which if you view it has the name of the articles on the page, which serve as a link to the first page of the article in the folder. As far as I can tell there should be no reason why the Google bots can simply follow the link into the folder and index all the pages, they are all connected and there is nothing special about the folder that should prevent them access.
A 403 error code means the user agent cannot access the requested resource. It may mean the wrong username and/or password were sent in the request, or the permission settings forbid access to the resource, or perhaps even that no default directory index page is present. The Apache directive
DirectoryIndexdefines the default index page name(s).
If you are not requiring authentication or if the page is not a cgi script or something that requires special permission settings for access to the resource, then maybe you are looking at an index error. Are you certain Google isn't telling you that it is receiving that error for a missing index page? Perhaps it can see the other resources, but for some reason it is also attempting to find an index page in a directory where an index page does not exist?
but for some reason it is also attempting to find an index page in a directory where an index page does not exist?
I suppose this is possible, but my question would be why look for an index page in the directory at all? The article page on the outside is pointing to the exact page in the folder where the file is at so I can't see any reason why it should even need one.
The directory is not password protected or have any type of CGI script or any other script associated with it.
http://www.example.com/articles/
If so, Google is going to try and follow that link and since there is no directory index file in there you get the 403 forbidden.
Ah, we might be getting somewhere. Yes, the directory is in my site map. If I understand your comment if I am going to have it on my XML site map than I have to have an index page in that directory to crawl to, is that correct?
Here are two possible solutions, which would you recommend...
1. I put the articles page that is currently outside the directory inside and rename it index.htm since it will be in the actual articles directory it won't cause a problem.
2. I put another page in the directory called index.htm and have the articles page point to that page and just have that page be a listing of the article titles the same as the page on the outside is.
If I do number two will Google penalize me for having the exact same page both inside and outside the directory, but named differently?
It never dawned on me that having it on the site map with no index page in the directory would cause this issue.
http://www.example.com/articles/article1.htm
http://www.example.com/articles/article2.htm
http://www.example.com/articles/article3.htm
http://www.example.com/articles/article4.htm
http://www.example.com/articles/article5.htm
http://www.example.com/articles/
but if you do not have an index in the directory then you should not have this in the sitemap xml:http://www.example.com/articles/
This is exactly how it was set up including not having the index.htm inside the folder, which probably why I was getting the error. It never dawned on me this would cause the problem because I thought it [Google] would simply find the Articles.htm page outside the directory and follow the links on it into the folder, but it did not do that, it used the XML site map instead which was set up like you said.
I changed it by putting the Articles.htm inside the directory and changed its name to index.htm. I updated the site map and re-submitted it to Google. It appears it *might* have solved the problem. I am cautiously optimistic as it seems the errors have disappeared but I want to give it a few days to make sure it solved the issue and it doesn't come back.