Forum Moderators: Robert Charlton & goodroi
But to this day, the PR is still the same, and Google has been crawling my pages everyday. Plus, all the backlinks are still listed too. Is this normal for a site that is banned?
I am sure the cause for the site being de-indexed was that my htaccess was blocking the Gbot. Does anyone have any personal experiences with getting dropped b/c of robots.txt or htacess errors, and getting back in?
But don't look at the PR - it won't tell you very much that helps in the present time. Look at your logs for the traffic you are getting, and see where you rank on those search terms. Look at earlier, "normal" traffic days in the log and see what search terms you are no longer appearing for. This information may lead you somewhere.
You also mentioned that when you enter site:example.com you get 118,000 results - but they are not from your site. If that is still the case, I would write to Google - it sounds like a bug may be involved.
If you don't have a Webmaster Tools account, I would suggest you set one up. It will provide you with a good vehicle to communicate with Google and authenticate that you are the legitimate owner.
And yes, sites that have made a technical error with robots.txt or htaccess can recover. So if that's the case, just fix it and expect the best.
When I use site:example.com, I get no results, but when I enter the example.com, I get hundreds of pages. But when I use the link:example.com, all the backlinks still show, which makes me think that the site isnt being penalized.
I see on the diagnostic tab on the Webmaster tools page, that I was last spidered last week, and there were 38 http header errors, all 403s likely caused by the htaccess errors.
It will be interesting to see if I ever actually am able to find out what happened and how long the resolution process takes.
The background on my issues with Google de-indexing my site began back on November 4, when they stripped all the pages of my site from the index completely.
The diagnostic tab on Webmaster tools said that I had HTTP header errors for all my directories -- 403 errors. First I thought it might have been something in my htaccess file that was the cause, so I removed almost everything in it.
Then, this past December 5 -- 30 days later, I came back into the index with a few pages coming back.
Today I see again that there are 403 HTTP header errors for every directory on my site, and there is nothing blocking them on my end. When I try any of the directories they have trouble reaching for the past few days again, I have no problem. And according to my logs, nobody else does either. Does anyone have any ideas what could be wrong here? The frustrating thing is that there is no way to communicate with these people either.
Is there something wrong with the way Google handles 403s lately? Could it be the best thing to do is not to have a sitemap?
[edited by: tedster at 2:20 am (utc) on Dec. 11, 2006]
there is nothing blocking them on my end.
I would definitely want to be sure on that. Google reports to you, consistently, that your server tells it "this URL is forbidden to you". This is not happening in a widespread way on many websites.
So I would still strongly suspect some server configuration gone wrong, rather than a Google bug. For example, do you run a "bad bot trap" or "hot-link protection" or in any way deny response to certain IP addresses?
Is there something wrong with the way Google handles 403s lately?
Here's the Status Code Definition [w3.org]
403 ForbiddenThe server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.
DirectoryIndex index.phpOptions +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_HOST} ^example\.net
RewriteRule ^(.*)$ http://www.example.net/$1 [L,R=301]
<Limit GET>
order allow,deny
allow from all
</Limit>
<Files .htaccess>
order allow,deny
deny from all
</Files>
How could I test if there is something misconfigured on my host? When I check the headers for all my directories, they come up ok. There are 24 other domains on the server, and they are all in G's index. The viewing of directories is blocked at the host level, but then, most sites do this too. Plus, this all begs the question of why arent the other SE's(yahoo/msn) having this issue with me?
Would this be a good idea -- is there some way to emulate the googlebot agent string to see if there are any requests that are a problem?
[edited by: tedster at 3:52 am (utc) on Dec. 11, 2006]
[edit reason] fix link [/edit]
emulate the googlebot agent string
I hate to sound like I am looking to fault G, but maybe the sitemaps program has some issues, or these problems are caused by G crawling its own cache. Another thing I see in my webmaster tools>summary tab is that there are dozens of "page not found" errors, for pages that have not existed in over three years!?
There is another thread that contains many of these same issues:
[webmasterworld.com...]
Worst thing about issues with G: there is no way to communicate with these people.
403 HTTP header errors for every directory on my site
Did you. or do you have a "bot trap"? These things ban automatically by IP inserted into .htaccess for ANY bot not obeying robots.txt.
A search here at WW will reveal others have had googlebot ignore robots.txt on occassion, it's a temporary glitch or something.
If it is a banned IP you might not be able to recognize it as googlebot.
When a 403 forbidden comes up it means the server thinks that the incoming "viewer" bot or human should be forbibben. Two common ways are banning in .htaccess (or in the virtual domain stuff in httpd.conf) OR going to a ./ root directory with no index.html
Most servers by default won't let you surf into a directory with a ./ and NO index.whatever in the folder. It will give you a 403
Why not google "header checker" find one, insert a page that is "supposedly" 403 and see if it is? If it returns 200 then all is cool for the "surfer", not the bot. This would be a sign of a bot trap or .htaccess mis-configured.
[edited by: MThiessen at 5:54 pm (utc) on Dec. 11, 2006]
I Googled "header checker" and tried the first one on one of the directories that G gets 403s for. There were two subsequent results. The first one came up HTTP/1.1 301 Moved Permanently, and there was a second one below returning a HTTP/1.1 403 Forbidden. I dont get why the first one comes up with a 301, I never ever changed the URL or moved anything.
I know that my host blocks directory access at the server level, and my htaccess file states at the top "DirectoryIndex index.php", since I dont have an index.html file. Is it even necessary for me to use "DirectoryIndex index.php"?
I know when I try that header checker without "DirectoryIndex index.php" in the htaccess, the directories also come back with 403s, b/c empty dir listings are denied at the host level.
What is odd is that they arent getting errors for all my directories.
I Googled "header checker" and tried the first one on one of the directories that G gets 403s for. There were two subsequent results. The first one came up HTTP/1.1 301 Moved Permanently, and there was a second one below returning a HTTP/1.1 403 Forbidden. I dont get why the first one comes up with a 301, I never ever changed the URL or moved anything.I know that my host blocks directory access at the server level, and my htaccess file states at the top "DirectoryIndex index.php", since I dont have an index.html file. Is it even necessary for me to use "DirectoryIndex index.php"?
The 301 is a redirect that you must have intiated either in the httpd.conf, .htaccess or on the page itself. They don't redirect themselves, something you need to look into.
The DirectoryIndex tells the requestor what to give someone who types in just mysite.com with no page specified. If you go into DirectoryIndex and add (not sure of syntax, look it up) something like:
DirectoryIndex index.html, index.htm, index.shtml, index.php
It will look for these files in blank directory calls in order. If none of the files above are in the directoryIndex it issues a 403 forbidden (apache, unix)
Most servers by default won't let you surf into a directory with a ./ and NO index.whatever in the folder. It will give you a 403
But wouldn't this then cause most sites also to see 403 HTTP header errors for directories it requests as such in Google sitemaps panel?
Maxgoldie; this error is generated ONLY when it is either expressly forbidden in httpd.conf, .htaccess or in the meta tags. If your DirectoryIndex has "your" version of index.whatever it won't show 403 to anyone.
A 403 is a server generated error meaning you "forbid" anyone (specified, depending on deployment) from seeing this directory or page.
EDIT: I also would like to add that in the old days apache "let" you surf undefined site directories by default which ended up giving you a list of files in this directory (clickable too lol) and this was a major security problem, that's why DirectoryIndex was implemented to start with.
[edited by: MThiessen at 10:10 pm (utc) on Dec. 11, 2006]