Site Dropped - PR Still The Same - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Site Dropped - PR Still The Same

maxgoldie

1:20 am on Nov 11, 2006 (gmt 0)

My site dropped out of Google's index a week ago as I mentioned here:
[webmasterworld.com...]

But to this day, the PR is still the same, and Google has been crawling my pages everyday. Plus, all the backlinks are still listed too. Is this normal for a site that is banned?

I am sure the cause for the site being de-indexed was that my htaccess was blocking the Gbot. Does anyone have any personal experiences with getting dropped b/c of robots.txt or htacess errors, and getting back in?

tedster

4:27 am on Nov 11, 2006 (gmt 0)

In the other thread, you said "my traffic from Google seemed to drop by over 50%, from 2200 uniques/day, to 850." If Google is still sending you ANY search traffic, then whatever may be going on, you are not banned.

But don't look at the PR - it won't tell you very much that helps in the present time. Look at your logs for the traffic you are getting, and see where you rank on those search terms. Look at earlier, "normal" traffic days in the log and see what search terms you are no longer appearing for. This information may lead you somewhere.

You also mentioned that when you enter site:example.com you get 118,000 results - but they are not from your site. If that is still the case, I would write to Google - it sounds like a bug may be involved.

If you don't have a Webmaster Tools account, I would suggest you set one up. It will provide you with a good vehicle to communicate with Google and authenticate that you are the legitimate owner.

And yes, sites that have made a technical error with robots.txt or htaccess can recover. So if that's the case, just fix it and expect the best.

maxgoldie

11:33 am on Nov 11, 2006 (gmt 0)

The only traffic I have left now are non-google uniques. I am getting no traffic from Google.

When I use site:example.com, I get no results, but when I enter the example.com, I get hundreds of pages. But when I use the link:example.com, all the backlinks still show, which makes me think that the site isnt being penalized.

I see on the diagnostic tab on the Webmaster tools page, that I was last spidered last week, and there were 38 http header errors, all 403s likely caused by the htaccess errors.

It will be interesting to see if I ever actually am able to find out what happened and how long the resolution process takes.

maxgoldie

2:15 am on Dec 11, 2006 (gmt 0)

< this message was spliced here from another thread >

The background on my issues with Google de-indexing my site began back on November 4, when they stripped all the pages of my site from the index completely.

The diagnostic tab on Webmaster tools said that I had HTTP header errors for all my directories -- 403 errors. First I thought it might have been something in my htaccess file that was the cause, so I removed almost everything in it.

Then, this past December 5 -- 30 days later, I came back into the index with a few pages coming back.

Today I see again that there are 403 HTTP header errors for every directory on my site, and there is nothing blocking them on my end. When I try any of the directories they have trouble reaching for the past few days again, I have no problem. And according to my logs, nobody else does either. Does anyone have any ideas what could be wrong here? The frustrating thing is that there is no way to communicate with these people either.

Is there something wrong with the way Google handles 403s lately? Could it be the best thing to do is not to have a sitemap?

[edited by: tedster at 2:20 am (utc) on Dec. 11, 2006]

tedster

2:36 am on Dec 11, 2006 (gmt 0)

there is nothing blocking them on my end.

I would definitely want to be sure on that. Google reports to you, consistently, that your server tells it "this URL is forbidden to you". This is not happening in a widespread way on many websites.

So I would still strongly suspect some server configuration gone wrong, rather than a Google bug. For example, do you run a "bad bot trap" or "hot-link protection" or in any way deny response to certain IP addresses?

Is there something wrong with the way Google handles 403s lately?

Here's the Status Code Definition [w3.org]

403 Forbidden
The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.

maxgoldie

3:40 am on Dec 11, 2006 (gmt 0)

I used to have a fair bit of "bad bot" cruft in my htaccess (gleaned from this thread [webmasterworld.com...] ), but back on November 5, (after being removed from G index) I removed it all. Then when I came back on December 5, the summary tab in webmaster tools showed that Google had no problem accessing any of my directories. All I have in my htacces now are just the barebones directives:


DirectoryIndex index.phpOptions +FollowSymLinks 
RewriteEngine on 
RewriteCond %{HTTP_HOST} ^example\.net 
RewriteRule ^(.*)$ http://www.example.net/$1 [L,R=301]<Limit GET>
order allow,deny
allow from all
</Limit>
<Files .htaccess>
order allow,deny
deny from all
</Files>

How could I test if there is something misconfigured on my host? When I check the headers for all my directories, they come up ok. There are 24 other domains on the server, and they are all in G's index. The viewing of directories is blocked at the host level, but then, most sites do this too. Plus, this all begs the question of why arent the other SE's(yahoo/msn) having this issue with me?

Would this be a good idea -- is there some way to emulate the googlebot agent string to see if there are any requests that are a problem?

[edited by: tedster at 3:52 am (utc) on Dec. 11, 2006]
[edit reason] fix link [/edit]

tedster

4:13 am on Dec 11, 2006 (gmt 0)

Does your host supply server logs? I would suggest inspecting those for googlebot visits. That will tell you IF googlebot is getting a 403 response, even though it wouldn't tell you WHY.

emulate the googlebot agent string

There is a Firefox extension called "User Agent Switcher" that will do that particular job for you.

maxgoldie

4:33 pm on Dec 11, 2006 (gmt 0)

This gets even more frustrating. The server logs show that Google has been actively crawling with no restrictions. The odd thing is that I havent made any site changes in at least 6 months, and then this started happening. When I got back in the index last Tuesday Dec 5, the bot had no problem crawling. Then three days ago, 403 errors for all my directories. At least I am back in the index now, but this stuff makes one wonder for how long.

I hate to sound like I am looking to fault G, but maybe the sitemaps program has some issues, or these problems are caused by G crawling its own cache. Another thing I see in my webmaster tools>summary tab is that there are dozens of "page not found" errors, for pages that have not existed in over three years!?

There is another thread that contains many of these same issues:
[webmasterworld.com...]

Worst thing about issues with G: there is no way to communicate with these people.

MThiessen

5:51 pm on Dec 11, 2006 (gmt 0)

403 HTTP header errors for every directory on my site

Did you. or do you have a "bot trap"? These things ban automatically by IP inserted into .htaccess for ANY bot not obeying robots.txt.

A search here at WW will reveal others have had googlebot ignore robots.txt on occassion, it's a temporary glitch or something.

If it is a banned IP you might not be able to recognize it as googlebot.

When a 403 forbidden comes up it means the server thinks that the incoming "viewer" bot or human should be forbibben. Two common ways are banning in .htaccess (or in the virtual domain stuff in httpd.conf) OR going to a ./ root directory with no index.html

Most servers by default won't let you surf into a directory with a ./ and NO index.whatever in the folder. It will give you a 403

Why not google "header checker" find one, insert a page that is "supposedly" 403 and see if it is? If it returns 200 then all is cool for the "surfer", not the bot. This would be a sign of a bot trap or .htaccess mis-configured.

[edited by: MThiessen at 5:54 pm (utc) on Dec. 11, 2006]

maxgoldie

6:42 pm on Dec 11, 2006 (gmt 0)

Thanks for the reply MThiessen. I had a long list of bot directives in my htaccess from early this year until Nov 4, when I was removed from G's index. I got rid of them all that day. I still dont have anything like that in it or robots.txt.

I Googled "header checker" and tried the first one on one of the directories that G gets 403s for. There were two subsequent results. The first one came up HTTP/1.1 301 Moved Permanently, and there was a second one below returning a HTTP/1.1 403 Forbidden. I dont get why the first one comes up with a 301, I never ever changed the URL or moved anything.

I know that my host blocks directory access at the server level, and my htaccess file states at the top "DirectoryIndex index.php", since I dont have an index.html file. Is it even necessary for me to use "DirectoryIndex index.php"?

I know when I try that header checker without "DirectoryIndex index.php" in the htaccess, the directories also come back with 403s, b/c empty dir listings are denied at the host level.

What is odd is that they arent getting errors for all my directories.

maxgoldie

6:45 pm on Dec 11, 2006 (gmt 0)

Most servers by default won't let you surf into a directory with a ./ and NO index.whatever in the folder. It will give you a 403

But wouldn't this then cause most sites also to see 403 HTTP header errors for directories it requests as such in Google sitemaps panel?

MThiessen

9:59 pm on Dec 11, 2006 (gmt 0)

I Googled "header checker" and tried the first one on one of the directories that G gets 403s for. There were two subsequent results. The first one came up HTTP/1.1 301 Moved Permanently, and there was a second one below returning a HTTP/1.1 403 Forbidden. I dont get why the first one comes up with a 301, I never ever changed the URL or moved anything.
I know that my host blocks directory access at the server level, and my htaccess file states at the top "DirectoryIndex index.php", since I dont have an index.html file. Is it even necessary for me to use "DirectoryIndex index.php"?

The 301 is a redirect that you must have intiated either in the httpd.conf, .htaccess or on the page itself. They don't redirect themselves, something you need to look into.

The DirectoryIndex tells the requestor what to give someone who types in just mysite.com with no page specified. If you go into DirectoryIndex and add (not sure of syntax, look it up) something like:

DirectoryIndex index.html, index.htm, index.shtml, index.php

It will look for these files in blank directory calls in order. If none of the files above are in the directoryIndex it issues a 403 forbidden (apache, unix)

Most servers by default won't let you surf into a directory with a ./ and NO index.whatever in the folder. It will give you a 403

But wouldn't this then cause most sites also to see 403 HTTP header errors for directories it requests as such in Google sitemaps panel?

Maxgoldie; this error is generated ONLY when it is either expressly forbidden in httpd.conf, .htaccess or in the meta tags. If your DirectoryIndex has "your" version of index.whatever it won't show 403 to anyone.

A 403 is a server generated error meaning you "forbid" anyone (specified, depending on deployment) from seeing this directory or page.

EDIT: I also would like to add that in the old days apache "let" you surf undefined site directories by default which ended up giving you a list of files in this directory (clickable too lol) and this was a major security problem, that's why DirectoryIndex was implemented to start with.

[edited by: MThiessen at 10:10 pm (utc) on Dec. 11, 2006]