Robots.txt and Google Ban

Forum Moderators: open

Message Too Old, No Replies

Robots.txt and Google Ban

markh8624

8:11 pm on Dec 27, 2002 (gmt 0)

Since I cannot find the old thread on my site's Google ban, I thought I would post this new thread (and see how long it stays in place?).

Mardi Gras - tried to find the robot.txt thread you mentioned in the last post before the thread I started was closed (and removed?). The link is 404.

I am still not sure why this would be the most important reason for a ban - if there is a file in our directory that is problematic it might be one file out of hundreds. And other search engines have no problem indexing our site.

Why would Google ban an educational site for an old page on a huge directory?

Thanks,

Mark

BigDave

8:35 pm on Dec 27, 2002 (gmt 0)

You are probably not banned, so stop thinking that you are. The problem is that YOU have banned google from your site.

You do not have a robots.txt file, and your 404 page does not actually return a status of 404. Your fancy 404 page is the problem. When it sends out the 404 page it sends a status of 200 instead of 404 since it found the page. Google is getting that fancy page when it expects that it is finding a simple robots.txt file. Somewhere on that page is some code that is convincing Google that you do not want them on your site!

The quickest, and easiest thing that you can do is to put in a robots.txt file in your top level directory that will allow google to crawl your site. if the file is there it will not send it to the 404 page that is sending the wrong status.

Do this now. The January Google crawl will start in a few days, and you will then have to wait till the end of January to see the results. Buy some adwords till then to get some traffic so you can remain employed.

I would also recommend that you drop your custom 404 page till you can be sure that it is returning the right status codes.

Also, drop your argument about the other search engines working with your broken site. Google and the other SEs will have different code for parsing robots.txt, and different policies for what to do if there are failures. They might be doing the right thing by having a policy of not spidering a site that has a robots.txt that they cannot figure out. You are returning a huge file, so you are obviously trying to block someone, probably safer to not spider the site.

pageoneresults

8:53 pm on Dec 27, 2002 (gmt 0)

A little off topic, but related to your site, when did you add the popup? Right now it is returning an error in IE and I'm wondering if you added it at about the same time you found yourself without Google traffic.

I've been keeping up with this one from inception and I'll have to agree with all the advice that has been shared up to this point. NFFC was kind enough to point you to a thread where GG discussed the problems with 404's and the robots.txt file. That is surely one of the issues you need to address ASAP!

I'd also run your home page through the W3C validator and see if any of those 56 errors on the page are fatal. I did not look closely at all of them.

markh8624

9:51 pm on Dec 27, 2002 (gmt 0)

>You are probably not banned, so stop thinking that you are. The problem is that YOU have banned google from your site.

Big Dave - interesting perspective. You are the first person (out of 30 or 40 who critiqued the code in a session at SES Dallas this month and checked its Google placement) to tell me that the site is not banned. I'll take a look at the tags with our engineering guys.

Mark

markh8624

10:01 pm on Dec 27, 2002 (gmt 0)

We added the custom 404 page about 7 days ago (after SES Dallas). We've have problems with Google since September. I'll look into the robot.txt issue (and try to find the cited article - it returned 404 - on this site) but really am skeptical if this is the reason for our 'broken' site.

Where are you seeing a pop-up? Not coming from our site if you are seeing one.

Mark

BigDave

10:14 pm on Dec 27, 2002 (gmt 0)

Mark,

It is not your tags, it is the status that is returned in your header when you return your 404 page. It is a 404 page so it should return a 404, not a 200.

Tell your engineering guys to do two things right now.

1. upload a blank robots.txt file. Fill it in correctly once you have the file format figured out.

2. get rid of the custom 404 page now. Let your server return a standard 404 till you get it figured out. It may not be as pretty, but layoffs are much uglier than a standard 404 page.

Do this NOW! Today is the 27th. There is the weekend and the new years holidays. I have been hit by googlebot for the main crawl as early as the 1st of the month. If you do not fix it by the time that the main crawl starts, you will not make it into the index till the end of February.

You should also make sure that one of your engineers knows how to read your server logs (not a stats program) and can see what googlebot is doing and where it is going. This will allow you to see what is going on during the crawl and to make sure that you are being found.

You are the first person (out of 30 or 40) to tell me that the site is not banned.

But I doubt that all those people tod you you were banned either. In the previous thread, they were basically telling you that you were probably not banned, without using those words.

If you ban google from your site, it is the same thing as google banning you. But saying that google banned you gives you the wrong perspective.

You should not expect to get fully indexed this month if you make the changes, but you should at least start to make it back into the index.

pageoneresults

10:17 pm on Dec 27, 2002 (gmt 0)

Oops, my mistake on the popup, it's actually a OpenNewWindow script that is returning the error in IE.

When in September did this happen? Was it during the update? If so, I'd go back and check July and August stats and pinpoint exactly what may have caused it. Usually with a site that large, there is constant movement.

It sure sounds like NFFC and BigDave nailed it with the 404 issues. I checked the domain and sure enough, your 404 returns a 200 status code. And, your robots.txt returns the fancy 404 page that GG refers to in the thread referenced earlier. Not good...

[edited by: pageoneresults at 10:20 pm (utc) on Dec. 27, 2002]

gujgifts

10:19 pm on Dec 27, 2002 (gmt 0)

I think this explanation is on the money...I experienced something very similar...

My site suddenly dropped out of google..after a lot of advice from the lot of nice people here at WebmasterWorld (thanks Soapystar and the rest of the guys!), I started checking my server logs....

lo and behold!..I did not have a robots.txt since about 2 years, but since last month, my server was erroneously returning a 403 instead of a 404 for all missing pages, including the robots.txt

Thus googlebot kept coming to my site, accessed only the robots.txt and confronted with a 403, it simply went away..

I put in a robots.txt (not a blank one though, some guys advised that atleast one dir should be disallowed, otherwise some bots might interpret it as disallowing everything)...and my site reappeared in about 12 days....
I saw googlebot coming and once it got a 200 on my new robots.txt, it starting pulling in a few pages...

so, perhaps it would be a nice idea to put in a robots.txt...I am pretty convinced it would work....and perhaps u can check out your raw logs

best of luck!

happy holidays to everyone

markh8624

10:30 pm on Dec 27, 2002 (gmt 0)

Big Dave -

I'll pass your message on to the engineering guys pronto.

I asked for information on the last visit by the Googlebot to our site. Evidently the last visit was 12/20/02. There was a visit on 12/18 with similar text made to our logging data. Here are the notations from the visit:

>21:32:48 216-239-45-4.google.com SEASOFT_WEB GET /robots.txt 404 HTTP/1.0 Googlebot/2.1+(+http://>>www.googlebot.com/bot.html)
>21:32:49 216-239-45-4.google.com SEASOFT_WEB GET /index.asp 200 HTTP/1.0 Googlebot/2.1+(+http://www.googlebot.com/bot.html)

How do I interpret this? Or, put another way, what should we be seeing if Google was indexing us properly.

Does this help confirm your suspicions about the problems with the robot.txt tags? Let me know. Really appreciate the input. Thanks,

Mark

[edited by: markh8624 at 10:43 pm (utc) on Dec. 27, 2002]

bobmark

10:41 pm on Dec 27, 2002 (gmt 0)

Beyond your immediate problem - which has been exhaustively and correctly diagnosed - one thing I would add is when you do write a proper robots.txt file, be very careful of using asterisks.
I had copied portions of the webmasterword robots.txt file posted on a thread here to ban Zeus and a few others and discovered that some spiders - I think not Google but some majors - seemed to interpret something like Zeus* as applying to them (apparently not parsing the text before the asterisk correctly).
Just a thought

arjan

10:44 pm on Dec 27, 2002 (gmt 0)

I'm not an expert on this, but it seems on the 20th ggbot did not find your/any robotstext an DID find your indexpage!

webwhiz

11:07 pm on Dec 27, 2002 (gmt 0)

Mark, I think you need to give these guys some idea of the email you received from Google that said you WERE banned.

Don't post it word for word, as they will delete it. However, it is clear from that email directly from Google that the site is banned, and it has to do with the content, not a robots.txt file.

(No way for anyone here to know that though, given the little bit of info they have to go on.)

markh8624

11:33 pm on Dec 27, 2002 (gmt 0)

WebWhiz:

The email that I received from Google stated that our site had been "blocked from our index because it does not meet the quality standards necessary to assign accurate PageRank." The message went on to state that they could not tell us exactly why this occurred.

As an 'FYI,' we have several hundred decent (World Bank, Govt of Thailand, etc) incoming links....

Mark

markh8624

11:36 pm on Dec 27, 2002 (gmt 0)

Big Dave:

Have a question from the engineer working today (most are on holiday this week):

"can you ask him if there is a way to modify our "fancy 404 page" to return 404 status instead of 200, please?"

Would you mind explaining this and I will email him your comments. Appreciate your help. Thanks

Mark

pageoneresults

11:45 pm on Dec 27, 2002 (gmt 0)

Here are instructions for how to set it up if on a Windows Server...

IIS Config for 404 Handler [webmasterworld.com]

BigDave

12:00 am on Dec 28, 2002 (gmt 0)

It looks like at that time it did return a proper 404 when looking for your robots.txt file. And since you stated that the custom 404 page is new, that was not your problem, but it could be in the next crawl, so you should still follow my earlier advice.

When is September did you change ISPs?
When did you notice the actual drop in traffic?
Was your old site still up on the old ISP for a while after you changed ISPs?
Do you have access to your logs from October and November?

As you stated that you redesigned your site, I doubt that you have a problem with duplicate content.

Googlebot is visiting your site, so you are not banned.

What I have to conclude is that you are a victim of the slow updating of the Google DNS cache. Sometimes it takes google far too long to update its cache.

For example, if you brought up your new site on SEPT. 25 and immediately shut down your old site, for the October crawl googlebot would have gone to your old IP address to try and find your content and it was gone. You would now have a grey bar in the update at the end of October.

If Google had still not updated it's cache by the first week in November, you would still have a grey bar through the end of December.

As it is showing up now, you should be good to go for the deep crawl at the beginning of January. I can find you now. But there is still the open question of whether they found you during the December main crawl.

If you have the logs from that fra back, was there any Googlebot activity suring the first week of December? That will tell you if you should be good to go in January.

I personally do not have the complaints about Google that many others do, but I do think that they need to revisit how they cache DNS, and they should respect the expiration dates that are set. At the very least they should update the entire cache before each monthly crawl.

BigDave

12:03 am on Dec 28, 2002 (gmt 0)

I can't tell you anything about ASP, as I am a microsoft hater and have no interest in using it. In PHP you would use a header() call to set the status, and I would assume that there is a similar method in ASP.

markh8624

4:00 am on Dec 28, 2002 (gmt 0)

BigDave:

I passed your suggestions on to someone in engineering who is working on making the changes that you suggested.

> When is September did you change ISPs?

MSH: We made the switch in early September. Once the new site was fully propagated throughout the web's DNS' we shut down the old site. There may have been an overlap of a few days (4-5 days perhaps) where the old site and new site were still up and running, but the new site would have been up on our domain name.

> When did you notice the actual drop in traffic?

MSH: End of September - I started emailing Google in early October.

> Was your old site still up on the old ISP for a while after you changed ISPs?

MSH: Perhaps, see above.

> Do you have access to your logs from October and November?

MSH: We are pulling those over the weekend for review. I 'll see what they say....

Thanks,

Mark

jamesyap

4:35 am on Dec 28, 2002 (gmt 0)

Is your old site (old domain) still in your google database? If yes, then it might be because of duplicate contents ban. I found out from google remove page that now you can immediately remove pages from database index. However you will need to have access to your old files to add some meta tag.

[google.com...]

There is one thing I read before in this forum, if google has ban one site from hand -> means an editor manually pull out your site from the database, it won't go back itself no matter what you do to correct your site problem that causes the ban. Only if some one from the editor manually lifeted the penalty. Then only googlebot will come again. But I know they usually do a manual delete for malicios spammers sites. I don't see why an education site can be banned.

BigDave

5:54 am on Dec 28, 2002 (gmt 0)

I don't see why an education site can be banned.

It wasn't banned, and the content was not duplicated. They redesigned the site, so there is no duplicate content.

Do a site search on google dns to get some idea of what I am talking about. I am not certain that it is a DNS issue, but I am almost positive that it is not a ban if googlebot is making it to his site.

jamesyap

6:03 am on Dec 28, 2002 (gmt 0)

how do you guys look visit mark site as it is not display in the profile?

markh8624

3:19 pm on Dec 28, 2002 (gmt 0)

Still curious about what the logging data that I posted earlier shows....can someone tell me if the Googlebot's remarks indicate some reason for the ban?

Mark

markh8624

9:29 pm on Dec 29, 2002 (gmt 0)

BigDave -

Did a search for "Google DNS" and came up with a few threads from this forum on the subject.

Apparently the consensus is that DNS updates are notoriously slow for Google - but seems like the longest at most was 3 months.

We've have major problems with Google not indexing our site since the new DNS update for our domain in the first week of September - almost 4 months now...

Did not really seem much else on the subject out there when I did my search - is there a page on the Google site itself discussing issues with DNS?

Thanks

Mark

BigDave

9:44 pm on Dec 29, 2002 (gmt 0)

Mark,

If you were not crawled for 3 months, starting in September, you would not be in the index yet. The sites that are currently showing up are from the November crawl, from the first week in November.

That's why I was wondering if you had the log files from September through November. If Googlebot was hitting your site befor November 5 and you are still not showing up in the index, then it is another problem. If it didn't start showing up till after November 5, then it was probably a DNS issue.

As far as I know, when Google is chaecking on banned sites, they get /robots.txt and / . In your case, last week, it got /robots.txt and /index.asp. While / is actually the same file as /index.asp, I would expect that google would always go with / while checking for a site to be present.

jamesyap

12:25 am on Dec 30, 2002 (gmt 0)

BigDave,

Then do you mean my new site must be crawl before 5th december to be included in the january database?

Most of the time, the bot only get robot.txt and / does this also mean my site is blacklisted? But that's a new site!

BigDave

12:45 am on Dec 30, 2002 (gmt 0)

The crawl for the update at the end of the month happens at the beginning of the month.

If the bot is only fetching robots.txt and / it does not mean that you are banned. That is normal for the first month that your site is crawled. If it is a new site, do not even start worrying for 2 months. Consider yourself lucky if you pop up for a couple of days with the freshbot.

Give yourself 2 months to get in and 3 months to get in solid. Work on your page of good content a day till then.

Bans are incredibly rare. Do not do anything stupid and you should not have a problem.

jamesyap

3:43 am on Dec 30, 2002 (gmt 0)

BigDave,

My site is up (by the fresh bot) for around 20 days but disappered from 23rd until today. So I don't think it will appear in the database after december google dance. I am a bit unpassion to wait another month. What happens to your site when it is new? How long google put it in the index? Please share more on your experience, thank you.

markh8624

4:08 pm on Dec 30, 2002 (gmt 0)

BigDave:

Thanks for your reply. I will pull our logging files for September and October to see if the Googlebot came by at that time.

It will be interesting to see what stupid thing we did (if we did so) that got us in this mess!

Mark

BigDave

6:57 pm on Dec 30, 2002 (gmt 0)

jamesyap,

You can get yourself all wound up tight, but it doesn't speed things up. Google works the same way it always has, and most likely always will. 2 months to get in, 3 to get fully spidered. The freshbot is a recent bonus, and that is how you should view it. It is a bonus that you would not have gotten a year ago.

And before you complain about Google being slow, unless you pay for inclusion the other engines are much, much slower.

markh8624

7:01 pm on Dec 30, 2002 (gmt 0)

OK - just talked with the guy who follows our logging data. We did not have any Googlebot visits in Sept, Oct, Nov. The first time Google came by the new site was 12/18 followed by 12/20 (as mentioned in an earlier post to this thread).

This seems strange to me as we are not a new site but have significant inbound links and have had the site working at the same URL since 1998. Why would we drop off the map with Google after the new site's release?

So, looks like a Google DNS updating issue? Any comments?

Mark

This 47 message thread spans 2 pages: 47