Welcome to WebmasterWorld Guest from 188.8.131.52
Forum Moderators: open
Mardi Gras - tried to find the robot.txt thread you mentioned in the last post before the thread I started was closed (and removed?). The link is 404.
I am still not sure why this would be the most important reason for a ban - if there is a file in our directory that is problematic it might be one file out of hundreds. And other search engines have no problem indexing our site.
Why would Google ban an educational site for an old page on a huge directory?
You do not have a robots.txt file, and your 404 page does not actually return a status of 404. Your fancy 404 page is the problem. When it sends out the 404 page it sends a status of 200 instead of 404 since it found the page. Google is getting that fancy page when it expects that it is finding a simple robots.txt file. Somewhere on that page is some code that is convincing Google that you do not want them on your site!
The quickest, and easiest thing that you can do is to put in a robots.txt file in your top level directory that will allow google to crawl your site. if the file is there it will not send it to the 404 page that is sending the wrong status.
Do this now. The January Google crawl will start in a few days, and you will then have to wait till the end of January to see the results. Buy some adwords till then to get some traffic so you can remain employed.
I would also recommend that you drop your custom 404 page till you can be sure that it is returning the right status codes.
Also, drop your argument about the other search engines working with your broken site. Google and the other SEs will have different code for parsing robots.txt, and different policies for what to do if there are failures. They might be doing the right thing by having a policy of not spidering a site that has a robots.txt that they cannot figure out. You are returning a huge file, so you are obviously trying to block someone, probably safer to not spider the site.
I've been keeping up with this one from inception and I'll have to agree with all the advice that has been shared up to this point. NFFC was kind enough to point you to a thread where GG discussed the problems with 404's and the robots.txt file. That is surely one of the issues you need to address ASAP!
I'd also run your home page through the W3C validator and see if any of those 56 errors on the page are fatal. I did not look closely at all of them.
Big Dave - interesting perspective. You are the first person (out of 30 or 40 who critiqued the code in a session at SES Dallas this month and checked its Google placement) to tell me that the site is not banned. I'll take a look at the tags with our engineering guys.
Where are you seeing a pop-up? Not coming from our site if you are seeing one.
It is not your tags, it is the status that is returned in your header when you return your 404 page. It is a 404 page so it should return a 404, not a 200.
Tell your engineering guys to do two things right now.
1. upload a blank robots.txt file. Fill it in correctly once you have the file format figured out.
2. get rid of the custom 404 page now. Let your server return a standard 404 till you get it figured out. It may not be as pretty, but layoffs are much uglier than a standard 404 page.
Do this NOW! Today is the 27th. There is the weekend and the new years holidays. I have been hit by googlebot for the main crawl as early as the 1st of the month. If you do not fix it by the time that the main crawl starts, you will not make it into the index till the end of February.
You should also make sure that one of your engineers knows how to read your server logs (not a stats program) and can see what googlebot is doing and where it is going. This will allow you to see what is going on during the crawl and to make sure that you are being found.
You are the first person (out of 30 or 40) to tell me that the site is not banned.
But I doubt that all those people tod you you were banned either. In the previous thread, they were basically telling you that you were probably not banned, without using those words.
If you ban google from your site, it is the same thing as google banning you. But saying that google banned you gives you the wrong perspective.
You should not expect to get fully indexed this month if you make the changes, but you should at least start to make it back into the index.
When in September did this happen? Was it during the update? If so, I'd go back and check July and August stats and pinpoint exactly what may have caused it. Usually with a site that large, there is constant movement.
It sure sounds like NFFC and BigDave nailed it with the 404 issues. I checked the domain and sure enough, your 404 returns a 200 status code. And, your robots.txt returns the fancy 404 page that GG refers to in the thread referenced earlier. Not good...
[edited by: pageoneresults at 10:20 pm (utc) on Dec. 27, 2002]
I think this explanation is on the money...I experienced something very similar...
My site suddenly dropped out of google..after a lot of advice from the lot of nice people here at WebmasterWorld (thanks Soapystar and the rest of the guys!), I started checking my server logs....
lo and behold!..I did not have a robots.txt since about 2 years, but since last month, my server was erroneously returning a 403 instead of a 404 for all missing pages, including the robots.txt
Thus googlebot kept coming to my site, accessed only the robots.txt and confronted with a 403, it simply went away..
I put in a robots.txt (not a blank one though, some guys advised that atleast one dir should be disallowed, otherwise some bots might interpret it as disallowing everything)...and my site reappeared in about 12 days....
I saw googlebot coming and once it got a 200 on my new robots.txt, it starting pulling in a few pages...
so, perhaps it would be a nice idea to put in a robots.txt...I am pretty convinced it would work....and perhaps u can check out your raw logs
best of luck!
happy holidays to everyone
I'll pass your message on to the engineering guys pronto.
I asked for information on the last visit by the Googlebot to our site. Evidently the last visit was 12/20/02. There was a visit on 12/18 with similar text made to our logging data. Here are the notations from the visit:
>21:32:48 216-239-45-4.google.com SEASOFT_WEB GET /robots.txt 404 HTTP/1.0 Googlebot/2.1+(+http://>>www.googlebot.com/bot.html)
>21:32:49 216-239-45-4.google.com SEASOFT_WEB GET /index.asp 200 HTTP/1.0 Googlebot/2.1+(+http://www.googlebot.com/bot.html)
How do I interpret this? Or, put another way, what should we be seeing if Google was indexing us properly.
Does this help confirm your suspicions about the problems with the robot.txt tags? Let me know. Really appreciate the input. Thanks,
[edited by: markh8624 at 10:43 pm (utc) on Dec. 27, 2002]
Don't post it word for word, as they will delete it. However, it is clear from that email directly from Google that the site is banned, and it has to do with the content, not a robots.txt file.
(No way for anyone here to know that though, given the little bit of info they have to go on.)
The email that I received from Google stated that our site had been "blocked from our index because it does not meet the quality standards necessary to assign accurate PageRank." The message went on to state that they could not tell us exactly why this occurred.
As an 'FYI,' we have several hundred decent (World Bank, Govt of Thailand, etc) incoming links....
Have a question from the engineer working today (most are on holiday this week):
"can you ask him if there is a way to modify our "fancy 404 page" to return 404 status instead of 200, please?"
Would you mind explaining this and I will email him your comments. Appreciate your help. Thanks
When is September did you change ISPs?
When did you notice the actual drop in traffic?
Was your old site still up on the old ISP for a while after you changed ISPs?
Do you have access to your logs from October and November?
As you stated that you redesigned your site, I doubt that you have a problem with duplicate content.
Googlebot is visiting your site, so you are not banned.
What I have to conclude is that you are a victim of the slow updating of the Google DNS cache. Sometimes it takes google far too long to update its cache.
For example, if you brought up your new site on SEPT. 25 and immediately shut down your old site, for the October crawl googlebot would have gone to your old IP address to try and find your content and it was gone. You would now have a grey bar in the update at the end of October.
If Google had still not updated it's cache by the first week in November, you would still have a grey bar through the end of December.
As it is showing up now, you should be good to go for the deep crawl at the beginning of January. I can find you now. But there is still the open question of whether they found you during the December main crawl.
If you have the logs from that fra back, was there any Googlebot activity suring the first week of December? That will tell you if you should be good to go in January.
I personally do not have the complaints about Google that many others do, but I do think that they need to revisit how they cache DNS, and they should respect the expiration dates that are set. At the very least they should update the entire cache before each monthly crawl.
I passed your suggestions on to someone in engineering who is working on making the changes that you suggested.
> When is September did you change ISPs?
MSH: We made the switch in early September. Once the new site was fully propagated throughout the web's DNS' we shut down the old site. There may have been an overlap of a few days (4-5 days perhaps) where the old site and new site were still up and running, but the new site would have been up on our domain name.
> When did you notice the actual drop in traffic?
MSH: End of September - I started emailing Google in early October.
> Was your old site still up on the old ISP for a while after you changed ISPs?
MSH: Perhaps, see above.
> Do you have access to your logs from October and November?
MSH: We are pulling those over the weekend for review. I 'll see what they say....
There is one thing I read before in this forum, if google has ban one site from hand -> means an editor manually pull out your site from the database, it won't go back itself no matter what you do to correct your site problem that causes the ban. Only if some one from the editor manually lifeted the penalty. Then only googlebot will come again. But I know they usually do a manual delete for malicios spammers sites. I don't see why an education site can be banned.
I don't see why an education site can be banned.
It wasn't banned, and the content was not duplicated. They redesigned the site, so there is no duplicate content.
Do a site search on google dns to get some idea of what I am talking about. I am not certain that it is a DNS issue, but I am almost positive that it is not a ban if googlebot is making it to his site.
Did a search for "Google DNS" and came up with a few threads from this forum on the subject.
Apparently the consensus is that DNS updates are notoriously slow for Google - but seems like the longest at most was 3 months.
We've have major problems with Google not indexing our site since the new DNS update for our domain in the first week of September - almost 4 months now...
Did not really seem much else on the subject out there when I did my search - is there a page on the Google site itself discussing issues with DNS?
If you were not crawled for 3 months, starting in September, you would not be in the index yet. The sites that are currently showing up are from the November crawl, from the first week in November.
That's why I was wondering if you had the log files from September through November. If Googlebot was hitting your site befor November 5 and you are still not showing up in the index, then it is another problem. If it didn't start showing up till after November 5, then it was probably a DNS issue.
As far as I know, when Google is chaecking on banned sites, they get /robots.txt and / . In your case, last week, it got /robots.txt and /index.asp. While / is actually the same file as /index.asp, I would expect that google would always go with / while checking for a site to be present.
If the bot is only fetching robots.txt and / it does not mean that you are banned. That is normal for the first month that your site is crawled. If it is a new site, do not even start worrying for 2 months. Consider yourself lucky if you pop up for a couple of days with the freshbot.
Give yourself 2 months to get in and 3 months to get in solid. Work on your page of good content a day till then.
Bans are incredibly rare. Do not do anything stupid and you should not have a problem.
My site is up (by the fresh bot) for around 20 days but disappered from 23rd until today. So I don't think it will appear in the database after december google dance. I am a bit unpassion to wait another month. What happens to your site when it is new? How long google put it in the index? Please share more on your experience, thank you.
You can get yourself all wound up tight, but it doesn't speed things up. Google works the same way it always has, and most likely always will. 2 months to get in, 3 to get fully spidered. The freshbot is a recent bonus, and that is how you should view it. It is a bonus that you would not have gotten a year ago.
And before you complain about Google being slow, unless you pay for inclusion the other engines are much, much slower.
This seems strange to me as we are not a new site but have significant inbound links and have had the site working at the same URL since 1998. Why would we drop off the map with Google after the new site's release?
So, looks like a Google DNS updating issue? Any comments?