Thanks for reporting this kind of "edge case". From my recent experience, it is interesting that often people assume they are penalized or banned before they check the technical basics.
I have just found something very interesting. Today I got client's logs, all 3GB of them in one big file (logs kept appending to the same file since Oct 2011).
I saw that robots.txt was returning HTTP 500 to Googlebot as far as logs go (18+ months) but Google kept crawling the site.
Then suddenly on the 7th of May 2013 - it stopped requesting other pages! Since that date Googlebot ONLY requested robots.txt and after getting HTTP 500 it did not crawl anything any more - I can only see the requests for robots.txt
Ten days later (17th of May) the home page and most of other pages got deindexed.
So either something changed from Google side on how they handle robots.txt response code - or Google decided that 18 months (and possibly longer) is enough of getting HTTP 500 for robots.txt requests.
How infuriating that you didn't notice this while it was ongoing! 18 months seems an unbelievably long time to go without seeing robots.txt. Normally they make a fuss if they have to go longer than 24 hours.
I would have tried a Fetch As Googlebot to see whether robots really are receiving a 500 response, or it's just your logging gone haywire. (Admittedly far more common in the other direction: logs record a 200 while visitor receives something else.) Why would only this one file generate a 500?
|How infuriating that you didn't notice this while it was ongoing! |
This was a new client - I got the client on 4th of June, as the first sentence of my opening post says:
|Today we got a new client who pretty much disappeared from Google on 17th of May |
With regards to this, yes, absolutely...
|I would have tried a Fetch As Googlebot to see whether robots really are receiving a 500 response, |
After entering www.example.com/robots.txt into address bar in address bar and getting text "Page not found" I went to check response code and saw it is HTTP 500. Then I got logon to WMT and did exactly that: Fetch as Googleboot, first the home page then robots.txt and both fetches resulted in "Unreachable robots.txt".
I then checked crawl in WMT and saw the crawling stopped on the 7th of May. I assumed some coding changes were done that day, but only when I got server access logs (yesterday), a huge file that went 18 months back, I saw that HTTP 500 for robots.txt (and any other file not found!) was being returned for over 18 months.
|Why would only this one file generate a 500? |
Any requet for non-existing file returned HTTP 500 - this is the next thing I tried after seeing that robots.txt returns HTTP 500. Based on this I asked for the emergency upload of robots.txt which has resulted in crawling restarting. The developers are now fixing response code for non-existing pages to return 404 - but they still haven't put the fix live.
Now is "wait and see" whether the rankings will return - amd I am wondering whether Google will refuse to rank the site until all not found pages will return proper 404 response code.
But at least I got Google crawling again.
Why developer has not spotted this is a mistery. The new client spent 3 weeks with developer wondering what is happening, who at the end said he cannot help and to find someone else to see why they got deindexed.
I wonder if google somehow knew that the 500 was just another variation on the "soft 404"? I simply don't see it crawling a site for years while the status of robots.txt is up in the air. But it will happily keep crawling if it knows that there is no robots.txt.
And then something hiccuped, and it suddenly noticed that a 500 is not a 404. Or, conversely, it forgot that it used to know that on this site, 500 means 404.
I am as puzzled as you are. This is why in the opening post I have just assumed that someting changed on 7th when WMT charts showed Google stop crawling, hence the title of this post "..after only 10 days of robots.txt returning HTTP 500" (which is clearly incorrect title now).
Looking at the Apache logs (greping for googlebot lines only) was a big surprise for me.
Maybe there is a threshold reached or maybe Google *used* to treat 500 as 404, but then stopped at some point, who knows. In the period whilst requesting only robots.txt, googlebot did something in the range of 1000 requests for robots.txt per day - a request roughly every minute and a half.
I guess in the hindsight I am just as happy as having had 18+ months of logs in ONE large 3GB file as I was initially annoyed at needing 40 minutes to download it + special editor/file splitter to view it.
ANYWAY - the news is that the developer fixed the site so that as of today morning the request for resource not found returns 404.
And as of today lunchtime the site returned to SERPs, regaining ALL of their main keyword positions. We have to wait for Analytics to see if it regained long tail too.
|Maybe there is a threshold reached or maybe Google *used* to treat 500 as 404 |
I don't know this for sure but I think Google may stop crawling if robots.txt returns a 500 error because it could be an indication of a dynamic robots.txt file. These robots.txt files react to the useragent and often contain information regarding spider traps and other techniques to prevent scrapers. To go ahead and crawl whilst receiving a 500 error may well result in triggering a spider trap and googlebot beeing banned from the site so far better not to crawl IMO.
This is what I thought but robots.txt was returning HTTP 500 for 18+ months and Googlebot was still crawling the site all that time - until 7th of May this year when it suddenly stopped crawling and 10 days latter dropped the site from the index.
Interesting, Google have said in the past about some instances of crawling when they believe robots.txt is not correct however the answer may be to do with webspam.
Some black hatters were spamming sites to top of Google then returning a 500 error on robots.txt to prevent the site being re-crawled and penalised (for keyword stuffing). This was resulting in the sites hanging round for a long time (sometimes a year) in top spots. Perhaps google was forced to re-crawl where robots.txt returned 500 errors to eliminate the spam out of the index. Now it seems to have reverted back to normal and stopped crawling 500 robots.txt.
|Now it seems to have reverted back to normal and stopped crawling 500 robots.txt. |
And probably for that reason dropped the site out of index after only 10 days of not crawling - so that it would avoid black hats using the tactic you described!
This would make sense, yes. It is interesting that the switch of not crawling the site if robots.txt returns HTTP 500 happened on 7th of May.
Before that the site was being crawled even though robots.txt was returning HTTP 500 for over 18 months!