homepage Welcome to WebmasterWorld Guest from 54.227.41.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google drops site after only 10 days of robots.txt returning HTTP 500
aakk9999




msg:4580857
 2:21 am on Jun 4, 2013 (gmt 0)

Today we got a new client who pretty much disappeared from Google on 17th of May whilst previously ranking on page 1 for their main keywords. Home page is nowhere to be found and their traffic dropped 90%.

Quick check of the site showed that all requests for non-existing pages result in HTTP 500 instead of returning 404. This in turn resulted that a request for the non-existing robots.txt is also returning HTTP 500, which stopped google crawling the site. Crawl history shows that until 7th of May Google did crawl the site, hence the error with incorrect status code must have happened 6th or 7th of May.

I have found an old thread on this here [webmasterworld.com ]

However, what is interesting now is that it took Google only 10 days to drop the home page from its index - whilst the previous thread linked above (5 years old) said Google still kept old cache of pages in its index 4 months after robots.txt HTTP 500 error and their indexed pages did not disappear from index.

Posting this just as the information to others who may wonder why unexplained drop in their traffic.

The best way to check if robots.txt is a problem is to use "Fetch as Googlebot" in WMT and fetch the home page and robots.txt file. If you get message "unreachable robots.txt" then this could be the problem even if robots.txt does not exist or never existed on the site - in which case go and check your response codes!

Also note that "Blocked URLs" option in WMT that "tests" the robots.txt is not a good way to test this particular case as it still reports home page as "Allowed".

Clarified the date of disappearing - added month

 

tedster




msg:4581028
 12:37 pm on Jun 4, 2013 (gmt 0)

Thanks for reporting this kind of "edge case". From my recent experience, it is interesting that often people assume they are penalized or banned before they check the technical basics.

aakk9999




msg:4581521
 1:40 am on Jun 6, 2013 (gmt 0)

I have just found something very interesting. Today I got client's logs, all 3GB of them in one big file (logs kept appending to the same file since Oct 2011).

I saw that robots.txt was returning HTTP 500 to Googlebot as far as logs go (18+ months) but Google kept crawling the site.

Then suddenly on the 7th of May 2013 - it stopped requesting other pages! Since that date Googlebot ONLY requested robots.txt and after getting HTTP 500 it did not crawl anything any more - I can only see the requests for robots.txt

Ten days later (17th of May) the home page and most of other pages got deindexed.

So either something changed from Google side on how they handle robots.txt response code - or Google decided that 18 months (and possibly longer) is enough of getting HTTP 500 for robots.txt requests.

lucy24




msg:4581536
 3:43 am on Jun 6, 2013 (gmt 0)

How infuriating that you didn't notice this while it was ongoing! 18 months seems an unbelievably long time to go without seeing robots.txt. Normally they make a fuss if they have to go longer than 24 hours.

I would have tried a Fetch As Googlebot to see whether robots really are receiving a 500 response, or it's just your logging gone haywire. (Admittedly far more common in the other direction: logs record a 200 while visitor receives something else.) Why would only this one file generate a 500?

aakk9999




msg:4581580
 7:48 am on Jun 6, 2013 (gmt 0)

How infuriating that you didn't notice this while it was ongoing!

This was a new client - I got the client on 4th of June, as the first sentence of my opening post says:

Today we got a new client who pretty much disappeared from Google on 17th of May


With regards to this, yes, absolutely...
I would have tried a Fetch As Googlebot to see whether robots really are receiving a 500 response,

After entering www.example.com/robots.txt into address bar in address bar and getting text "Page not found" I went to check response code and saw it is HTTP 500. Then I got logon to WMT and did exactly that: Fetch as Googleboot, first the home page then robots.txt and both fetches resulted in "Unreachable robots.txt".

I then checked crawl in WMT and saw the crawling stopped on the 7th of May. I assumed some coding changes were done that day, but only when I got server access logs (yesterday), a huge file that went 18 months back, I saw that HTTP 500 for robots.txt (and any other file not found!) was being returned for over 18 months.

Why would only this one file generate a 500?


Any requet for non-existing file returned HTTP 500 - this is the next thing I tried after seeing that robots.txt returns HTTP 500. Based on this I asked for the emergency upload of robots.txt which has resulted in crawling restarting. The developers are now fixing response code for non-existing pages to return 404 - but they still haven't put the fix live.

Now is "wait and see" whether the rankings will return - amd I am wondering whether Google will refuse to rank the site until all not found pages will return proper 404 response code.

But at least I got Google crawling again.

Why developer has not spotted this is a mistery. The new client spent 3 weeks with developer wondering what is happening, who at the end said he cannot help and to find someone else to see why they got deindexed.

lucy24




msg:4581595
 8:38 am on Jun 6, 2013 (gmt 0)

I wonder if google somehow knew that the 500 was just another variation on the "soft 404"? I simply don't see it crawling a site for years while the status of robots.txt is up in the air. But it will happily keep crawling if it knows that there is no robots.txt.

And then something hiccuped, and it suddenly noticed that a 500 is not a 404. Or, conversely, it forgot that it used to know that on this site, 500 means 404.

aakk9999




msg:4581685
 2:56 pm on Jun 6, 2013 (gmt 0)

@Lucy,

I am as puzzled as you are. This is why in the opening post I have just assumed that someting changed on 7th when WMT charts showed Google stop crawling, hence the title of this post "..after only 10 days of robots.txt returning HTTP 500" (which is clearly incorrect title now).

Looking at the Apache logs (greping for googlebot lines only) was a big surprise for me.

Maybe there is a threshold reached or maybe Google *used* to treat 500 as 404, but then stopped at some point, who knows. In the period whilst requesting only robots.txt, googlebot did something in the range of 1000 requests for robots.txt per day - a request roughly every minute and a half.

I guess in the hindsight I am just as happy as having had 18+ months of logs in ONE large 3GB file as I was initially annoyed at needing 40 minutes to download it + special editor/file splitter to view it.

ANYWAY - the news is that the developer fixed the site so that as of today morning the request for resource not found returns 404.

And as of today lunchtime the site returned to SERPs, regaining ALL of their main keyword positions. We have to wait for Analytics to see if it regained long tail too.

seoskunk




msg:4581688
 3:19 pm on Jun 6, 2013 (gmt 0)

Maybe there is a threshold reached or maybe Google *used* to treat 500 as 404


I don't know this for sure but I think Google may stop crawling if robots.txt returns a 500 error because it could be an indication of a dynamic robots.txt file. These robots.txt files react to the useragent and often contain information regarding spider traps and other techniques to prevent scrapers. To go ahead and crawl whilst receiving a 500 error may well result in triggering a spider trap and googlebot beeing banned from the site so far better not to crawl IMO.

aakk9999




msg:4581693
 3:39 pm on Jun 6, 2013 (gmt 0)

@seoskunk

This is what I thought but robots.txt was returning HTTP 500 for 18+ months and Googlebot was still crawling the site all that time - until 7th of May this year when it suddenly stopped crawling and 10 days latter dropped the site from the index.

seoskunk




msg:4581701
 4:04 pm on Jun 6, 2013 (gmt 0)

Interesting, Google have said in the past about some instances of crawling when they believe robots.txt is not correct however the answer may be to do with webspam.

Some black hatters were spamming sites to top of Google then returning a 500 error on robots.txt to prevent the site being re-crawled and penalised (for keyword stuffing). This was resulting in the sites hanging round for a long time (sometimes a year) in top spots. Perhaps google was forced to re-crawl where robots.txt returned 500 errors to eliminate the spam out of the index. Now it seems to have reverted back to normal and stopped crawling 500 robots.txt.

aakk9999




msg:4581826
 4:55 pm on Jun 6, 2013 (gmt 0)

Now it seems to have reverted back to normal and stopped crawling 500 robots.txt.


And probably for that reason dropped the site out of index after only 10 days of not crawling - so that it would avoid black hats using the tactic you described!

This would make sense, yes. It is interesting that the switch of not crawling the site if robots.txt returns HTTP 500 happened on 7th of May.

Before that the site was being crawled even though robots.txt was returning HTTP 500 for over 18 months!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved