| 10:04 am on Aug 9, 2013 (gmt 0)|
I have come across times that Google did not follow robots.txt. It happens .001%. 99.999% of the time it is because someone else didn't do something right.
I know you say that robots.txt hasn't changed. Double-check it anyway.
Did you add Google+? Google+ will create a backdoor for Google to access pages regardless of robots.txt.
Is your hosting company reliable? It is rare but sometimes a cheap hosting company will not load up files correctly.
If you really don't want Google to access certain pages consider using htaccess.
| 10:56 am on Aug 9, 2013 (gmt 0)|
And don't forget "fetch as googlebot". Ordinarily you want it to succeed; here you want to see "Denied by robots.txt". (I detoured to verify that this happens if you request a page that is roboted-out.)
| 4:03 pm on Aug 9, 2013 (gmt 0)|
I tried fetching one of the robots excluded pages as googlebot. It just did happily totally ignored robots.txt but hit a 500 error because the page is not intended for visitors.
here is the result:
Fetch as Google
This is how Googlebot fetched the page.
Date: Friday, August 9, 2013 at 8:56:57 AM PDT
Googlebot Type: Web
Download Time (in milliseconds): 96
HTTP/1.1 500 Internal Server Error
Date: Fri, 09 Aug 2013 15:56:57 GMT
Server: Apache/2.2.24 (Unix) mod_ssl/2.2.24 OpenSSL/1.0.0-fips mod_bwlimited/1.4
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML><HEAD><TITLE>500 Internal Server Error</TITLE>...
google says the referring page is sitemap, actually all the error pages are not in sitemap.
What can I do in this case?
[edited by: phranque at 1:17 pm (utc) on Aug 10, 2013]
[edit reason] exemplified hostname [/edit]
| 8:26 pm on Aug 9, 2013 (gmt 0)|
|500 error because the page is not intended for visitors. |
I don't see how you get from A to B. A non-public page should be getting a 400-class error-- generally 401 or 403-- unless you've especially configured it so requests get a 500. What do error logs say? For that matter, what happens when you yourself request the page in an ordinary browser?
Does your robots.txt ever mention googlebot by name?
| 4:44 am on Aug 18, 2013 (gmt 0)|
sorry for the confusion. it is 500. They should give a 403.
My robots.txt has these lines at the beginning:
| 6:18 am on Aug 18, 2013 (gmt 0)|
500 for intended 403 may be a completely unrelated issue. Been there. Done that. Most likely explanation: You forgot to code an exemption for requests for the error page itself. (This includes internal requests such as would be triggered by a 403.) Result: vicious circle winding up with server throwing in the towel and delivering a 500 error.
| 5:38 pm on Aug 18, 2013 (gmt 0)|
You are right. The 500 error was caused by setting a folder incorrectly to 777. I have corrected the mistake and will see if google still has the problem. Thank you
| 7:15 am on Sep 12, 2013 (gmt 0)|
Sorry to say that not only Google most of the search engines first visit should be robots.txt of the website, and then if not goes to inner pages and looking for <Meta= robots function. Many times we confused the search engine not to index the particular pages of the site (my experience here). If the page is goes to 403 forbidden errors, search engine not able to crawl. First we need to make it as 404 page from 403 forbidden and then use the URL in disallow sources on robots.txt file after that see the results
Always welcome for your question.
Thanks and happy day.
| 10:17 am on Sep 12, 2013 (gmt 0)|
does your robots.txt have a separate set of exclusion directives for User-agent: Googlebot or similar specification?
| 3:40 pm on Sep 12, 2013 (gmt 0)|
|does your robots.txt have a separate set of exclusion directives for User-agent: Googlebot or similar specification? |