Msg#: 4600575 posted 12:54 am on Aug 9, 2013 (gmt 0)
I recently received several email from google webmaster tool notifying me "increase in server errors". Most of the errors I found in Webmaster tool are links to forum internal pages such as the following
I have been running the forum since the start of my site and have included in robots.txt the following when the forum was installed: User-agent: * Disallow: /forums/admin/ Disallow: /forums/cache/ Disallow: /forums/converge_local/ Disallow: /forums/hooks/ Disallow: /forums/ips_kernel/ Disallow: /forums/user/
clearly, the urls google tries to access in vain are all excluded in robots.txt, why does google still want to crawl them?
BTW, I didn't have such problem a while ago, and no changes have been made to the forum and robots.txt ever since.
Msg#: 4600575 posted 10:56 am on Aug 9, 2013 (gmt 0)
And don't forget "fetch as googlebot". Ordinarily you want it to succeed; here you want to see "Denied by robots.txt". (I detoured to verify that this happens if you request a page that is roboted-out.)
Msg#: 4600575 posted 8:26 pm on Aug 9, 2013 (gmt 0)
500 error because the page is not intended for visitors.
I don't see how you get from A to B. A non-public page should be getting a 400-class error-- generally 401 or 403-- unless you've especially configured it so requests get a 500. What do error logs say? For that matter, what happens when you yourself request the page in an ordinary browser?
Double-check: Does your robots.txt ever mention googlebot by name?
Msg#: 4600575 posted 6:18 am on Aug 18, 2013 (gmt 0)
500 for intended 403 may be a completely unrelated issue. Been there. Done that. Most likely explanation: You forgot to code an exemption for requests for the error page itself. (This includes internal requests such as would be triggered by a 403.) Result: vicious circle winding up with server throwing in the towel and delivering a 500 error.
Msg#: 4600575 posted 7:15 am on Sep 12, 2013 (gmt 0)
Sorry to say that not only Google most of the search engines first visit should be robots.txt of the website, and then if not goes to inner pages and looking for <Meta= robots function. Many times we confused the search engine not to index the particular pages of the site (my experience here). If the page is goes to 403 forbidden errors, search engine not able to crawl. First we need to make it as 404 page from 403 forbidden and then use the URL in disallow sources on robots.txt file after that see the results