Google does not honor robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Google does not honor robots.txt

newbies

12:54 am on Aug 9, 2013 (gmt 0)

I recently received several email from google webmaster tool notifying me "increase in server errors". Most of the errors I found in Webmaster tool are links to forum internal pages such as the following

forums/ips_kernel/HTMLPurifier/HTMLPurifier/AttrDef/URI/Email.php
forums/user/32928-nxqx/

I have been running the forum since the start of my site and have included in robots.txt the following when the forum was installed:
User-agent: *
Disallow: /forums/admin/
Disallow: /forums/cache/
Disallow: /forums/converge_local/
Disallow: /forums/hooks/
Disallow: /forums/ips_kernel/
Disallow: /forums/user/

clearly, the urls google tries to access in vain are all excluded in robots.txt, why does google still want to crawl them?

BTW, I didn't have such problem a while ago, and no changes have been made to the forum and robots.txt ever since.

goodroi

10:04 am on Aug 9, 2013 (gmt 0)

I have come across times that Google did not follow robots.txt. It happens .001%. 99.999% of the time it is because someone else didn't do something right.

I know you say that robots.txt hasn't changed. Double-check it anyway.

Did you add Google+? Google+ will create a backdoor for Google to access pages regardless of robots.txt.

Is your hosting company reliable? It is rare but sometimes a cheap hosting company will not load up files correctly.

If you really don't want Google to access certain pages consider using htaccess.

lucy24

10:56 am on Aug 9, 2013 (gmt 0)

And don't forget "fetch as googlebot". Ordinarily you want it to succeed; here you want to see "Denied by robots.txt". (I detoured to verify that this happens if you request a page that is roboted-out.)

newbies

4:03 pm on Aug 9, 2013 (gmt 0)

I tried fetching one of the robots excluded pages as googlebot. It just did happily totally ignored robots.txt but hit a 500 error because the page is not intended for visitors.

here is the result:

=============
Fetch as Google

This is how Googlebot fetched the page.

URL: http://www.example.com/forums/ips_kernel/HTMLPurifier/HTMLPurifier/PercentEncoder.php

Date: Friday, August 9, 2013 at 8:56:57 AM PDT

Googlebot Type: Web

Download Time (in milliseconds): 96

HTTP/1.1 500 Internal Server Error
Date: Fri, 09 Aug 2013 15:56:57 GMT
Server: Apache/2.2.24 (Unix) mod_ssl/2.2.24 OpenSSL/1.0.0-fips mod_bwlimited/1.4
Accept-Ranges: bytes
Content-Length: 2716
Connection: close
Content-Type: text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML><HEAD><TITLE>500 Internal Server Error</TITLE>...
=====================

google says the referring page is sitemap, actually all the error pages are not in sitemap.
What can I do in this case?

[edited by: phranque at 1:17 pm (utc) on Aug 10, 2013]
[edit reason] exemplified hostname [/edit]

lucy24

8:26 pm on Aug 9, 2013 (gmt 0)

500 error because the page is not intended for visitors.

I don't see how you get from A to B. A non-public page should be getting a 400-class error-- generally 401 or 403-- unless you've especially configured it so requests get a 500. What do error logs say? For that matter, what happens when you yourself request the page in an ordinary browser?

Double-check:
Does your robots.txt ever mention googlebot by name?

newbies

4:44 am on Aug 18, 2013 (gmt 0)

sorry for the confusion. it is 500. They should give a 403.

My robots.txt has these lines at the beginning:

User-agent: *
Disallow: /forums/admin/
Disallow: /forums/cache/
Disallow: /forums/converge_local/
Disallow: /forums/hooks/
Disallow: /forums/ips_kernel/

lucy24

6:18 am on Aug 18, 2013 (gmt 0)

500 for intended 403 may be a completely unrelated issue. Been there. Done that. Most likely explanation: You forgot to code an exemption for requests for the error page itself. (This includes internal requests such as would be triggered by a 403.) Result: vicious circle winding up with server throwing in the towel and delivering a 500 error.

newbies

5:38 pm on Aug 18, 2013 (gmt 0)

You are right. The 500 error was caused by setting a folder incorrectly to 777. I have corrected the mistake and will see if google still has the problem. Thank you

rvkumarweb

7:15 am on Sep 12, 2013 (gmt 0)

Hi Goodroi,

Sorry to say that not only Google most of the search engines first visit should be robots.txt of the website, and then if not goes to inner pages and looking for <Meta= robots function. Many times we confused the search engine not to index the particular pages of the site (my experience here). If the page is goes to 403 forbidden errors, search engine not able to crawl. First we need to make it as 404 page from 403 forbidden and then use the URL in disallow sources on robots.txt file after that see the results

Always welcome for your question.

Thanks and happy day.

phranque

10:17 am on Sep 12, 2013 (gmt 0)

does your robots.txt have a separate set of exclusion directives for User-agent: Googlebot or similar specification?

newbies

3:40 pm on Sep 12, 2013 (gmt 0)

does your robots.txt have a separate set of exclusion directives for User-agent: Googlebot or similar specification?

No.