homepage Welcome to WebmasterWorld Guest from 54.204.73.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Google does not honor robots.txt
newbies

10+ Year Member



 
Msg#: 4600575 posted 12:54 am on Aug 9, 2013 (gmt 0)

I recently received several email from google webmaster tool notifying me "increase in server errors". Most of the errors I found in Webmaster tool are links to forum internal pages such as the following

forums/ips_kernel/HTMLPurifier/HTMLPurifier/AttrDef/URI/Email.php
forums/user/32928-nxqx/

I have been running the forum since the start of my site and have included in robots.txt the following when the forum was installed:
User-agent: *
Disallow: /forums/admin/
Disallow: /forums/cache/
Disallow: /forums/converge_local/
Disallow: /forums/hooks/
Disallow: /forums/ips_kernel/
Disallow: /forums/user/

clearly, the urls google tries to access in vain are all excluded in robots.txt, why does google still want to crawl them?

BTW, I didn't have such problem a while ago, and no changes have been made to the forum and robots.txt ever since.

 

goodroi

WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4600575 posted 10:04 am on Aug 9, 2013 (gmt 0)

I have come across times that Google did not follow robots.txt. It happens .001%. 99.999% of the time it is because someone else didn't do something right.

I know you say that robots.txt hasn't changed. Double-check it anyway.

Did you add Google+? Google+ will create a backdoor for Google to access pages regardless of robots.txt.

Is your hosting company reliable? It is rare but sometimes a cheap hosting company will not load up files correctly.

If you really don't want Google to access certain pages consider using htaccess.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4600575 posted 10:56 am on Aug 9, 2013 (gmt 0)

And don't forget "fetch as googlebot". Ordinarily you want it to succeed; here you want to see "Denied by robots.txt". (I detoured to verify that this happens if you request a page that is roboted-out.)

newbies

10+ Year Member



 
Msg#: 4600575 posted 4:03 pm on Aug 9, 2013 (gmt 0)

I tried fetching one of the robots excluded pages as googlebot. It just did happily totally ignored robots.txt but hit a 500 error because the page is not intended for visitors.

here is the result:

=============
Fetch as Google

This is how Googlebot fetched the page.

URL: http://www.example.com/forums/ips_kernel/HTMLPurifier/HTMLPurifier/PercentEncoder.php

Date: Friday, August 9, 2013 at 8:56:57 AM PDT

Googlebot Type: Web

Download Time (in milliseconds): 96

HTTP/1.1 500 Internal Server Error
Date: Fri, 09 Aug 2013 15:56:57 GMT
Server: Apache/2.2.24 (Unix) mod_ssl/2.2.24 OpenSSL/1.0.0-fips mod_bwlimited/1.4
Accept-Ranges: bytes
Content-Length: 2716
Connection: close
Content-Type: text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML><HEAD><TITLE>500 Internal Server Error</TITLE>...
=====================

google says the referring page is sitemap, actually all the error pages are not in sitemap.
What can I do in this case?

[edited by: phranque at 1:17 pm (utc) on Aug 10, 2013]
[edit reason] exemplified hostname [/edit]

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4600575 posted 8:26 pm on Aug 9, 2013 (gmt 0)

500 error because the page is not intended for visitors.

I don't see how you get from A to B. A non-public page should be getting a 400-class error-- generally 401 or 403-- unless you've especially configured it so requests get a 500. What do error logs say? For that matter, what happens when you yourself request the page in an ordinary browser?

Double-check:
Does your robots.txt ever mention googlebot by name?

newbies

10+ Year Member



 
Msg#: 4600575 posted 4:44 am on Aug 18, 2013 (gmt 0)

sorry for the confusion. it is 500. They should give a 403.

My robots.txt has these lines at the beginning:

User-agent: *
Disallow: /forums/admin/
Disallow: /forums/cache/
Disallow: /forums/converge_local/
Disallow: /forums/hooks/
Disallow: /forums/ips_kernel/

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4600575 posted 6:18 am on Aug 18, 2013 (gmt 0)

500 for intended 403 may be a completely unrelated issue. Been there. Done that. Most likely explanation: You forgot to code an exemption for requests for the error page itself. (This includes internal requests such as would be triggered by a 403.) Result: vicious circle winding up with server throwing in the towel and delivering a 500 error.

newbies

10+ Year Member



 
Msg#: 4600575 posted 5:38 pm on Aug 18, 2013 (gmt 0)

You are right. The 500 error was caused by setting a folder incorrectly to 777. I have corrected the mistake and will see if google still has the problem. Thank you

rvkumarweb



 
Msg#: 4600575 posted 7:15 am on Sep 12, 2013 (gmt 0)

Hi Goodroi,

Sorry to say that not only Google most of the search engines first visit should be robots.txt of the website, and then if not goes to inner pages and looking for <Meta= robots function. Many times we confused the search engine not to index the particular pages of the site (my experience here). If the page is goes to 403 forbidden errors, search engine not able to crawl. First we need to make it as 404 page from 403 forbidden and then use the URL in disallow sources on robots.txt file after that see the results

Always welcome for your question.

Thanks and happy day.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4600575 posted 10:17 am on Sep 12, 2013 (gmt 0)

does your robots.txt have a separate set of exclusion directives for User-agent: Googlebot or similar specification?

newbies

10+ Year Member



 
Msg#: 4600575 posted 3:40 pm on Sep 12, 2013 (gmt 0)

does your robots.txt have a separate set of exclusion directives for User-agent: Googlebot or similar specification?

No.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved