| 12:30 pm on Aug 1, 2011 (gmt 0)|
You might want to double check your claim. You may be correct but based on prior experience almost all of the people that have made this claim to me were mistaken.
Here are common issues I have come across:
1) Robots.txt was not correctly setup
2) Robots.txt is correct but was uploaded while Google was already crawling the content
3) Robots.txt is correct but placed in the wrong location
4) Someone is trying to scrape content & is faking the user agent to be googlebot (check the ip)
On rare occasions I have seen Google make a mistake. Considering they crawl billions of pages the occasional glitch is to be expected. If you have confidential information don't put it online. If you need to put sensitive information online use htaccess to further secure it.
| 12:58 pm on Aug 1, 2011 (gmt 0)|
No, It has been in place for many months. No changes. All good. Then I got a notification it was blocked (via the spidertrap notifier).
Sure enough, it was. Upon double checking, Google webmaster tools reported a 403 forbidden error. IP was google.
I whitelisted it, and Google webmaster tools then gave a success.
| 1:01 pm on Aug 1, 2011 (gmt 0)|
BTW, the spidertrap i use to prevent crawlers using bandwidth. The site is pretty big with 100,000's of pages.
It isn't there for security reason.
| 1:03 pm on Aug 1, 2011 (gmt 0)|
I am backing his claim. I also got hit by it, some 4 months after setting up a spider trap which has until now been working fine.
The link to the spider trap is rel=Nofollowed, the folder is banned in robot.txt.
The spider trap works by banning by ip address, not user agent so its not caused by a faker - and of course robots.txt was setup up correctly and prior, it was in place days before the spider trap was turned on, and its run with no problems for months.
My logs show, it was the real google, from a real google ip address that ignored my robots.txt, ignored rel-nofollow and basically killed my site.
| 7:19 pm on Aug 1, 2011 (gmt 0)|
Is the folder that Google is banned from spidering listed in the
User-agent: * section of the robots.txt file or in a section especially for Googlebot?
| 1:33 pm on Aug 2, 2011 (gmt 0)|
Can you post the IP of the particular Google bot?
| 11:42 am on Aug 3, 2011 (gmt 0)|
Hi, the IP is 126.96.36.199.
Folder is listed under both:
| 3:26 pm on Aug 3, 2011 (gmt 0)|
Ditto from 188.8.131.52
stepped in my honeypot
Folders have been listed for several years under both:
| 6:49 pm on Aug 3, 2011 (gmt 0)|
Would some brave person like to experiment by changing it to Googlebot, capitalized? I wouldn't put it past them...
| 7:08 pm on Aug 3, 2011 (gmt 0)|
I have always assumed that it is case sensitive.
| 9:05 pm on Aug 3, 2011 (gmt 0)|
If it is case sensitive then * should apply as staffa and starchild say their folders have been listed under both. So Googlebot should not have gone in there.
| 9:35 pm on Aug 3, 2011 (gmt 0)|
Actually, in my robots.txt it's : Googlebot
as copied from the UA, which I always do to be sure to have the spelling right.
(I just copied those lines from the post above)
| 9:25 am on Aug 30, 2011 (gmt 0)|
The same IP just got banned again. Have whitelisted once more, but it seems this googlebot is simply not adhering to the robots.txt file
Confirmed at webmastertools, and indeed it was caught.
Makes it very hard when google doesn't follow it;s own directives,
| 3:22 pm on Sep 1, 2011 (gmt 0)|
Perhaps this is happening
At the exact time Google calls your robots.txt your server is unresponsive for a short period, Google then marks your site as having no robots.txt and proceeds as if there is none and starts indexing everything it can find.
It probably has a fail safe, that the pages have to be crawled a few times over a given period with robots.txt checked each time to ensure it has actually seen the robot.txt
Problem is, if your honey trap triggers your website to server a 403 or 404 error to a rogue crawler forever more after tripping a honeytrap.... then when google comes back to re-read you robots.txt it cant read it again and doesn't know it has done the wrong thing.
If this is the case, then honey traps that ban after a folder banned by robots.txt is crawled a single time, will never work, and you should remove this rule and try something differernt to stop rogues crawlers.
| 5:14 pm on Sep 1, 2011 (gmt 0)|
|The same IP just got banned again. Have whitelisted once more, but it seems this googlebot is simply not adhering to the robots.txt file |
I have seen Googlebot ignore noindex and robots.txt rules if content is heavily linked from other websites. Google picks up indicators that the content is important and it seems that if enough sites "vote" for the disallowed content Google will ignore your attempts to keep them out. There was a discussion on here a while back about this very topic but I can't find it.
| 12:32 pm on Sep 3, 2011 (gmt 0)|
Part of my daily surf is to check no pages in my various blocked sections have been indexed by google (its a site: search)
Every few months (I'd go so far as to say once or twice a year) there is a result there and its a quick trip to WMT to get it removed.
I had been assuming it is Google's bad, but nippi's explanation of a glitch in robots.txt reading would cover it, although I would have expected a bigger tranche than the 1 page I see to occur, really...
But - it definitely does happen, on old established sites where I haven't touched the robots file in years.
| 1:30 pm on Sep 3, 2011 (gmt 0)|
If robots.txt is .htaccess passed to ALL regardless of any denies, then no bot can NOT see it.
| 8:38 pm on Sep 4, 2011 (gmt 0)|
But if the robots.txt returns 500 error due to overloaded server or some other error stopping the robots.txt opening at all?
| 9:00 pm on Sep 4, 2011 (gmt 0)|
Google also ignores robots.txt if page has +1 button in it.
See other parallel WebmasterWorld thread.
| 10:41 pm on Sep 4, 2011 (gmt 0)|
Gbot doesn't always read robots.txt before each crawl of one or more pages.
The occurrences I have seen on one of my sites, the bot went straight for the disallowed folder without first reading the robots.txt (which has remained unchanged for several years) and that was before the start of +1 buttons.
| 11:43 pm on Sep 4, 2011 (gmt 0)|
Confirmed, googlebot has requested a page in my 3 day old honeypot for a page that has never existed and, obviously, has never been linked to. The ONLY place it is even mentioned online is in the .htaccess file to control the honeypot as well as in the robots.txt file itself.
Parsing logs shows me that googlebot also likes to check if my site is wordpress based by attempting to load up example.com/xmlrpc.php, a standard wordpress file. My site isn't wordpress based so the request returns an error instead of the wordpress default "XML-RPC server accepts POST requests only." message. I would guess that google does this check to see if the site is wordpress, or perhaps they want to see if the site is secure? Whichever the case, googlebot can and does ignore robots.txt on occasion.
Perhaps we need a list from Google explaining EXACTLY under which circumstances they would/will ignore robots.txt... since they are essentially acting like a spam bot or scraper at that point. it's no longer a matter of IF they ignore it but WHY and under what conditions.
P.S. for wordpress based sites - if you have remote publishing off you can set up htaccess to immediately ban any ip address attempting to load that file.
| 1:26 am on Sep 5, 2011 (gmt 0)|
It's a common misconception that robots.txt completely blocks out Googlebot. Googlebot still discovers and even indexes locations blocked by robots.txt. It just doesnt read, nor does it index the actual content of the page.
If you really want to physically block Googlebot, use an IP filter.
| 1:40 am on Sep 5, 2011 (gmt 0)|
Ah, but that's the problem. Under some conditions which are not yet clear, googlebot does request and read disallowed URLs, and that's contrary to what they state they will do. And it seems to be too common to be only a one-off technical error.
It wouldn't be the first time a Google engineer programmed something that was contrary to someone else's policy at Google. They use a kind of rapid, agile development that seems to allow a lot of autonomy and the QA often comes later. An 80% OK is about what it takes for code to get pushed live, from what I've been reading in those new books that were published this year. And no company can thrive if they try for 100% on QA before going live.
| 2:16 am on Sep 5, 2011 (gmt 0)|
Tedster, here's a video by Matt Cutts on the issue: [youtube.com...]
| 2:30 am on Sep 5, 2011 (gmt 0)|
|It's a common misconception that robots.txt completely blocks out Googlebot. |
Not in this forum it isn't ;) The question is really: Are we stupid or is google crooked?
| 4:22 am on Sep 5, 2011 (gmt 0)|
Thanks johnnie. That 2009 video does explain why an uncrawled URL can still show up in the search results. The reports in this thread are not about showing up in the SERPs, they're taken from entries showing up in the actual server logs.
I notice near the end of the video, Matt says "probably 90% of the time when someone says you violated my robots.txt" ... that does leave 10% of the time for something different.
I know that once in a while googlebot does make a burble in this area. However, given reports in this thread, it seems like it might be happening a bit more often right now.
| 4:28 am on Sep 5, 2011 (gmt 0)|
Do your robots.txt files utilize a database for any reason? If so do you record MySQL errors?
If your robots.txt file uses a service that temporarily becomes unavailable do you know how your site will react? In example by using the base element my entire site works seamlessly both locally and live with the exact same code; I have tested my site disabling the database and then loading pages (since my site is now database driven). My site actually is aware and able to fall back in to a "safe mode" of sorts if the database becomes unavailable for that request.
If that is the situation that's happening then I would recommend serving an HTTP 503 header presuming Google will wait a few moments/minutes before trying again instead of simply presuming what you're seeing.
| 11:03 am on Sep 5, 2011 (gmt 0)|
Serving the right error codes for both planned and unplanned outages is something that few sites get completely right. It's probably another factor in some cases.
| 1:10 am on Sep 6, 2011 (gmt 0)|
If the page has Google Plus button on it, the bot will not follow the robots.txt directive: [seroundtable.com...]
Mod's note: adding quote for context
|Googler, Jenny Murphy, provides a quick answer: |
The +1 Button interacts with robots.txt and other crawler directives in an interesting way. Since +1's can only be applied to public pages, we may visit your page at the time the +1 Button is clicked to verify that it is indeed public. This check ignores crawler directives. This does not, however, impact the behavior of Google web search crawlers and how they interact with your robots.txt file.
[edited by: Robert_Charlton at 4:00 am (utc) on Sep 6, 2011]
[edit reason] Added quote [/edit]
| This 35 message thread spans 2 pages: 35 (  2 ) > > |