homepage Welcome to WebmasterWorld Guest from 54.163.72.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 35 message thread spans 2 pages: 35 ( [1] 2 > >     
Googlebot getting caught in robots.txt spider trap
starchild




msg:4346140
 11:14 am on Aug 1, 2011 (gmt 0)

Hi,

I saw today that Googlebot got caught in a spider trap that it shouldn't have as that dir is blocked via robots.txt

I know of at least one other person recently who this has also happened to.

Why is GB ignoring robots?

 

goodroi




msg:4346164
 12:30 pm on Aug 1, 2011 (gmt 0)

You might want to double check your claim. You may be correct but based on prior experience almost all of the people that have made this claim to me were mistaken.

Here are common issues I have come across:

1) Robots.txt was not correctly setup
2) Robots.txt is correct but was uploaded while Google was already crawling the content
3) Robots.txt is correct but placed in the wrong location
4) Someone is trying to scrape content & is faking the user agent to be googlebot (check the ip)

On rare occasions I have seen Google make a mistake. Considering they crawl billions of pages the occasional glitch is to be expected. If you have confidential information don't put it online. If you need to put sensitive information online use htaccess to further secure it.

starchild




msg:4346168
 12:58 pm on Aug 1, 2011 (gmt 0)

No, It has been in place for many months. No changes. All good. Then I got a notification it was blocked (via the spidertrap notifier).

Sure enough, it was. Upon double checking, Google webmaster tools reported a 403 forbidden error. IP was google.

I whitelisted it, and Google webmaster tools then gave a success.

starchild




msg:4346169
 1:01 pm on Aug 1, 2011 (gmt 0)

BTW, the spidertrap i use to prevent crawlers using bandwidth. The site is pretty big with 100,000's of pages.

It isn't there for security reason.

nippi




msg:4346170
 1:03 pm on Aug 1, 2011 (gmt 0)

I am backing his claim. I also got hit by it, some 4 months after setting up a spider trap which has until now been working fine.

The link to the spider trap is rel=Nofollowed, the folder is banned in robot.txt.

The spider trap works by banning by ip address, not user agent so its not caused by a faker - and of course robots.txt was setup up correctly and prior, it was in place days before the spider trap was turned on, and its run with no problems for months.

My logs show, it was the real google, from a real google ip address that ignored my robots.txt, ignored rel-nofollow and basically killed my site.

g1smd




msg:4346327
 7:19 pm on Aug 1, 2011 (gmt 0)

Is the folder that Google is banned from spidering listed in the
User-agent: * section of the robots.txt file or in a section especially for Googlebot?
mslina2002




msg:4346662
 1:33 pm on Aug 2, 2011 (gmt 0)

Can you post the IP of the particular Google bot?

starchild




msg:4347104
 11:42 am on Aug 3, 2011 (gmt 0)

Hi, the IP is 66.249.71.108.

Folder is listed under both:
User-agent: googlebot
and
User-agent: *

Staffa




msg:4347216
 3:26 pm on Aug 3, 2011 (gmt 0)

Ditto from 66.249.66.79
stepped in my honeypot

Folders have been listed for several years under both:
User-agent: googlebot
and
User-agent: *

lucy24




msg:4347328
 6:49 pm on Aug 3, 2011 (gmt 0)

Would some brave person like to experiment by changing it to Googlebot, capitalized? I wouldn't put it past them...

g1smd




msg:4347338
 7:08 pm on Aug 3, 2011 (gmt 0)

I have always assumed that it is case sensitive.

I use:
User-agent: Googlebot
aakk9999




msg:4347387
 9:05 pm on Aug 3, 2011 (gmt 0)

If it is case sensitive then * should apply as staffa and starchild say their folders have been listed under both. So Googlebot should not have gone in there.

Staffa




msg:4347403
 9:35 pm on Aug 3, 2011 (gmt 0)

Actually, in my robots.txt it's : Googlebot
as copied from the UA, which I always do to be sure to have the spelling right.

(I just copied those lines from the post above)

starchild




msg:4356614
 9:25 am on Aug 30, 2011 (gmt 0)

Well,

The same IP just got banned again. Have whitelisted once more, but it seems this googlebot is simply not adhering to the robots.txt file

Confirmed at webmastertools, and indeed it was caught.

Makes it very hard when google doesn't follow it;s own directives,

nippi




msg:4357632
 3:22 pm on Sep 1, 2011 (gmt 0)

Perhaps this is happening

At the exact time Google calls your robots.txt your server is unresponsive for a short period, Google then marks your site as having no robots.txt and proceeds as if there is none and starts indexing everything it can find.

It probably has a fail safe, that the pages have to be crawled a few times over a given period with robots.txt checked each time to ensure it has actually seen the robot.txt

Problem is, if your honey trap triggers your website to server a 403 or 404 error to a rogue crawler forever more after tripping a honeytrap.... then when google comes back to re-read you robots.txt it cant read it again and doesn't know it has done the wrong thing.

If this is the case, then honey traps that ban after a folder banned by robots.txt is crawled a single time, will never work, and you should remove this rule and try something differernt to stop rogues crawlers.

SEOMike




msg:4357674
 5:14 pm on Sep 1, 2011 (gmt 0)

The same IP just got banned again. Have whitelisted once more, but it seems this googlebot is simply not adhering to the robots.txt file


I have seen Googlebot ignore noindex and robots.txt rules if content is heavily linked from other websites. Google picks up indicators that the content is important and it seems that if enough sites "vote" for the disallowed content Google will ignore your attempts to keep them out. There was a discussion on here a while back about this very topic but I can't find it.

leadegroot




msg:4358264
 12:32 pm on Sep 3, 2011 (gmt 0)

Part of my daily surf is to check no pages in my various blocked sections have been indexed by google (its a site: search)
Every few months (I'd go so far as to say once or twice a year) there is a result there and its a quick trip to WMT to get it removed.
I had been assuming it is Google's bad, but nippi's explanation of a glitch in robots.txt reading would cover it, although I would have expected a bigger tranche than the 1 page I see to occur, really...
But - it definitely does happen, on old established sites where I haven't touched the robots file in years.

tangor




msg:4358289
 1:30 pm on Sep 3, 2011 (gmt 0)

If robots.txt is .htaccess passed to ALL regardless of any denies, then no bot can NOT see it.

nippi




msg:4358589
 8:38 pm on Sep 4, 2011 (gmt 0)

But if the robots.txt returns 500 error due to overloaded server or some other error stopping the robots.txt opening at all?

g1smd




msg:4358595
 9:00 pm on Sep 4, 2011 (gmt 0)

Google also ignores robots.txt if page has +1 button in it.

See other parallel WebmasterWorld thread.

Staffa




msg:4358619
 10:41 pm on Sep 4, 2011 (gmt 0)

Gbot doesn't always read robots.txt before each crawl of one or more pages.
The occurrences I have seen on one of my sites, the bot went straight for the disallowed folder without first reading the robots.txt (which has remained unchanged for several years) and that was before the start of +1 buttons.

Sgt_Kickaxe




msg:4358633
 11:43 pm on Sep 4, 2011 (gmt 0)

Confirmed, googlebot has requested a page in my 3 day old honeypot for a page that has never existed and, obviously, has never been linked to. The ONLY place it is even mentioned online is in the .htaccess file to control the honeypot as well as in the robots.txt file itself.

Parsing logs shows me that googlebot also likes to check if my site is wordpress based by attempting to load up example.com/xmlrpc.php, a standard wordpress file. My site isn't wordpress based so the request returns an error instead of the wordpress default "XML-RPC server accepts POST requests only." message. I would guess that google does this check to see if the site is wordpress, or perhaps they want to see if the site is secure? Whichever the case, googlebot can and does ignore robots.txt on occasion.

Perhaps we need a list from Google explaining EXACTLY under which circumstances they would/will ignore robots.txt... since they are essentially acting like a spam bot or scraper at that point. it's no longer a matter of IF they ignore it but WHY and under what conditions.

P.S. for wordpress based sites - if you have remote publishing off you can set up htaccess to immediately ban any ip address attempting to load that file.

johnnie




msg:4358647
 1:26 am on Sep 5, 2011 (gmt 0)

It's a common misconception that robots.txt completely blocks out Googlebot. Googlebot still discovers and even indexes locations blocked by robots.txt. It just doesnt read, nor does it index the actual content of the page.

If you really want to physically block Googlebot, use an IP filter.

tedster




msg:4358649
 1:40 am on Sep 5, 2011 (gmt 0)

It just doesnt read,

Ah, but that's the problem. Under some conditions which are not yet clear, googlebot does request and read disallowed URLs, and that's contrary to what they state they will do. And it seems to be too common to be only a one-off technical error.

It wouldn't be the first time a Google engineer programmed something that was contrary to someone else's policy at Google. They use a kind of rapid, agile development that seems to allow a lot of autonomy and the QA often comes later. An 80% OK is about what it takes for code to get pushed live, from what I've been reading in those new books that were published this year. And no company can thrive if they try for 100% on QA before going live.

johnnie




msg:4358657
 2:16 am on Sep 5, 2011 (gmt 0)

Tedster, here's a video by Matt Cutts on the issue: [youtube.com...]

lucy24




msg:4358658
 2:30 am on Sep 5, 2011 (gmt 0)

It's a common misconception that robots.txt completely blocks out Googlebot.

Not in this forum it isn't ;) The question is really: Are we stupid or is google crooked?

tedster




msg:4358667
 4:22 am on Sep 5, 2011 (gmt 0)

Thanks johnnie. That 2009 video does explain why an uncrawled URL can still show up in the search results. The reports in this thread are not about showing up in the SERPs, they're taken from entries showing up in the actual server logs.

I notice near the end of the video, Matt says "probably 90% of the time when someone says you violated my robots.txt" ... that does leave 10% of the time for something different.

I know that once in a while googlebot does make a burble in this area. However, given reports in this thread, it seems like it might be happening a bit more often right now.

JAB Creations




msg:4358670
 4:28 am on Sep 5, 2011 (gmt 0)

Do your robots.txt files utilize a database for any reason? If so do you record MySQL errors?

I started recording JavaScript, PHP and MySQL errors to logs months ago and I've noticed (and confirmed from talking with my current host) that they reset MySQL every so often. The error message generated is, "MySQL server has gone away". This may occur several times during the day, smack dab in the middle of the day in fact and the database may be unavailable for several seconds, long enough to cause someone somewhere some sort of trouble. I highly doubt the host I use is the only one who does this.

If your robots.txt file uses a service that temporarily becomes unavailable do you know how your site will react? In example by using the base element my entire site works seamlessly both locally and live with the exact same code; I have tested my site disabling the database and then loading pages (since my site is now database driven). My site actually is aware and able to fall back in to a "safe mode" of sorts if the database becomes unavailable for that request.

If that is the situation that's happening then I would recommend serving an HTTP 503 header presuming Google will wait a few moments/minutes before trying again instead of simply presuming what you're seeing.

- John

g1smd




msg:4358758
 11:03 am on Sep 5, 2011 (gmt 0)

Serving the right error codes for both planned and unplanned outages is something that few sites get completely right. It's probably another factor in some cases.

draftzero




msg:4358931
 1:10 am on Sep 6, 2011 (gmt 0)

If the page has Google Plus button on it, the bot will not follow the robots.txt directive: [seroundtable.com...]

Mod's note: adding quote for context

Googler, Jenny Murphy, provides a quick answer:
The +1 Button interacts with robots.txt and other crawler directives in an interesting way. Since +1's can only be applied to public pages, we may visit your page at the time the +1 Button is clicked to verify that it is indeed public. This check ignores crawler directives. This does not, however, impact the behavior of Google web search crawlers and how they interact with your robots.txt file.

[edited by: Robert_Charlton at 4:00 am (utc) on Sep 6, 2011]
[edit reason] Added quote [/edit]

This 35 message thread spans 2 pages: 35 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved