Googlebot is ignoring robots.txt and following redirects - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot is ignoring robots.txt and following redirects

Sgt_Kickaxe

9:46 am on May 4, 2012 (gmt 0)

I have a handful of affiliate links on a new site that link to the affiliate site via a redirect page. All links to the redirect page share a common url path of example.com/affiliate- followed by parameters for individual products. robots.txt has contained a disallow /affiliate- from BEFORE the links ever existed so in theory Googlebot should never have followed these through to the affiliate site, but they have.

note: when bots follow to this affiliate site I lose money for having non-targeted traffic.

Recently GWT started reporting 404 errors for affiliate jump page urls containing the parameters despite the fact that the links no longer appeared on my site, the items expired. The 404, ironically enough, was on the affiliate site but since I redirect via 301 permanent to the affiliate site the 404 error on their site is attributed to me in GWT. Visiting any of the urls will still send a visitor to the affiliate site with a 301 code.

Keep in mind that none of these links are getting indexed, it's just making me really mad that robots.txt is being ignored though and it's costing me money with this affiliate, what should I do since robots.txt is being ignored?

g1smd

10:55 am on May 4, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

This sounds like a case for detecting when Google|Bing|Yahoo request those things and then slamming the door in their face with a swift 403 response.

deadsea

11:07 am on May 4, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Are you sure that it is really Googlebot (verified the IP address using dns then reverse dns)?

I used to work on a site that had a similar setup. We would keep close tabs on bots that tried to use our affiliate links. Googlebot was always very well behaved and never requested anything in robots.txt though. But I haven't monitored the situation (or worked with that client) for two years.

I'd suggest doing some user agent sniffing and 403 bot requests in the script that handles the redirects.

Sgt_Kickaxe

11:44 am on May 4, 2012 (gmt 0)

Are you sure that it is really Googlebot (verified the IP address using dns then reverse dns)?

Yes, and as I said GWT is reporting redirect urls as 404 when the final destination becomes 404. That means the real googlebot took a peek.

re:403, is there a proven method of shutting the door on all googlebot activity since robots.txt is apparently not enough? They do occasionally visit with other referrers, such as when a member of the ratings team takes a look from somewhere like India.

Sand

12:59 pm on May 4, 2012 (gmt 0)

10+ Year Member

How do you know that they're ignoring the directive? You can check to see if a page exists without crawling the content of the page.

Edit: One way I'm assuming they can do this is through browser data. I have a test site completely blocked by robots.txt, but URLs still show up in the search results (sans metadata) if you search for the domain name.

Andem

4:14 pm on May 4, 2012 (gmt 0)

10+ Year Member

Top Contributors Of The Month

I posted something about this a few days ago here: [webmasterworld.com ]

The problem is that my robots.txt *is* valid and as of today, I have 15,900 results in the index. I'm actually getting a fair amount of unwanted traffic, though the traffic is very targetted.

I think the best way to deal with this is to deliver a 403 to Googlebot or in my case, setup authentication as phranque suggested.

Sgt_Kickaxe

4:57 pm on May 4, 2012 (gmt 0)

How do you know that they're ignoring the directive?

Because they are assigning me 404 errors for each url in GWT when final destination pages (e.g. not even on my site) turn 404. The redirection page inherently cannot result in a 404 error code, it just processes parameters. The 404 urls reported are not 404, the redirect page is still sending the visitor to the destination page.

In short the urls showing up in GWT should not even be crawled due to an explicit robots.txt directive but Google has gathered the data anyway and knows the final destination. Didn't Matt Cutts ask if webmasters would be ok if Google used discretion in ignoring robots.txt? Have they gone ahead and done so?

Using browser gathered data to know where a visitor went? Very possible too but with huge implications when I'm saying Googlebot don't via robots.txt.

Sand

6:08 pm on May 4, 2012 (gmt 0)

10+ Year Member

Using browser gathered data to know where a visitor went? Very possible too but with huge implications when I'm saying Googlebot don't via robots.txt.

Robots.txt disallows tell Google not to crawl the pages -- not to ignore their existence. It's also not the same thing as telling Google to noindex a piece of content.

If you want to run a simple test, create a new piece of content and block it via robots.txt. Then link to it from other pages on your site. When I've done this in the past, the URL of the disallowed page will still show in their index, but Google displays no other information about the page. They know it exists, but haven't looked at it. And because they know it exists, they'll index it.

Similarly, you could get a page's response code without ever 'visiting' the page. You can build a simple response code checker using the CURL PHP extension with just a few lines of code.

Edit: They're probably checking your page's response header (without crawling the page), which tells them the URL that it redirects to, and then the destination URL gives them a 404.

lucy24

6:45 pm on May 4, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

re:403, is there a proven method of shutting the door on all googlebot activity since robots.txt is apparently not enough? They do occasionally visit with other referrers, such as when a member of the ratings team takes a look from somewhere like India.

Unlike robots.txt, config files or htaccess cannot be ignored. Physically impossible. Block 'em by a RewriteRule listing the forbidden directories, accompanied by a Condition looking at the IP.

deadsea

6:58 pm on May 4, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Are you sure that Google is getting this information from Googlebot, as opposed to Chrome/browser toolbar? Have you seen evidence of Googlebot fetching these in your logs? What happen when you use WMT to fetch as googlebot?

Sgt_Kickaxe

9:15 pm on May 4, 2012 (gmt 0)

What happen when you use WMT to fetch as googlebot?

When I fetch a redirect link that leads(via 301) to a now 404 page on the affiliate site I receive a 404 error code for MY url despite the fact that the 404 belongs to the affiliate site and mine returns 301 as instructed. It's a problem.

I'm looking into an htaccess and/or config solution now, again my concern is with Google using different referrers or not providing one at all, what then? A cloaking penalty?

aakk9999

9:36 pm on May 4, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

If you want to run a simple test, create a new piece of content and block it via robots.txt. Then link to it from other pages on your site. When I've done this in the past, the URL of the disallowed page will still show in their index, but Google displays no other information about the page. They know it exists, but haven't looked at it. And because they know it exists, they'll index it.

Yes, and in this case you would expect that this page *may* be indexed by Google with no details shown. But this is not Sgt_Kickaxe case. His page was requested by Google in order to see 301->404 being returned. So, even though it was in robots.txt, Google executed GET for this URL.

I think Google is purposely ambigous here. We all assume that when they say "we will not crawl" that this mean "we will not request this URL" but obviously this is not the case as Sgt case proves they still reqest the URL.

The only plausably "legit" reason for this would be if Google wants to check the response code before it includes this URL in its index (with no title and without snippet, as it does for pages blocked by robots.txt but linked from elsewhere).

But if the bot is looking for more than response code (e.g. URL discovery) then it does not behave!

A simple test would be: Create two pages, A & B. Stop A via robots.txt, but do not stop B via robots.txt. From A link to B in obsecure way or in the way visitors would not click on it. Do not use browser or anything to access page B. Link to A from somewhere and wait to see if B ever appears in Google index (make it worthwhile indexing though).

If it ever does, then Google IS crawling pages blocked by robots, but will index them (no snippet, no title) in the way that you think is not crawling them.

Sand

10:14 pm on May 4, 2012 (gmt 0)

10+ Year Member

The only plausably "legit" reason for this would be if Google wants to check the response code before it includes this URL in its index (with no title and without snippet, as it does for pages blocked by robots.txt but linked from elsewhere).

If you read the rest of my post, that's exactly what I'm suggesting. We're on the same page. They're checking the response codes and following the redirect provided, not crawling the page itself.

g1smd

10:30 pm on May 4, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

If they requested HEAD they they would have seen only the HTTP header.

A GET request will have returned the whole page.

aakk9999

10:44 pm on May 4, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

In which case, to really understand what Googlebot is doing, a test as described in my post above would be required.

deadsea

1:28 am on May 5, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

There are plenty of good reasons that bots should be obeying robots.txt. Everything from stats counters to urls that create, delete, or modify data.

I can't imagine you could get a cloaking penalty for a URL that you don't allow Googlebot to access to begin with because of robots.txt. But then again, Google writes the rules, not me.

phranque

2:27 am on May 5, 2012 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

have you checked your server access log to verify the precise behavior of googlebot regarding those urls?

Sgt_Kickaxe

3:40 am on Aug 7, 2012 (gmt 0)

I've been gathering data and can verify that googlebot does not access the redirect page directly however they are aware of the landing page that a user ends up on and when that page returns a 404 error it is being assigned to my redirect url and showing up in GWT as such.

Googlebot isn't seeing the redirect but is learning where the jump leads via some other method(chrome browser, beacons, etc?). It makes sense that if a visitor landed on site B from a page on my site that the page on my site must exist and so it's being treated as a regular url by GWT despite the robots.txt exclusion.

This brings up a problem for me, how do I continue to redirect visitors without appearing to be creating countless new urls, one for each redirect? Again, GWT won't report this unless the other sites page becomes 404 and it takes time to show up probably because it requires gathering data from non-googlebot sources.