Google disregarded robots.txt

Forum Moderators: open

Message Too Old, No Replies

Google disregarded robots.txt

Scooter24

9:01 am on Jan 30, 2003 (gmt 0)

Google disregarded the robots.txt file and browsed a directory which was off-limits. Not the Googlebot, but this:

IP-adress: 216.239.33.5
User agent: UP.Browser/6.1.0.1.140 (Google CHTML Proxy/1.0)

Has anybody an explanation for this?

DaveN

9:17 am on Jan 30, 2003 (gmt 0)

read this

[webmasterworld.com...]

DaveN

Dreamquick

9:17 am on Jan 30, 2003 (gmt 0)

That's supposed to be part of the google-for-mobiles stuff which converts HTML to WML (WAP Markup Language) on-the-fly, so although it appears to be a google bot it's actually a real user (on a mobile) routing through Google's proxy.

A site search will turn up a number of similar discussions.

- Tony

Scooter24

9:24 am on Jan 30, 2003 (gmt 0)

Well, in my case it's not python.

Anyway, the fact is that I implement a download-protection mechanism against download agents. Browse the wrong directory and you get automatically banned. Worked oerfectly so far, but right now this Google proxy automatically banned itself. Of course I removed the deny from 216.239.33.5 line and I hope this will not affect my Google ranking, but what if this happens again in the future?

Why the hell does this Google proxy thing need to browse disallowed directories?

jdMorgan

7:08 pm on Jan 30, 2003 (gmt 0)

Scooter24,

The point is that you banned a user who was using a mobile device to view your site. If you want to ban such users, then do nothing. If you don't want to ban users, you'll have to add some logic to the code which calls your bad-bot script to allow that user agent or to allow that IP range.

Since Google has both a WAP proxy and a CHTML proxy, allowing "Google\ .*\ Proxy" works for me. You may also wish to allow the AvantGo WAP proxy, by allowing its IP address, "^64\.157\.224\."

Jim

jomaxx

8:36 pm on Jan 30, 2003 (gmt 0)

The same basic thing happens when a user has Google translate a page via its "Language Tools". I just did a quick test and in that case, the request comes from a Google IP but the Google proxy announces itself as whatever browser the surfer is using.

jdMorgan

9:01 pm on Jan 30, 2003 (gmt 0)

jomaxx,

True, but in that case - using a normal browser through a translator, the user-agent does not seem to trip over spider traps. The problem here is that typical methods which prevent browsers from tripping over spider traps do not work with WAP and CHTML proxies.

Jim

Scooter24

8:26 am on Jan 31, 2003 (gmt 0)

The point is that you banned a user who was using a mobile device to view your site. If you want to ban such users, then do nothing. If you don't want to ban users, you'll have to add some logic to the code which calls your bad-bot script to allow that user agent or to allow that IP range.
Since Google has both a WAP proxy and a CHTML proxy, allowing "Google\ .*\ Proxy" works for me. You may also wish to allow the AvantGo WAP proxy, by allowing its IP address, "^64\.157\.224\."

There is a simple filter which blocks certain agents, but the actual banning is triggered by the agent's behaviour. In this specific case the user called a page in a disallowed directory, through a hidden link which only a robot or download agent can see. Since no normal user would ever call the forbidden page, this protection is very effective.
On the other hand, why would a proxy retrieve a page in a disallowed directory - why should this be allowed?