Welcome to WebmasterWorld Guest from 54.146.246.4

Forum Moderators: goodroi

Message Too Old, No Replies

Google WAP Proxy and robots.txt

     

carfac

3:37 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looking over last nights logs, I found a google bot (a WAP bot, it looks like) that ignored my robots.txt.

I have a "spider trap" on my site, a file that is excluded in the robots.txt, but crawlers that ignore the robots.txt will run... and if run, it logs their IP, and bans them...

So, look at this:

216.239.33.5 - - [07/Sep/2002:06:18:24 -0600] "GET / HTTP/1.0" 200 12614 "-" "SIE-C3I/3.0 UP/4.1.16m (Google WAP Proxy/1.0)"
216.239.33.5 - - [07/Sep/2002:06:19:12 -0600] "GET /secret_spider-trap.cgi HTTP/1.0" 200 152 "-" "SIE-C3I/3.0 UP/4.1.16m (Google WAP Proxy/1.0)"

That is all it got, but it was enough to ban him!

Should I unban this IP, contact Google, anything like that?

dave

ciml

3:42 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member ciml is a WebmasterWorld Top Contributor of All Time 10+ Year Member



dave, that's not a spider it's highly likely to be a person.

Google's WAP proxy is a bridge for individual people to access Web pages via the WAP protocol used in mobile telephones.

Key_Master

3:44 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Congratulations! You just banned a cell phone visitor.

[google.com...]

bill

3:58 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Administrator bill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



We had a related thread [webmasterworld.com] a few days ago.

carfac

4:17 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Key_Master:

But isn't it weird, it only requested two documents- my main page and the spider trap (which is a hidden URL)?

You would have to look at the code of the page to even know that link existed, the file name is pretty obscure...

Should I un-ban the IP?

dave

Key_Master

4:32 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



carfac,

Yes, you should un ban the IP and modify your trap to exclude these types of visits.

The best way to check your script out is to use the Google WAP proxy to check out your site. I'd bet that cell phone visitor was shown your spider trap link and clicked on it purely out of curiosity.

carfac

4:42 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Key_Master:

Very good... I will edit and edit!

Thanks for the advice!

Dave

carfac

4:52 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi:

Probably not the place for perl questions, so sorry if this is inappropriate...

I wrote this:

$visitor_ua = $ENV{'HTTP_USER_AGENT'};
if ($visitor_ua =~ 'WAP') {

print "Content-type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Forward On</title>\n"; print "</head>\n";
print "<body>\n";
print "<p><b>Please <A HREF="http://www.mydomain.com/">Click Here</A> to continue!</b></p>\n";
print "</body>\n";
print "</html>\n";
exit;
}
else {
CODE
}

and inserted above the logging part of the trap... look good?

dave

[edited by: carfac at 5:56 pm (utc) on Sep. 7, 2002]

Key_Master

5:28 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wouldn't do it by user agent. Too easy to spoof. Use IP addresses instead. Here is a list of Google WAP proxies and translator IP addresses.

216.239.33.5
216.239.35.4
216.239.37.5
216.239.39.5

carfac

5:41 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Key_Master:

Perfect- thanks!

I am a bit shaky on Regular expressions, can you tell me if this is correct to match all those numbers:

if ($visitor_ua =~ '(216.239.33.5¦216.239.35.4¦216.239.37.5¦216.239.39.5)' { code blah blah

Sorry- I get mixed up sometimes whether to us single or double quotes, or the ^ anchor...

Thank you!

Dave

Key_Master

5:51 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This should do it:

$visitor_ip = $ENV{'REMOTE_ADDR'};
if ($visitor_ip =~ /^216\.239\.3([3¦7¦9]\.5)$¦^216\.239\.35\.4$/ {

Remember to change the pipe () to the proper character.

carfac

5:57 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Key_Master:

Thank you! I am a mere Gate_Keeper... :)

Dave

carfac

6:35 pm on Sep 7, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Couple of minor edits, and it works....

Tested on the WAP emulator, and got the spider, but it did not log the IP...

Should I post the final, fixed code?

dave

mbauser2

8:10 pm on Sep 7, 2002 (gmt 0)

10+ Year Member



But isn't it weird, it only requested two documents- my main page and the spider trap (which is a hidden URL)?

Not that weird. The HTML-to-WML proxy normally strips out graphics and inserts placeholders. It's probably "un-hiding" your hidden link.