Welcome to WebmasterWorld Guest from 54.204.106.194

Forum Moderators: goodroi

Message Too Old, No Replies

Google WAP Proxy and robots.txt

     
3:37 pm on Sep 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Looking over last nights logs, I found a google bot (a WAP bot, it looks like) that ignored my robots.txt.

I have a "spider trap" on my site, a file that is excluded in the robots.txt, but crawlers that ignore the robots.txt will run... and if run, it logs their IP, and bans them...

So, look at this:

216.239.33.5 - - [07/Sep/2002:06:18:24 -0600] "GET / HTTP/1.0" 200 12614 "-" "SIE-C3I/3.0 UP/4.1.16m (Google WAP Proxy/1.0)"
216.239.33.5 - - [07/Sep/2002:06:19:12 -0600] "GET /secret_spider-trap.cgi HTTP/1.0" 200 152 "-" "SIE-C3I/3.0 UP/4.1.16m (Google WAP Proxy/1.0)"

That is all it got, but it was enough to ban him!

Should I unban this IP, contact Google, anything like that?

dave

3:42 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member ciml is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:June 22, 2001
posts:3805
votes: 2


dave, that's not a spider it's highly likely to be a person.

Google's WAP proxy is a bridge for individual people to access Web pages via the WAP protocol used in mobile telephones.

3:44 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Congratulations! You just banned a cell phone visitor.

[google.com...]

3:58 pm on Sept 7, 2002 (gmt 0)

Administrator from JP 

WebmasterWorld Administrator bill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Oct 12, 2000
posts:14948
votes: 122


We had a related thread [webmasterworld.com] a few days ago.
4:17 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Key_Master:

But isn't it weird, it only requested two documents- my main page and the spider trap (which is a hidden URL)?

You would have to look at the code of the page to even know that link existed, the file name is pretty obscure...

Should I un-ban the IP?

dave

4:32 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


carfac,

Yes, you should un ban the IP and modify your trap to exclude these types of visits.

The best way to check your script out is to use the Google WAP proxy to check out your site. I'd bet that cell phone visitor was shown your spider trap link and clicked on it purely out of curiosity.

4:42 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Key_Master:

Very good... I will edit and edit!

Thanks for the advice!

Dave

4:52 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Hi:

Probably not the place for perl questions, so sorry if this is inappropriate...

I wrote this:

$visitor_ua = $ENV{'HTTP_USER_AGENT'};
if ($visitor_ua =~ 'WAP') {

print "Content-type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Forward On</title>\n"; print "</head>\n";
print "<body>\n";
print "<p><b>Please <A HREF="http://www.mydomain.com/">Click Here</A> to continue!</b></p>\n";
print "</body>\n";
print "</html>\n";
exit;
}
else {
CODE
}

and inserted above the logging part of the trap... look good?

dave

[edited by: carfac at 5:56 pm (utc) on Sep. 7, 2002]

5:28 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


I wouldn't do it by user agent. Too easy to spoof. Use IP addresses instead. Here is a list of Google WAP proxies and translator IP addresses.

216.239.33.5
216.239.35.4
216.239.37.5
216.239.39.5
5:41 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Key_Master:

Perfect- thanks!

I am a bit shaky on Regular expressions, can you tell me if this is correct to match all those numbers:

if ($visitor_ua =~ '(216.239.33.5¦216.239.35.4¦216.239.37.5¦216.239.39.5)' { code blah blah

Sorry- I get mixed up sometimes whether to us single or double quotes, or the ^ anchor...

Thank you!

Dave

5:51 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


This should do it:

$visitor_ip = $ENV{'REMOTE_ADDR'};
if ($visitor_ip =~ /^216\.239\.3([3¦7¦9]\.5)$¦^216\.239\.35\.4$/ {

Remember to change the pipe () to the proper character.

5:57 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Key_Master:

Thank you! I am a mere Gate_Keeper... :)

Dave

6:35 pm on Sept 7, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 1, 2002
posts:774
votes: 0


Couple of minor edits, and it works....

Tested on the WAP emulator, and got the spider, but it did not log the IP...

Should I post the final, fixed code?

dave

8:10 pm on Sept 7, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 25, 2002
posts:378
votes: 0


But isn't it weird, it only requested two documents- my main page and the spider trap (which is a hidden URL)?

Not that weird. The HTML-to-WML proxy normally strips out graphics and inserts placeholders. It's probably "un-hiding" your hidden link.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members