homepage Welcome to WebmasterWorld Guest from 23.20.28.193
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Google WAP Proxy and robots.txt
carfac




msg:1527726
 3:37 pm on Sep 7, 2002 (gmt 0)

Looking over last nights logs, I found a google bot (a WAP bot, it looks like) that ignored my robots.txt.

I have a "spider trap" on my site, a file that is excluded in the robots.txt, but crawlers that ignore the robots.txt will run... and if run, it logs their IP, and bans them...

So, look at this:

216.239.33.5 - - [07/Sep/2002:06:18:24 -0600] "GET / HTTP/1.0" 200 12614 "-" "SIE-C3I/3.0 UP/4.1.16m (Google WAP Proxy/1.0)"
216.239.33.5 - - [07/Sep/2002:06:19:12 -0600] "GET /secret_spider-trap.cgi HTTP/1.0" 200 152 "-" "SIE-C3I/3.0 UP/4.1.16m (Google WAP Proxy/1.0)"

That is all it got, but it was enough to ban him!

Should I unban this IP, contact Google, anything like that?

dave

 

ciml




msg:1527727
 3:42 pm on Sep 7, 2002 (gmt 0)

dave, that's not a spider it's highly likely to be a person.

Google's WAP proxy is a bridge for individual people to access Web pages via the WAP protocol used in mobile telephones.

Key_Master




msg:1527728
 3:44 pm on Sep 7, 2002 (gmt 0)

Congratulations! You just banned a cell phone visitor.

[google.com...]

bill




msg:1527729
 3:58 pm on Sep 7, 2002 (gmt 0)

We had a related thread [webmasterworld.com] a few days ago.

carfac




msg:1527730
 4:17 pm on Sep 7, 2002 (gmt 0)

Key_Master:

But isn't it weird, it only requested two documents- my main page and the spider trap (which is a hidden URL)?

You would have to look at the code of the page to even know that link existed, the file name is pretty obscure...

Should I un-ban the IP?

dave

Key_Master




msg:1527731
 4:32 pm on Sep 7, 2002 (gmt 0)

carfac,

Yes, you should un ban the IP and modify your trap to exclude these types of visits.

The best way to check your script out is to use the Google WAP proxy to check out your site. I'd bet that cell phone visitor was shown your spider trap link and clicked on it purely out of curiosity.

carfac




msg:1527732
 4:42 pm on Sep 7, 2002 (gmt 0)

Key_Master:

Very good... I will edit and edit!

Thanks for the advice!

Dave

carfac




msg:1527733
 4:52 pm on Sep 7, 2002 (gmt 0)

Hi:

Probably not the place for perl questions, so sorry if this is inappropriate...

I wrote this:

$visitor_ua = $ENV{'HTTP_USER_AGENT'};
if ($visitor_ua =~ 'WAP') {

print "Content-type: text/html\n\n";
print "<html>\n";
print "<head>\n";
print "<title>Forward On</title>\n"; print "</head>\n";
print "<body>\n";
print "<p><b>Please <A HREF="http://www.mydomain.com/">Click Here</A> to continue!</b></p>\n";
print "</body>\n";
print "</html>\n";
exit;
}
else {
CODE
}

and inserted above the logging part of the trap... look good?

dave

[edited by: carfac at 5:56 pm (utc) on Sep. 7, 2002]

Key_Master




msg:1527734
 5:28 pm on Sep 7, 2002 (gmt 0)

I wouldn't do it by user agent. Too easy to spoof. Use IP addresses instead. Here is a list of Google WAP proxies and translator IP addresses.

216.239.33.5
216.239.35.4
216.239.37.5
216.239.39.5

carfac




msg:1527735
 5:41 pm on Sep 7, 2002 (gmt 0)

Key_Master:

Perfect- thanks!

I am a bit shaky on Regular expressions, can you tell me if this is correct to match all those numbers:

if ($visitor_ua =~ '(216.239.33.5¦216.239.35.4¦216.239.37.5¦216.239.39.5)' { code blah blah

Sorry- I get mixed up sometimes whether to us single or double quotes, or the ^ anchor...

Thank you!

Dave

Key_Master




msg:1527736
 5:51 pm on Sep 7, 2002 (gmt 0)

This should do it:

$visitor_ip = $ENV{'REMOTE_ADDR'};
if ($visitor_ip =~ /^216\.239\.3([3¦7¦9]\.5)$¦^216\.239\.35\.4$/ {

Remember to change the pipe () to the proper character.

carfac




msg:1527737
 5:57 pm on Sep 7, 2002 (gmt 0)

Key_Master:

Thank you! I am a mere Gate_Keeper... :)

Dave

carfac




msg:1527738
 6:35 pm on Sep 7, 2002 (gmt 0)

Couple of minor edits, and it works....

Tested on the WAP emulator, and got the spider, but it did not log the IP...

Should I post the final, fixed code?

dave

mbauser2




msg:1527739
 8:10 pm on Sep 7, 2002 (gmt 0)

But isn't it weird, it only requested two documents- my main page and the spider trap (which is a hidden URL)?

Not that weird. The HTML-to-WML proxy normally strips out graphics and inserts placeholders. It's probably "un-hiding" your hidden link.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved