Forum Moderators: phranque

Message Too Old, No Replies

GeoIP RewriteCond but exclude googlebot and other spiders

         

gsmforum

9:04 pm on Mar 4, 2011 (gmt 0)

10+ Year Member



Hello,

Just installed GeoIP and i'm getting some country visitors to redirect from site1 to site2, but want to exclude from redirection googlebot and other spiders, otherwise they will stop crawling site1.

What should i have in my .htaccess to allow this spiders to keep on crawling my website ?


Here are my site1 .htaccess settings:

# Redirect multiple countries to a single page
RewriteEngine on
RewriteCond %{ENV:GEOIP_COUNTRY_CODE} ^(SE|MX)$
RewriteRule ^(.*)$ http:/www.site2.com$1 [L]

wilderness

9:11 pm on Mar 4, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This [webmasterworld.com] above your ENV kine will suffice.

Course, you'll need to determine the IP ranges foe the bots your desire to exclude.

gsmforum

9:14 pm on Mar 4, 2011 (gmt 0)

10+ Year Member



Just checked google recommendations here: [google.com...]

They recommend not using IP range but user-agent:

Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).


If so, how can i do it in my htaccess to allow crawl everything and not redirect ?

wilderness

9:37 pm on Mar 4, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If so, how can i do it in my htaccess to allow crawl everything and not redirect ?


You cannot, at least with any assurance that your not also allowing the many fake-Googlebots from non-Google IP ranges.

The Google IP ranges have been provided hundreds (perhaps thousands) of times in Search Engine Spider forum.

66.249.64.0 - 66.249.95.255

There are other Google IP ranges, however they are utilized for most of the Google tools that are provided for users. These tools are easily abused.

gsmforum

1:20 am on Mar 5, 2011 (gmt 0)

10+ Year Member



Ok i will follow your recommendation.

checking here, those are usually IP used from robots: [chceme.info...]


I'm trying to understand how to create RewriteCond %{REMOTE_ADDR} for IP address ranges, but its really crazy !

What is the correct format of RewriteCond to get those IP addresses range out of the redirection (at least those used by googlebot) ?

64.233.160.0 to 64.233.191.255
66.102.0.0 to 66.102.15.255
66.249.64.0 to 66.249.95.255
72.14.192.0 to 72.14.255.255
74.125.0.0 to 74.125.255.255
209.85.128.0 to 209.85.255.255
216.239.32.0 to 216.239.63.255


thanks

PS- using apache 2.2

wilderness

2:19 am on Mar 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[webmasterworld.com...]

FWIW, The Google bot does not crawl from all those ranges, rather the one I previously provided.

gsmforum

9:44 am on Mar 5, 2011 (gmt 0)

10+ Year Member



Thank you again. I've already read carefully that page, but i really can't understand those numbers

RewriteCond %{REMOTE_ADDR} !^123\.455\.([0-9]|[1-9][0-9]|1[01][0-9])\.

For me would be just simple as:
RewriteCond %{REMOTE_ADDR} !^66.249.64.0-66.249.95.255

g1smd

10:55 am on Mar 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Regular Expressions allow parsing of a large amount of data in a very quick and efficient manner.

The
! ^ . ? + * ( ) [^ ] { } \. \? $1 $
syntax has a very specific meaning.

Once you have learned what each one means, you can write some very complex pattern matching rules using a very small number of characters.

wilderness

1:01 pm on Mar 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



# Except (leading exclamation) IP range
RewriteCond %{REMOTE_ADDR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.

This old explanation [webmasterworld.com] may or may not help.

gsmforum

1:34 pm on Mar 5, 2011 (gmt 0)

10+ Year Member



Thank you !
Just added to my .htaccess file.

Is it possible to add another similar line just below that, based also in USER AGENT ?
Something like: RewriteCond %{HTTP_USER_AGENT} !googlebot|Msnbot|Slurp|Teoma|^$ [NC]

wilderness

2:28 pm on Mar 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is it possible to add another similar line just below that, based also in USER AGENT ?
Something like: RewriteCond %{HTTP_USER_AGENT} !googlebot|Msnbot|Slurp|Teoma|^$ [NC]


Yes, however once again, your leaving the door open for abuse by any visitor whom is capable of editing their UA (a simple task) and proclaiming themselves as one of these otherwise legitimate bots.

Corrected line:

RewriteCond %{HTTP_USER_AGENT} !(googlebot|Msnbot|Slurp|Teoma) [NC]

Anchor explanation as follows:

"begins with" ^
"end with" $
"contains" (no character)
"begins with and ends with zilch" ^zilch$

g1smd

2:42 pm on Mar 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is very educational to go about your daily business on the web with the browser declaring the Googlebot user agent.

You see some interesting tricks that some well known sites employ!

wilderness

3:19 pm on Mar 5, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It is very educational to go about your daily business on the web with the browser declaring the Googlebot user agent.


Ah! Your the one ;)

jdMorgan

5:23 pm on Mar 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For the geo-ip application discussed here, it really isn't necessary to worry about IP address ranges. The country-code redirection can just look for a reasonably-valid search engine spider identification string.

However, validating those spider user-agent strings and checking their IP addresses *before* allowing access or wasting any time redirecting unwelcome spoofer-bots is a good idea.

In short, I'll argue that user-agent validation and geoip-related functions should be separate.

Jim

Sgt_Kickaxe

10:54 am on Mar 29, 2011 (gmt 0)



I agree with that Jim, even though I'm a relative rookie with .htaccess they should indeed be tackled separately. The purpose of the geo-ip is NOT to hinder a bot's search efforts while redirecting people based on location. The location of the bot should be irrelevant imo.

There are also a few more bots to consider from the line of code above, an example for a US based site might be...

RewriteCond %{HTTP_USER_AGENT} !(Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Msnbot|bingbot|Slurp|Teoma|YandexBot|YandexImages) [NC]

If someone wants to spoof a bot user agent let them land in the wrong country :-)