homepage Welcome to WebmasterWorld Guest from 174.129.76.87
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Did I accidentally block a real googlebot visit?
aristotle

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 1:20 pm on Jun 18, 2014 (gmt 0)

Yesterday I blocked some new IPs in one of my .htaccess files. But today it looks like a googlebot visit was blocked. Here is the Latest Visitors entry:
Host: 23.20.22.2
/
Http Code: 403 Date: Jun 18 08:47:55 Http Version: HTTP/1.0 Size in Bytes: 13
Referer: -
Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

And here is the IP information:
IP: 23.20.22.2
Hostname: ec2-23-20-22-2.compute-1.amazonaws.com
ISP: Amazon.com
Organization: Amazon.com
Services: None detected
Type: Corporate
Assignment: Static IP
Country: United States
State/Region: Virginia
City: Ashburn

I'm really getting tired of having to deal with .htaccess matters. It takes time that could be spent on research and writing.

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 11:49 pm on Jun 18, 2014 (gmt 0)

I'm pretty sure Google doesn't use Amazon AWS, a competing service.

not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 2:25 am on Jun 19, 2014 (gmt 0)

Pretty sure you won't see the real Googlebot using HTTP/1.0 either.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 3:02 am on Jun 19, 2014 (gmt 0)

It is even easier to plug in the known (and desired) googlebot IP and reject anything else. (whitelisting)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4680801 posted 4:04 am on Jun 19, 2014 (gmt 0)

googlebot has got to be the single most common spoofed robotic UA-- to the point where many people have a rule along the lines of

:: shuffling papers ::

RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REMOTE_ADDR} !^(66\.249|74\.125)\.
RewriteRule (^|\.html|/|\.pdf)$ - [F]

Details to taste, of course.

aristotle

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 11:54 am on Jun 19, 2014 (gmt 0)

Thanks for the replies. I thought it was probably a fake.

I'm really tired of spending time looking at logs and working on .htaccess files. There's other things I could be doing. And I still don't understand why people are expending so much time and effort to create all of these rogue bots. Why can't they find something better to do with their lives.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 4:46 pm on Jun 19, 2014 (gmt 0)

Why can't they find something better to do with their lives.


Because there's big money in scraping.

I have a long list of how that data could be used and it's a thread topic all it's own.

aristotle

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 9:12 pm on Jun 19, 2014 (gmt 0)

Lucy suggested:
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REMOTE_ADDR} !^(66\.249|74\.125)\.
RewriteRule (^|\.html|/|\.pdf)$ - [F]

Thanks Lucy. I'm going to add that to my .htaccess files. But I'm not sure where to put it among the other sections of my files. Here is an abbreviated example of the order I'm currently using:

Options +FollowSymLinks
RewriteEngine On

# BLOCK IPs
order allow,deny
deny from 128.204.195.249
deny from 173.213.
. . . . . . . . .
allow from all

# BLOCK USER AGENTS:
SetEnvIfNoCase User-Agent (\<|\>|\'|\$x0|\%0A|\%0D|\%27|\%3C|\&lt) ban
SetEnvIfNoCase User-Agent (a6corp|MJ12bot|YisouS|NerdyBot|nutch|spbot) ban
. . . . . . . . .
Order Allow,Deny
Allow from all
Deny from env=ban

# REDIRECT
Redirect 301 /example1.html /example2.html

# BLOCK COUNTRY DOMAINS
RewriteCond %{HTTP_REFERER} \.(ru|su|ua|cn|md|kz|pl|lv|ro)(/|$) [NC,OR]
RewriteCond %{HTTP_REFERER} \.(by|bg|hr|cz|al|rs|kp|hu|jp)(/|$) [NC]
RewriteRule (^|\.html|/)$ - [F]

# BLOCK REFERERS
RewriteCond %{HTTP_REFERER} (formatn|kochanelli|chimiver|poker|thepostemail) [NC,OR]
RewriteCond %{HTTP_REFERER} (sugarkun|trustcombat|escort|letseks|tipkiller) [NC,OR]
RewriteCond %{HTTP_REFERER} (semalt|#*$!ogorod\.com|prostitutki) [NC]
RewriteRule (^|\.html|/)$ - [F]

# BLOCK HOME PAGE FROM SELF-REFERERS
RewriteCond %{HTTP_REFERER} ^http://(www\.)?example\.com(/(index\.html)?)?$ [NC]
RewriteRule ^(index\.html)?$ - [F]

# REWRITE TO WWW
RewriteCond %{HTTP_HOST} ^example\.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

# PREVENT INDEXING OF IMAGES
<Files ~ "\.(gif|jp[eg]|png)$">
Header append x-robots-tag "noindex"
</Files>

ErrorDocument 403 "Access Denied"
ErrorDocument 404 /custom404.html
# END

That's the order I'm using now -- I'm not sure if it's correct, but it seems to work. So where in there should I stick your code for blocking spoof googlebots?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4680801 posted 11:15 pm on Jun 19, 2014 (gmt 0)

Redirect 301 /example1.html /example2.html
<snip>
RewriteCond

Aaack, don't do that! Don't combine mod_alias (Redirect by that name) with mod_rewrite; things won't execute in the desired order. Anything currently using mod_alias should be converted to mod_rewrite syntax, and then put it with the other redirects (R=301 flag).

When ordering your RewriteRules, start by grouping them in order of severity: first access-control rules, then 410s, then redirects, then L-alone (rewrites, generally). There will be individual exceptions but that's the general principle.

I put mod_rewrite after everything else in htaccess, simply because it takes up more room. I guess technically mod_authzzz uses more lines, but I keep those in a separate htaccess file covering all domains. (You can only do this if you're on a "Userspace" setup as opposed to a "Primary/Addon" setup.)

Each module is an island, so it makes absolutely no difference to Apache what order you put things in. You could even have directives from different modules all garbled together. Arrange things in the most sanity-saving way.

aristotle

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 11:48 am on Jun 20, 2014 (gmt 0)

Lucy wrote:
Aaack, don't do that! Don't combine mod_alias (Redirect by that name) with mod_rewrite; things won't execute in the desired order. Anything currently using mod_alias should be converted to mod_rewrite syntax, and then put it with the other redirects (R=301 flag).

Thanks Lucy. So a one-line 301 redirect (mod_alias) should never be included in an .htaccess file that also includes mod_rewrite instructions?

One reason I don't like working with .htaccess files is that, unless you're an expert, it's so easy to get tripped up by this type of mistake. And when I search for information on the web, I seem to find a lot of self-proclaimed experts who turn out not to know much more than the little I know. Once I copied some recommended code from some article somewhere, and it caused an internal server error!

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4680801 posted 8:10 pm on Jun 20, 2014 (gmt 0)

So a one-line 301 redirect (mod_alias) should never be included in an .htaccess file that also includes mod_rewrite instructions?

Exactly. There may or may not be visible consequences, depending on what else is going on at the site, but it's better not to take chances.

self-proclaimed experts

Well, don't look at me: I don't speak a word of Apache and never claimed to ;) I do have a fairly solid grip on Regular Expressions. This takes you surprisingly far.

Once I copied some recommended code from some article somewhere, and it caused an internal server error!

I did this once-- with information from my own host's wiki, no less. On closer inspection it turned out the wiki article was several years old and pertained to an older version of a third-party mod. But that's the worst case. More often, a cut-and-paste remedy either won't work, or will work very inefficiently.

aristotle

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 8:50 pm on Jun 20, 2014 (gmt 0)

self-proclaimed experts -- Well, don't look at me:

Of course I didn't mean you, Lucy. You're a real expert, along with some others here. It's just that I've never had much success finding answers to .htaccess questions by searching on the web. That's why I keep coming back here.

aristotle

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 1:02 pm on Jun 21, 2014 (gmt 0)

I would like to use this thread to ask another question. I keep seeing requests that attach //RK=0 to a file name. For example:
Host: 183.217.178.17
/
Http Code: 200 Date: Jun 20 19:34:16 Http Version: HTTP/1.1 Size in Bytes: 43733
Referer: http://www.example.com//RK=0
Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36

Can someone please explain the origin of this attached RK=0 , and also why does the server appear to ignore or disregard it?

Edit: Oops I posted a mis-leading example. Sometimes the RK=0 is attached to the requested file name, and sometimes to the "referrer" (the site's home page URL)
okay, here's an example where it's attached to the requested file name:
Host: 86.51.26.22
//RK=0
Http Code: 404 Date: Jun 20 19:33:41 Http Version: HTTP/1.0 Size in Bytes: 472
Referer: http://www.example.com/
Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36


not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 3:58 pm on Jun 21, 2014 (gmt 0)

I've seen them showing up in GWT as 404 pages on one site, not very happy about that. If you do a Google search for
allinurl: RS "//RK=0"
it looks like someone figured out what these come from: scraped pages of malformed URLs from results of Yahoo searches. If we want to discuss a problem that is a different topic, it helps others find the information if we keep it all together with the rest of that discussion: [webmasterworld.com...]
aristotle

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 4:33 pm on Jun 21, 2014 (gmt 0)

Thanks not2easy
I'm sorry for disregarding the forum guidelines, but I didn't think the question was worth a thread of its own.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4680801 posted 7:16 pm on Jun 21, 2014 (gmt 0)

A while back-- maybe last year?-- there was a fairly long thread in Apache. Someone wanted to build an htaccess file for the ages by including rules to block every malformed URL known to humanity. That includes things like // duplicate slashes, or extra material after ".html", or, well, all the things people generally don't have to think about until they have to think about them. It's a pretty long list. He did get it hammered into shape in the end. That applies specifically to bogus filenames. There's a separate and more recent thread-- I think started by wilderness-- laying out all the weird punctuation glitches you'll see in a phony or robotic UA.

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4680801 posted 12:24 am on Jun 22, 2014 (gmt 0)

Still curious. What is a "real" gooblebot? Or what do others think is the real googlebot?

I work with the known IP ranges. Anything else gets booted. YMMV. Just asking, "How do you roll?"

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4680801 posted 2:12 am on Jun 22, 2014 (gmt 0)

I've been assuming "real googlebot" = one from the expected 66.249 range. As opposed to some random Ukrainian spoofer, or silly blunders like "GoogleBot 1.0" -- which is why incrediBill has a Sticky on why casing is important.

Angonasec

10+ Year Member



 
Msg#: 4680801 posted 12:50 pm on Jun 25, 2014 (gmt 0)

Q/Aaack, don't do that!/Q

Oh dear, better not let Our Lucy see my htacc file...

To help me repent, kindly advise how to switch this to modRewrite format.

redirect 301 /subdomain/ http://subdomain.example.com/

It is a line I use to stop requests for;

example.com/subdomain/anyfile

being perceived other than a genuine subdomain like this;

subdomain.example.com/anyfile

You need to do this if your host uses DSIRMs for subdomains, or you end up with duplicate url listings in the SEs.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4680801 posted 7:14 pm on Jun 25, 2014 (gmt 0)

Open your htaccess file in a text editor that does Regular Expressions. Run these global replaces:
# change . to \.
# ^(Redirect \d\d\d \S+?[^\\])\. TO \1\\.
# now change Redirect to Rewrite
# ^Redirect(?:Match)? 301 /(.+) TO RewriteRule \1 [R=301,L]

replacing \1 with $1 depending on your RegEx engine.

Angonasec

10+ Year Member



 
Msg#: 4680801 posted 11:59 am on Jun 26, 2014 (gmt 0)

See; it takes a woman to show us how easy it is folks.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved