homepage Welcome to WebmasterWorld Guest from 23.22.173.58
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
deny all
wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 5:20 pm on Feb 27, 2012 (gmt 0)

After 22-days (and nights) of intensive eye strain (html and htaccess) my reactivated domain is ok.

I've posted an inquire similar to the following previously.

I've registered a new domain (2nd and larger that the reactivated domain), which is intended to function as pay based (despite the well-known failure of such methods).

1) deny all
2) except key and/or major SE's which will NOT cache.
3) exceptions to the deny all for the main page and perhaps a few other pages. (some kind of shopping cart may be required for payments, however I don't believe so (based up previous support).

Is it possible to make page exceptions in mod_authz_host or will I have to use mod_rewrite?

Perhaps my thoughts are not clear, however I'm not aware of a DENY ALL function in mod_rewrite?

Anybody have any pointers or references?

TIA.

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 7:42 pm on Feb 27, 2012 (gmt 0)

Could the following be a beginning?

RewriteCond %{HTTP_USER_AGENT} .
RewriteRule .* - [F]

What about the blank UA's?

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4422323 posted 10:10 pm on Feb 27, 2012 (gmt 0)

Is this what your are looking for?

# Deny all
SetEnvIf Remote_Addr .* dont_allow

# Allow some IP ranges
SetEnvIf Remote_Addr ^(?:66\.249\.(?:6[4-9]|[78][0-9]|9[0-5])|74\.125)\. !dont_allow

# Allow global access to certain pages
SetEnvIf Request_URI ^/robots\.txt$ !dont_allow
SetEnvIf Request_URI ^/main-page\.php$ !dont_allow

<Files *>
Order Allow,Deny
Allow from all
Deny from env=dont_allow
</Files>

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 11:23 pm on Feb 27, 2012 (gmt 0)

Many thanks key_master

Two questions?
1) Will there be any adverse effect if I change the environment to Deny,Allow?

2) Is ok to use the Limit container rather than the Files container?

3) My inclination towards mod_rewrite was because of the capability of coordinating IP's and UA's for the SE's. (We have some make fake SE UA's these days, as well as the SE's themselves (from the same IP's) using standard or malformed browser UA's.

BTW, saw your thread in the other forum.
Your to be commended for another superb and time intensive effort.
That forum and since Jim's absence seems to be comprised of a few devotees attempting to help Jim, while most of the inquires are from newly registered users. I'd almost wager that forum attracts more noobs than any other forum at Webmaster World.

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4422323 posted 12:07 am on Feb 28, 2012 (gmt 0)

Thanks wilderness. You can use a Limit container. If you want to use mod_rewrite instead, you could use:

RewriteCond %{ENV:dont_allow} ^1$
RewriteRule .* - [F]


With a little tweeking, deny,allow will work:

SetEnvIf Remote_Addr ^(?:66\.249\.(?:6[4-9]|[78][0-9]|9[0-5])|74\.125)\. allow

<Limit GET POST>
Order Deny,Allow
Deny from all
Allow from env=allow
</Limit>


Then make a white list of SetEnvIf conditions to allow what you want (using "allow" instead of "dont_allow"). You wouldn't need a "Deny all" list if that is the way you want to go.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 2:51 am on May 12, 2012 (gmt 0)

Thanks wilderness. You can use a Limit container. If you want to use mod_rewrite instead, you could use:

RewriteCond %{ENV:dont_allow} ^1$
RewriteRule .* - [F]


Key_Master

How would I add the following lines into the ENV?

RewriteCond %{REMOTE_ADDR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteRule .* - [F]
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} ^70\.37\. [OR]
RewriteCond %{REMOTE_ADDR} ^157\.[45][0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.[67][0-9]\.
RewriteCond %{HTTP_USER_AGENT} !(Bingbot|msnbot) [NC]
RewriteRule !^robots\.txt$ - [F]

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4422323 posted 8:13 am on May 13, 2012 (gmt 0)

Have you ever thought about using the HTTP_FROM header? I've never seen it spoofed or used from a specified IP range from a non-crawler. It also simplifies the solution in your case.

SetEnvIfNoCase User-Agent (google|msn|bing)bot dont_allow
SetEnvIf From ^googlebot\(at\)googlebot\.com$ !dont_allow
SetEnvIf From ^bingbot\(at\)microsoft\.com$ !dont_allow
SetEnvIf Request_URI ^/robots\.txt$ !dont_allow

RewriteCond %{ENV:dont_allow} ^1$
RewriteRule .* - [F]

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4422323 posted 11:57 am on May 13, 2012 (gmt 0)

RewriteCond %{ENV:dont_allow} ^1$
RewriteRule .* - [F]

But it will only work if mod_setenvif executes before mod_rewrite. On shared hosting, you can never be sure about these things. (It probably does-- I experimented once on my own site-- but it's not iron clad.) If instead you say

Deny from env=dont_allow

you can be absolutely that that sucker isn't getting in.

You can save yourself a little bit of typing, and save the server a few bytes, by using the shortcut BrowserMatch.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 3:30 pm on May 13, 2012 (gmt 0)

Have you ever thought about using the HTTP_FROM header? I've never seen it spoofed or used from a specified IP range from a non-crawler. It also simplifies the solution in your case.

SetEnvIfNoCase User-Agent (google|msn|bing)bot dont_allow
SetEnvIf From ^googlebot\(at\)googlebot\.com$ !dont_allow
SetEnvIf From ^bingbot\(at\)microsoft\.com$ !dont_allow
SetEnvIf Request_URI ^/robots\.txt$ !dont_allow

RewriteCond %{ENV:dont_allow} ^1$
RewriteRule .* - [F]


Key_master,
Despite my longevity in this forum, my methods are quite simple.
If in the past decade somebody provided an example of headers that I comprehended, than I implemented it (if effective) and if I didn't comprehend the use (at least enough that I could expand my simplistic capabilities) than I didn't use it.

The only active example that I have for header checks was for the AVG thing.

My apologies, however I don't understand how the lines you provided compares the IP's to the UA's?
1) Don't wish to allow access based upon IP's alone
2) or UA's alone.

(Note just using the google UA would also let in the numerous fakers that appear fairly often and in bunches).

See this thread [webmasterworld.com] for a prime example.

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4422323 posted 3:31 pm on May 13, 2012 (gmt 0)

Lucy, that would indicate a server that is seriously misconfigured. Fortunately, I've never seen that happen before and the problems it would cause would go far beyond simple access control. I think maybe you are think of the SetEnv directive, which does get processed last (e.g.,
SetEnv dont_allow 1)
wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 3:35 pm on May 13, 2012 (gmt 0)

lucy,
I realize that you trying to help, however unless your able to provide how to implement the combined IP & UA checks into the ENV,

How would I add the following lines into the ENV?

than your only confusing the matter.

If it's not possible to utilize this dual conditions in the ENV, than I won't be able to use the ENV.

See my reply to Key_master.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 3:46 pm on May 13, 2012 (gmt 0)

Am I able to add IP ranges?

RewriteCond %{ENV:dont_allow} ^1$
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} ^70\.37\. [OR]
RewriteCond %{REMOTE_ADDR} ^157\.[45][0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.[67][0-9]\.
RewriteRule .* - [F]

Key_Master

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4422323 posted 4:16 pm on May 13, 2012 (gmt 0)

Wilderness, the following code should do the trick, but there are more efficient ways of achieving the same effect.


SetEnvIfNoCase User-Agent googlebot google_ua
SetEnvIfNoCase User-Agent (bingbot|msnbot) msn_ua
SetEnvIf Remote_Addr ^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\. google_ip
SetEnvIf Remote_Addr ^65\.5[2-5]\. msn_ip
SetEnvIf Remote_Addr ^70\.37\. msn_ip
SetEnvIf Remote_Addr ^157\.[45][0-9]\. msn_ip
SetEnvIf Remote_Addr ^207\.46\. msn_ip
SetEnvIf Remote_Addr ^207\.[67][0-9]\. msn_ip

RewriteCond %{ENV:google_ua} ^1$
RewriteCond %{ENV:google_ip} !^1$
RewriteRule !^robots\.txt$ - [F]

RewriteCond %{ENV:msn_ua} ^1$
RewriteCond %{ENV:msn_ip} !^1$
RewriteRule !^robots\.txt$ - [F]

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 4:36 pm on May 13, 2012 (gmt 0)

Many thanks Key_master.

At this point I'm looking for implementation.
I've a domain that has been sitting empty for three months, while materials are awaiting addition.

The root will allow access to ALL. Even IP's and regions that I've denied for more than a decade.

The sub-root and it's sub-directories will offer these restrictions.

My sites don't get the volume of traffic that many other sites get, as a result of the small market share of widgets.
The variety, of visitors however is extensive.
In addition, my sites are simplistic in design. No PHP, no scripts, no Java.
Simple html and CSS.

If there's a known method to having the major SE's crawl/index a website, while offering a deny/PAY option to every other visitor, than I'm certainly open to suggestions.

Many thanks again.

Don

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4422323 posted 3:50 pm on May 14, 2012 (gmt 0)

Key_master,
Many thanks again.
I put this in place last night.
Tested and added a page, and will begin adding a few preliminary pages to see how the SE's deal with them.

It seems to be functioning as intended.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved