homepage Welcome to WebmasterWorld Guest from 23.22.29.137
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Would anyone be so kind to check my htaccess file?
check my htaccess file, help, profesional, experianced opinion needed
iwillgetu




msg:4503525
 4:03 pm on Oct 3, 2012 (gmt 0)

I just want profesional/experianced opinions and or suggestions so on.

Here is my current htaccess:
Options -Indexes
Options All -Indexes
IndexIgnore *
Options +FollowSymLinks
DirectoryIndex index.php
RewriteEngine on
ErrorDocument 404 404.html
ErrorDocument 401 401.html
ErrorDocument 403 403.html
ErrorDocument 500 500.html
ErrorDocument 503 503.html
RewriteCond %{HTTP_HOST} ^MYWEBSITE.com [NC]
RewriteRule ^(.*)$ http://www.MYWEBSITE.com/$1 [L,R=301]
AddEncoding gzip .gz
RewriteCond %{HTTP:Accept-encoding} gzip
RewriteCond %{HTTP_USER_AGENT} !Safari
RewriteCond %{REQUEST_FILENAME}.gz -f
RewriteRule ^(.*)$ $1.gz [QSA,L]
RewriteRule semuel.php semuel.php$1

order allow,deny
#amazonaws
deny from 46.51.128.0/18
deny from 46.51.192.0/20
deny from 46.51.216.0/21
deny from 46.51.224.0/19
deny from 46.137.0.0/17
deny from 46.137.128.0/18
deny from 46.137.224.0/19
deny from 50.16.0.0/15
deny from 50.18.0.0/16
deny from 50.19.0.0/16
deny from 67.202.0.0/18
deny from 72.44.32.0/19
deny from 75.101.128.0/17
deny from 79.125.0.0/17
deny from 103.4.8.0/21
deny from 107.20.0.0/15
deny from 122.248.192.0/18
deny from 174.129.0.0/16
deny from 175.41.128.0/18
deny from 175.41.192.0/18
deny from 176.32.64.0/19
deny from 176.34.128.0/17
deny from 184.72.0.0/18
deny from 184.72.64.0/18
deny from 184.72.128.0/17
deny from 184.73.0.0/16
deny from 204.236.128.0/18
deny from 204.236.192.0/18
deny from 216.182.224.0/20
#dotnetdotcom.org wowrack
deny from 208.115.96.
deny from 208.115.97.
deny from 208.115.98.
deny from 208.115.99.
deny from 208.115.100.
deny from 208.115.101.
deny from 208.115.102.
deny from 208.115.103.
deny from 208.115.104.
deny from 208.115.105.
deny from 208.115.106.
deny from 208.115.107.
deny from 208.115.108.
deny from 208.115.109.
deny from 208.115.110.
deny from 208.115.111.
deny from 208.115.112.
deny from 208.115.113.
deny from 208.115.114.
deny from 208.115.115.
deny from 208.115.116.
deny from 208.115.117.
deny from 208.115.118.
deny from 208.115.119.
deny from 208.115.120.
deny from 208.115.121.
deny from 208.115.122.
deny from 208.115.123.
deny from 208.115.124.
deny from 208.115.125.
deny from 208.115.126.
deny from 216.176.176.
deny from 216.176.177.
deny from 216.176.178.
deny from 216.176.179.
deny from 216.176.180.
deny from 216.176.181.
deny from 216.176.182.
deny from 216.176.183.
deny from 216.176.184.
deny from 216.176.185.
deny from 216.176.186.
deny from 216.176.187.
deny from 216.176.188.
deny from 216.176.189.
deny from 216.176.190.
deny from 216.176.191.
#linode user
deny from 109.74.197.228
#linode
#deny from 109.74.192.
#deny from 109.74.193.
#deny from 109.74.194.
#deny from 109.74.195.
#deny from 109.74.196.
#deny from 109.74.197.
#deny from 109.74.198.
#deny from 109.74.199.
#deny from 109.74.200.
#deny from 109.74.201.
#deny from 109.74.202.
#deny from 109.74.203.
#deny from 109.74.204.
#deny from 109.74.205.
#deny from 109.74.206.
#deny from 109.74.207.
#fake googlebot
deny from 174.37.39.114
#some bot also
deny from 184.172.187.229
#china bot
deny from 123.126.68.31
deny from 180.76.5.
deny from 180.76.6.
#navada bot
deny from 66.116.122.5
#songu
deny from 123.126.68.19
#soso
deny from 124.115.0.
deny from 124.115.4.
deny from 113.142.10.
#sogou
deny from 220.181.94.232
deny from 220.181.94.222
#some crap
deny from 208.88.226.73
#baidu
deny from 180.76.5.
#other
deny from 110.77.139.157
allow from all

RewriteBase /
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php\ HTTP/
RewriteRule ^index\.php$ http://www.MYWEBSITE.com/ [R=301,L]

RewriteCond %{HTTP_HOST} ^(ns1\.)?MYWEBSITE\.com [NC]
RewriteRule ^(.*)$ http://www.MYWEBSITE.com/$1 [L,R=301]
RewriteCond %{HTTP_HOST} ^(ns2\.)?MYWEBSITE\.com [NC]
RewriteRule ^(.*)$ http://www.MYWEBSITE.com/$1 [L,R=301]
RewriteCond %{HTTP_HOST} ^(ns3\.)?MYWEBSITE\.com [NC]
RewriteRule ^(.*)$ http://www.MYWEBSITE.com/$1 [L,R=301]
RewriteCond %{HTTP_HOST} ^(ns4\.)?MYWEBSITE\.com [NC]
RewriteRule ^(.*)$ http://www.MYWEBSITE.com/$1 [L,R=301]

RewriteCond %{HTTP_USER_AGENT} ^Anarchie [OR]
RewriteCond %{HTTP_USER_AGENT} ^ASPSeek [OR]
RewriteCond %{HTTP_USER_AGENT} ^attach [OR]
RewriteCond %{HTTP_USER_AGENT} ^autoemailspider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

RewriteCond %{HTTP_USER_AGENT} (Crawler|spider) [NC]
RewriteRule .* - [L]
RewriteCond %{QUERY_STRING} ^[^=]*$
RewriteCond %{QUERY_STRING} %2d|\- [NC]
RewriteRule .? - [F,L]

SetEnvIf User-Agent ^libww keep_out=1
SetEnvIf User-Agent ^Morf keep_out2=1
SetEnvIf User-Agent ^TurnitinBot keep_out3=1
Deny from env=keep_out env=keep_out2 env=keep_out3

SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^libwww-perl" bad_bot
#SetEnvIfNoCase User-Agent "libwww" bad_bot
SetEnvIfNoCase User-Agent "TurnitinBot" bad_bot
SetEnvIfNoCase User-Agent "tencenttraveler" bad_bot
SetEnvIfNoCase User-Agent "Yandex" bad_bot
SetEnvIfNoCase User-Agent "baidu" bad_bot
SetEnvIfNoCase User-Agent "zeus" bad_bot
SetEnvIfNoCase User-Agent "getright" bad_bot
SetEnvIfNoCase User-Agent "flipboard" bad_bot
SetEnvIfNoCase User-Agent "mj12" bad_bot
SetEnvIfNoCase User-Agent "majestic" bad_bot
Order allow,deny
Allow from all
Deny from env=bad_bot

AddOutputFilterByType DEFLATE text/css text/html application/x-javascript application/javascript
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html

# Turn on Expires and set default to 0
ExpiresActive On
ExpiresDefault A0

ExpiresByType text/css "access plus 10 days"
ExpiresByType text/javascript "access plus 10 days"
ExpiresByType application/javascript "access plus 10 days"
ExpiresByType image/gif "access plus 30 days"
ExpiresByType image/png "access plus 30 days"
ExpiresByType image/jpg "access plus 30 days"
ExpiresByType image/jpeg "access plus 30 days"

<filesMatch "\.(flv|ico|pdf|avi|mov|ppt|doc|mp3|wmv|wav)$">
ExpiresDefault A432000
Header append Cache-Control "public"
</filesMatch>

<filesMatch "\.(gif|jpg|jpeg|png|swf)$">
ExpiresDefault A432000
Header append Cache-Control "public"
</filesMatch>

<filesMatch "\.(xml|txt|html|js|css)$">
ExpiresDefault A7200
Header append Cache-Control "proxy-revalidate"
</filesMatch>

<filesMatch "\.(php|cgi|pl|htm)$">
ExpiresActive Off
Header set Cache-Control "private, no-cache, no-store, proxy-revalidate, no-transform"
Header set Pragma "no-cache"
</filesMatch>

<filesMatch "\.(ico|pdf|flv)$">
Header set Cache-Control "max-age=432000, public"
</filesMatch>

<filesMatch "\.(jpg|jpeg|png|gif|swf)$">
Header set Cache-Control "max-age=432000, public"
</filesMatch>

<filesMatch "\.(xml|txt|css|js)$">
Header set Cache-Control "max-age=172800, proxy-revalidate"
</filesMatch>

<filesMatch "\.(html|htm|php)$">
Header set Cache-Control "max-age=60, private, proxy-revalidate"
</filesMatch>

<Files ~ "^.*\.([Bb][Aa][Cc][Kk][Uu][Pp])">
Order allow,deny
Deny from all
Satisfy All
</Files>

<Files .htaccess>
order allow,deny
deny from all
</Files>

###Start Kloxo PHP config Area
###Please Don't edit these comments or the content in between. kloxo uses this to recognize the lines it writes to the the file. If the above line is corrupted, it may fail to recognize them, leading to multiple lines.

<Ifmodule mod_php4.c>
php_value error_log "/home/admin/__processed_stats/MYWEBSITE.com.phplog"
php_value upload_max_filesize 2M
php_value max_execution_time 30
php_value max_input_time 60
php_value memory_limit 32M
php_value post_max_size 8M
php_flag register_globals off
php_flag display_errors off
php_flag file_uploads on
php_flag log_errors off
php_flag output_buffering off
php_flag register_argc_argv on
php_flag magic_quotes_gpc off
php_flag magic_quotes_runtime off
php_flag magic_quotes_sybase off
php_flag mysql.allow_persistent off
php_flag register_long_arrays on
php_flag allow_url_fopen on
php_flag cgi.force_redirect on
php_flag enable_dl on
</Ifmodule>

<Ifmodule mod_php5.c>
php_value error_log "/home/admin/__processed_stats/MYWEBSITE.com.phplog"
php_value upload_max_filesize 2M
php_value max_execution_time 30
php_value max_input_time 60
php_value memory_limit 32M
php_value post_max_size 8M
php_flag register_globals off
php_flag display_errors off
php_flag file_uploads on
php_flag log_errors off
php_flag output_buffering off
php_flag register_argc_argv on
php_flag magic_quotes_gpc off
php_flag magic_quotes_runtime off
php_flag magic_quotes_sybase off
php_flag mysql.allow_persistent off
php_flag register_long_arrays on
php_flag allow_url_fopen on
php_flag cgi.force_redirect on
php_flag enable_dl on
</Ifmodule>

###End Kloxo PHP config Area



Thank you! :)

 

g1smd




msg:4503604
 6:19 pm on Oct 3, 2012 (gmt 0)

RewriteRules which block access should be listed before RewriteRules that redirect. No point redirecting a request only to then block it.

Escape literal periods in RegEx patterns.

Every RewriteRule needs the [L] flag.

iwillgetu




msg:4503617
 6:54 pm on Oct 3, 2012 (gmt 0)

So i should put all this deny IP and all bots right after line where i define error pages?

Escape literal periods in RegEx patterns.
I dont understand exactly what you mean .

So i add [L] at end of every rewrite rule? will that be ok?

thanks

wilderness




msg:4503673
 8:58 pm on Oct 3, 2012 (gmt 0)

deny from 208.115.96.
deny from 208.115.97.
deny from 208.115.98.
deny from 208.115.99.
deny from 208.115.100.
deny from 208.115.101.
deny from 208.115.102.
deny from 208.115.103.
deny from 208.115.104.
deny from 208.115.105.
deny from 208.115.106.
deny from 208.115.107.
deny from 208.115.108.
deny from 208.115.109.
deny from 208.115.110.
deny from 208.115.111.
deny from 208.115.112.
deny from 208.115.113.
deny from 208.115.114.
deny from 208.115.115.
deny from 208.115.116.
deny from 208.115.117.
deny from 208.115.118.
deny from 208.115.119.
deny from 208.115.120.
deny from 208.115.121.
deny from 208.115.122.
deny from 208.115.123.
deny from 208.115.124.
deny from 208.115.125.
deny from 208.115.126.


You just don't get it!
Replace ALL the above lines with a single lne:

deny from 208.115.96.0/19

There are many other corrections that your file requires, however they are far too many to go through.

Even the User-Agent strings require correction and merging, however this function (UA's) are better when solved with mod_rewrite.

iwillgetu




msg:4503683
 9:11 pm on Oct 3, 2012 (gmt 0)

yes i am making my new htaccess file i corected all this with 208.115.96.0/19

:) that what i posted is my old file

i want to make new clean one and thats why i ask for help

why you say too many? i though its ok? should i be worried?

wilderness




msg:4503718
 10:32 pm on Oct 3, 2012 (gmt 0)

why you say too many? i though its ok? should i be worried?


When any user submits their entire file, they are looking for a copy and paste solution.

As opposed to learning the process of creating your solutions both presently and in the future.

Were I to submit my file for corrections it would fill this entire page and another, then another.
It's not fair to expect somebody else to do all your work.

lucy24




msg:4503730
 10:54 pm on Oct 3, 2012 (gmt 0)

Overlapping wilderness, but we're all used to that.


Why do you have three separate Options lines?

First step should be to organize the file conceptually and by module. For example: <Files> and <FilesMatch> are core-level settings, so you should put them before all the Allow/Deny lines-- and these, in turn, should be grouped together.

Everything concerning mod_rewrite should be grouped together, and then organized as recommended in several thousand earlier threads in-- Oops, what's this thread even doing here? I thought we were in Apache.

Put the small miscellaneous stuff like DirectoryIndex and ErrorDocument at the beginning to get it out of the way.


When you ask people to look at something, whether it's CSS* or htaccess or what-have-you, trim out the duplicates. As in:

Deny from 1.2.3.4
Deny from 2.3.4.5
<snip>
Deny from 3.4.5.6

Nobody is going to pore over every single line to ask "Why are you blocking them in particular?" We're looking at the overall structure.


* When someone dumps an entire 300-line stylesheet into a post, it's a sure sign that they haven't done any preliminary screening of their own. Honestly, I don't think the color of the footer text will turn out to have anything to do with the positioning of the top navigation bar.

iwillgetu




msg:4503741
 11:25 pm on Oct 3, 2012 (gmt 0)

k nevermind, is good as it is

Leosghost




msg:4503742
 11:30 pm on Oct 3, 2012 (gmt 0)

* When someone dumps an entire 300-line stylesheet into a post, it's a sure sign that they haven't done any preliminary screening of their own. Honestly, I don't think the color of the footer text will turn out to have anything to do with the positioning of the top navigation bar.

::snrk:: encore..

g1smd




msg:4503767
 12:15 am on Oct 4, 2012 (gmt 0)

k nevermind, is good as it is

Unfortunately, it isn't. There are many things that can be improved.

The problem is this. The file will be specific to your site, so it's not possible for someone else to write your file for you as they won't know exactly what you want. Additionally, you need to know exactly what every line does, so that you can maintain your code over the months and years the site is active. You can't do that if someone here writes a load of code you don't understand, and they are not here when you come back to ask about stuff within it.

Take the list of bots that you block as one example. Do all of those bots really visit your site, or is that a list you found on the web somewhere? Have you looked in your site logs to see what other bots are scraping your content, trying to hack your site, or are just using loads of your bandwidth?

The answers you are going to get here are mostly going to be general in nature: code syntax, rule ordering, and so on.

incrediBILL




msg:4503799
 2:41 am on Oct 4, 2012 (gmt 0)

Not to mention the fact that you have several good bots flagged as bad_bot which should be addressed in robots.txt, not in .htaccess.

You're just needlessly adding more overhead to every page access and all other files for that matter.

iwillgetu




msg:4504017
 1:48 pm on Oct 4, 2012 (gmt 0)

Well incrediBILL can you tell me which are they?

Thank you!

wilderness




msg:4504050
 2:57 pm on Oct 4, 2012 (gmt 0)

Well incrediBILL can you tell me which are they?


This search link [webmasterworld.com] is at the top of every page at Webmaster World.

iwillgetu




msg:4504062
 3:24 pm on Oct 4, 2012 (gmt 0)

wow so informative and suportive, what for is forum then?

i can use google for search 2

and always will have same information and copy paste no real conversation and togethr to clear all points out(forum)

well nm forget all

wilderness




msg:4504079
 3:51 pm on Oct 4, 2012 (gmt 0)

wow so informative and suportive, what for is forum then?


Your comprehension of the forum is precisely copy and paste solutions.

You never bothered to read either the Forum Charter [webmasterworld.com] or the Forum Library [webmasterworld.com], else you would have found solutions to most everything you've repeatedly asked for, and despite multiple longtime forum participants providing similar answers to mine (even the Forum Moderator).

g1smd




msg:4504124
 5:28 pm on Oct 4, 2012 (gmt 0)

so informative

Very. With three million posts in a little over 12 years, every question that can be asked has already been asked and answered multiple times.

Some questions have been asked and answered several thousand times; and that's no exaggeration. It's your job to search this huge body of work and extract the most you can from it, rather than asking us to copy and paste a previous answer into this thread.

incrediBILL




msg:4504192
 8:17 pm on Oct 4, 2012 (gmt 0)

Now now, we were all new once, play nice and be helpful.

However, while we don't mind being helpful but we do kind of draw the line at doing someone's homework for them. A little RTM (READ THE MANUAL) wouldn't hurt here because as the others have politely pointed out, the information exists here in volumes being posted over and over.

The problem with bot blocking and denying IP ranges is if you don't understand the technology you can do more harm than good and inadvertently lose actual bots that are beneficial or with one block major ISPs and traffic.

This is not a technology where just doing what others do will work. Quick 'for instance' is Wilderness who blocks everyone except the US so cutting and pasting his .htaccess file could be devastating if you rely on international visitors. By not understanding the technology you might not notice for a few days that all your non-US visitors can't access your site. If you're not US-based, such as a UK or Aussie only site, it would be completely devastating.

That's why you need to read a little and understand the basics. One such example is how to block an IP range using a CIDR, which was suggested above, instead of listing multiple sequential IP numbers.

Well incrediBILL can you tell me which are they?

Yes I can.

For starters you can assume that any real search engine that has real credibility will honor robots.txt, such as Yandex or Baidu. Those are best served by blocking in your robots.txt file instead as they will go away if they're denied, thus keeping your .haccess file shorter and faster. It's easy to find out who honors robots.txt by checking each spider name as we typically report in this forum if it honors robots.txt or not.

I don't know them all off the top of my head so someone will have to look them up.

wow so informative and suportive, what for is forum then?


We don't give away the fish, we teach people how to fish and share the secrets of the bait. Some of the best minds around on this topic are in this very forum and willing to help those that are willing to help themselves. You asked for help, they gave suggestions and actionable information. Once you've attempted to implement it yourself, and there are tons of examples in this forum, the members will gladly help guide you thru fixing any problems you might encounter.

Alternatively I'd suggest managed hosting.

[edited by: incrediBILL at 8:49 pm (utc) on Oct 4, 2012]

wilderness




msg:4504200
 8:32 pm on Oct 4, 2012 (gmt 0)

Wilderness who blocks everyone except the US


FWIW, and Canada.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved