homepage Welcome to WebmasterWorld Guest from 54.161.175.231
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Coordinating robots.txt and .htaccess
Should the lists of banned robots be the same in both?
Busynut




msg:1505094
 4:30 pm on Oct 30, 2002 (gmt 0)

I've been following the recent .htaccess thread with great interest:
[webmasterworld.com ]

..as well as several related threads concerning robots, spam-proofing, etc. in several forums here at webmasterworld. I've been using robots files and .htaccess on my sites, but it appears I haven't used them as effectively as I can. I'm confused about whether the two files need to be coordinated - in terms of having the same list of "bad robots."

At present, my robots files basically just list the directories that are off-limits plus disallow a few specific image search engines.

So, should my robots.txt file include the same list of banned bots that I include in my mod_rewrite conditions within the .htaccess file? Or perhaps that doesn't make sense because the .htaccess file is being used to handle bots which are ignoring the robots file in the first place?

 

GaryK




msg:1505095
 4:54 pm on Oct 30, 2002 (gmt 0)

This is strictly my opinion but I think it belongs in one file or the other, not both. If a user agent has shown a lack of respect for robots.txt then it should be elevated to the next level of defense which for you is .htaccess. At that point there is no useful purpose being served by having it remain in the robots.txt file.

andreasfriedrich




msg:1505096
 5:00 pm on Oct 30, 2002 (gmt 0)

I like that opinion. It is a sound one. :)

jdMorgan




msg:1505097
 5:07 pm on Oct 30, 2002 (gmt 0)

BusyNut,

I agree with GaryK - robots.txt is used to control the behaviour of "good robots" that will read and respect the Disallow statements.

.htaccess - or a similar mechanism on servers other than Apache - is used to block access by robots which do not respect robots.txt, as well as other user-agents which you wish to exclude, such as site downloaders and e-mail address harvesters.

Sometimes it is useful to include a suspicious user-agent in robots.txt as a test. In some cases, malicious 'bots will read robots.txt in order to appear innocuous, but then ignore what they've read. If you observe this behaviour, then it is likely that you are seeing a malicious user-agent.

In a very few cases, even good 'bots will make a mistake due to a coding bug, and in this case you should report the problem to the owner of the 'bot.

GaryK's use of the concept of "promoting" a user agent from robots.txt exclusion to a .htaccess block is a good way of thinking about it.

Jim

Busynut




msg:1505098
 7:28 pm on Oct 30, 2002 (gmt 0)

Many thanks for your responses. What you've advised now makes a lot of sense to me.

Now, if only I could solve my log analysis program obstacles so I don't have to pick through the logs line by line manually. :)

Busynut




msg:1505099
 1:55 am on Nov 3, 2002 (gmt 0)

Uh Oh. I did something wrong.

The final line of my htaccess file reads as follows:

RewriteRule ^.*$ /403bots.htm [F,L]

I *thought* I was setting up a special rule just for the bad bots - a special 403 page, if you will. However, when testing it from www.wannabrowser.com (thank you!) I got the following result:

You don't have permission to access /
on this server.
Additionally, a 403 Forbidden
error was encountered while trying to use an ErrorDocument to handle the request.

Can someone educate me as to what I've done wrong?
Many thanks!

jdMorgan




msg:1505100
 2:26 am on Nov 3, 2002 (gmt 0)

Busynut,

Without seeing your RewriteConds which precede this RewriteRule, it's hard to tell what's wrong. But you must allow everyone - good and bad - to fetch 403bots.htm, or they will get another 403 when trying to fetch 403bots.htm.

The easiest work-around (if you haven't already handled this case in a preceding RewriteCond) would be:
RewriteRule !^403bots.htm$ - [F,L]
In other words, if the requested file is not /403bots.htm, then return a status code of 403-Forbidden. This will automatically invoke your custom ErrorDocument if you have added
ErrorDocument 403 /403bots.htm
to your .htaccess file anywhere above this RewriteRule. Therefore, there is no need to supply a redirect pathname, and you just use "-".

When you used wannabrowser, what URL did you request? It sound like you just entered your domain name.

HTH,
Jim

[edited by: jatar_k at 4:05 am (utc) on Nov. 3, 2002]
[edit] jdm - reinsert space preceding "!^403bots..." deleted by above edit (code tag is not foolproof)[/edit]

Busynut




msg:1505101
 5:11 am on Nov 3, 2002 (gmt 0)

Thank you for explaining. Actually, I thought I would be able to have a "regular" 403 error document, and then a special one just for the 'bad' bots I've designated in my htaccess rewrite conditions. In my htaccess file above the bad bot list I've designated ErrorDocument 403 /403.htm. I thought by specifying the new 403bots.htm file that only those on the list would end up there. I guess having just the one 403 file will be okay... I had set the second one up with an additional explanation just in case I caught some 'innocent' surfers.

When I tested with wannabrowser I did just insert my domain - I wanted to find out what would happen if I pretended to be one of my bad bots. I've since tried testing several specific pages within various directories of my site and I've been able to get several of the pages without invoking a 403 error. Here's another question: if I use another htaccess file in a lower directory (for blocking image hotlinking primarily)... does it completely cancel out the directives in the one in the higher directory or does it merely 'add' to the information from the first? I apologize for asking such basic questions - honestly, I did do homework on this, but it's apparent I've got a great deal more to learn.

Thank you for your help!

jdMorgan




msg:1505102
 5:14 pm on Nov 3, 2002 (gmt 0)

Busynut,

...if I use another htaccess file in a lower directory (for blocking image hotlinking primarily)... does it completely cancel out the directives in the one in the higher directory...?

No, .htaccess files are applied in order, from the top of your directory hierarchy on down. If this issue is worrisome, simply do your image hot-link block in your top-level directory - it's more efficient doing it there anyway.

403 issue:

One approach you might consider is to serve up a very small generic 403 ErrorDocument regardless of whether the User-agent is known-bad or just possibly-bad. On this page, put a link and a meta-refresh redirect to an "explanatory" page for innocent visitors who get caught in your 403 trap. Generally, bad-bots will not follow the link or the meta-refresh redirect, and the small initial 403 page will save you bandwidth on the bad-bots that are too stupid to quit trying to get in.

With this approach, I believe you can accomplish what you want to do using just:

ErrorDocument 403 /403.htm
RewriteCond %{HTTP_USER_AGENT} <list of bad bots>
RewriteRule !^(403.*\.htm¦robots\.txt)$ - [F,L]

Note that robots.txt and any document which starts with "403" and ends with ".htm" can now be served to any User-agent, so name your 403-explanatory-page-for-innocent-victims 403info.htm, or something like that. This is how I've done it, and it works well.

You could also do this 403 stuff in two steps to more closely approximate what you originally intended: Use the ErrorDocument 403 /403.htm to start. Then add another layer of mod-rewrite internal redirection below that to discriminate between known-bad and possibly-bad user-Agents. In other words, internally rewrite 403.htm to a different URL depending on the user-Agent. The above approach is simpler, easier on your server in case the 'bot won't give up, and avoids any kind of User-agent cloaking.

HTH,
Jim

Busynut




msg:1505103
 5:59 pm on Nov 3, 2002 (gmt 0)

Many thanks again. I'd been doing a search on this topic when you replied and have a good deal of reading to do (you've been such a great help to so many on this topic! bet you probably get tired of answering the same questions again and again!). I'll be testing your suggestions - and I do believe some of this is beginning to sink in. :)

Also, the reason I don't want to put the image hot link directive in the top directory is because I'm reserving a certain folder there so I can hotlink to my own images when signing guestbooks, etc.

jdMorgan




msg:1505104
 7:10 pm on Nov 3, 2002 (gmt 0)

Busynut,

No problem - Answering these q's helps me "earn my keep" around here - I read a lot more than I write!

I understand about blocking image hot-linking only in certain subdirectories, and having a subdirectory-level .htaccess is one way to do it. The other way is to put the subdirectory name in the RewriteRule in your site's root directory. Six of one, half dozen of the other - both work.

Have you "allowed" the Google cache, and the others to hot-link? It's another consideration for you. Here's a snippet that demonstrates both the subdirectory-only-image-blocking and also allowing well-known caches and language translators to hot-link while controlling the rest:


# Block image inclusion outside our domain except Google, AltaVista, Gigablast,
# Comet Systems, and SearchHippo translators and caches
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://www\.mydomain\.org
RewriteCond %{HTTP_REFERER} !^http://216\.239\.(3[2-9]¦[45][0-9]¦6[0-3])\..*www\.mydomain\.org
RewriteCond %{HTTP_REFERER} !^http://babel.altavista.com/.*www\.mydomain\.org
RewriteCond %{HTTP_REFERER} !^http://216\.243\.113\.1/cgi/
RewriteCond %{HTTP_REFERER} !^http://search.*\.cometsystems\.com/search.*www\.mydomain\.org
RewriteCond %{HTTP_REFERER} !^http://.*searchhippo\.com.*www\.mydomain\.org
RewriteRule ^image_dir/.*\.(jpg¦jpeg?¦gif)$ - [F,L]

Jim

Busynut




msg:1505105
 1:06 am on Nov 4, 2002 (gmt 0)

[sigh]

I need to study regular expressions because every time I think I'm getting it I look at all the characters/symbols and get confused all over again.

[begin childish rant]I do allow Google and everyone else visit/spider/cache any of my pages in the "top" section of my site. This section basically just contains explanatory info for the rest of the site. However, there are certain directories I don't want anyone to cache, save, etc. I just want them to be viewed by legitimate visitors. See, certain sections of my site have become moderately popular due to a good deal of positive attention from an About.com guide - it's a humorous graphic section (all family friendly, I assure you!). Although I'm enjoying the popularity, I've begun to see my images crop up all kinds of places... and they didn't get there through legitimate visitors to MY site. I don't want any of the image search engines caching these pages... they're welcome to cache the page that 'explains' what my site contains... but if people are going to use my graphics I'd rather they get them from my site legitimately, if you know what I mean. It seems a losing battle at times, and clearly I'm way behind in implementing effective techniques. [end of rant]

Back to the htaccess. I implemented your suggestion above (msg 9)
ErrorDocument 403 /403.htm
RewriteCond %{HTTP_USER_AGENT} <list of bad bots>
RewriteRule!^(403.*\.htm¦robots\.txt)$ - [F,L]

And it seems to work fine EXCEPT while pretending to be bad bots I was still able to view pages in the lower directory (which contained the hotlinking rewrite rule). So... I seem to have fixed that for the time being by just putting the same list of bad bots in both htaccess files - the only difference in the files is the lower one contains the hotlinking rule.

Also, I got rid of the extra 403 page (the explanatory one) and am just using a 403.html page (I included the explanation there). But I'm still getting the error:
Additionally, a 403 Forbidden
error was encountered while trying to use an ErrorDocument to handle the request.

So I'm still doing something wrong and I'm not going to give up on this because I'm very stubborn :) even if it means reading htaccess, mod-rewrite, and regex rules until I'm blurry eyed. I'll study your last suggestion (msg 11), but it's definitely okay with me that no search engine have access to certain directories on my site.

I'm very grateful for your help! (can I please adopt you?)

jdMorgan




msg:1505106
 3:06 am on Nov 4, 2002 (gmt 0)

Busynut,

I also block image hot-linkers, but... I allow those accesses listed above so that people who use those search engine caches and various language-translation resources will see an attractive-looking result. My concern with images being hot-linked is purely one of bandwidth - at one point I had a ton of people using our images on their "profile" pages on Yahoo, various sports portals, etc. I just got tired of the bandwidth leak and chasing down all the extraneous and irrelevant referals in my logs. All I can suggest is to try these services out, and see if they merit your re-appraisal.

For images which have some intrinsic value, other methods such as watermarking should be employed.

Some good new about regular expressions and mod_rewrite rules - they are far easier to write than to read (it's true). :)

Additionally, a 403 Forbidden error was encountered while trying to use an ErrorDocument to handle the request.

This means that one or more of your custom ErrorDocuments is still being blocked in .htaccess or by the server configuration.

It sounds like something is fishy with the server config... The code I posted is exactly what I use on my sites, and if I put an access block on any file using my top-level-directory .htaccess file, then that file gets blocked, no matter what directory it is in. I don't know why it's not working for you. The only possible work-around I know of (so that you won't have to include the bad-bots list in each of the subdirectory .htaccess files) is to add the directive "RewriteOptions inherit" to your .htaccess files in subdirectories only.

As far as adoption goes, sure - but you'd be better off with a puppy rather than a salty old dog like me! ;)

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved