homepage Welcome to WebmasterWorld Guest from 54.196.18.51
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Rewrite robots.txt to php script
Problem - target script name being revealed
phred




msg:3696910
 5:33 am on Jul 12, 2008 (gmt 0)

Ok, think I must have an htaccess rewrite problem or something similar. I posted over in spider id and no one seems to be having the same problem I am with msnbot.

Basically I have been rendering get requests for my robots.txt file using a php script for a while now. Here’s the htaccess entries – all the names have been changed to protect, well you know..

Options +FollowSymLinks
RewriteEngine on
RewriteRule robots.txt$ robots.zzzz.php [L,NC]
RewriteRule \.htaccess - [F]
RewriteRule \.(jpe?g)$ - [L]
RewriteCond %{HTTP_HOST} !^www\.widgets\.org$ [NC]
RewriteRule .? http://www.widgets.org%{REQUEST_URI} [R=301,L]
RewriteCond %{REQUEST_URI} !(/$¦\.)
RewriteRule (.+) http://www.widgets.org/$1/ [R=301,L]
RewriteRule ^/?(Contact¦About¦Products¦History)/$ ?zq=$1&zr= [L,QSA]
RewriteRule ^/?(Contact¦About¦Products¦History)/([0-9a-f]+)/$ ?zq=$1&zr=$2 [L,QSA]

Now here’s what happened. Msnbot did a direct get for
>>---> robots.zzz.php <--<<
followed a second later for a get of robots.txt. Now the zzzz, not the real characters, isn’t crypto strong or anything but it would be pretty darn hard to guess. What’s more I don’t see pages of not-found’s in the logs.

Please help! What’s wrong with the htaccess entries? How could I be revealing that robots.txt is being rewritten to robots.zzz.php?

Thanks very much for help!.

Phred

[edited by: jdMorgan at 2:21 pm (utc) on July 12, 2008]

 

jdMorgan




msg:3697099
 2:08 pm on Jul 12, 2008 (gmt 0)

> How could I be revealing that robots.txt is being rewritten to robots.zzz.php?

Your rules are in the wrong order, and any request for "widgets.org/robots.txt" (no "www.") will result in a 301 redirect to "www.widgets.org/robots.zzz.php" -- Try it yourself using a server headers checker, such as the "Live HTTP Headers" add-on for Mozilla/Firefox browsers. Since search engines often 'try' the no-www domain, this is the most likely mechanism for the exposure of your php file.

Put your rules in order, external redirects first, followed by internal rewrites. Within these two groups, put the rules in order from most-specific to least-specific. To clarify that a little, the most-specific kind of rule will accept only a single URL-path as a match; it affects only one "page." Then you may have rules that affect groups of pages, based on part of the URL-path. Finally, you may have rules that affect the entire domain.

I also notice that your "^robots.txt$" pattern is missing an escape on the literal period -- it should be "^robots\.txt$".

Also, all but the first two rules will be by-passed (because the third rule) if the request is for a .jpg or .jpeg image -- be sure that's what you want, and take care to check for interaction with the "no trailing slash, no period in path" RewriteCond on the fifth rule (which would also exclude .jpe?g files).

The bottom line is that rule order is important, and you must avoid having external redirects 'expose' previously-executed internal rewrites. This gets especially tricky if you use multiple .htaccess files at different subdirectory levels...

To 'repair' this situation with MSNbot, you can use another external redirect. The rule is a bit tricky, in order to avoid creating an 'infinite' redirect/rewrite loop:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /robots\.[^.]+\.php
RewriteRule ^robots\.[^.]+\.php$ http://www.widgets.org/robots.txt [R=301,L]

The RewriteCond requires that "robots.zzzz.php" be sent in the original request from the client, and not be present only because of a previously-executed internal rewrite. This prevents looping.

I also used a generic sub-pattern for 'zzzz' so that client requests for robots.<anything>.php will be redirected back to robots.txt, just in case you've used several values for 'zzzz' in the past, and they have also been exposed.

Jim

phred




msg:3697394
 7:00 am on Jul 13, 2008 (gmt 0)

>> How could I be revealing that robots.txt is being rewritten to robots.zzz.php?
>Your rules are in the wrong order, and any request for "widgets.org/robots.txt"
> (no "www.") will result in a 301 redirect to "www.widgets.org/robots.zzz.php"
>'try' the no-www domain, this is the most likely mechanism for the exposure of your php file.

Well there ya go! Exactly the cause of the "exposure" I'm sure. Jim, you are a champion! Thank you!

>Try it yourself using a server headers checker, such as the "Live HTTP Headers"

I use Live Headers all the time, my problem was I didn't know what to feed as a request to show the problem. I was off in fairy land trying all sorts of convoluted / // \ \\().* file names/wild cards and couldn't get the .php revealed. Omitting the www I hadn't tried in this case; had tried when I first got that part of the rewrites working but not after adding the robots.txt handling...

>Put your rules in order, external redirects first, followed by internal rewrites.
>Within these two groups, put the rules in order from most-specific to least-specific.

This now seems to work.

Options +FollowSymLinks
RewriteEngine on
#
RewriteCond %{HTTP_HOST} !^www\.widgets\.org$ [NC]
RewriteRule .? [widgets.org%{REQUEST_URI}...] [R=301,L]
#
RewriteCond %{REQUEST_URI} !(/$¦\.)
RewriteRule (.+) [widgets.org...] [R=301,L]
#
RewriteCond %{HTTP_REFERER} !^http://www.widgets.org/ [NC]
RewriteRule [^/]+.(jpg¦JPG¦jpeg¦JPEG)$ Rx.gif [L]
#
RewriteRule robots.txt$ robots.zzzz.php [L,NC]
RewriteRule \.htaccess - [F]
#
RewriteRule ^/?(Contact¦About¦Products¦History)/$ ?zq=$1&zr= [L,QSA]
RewriteRule ^/?(Contact¦About¦Products¦History)/([0-9a-f]+)/$ ?zq=$1&zr=$2 [L,QSA]

Fixed the non-escaped .txt and re-ordered. Does it matter where the \.htaccess - [F] rule is?

>Also, all but the first two rules will be by-passed (because the third rule)
>if the request is for a .jpg or .jpeg image -- be sure that's what you want,

It's not of course. That was part of an attempt at an anti-leach rule; where the rest of the rule went god only knows. I've added what seems to work for an anti-leach rule where I think it should go.

>The bottom line is that rule order is important, and you must avoid having external redirects
>'expose' previously-executed internal rewrites.

You have made this much clearer. I have a better understanding and appreciation because of the problem.

>Jim

Thank you very much Jim. As I said, you are a champion!

Phred

jdMorgan




msg:3697473
 1:09 pm on Jul 13, 2008 (gmt 0)

Most-specific first! -- Your first two rules are ordered incorrectly, and any request for an extensionless URL will now result in two back-to-back redirects -- The first to correct the domain, the next to add a slash. Not desireable, so swap them.

Also, lets clean this up:

RewriteCond %{HTTP_REFERER} !^http://www\.widgets\.org
RewriteRule \.jpe?g$ Rx.gif [NC,L]

Same function, much more efficient.

Oh and take that [NC] off your domain redirect rule, too -- You want exactly the correct domain (including correct case), otherwise, redirect.

Where specific rules are so specific as to be mutually-exclusive, the order isn't functionally important. So the "\.htaccess - [F]" rule is fine where it is (there's no danger of confusing \.htaccess with a .jpeg file or with robots.txt). I said "functionally important" becuase you *might* get better performance by re-ordering mutually-exclusive rules, but that's usually splitting hairs.

Your last two rules don't really need the leading "/?" in the patterns. This prefix is often used in online postings to make the rule "portable" between httpd.conf and .htaccess; Patterns in httpd.conf start with "/", but patterns in .htaccess don't, and you don't need "/?" in there if your code is in .htaccess.

Not a champion... But I once exposed my robots.cgi file... ;)

Jim

g1smd




msg:3697871
 10:59 am on Jul 14, 2008 (gmt 0)

I was going to say that

(jpg¦JPG¦jpeg¦JPEG)

is best done with

(jpe?g) and [NC]

but jd got there first.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved