Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt and Mod Rewrite

Which url does the robot read?

         

eflouret

11:11 pm on Jun 15, 2005 (gmt 0)

10+ Year Member


Hello,

I have a site that uses Mod Rewrite in all of its urls.

I have a very large set of urls that I don't want them to be visited by bots anymore.

These urls are something like this:

/dir1/blah/blah/blah
/dir2/example/blah/
/dir3/etc/etc/etc

and so on.

In my robots.txt I disallowed robots from visiting

/dir1/
/dir2/
/dir3/

but robots keep on visiting all urls. So it seems that they don't consider those rewritten directories as real. Of course they don't exist. The real urls are something like this:

index.php?var1=blah&var2=etc&var3=whatever

I checked the robots.txt with a validator and everything was ok.

How should I proceed with Mod rewritten urls?

Thanks in advance,

Enrique

jdMorgan

4:52 am on Jun 18, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Robots see the URLs in the links on your pages (and other sites' pages) and put those on a list to fetch. But before fetching each URL, the robot will check the URL against the URL-prefixes listed in your robots.txt file. If the syntax of the robots.txt file is correct, and the robot is complying with the Standard for Robot Exclusion, then any URL which matches the prefixes listed in your robots.txt file will not be fetched.

Nowhere above have I mentioned filenames. Robots don't see or know about filenames. They only use URLs. So URL-to-filename translations done in mod_rewrite have nothing to do with robots or with robots.txt. The only time mod_rewrite will affect robots is when you use it to do external redirects instead of internal rewrites.

If you are seeing well-know robots such as Googlebot, Slurp, and msnbot fetching your disallowed URLs, then it is likely that you have errors in your robots.txt file or that you have redirected the disallowed URLs instead of just rewriting them.

Use the robots.txt validator [webmasterworld.com] to check your robots.txt file, and make sure that you are using the correct syntax for internal rewrites and not redirects in mod_rewrite.

This is a rewrite (as it might appear in httpd.conf):


RewriteRule ^/dir1/([^/]+)/([^/]+)/([^/]+/?$ /index.php?var1=$1&var2=$2&var3=$3 [L]

and this is a redirect:

RewriteRule ^/dir1/([^/]+)/([^/]+)/([^/]+/?$ [i]http://www.example.com[/i]/index.php?var1=$1&var2=$2&var3=$3 [[i]R=301[/i],L]

But even this will default to a 302 redirect:

RewriteRule ^/dir1/([^/]+)/([^/]+)/([^/]+/?$ http://www.example.com/index.php?var1=$1&var2=$2&var3=$3

because a canonical URL has been specified.

Jim