homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Gold Sponsor 2015!
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

robots.txt and mod_rewrite, meta robots, etc.

10+ Year Member

Msg#: 22 posted 9:31 pm on Nov 26, 2002 (gmt 0)

I'm working on a robots.txt for a new site and have a few questions. I use mod_rewrite to make the URL's look nicer, how, if at all, will that effect the robots.txt? Here's an example:

I have urls of the type:


[to being text only and lg low graphics]

that I would like to keep robots out of because they are printer friendly versions. section1/2, to, lg, article1/2 are not real directories, they are the search engine friendly ones made from the mod_rewrite.

How would most bots interpret something like:

Disallow: /section1/to/
Disallow: /section1/lg/
Disallow: /section2/to/
Disallow: /section2/lg/

Would they also not spider anything with an URL containing "section1"? The real articles, that I want spidered, are named like:


Maybe it would be best to just specify the /to/ and /lg/ fake directories to be disallowed, but how do I write that, will

Disallow: /../to/
Disallow: */to/


On the actual files I also have a :

<meta name="robots" content="noindex, nofollow">

is this sufficient or should I also try to exclude them with the robots.txt and keep the meta tag for those that might ignore the robots.txt?



WebmasterWorld Senior Member 10+ Year Member

Msg#: 22 posted 10:20 am on Nov 27, 2002 (gmt 0)

<meta name="robots" content="noindex, nofollow"> on each page should be fine.

on the mod_rewrite not to sure i'm an NT guy, But I'm pretty sure that some of the unix/linux guys can help with is



WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member

Msg#: 22 posted 6:13 pm on Nov 27, 2002 (gmt 0)


Disallow: /../to/
Disallow: */to/

No, robots.txt is interpreted using prefix-matching. The match starts at the left, proceeds to the right, and anything you leave off at the right end means "don't care". Wild-cards are not recognized, per se.

As a result,

Disallow: /section1/to/

means, "don't spider anything that starts with /section1/to/" - the contents of subdirectory /section1/to/.

Would they also not spider anything with an URL containing "section1"?

No, the path would have to start with and contain everything you included in the Disallow.

You can get only "sort of" a wild-card effect if the URLs you wish to disallow all start with the same prefix, i.e. Disallow: /private/ will keep the spiders out of subdirectory /private. If you cannot arrange your directory structure to use this approach, then you'll have to Disallow them individually.

It does not matter that your search-engine-friendly URLs do not actually exist as files. If you tell the robot that it's OK to request those URLs, they will do so when they find a link, and will then be subject to your RewriteRules. So they'll "land" on the page you expect them to, just like a human visitor. Clear?

Almost all of the search engine spiders now recognize the <meta robots> tag, so as DaveN states, that may be the easiest solution to your problem. The downside is that the pages will be fetched, even though they will not be listed in the search engine index. That just means some wasted server bandwidth.



10+ Year Member

Msg#: 22 posted 4:17 am on Nov 30, 2002 (gmt 0)

Thanks! I'll go ahead and use the robots.txt with
Disallow: /section1/to/

Is is kind of tempting to use only <meta robots> though, but that is a good point about the bandwidth, I hadn't thought of that before.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved