robots.txt and mod_rewrite, meta robots, etc. - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt and mod_rewrite, meta robots, etc.

Trisha

9:31 pm on Nov 26, 2002 (gmt 0)

10+ Year Member

I'm working on a robots.txt for a new site and have a few questions. I use mod_rewrite to make the URL's look nicer, how, if at all, will that effect the robots.txt? Here's an example:

I have urls of the type:

mydomain.com/section1/to/article1
mydomain.com/section1/lg/article1
mydomain.com/section2/to/article2
mydomain.com/section2/lg/article2

[to being text only and lg low graphics]

that I would like to keep robots out of because they are printer friendly versions. section1/2, to, lg, article1/2 are not real directories, they are the search engine friendly ones made from the mod_rewrite.

How would most bots interpret something like:

Disallow: /section1/to/
Disallow: /section1/lg/
Disallow: /section2/to/
Disallow: /section2/lg/

Would they also not spider anything with an URL containing "section1"? The real articles, that I want spidered, are named like:

mydomain.com/section1/subsection1/article1

Maybe it would be best to just specify the /to/ and /lg/ fake directories to be disallowed, but how do I write that, will

Disallow: /../to/
or
Disallow: */to/

work?

On the actual files I also have a :

<meta name="robots" content="noindex, nofollow">

is this sufficient or should I also try to exclude them with the robots.txt and keep the meta tag for those that might ignore the robots.txt?

DaveN

10:20 am on Nov 27, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

<meta name="robots" content="noindex, nofollow"> on each page should be fine.

on the mod_rewrite not to sure i'm an NT guy, But I'm pretty sure that some of the unix/linux guys can help with is

DaveN

jdMorgan

6:13 pm on Nov 27, 2002 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Trisha,

will


Disallow: /../to/

or


Disallow: */to/
work?

No, robots.txt is interpreted using prefix-matching. The match starts at the left, proceeds to the right, and anything you leave off at the right end means "don't care". Wild-cards are not recognized, per se.

As a result,


Disallow: /section1/to/

means, "don't spider anything that starts with /section1/to/" - the contents of subdirectory /section1/to/.

Would they also not spider anything with an URL containing "section1"?

No, the path would have to start with and contain everything you included in the Disallow.

You can get only "sort of" a wild-card effect if the URLs you wish to disallow all start with the same prefix, i.e. Disallow: /private/ will keep the spiders out of subdirectory /private. If you cannot arrange your directory structure to use this approach, then you'll have to Disallow them individually.

It does not matter that your search-engine-friendly URLs do not actually exist as files. If you tell the robot that it's OK to request those URLs, they will do so when they find a link, and will then be subject to your RewriteRules. So they'll "land" on the page you expect them to, just like a human visitor. Clear?

Almost all of the search engine spiders now recognize the <meta robots> tag, so as DaveN states, that may be the easiest solution to your problem. The downside is that the pages will be fetched, even though they will not be listed in the search engine index. That just means some wasted server bandwidth.

HTH,
Jim

Trisha

4:17 am on Nov 30, 2002 (gmt 0)

10+ Year Member

Thanks! I'll go ahead and use the robots.txt with
Disallow: /section1/to/

Is is kind of tempting to use only <meta robots> though, but that is a good point about the bandwidth, I hadn't thought of that before.