Welcome to WebmasterWorld Guest from 126.96.36.199
Forum Moderators: goodroi
I have urls of the type:
[to being text only and lg low graphics]
that I would like to keep robots out of because they are printer friendly versions. section1/2, to, lg, article1/2 are not real directories, they are the search engine friendly ones made from the mod_rewrite.
How would most bots interpret something like:
Would they also not spider anything with an URL containing "section1"? The real articles, that I want spidered, are named like:
Maybe it would be best to just specify the /to/ and /lg/ fake directories to be disallowed, but how do I write that, will
On the actual files I also have a :
<meta name="robots" content="noindex, nofollow">
is this sufficient or should I also try to exclude them with the robots.txt and keep the meta tag for those that might ignore the robots.txt?
No, robots.txt is interpreted using prefix-matching. The match starts at the left, proceeds to the right, and anything you leave off at the right end means "don't care". Wild-cards are not recognized, per se.
As a result,
Would they also not spider anything with an URL containing "section1"?
No, the path would have to start with and contain everything you included in the Disallow.
You can get only "sort of" a wild-card effect if the URLs you wish to disallow all start with the same prefix, i.e. Disallow: /private/ will keep the spiders out of subdirectory /private. If you cannot arrange your directory structure to use this approach, then you'll have to Disallow them individually.
It does not matter that your search-engine-friendly URLs do not actually exist as files. If you tell the robot that it's OK to request those URLs, they will do so when they find a link, and will then be subject to your RewriteRules. So they'll "land" on the page you expect them to, just like a human visitor. Clear?
Almost all of the search engine spiders now recognize the <meta robots> tag, so as DaveN states, that may be the easiest solution to your problem. The downside is that the pages will be fetched, even though they will not be listed in the search engine index. That just means some wasted server bandwidth.