Forum Moderators: goodroi
I'm performing searches at all three right now, regular searches, not advanced, and sure enough, they all show URI only listings for stuff that has been Disallowed.
This is why I don't rely on the robots.txt file to keep things out of the SEs indices.
Historically search engines will list the url with no snippet in the serps when they find links pointing to a page and are unable to actually crawl it. If you have a page that is blocked with robots.txt and the search engine finds links from other pages pointing to it they know that the page exists so they will show the only information they know - the url.
It is also a good time to remind people that robots.txt is a way for you to ASK bots nicely to behave. htaccess is a way for you to better FORCE bots to behave.
<end of history lesson>
It is also a good time to remind people that robots.txt is a way for you to ASK bots nicely to behave.
Its also a way to let the world know what you don't want indexed. Is it possible that I could review your robots.txt file and then build a mini-site of links all leading to content that you've Disallowed? Do you think that would have a slight impact on the performance of your site? Remember, my regurgitated links to not have a Disallow in the robots.txt file so they are there for all to index. Even though the final destination is Disallowed, do you think there might be some merit to the fact that you could be sabotaged by using robots.txt? Or, is it Tin Hat Thursday? ;)
htaccess is a way for you to better FORCE bots to behave.
Tell that to the Windows folks. :)
htaccess is a way for you to better FORCE bots to behave.Tell that to the Windows folks. :)
OK, I'll translate:
On Windows servers, ISAPI Rewrite is a way for you to better FORCE bots to behave. ;)
Let's not forget the middle approach: Allow robots to fetch the page, then use the on-page meta-robots tag to tell them not to include the page in their index. The page must be Allowed in robots.txt in order for the robots to fetch the page and read the meta-robots tag. But this on-page tag is more suited to keeping pages out of SE indexes because unlike robots.txt, which says "don't fetch this page", the on-page meta-robots tag can say, "Don't include this page (or its URL) in the index."
It's important to recognize that robots.txt and the on-page meta-robots tag have different purposes, and their semantics differ: Robots.txt says "Do not fetch URLs beginning with this URL-prefix." It was originally intended as a bandwidth control mechanism, while the meta-robots tag (among other things) says, "Now that you've fetched this page, do not include it in your index." The on-page meta-robots tag is more specifically targeted at search engine spider control.
Having implemented either the robots.txt Disallow or the on-page <meta name="robots" content="noindex"> tag, you can then proceed to using mod_rewrite, mod_access, or ISAPI Rewrite to positively block access to pages or other resources you don't want indexed. Or alternatively, you can rewrite known robot requests for off-limits pages to a low-byte-count page containing, say, only a text link to your home page, along with a <meta name="robots" content="noindex,follow"> tag. Doing so can save you quite a bit of bandwidth as robots spider these "Allowed-but-noindexed" pages.
As to robots.txt providing a "shopping" list of URLs for malicious use, remember that robots.txt uses prefix-matching, so there is no guarantee that a URL-prefix (partial URL) found in robots.txt will resolve to an actual resource on the site. In fact, one might detect client requests for some or all of these URL-prefixes and call a script to block the client's IP address if the user-agent is not a recognized SE robot. Recognized robots can be fed a 301 to a valid URL. Some sites even salt a few fake URL-prefixes into robots.txt to trap robots.txt harvesters... :)
Jim
<meta name="robots" content="none"> The above has been my preference for years. Works perfectly and does what it is supposed to do, keep pages (including URI only listings) out of the index. I don't use the robots.txt to its fullest extent, not even close. I prefer to control it at the page level and through ISAPI_Rewrite. Keep in mind though that there are few hosts out there that have ISAPI_Rewrite or something similar installed for us Windows folks. So BRUTE FORCE control is out of reach for many.
Tell that to the Windows folksyou still talk with people hosting on windows? just kidding :)
yes, you do need to be careful of what you list in robots.txt because it could attract attention. when i visit my competition i always check out their robots.txt and then i head over to their web design firm's site and check out their robots.txt. this has helped me to discover many of my competitors redesign while it is still being built.
Ah, but try that on my sites, using either a browser UA or a spider UA which does not resolve to a real search engine, and you'll find yourself getting 403s for all subsequent requests -- Maybe not forever, but at least for a longer time than a "competitive analyst" would want to wait on a research project. :)
The same thing will happen if you go "directory-index-fishing" unless you're verifiably Yahoo! Slurp -- Annoyingly, this 'bot has recently been making a lot of blind/unlinked directory-index requests, apparently oblivious to the fact that "Options -Indexes" is a fairly common setting on Apache servers, and that it's causing an awful lot of 403-Forbidden responses to be logged across the Web...
Not sure who was being addressed by the "still talk to people hosting on Windows?" comment, but sure, I still talk to them. Partly out of sympathy for the facts that they have to pay for "add-on capabilities" such as URL-rewriting that come free with Apache, and that they're locking themselves in with additional MS-proprietary technologies and thereby subjecting their business' viability to the whims of MS's future pricing policies. Also as Vista demonstrates, MS is not much concerned with long-term applications compatibility... Something to consider before locking yourself in with them. Apache, IIS, no matter -- It's all good, as long as you're making well-informed decisions.
I tend to use the more-verbose <meta name="robots" content="noindex,nofollow"> simply because it's not clear how content="none" might be treated with respect to future extensions to the meta-robots content-attributes. We're still OK now, because "noCache, noSnippet, noODP, and noYdir would logically be subordinate to "noindex", but you never know what's coming next with all of these recent additional proprietary and semi-proprietary extensions to the robots meta-tag... An additional 12 bytes of future-proofing per page is how I see it.
Jim
as for my windows comment, i was trying to be playfully snarky but not to anyone in particular.