Forum Moderators: goodroi
and this in Robots.txt
User-agent: *
Disallow: /about/
Disallow: /press/
Disallow: /products/
Would pages in /about/, /press/ and /products/ get crawled and indexed?
I've got a site that I'm going to work on and they set things up this way. The site has a PR7 on the home page and 0 on any of the pages that are in these 3 directories.
I also just read a post from Vanessa Fox dated 3/5/07, subject "Using the robots meta tag" on the Google Webmaster Blog today and it turns out that ALL is not even an accepted value for the Robots Meta Tag, from googles perspective. The post indicated that the following were valid values (case insensitive):
NOINDEX
NOFOLLOW
NOARCHIVE
NOSNIPPET
NOODP
NONE
You sure about that encyclo? AFAIK, Google blatantly crawls all over those directories, reads and stores information from there, but just doesn't include it in the SERPs.
That would mean that gbot is not obeying robots.txt, and that would be news all over the webmaster/sem forums. Haven't seen that as yet. While there have been isolated cases reported where googlebot appeared not to obey robots.tx, these usually involving a very recently changed robots.txt file with the bot apparently working with its most recently cached copy.
Pages blocked by robots.txt can, and will, be indexed if G finds links pointing to them. As Encyclo says, the page must be exposed to the bot so it can read the NOINDEX.
So, if you have 100 links on an indexed page to 100 pages residing in a blocked directory, then those 100 pages can still be listed in the SERPs.
Occasionally, a URL which is blocked by robots.txt but is listed in the DMOZ directory will display with the DMOZ title and data rather than URL only. I am not aware of Google ever using any other data (eg. backlink text) to add a title or description.
In my experience, Googlebot (or any other major search engine bot) has never violated the robots.txt directives. If you find that it does on a particular site, then the most likely explanation is that there is a problem with your robots.txt syntax.
On the other hand SLURP is notorious to getting itself banned. none linked ditectories that were simply mentioned in robots.txt for testing of the bot trap.
MSNBot was cought couple times and later even had a cashed copy of the page in the indexed after getting real 403 that was disallowed by robots.txt for a could of days. That page is still linked-displayed when site:mydomain.tld is used. If I set to see 50 results per page it shows links to 2 bottrap pages in top 20 at this time. thouse pages have
<meta name="robots" content="noindex,nofollow"> in the head of the document but linked to from the footer on almost every page.
:)
Google has publicly stated (I think) that not all their bots respect robots.tx. The Adwords bot is an example. Claims exist in threads like this one [webmasterworld.com]. But this does not relate to the main googlebot itself.
Various confirmed reports - by members like g1smd (whose posts I greatly respect) - on Google violations involved violation of robots meta tags [webmasterworld.com], not robots.txt.
I suppose I was misled by the many anecdotal accounts here about Google ignoring robots.txt itself.