Forum Moderators: goodroi
The first example is rather a bad one
<meta name="robots" content="all,index,follow">
all by itself means "index, follow" so it either should have been
<meta name="robots" content="all">
or
<meta name="robots" content="index,follow">
The index tells the bots that is it ok to index this page (ie. have it in their search index), and the follow tells the bots that it is allowed to follow the links it find in that page. Both of these are assumed defaults by bots/crawlers unless it find a contraty instruction.
The second example
<meta name="robots" content="noindex,follow,noarchive">
noindex says, I don't want this page in the search index.
follow, same as mentioned previously, the opposite being nofollow which says don't follow any links you find on this page.
noarchive means you don't want the search engine to keep a cached copy of the page like Google does.
From the research I have done since my post, I understand this.
1. Robots.txt is preferred over <meta name="robots" content="index,follow">
2. <meta name="robots" content="index,follow"> is supported by a few bots/crawlers.
3. Control of what a bot crawls is done by
DISALLOW: /DIRECTORY/ or DISALLOW: /FILENAME.HTML in the robots.txt
4. A bot will crawl and cache everything not disallowed in robots.txt
Am I on the right path here?
Neither robots.txt or Meta tags are foolproof. If you really want to stop certain pages getting indexed you should really password protect them.
i am searching this forum for ideas and hints and problems as i am currently in a team that is coding a webcrawler.
notes to your examles:
generally we obey the proper html-specifications. the specifications regarding the robots-met-tagt can be found here:
[robotstxt.org...]
1.example is no proper robots meta tag, but we load all parts of the content-expression.
proper would be:
conten="ALL"
conten="NONE"
conten="(NO)FOLLOW,(NO)INDEX" (or vice versa)
as soon as index and follow-flag is set we break the analysis. in the case of the first example we would break after we read "all" (our bot is not case sensitive).
2.example is no prper html too, as there is no "noarchive"-value in the specifications. we would ignore that, but obey to "noindex" and "follow".