Meta tags and Robots

Forum Moderators: goodroi

Message Too Old, No Replies

Meta tags and Robots

Buster42

3:57 am on Feb 1, 2006 (gmt 0)

I'm not sure if this is in the right forum.

Is a meta tag for robots important?

i've found 2 examples, could someone please explain each for me, pls. What do the content values mean?

TIA

Dijkgraaf

6:11 am on Feb 1, 2006 (gmt 0)

A good resource is
[robotstxt.org...]

The first example is rather a bad one
<meta name="robots" content="all,index,follow">
all by itself means "index, follow" so it either should have been
<meta name="robots" content="all">
or
<meta name="robots" content="index,follow">
The index tells the bots that is it ok to index this page (ie. have it in their search index), and the follow tells the bots that it is allowed to follow the links it find in that page. Both of these are assumed defaults by bots/crawlers unless it find a contraty instruction.

The second example
<meta name="robots" content="noindex,follow,noarchive">
noindex says, I don't want this page in the search index.
follow, same as mentioned previously, the opposite being nofollow which says don't follow any links you find on this page.
noarchive means you don't want the search engine to keep a cached copy of the page like Google does.

Buster42

8:26 am on Feb 1, 2006 (gmt 0)

Thanks for the reply and URL.

From the research I have done since my post, I understand this.

1. Robots.txt is preferred over <meta name="robots" content="index,follow">
2. <meta name="robots" content="index,follow"> is supported by a few bots/crawlers.
3. Control of what a bot crawls is done by
DISALLOW: /DIRECTORY/ or DISALLOW: /FILENAME.HTML in the robots.txt
4. A bot will crawl and cache everything not disallowed in robots.txt

Am I on the right path here?

Dijkgraaf

10:26 am on Feb 1, 2006 (gmt 0)

1. You can do things with Meta tags that you can't do with robots.txt, e.g. nocache. So I wouldn't say preferred, but instead to be used in conjuntion with.
2. True enough, it probably isn't as widely supported as robots.txt yet.
3. The Disallow: directive will tell a bot not to request anything beginining with, so you don't have to give the full directory name or filename, but just the first part (care has to be taken not to disallow something you do want indexed). In some cases you don't want the trailing slash when disallowing the directory (there are threads in this Forum discussing why).
4. Esentially correct, but not all bots will crawl everthing, some only index the main or entry pages. Also the other way to disallow is to use the Meta tags (but as noted, may not be supported by some bots).

Neither robots.txt or Meta tags are foolproof. If you really want to stop certain pages getting indexed you should really password protect them.

solandre

4:48 pm on Feb 9, 2006 (gmt 0)

hi buster,

i am searching this forum for ideas and hints and problems as i am currently in a team that is coding a webcrawler.

notes to your examles:

generally we obey the proper html-specifications. the specifications regarding the robots-met-tagt can be found here:

[robotstxt.org...]

1.example is no proper robots meta tag, but we load all parts of the content-expression.
proper would be:

conten="ALL"
conten="NONE"
conten="(NO)FOLLOW,(NO)INDEX" (or vice versa)

as soon as index and follow-flag is set we break the analysis. in the case of the first example we would break after we read "all" (our bot is not case sensitive).

2.example is no prper html too, as there is no "noarchive"-value in the specifications. we would ignore that, but obey to "noindex" and "follow".