Forum Moderators: phranque

Message Too Old, No Replies

Question re: Robots Meta Tag

The default with Inktomi is index,nofollow

         

Neilski

4:15 pm on Jul 17, 2002 (gmt 0)

10+ Year Member



Hello - I'm new to WebmasterWorld
What a fantastic site.

My question is - I discovered:
[searchengineworld.com...]
When I did a Meta Tag search on WW's site search.

I discovered this paragraph:

"Some search engine articles on Robots Meta tag say the predefined defaults are INDEX and FOLLOW, not true with Inktomi. The default with Inktomi is index,nofollow."

So - I am confused.

If one puts index,nofollow to satisfy Inktomi - what happens with the other Engines when they would be looking for index,follow ?

and/or will Inktomi still follow - if index,follow is there?

Thanks for any help and/or suggestions.

gcross

5:46 pm on Jul 17, 2002 (gmt 0)

10+ Year Member



I think when they say default they mean one of two things. Either that is Inktomi's default which can be overruled by the correct meta tag in your pages, or that is Inktomi's default in the event you fail to have any meta tag in your pages addressing that requirement. (Could mean both too)

I've simply added a meta tag with index,follow in all my pages and then uploaded a robots.txt file in the root directory to exclude those directories I didn't want indexed, such as the graphics/images directories.

Neilski

12:26 am on Jul 18, 2002 (gmt 0)

10+ Year Member


Thanks gcross - for your imput

If I read you correctly - I am safe - with all of my concerns - by simply having "index,follow"
Taking out of the equation - what a cetain Engine's Default might be. As long as I have something - it's better than nothing? correct? Or do they spider thru-out anyhow? Even if you didn't have an index,follow?

***** Any other imput is appreciated.

Which leads to a Part Two ? - all of a sudden.

I've never used a robots.txt file - on certain sites - where there is no linking info to a dir or page. For example - if on my site - I just put up informational stuff for friends at:
mysite.com/friends/your_new_pics.html

And if there are no links from the root - and/or anywhere else to /friends/ or pages - Engines can't find this stuff right? unless of course there was a link on that dir or page - to somewhere else?

Thanks for your imput.

Neilski

12:34 am on Jul 18, 2002 (gmt 0)

10+ Year Member


Oops Sorry - Follow-up

I have fogotten for years to do "index,follow" - yet my sites have been spidered deep.

Just what does the "index,follow" do
or not do - if it isn't there?

Thanks

pageoneresults

12:42 am on Jul 18, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello Neilski, here is a little more info...

This tag is meant to provide users who cannot control the /robots.txt file at their sites. It provides a last chance to keep their content out of search services. It was decided not to add syntax to allow robot specific permissions within the robots meta tag (as is possible in the /robots.txt file).

<meta name="robots" content="robots-terms">

The content robot-terms is a comma separated list that may contain the following keywords (without regard to case): all, none, index, noindex, follow and nofollow.

none
Tells all robots to ignore this page (equivalent to: noindex, nofollow).

all
There are no restrictions on indexing this page, or following links from this page to determine pages to index (equivalent to: index, follow).

index
All robots are welcome to include this page in search services.

noindex
This page may not be indexed by a search service.

follow
Robots are welcome to follow links from this page to find other pages.

nofollow
Robots are not to follow links from this page.

If this meta tag is missing, or if there is no content, or the robot terms are not specified, then the robot terms will be assumed to be "index, follow" (e.g. "all"). If the keyword all is found in the robots terms list it overrides all other values. That is, a robots terms that is "nofollow, all, noindex, nofollow", would effectively be "all". If the robots terms contains contradictory information (e.g. "follow, nofollow, follow") then the robot is free to do whatever it wishes with regard to the behavior being addressed (in this case the follow behavior).

A robot terms consisting only of noindex allows the subsidiary links to be explored, even though the page is not to be indexed. A robots terms consisting only of nofollow allows the page to be indexed, but no links from the page are explored (this may be useful if the page is a free entry point into pay per view content).

In regards to the robots meta tag, the index and follow are pretty much useless. If your site is void of these tags, it means the same thing. Don't clutter your metas with information that does not need to be there.

Neilski

1:29 am on Jul 18, 2002 (gmt 0)

10+ Year Member


pageoneresults

Thanks so very much.

That was a very comprehensive explanation.

pageoneresults

1:33 am on Jul 18, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You are quite welcome although I cannot take credit for the above. I put that together a couple of years ago and I believe most of it is from the W3C and some might have come from Brett's. Not sure. I do know that it is accurate and researched.

gcross

7:43 pm on Jul 19, 2002 (gmt 0)

10+ Year Member



Thank you, PageOne, that was very informative. I appreciate it immensely.

But it does beg some new questions.

First of all, you said that ALL (or it's equivalents) would override all other selections. Does this also then override the robot.txt file's exclusions? For instance, I have my own domains (2) and so can control the robot.txt exclusions. I want the SE to spider my site but not all the directories, hence the exclusions in the robot.txt file. I also want to override Inktomi's nofollow default. So my meta tag should simply read ALL and I am ok?

Secondly, of course, you mentioned not to clutter up one's meta tags with unnecessary information. I am trying to bone up on usability issues currently while drafting revisions to my sites. I have not yet come across anything on meta tag standards, namely limitations, other than leaving them out is foolish and leads to messy entries in the SEs. You mentioned w3c, Brett's and researched. Can you give us some leads on where to find more specific information on what is acceptable and what is not? Also, is there any kind of validation service for meta tags?

pageoneresults

8:54 pm on Jul 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> Does this also then override the robot.txt file's exclusions?

Unfortunately I cannot answer that one. I've not used both methods together, its either one or the other. I've been using a robots.txt file for the past 3 years at least so I cannot give much more instruction on the Robots META tag other than that outlined above.

I would think that you can do everything within your robots.txt file as that is the first place the spider is going to look. When the spider hits your site, it requests the robots.txt file, if not present it assumes that index all is the option.

As it begins indexing, if it finds the Robots META Tag, then it is supposed to take instruction from that. If you don't have one, the default is "all". If you do have one, the only one I recommend is "none" as the other ones can be covered without using the Robots META Tag causing the default "all" to take place.

> I have not yet come across anything on meta tag standards, namely limitations, other than leaving them out is foolish and leads to messy entries in the SEs.

The best place to review standards on META's is at the W3C. When it comes to what a website should have as META's from a standard point of view, I believe its going to be relative to the country you are in, your audience, and how critical having those META's is in the overall scheme of things.

Do a search here at Webmaster World for META Tags and you'll get quite a few results. You'll find that there are over 20-30 META's that you can use and that includes the Dublin Core Set. Which ones you need will be relative to your audience requirements.

gcross

12:31 am on Jul 20, 2002 (gmt 0)

10+ Year Member



Dublic Core Set?

gcross

12:31 am on Jul 20, 2002 (gmt 0)

10+ Year Member



Apologies, I meant DUBLIN Core Set?

Lundy

12:39 am on Jul 20, 2002 (gmt 0)

10+ Year Member



I have a robots text question, I have several new sites that I am most eager to get deeply spidered. I have links to them from sites that got into Ink & others for free long ago, they are fairly small and nothing on them needs to be excluded from indexing so I have not bothered to make a robots text file, also I have not put any robots meta tags as I stopped doing that a bit ago on sites like this.
I see in the logs that various spiders request the robots text file over and over and over--of course getting a standard 404 in these cases--

now, is it very beneficial to spiders to have a custom 404 page with a link into the site? And why do these spiders often just request the robots text file and nothing else? Often they do not request anything other than that according to what I see. Thanks to all.

pageoneresults

5:39 am on Jul 20, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



> Apologies, I meant DUBLIN Core Set?

Okay gcross, you asked for it...

Dublin Core Metadata Element Set [dublincore.org]

pageoneresults

5:54 am on Jul 20, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello Lundy and Welcome to Webmaster World.

A couple of years ago I went on a mission to learn as much as I could about the robots.txt file mainly to eliminate the 404 errors and also to prevent just what you describe...

> And why do these spiders often just request the robots text file and nothing else?

I believe it was in 1999 when I was involved in another discussion concerning this very same issue. There were many webmasters stating the same thing, robots request, not found, off it went. In some instances the robot did not return.

I can't say that is the case today. My basic understanding is there are multiple robots. Each one serves a purpose. One comes out to search for robots.txt files. It returns home and drops its data into the next robots hands which indexes all sites that had a robots text file with a set of instructions. Many of those robots.txt files that were indexed might have been blank, but it was there, and its like giving the robot a key, and quick access to your data.

Then there are sites without a robots.txt file. The robot collects the 404 data and dumps it into a generic robots hand. Not having a robots.txt file is the same thing as having one that is empty. Brett recently posted a reply to another topic stating that an empty robots.txt file may cause some issues with certain robots, that they may see it as a Disallow.

I can only share with you my experiences and research in this area. I've got a robots.txt file on all sites that I manage. I typically Disallow /css/, /javascript/, /working/ and other temp directories that I don't want indexed. I use it mainly for Google because Googlebot will crawl a link in a heartbeat and I hate seeing working drafts in the SERP's (Search Engine Results Pages).

If you want to stop the 404 errors, drop a robots.txt file into your root directory. You should definitely Disallow at least one directory, typically external css or javascript directories are a good start. Your file might look like this...

robots.txt

User-agent: *
Disallow: /css/
Disallow: /javascript/

pageoneresults

6:04 am on Jul 20, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Neilski, my apologies for not Welcoming you too!

In response to your original question. If you cannot upload a robots.txt to your root directory, then the META Robots Tag is the alternative...

Slurp -- Inktomi's Web Robot [inktomi.com]

> robots.txt: Slurp obeys the Robot Exclusion Standard. Specifically, Slurp adheres to the 1994 Robots Exclusion Standard (RES). Where the 1996 proposed standard disambiguates the 1994 standard, the proposed standard is followed.

Slurp is actually somewhat more forgiving than the RES requires.

Slurp will obey the first record in the robots.txt file with a User-Agent containing "Slurp". If there is no such record, It will obey the first entry with a User-Agent of "*".

Disallowed documents, including slash (the home page of the site), are not indexed, nor are links in those documents followed. Slurp does read slash at each site and uses it internally, but if it is disallowed it is neither indexed nor followed.

noindex meta-tag: Slurp obeys the noindex meta-tag. If you place

<META NAME="robots" CONTENT="noindex">

in the head of your web document, Slurp will retrieve the document, but it will not index the document or place it in the search engine's database.

You'll find this to be a very interesting read...

Notes on helping search engines index your Web site [w3.org]

Neilski

2:06 pm on Jul 20, 2002 (gmt 0)

10+ Year Member



Re: Neilski, my apologies for not Welcoming you too!
;-)

I'm glad I had posed the original question(s)
Great info from all and a lively discussion.

Lundy

4:34 pm on Jul 20, 2002 (gmt 0)

10+ Year Member



Thank you for the spider question great info.
Also, for the folks wondering about getting FTP'ing files into FP sites, I recall vaguely some years back having to Telnet a robots text file to the server which hosted my FP site. The server ran Apache and this was about 5 yrs ago & I didn't know anything about Telnet, but my hosting co. (great people, I love 'em) gave me good instructions & I did it.
Thanks again

Lundy

4:36 pm on Jul 20, 2002 (gmt 0)

10+ Year Member



No, you know what--that may have been a htaccess file I had to Telnet--sorry.