Forum Moderators: phranque
My question is - I discovered:
[searchengineworld.com...]
When I did a Meta Tag search on WW's site search.
I discovered this paragraph:
"Some search engine articles on Robots Meta tag say the predefined defaults are INDEX and FOLLOW, not true with Inktomi. The default with Inktomi is index,nofollow."
So - I am confused.
If one puts index,nofollow to satisfy Inktomi - what happens with the other Engines when they would be looking for index,follow ?
and/or will Inktomi still follow - if index,follow is there?
Thanks for any help and/or suggestions.
I've simply added a meta tag with index,follow in all my pages and then uploaded a robots.txt file in the root directory to exclude those directories I didn't want indexed, such as the graphics/images directories.
If I read you correctly - I am safe - with all of my concerns - by simply having "index,follow"
Taking out of the equation - what a cetain Engine's Default might be. As long as I have something - it's better than nothing? correct? Or do they spider thru-out anyhow? Even if you didn't have an index,follow?
***** Any other imput is appreciated.
Which leads to a Part Two ? - all of a sudden.
I've never used a robots.txt file - on certain sites - where there is no linking info to a dir or page. For example - if on my site - I just put up informational stuff for friends at:
mysite.com/friends/your_new_pics.html
And if there are no links from the root - and/or anywhere else to /friends/ or pages - Engines can't find this stuff right? unless of course there was a link on that dir or page - to somewhere else?
Thanks for your imput.
This tag is meant to provide users who cannot control the /robots.txt file at their sites. It provides a last chance to keep their content out of search services. It was decided not to add syntax to allow robot specific permissions within the robots meta tag (as is possible in the /robots.txt file).
<meta name="robots" content="robots-terms">
The content robot-terms is a comma separated list that may contain the following keywords (without regard to case): all, none, index, noindex, follow and nofollow.
none
Tells all robots to ignore this page (equivalent to: noindex, nofollow).
all
There are no restrictions on indexing this page, or following links from this page to determine pages to index (equivalent to: index, follow).
index
All robots are welcome to include this page in search services.
noindex
This page may not be indexed by a search service.
follow
Robots are welcome to follow links from this page to find other pages.
nofollow
Robots are not to follow links from this page.
If this meta tag is missing, or if there is no content, or the robot terms are not specified, then the robot terms will be assumed to be "index, follow" (e.g. "all"). If the keyword all is found in the robots terms list it overrides all other values. That is, a robots terms that is "nofollow, all, noindex, nofollow", would effectively be "all". If the robots terms contains contradictory information (e.g. "follow, nofollow, follow") then the robot is free to do whatever it wishes with regard to the behavior being addressed (in this case the follow behavior).
A robot terms consisting only of noindex allows the subsidiary links to be explored, even though the page is not to be indexed. A robots terms consisting only of nofollow allows the page to be indexed, but no links from the page are explored (this may be useful if the page is a free entry point into pay per view content).
In regards to the robots meta tag, the index and follow are pretty much useless. If your site is void of these tags, it means the same thing. Don't clutter your metas with information that does not need to be there.
But it does beg some new questions.
First of all, you said that ALL (or it's equivalents) would override all other selections. Does this also then override the robot.txt file's exclusions? For instance, I have my own domains (2) and so can control the robot.txt exclusions. I want the SE to spider my site but not all the directories, hence the exclusions in the robot.txt file. I also want to override Inktomi's nofollow default. So my meta tag should simply read ALL and I am ok?
Secondly, of course, you mentioned not to clutter up one's meta tags with unnecessary information. I am trying to bone up on usability issues currently while drafting revisions to my sites. I have not yet come across anything on meta tag standards, namely limitations, other than leaving them out is foolish and leads to messy entries in the SEs. You mentioned w3c, Brett's and researched. Can you give us some leads on where to find more specific information on what is acceptable and what is not? Also, is there any kind of validation service for meta tags?
Unfortunately I cannot answer that one. I've not used both methods together, its either one or the other. I've been using a robots.txt file for the past 3 years at least so I cannot give much more instruction on the Robots META tag other than that outlined above.
I would think that you can do everything within your robots.txt file as that is the first place the spider is going to look. When the spider hits your site, it requests the robots.txt file, if not present it assumes that index all is the option.
As it begins indexing, if it finds the Robots META Tag, then it is supposed to take instruction from that. If you don't have one, the default is "all". If you do have one, the only one I recommend is "none" as the other ones can be covered without using the Robots META Tag causing the default "all" to take place.
> I have not yet come across anything on meta tag standards, namely limitations, other than leaving them out is foolish and leads to messy entries in the SEs.
The best place to review standards on META's is at the W3C. When it comes to what a website should have as META's from a standard point of view, I believe its going to be relative to the country you are in, your audience, and how critical having those META's is in the overall scheme of things.
Do a search here at Webmaster World for META Tags and you'll get quite a few results. You'll find that there are over 20-30 META's that you can use and that includes the Dublin Core Set. Which ones you need will be relative to your audience requirements.
now, is it very beneficial to spiders to have a custom 404 page with a link into the site? And why do these spiders often just request the robots text file and nothing else? Often they do not request anything other than that according to what I see. Thanks to all.
Okay gcross, you asked for it...
Dublin Core Metadata Element Set [dublincore.org]
A couple of years ago I went on a mission to learn as much as I could about the robots.txt file mainly to eliminate the 404 errors and also to prevent just what you describe...
> And why do these spiders often just request the robots text file and nothing else?
I believe it was in 1999 when I was involved in another discussion concerning this very same issue. There were many webmasters stating the same thing, robots request, not found, off it went. In some instances the robot did not return.
I can't say that is the case today. My basic understanding is there are multiple robots. Each one serves a purpose. One comes out to search for robots.txt files. It returns home and drops its data into the next robots hands which indexes all sites that had a robots text file with a set of instructions. Many of those robots.txt files that were indexed might have been blank, but it was there, and its like giving the robot a key, and quick access to your data.
Then there are sites without a robots.txt file. The robot collects the 404 data and dumps it into a generic robots hand. Not having a robots.txt file is the same thing as having one that is empty. Brett recently posted a reply to another topic stating that an empty robots.txt file may cause some issues with certain robots, that they may see it as a Disallow.
I can only share with you my experiences and research in this area. I've got a robots.txt file on all sites that I manage. I typically Disallow /css/, /javascript/, /working/ and other temp directories that I don't want indexed. I use it mainly for Google because Googlebot will crawl a link in a heartbeat and I hate seeing working drafts in the SERP's (Search Engine Results Pages).
If you want to stop the 404 errors, drop a robots.txt file into your root directory. You should definitely Disallow at least one directory, typically external css or javascript directories are a good start. Your file might look like this...
robots.txt
User-agent: *
Disallow: /css/
Disallow: /javascript/
In response to your original question. If you cannot upload a robots.txt to your root directory, then the META Robots Tag is the alternative...
Slurp -- Inktomi's Web Robot [inktomi.com]
> robots.txt: Slurp obeys the Robot Exclusion Standard. Specifically, Slurp adheres to the 1994 Robots Exclusion Standard (RES). Where the 1996 proposed standard disambiguates the 1994 standard, the proposed standard is followed.
Slurp is actually somewhat more forgiving than the RES requires.
Slurp will obey the first record in the robots.txt file with a User-Agent containing "Slurp". If there is no such record, It will obey the first entry with a User-Agent of "*".
Disallowed documents, including slash (the home page of the site), are not indexed, nor are links in those documents followed. Slurp does read slash at each site and uses it internally, but if it is disallowed it is neither indexed nor followed.
noindex meta-tag: Slurp obeys the noindex meta-tag. If you place
<META NAME="robots" CONTENT="noindex">
in the head of your web document, Slurp will retrieve the document, but it will not index the document or place it in the search engine's database.
You'll find this to be a very interesting read...
Notes on helping search engines index your Web site [w3.org]