|X-Robots-Tag - controlling Googlebot via HTTP headers|
From the recent Google Blog posting Robots Exclusion Protocol: now with even more flexibility [googleblog.blogspot.com], Google have announced the availability of the
unavailable_after meta element, which enables you to give an expiry date to your pages. (See the thread Google Plans a New Meta Tag - "unavailable_after" [webmasterworld.com] for more information.)
However, there is a second, more interesting, announcement in the same entry: the ability to control Googlebot behavior via HTTP headers rather than on-page meta elements: the
|We've extended our support for |
META tags so they can now be associated with any file. Simply add any supported
META to a new
X-Robots-Tag directive in the HTTP Header used to serve the file.
As mentioned in the post, this is very useful for non-HTML content such as PDF, Word or plain text [webmasterworld.com] files, where you cannot insert
meta elements. You can also reduce clutter in the document itself, as well as control indexing via the server configuration rather than editing the files.
One caveat not mentioned by Google is that only Googlebot supports this syntax - unless the other search engines decide to follow suit - so you will still need
meta elements for Yahoo or MSN. Also, how long do you reckon we'll have to wait until the first case of a hacked server being modified to send a
noindex HTTP header with every request?
Yes, I would definitely be concerned about the ease with which a website on a compromised server could be destroyed.
This follows on from our earlier post on the matter.
|wait until the first case of a hacked server being modified to send a noindex HTTP header with every request |
If your server is hacked, search engine placement is the least of your worries.
Am I right in thinking that this is basically a "Is-robot: true" or "Is-robot: 1" HTTP header? So we no longer have to sniff the User-agent string and guess whether it's a bot or a human using a Web browser?
|If your server is hacked, search engine placement is the least of your worries. |
LOL, no doubt! Last time my main unix server was hacked I didn't have any search engine placement worries, in fact I didn't have any websites left on it at all...thank goodness for my backup dedicated hosting the downtime was minimal.
If some one has unauthorized access to your website, there's already plenty of ways they can break down your business without any need for a new metatag, someone can already put a nofollow tag and get you out of the serps if they have access to your server.
|someone can already put a nofollow tag and get you out of the serps if they have access to your server |
The comment about hackers was merely an aside and not the main part of my post in any way, but I'll just reply to this: the HTTP header is much more unobtrusive, and therefore much harder to detect, than actually modifying the pages themselves or changing the robots.txt (something which has been reported as occuring in the past in order to remove a site from the index).
|Am I right in thinking that this is basically a "Is-robot: true" or "Is-robot: 1" HTTP header? So we no longer have to sniff the User-agent string and guess whether it's a bot or a human using a Web browser? |
This is not anything sent by the bot itself, so it doesn't help in identifying Googlebot - it is a HTTP header that you can add to your server's response to a GET request, which offers similar functionality to the usual robots meta elements more commonly seen. You can add the HTTP headers via a server-side scripting language (PHP, etc.) or via the server configuration (Apache httpd.conf, IIS...).
This is great but nobody is saying how you would do such a thing. How do you modify your http header on IIS and Apache?
|How do you modify your http header on IIS and Apache? |
That's something you'd usually handle at application level, e.g. in PHP / ASP / whatever.
Here's a simple example for Apache that you can include in your .htaccess file to keep Googlebot (and hopefully others in time) from indexing image files. With some modification it can be used to control robot access to other files or file types:
<Files ~ "\.(gif¦jp[eg]¦png)$">
Header append X-Robots-Tag "noindex"
The X-Robots-Tag directive is a small step towards making robots.txt obsolete.
I don't really see the need for this. Why not just stick those files in their own directory and disallow it.
If your only intention is to disallow access to a file or files then a robots.txt would work just fine.
However, you can't use noarchive, nofollow, nosnippet, or unavailable_after in a robots.txt file. The header X-Robots-Tag is a much more powerful tool. It allows us to use these directives without needing to edit files. It also allows us to use these directives for media files, pdf files, etc, that can't have meta tags directives inserted in them. It can also be used for user-agent/ip delivery.