homepage Welcome to WebmasterWorld Guest from 54.225.57.156
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Help! How to set x-robot-tags in apache/vhost
cyber09ron




msg:4552822
 7:47 am on Mar 9, 2013 (gmt 0)

Hi,

Good Day. I'm a new member here, I'm Jerome. Hope you can help me on my problem.

My problem is how can I set x-robot-tags if a page has a 410 status code. I need to do this in apache level instead of creating ErrorDocument 410 handler.

Thanks for your big help.

 

lucy24




msg:4552837
 9:41 am on Mar 9, 2013 (gmt 0)

Uhm... What exactly are you trying to do? Ordinarily when you talk about x-robots you're talking about indexing. But if it isn't there, there's nothing to index. And robots don't spend a lot of time reading error documents.

phranque




msg:4552845
 11:25 am on Mar 9, 2013 (gmt 0)

welcome to WebmasterWorld, Jerome!


you can set an arbitrary header in a FilesMatch or similar container if that would be sufficient to identify the requested resource(s) that are generating 410 responses.

while technically possible, you should answer lucy24's question before pursuing this.
a 410 will remove a url from the index.
(I'm assuming that's the purpose of the X-Robots-Tag)
how is the 410 being generated?

cyber09ron




msg:4553674
 11:50 pm on Mar 11, 2013 (gmt 0)

Thanks lucy24 and phranque.
What I'm trying to do is to set "X-Robots-Tag: none, nosnippet, noarchive" for all the 410 URL's we have in our site.

We buy this site with a lot of SEO pages and we decided to not maintain those SEO pages. And unfortunately most of the URL's are non standard and I can't use the FilesMatch as suggested by phranque.

Thanks

lucy24




msg:4553681
 12:14 am on Mar 12, 2013 (gmt 0)

If the pages are already gone, nobody will ever see the header. It's like the server saying "The page you requested isn't here, but if it were here, this is the header you would get." You might be able to rig some kind of jiggery-pokery using a php script, but not in Apache alone.

I think what you really want to do is remove them via gwt. I guess the FilesMatch problem applies here too, unless all the defunct files live in the same directories.

Come to think of it, how are you identifying the pages so they can return that 410 in the first place? Unlike 404, a 410 doesn't happen on its own; you have to take some intentional action.

Can't help suspecting that when you say "410 pages" you really mean "404 pages that were intentionally removed". And that's a whole nother question.

Andy Langton




msg:4553690
 12:25 am on Mar 12, 2013 (gmt 0)

As implied above, using a robots HTTP header in addition to a 410 status code is redundant at best and open to confusion at worst. Your server response is going to look like this:


HTTP/1.1 410 Gone
Date: Tue, 12 March 2013 21:42:43 GMT
...
X-Robots-Tag: none


So after you say the URL is gone, you will also suggest not indexing it. But no search engine indexes pages with 4xx status codes, so there is no point in also including a robots HTTP header.

Perhaps if you explain why you think this is a good idea we can offer alternate suggestions.

cyber09ron




msg:4553691
 12:27 am on Mar 12, 2013 (gmt 0)

Thanks for the quick reply lucy24. Actually this is an SEO iniative and I just ask this question if it is possible to do it in apache instead of touching the application level.

But as you said I need also to used some php script which I think can solve my problem.

Thanks for the big help lucy24. :)

lucy24




msg:4553703
 1:02 am on Mar 12, 2013 (gmt 0)

Come to think of it... Would this part of the header come from the document that is requested, or the document that is actually served? Quick detour to Live Headers confirms that the "content-length" element belongs to the error document; there's no difference between asking for the doc by name and getting it as a response to a 40x request. And caching/expiration definitely belong to the file served, not the file requested. (I've taken advantage of this detail when logging.)

cyber09ron




msg:4553791
 7:41 am on Mar 12, 2013 (gmt 0)

Hi lucy24,

What we really want is to remove them from GWT. And this tag will now be part of the document that is being served.

Anyway, thanks a lot for the big help really appreciate it. :)

phranque




msg:4553793
 8:15 am on Mar 12, 2013 (gmt 0)

how is the 410 being generated?

if googlebot is requesting one of these 410 urls, G Search will remove the url from the index and GWT will essentially report the error (410) and ignore the header.

lucy24




msg:4554093
 9:42 pm on Mar 12, 2013 (gmt 0)

Come to that: It's not a bad idea to attach the no-index tag to all your error documents, just to protect yourself against awkward mistakes. That's the error document itself, not the requested document that generated the error.

Andy Langton




msg:4554111
 10:31 pm on Mar 12, 2013 (gmt 0)

Come to that: It's not a bad idea to attach the no-index tag to all your error documents


I'd be careful with this. Google already indexes URLs that are excluded by x-robots-tag headers and robots.txt - that's why they appear in search results (because third party references to a URL are enough to get a URL indexed). You run the risk of creating URLs for evaluation should a search engine prioritise robots directives over HTTP status codes.

lucy24




msg:4554132
 11:42 pm on Mar 12, 2013 (gmt 0)

excluded by x-robots-tag headers and robots.txt

Now, wait, that's something entirely different. Something it took me years to wrap my head around, in fact ;)

robots.txt prevents crawling. It doesn't prevent indexing.

I recently had to add a line of text to my test site's front page saying something like "sorry, folks, there's nothing here". The entire domain is roboted-out. So when a search query points straight to the domain name, it will show up in search results as a listing without a text snippet.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved