How do I block TagIDs?

Forum Moderators: goodroi

Message Too Old, No Replies

How do I block TagIDs?

shaunm

7:30 am on Apr 9, 2013 (gmt 0)

Hi all,

Could you please help me with blocking pages from being indexed through Robots.txt?

This is for a blog where the tags and values are automatically assigned and available on the home page. So when clicked on a tag, it goes to a page like

example.com/abcprompt.aspx?tagid=22
example.com/abcprompt.aspx?tagid=23
example.com/abcprompt.aspx?tagid=24

So, I want to exclude all those tags(urls)being crawled and indexed by Google. I just want to exclude everything that contains '?tagid=' How do I do that? I see that I could block all those URLs that have '?' in them through robots, but I am concerned that it might block other important pages also.

Could you please help with this?

Thank you for all and any help :-)

phranque

11:22 am on Apr 9, 2013 (gmt 0)

This document details how Google handles the robots.txt file:
http://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [developers.google.com]

you'll want to use the Disallow: directive.

The [path] value, if specified, is to be seen relative from the root of the website for which the robots.txt file was fetched (using the same protocol, port number, host and domain names). The path value must start with "/" to designate the root.

this means the crawler matches the url to be requested from left-to-right starting from the leading / which is the document root directory.

you'll need to answer these questions before you write a robots.txt file:
do you want to exclude exactly /abcprompt.aspx or all paths?
do you want to exclude only urls with exactly one parameter that is tagid or any query string with the tagid parameter?

and here's the other problem - you can use robots.txt to exclude googlebot from crawling but you can't use it to prevent google from indexing any urls it discovers.
if you want to control indexing you will have to allow crawling of the url and provide either a meta robots noindex element in the document head or a X-Robots-Tag HTTP Response header with a noindex value.

Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters - Google Developers:
http://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag [developers.google.com]

g1smd

11:34 am on Apr 9, 2013 (gmt 0)

Is there a link to the bare

example.com/abcprompt.aspx

URL from within the site?

If not, the page content will never be spidered and indexed.

shaunm

1:27 pm on Apr 10, 2013 (gmt 0)

@phranque
Thank you so much for that link :-)

To put in simple I used this in my robots.txt file, would that work?

Disallow: /*tagid

@g1smd
Thanks :-)
Yes the tags are available on the home page like any other blogs with tags, these URL values are automatically assigned to these tags.

The original page is example.com/something.aspx. Tags to this page will lead to example.com/something.aspx?tagid=22 or something

Thanks again

tedster

6:39 pm on Apr 10, 2013 (gmt 0)

Disallow: /*tagid

That looks technically correct as long as the character string "tagid" doesn't appear in any other URLs except the ones you want to disallow crawling for. If you say "Disallow: /*?tagid" then including the "?" would limit the rule to just query string parameters - that might be even safer.

Another step you can take is to use your Webmaster Tools account to tell Google to ignore the "tagid" parameter. Look under the Configuration > URL Parameters section.

shaunm

6:40 am on Apr 11, 2013 (gmt 0)

@Ted

Thank you so much for your help. I will replace 'tagid' with '?tagid' then. And will work it out through Google webmaster tool as well :-)

Cheers!