homepage Welcome to WebmasterWorld Guest from 54.166.108.167
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How do I block TagIDs?
shaunm




msg:4563020
 7:30 am on Apr 9, 2013 (gmt 0)

Hi all,

Could you please help me with blocking pages from being indexed through Robots.txt?

This is for a blog where the tags and values are automatically assigned and available on the home page. So when clicked on a tag, it goes to a page like

example.com/abcprompt.aspx?tagid=22
example.com/abcprompt.aspx?tagid=23
example.com/abcprompt.aspx?tagid=24

So, I want to exclude all those tags(urls)being crawled and indexed by Google. I just want to exclude everything that contains '?tagid=' How do I do that? I see that I could block all those URLs that have '?' in them through robots, but I am concerned that it might block other important pages also.

Could you please help with this?


Thank you for all and any help :-)

 

phranque




msg:4563050
 11:22 am on Apr 9, 2013 (gmt 0)

This document details how Google handles the robots.txt file:
http://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [developers.google.com]

you'll want to use the Disallow: directive.

The [path] value, if specified, is to be seen relative from the root of the website for which the robots.txt file was fetched (using the same protocol, port number, host and domain names). The path value must start with "/" to designate the root.


this means the crawler matches the url to be requested from left-to-right starting from the leading / which is the document root directory.

you'll need to answer these questions before you write a robots.txt file:
do you want to exclude exactly /abcprompt.aspx or all paths?
do you want to exclude only urls with exactly one parameter that is tagid or any query string with the tagid parameter?

and here's the other problem - you can use robots.txt to exclude googlebot from crawling but you can't use it to prevent google from indexing any urls it discovers.
if you want to control indexing you will have to allow crawling of the url and provide either a meta robots noindex element in the document head or a X-Robots-Tag HTTP Response header with a noindex value.

Robots meta tag and X-Robots-Tag HTTP header specifications - Webmasters - Google Developers:
http://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag [developers.google.com]

g1smd




msg:4563057
 11:34 am on Apr 9, 2013 (gmt 0)

Is there a link to the bare
example.com/abcprompt.aspx URL from within the site?

If not, the page content will never be spidered and indexed.

shaunm




msg:4563496
 1:27 pm on Apr 10, 2013 (gmt 0)

@phranque
Thank you so much for that link :-)

To put in simple I used this in my robots.txt file, would that work?

Disallow: /*tagid


@g1smd
Thanks :-)
Yes the tags are available on the home page like any other blogs with tags, these URL values are automatically assigned to these tags.

The original page is example.com/something.aspx. Tags to this page will lead to example.com/something.aspx?tagid=22 or something

Thanks again

tedster




msg:4563580
 6:39 pm on Apr 10, 2013 (gmt 0)

Disallow: /*tagid

That looks technically correct as long as the character string "tagid" doesn't appear in any other URLs except the ones you want to disallow crawling for. If you say "Disallow: /*?tagid" then including the "?" would limit the rule to just query string parameters - that might be even safer.

Another step you can take is to use your Webmaster Tools account to tell Google to ignore the "tagid" parameter. Look under the Configuration > URL Parameters section.

shaunm




msg:4563720
 6:40 am on Apr 11, 2013 (gmt 0)

@Ted

Thank you so much for your help. I will replace 'tagid' with '?tagid' then. And will work it out through Google webmaster tool as well :-)

Cheers!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved