homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

robots , duplicate content and google

 6:52 am on Sep 23, 2008 (gmt 0)


i want to know that there are some pages of my site which are crawled by google

and i have come to know about that now

some told me to block that pages using robots.txt and i have blocked them using robots but still after two weeks all that blocked pages were appearing in the google although there cache pages were not updating any more

but i want to totally remove them and to inform u that i cant no index on site as it is dynamic site and all pages follow one header

My question is will robots.txt helps in removing even they are crawled by google

and 2nd question is

let say i have these two urls in google


while the 2nd one is still in google but is blocked by robots.txt

will google still consider this as duplicate content or it will just ignore the 2nd url

Plz give me exact answer.



 6:21 pm on Sep 23, 2008 (gmt 0)

This code would be needed to block Google from accessing it:

User-agent: Googlebot
Disallow: /*jtype

but that doesn't stop them listing the resource as a URL-only entry in the SERPs.

If they are no longer accessing it, then they will not consider the content in any way.


 6:23 pm on Sep 23, 2008 (gmt 0)

*** I cant no index on site as it is dynamic site and all pages follow one header ***

Of course you can. You simply detect what the requested URL was, and then output the extra headers only for the specific variations.


 6:37 pm on Sep 23, 2008 (gmt 0)

Ok will u plz make my mind clear as i have asked many other savvy experts but no one can give me exact answer

first of all let u know that i have blocked them already using commands

User-Agent: *
Disallow: /*jtype*

and when i check for these type of urls in google webmaster tools it is showing these urls blocked.


Now my question is even they are blocked by robots but still appearing in google with two urls



if first one is blocked is there any fear of penalty in future for duplicate content or duplicate url

Thanks for ur help


 7:18 pm on Sep 23, 2008 (gmt 0)

If they don't access the page, then they won't see the content. The robots.txt directive keeps them out.

If they can't see the content, how can they know it is a copy of some content on another page?

They can't and don't. You are safe from Duplicate Content issues in that case.


 7:21 pm on Sep 23, 2008 (gmt 0)

About your code:

User-Agent: *
Disallow: /*jtype*

Not all User-agents are wildcard aware, and the trailing * is not required.

This might be better, but you must also copy all of your other directives that you want Google to see into this rule block:

User-Agent: Googlebot
Disallow: /*jtype


 7:42 pm on Sep 23, 2008 (gmt 0)

Thanks a lot g1smd

u have released my tension as many experts are net were not sure as there were two urls on google and both are still appearing on google and one of them which i mentioned above are blocked two weeks ago

And plz confirm me one more time as both of them are still in google



and are there any chances that they get removed in future.

ok now about code how can i use
User-Agent: Googlebot

bcoz i want yahoo and msn to follow the same rule so what should i do regarding this

and u mean i remove

which is at the end of jtype of my command , i get this command from this page


here it is written

To block access to all URLs containing the word "private", you could use:

User-agent: *
Disallow: /*private*

and last but not the least question is can i use any command to block all the characters which are used after this url




and i also want to mention i used this forum very first time and it really helped me a lot

My best regards to u.


 7:57 pm on Sep 23, 2008 (gmt 0)

Google will continue to show the other one as a URL-only entry for quite a while. That is normal and not an issue.

I don't know if Yahoo and MSN are wildcard aware. If they are not then showing them a directive with a * in it would not be a good idea.


 8:07 pm on Sep 23, 2008 (gmt 0)

well i have search in yahoo

the robots.txt urls cached pages are just turning to

We're sorry, but we could not process your request for the cache of http://example.com/demo.php?cat_id=ahhf&jtype=1. Please click here to check the current page or check for previous versions at the Internet Archive.

So what u say is it properly blocked by yahoo as well

and yes most of the pages are crawled with


example below


as i have blocked them and now i want all search engines to crawl the urls without extra parameter

one of the example is


so can i submit a new sitemap with new requried urls or just wait for few weeks,months

as i dont know whether robots blocked urls will be removed from search engines or not.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved