Forum Moderators: Robert Charlton & goodroi
I am trying to remove some urls from my wordpress blog.
These urls created as a result of installing "paged comments" plugin.
The urls looks as:
-------------------------------------
http://www.example.com/?p=400&cp=1
-------------------------------------
I blocked this type of urls by robots.txt:
-----------------
Disallow: /*cp=
-----------------
When trying to remove these urls by Goolge Webmaster Tools i get Denied.
Any explanations?
Thanking you in advance.
[edited by: tedster at 7:34 pm (utc) on Sep. 5, 2007]
[edit reason] switch to example.com - it can never be owned [/edit]
[google.com...]
I really don't know what's the problem.
=======================================
User-agent: googlebot
Disallow: /wp-
Disallow: /search
Disallow: /feed
Disallow: /comments/feed
Disallow: /feed/$
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
Disallow: /*/*/feed/$
Disallow: /*/*/feed/rss/$
Disallow: /*/*/trackback/$
Disallow: /*/*/*/feed/$
Disallow: /*/*/*/feed/rss/$
Disallow: /*/*/*/trackback/$
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /private/
Disallow: /*tag=
Disallow: /*m=
Disallow: /*cat=
Disallow: /*&cp=
Disallow: /*page_id=
Disallow: /*?paged=
Disallow: /*comments_popup=
Disallow: /*paged=
Allow: /*cat=6
Allow: /*cat=1
Allow: /wp-content/uploads/
User-agent: Googlebot
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.cgi$
# Google Image
User-agent: Googlebot-Image
Allow: /
# digg mirror
User-agent: duggmirror
Disallow: /
# Sitemap
Sitemap: http://www.example.com/sitemap.xml
=======================================
As mentioned before i use the default link permalinks , or the "ugly" links (wordpress).
=======================================
Analysis of cached robots.txt:
Last downloaded - September 8, 2007 8:49:03 PM PDT
Status - 200 (Success)
=======================================
Currently unable to stop the fluid of duplicated content:
http://www.example.com/?p=123&cp=1
http://www.example.com/?p=123&cp=2
http://www.example.com/?p=123&cp=3
http://www.example.com/?p=123&cp=4
http://www.example.com/?p=123&cp=5
.
.
.
[edited by: rashe18 at 2:30 pm (utc) on Sep. 9, 2007]
BUT I've found Google will not remove a page from it's index unless you mark the page "robots noindex", allow Google to crawl the page, by not blocking it with robots.txt, Then wait until it is removed from the index, and then and only then, block the path or page with robots.txt.
You may be better off with temporary dynamic pages that actually have "robots noindex" in them until the requests go away.
<meta name="robots" content="noindex"> correct syntax.
Quoted from google:
--------------------------------------
To remove content from the Google index, do one of the following:
1. Ensure requests for the page return an HTTP status code of either 404 or 410.
2. Block the page using a robots.txt file.
3. Block the page using a meta noindex tag.
--------------------------------------
I have blocked these dynamic url by robots.txt
Do you think should i block these urls using a meta noindex tag?
These aren't HTML pages. Is it possible to block them (php pages)?
--------------------------------------
Quoted from google:
Block or remove pages using meta tags
Rather than use a robots.txt file to block crawler access to pages, you can add a <META> tag to an HTML page to tell robots not to index the page. This standard is described at [robotstxt.org...]
To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
To allow other robots to index the page on your site, preventing only Google's robots from indexing the page, you'd use the following tag:
<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">
To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
--------------------------------------
Disallow: /feed/$
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
Disallow: /*/*/feed/$
Disallow: /*/*/feed/rss/$
Disallow: /*/*/trackback/$
Disallow: /*/*/*/feed/$
Disallow: /*/*/*/feed/rss/$
Disallow: /*/*/*/trackback/$
The star means "anything", so you shouldn't have multiple stars in the syntax.
Try this:
Disallow: /*feed/
Disallow: /*rss/
Disallow: /*trackback/
or does that block other things that you need to be indexed?