Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Url removal denied although blocked by robots.txt

         

rashe18

7:25 pm on Sep 5, 2007 (gmt 0)

10+ Year Member



Hello ...

I am trying to remove some urls from my wordpress blog.

These urls created as a result of installing "paged comments" plugin.

The urls looks as:

-------------------------------------
http://www.example.com/?p=400&cp=1
-------------------------------------

I blocked this type of urls by robots.txt:

-----------------
Disallow: /*cp=
-----------------

When trying to remove these urls by Goolge Webmaster Tools i get Denied.

Any explanations?

Thanking you in advance.

[edited by: tedster at 7:34 pm (utc) on Sep. 5, 2007]
[edit reason] switch to example.com - it can never be owned [/edit]

tedster

6:58 pm on Sep 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have you checked the syntax of your complete robots.txt file?

rashe18

6:59 pm on Sep 6, 2007 (gmt 0)

10+ Year Member



what do you mean "syntax" please?

tedster

7:16 pm on Sep 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I mean that your file is fully conformed to the robots.txt standard, as explained by robotstxt.org [robotstxt.org] and extended by Google. Google has a lot of information about this:

[google.com...]

g1smd

8:05 pm on Sep 6, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



You'll need the disallow with the * syntax to be in the User-agent: Googlebot section of the file, not in the User-agent: * section.

rashe18

2:04 pm on Sep 9, 2007 (gmt 0)

10+ Year Member



This is my robots.txt file.

I really don't know what's the problem.

=======================================

User-agent: googlebot
Disallow: /wp-
Disallow: /search
Disallow: /feed
Disallow: /comments/feed
Disallow: /feed/$
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
Disallow: /*/*/feed/$
Disallow: /*/*/feed/rss/$
Disallow: /*/*/trackback/$
Disallow: /*/*/*/feed/$
Disallow: /*/*/*/feed/rss/$
Disallow: /*/*/*/trackback/$
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /trackback/
Disallow: /feed/
Disallow: /comments/
Disallow: /private/
Disallow: /*tag=
Disallow: /*m=
Disallow: /*cat=
Disallow: /*&cp=
Disallow: /*page_id=
Disallow: /*?paged=
Disallow: /*comments_popup=
Disallow: /*paged=
Allow: /*cat=6
Allow: /*cat=1
Allow: /wp-content/uploads/

User-agent: Googlebot
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.cgi$

# Google Image
User-agent: Googlebot-Image
Allow: /

# digg mirror
User-agent: duggmirror
Disallow: /

# Sitemap
Sitemap: http://www.example.com/sitemap.xml

=======================================

As mentioned before i use the default link permalinks , or the "ugly" links (wordpress).

=======================================

Analysis of cached robots.txt:

Last downloaded - September 8, 2007 8:49:03 PM PDT
Status - 200 (Success)

=======================================

Currently unable to stop the fluid of duplicated content:

http://www.example.com/?p=123&cp=1
http://www.example.com/?p=123&cp=2
http://www.example.com/?p=123&cp=3
http://www.example.com/?p=123&cp=4
http://www.example.com/?p=123&cp=5
.
.
.

[edited by: rashe18 at 2:30 pm (utc) on Sep. 9, 2007]

jdMorgan

2:08 pm on Sep 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Although I doubt that it's the cause of your problem, you have two "User-agent: Googlebot" sections, and that is not valid robots.txt syntax. Google is fairly smart about figuring out errors like this, but I would not count on it.

Jim

rashe18

2:29 pm on Sep 9, 2007 (gmt 0)

10+ Year Member



jdMorgan ,

It had been fixed though it is not the responsible factor for this problem as you (and me) believe.

bumpski

2:30 pm on Sep 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm not sure I fully understand the problem.

BUT I've found Google will not remove a page from it's index unless you mark the page "robots noindex", allow Google to crawl the page, by not blocking it with robots.txt, Then wait until it is removed from the index, and then and only then, block the path or page with robots.txt.

You may be better off with temporary dynamic pages that actually have "robots noindex" in them until the requests go away.

<meta name="robots" content="noindex"> correct syntax.

rashe18

2:54 pm on Sep 9, 2007 (gmt 0)

10+ Year Member



Hello Bumbski ,

Quoted from google:

--------------------------------------

To remove content from the Google index, do one of the following:

1. Ensure requests for the page return an HTTP status code of either 404 or 410.

2. Block the page using a robots.txt file.

3. Block the page using a meta noindex tag.

--------------------------------------

I have blocked these dynamic url by robots.txt

Do you think should i block these urls using a meta noindex tag?

These aren't HTML pages. Is it possible to block them (php pages)?

--------------------------------------
Quoted from google:

Block or remove pages using meta tags

Rather than use a robots.txt file to block crawler access to pages, you can add a <META> tag to an HTML page to tell robots not to index the page. This standard is described at [robotstxt.org...]

To prevent all robots from indexing a page on your site, you'd place the following meta tag into the <HEAD> section of your page:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

To allow other robots to index the page on your site, preventing only Google's robots from indexing the page, you'd use the following tag:

<META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW">

To allow robots to index the page on your site but instruct them not to follow outgoing links, you'd use the following tag:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">
--------------------------------------

jdMorgan

4:16 pm on Sep 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Based on the Google Webmaster Help article cited above, I'd try this:

Disallow: /*?*&cp=

This might help if Google won't look past the "?" unless it sees one in the "Disallow:" string.

I wish I could be more sure, but I use SE-friendly URLs exclusively.

Jim

g1smd

4:21 pm on Sep 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The robots.txt method, as you describe it, should work.

g1smd

5:35 pm on Sep 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is over complicated:

Disallow: /feed/$
Disallow: /*/feed/$
Disallow: /*/feed/rss/$
Disallow: /*/trackback/$
Disallow: /*/*/feed/$
Disallow: /*/*/feed/rss/$
Disallow: /*/*/trackback/$
Disallow: /*/*/*/feed/$
Disallow: /*/*/*/feed/rss/$
Disallow: /*/*/*/trackback/$

The star means "anything", so you shouldn't have multiple stars in the syntax.

Try this:

Disallow: /*feed/
Disallow: /*rss/
Disallow: /*trackback/

or does that block other things that you need to be indexed?

rashe18

1:06 am on Sep 11, 2007 (gmt 0)

10+ Year Member



I think g1smd and jdMorgan are right.

It works fine now.

Thanks for all!