| 11:45 pm on Nov 8, 2005 (gmt 0)|
What is described at [google.com...] is
You have an extra * after the? that you need to take off.
| 2:13 am on Nov 9, 2005 (gmt 0)|
We made that change; it didn't help:
I asked google to review it via the automatic URL removal system ([services.google.com ]).
URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card:
How insane is that?
Oh, and while the original file's syntax wasn't per their example, it was legal, per their syntax!
(PS thanks for URLifying my previous post; I had assumed the posing system would do that for me.)
| 3:49 am on Nov 9, 2005 (gmt 0)|
Oh dear, well it looks like their removal tool can't handle wild cards then.
How many pages do you want it to remove?
That si actual pages e.g. /page.php and not /page.php?a=b and all different URL's.
If it is a small number or you can make a high level disallow rule that will match them without wild cards I would try that.
| 6:06 am on Nov 9, 2005 (gmt 0)|
Oh, around 35,000. :(
| 6:07 am on Nov 9, 2005 (gmt 0)|
I contacted google directly via [google.com...]
and directed them to this thread. We'll see...
| 8:31 pm on Nov 9, 2005 (gmt 0)|
And those 35000 aren't in a seperate directory from the pages you do wan't crawled?
| 11:22 am on Nov 10, 2005 (gmt 0)|
No word from google.
| 11:34 am on Nov 10, 2005 (gmt 0)|
This is why we should all stick to a standard robotstxt standard (www.robotstxt.org) instead of each SE having it's own little rules! ([webmasterworld.com ]).
| 12:14 pm on Nov 10, 2005 (gmt 0)|
Google appears to have ignored my robots.txt as well but only once. Once, however, is enough to have all the pages listed that I wanted left alone. Google doesn't appear to be spidering the disallowed pages anymore and they are slowly disappearing from the serps. I would hope my robots file should work cos I borrowed some of it from WebmasterWorld. ;oP
| 6:37 pm on Nov 10, 2005 (gmt 0)|
Anyone have successful experiences with huge (35,000-entry) robots.txt files?
In any case, google needs to fix their bug, as others are being tripped up by it.
| 11:35 am on Dec 1, 2005 (gmt 0)|
Yoy may try to solve it on the IT level - make 301 redirection to all the pages you want to be removed to some anchor you may treat later with robots.txt standard tools.
As an example 1abc.html -> 301 to -> 1abc.#*$! and later in robots.txt Disallow /*.xxx
| 5:50 am on Dec 27, 2005 (gmt 0)|
| 7:25 am on Dec 27, 2005 (gmt 0)|
I wonder what twebdonny said...
Spiral: thanks but:
..a)Since Google specifically says ( [google.com...] ) that Disallow: /*? is supported, then it should be supported by their bot and their other tools.
..b)Lots of folks will be tripped up by the same bug.
2)I don't have access to the full apache config