Dijkgraaf

msg:1528455 | 11:45 pm on Nov 8, 2005 (gmt 0) |
What is described at [google.com...] is Disallow: /*? You have an extra * after the? that you need to take off.
|
elvey

msg:1528456 | 2:13 am on Nov 9, 2005 (gmt 0) |
We made that change; it didn't help: I asked google to review it via the automatic URL removal system ([services.google.com ]). Result: URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card: DISALLOW /*? How insane is that? Oh, and while the original file's syntax wasn't per their example, it was legal, per their syntax! (PS thanks for URLifying my previous post; I had assumed the posing system would do that for me.)
|
Dijkgraaf

msg:1528457 | 3:49 am on Nov 9, 2005 (gmt 0) |
Oh dear, well it looks like their removal tool can't handle wild cards then. How many pages do you want it to remove? That si actual pages e.g. /page.php and not /page.php?a=b and all different URL's. If it is a small number or you can make a high level disallow rule that will match them without wild cards I would try that.
|
elvey

msg:1528458 | 6:06 am on Nov 9, 2005 (gmt 0) |
Oh, around 35,000. :(
|
elvey

msg:1528459 | 6:07 am on Nov 9, 2005 (gmt 0) |
I contacted google directly via [google.com...] and directed them to this thread. We'll see...
|
Dijkgraaf

msg:1528460 | 8:31 pm on Nov 9, 2005 (gmt 0) |
And those 35000 aren't in a seperate directory from the pages you do wan't crawled?
|
elvey

msg:1528461 | 11:22 am on Nov 10, 2005 (gmt 0) |
No. [google.com...] No word from google.
|
Sanenet

msg:1528462 | 11:34 am on Nov 10, 2005 (gmt 0) |
This is why we should all stick to a standard robotstxt standard (www.robotstxt.org) instead of each SE having it's own little rules! ([webmasterworld.com ]).
|
ska_demon

msg:1528463 | 12:14 pm on Nov 10, 2005 (gmt 0) |
Google appears to have ignored my robots.txt as well but only once. Once, however, is enough to have all the pages listed that I wanted left alone. Google doesn't appear to be spidering the disallowed pages anymore and they are slowly disappearing from the serps. I would hope my robots file should work cos I borrowed some of it from WebmasterWorld. ;oP Ska
|
elvey

msg:1528464 | 6:37 pm on Nov 10, 2005 (gmt 0) |
Anyone have successful experiences with huge (35,000-entry) robots.txt files? In any case, google needs to fix their bug, as others are being tripped up by it.
|
spiral

msg:1528465 | 11:35 am on Dec 1, 2005 (gmt 0) |
elvey, Yoy may try to solve it on the IT level - make 301 redirection to all the pages you want to be removed to some anchor you may treat later with robots.txt standard tools. As an example 1abc.html -> 301 to -> 1abc.#*$! and later in robots.txt Disallow /*.xxx
|
twebdonny

msg:1528466 | 5:50 am on Dec 27, 2005 (gmt 0) |
deleted
|
elvey

msg:1528467 | 7:25 am on Dec 27, 2005 (gmt 0) |
I wonder what twebdonny said... Spiral: thanks but: 1) ..a)Since Google specifically says ( [google.com...] ) that Disallow: /*? is supported, then it should be supported by their bot and their other tools. ..b)Lots of folks will be tripped up by the same bug. 2)I don't have access to the full apache config
|
|