Welcome to WebmasterWorld Guest from 54.205.96.97

Forum Moderators: goodroi

paging google! robots.txt being ignored!

   
6:15 pm on Nov 8, 2005 (gmt 0)

5+ Year Member



Hi. My robots.txt was put in place in August!
But google still has tons of results that violate the file.

[searchengineworld.com...]
doesn't complain (other than about the use of google's nonstandard extensions described at
[google.com...]

The above page says that it's OK that

#per [[AdminRequests]]
User-agent: Googlebot
Disallow: /*?*

is last (after User-agent: *)

and seems to suggest that the syntax is OK.

[edited by: jatar_k at 12:16 am (utc) on Nov. 9, 2005]
[edit reason] no urls thanks [/edit]

11:45 pm on Nov 8, 2005 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



What is described at [google.com...] is
Disallow: /*?

You have an extra * after the? that you need to take off.

2:13 am on Nov 9, 2005 (gmt 0)

5+ Year Member



We made that change; it didn't help:
I asked google to review it via the automatic URL removal system ([services.google.com ]).
Result:
URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card:
DISALLOW /*?
How insane is that?

Oh, and while the original file's syntax wasn't per their example, it was legal, per their syntax!

(PS thanks for URLifying my previous post; I had assumed the posing system would do that for me.)

3:49 am on Nov 9, 2005 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Oh dear, well it looks like their removal tool can't handle wild cards then.

How many pages do you want it to remove?
That si actual pages e.g. /page.php and not /page.php?a=b and all different URL's.

If it is a small number or you can make a high level disallow rule that will match them without wild cards I would try that.

6:06 am on Nov 9, 2005 (gmt 0)

5+ Year Member



Oh, around 35,000. :(
6:07 am on Nov 9, 2005 (gmt 0)

5+ Year Member



I contacted google directly via [google.com...]
and directed them to this thread. We'll see...
8:31 pm on Nov 9, 2005 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



And those 35000 aren't in a seperate directory from the pages you do wan't crawled?
11:22 am on Nov 10, 2005 (gmt 0)

5+ Year Member



No.
[google.com...]
No word from google.
11:34 am on Nov 10, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is why we should all stick to a standard robotstxt standard (www.robotstxt.org) instead of each SE having it's own little rules! ([webmasterworld.com ]).
12:14 pm on Nov 10, 2005 (gmt 0)

10+ Year Member



Google appears to have ignored my robots.txt as well but only once. Once, however, is enough to have all the pages listed that I wanted left alone. Google doesn't appear to be spidering the disallowed pages anymore and they are slowly disappearing from the serps. I would hope my robots file should work cos I borrowed some of it from WebmasterWorld. ;oP

Ska

6:37 pm on Nov 10, 2005 (gmt 0)

5+ Year Member



Anyone have successful experiences with huge (35,000-entry) robots.txt files?

In any case, google needs to fix their bug, as others are being tripped up by it.

11:35 am on Dec 1, 2005 (gmt 0)

10+ Year Member



elvey,

Yoy may try to solve it on the IT level - make 301 redirection to all the pages you want to be removed to some anchor you may treat later with robots.txt standard tools.
As an example 1abc.html -> 301 to -> 1abc.#*$! and later in robots.txt Disallow /*.xxx

5:50 am on Dec 27, 2005 (gmt 0)



deleted
7:25 am on Dec 27, 2005 (gmt 0)

5+ Year Member



I wonder what twebdonny said...

Spiral: thanks but:
1)
..a)Since Google specifically says ( [google.com...] ) that Disallow: /*? is supported, then it should be supported by their bot and their other tools.
..b)Lots of folks will be tripped up by the same bug.

2)I don't have access to the full apache config

 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month