homepage Welcome to WebmasterWorld Guest from 54.227.5.234
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
paging google! robots.txt being ignored!
elvey




msg:1528454
 6:15 pm on Nov 8, 2005 (gmt 0)

Hi. My robots.txt was put in place in August!
But google still has tons of results that violate the file.

[searchengineworld.com...]
doesn't complain (other than about the use of google's nonstandard extensions described at
[google.com...]

The above page says that it's OK that

#per [[AdminRequests]]
User-agent: Googlebot
Disallow: /*?*

is last (after User-agent: *)

and seems to suggest that the syntax is OK.

[edited by: jatar_k at 12:16 am (utc) on Nov. 9, 2005]
[edit reason] no urls thanks [/edit]

 

Dijkgraaf




msg:1528455
 11:45 pm on Nov 8, 2005 (gmt 0)

What is described at [google.com...] is
Disallow: /*?

You have an extra * after the? that you need to take off.

elvey




msg:1528456
 2:13 am on Nov 9, 2005 (gmt 0)

We made that change; it didn't help:
I asked google to review it via the automatic URL removal system ([services.google.com ]).
Result:
URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card:
DISALLOW /*?
How insane is that?

Oh, and while the original file's syntax wasn't per their example, it was legal, per their syntax!

(PS thanks for URLifying my previous post; I had assumed the posing system would do that for me.)

Dijkgraaf




msg:1528457
 3:49 am on Nov 9, 2005 (gmt 0)

Oh dear, well it looks like their removal tool can't handle wild cards then.

How many pages do you want it to remove?
That si actual pages e.g. /page.php and not /page.php?a=b and all different URL's.

If it is a small number or you can make a high level disallow rule that will match them without wild cards I would try that.

elvey




msg:1528458
 6:06 am on Nov 9, 2005 (gmt 0)

Oh, around 35,000. :(

elvey




msg:1528459
 6:07 am on Nov 9, 2005 (gmt 0)

I contacted google directly via [google.com...]
and directed them to this thread. We'll see...

Dijkgraaf




msg:1528460
 8:31 pm on Nov 9, 2005 (gmt 0)

And those 35000 aren't in a seperate directory from the pages you do wan't crawled?

elvey




msg:1528461
 11:22 am on Nov 10, 2005 (gmt 0)

No.
[google.com...]
No word from google.

Sanenet




msg:1528462
 11:34 am on Nov 10, 2005 (gmt 0)

This is why we should all stick to a standard robotstxt standard (www.robotstxt.org) instead of each SE having it's own little rules! ([webmasterworld.com ]).

ska_demon




msg:1528463
 12:14 pm on Nov 10, 2005 (gmt 0)

Google appears to have ignored my robots.txt as well but only once. Once, however, is enough to have all the pages listed that I wanted left alone. Google doesn't appear to be spidering the disallowed pages anymore and they are slowly disappearing from the serps. I would hope my robots file should work cos I borrowed some of it from WebmasterWorld. ;oP

Ska

elvey




msg:1528464
 6:37 pm on Nov 10, 2005 (gmt 0)

Anyone have successful experiences with huge (35,000-entry) robots.txt files?

In any case, google needs to fix their bug, as others are being tripped up by it.

spiral




msg:1528465
 11:35 am on Dec 1, 2005 (gmt 0)

elvey,

Yoy may try to solve it on the IT level - make 301 redirection to all the pages you want to be removed to some anchor you may treat later with robots.txt standard tools.
As an example 1abc.html -> 301 to -> 1abc.#*$! and later in robots.txt Disallow /*.xxx

twebdonny




msg:1528466
 5:50 am on Dec 27, 2005 (gmt 0)

deleted

elvey




msg:1528467
 7:25 am on Dec 27, 2005 (gmt 0)

I wonder what twebdonny said...

Spiral: thanks but:
1)
..a)Since Google specifically says ( [google.com...] ) that Disallow: /*? is supported, then it should be supported by their bot and their other tools.
..b)Lots of folks will be tripped up by the same bug.

2)I don't have access to the full apache config

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved