homepage Welcome to WebmasterWorld Guest from 54.147.196.159
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
paging google! robots.txt being ignored!
elvey

5+ Year Member



 
Msg#: 782 posted 6:15 pm on Nov 8, 2005 (gmt 0)

Hi. My robots.txt was put in place in August!
But google still has tons of results that violate the file.

[searchengineworld.com...]
doesn't complain (other than about the use of google's nonstandard extensions described at
[google.com...]

The above page says that it's OK that

#per [[AdminRequests]]
User-agent: Googlebot
Disallow: /*?*

is last (after User-agent: *)

and seems to suggest that the syntax is OK.

[edited by: jatar_k at 12:16 am (utc) on Nov. 9, 2005]
[edit reason] no urls thanks [/edit]

 

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 782 posted 11:45 pm on Nov 8, 2005 (gmt 0)

What is described at [google.com...] is
Disallow: /*?

You have an extra * after the? that you need to take off.

elvey

5+ Year Member



 
Msg#: 782 posted 2:13 am on Nov 9, 2005 (gmt 0)

We made that change; it didn't help:
I asked google to review it via the automatic URL removal system ([services.google.com ]).
Result:
URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card:
DISALLOW /*?
How insane is that?

Oh, and while the original file's syntax wasn't per their example, it was legal, per their syntax!

(PS thanks for URLifying my previous post; I had assumed the posing system would do that for me.)

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 782 posted 3:49 am on Nov 9, 2005 (gmt 0)

Oh dear, well it looks like their removal tool can't handle wild cards then.

How many pages do you want it to remove?
That si actual pages e.g. /page.php and not /page.php?a=b and all different URL's.

If it is a small number or you can make a high level disallow rule that will match them without wild cards I would try that.

elvey

5+ Year Member



 
Msg#: 782 posted 6:06 am on Nov 9, 2005 (gmt 0)

Oh, around 35,000. :(

elvey

5+ Year Member



 
Msg#: 782 posted 6:07 am on Nov 9, 2005 (gmt 0)

I contacted google directly via [google.com...]
and directed them to this thread. We'll see...

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 782 posted 8:31 pm on Nov 9, 2005 (gmt 0)

And those 35000 aren't in a seperate directory from the pages you do wan't crawled?

elvey

5+ Year Member



 
Msg#: 782 posted 11:22 am on Nov 10, 2005 (gmt 0)

No.
[google.com...]
No word from google.

Sanenet

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 782 posted 11:34 am on Nov 10, 2005 (gmt 0)

This is why we should all stick to a standard robotstxt standard (www.robotstxt.org) instead of each SE having it's own little rules! ([webmasterworld.com ]).

ska_demon

10+ Year Member



 
Msg#: 782 posted 12:14 pm on Nov 10, 2005 (gmt 0)

Google appears to have ignored my robots.txt as well but only once. Once, however, is enough to have all the pages listed that I wanted left alone. Google doesn't appear to be spidering the disallowed pages anymore and they are slowly disappearing from the serps. I would hope my robots file should work cos I borrowed some of it from WebmasterWorld. ;oP

Ska

elvey

5+ Year Member



 
Msg#: 782 posted 6:37 pm on Nov 10, 2005 (gmt 0)

Anyone have successful experiences with huge (35,000-entry) robots.txt files?

In any case, google needs to fix their bug, as others are being tripped up by it.

spiral

10+ Year Member



 
Msg#: 782 posted 11:35 am on Dec 1, 2005 (gmt 0)

elvey,

Yoy may try to solve it on the IT level - make 301 redirection to all the pages you want to be removed to some anchor you may treat later with robots.txt standard tools.
As an example 1abc.html -> 301 to -> 1abc.#*$! and later in robots.txt Disallow /*.xxx

twebdonny



 
Msg#: 782 posted 5:50 am on Dec 27, 2005 (gmt 0)

deleted

elvey

5+ Year Member



 
Msg#: 782 posted 7:25 am on Dec 27, 2005 (gmt 0)

I wonder what twebdonny said...

Spiral: thanks but:
1)
..a)Since Google specifically says ( [google.com...] ) that Disallow: /*? is supported, then it should be supported by their bot and their other tools.
..b)Lots of folks will be tripped up by the same bug.

2)I don't have access to the full apache config

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved