Welcome to WebmasterWorld Guest from 54.146.194.42

Forum Moderators: goodroi

Message Too Old, No Replies

Google ignores one line of robots.txt

     
8:44 am on Jun 22, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 6, 2005
posts: 863
votes: 0


Hi
I have a mediawiki install and can't find anywhere on the web that tells you how to setup a spider friendly mediawiki.

My robots.txt file contains
User-agent: *
Disallow: /bin/
Disallow: /cgi-bin/
Disallow: /config/
Disallow: /docs/
Disallow: /extensions/
Disallow: /includes/
Disallow: /languages/
Disallow: /local/
Disallow: /maintenance/
Disallow: /math/
Disallow: /serialized/
Disallow: /skins/
Disallow: /t/
Disallow: /tests/

While doing a site: command the skins folder IS indexed so are all the sub directories in it? out of all the disallow lines above only skins has a problem, any ideas where to look, thanks

8:51 am on June 22, 2008 (gmt 0)

Senior Member

joined:Jan 27, 2003
posts:2534
votes: 0


Was the line added recently? It's possible that Google discovered the content a long time ago, and has not revisited the pages since being disallowed (you can check the cache date on the files to get an idea if this is the case).

This is often true of content that has few or no external links to it (which is likely the case with your /skins/ folder). Such content can hang around for months and months since Googlebot never revisits it.

If it's important to get the files removed you can use the URL removal tool in webmaster tools, otherwise it's just a case of waiting. Certainly, there doesn't appear to be any problem with your robots directives.

8:55 am on June 22, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 6, 2005
posts:863
votes: 0


Hi Andy

No the whole robots.txt file was created in Feb 2008 I have just started to work on it again and to see how it was doing in the serps I did a site: command and found the skins folder to be indexed and every folder within it, I thought it was strange.

edit: there's no cache date only the Similar OR note tags

9:02 am on June 22, 2008 (gmt 0)

Senior Member

joined:Jan 27, 2003
posts:2534
votes: 0


there's no cache date only the Similar OR note tags

Is there a snippet underneath the listing, or do you just see the URLs? If it's just a URL, then this is quite common: files excluded in robots.txt often appear in Google listings in that way.

9:05 am on June 22, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 6, 2005
posts:863
votes: 0


No snippet just

URL
Similar pages - Note this

I use the short URL's directives in htaccess but cant see that affecting anything.

9:08 am on June 22, 2008 (gmt 0)

Senior Member

joined:Jan 27, 2003
posts:2534
votes: 0


In that case they are excluded, and your robots directives are being obeyed: Google is aware of the content because of links to it, but it is 'prevented' from retrieving it and so there is no cache or snippet.

Excluded files can hang around in this way for a long time (forever?) and while they make a mess of site: search results, in my experience there isn't usually any impact on performance.

9:11 am on June 22, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 6, 2005
posts:863
votes: 0


ok thanks Andy I will leave as is
9:16 am on June 22, 2008 (gmt 0)

Senior Member

joined:Jan 27, 2003
posts:2534
votes: 0


I tracked down a (very!) old thread on the same subject, which has a bit more detail:

Indexed pages that are disallowed by robots.txt [webmasterworld.com]