Can't seem to Disallow a Directory - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

Can't seem to Disallow a Directory

Disallow robots.txt robots

Holygamer

11:07 pm on Mar 19, 2012 (gmt 0)

10+ Year Member

I couldn't find an answer about wildcards when searching on Google.

Is this the correct way to disallow crawling of pages which have "Template" at the start of the page name?

Disallow: /Template*

So would the above also prevent pages which have the following at the start of their names from being crawled?

Template:
Template_talk:

Also I have a directory in the root of my site called "Game Music" which has subfolders with MP3s in (there are no webpages inside it). Google is showing non-existant pages from that directory in search results when I search for an album name. For example, try searching for the following on Google: Baroque (Saturn) Original Soundtrack

On the 1st results page you'll see 2 links to my site. The 1st link takes you to the actual page I have made. The 2nd link takes you to a non-existant page called:
"Game Music/B/Baroque/001. Baroque (Saturn) Original soundtrack"

The 1st page has download links to MP3s in the following location:
"Game Music/B/Baroque/001. Baroque (Saturn) Original soundtrack"

Forgetting about the robots file for a second, why is google showing non-existant pages in search results? I use MediaWiki which is the software that Wikipedia uses to build it's website. I don't know if this is the correct term or not, but the pages are virtual paths - the pages are stored in an MySQL database - does this have something to do with the problem?

I tried the following a month ago but the non-existant pages are still showing in search results:

Disallow: /Game Music/

Is there a problem with there being a space in the directory name?

Would this work to block everything in that directory from being indexed?

Disallow: /Game*

g1smd

11:38 pm on Mar 19, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Patterns are used as prefix matches, matching from the left.

Disallow: /Template

is all you need.

The * is never needed or used at the end of a pattern.

It belongs only in the middle of a pattern.

Disallow: /*_this:

would disallow any URL request like

example.com/<anything>_this:

Holygamer

12:07 am on Mar 20, 2012 (gmt 0)

10+ Year Member

OK thanks but any idea why google is showing non-existant pages especially when I have this?: Disallow: /Game Music/
Does it matter if I have a space in the name? Or would I be better off doing this?: Disallow: /Game

g1smd

12:19 am on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The space in the robots.txt file is likely an issue.

Use

Disallow: /Game

or

Disallow: /Game*Music

However, when a URL is disallowed Google can still show that URL as a URL-only entry in the SERPs as it cannot access the URL to see the real status code.

If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request.

lucy24

12:29 am on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Does it matter if I have a space in the name?

Heck, yeah ;) By the time the request reaches your server, it almost certainly won't be a space. Probably it will come through as %20. As noted above, "Disallow: /Game" by itself will have the same effect.

But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere. Replace with hyphen or lowline or nothing. (Different argument.)

The * is never needed or used at the end of a pattern.

It belongs only in the middle of a pattern.

And that's not something you hear every day :) This pronouncement applies ONLY to robots.txt!

Holygamer

12:52 am on Mar 20, 2012 (gmt 0)

10+ Year Member

g1smd, you said "If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request". How do I do that? Doesn't that defeat the purpose of the Robots.txt file?

Or will the results eventually be removed if I just use the correct line in my robots file and wait a month or so?

I do know about using hyphens instead of spaces and all that, it's just that I didn't want to do that with my music folder as it makes the folders hard to read in file manager in cPanel if I use hyphens instead of spaces. And you can still link to files with spaces in without problem, but I didn't think about the robots.txt file.

Holygamer

11:44 am on Mar 20, 2012 (gmt 0)

10+ Year Member

/Game*Music

Also if I do the above to block the Game Music directory it will also block pages beginning with "Game Music" which I don't want. So could I do this to just block the Game Music directory? Note the slash on the end.

/Game*Music/

topr8

12:20 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere.

i know this is OT but i totally concur with lucy24, if your site isn't really old you might be better off renaming all the files and folders without spaces. also IMHO using a mix of capital and lower case letters is not ideal either.

phranque

1:15 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Doesn't that defeat the purpose of the Robots.txt file?

the purpose of the robots.txt file (NOTE THE LOWER CASE FILE NAME!) is to exclude the crawler from making a request for that url, not for keeping a url space out of the index.
to keep a url out of the index you need a 404/410 response if the url is Not Found or Gone, otherwise a meta robots noindex tag or "X-Robots-Tag: noindex" HTTP Response header.

Holygamer

2:48 pm on Mar 20, 2012 (gmt 0)

10+ Year Member

I'm using MediaWiki. The pages never existed in the first place so how do I do what you suggested?

This is one of the non-existant pages:

http://example.com/Game_Music/B/Baroque/001._Baroque_%28Saturn%29_Original_soundtrack/

[edited by: tedster at 6:12 am (utc) on Mar 24, 2012]
[edit reason] switch to example.com [/edit]

phranque

3:07 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

MediaWiki should provide a "404 Not Found" response to requests for non-existent urls - as long as you don't exclude the crawler from making the request.

Holygamer

3:16 pm on Mar 20, 2012 (gmt 0)

10+ Year Member

This is my robots.txt file. Could you tell me if I am preventing MediaWiki from providing a "404 Not Found"?

User-agent: *
Disallow: /index.php
Disallow: /w/
Disallow: /Category:*
Disallow: /Category_talk:*
Disallow: /Extension:*
Disallow: /Extension_talk:*
Disallow: /File:*
Disallow: /File_talk:*
Disallow: /Game*/
Disallow: /Image:*
Disallow: /Image_talk:*
Disallow: /Help:*
Disallow: /Help_talk:*
Disallow: /Manual:*
Disallow: /Manual_talk:*
Disallow: /Media:*
Disallow: /MediaWiki:*
Disallow: /Media Wiki_talk:*
Disallow: /Project:*
Disallow: /Project_talk:*
Disallow: /Special
Disallow: /Special:*
Disallow: /Talk:*
Disallow: /Template:*
Disallow: /Template_talk:*
Disallow: /User:*
Disallow: /User_talk:*
User-agent: ia_archiver
Disallow: /
Allow: /Special:Contact

sitemap: http://example.com/sitemap.xml

This is my .htaccess file:
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.+)$ index.php?title=$1 [L,QSA]

Options +FollowSymlinks
RewriteEngine on

# Link for the Sitemap
RewriteRule ^sitemap(.*)\.xml$ sitemap.php?page=$1 [L,NC]

RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://example.com$ [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp|mp3)$ http://example.com/Hotlink_Protection [R,NC]

If I'm not preventing it then MediaWiki musn't be working properly because if it was giving 404 Not Found then I wouldn't be able to find the non-existant pages via Google?

[edited by: tedster at 6:14 am (utc) on Mar 24, 2012]
[edit reason] switch to example.com [/edit]

g1smd

8:59 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Remove the trailing * from all of the Disallow directives.

Try this code:

Disallow: /Game*Music/$

It will disallow the single URL, not the subpages in that folder.

Your htaccess file is broken. The rules are in the wrong order.

The hotlink rules should be first. Both of the .* patterns in that ruleset are errors in some way or other.

The sitemap rule should be next.

The general rewrite should be last.

You should have

RewriteEngine On

only ONCE at the start of the file.

Holygamer

10:17 pm on Mar 20, 2012 (gmt 0)

10+ Year Member

OK thanks. I think it would be better if I moved the .htaccess question to a different thread as we've gone off topic a bit. Here's the new topic: [webmasterworld.com...] Could you please reformat my htaccess to show me what it should look like as I'm still not sure.

Anyway, I'm confused with what you guys said about pattern matching compared to what Google says here: [support.google.com...]

With disallowing a directory I found this on Google via the above link (click on "Manually create a robots.txt file"). Under "Patter Matching" it says this:

To block access to all subdirectories that begin with private:

Disallow: /private*/

So can't I just do this to disallow my Game Music directory?: Disallow: /Game*/

So won't the following also by OK to block pages names beginning with Category: ?

Disallow: /Category:*

g1smd

10:28 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The * wildcard belongs only in the middle of a pattern:

/*something

/some*thing

/some*thing/

/something*/

The pattern is a prefix match. It matches anything that BEGINS with this pattern.

Use a trailing $ to match ONLY this exact URL.

Never use * on the end of a rule. It is redundant.

Holygamer

10:40 pm on Mar 20, 2012 (gmt 0)

10+ Year Member

OK, just checking then you said /something*/
is OK so is this OK to disallow my Game Music directory?:

/Game*/

g1smd

10:41 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Disallow: /Game*/ will block /Game<something>/<anything-or-nothing>

Disallow: /Game*/$ will block /Game<something>/ and not subpages.

Holygamer

10:51 pm on Mar 20, 2012 (gmt 0)

10+ Year Member

OK got it.

Also if I don't want to block a directory but I want to block any page starting "Category:" then do I do this without putting an asterix at the end?

Disallow: /Category:

g1smd

10:52 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Disallow: /Category: will block the URLs /Category:<anything-or-nothing>