Welcome to WebmasterWorld Guest from 54.204.165.156

Forum Moderators: goodroi

Can't seem to Disallow a Directory

Disallow robots.txt robots

   
11:07 pm on Mar 19, 2012 (gmt 0)



I couldn't find an answer about wildcards when searching on Google.

Is this the correct way to disallow crawling of pages which have "Template" at the start of the page name?

Disallow: /Template*


So would the above also prevent pages which have the following at the start of their names from being crawled?

Template:
Template_talk:


Also I have a directory in the root of my site called "Game Music" which has subfolders with MP3s in (there are no webpages inside it). Google is showing non-existant pages from that directory in search results when I search for an album name. For example, try searching for the following on Google: Baroque (Saturn) Original Soundtrack

On the 1st results page you'll see 2 links to my site. The 1st link takes you to the actual page I have made. The 2nd link takes you to a non-existant page called:
"Game Music/B/Baroque/001. Baroque (Saturn) Original soundtrack"

The 1st page has download links to MP3s in the following location:
"Game Music/B/Baroque/001. Baroque (Saturn) Original soundtrack"

Forgetting about the robots file for a second, why is google showing non-existant pages in search results? I use MediaWiki which is the software that Wikipedia uses to build it's website. I don't know if this is the correct term or not, but the pages are virtual paths - the pages are stored in an MySQL database - does this have something to do with the problem?

I tried the following a month ago but the non-existant pages are still showing in search results:

Disallow: /Game Music/


Is there a problem with there being a space in the directory name?

Would this work to block everything in that directory from being indexed?

Disallow: /Game*
11:38 pm on Mar 19, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Patterns are used as prefix matches, matching from the left.

Disallow: /Template
is all you need.

The * is never needed or used at the end of a pattern.

It belongs only in the middle of a pattern.

Disallow: /*_this:
would disallow any URL request like
example.com/<anything>_this:
12:07 am on Mar 20, 2012 (gmt 0)



OK thanks but any idea why google is showing non-existant pages especially when I have this?: Disallow: /Game Music/
Does it matter if I have a space in the name? Or would I be better off doing this?: Disallow: /Game
12:19 am on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The space in the robots.txt file is likely an issue.

Use
Disallow: /Game

or
Disallow: /Game*Music



However, when a URL is disallowed Google can still show that URL as a URL-only entry in the SERPs as it cannot access the URL to see the real status code.

If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request.
12:29 am on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Does it matter if I have a space in the name?

Heck, yeah ;) By the time the request reaches your server, it almost certainly won't be a space. Probably it will come through as %20. As noted above, "Disallow: /Game" by itself will have the same effect.

But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere. Replace with hyphen or lowline or nothing. (Different argument.)

The * is never needed or used at the end of a pattern.

It belongs only in the middle of a pattern.

And that's not something you hear every day :) This pronouncement applies ONLY to robots.txt!
12:52 am on Mar 20, 2012 (gmt 0)



g1smd, you said "If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request". How do I do that? Doesn't that defeat the purpose of the Robots.txt file?

Or will the results eventually be removed if I just use the correct line in my robots file and wait a month or so?

I do know about using hyphens instead of spaces and all that, it's just that I didn't want to do that with my music folder as it makes the folders hard to read in file manager in cPanel if I use hyphens instead of spaces. And you can still link to files with spaces in without problem, but I didn't think about the robots.txt file.
11:44 am on Mar 20, 2012 (gmt 0)



/Game*Music

Also if I do the above to block the Game Music directory it will also block pages beginning with "Game Music" which I don't want. So could I do this to just block the Game Music directory? Note the slash on the end.

/Game*Music/
12:20 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere.


i know this is OT but i totally concur with lucy24, if your site isn't really old you might be better off renaming all the files and folders without spaces. also IMHO using a mix of capital and lower case letters is not ideal either.
1:15 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Doesn't that defeat the purpose of the Robots.txt file?

the purpose of the robots.txt file (NOTE THE LOWER CASE FILE NAME!) is to exclude the crawler from making a request for that url, not for keeping a url space out of the index.
to keep a url out of the index you need a 404/410 response if the url is Not Found or Gone, otherwise a meta robots noindex tag or "X-Robots-Tag: noindex" HTTP Response header.
2:48 pm on Mar 20, 2012 (gmt 0)



I'm using MediaWiki. The pages never existed in the first place so how do I do what you suggested?

This is one of the non-existant pages:

http://example.com/Game_Music/B/Baroque/001._Baroque_%28Saturn%29_Original_soundtrack/

[edited by: tedster at 6:12 am (utc) on Mar 24, 2012]
[edit reason] switch to example.com [/edit]

3:07 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



MediaWiki should provide a "404 Not Found" response to requests for non-existent urls - as long as you don't exclude the crawler from making the request.
3:16 pm on Mar 20, 2012 (gmt 0)



This is my robots.txt file. Could you tell me if I am preventing MediaWiki from providing a "404 Not Found"?

User-agent: *
Disallow: /index.php
Disallow: /w/
Disallow: /Category:*
Disallow: /Category_talk:*
Disallow: /Extension:*
Disallow: /Extension_talk:*
Disallow: /File:*
Disallow: /File_talk:*
Disallow: /Game*/
Disallow: /Image:*
Disallow: /Image_talk:*
Disallow: /Help:*
Disallow: /Help_talk:*
Disallow: /Manual:*
Disallow: /Manual_talk:*
Disallow: /Media:*
Disallow: /MediaWiki:*
Disallow: /Media Wiki_talk:*
Disallow: /Project:*
Disallow: /Project_talk:*
Disallow: /Special
Disallow: /Special:*
Disallow: /Talk:*
Disallow: /Template:*
Disallow: /Template_talk:*
Disallow: /User:*
Disallow: /User_talk:*
User-agent: ia_archiver
Disallow: /
Allow: /Special:Contact

sitemap: http://example.com/sitemap.xml

This is my .htaccess file:
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.+)$ index.php?title=$1 [L,QSA]

Options +FollowSymlinks
RewriteEngine on

# Link for the Sitemap
RewriteRule ^sitemap(.*)\.xml$ sitemap.php?page=$1 [L,NC]

RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://example.com$ [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp|mp3)$ http://example.com/Hotlink_Protection [R,NC]




If I'm not preventing it then MediaWiki musn't be working properly because if it was giving 404 Not Found then I wouldn't be able to find the non-existant pages via Google?

[edited by: tedster at 6:14 am (utc) on Mar 24, 2012]
[edit reason] switch to example.com [/edit]

8:59 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Remove the trailing * from all of the Disallow directives.

Try this code:
Disallow: /Game*Music/$

It will disallow the single URL, not the subpages in that folder.


Your htaccess file is broken. The rules are in the wrong order.

The hotlink rules should be first. Both of the .* patterns in that ruleset are errors in some way or other.

The sitemap rule should be next.

The general rewrite should be last.

You should have
RewriteEngine On
only ONCE at the start of the file.
10:17 pm on Mar 20, 2012 (gmt 0)



OK thanks. I think it would be better if I moved the .htaccess question to a different thread as we've gone off topic a bit. Here's the new topic: [webmasterworld.com...] Could you please reformat my htaccess to show me what it should look like as I'm still not sure.

Anyway, I'm confused with what you guys said about pattern matching compared to what Google says here: [support.google.com...]

With disallowing a directory I found this on Google via the above link (click on "Manually create a robots.txt file"). Under "Patter Matching" it says this:

To block access to all subdirectories that begin with private:

Disallow: /private*/

So can't I just do this to disallow my Game Music directory?: Disallow: /Game*/

So won't the following also by OK to block pages names beginning with Category: ?

Disallow: /Category:*
10:28 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The * wildcard belongs only in the middle of a pattern:

/*something

/some*thing

/some*thing/

/something*/

The pattern is a prefix match. It matches anything that BEGINS with this pattern.

Use a trailing $ to match ONLY this exact URL.

Never use * on the end of a rule. It is redundant.
10:40 pm on Mar 20, 2012 (gmt 0)



OK, just checking then you said /something*/
is OK so is this OK to disallow my Game Music directory?:

/Game*/
10:41 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Disallow: /Game*/ will block /Game<something>/<anything-or-nothing>

Disallow: /Game*/$ will block /Game<something>/ and not subpages.
10:51 pm on Mar 20, 2012 (gmt 0)



OK got it.

Also if I don't want to block a directory but I want to block any page starting "Category:" then do I do this without putting an asterix at the end?

Disallow: /Category:
10:52 pm on Mar 20, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Disallow: /Category: will block the URLs /Category:<anything-or-nothing>
 

Featured Threads

Hot Threads This Week

Hot Threads This Month