g1smd

msg:4431077 | 11:38 pm on Mar 19, 2012 (gmt 0) |
Patterns are used as prefix matches, matching from the left.
Disallow: /Template is all you need. The * is never needed or used at the end of a pattern. It belongs only in the middle of a pattern.
Disallow: /*_this: would disallow any URL request like example.com/<anything>_this:
|
Holygamer

msg:4431087 | 12:07 am on Mar 20, 2012 (gmt 0) |
OK thanks but any idea why google is showing non-existant pages especially when I have this?: Disallow: /Game Music/ Does it matter if I have a space in the name? Or would I be better off doing this?: Disallow: /Game
|
g1smd

msg:4431091 | 12:19 am on Mar 20, 2012 (gmt 0) |
The space in the robots.txt file is likely an issue. Use Disallow: /Game or Disallow: /Game*Music However, when a URL is disallowed Google can still show that URL as a URL-only entry in the SERPs as it cannot access the URL to see the real status code. If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request.
|
lucy24

msg:4431098 | 12:29 am on Mar 20, 2012 (gmt 0) |
| Does it matter if I have a space in the name? |
| Heck, yeah ;) By the time the request reaches your server, it almost certainly won't be a space. Probably it will come through as %20. As noted above, "Disallow: /Game" by itself will have the same effect. But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere. Replace with hyphen or lowline or nothing. (Different argument.) The * is never needed or used at the end of a pattern. It belongs only in the middle of a pattern. |
| And that's not something you hear every day :) This pronouncement applies ONLY to robots.txt!
|
Holygamer

msg:4431102 | 12:52 am on Mar 20, 2012 (gmt 0) |
g1smd, you said "If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request". How do I do that? Doesn't that defeat the purpose of the Robots.txt file? Or will the results eventually be removed if I just use the correct line in my robots file and wait a month or so? I do know about using hyphens instead of spaces and all that, it's just that I didn't want to do that with my music folder as it makes the folders hard to read in file manager in cPanel if I use hyphens instead of spaces. And you can still link to files with spaces in without problem, but I didn't think about the robots.txt file.
|
Holygamer

msg:4431250 | 11:44 am on Mar 20, 2012 (gmt 0) |
/Game*Music Also if I do the above to block the Game Music directory it will also block pages beginning with "Game Music" which I don't want. So could I do this to just block the Game Music directory? Note the slash on the end. /Game*Music/
|
topr8

msg:4431261 | 12:20 pm on Mar 20, 2012 (gmt 0) |
| But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere. |
| i know this is OT but i totally concur with lucy24, if your site isn't really old you might be better off renaming all the files and folders without spaces. also IMHO using a mix of capital and lower case letters is not ideal either.
|
phranque

msg:4431285 | 1:15 pm on Mar 20, 2012 (gmt 0) |
| Doesn't that defeat the purpose of the Robots.txt file? |
| the purpose of the robots.txt file (NOTE THE LOWER CASE FILE NAME!) is to exclude the crawler from making a request for that url, not for keeping a url space out of the index. to keep a url out of the index you need a 404/410 response if the url is Not Found or Gone, otherwise a meta robots noindex tag or "X-Robots-Tag: noindex" HTTP Response header.
|
Holygamer

msg:4431327 | 2:48 pm on Mar 20, 2012 (gmt 0) |
I'm using MediaWiki. The pages never existed in the first place so how do I do what you suggested? This is one of the non-existant pages: http://example.com/Game_Music/B/Baroque/001._Baroque_%28Saturn%29_Original_soundtrack/ [edited by: tedster at 6:12 am (utc) on Mar 24, 2012] [edit reason] switch to example.com [/edit]
|
phranque

msg:4431342 | 3:07 pm on Mar 20, 2012 (gmt 0) |
MediaWiki should provide a "404 Not Found" response to requests for non-existent urls - as long as you don't exclude the crawler from making the request.
|
Holygamer

msg:4431348 | 3:16 pm on Mar 20, 2012 (gmt 0) |
This is my robots.txt file. Could you tell me if I am preventing MediaWiki from providing a "404 Not Found"? User-agent: * Disallow: /index.php Disallow: /w/ Disallow: /Category:* Disallow: /Category_talk:* Disallow: /Extension:* Disallow: /Extension_talk:* Disallow: /File:* Disallow: /File_talk:* Disallow: /Game*/ Disallow: /Image:* Disallow: /Image_talk:* Disallow: /Help:* Disallow: /Help_talk:* Disallow: /Manual:* Disallow: /Manual_talk:* Disallow: /Media:* Disallow: /MediaWiki:* Disallow: /Media Wiki_talk:* Disallow: /Project:* Disallow: /Project_talk:* Disallow: /Special Disallow: /Special:* Disallow: /Talk:* Disallow: /Template:* Disallow: /Template_talk:* Disallow: /User:* Disallow: /User_talk:* User-agent: ia_archiver Disallow: / Allow: /Special:Contact sitemap: http://example.com/sitemap.xml This is my .htaccess file: RewriteEngine on RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.+)$ index.php?title=$1 [L,QSA] Options +FollowSymlinks RewriteEngine on # Link for the Sitemap RewriteRule ^sitemap(.*)\.xml$ sitemap.php?page=$1 [L,NC] RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http://example.com$ [NC] RewriteRule .*\.(jpg|jpeg|gif|png|bmp|mp3)$ http://example.com/Hotlink_Protection [R,NC] If I'm not preventing it then MediaWiki musn't be working properly because if it was giving 404 Not Found then I wouldn't be able to find the non-existant pages via Google? [edited by: tedster at 6:14 am (utc) on Mar 24, 2012] [edit reason] switch to example.com [/edit]
|
g1smd

msg:4431500 | 8:59 pm on Mar 20, 2012 (gmt 0) |
Remove the trailing * from all of the Disallow directives. Try this code:
Disallow: /Game*Music/$ It will disallow the single URL, not the subpages in that folder. Your htaccess file is broken. The rules are in the wrong order. The hotlink rules should be first. Both of the .* patterns in that ruleset are errors in some way or other. The sitemap rule should be next. The general rewrite should be last. You should have RewriteEngine On only ONCE at the start of the file.
|
Holygamer

msg:4431531 | 10:17 pm on Mar 20, 2012 (gmt 0) |
OK thanks. I think it would be better if I moved the .htaccess question to a different thread as we've gone off topic a bit. Here's the new topic: [webmasterworld.com...] Could you please reformat my htaccess to show me what it should look like as I'm still not sure. Anyway, I'm confused with what you guys said about pattern matching compared to what Google says here: [support.google.com...] With disallowing a directory I found this on Google via the above link (click on "Manually create a robots.txt file"). Under "Patter Matching" it says this: To block access to all subdirectories that begin with private: Disallow: /private*/ So can't I just do this to disallow my Game Music directory?: Disallow: /Game*/ So won't the following also by OK to block pages names beginning with Category: ? Disallow: /Category:*
|
g1smd

msg:4431535 | 10:28 pm on Mar 20, 2012 (gmt 0) |
The * wildcard belongs only in the middle of a pattern: /*something /some*thing /some*thing/ /something*/ The pattern is a prefix match. It matches anything that BEGINS with this pattern. Use a trailing $ to match ONLY this exact URL. Never use * on the end of a rule. It is redundant.
|
Holygamer

msg:4431543 | 10:40 pm on Mar 20, 2012 (gmt 0) |
OK, just checking then you said /something*/ is OK so is this OK to disallow my Game Music directory?: /Game*/
|
g1smd

msg:4431544 | 10:41 pm on Mar 20, 2012 (gmt 0) |
Disallow: /Game*/ will block /Game<something>/<anything-or-nothing> Disallow: /Game*/$ will block /Game<something>/ and not subpages.
|
Holygamer

msg:4431551 | 10:51 pm on Mar 20, 2012 (gmt 0) |
OK got it. Also if I don't want to block a directory but I want to block any page starting "Category:" then do I do this without putting an asterix at the end? Disallow: /Category:
|
g1smd

msg:4431553 | 10:52 pm on Mar 20, 2012 (gmt 0) |
Disallow: /Category: will block the URLs /Category:<anything-or-nothing>
|
|