homepage Welcome to WebmasterWorld Guest from 54.196.62.23
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Can't seem to Disallow a Directory
Disallow robots.txt robots
Holygamer



 
Msg#: 4431065 posted 11:07 pm on Mar 19, 2012 (gmt 0)

I couldn't find an answer about wildcards when searching on Google.

Is this the correct way to disallow crawling of pages which have "Template" at the start of the page name?

Disallow: /Template*

So would the above also prevent pages which have the following at the start of their names from being crawled?

Template:
Template_talk:


Also I have a directory in the root of my site called "Game Music" which has subfolders with MP3s in (there are no webpages inside it). Google is showing non-existant pages from that directory in search results when I search for an album name. For example, try searching for the following on Google: Baroque (Saturn) Original Soundtrack

On the 1st results page you'll see 2 links to my site. The 1st link takes you to the actual page I have made. The 2nd link takes you to a non-existant page called:
"Game Music/B/Baroque/001. Baroque (Saturn) Original soundtrack"

The 1st page has download links to MP3s in the following location:
"Game Music/B/Baroque/001. Baroque (Saturn) Original soundtrack"

Forgetting about the robots file for a second, why is google showing non-existant pages in search results? I use MediaWiki which is the software that Wikipedia uses to build it's website. I don't know if this is the correct term or not, but the pages are virtual paths - the pages are stored in an MySQL database - does this have something to do with the problem?

I tried the following a month ago but the non-existant pages are still showing in search results:

Disallow: /Game Music/

Is there a problem with there being a space in the directory name?

Would this work to block everything in that directory from being indexed?

Disallow: /Game*

 

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4431065 posted 11:38 pm on Mar 19, 2012 (gmt 0)

Patterns are used as prefix matches, matching from the left.

Disallow: /Template is all you need.

The * is never needed or used at the end of a pattern.

It belongs only in the middle of a pattern.

Disallow: /*_this: would disallow any URL request like example.com/<anything>_this:
Holygamer



 
Msg#: 4431065 posted 12:07 am on Mar 20, 2012 (gmt 0)

OK thanks but any idea why google is showing non-existant pages especially when I have this?: Disallow: /Game Music/
Does it matter if I have a space in the name? Or would I be better off doing this?: Disallow: /Game

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4431065 posted 12:19 am on Mar 20, 2012 (gmt 0)

The space in the robots.txt file is likely an issue.

Use
Disallow: /Game
or
Disallow: /Game*Music


However, when a URL is disallowed Google can still show that URL as a URL-only entry in the SERPs as it cannot access the URL to see the real status code.

If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4431065 posted 12:29 am on Mar 20, 2012 (gmt 0)

Does it matter if I have a space in the name?

Heck, yeah ;) By the time the request reaches your server, it almost certainly won't be a space. Probably it will come through as %20. As noted above, "Disallow: /Game" by itself will have the same effect.

But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere. Replace with hyphen or lowline or nothing. (Different argument.)

The * is never needed or used at the end of a pattern.

It belongs only in the middle of a pattern.

And that's not something you hear every day :) This pronouncement applies ONLY to robots.txt!

Holygamer



 
Msg#: 4431065 posted 12:52 am on Mar 20, 2012 (gmt 0)

g1smd, you said "If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request". How do I do that? Doesn't that defeat the purpose of the Robots.txt file?

Or will the results eventually be removed if I just use the correct line in my robots file and wait a month or so?

I do know about using hyphens instead of spaces and all that, it's just that I didn't want to do that with my music folder as it makes the folders hard to read in file manager in cPanel if I use hyphens instead of spaces. And you can still link to files with spaces in without problem, but I didn't think about the robots.txt file.

Holygamer



 
Msg#: 4431065 posted 11:44 am on Mar 20, 2012 (gmt 0)

/Game*Music

Also if I do the above to block the Game Music directory it will also block pages beginning with "Game Music" which I don't want. So could I do this to just block the Game Music directory? Note the slash on the end.

/Game*Music/

topr8

WebmasterWorld Senior Member topr8 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4431065 posted 12:20 pm on Mar 20, 2012 (gmt 0)

But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere.


i know this is OT but i totally concur with lucy24, if your site isn't really old you might be better off renaming all the files and folders without spaces. also IMHO using a mix of capital and lower case letters is not ideal either.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4431065 posted 1:15 pm on Mar 20, 2012 (gmt 0)

Doesn't that defeat the purpose of the Robots.txt file?

the purpose of the robots.txt file (NOTE THE LOWER CASE FILE NAME!) is to exclude the crawler from making a request for that url, not for keeping a url space out of the index.
to keep a url out of the index you need a 404/410 response if the url is Not Found or Gone, otherwise a meta robots noindex tag or "X-Robots-Tag: noindex" HTTP Response header.

Holygamer



 
Msg#: 4431065 posted 2:48 pm on Mar 20, 2012 (gmt 0)

I'm using MediaWiki. The pages never existed in the first place so how do I do what you suggested?

This is one of the non-existant pages:

http://example.com/Game_Music/B/Baroque/001._Baroque_%28Saturn%29_Original_soundtrack/

[edited by: tedster at 6:12 am (utc) on Mar 24, 2012]
[edit reason] switch to example.com [/edit]

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4431065 posted 3:07 pm on Mar 20, 2012 (gmt 0)

MediaWiki should provide a "404 Not Found" response to requests for non-existent urls - as long as you don't exclude the crawler from making the request.

Holygamer



 
Msg#: 4431065 posted 3:16 pm on Mar 20, 2012 (gmt 0)

This is my robots.txt file. Could you tell me if I am preventing MediaWiki from providing a "404 Not Found"?

User-agent: *
Disallow: /index.php
Disallow: /w/
Disallow: /Category:*
Disallow: /Category_talk:*
Disallow: /Extension:*
Disallow: /Extension_talk:*
Disallow: /File:*
Disallow: /File_talk:*
Disallow: /Game*/
Disallow: /Image:*
Disallow: /Image_talk:*
Disallow: /Help:*
Disallow: /Help_talk:*
Disallow: /Manual:*
Disallow: /Manual_talk:*
Disallow: /Media:*
Disallow: /MediaWiki:*
Disallow: /Media Wiki_talk:*
Disallow: /Project:*
Disallow: /Project_talk:*
Disallow: /Special
Disallow: /Special:*
Disallow: /Talk:*
Disallow: /Template:*
Disallow: /Template_talk:*
Disallow: /User:*
Disallow: /User_talk:*
User-agent: ia_archiver
Disallow: /
Allow: /Special:Contact

sitemap: http://example.com/sitemap.xml

This is my .htaccess file:
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.+)$ index.php?title=$1 [L,QSA]

Options +FollowSymlinks
RewriteEngine on

# Link for the Sitemap
RewriteRule ^sitemap(.*)\.xml$ sitemap.php?page=$1 [L,NC]

RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://example.com$ [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp|mp3)$ http://example.com/Hotlink_Protection [R,NC]




If I'm not preventing it then MediaWiki musn't be working properly because if it was giving 404 Not Found then I wouldn't be able to find the non-existant pages via Google?

[edited by: tedster at 6:14 am (utc) on Mar 24, 2012]
[edit reason] switch to example.com [/edit]

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4431065 posted 8:59 pm on Mar 20, 2012 (gmt 0)

Remove the trailing * from all of the Disallow directives.

Try this code:
Disallow: /Game*Music/$
It will disallow the single URL, not the subpages in that folder.


Your htaccess file is broken. The rules are in the wrong order.

The hotlink rules should be first. Both of the .* patterns in that ruleset are errors in some way or other.

The sitemap rule should be next.

The general rewrite should be last.

You should have
RewriteEngine On only ONCE at the start of the file.
Holygamer



 
Msg#: 4431065 posted 10:17 pm on Mar 20, 2012 (gmt 0)

OK thanks. I think it would be better if I moved the .htaccess question to a different thread as we've gone off topic a bit. Here's the new topic: [webmasterworld.com...] Could you please reformat my htaccess to show me what it should look like as I'm still not sure.

Anyway, I'm confused with what you guys said about pattern matching compared to what Google says here: [support.google.com...]

With disallowing a directory I found this on Google via the above link (click on "Manually create a robots.txt file"). Under "Patter Matching" it says this:

To block access to all subdirectories that begin with private:

Disallow: /private*/

So can't I just do this to disallow my Game Music directory?: Disallow: /Game*/

So won't the following also by OK to block pages names beginning with Category: ?

Disallow: /Category:*

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4431065 posted 10:28 pm on Mar 20, 2012 (gmt 0)

The * wildcard belongs only in the middle of a pattern:

/*something

/some*thing

/some*thing/

/something*/

The pattern is a prefix match. It matches anything that BEGINS with this pattern.

Use a trailing $ to match ONLY this exact URL.

Never use * on the end of a rule. It is redundant.

Holygamer



 
Msg#: 4431065 posted 10:40 pm on Mar 20, 2012 (gmt 0)

OK, just checking then you said /something*/
is OK so is this OK to disallow my Game Music directory?:

/Game*/

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4431065 posted 10:41 pm on Mar 20, 2012 (gmt 0)

Disallow: /Game*/ will block /Game<something>/<anything-or-nothing>

Disallow: /Game*/$ will block /Game<something>/ and not subpages.

Holygamer



 
Msg#: 4431065 posted 10:51 pm on Mar 20, 2012 (gmt 0)

OK got it.

Also if I don't want to block a directory but I want to block any page starting "Category:" then do I do this without putting an asterix at the end?

Disallow: /Category:

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4431065 posted 10:52 pm on Mar 20, 2012 (gmt 0)

Disallow: /Category: will block the URLs /Category:<anything-or-nothing>

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved