Welcome to WebmasterWorld Guest from 54.161.3.108

Forum Moderators: goodroi

Message Too Old, No Replies

Can't seem to Disallow a Directory

Disallow robots.txt robots

     
11:07 pm on Mar 19, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


I couldn't find an answer about wildcards when searching on Google.

Is this the correct way to disallow crawling of pages which have "Template" at the start of the page name?

Disallow: /Template*


So would the above also prevent pages which have the following at the start of their names from being crawled?

Template:
Template_talk:


Also I have a directory in the root of my site called "Game Music" which has subfolders with MP3s in (there are no webpages inside it). Google is showing non-existant pages from that directory in search results when I search for an album name. For example, try searching for the following on Google: Baroque (Saturn) Original Soundtrack

On the 1st results page you'll see 2 links to my site. The 1st link takes you to the actual page I have made. The 2nd link takes you to a non-existant page called:
"Game Music/B/Baroque/001. Baroque (Saturn) Original soundtrack"

The 1st page has download links to MP3s in the following location:
"Game Music/B/Baroque/001. Baroque (Saturn) Original soundtrack"

Forgetting about the robots file for a second, why is google showing non-existant pages in search results? I use MediaWiki which is the software that Wikipedia uses to build it's website. I don't know if this is the correct term or not, but the pages are virtual paths - the pages are stored in an MySQL database - does this have something to do with the problem?

I tried the following a month ago but the non-existant pages are still showing in search results:

Disallow: /Game Music/


Is there a problem with there being a space in the directory name?

Would this work to block everything in that directory from being indexed?

Disallow: /Game*
11:38 pm on Mar 19, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Patterns are used as prefix matches, matching from the left.

Disallow: /Template
is all you need.

The * is never needed or used at the end of a pattern.

It belongs only in the middle of a pattern.

Disallow: /*_this:
would disallow any URL request like
example.com/<anything>_this:
12:07 am on Mar 20, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


OK thanks but any idea why google is showing non-existant pages especially when I have this?: Disallow: /Game Music/
Does it matter if I have a space in the name? Or would I be better off doing this?: Disallow: /Game
12:19 am on Mar 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The space in the robots.txt file is likely an issue.

Use
Disallow: /Game

or
Disallow: /Game*Music



However, when a URL is disallowed Google can still show that URL as a URL-only entry in the SERPs as it cannot access the URL to see the real status code.

If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request.
12:29 am on Mar 20, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13204
votes: 346


Does it matter if I have a space in the name?

Heck, yeah ;) By the time the request reaches your server, it almost certainly won't be a space. Probably it will come through as %20. As noted above, "Disallow: /Game" by itself will have the same effect.

But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere. Replace with hyphen or lowline or nothing. (Different argument.)

The * is never needed or used at the end of a pattern.

It belongs only in the middle of a pattern.

And that's not something you hear every day :) This pronouncement applies ONLY to robots.txt!
12:52 am on Mar 20, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


g1smd, you said "If you want something removed from the SERPs, you should let crawlers in and then serve the 404 or 410 status code for that request". How do I do that? Doesn't that defeat the purpose of the Robots.txt file?

Or will the results eventually be removed if I just use the correct line in my robots file and wait a month or so?

I do know about using hyphens instead of spaces and all that, it's just that I didn't want to do that with my music folder as it makes the folders hard to read in file manager in cPanel if I use hyphens instead of spaces. And you can still link to files with spaces in without problem, but I didn't think about the robots.txt file.
11:44 am on Mar 20, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


/Game*Music

Also if I do the above to block the Game Music directory it will also block pages beginning with "Game Music" which I don't want. So could I do this to just block the Game Music directory? Note the slash on the end.

/Game*Music/
12:20 pm on Mar 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 19, 2002
posts:3195
votes: 12


But-- ahem!-- the real fix is to get rid of any and all spaces in any and all names everywhere.


i know this is OT but i totally concur with lucy24, if your site isn't really old you might be better off renaming all the files and folders without spaces. also IMHO using a mix of capital and lower case letters is not ideal either.
1:15 pm on Mar 20, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10552
votes: 10


Doesn't that defeat the purpose of the Robots.txt file?

the purpose of the robots.txt file (NOTE THE LOWER CASE FILE NAME!) is to exclude the crawler from making a request for that url, not for keeping a url space out of the index.
to keep a url out of the index you need a 404/410 response if the url is Not Found or Gone, otherwise a meta robots noindex tag or "X-Robots-Tag: noindex" HTTP Response header.
2:48 pm on Mar 20, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


I'm using MediaWiki. The pages never existed in the first place so how do I do what you suggested?

This is one of the non-existant pages:

http://example.com/Game_Music/B/Baroque/001._Baroque_%28Saturn%29_Original_soundtrack/

[edited by: tedster at 6:12 am (utc) on Mar 24, 2012]
[edit reason] switch to example.com [/edit]

3:07 pm on Mar 20, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10552
votes: 10


MediaWiki should provide a "404 Not Found" response to requests for non-existent urls - as long as you don't exclude the crawler from making the request.
3:16 pm on Mar 20, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


This is my robots.txt file. Could you tell me if I am preventing MediaWiki from providing a "404 Not Found"?

User-agent: *
Disallow: /index.php
Disallow: /w/
Disallow: /Category:*
Disallow: /Category_talk:*
Disallow: /Extension:*
Disallow: /Extension_talk:*
Disallow: /File:*
Disallow: /File_talk:*
Disallow: /Game*/
Disallow: /Image:*
Disallow: /Image_talk:*
Disallow: /Help:*
Disallow: /Help_talk:*
Disallow: /Manual:*
Disallow: /Manual_talk:*
Disallow: /Media:*
Disallow: /MediaWiki:*
Disallow: /Media Wiki_talk:*
Disallow: /Project:*
Disallow: /Project_talk:*
Disallow: /Special
Disallow: /Special:*
Disallow: /Talk:*
Disallow: /Template:*
Disallow: /Template_talk:*
Disallow: /User:*
Disallow: /User_talk:*
User-agent: ia_archiver
Disallow: /
Allow: /Special:Contact

sitemap: http://example.com/sitemap.xml

This is my .htaccess file:
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.+)$ index.php?title=$1 [L,QSA]

Options +FollowSymlinks
RewriteEngine on

# Link for the Sitemap
RewriteRule ^sitemap(.*)\.xml$ sitemap.php?page=$1 [L,NC]

RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://example.com$ [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp|mp3)$ http://example.com/Hotlink_Protection [R,NC]




If I'm not preventing it then MediaWiki musn't be working properly because if it was giving 404 Not Found then I wouldn't be able to find the non-existant pages via Google?

[edited by: tedster at 6:14 am (utc) on Mar 24, 2012]
[edit reason] switch to example.com [/edit]

8:59 pm on Mar 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Remove the trailing * from all of the Disallow directives.

Try this code:
Disallow: /Game*Music/$

It will disallow the single URL, not the subpages in that folder.


Your htaccess file is broken. The rules are in the wrong order.

The hotlink rules should be first. Both of the .* patterns in that ruleset are errors in some way or other.

The sitemap rule should be next.

The general rewrite should be last.

You should have
RewriteEngine On
only ONCE at the start of the file.
10:17 pm on Mar 20, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


OK thanks. I think it would be better if I moved the .htaccess question to a different thread as we've gone off topic a bit. Here's the new topic: [webmasterworld.com...] Could you please reformat my htaccess to show me what it should look like as I'm still not sure.

Anyway, I'm confused with what you guys said about pattern matching compared to what Google says here: [support.google.com...]

With disallowing a directory I found this on Google via the above link (click on "Manually create a robots.txt file"). Under "Patter Matching" it says this:

To block access to all subdirectories that begin with private:

Disallow: /private*/

So can't I just do this to disallow my Game Music directory?: Disallow: /Game*/

So won't the following also by OK to block pages names beginning with Category: ?

Disallow: /Category:*
10:28 pm on Mar 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


The * wildcard belongs only in the middle of a pattern:

/*something

/some*thing

/some*thing/

/something*/

The pattern is a prefix match. It matches anything that BEGINS with this pattern.

Use a trailing $ to match ONLY this exact URL.

Never use * on the end of a rule. It is redundant.
10:40 pm on Mar 20, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


OK, just checking then you said /something*/
is OK so is this OK to disallow my Game Music directory?:

/Game*/
10:41 pm on Mar 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Disallow: /Game*/ will block /Game<something>/<anything-or-nothing>

Disallow: /Game*/$ will block /Game<something>/ and not subpages.
10:51 pm on Mar 20, 2012 (gmt 0)

New User

joined:Mar 19, 2012
posts: 19
votes: 0


OK got it.

Also if I don't want to block a directory but I want to block any page starting "Category:" then do I do this without putting an asterix at the end?

Disallow: /Category:
10:52 pm on Mar 20, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Disallow: /Category: will block the URLs /Category:<anything-or-nothing>