Welcome to WebmasterWorld Guest from 3.229.122.166

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt block all deeper URLs in subfolder but not root

     
2:44 pm on Nov 14, 2014 (gmt 0)

Full Member

10+ Year Member

joined:Aug 31, 2002
posts: 284
votes: 0


Does anyone know the sytax / viability of the following?

I'd like to allow indexation of example.com/sub/ and block everything deeper than example.com/sub/


i.e.

example.com/sub/ = good to crawl


whilst the following - and anything else at these levels - should all be blocked

example.com/sub/123
example.com/sub/xyz/
example.com/sub/1sdlkashdkshad.html
8:11 pm on Nov 14, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15801
votes: 845


If you're asking specifically about google, they do follow the "Allow:" directive. But if there's a small finite number of subdirectories, just attach the "Disallow:" to those instead, and don't say anything about the higher directory.
7:20 am on Nov 26, 2014 (gmt 0)

New User from BD 

joined:Oct 11, 2014
posts:16
votes: 0


When you want to block a specific URL then you can use meta robot tags on those specific page/post such as use the tag on header of those pages/posts <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
10:56 am on Nov 26, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 19, 2002
posts:3499
votes: 82


as lucy says google do use the 'Allow;' directive,
their robots.txt is ...

[google.com...]

there are many examples in it of using Allow and Disallow, including strings in the same path
9:02 pm on Nov 26, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15801
votes: 845


drinstech* pointed out something that I overlooked the first time around: Although the question was about robots.txt, the OP used the word "indexation". Crawling and indexing are entirely different things. In practice it rarely matters. But for some pages on some sites, you may want to distinguish between
-- page shows up in SERPs with text saying something like "the site's robots.txt prevents us from saying what's on the page"
-- page never shows up in search-engine results, because robot has crawled the page and seen "noindex" meta.

Incidentally "noindex" and "nofollow" are different and unrelated functions. The second essentially means "When you follow the links on this page, don't tell them I sent you".


* Whoops! My fingers went on autopilot and treacherously inserted a "k" into the middle of the name.
6:57 am on Dec 8, 2014 (gmt 0)

New User

joined:Nov 24, 2014
posts:15
votes: 0


allow: /directory/$

disallow: /directory/*

You can use this syntax. Also you can verify that in Google webmaster. [support.google.com...]

Cheer,
Regina
11:31 am on Dec 9, 2014 (gmt 0)

Full Member

10+ Year Member

joined:Aug 31, 2002
posts:284
votes: 0


Well done and thanks to reginafashionist. That is what I was looking for - despite my misleading OP about "indexation" opposed to "crawling".
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members