Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt block all deeper URLs in subfolder but not root

         

Adam_C

2:44 pm on Nov 14, 2014 (gmt 0)

10+ Year Member



Does anyone know the sytax / viability of the following?

I'd like to allow indexation of example.com/sub/ and block everything deeper than example.com/sub/


i.e.

example.com/sub/ = good to crawl


whilst the following - and anything else at these levels - should all be blocked

example.com/sub/123
example.com/sub/xyz/
example.com/sub/1sdlkashdkshad.html

lucy24

8:11 pm on Nov 14, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you're asking specifically about google, they do follow the "Allow:" directive. But if there's a small finite number of subdirectories, just attach the "Disallow:" to those instead, and don't say anything about the higher directory.

drinstech

7:20 am on Nov 26, 2014 (gmt 0)

10+ Year Member



When you want to block a specific URL then you can use meta robot tags on those specific page/post such as use the tag on header of those pages/posts <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

topr8

10:56 am on Nov 26, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



as lucy says google do use the 'Allow;' directive,
their robots.txt is ...

[google.com...]

there are many examples in it of using Allow and Disallow, including strings in the same path

lucy24

9:02 pm on Nov 26, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



drinstech* pointed out something that I overlooked the first time around: Although the question was about robots.txt, the OP used the word "indexation". Crawling and indexing are entirely different things. In practice it rarely matters. But for some pages on some sites, you may want to distinguish between
-- page shows up in SERPs with text saying something like "the site's robots.txt prevents us from saying what's on the page"
-- page never shows up in search-engine results, because robot has crawled the page and seen "noindex" meta.

Incidentally "noindex" and "nofollow" are different and unrelated functions. The second essentially means "When you follow the links on this page, don't tell them I sent you".


* Whoops! My fingers went on autopilot and treacherously inserted a "k" into the middle of the name.

reginafashionist

6:57 am on Dec 8, 2014 (gmt 0)



allow: /directory/$

disallow: /directory/*

You can use this syntax. Also you can verify that in Google webmaster. [support.google.com...]

Cheer,
Regina

Adam_C

11:31 am on Dec 9, 2014 (gmt 0)

10+ Year Member



Well done and thanks to reginafashionist. That is what I was looking for - despite my misleading OP about "indexation" opposed to "crawling".