Welcome to WebmasterWorld Guest from 107.20.104.110

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt - Disallow/Allow

     
7:58 pm on Mar 10, 2008 (gmt 0)

New User

5+ Year Member

joined:Mar 10, 2008
posts: 4
votes: 0


Hello,

I'm relatively newbie to robots.txt and need some ideas to solve a challenge I have.

I would like to block my entire site except the index.html on the root directory.

Unfortunately, There are some files that must be on the root directory as well.

I tried to follow the Google's guideline which suggest that you can do something like this:

User-agent: Googlebot
Disallow: /
Allow: /sitemap.xml
Allow: /index.html

However when i test it with Google's own webmaster tools, it tells be that access is denied by robots.txt.

Any ideas what am I doing wrong or how can I workaround this?

Thanks
Eyal

9:24 am on Mar 12, 2008 (gmt 0)

Full Member

10+ Year Member

joined:May 24, 2005
posts:211
votes: 0


Hi eyalkattan,
By usingb 'Disallow: /' you have told Googlebot NOT to crawl any of your site.

To permit Googlebot to crawl your site, but Disallow it to crawl all of your sub-directories, you should perhaps list them all individually. Something like:

User-agent: Googlebot
Disallow: /sub-directoryA
Disallow: /sub-directoryB
Disallow: /sub-directoryC

Hope this helps.

12:51 pm on Mar 12, 2008 (gmt 0)

New User

5+ Year Member

joined:Mar 10, 2008
posts: 4
votes: 0


Yeah, this does the trick, however I was hoping to avoid listing my site's structure for security reasons.
I wonder why the specs are missing the "allow" directive - it seem logical to be able to block the entire site and allow individual files or folders....
1:11 pm on Mar 12, 2008 (gmt 0)

Junior Member

10+ Year Member

joined:May 17, 2006
posts:48
votes: 0


If you don't want to reveal your site structure, remember that robots.txt matches partial names. You don't need to put the full directory name, just the first letter (except for 'i' and 's', since you're allowing files that start with those).

User-agent: Googlebot
Disallow: /a
Disallow: /b
Disallow: /c
...etc.

If you have other files or folders that start with i or s, add additional filters for those, but you can use just enough of the name so that the 2 files you allow are the only things that don't match.

1:34 pm on Mar 12, 2008 (gmt 0)

Senior Member from MY 

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 1, 2003
posts:4847
votes: 0


You could use .htaccess to 'physically' block access to anything but index.html
1:09 am on Mar 13, 2008 (gmt 0)

New User

5+ Year Member

joined:Mar 10, 2008
posts: 4
votes: 0



User-agent: Googlebot
Disallow: /a
Disallow: /b
Disallow: /c
...etc.

This sounds like a nice workaround actually. I'll definitely give it a try.

Does this apply also for other bots or just google?

1:14 am on Mar 13, 2008 (gmt 0)

New User

5+ Year Member

joined:Mar 10, 2008
posts: 4
votes: 0


You could use .htaccess to 'physically' block access to anything but index.html

The files I'm trying to block from the bot, needs to be accessible by the index.html
The way I architected the site, the index.html loads dynamic pages from JOOMLA cms into dynamic DIV. This way I can control what google and other bots index on my site as I don't whish every page of my site to be indexed
Another reason is that I am able to load pages into the DIV without having to refresh the entire page.

I think setting .htaccess to block access to these files and folders may cause the site to malfunction.

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members