Forum Moderators: goodroi

Message Too Old, No Replies

How to block robots spidering certain pages?

         

zoltan

7:45 am on Dec 22, 2005 (gmt 0)

10+ Year Member



OK, here is the question.
We have thousands of user generated pages and have some sections that requires login. The URLs that are on these sections contain a specific word like: "youhavetologin". I would like to stop robots from crawling these pages because they just eat my bandwidth. Obviously, URLs that contain the "youhavetologin" phrase are just simply login forms (the same login form for every page).

How to "ask" robots to not waste their time and my bandwidth and do not crawl URLs like this?

www.mysite.com/dir1/dir2/dir3/youhavetologin/

I tried this way:
User-agent: *
Disallow: youhavetologin

but googlebot seems to not follow this. How to do it properly?

tedster

2:32 am on Dec 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There's a great tutorial for robots.txt on our sister site, SearchEngineWorld:

[searchengineworld.com...]

zoltan

7:32 am on Dec 23, 2005 (gmt 0)

10+ Year Member



I have read it. The question: can I block a directory if the directory is not on root level? Ex: www.mysite.com/dir1/dir2/youhavetologin/ and www.mysite.com/dir1/youhavetologin/.
I want anything that is under "youhavetologin" to not be crawled by search engines.

davelms

6:42 pm on Dec 23, 2005 (gmt 0)

10+ Year Member



Yes, a directory does not have to be at root level for it to be blocked. The following will stop your examples from being crawled by search engines honouring robots.txt.

User-agent: *
Disallow: /dir1/dir2/youhavetologin/
Disallow: /dir1/youhavetologin/