robots.txt confusion

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt confusion

question about where to put robots.txt

SplitPersona

8:42 pm on Mar 9, 2005 (gmt 0)

my website directory is setup as:

-masterdirectory
+subdir1
+subdir2
...

All of my websites and virtual domains (test.domain.com/somepage.asp or www.domain.com/loging.asp) map to some file within this directory structure. There are no actual files at the /masterdirectory/ level. I want to block spiders from having access to everything without having to add meta tags to every asp page. So do I place robots.txt at

/masterdirectory/robots.txt

/masterdirectory/subdir1/robots.txt (and then again for each subdir)?

Sorry if this doesn't make sense, but I really need help with this. Thanks a lot!

DanA

8:58 pm on Mar 9, 2005 (gmt 0)

Normally spiders will look for robots.txt in the root :
/robots.txt
not elsewhere

ThomasB

9:41 pm on Mar 9, 2005 (gmt 0)

SplitPersona, first of all welcome to WebmasterWorld!

DanA is right, only the robots.txt in the root of the domain will be called from the search engines, so anything else won't work unless you use metatags.

SplitPersona

1:15 pm on Mar 10, 2005 (gmt 0)

That is what I was confused about.

My domain is www.domain.com, but the file structure it maps to is masterdirectory/. So for www.domain.com/robots.txt, I would have to map it to masterdirectory/robots.txt?

ThomasB

1:24 pm on Mar 10, 2005 (gmt 0)

I don't think robots follow redirects for the robots.txt file. You have to place it in the root like this:
[example.com...]

Lord Majestic

1:39 pm on Mar 10, 2005 (gmt 0)

I don't think robots follow redirects for the robots.txt file. You have to place it in the root like this:

robots.txt has to be in root, however robots should understand redirects and when in some cases redirect is made to other domain than the original one (say domain.com -> www.domain.com), then robots.txt have to be re-requested for new domain: this can happen in course of requesting normal non-robots URL. I can't say how many robots do follow that, but I suppose the most correct ones should even though it is PITA to program that logic and I know that at least some robots don't support it, can't speak for top tier engines however but I would imagine they got it sorted.

ThomasB

1:58 pm on Mar 10, 2005 (gmt 0)

LM, I think the question is if they follow sth like this:
www.example.com/robots.txt > 301 > www.example.com/directory/robots.txt

I doubt it, never tried it though.

SplitPersona

3:22 pm on Mar 10, 2005 (gmt 0)

ThomasB, that is exactly what I was asking :)
Is there some tool I can use to test if a robot will find my robots.txt file? The validation tool on this site seems to just check syntax..are there any tools that I can feed a domain name and it return the robots.txt? That would be ideal...

Thanks for all the responses so far!

Lord Majestic

3:30 pm on Mar 10, 2005 (gmt 0)

I doubt it, never tried it though.

I am inclined to agree even though technically request for robots.txt is a normal web request that can be subject to redirection: all standard requires is to request it in the root, and not have to start requesting for it elsewhere (like in directories). My bot won't be too happy since it first checks for existance of robots.txt using HEAD request and only makes full request if it gets 200 response code :(

All of my websites and virtual domains (test.domain.com/somepage.asp or www.domain.com/loging.asp) map to some file within this directory structure

Sub-domains will be treated like separate domains for robots.txt purposes, so correct crawler will have to request robots from each of those ie: test.domain.com/robots.txt, since you can point subdomain to its own directory then you can place its own unique robots.txt file in each of those directories without having to redirect anything as redirection will be done implicitly by webserver. This still leaves issue of having lots of robot.txt files, but can't you use something like symbolic link to point to single real robots.txt somewhere else?

clockstopper

3:13 pm on Mar 17, 2005 (gmt 0)

Maybe my question is appropriate here.

I recently dumped the meta robots for a robots.txt thanks to this dedicated section, and it�s sitting comfortably in my /mehere/robots.txt spot. its valid.

My question now is what to do with that meta info I had:
meta content="FOLLOW,INDEX" etc etc ... should I just remove it totally now? Or is there some special meta, or do I leave it in...?

Im not as clever as most here yet, and I�m oblivious to the obvious most times, any feedback would be great.

Thanks!