I'm developing a website. It is on a publicly accessible URL but not linked to or indexed. I have a robots.txt with the following content:
User-Agent: *
Disallow: /
I added the site to webmaster tools under my google account because I wanted to submit the XML sitemap (which I am generating dynamically through my CMS) to see to test it via webmaster tools.
So while I am in my dashboard, I run a diagnostic on my robots.txt file. googlebot has found it and here is how it has interpreted it:
User-agent: *
Disallow: /dir1/
Disallow: /dir2/
Disallow: /dir3/
Where dirX above is the name of a directory on the web server.
Unfortunately, I don't have access to the access log for this client's hosting account, so can't see exactly what googlebot has been up to. My question though is this, how did googlebot discover the directory structure on the web server? I am quite certain that not all of those directories can be discovered by a simple deep crawl of the site.
Anyway, this has freaked me out a little. Does someone have an explanation?
Also, I decided to change my robots.txt file to:
User-Agent: *
Disallow: *
Is there any practical difference achieved by doing so?
Thanks for your consideration. This is my first post to webmasterworld! :)
Mark