Forum Moderators: goodroi
I have a robots.txt with the following content:
User-Agent: *
Disallow: /
I added the site to webmaster tools under my google account because I wanted to submit the XML sitemap (which I am generating dynamically through my CMS) to see to test it via webmaster tools.
So while I am in my dashboard, I run a diagnostic on my robots.txt file. googlebot has found it and here is how it has interpreted it:
User-agent: *
Disallow: /dir1/
Disallow: /dir2/
Disallow: /dir3/
Where dirX above is the name of a directory on the web server.
Unfortunately, I don't have access to the access log for this client's hosting account, so can't see exactly what googlebot has been up to. My question though is this, how did googlebot discover the directory structure on the web server? I am quite certain that not all of those directories can be discovered by a simple deep crawl of the site.
Anyway, this has freaked me out a little. Does someone have an explanation?
Also, I decided to change my robots.txt file to:
User-Agent: *
Disallow: *
Is there any practical difference achieved by doing so?
Thanks for your consideration. This is my first post to webmasterworld! :)
Mark
Google can find information about your site from many different sources. The Google toolbar can help Google discover pages. Other sites linking to you, can reveal urls to Google. You can also submit a sitemap to Google. The simplest and most common way is for Google to just crawl your site.
Has your robots.txt always been live? Maybe Google crawled your site before you uploaded a robots.txt? Do you have a line in your robots.txt for Google?
cheers,
greg