homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

Disallow: /
How did googlebot index my web server dir structure?

5+ Year Member

Msg#: 3317537 posted 9:51 am on Apr 21, 2007 (gmt 0)

I'm developing a website. It is on a publicly accessible URL but not linked to or indexed.

I have a robots.txt with the following content:

User-Agent: *
Disallow: /

I added the site to webmaster tools under my google account because I wanted to submit the XML sitemap (which I am generating dynamically through my CMS) to see to test it via webmaster tools.

So while I am in my dashboard, I run a diagnostic on my robots.txt file. googlebot has found it and here is how it has interpreted it:

User-agent: *
Disallow: /dir1/
Disallow: /dir2/
Disallow: /dir3/

Where dirX above is the name of a directory on the web server.

Unfortunately, I don't have access to the access log for this client's hosting account, so can't see exactly what googlebot has been up to. My question though is this, how did googlebot discover the directory structure on the web server? I am quite certain that not all of those directories can be discovered by a simple deep crawl of the site.

Anyway, this has freaked me out a little. Does someone have an explanation?

Also, I decided to change my robots.txt file to:

User-Agent: *
Disallow: *

Is there any practical difference achieved by doing so?

Thanks for your consideration. This is my first post to webmasterworld! :)




WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

Msg#: 3317537 posted 1:14 pm on Apr 25, 2007 (gmt 0)

Welcome to WebmasterWorld thinkweb!

Google can find information about your site from many different sources. The Google toolbar can help Google discover pages. Other sites linking to you, can reveal urls to Google. You can also submit a sitemap to Google. The simplest and most common way is for Google to just crawl your site.

Has your robots.txt always been live? Maybe Google crawled your site before you uploaded a robots.txt? Do you have a line in your robots.txt for Google?



5+ Year Member

Msg#: 3317537 posted 5:24 am on Apr 26, 2007 (gmt 0)

thanks for the reply. Yes, I've added the site to google webamaster tools to test the sitemap. So google knows about it from my toolbar activity and webmaster tools. Anyway, site is not in the indexed results yet so now I'm just harassing the client to hurry up and finish their content so I can flick the switch to "live".

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved