Forum Moderators: Robert Charlton & goodroi
It does, in some places, have Lorem ipsum and debugging information exposed. Most is banal, but some could potentially provide malicious visitors with ammunition for hacking the site. Database Table names, PHP error messages, SQL error messages, class names, columns and cookie values, session values, and things like that.
Well guess who came to visit! Yes, the googlebot. I didn't expect them to find it, but they did, and they crawled the whole work-in-progress, and it's now visible in their index.
How did Googlebot wander in? Hmmmm, well here are some facts:
1) I can say WITH 100% CERTAINTY that no one has ever publicly linked to this site. I am the only person in the world who knows that the site exists. Well OK, just me and my registrar.
2) Looking in their index at "site:example.com", I see some interesting URLs. One of them surprised me:
www.example.com/search/?q=scubamonkey
"scubamonkey" is a word I use sometimes when I'm testing things. It's just a more colourful variant of "foo" or "bar".
Did I leave a link to that in the site somewhere as I was working in it? I don't think I did... how could that URL have gotten indexed? The Googlebot would have had to type "scubamonkey" in the search box, and submitted a form. How likely is that?
3) Are they fuzzing, too? I found indexes to pages that don't exist, with invalid URLs that return a 200 OK Status (gimme a break - after all, the site is in dev). They follow the pattern of a real URL on my site, but the data in the querystring is totally whack.
I suspect that Google was using my toolbar to mine for new URLs. They were shoulder surfing while I worked on the site, and they came in later and crawled it.
Now that it's in the Google index, anyone can find it. Some of the cached pages show PHP errors that I really wish were not exposed in public.
I enabled password protection on the site a few minutes ago. 2 little 2 late... I should have known better.
Did I leave a link to that in the site somewhere as I was working in it? I don't think I did... how could that URL have gotten indexed? The Googlebot would have had to type "scubamonkey" in the search box, and submitted a form. How likely is that?
Are your server logs or stats pages spiderable?
Here's a discussion about how Googlebot might find you...
Why is Google indexing my entire web server?
[webmasterworld.com...]
Note that Google's "secret" web server" FAQ article I referenced in this thread has gotten moved a few times and ultimately removed, but fortunately I quoted the relevant paragraph in the thread.
They follow the pattern of a real URL on my site, but the data in the querystring is totally whack.
Google has been doing this for a while, now - they guess query string variations and form inputs in many ways. And if they get a "hit", it seems to intensify the practice. Something to remember when you do go into full production and open up the domain to spidering again.