How Did Google Find this Hidden File?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How Did Google Find this Hidden File?

davidn

5:10 am on Feb 19, 2012 (gmt 0)

Hi, I have a development server and I recently created a new directory and put a test script in it. I just discovered tonight that Google had found this script and indexed it within a matter of days.

The domain is not advertised or linked anywhere and there are no index files on the site, i.e. unless a user know the exact URL, he won't stumble upon anything other than blank pages.

So, how did Google find this hidden page. Now I'm wondering if they are spying on my email or browser and indexing links that I visit or are contained in emails I send/receive. I can not think of any other possible way they could have stumbled upon this URL.

Any ideas?

Marshall

5:22 am on Feb 19, 2012 (gmt 0)

I assume your site is a registered domain. SE's will actually go through registration domain records given the fact they are public records.

I suggest putting up a robots.txt file with a disallow.

Marshall

lucy24

7:07 am on Feb 19, 2012 (gmt 0)

But wait. The domain name will only get them as far as the front page, plus anything that links from the front page. If they are nasty robots they will also look at robots.txt and make a predictable try for
blockeddirectory/index.html
and any reasonable variants.

But robots don't have x-ray vision. They can't see the entire content of a domain. Is your test script at the front of the domain, where anyone entering the domain name will get to it? Or is it hiding?

Does it have any outward links? Did you send someone else its exact URL using gmail?

Frank_Rizzo

10:53 am on Feb 19, 2012 (gmt 0)

I once made a mistake of suggesting google was reading hidden files but then realised that the sitemap.py file was logging those files and thus feeding G via the sitemap.xml file.

Do you use google toolbar? That's another gotcha.

Look at the access logfile and see all the IPs that accessed the file before google. That could explain if it is linked from somewhere / one else.

Staffa

12:32 pm on Feb 19, 2012 (gmt 0)

Did you send someone else its exact URL using gmail?

And not just gmail. Today, I had google following a link in an email from one of those free email address providers which G could only have known if it crawled the email I sent this personalized link in.

davidn

3:09 pm on Feb 19, 2012 (gmt 0)

Hi, thanks for all the replies. The domain is a subdomain, e.g. test.mydomain.com. It is not linked anywhere and no one would possible know about it unless they could dump all the DNS records for the domain.

The script it found was not easy to guess and there was no default index file in the directory. They had to know exactly what they were looking for.

The only possibility I see is that I used Google Chrome to run the script....which if true means that Google is spying on people.

Pretty sickening if true...but I am at a loss for any other explanations. Add to this the fact that it was in the index within days of it being created.

I added a robots.txt file to the site and removed the site from google search via the webmaster tools. But, if they're unethical enough to index this in the first place....

davidn

3:16 pm on Feb 19, 2012 (gmt 0)

Follow-up reply: I shared the URL with no one, nor do I believe I included the link in any emails (I did use my personal GMAIL for testing).

The script basically reads an email account and converts the unread messages to a webpage. I did send a test email from my gmail account to the test email account, but the email did not include a link to the script itself. It was simply addressed to the test email account, which could in no way lead to the script itself.

This is how I found out the test script had been indexed. I was googling my gmail address for privacy reasons and the test script popped up as the #4 search result showing the test email I had sent.

davidn

3:29 pm on Feb 19, 2012 (gmt 0)

Testing the phenomena. If anyone is interested, I think we could do some testing to see whether Chrome and/or Gmail is spying on people for the purpose of gathering URLs to search.

1) Create a buried page with a unique name and place what appears to be pertinent info on it so Google assumes it is worth indexing. Include some unique piece of information that when searched should easily appear in the top 10 results...maybe a unique fictitious email address.

Be sure the page is not blocked by a robots.txt file.

Then, visit this page a few times using only the Google Chrome browser. Do not send any emails or other electronic communications that include the URL. Check google a few days later to see if it shows up.

2) Repeat the above, except create another unique page name with a different unique piece of information. Send the link from and to gmail address. Do not visit the page using Chrome, or any browser if you can prevent it. Wait a few days and see if it shows up in the index.

indyank

5:10 pm on Feb 19, 2012 (gmt 0)

There are some interesting theories floated around...

[ipullrank.com...]
[news.ycombinator.com...]

lucy24

8:19 pm on Feb 19, 2012 (gmt 0)

The only possibility I see is that I used Google Chrome to run the script....which if true means that Google is spying on people.

Welcome to WebmasterWorld :(

DeeCee

8:33 pm on Feb 19, 2012 (gmt 0)

I have seen similar issues with Google visits on sites that should NEVER see a visit. Such as sites I use only for testing web-site setups and testing of my software. Links that should never be seen other than by me.

That said, I don't really believe stuff like the ipullrank article, that "Chrome is GoogleBot".

But stuff like this is one reason that I decided not to use Chrome, after my initial testing of it. There simply is a limit to how much information I want Google to have.

Here is my theory:

Rather than Google directly "spying" on people, I think we have here another of Google's VERY liberal interpretations of what privacy is, combined with a VERY liberal use of information in Google's various databases.

When you use Chrome, the typical setup is to have its "Under Hood" features enabled.
Such as "Use a web-service to resolve navigation", "Use a prediction service to complete searches", and "Enable Phishing and malware protection".

All these services make a very liberal use of various services at Google.

Try loading up a sniffer, such as Network Monitor while using Chrome. Ignore the initial start up at first, doing update checks and other junk connections, but then start tracking the connections and data. Chrome has a CONSTANT barrage of connections to various Google IPs, even when you do just something simple, like typing in and opening a web-page..

Some connections are obvious, such as content checks and other, Google ad calls if the site has Adsense, Analytics calls, and such. A large portion are encrypted connections. (Obviously explained to "protect your privacy"), but also making it hard to know what exactly they are doing.

But rather than Google Chrome in an obvious, planned way spying on the individual user, I think they do what they have explained many times on the Google support groups, when people ask how stuff is found. That they get links from many sources, and that robots.txt only prevents links from being followed by GoogleBot, not from being found if they are available in other sources.

When Chrome does all its various checks on malware, completions, and much else, all these actions are obviously tracked and logged in Google's databases. If a Google ad is there, the URL is tracked. If you have Google Analytics code loading on the page, that URL is loaded into the Google databases.

You load a URI, and Chrome + Analytics, + Adsense tracks information back to Mamma Google doing all sorts of checks and logging.

It is my firm belief that all the links they "catch" through these sources are by Google seen as a way to include the "whole internet" in their databases. Seen as merely Google using all available sources.

They extract the links from the databases, they know it loaded up for you (or adsense/analytics/malware checker/predictor) as a valid link, and they can then pass it into GoogleBots queues for further follow-up.

If Gmail does these kinds of content checks, the link is suddenly logged in the Google databases as well. (Could happen both when sending an email with a link, and on checks at the receivers side, if Gmail.

Not overt spying, provided that they do not log your identity with the links, but still a somewhat grey way of getting information in my book.

The article links questioning why else Google would invest in getting into the browser market are likely incorrect that "Chrome is GoogleBot", but I do believe that Google got into that market because our browsers would otherwise be a HUGE untapped source of information for them.

Also, have y'all disabled your Google Web history yet in Google accounts? That is another potential source for Google database merges to find a lot of links that would otherwise not be found anywhere else.

davidn

8:48 pm on Feb 19, 2012 (gmt 0)

DeeCee, lacking any actual evidence of what Google is doing, I believe you have hit the nail on the head!

I always assumed they were scanning my GMAIL, but I never dreamed they were capturing and acting on URLs I have visited via Chrome.

Conclusion: I am removing all instances of Google Chrome and I am going to spread the word!

DeeCee

9:18 pm on Feb 19, 2012 (gmt 0)

When I had the initial problem with Google suddenly showing up in my test site logs, I changed my setup.
I test sites by running them off a couple of test-servers I have here in house. Normally only accessible to me.

For my own usability, I used to have them set up in my global DNS as having an IP (external) that translates in through my firewall.

Since I still have to load Chrome once in a while for test purposes, when Google suddenly magically started visiting my internal test URLs, that would never have been found externally, I changed my DNS setup to point all test sites to IPs on the 10.* network (you can also use the firewall default 198.168* thing).. Since 10.* cannot be routed on the global Internet, but works perfectly between my internal machines and my own browsers, the URLs suddenly turn worthless for anything but internal use.

If Chrome (or other funny browser add-on services) log my test URLs (using my test hosts), they can no longer use them, since they all resolve to outside unreachable IPs. :) So GoogleBot is welcome to hack away on them.

BTW.. Notice that a lot of the database info I mentioned is really unrelated to Chrome.
Adsense, Analytics, and other URI logging will still happen, even if you run it from another browser. Plus, if you have an add-on that calls on Google services, their databases will be filled as well.

dstiles

9:42 pm on Feb 20, 2012 (gmt 0)

I have a permanent block on ALL google javascript (Firefox + NoScript). Won't let them anywhere near me.

DeeCee

9:54 pm on Feb 20, 2012 (gmt 0)

Me too.. Firefox add-ons Ghostery and NoScript are good friends of mine. :)

tedster

11:13 pm on Feb 20, 2012 (gmt 0)

For years Google has been able to find "development" files that aren't linked in. There could be many avenues in use - probably are - but for me the bottom line has been that any dev files that go online also get password protection. 403 Forbidden responses definitely stop indexing ;)

enigma1

10:30 am on Feb 21, 2012 (gmt 0)

Yes usually I also setup an authorization password for new site testing till it's ready.

But in production mode there are other methods which can guarantee no robot can access them even if they know the exact URL. You can always use a short term session in a custom manner (some languages already have support built in), so unless it's present and valid, the page won't show it's content and you can emit whatever error headers you want instead.

So even if we accept that Chrome passes all access data to the Google datacenters it will make no difference. They will have to figure out the entire sequence of events to get the content of the page.

phranque

9:14 am on Feb 22, 2012 (gmt 0)

the correct technical solution for blocking a test server is to set up HTTP Basic Access Authentication which challenges the crawler's request with a 401 Unauthorized status code response:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.2

g1smd

9:20 am on Feb 22, 2012 (gmt 0)

.htpasswd is your friend.

davidn

4:02 pm on Feb 22, 2012 (gmt 0)

yeah, well, if companies weren't spying on what people surfed and adding those pages to a public index, you wouldn't have to lock down a test server that does not advertise itself to the public. May they're doing may or not be buried deep in their terms of use and privacy statements, but it sure isn't common knowledge and clearly understood.

Just like we wouldn't need locks on doors if everyone was honest :lol:

phranque

7:50 pm on Feb 22, 2012 (gmt 0)

i'm pretty sure after getting your test site urls "discovered" by various means a few times you'll get in the habit of password protecting your test content.
it couldn't hurt to also meta robots or X-Robots_tag noindex that content.

tedster

3:22 am on Feb 23, 2012 (gmt 0)

Phranque is correct - 401 is the correct http status code for the server response.