homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google Desktop Tools and Google Labs Projects
Forum Library, Charter, Moderator: open

Google Desktop Tools and Google Labs Projects Forum

This 54 message thread spans 2 pages: 54 ( [1] 2 > >     
GoogleBot visits what you visit if you have the toolbar

 10:46 pm on May 7, 2002 (gmt 0)

Ok, we have had this discussion before and the result was a denial by GG that Googlebot did this. But I have proof that Google crawls locations that you visit with the broswer/toolbar. Make a new document that no one knows about. Visit that location a couple times with your toolbar on. Set logging on this document.... wait a month and there comes Google. - - [06/May/2002:23:51:06 -0700] "GET /location_that_I_made_to_test_googlebot/ HTTP/1.0" 401 46 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

Please set traps and prove me wrong but I am positive that they do this! The theory to discount this used to be that the location was crawled because it was in a referral log somewhere. Well, that is not the case because I only referrer to myself and my logs are private.



 10:59 pm on May 7, 2002 (gmt 0)

Lisa, I always figured that it happened when you did the advanced install. So I always goto web addresses from the toolbar rather than the address bar ;)


 11:30 pm on May 7, 2002 (gmt 0)

I remember GoogleGuy denying this. But I tested it and it tracks you are reports your urls to be crawled.

Maybe GoogleGuy should double check with the Engineers.

brotherhood of LAN

 11:40 pm on May 7, 2002 (gmt 0)

Hi Lisa,

I think I was around at the time you mention. The thing is, what I want to know is, what are we angling at here, is it privacy/spider...what. I know where you are coming from just would like further opinion as to why it matters :)


 11:40 pm on May 7, 2002 (gmt 0)

Great stuff Lisa. I figured Google would do something like this. It is a perfect way to find a page that might be of interest.

I hope they stick with their "needs a few links to show up in listings" motto because that could create some problems.

Maybe Google is compartmentalized and each section only knows about certain things so that they don't get leaked out.

Either way, when a SE rep talks, I only listen to what they have to say about something new or bug-fixing.

When they say what they do and don't do, I just let it go out the other ear.


 11:42 pm on May 7, 2002 (gmt 0)

Same here, brand new domain, just delegated and uploaded, not linked, visited via toolbar - next day - Hello Mr GoogleBot!



 11:55 pm on May 7, 2002 (gmt 0)

Yes, I had advanced install. I was under the impression that they would not report URL back to google for crawling. It says they only report back for figuring out what PR to report to you and if it is in the directory.


 1:22 am on May 8, 2002 (gmt 0)

It says they only report back for figuring out what PR to report to you and if it is in the directory.

What do they say in their toolbar FAQ, their privacy policy, and in the toolbar itself? I might have to reinstall it just to investigate. Time to proxy with Junkbuster and see just exactly when they call home with the toolbar.

If their explanation is misleading, we might have a case of creeping privacy-violation at Google, or a case of the right hand not knowing what the left hand is doing. It's been 17 months since they chose their privacy wording so carefully with respect to the toolbar, and something may have changed.

I can forgive the CIA for setting illegal cookies after they apologize and stop doing it -- the CIA is pretty big and their document search site was outsourced. (Still, you'd think the spooks would double-check their contractor's work.)

But Google with 300 or so employees? It would be hard for them to say "Oops!"

Alexa got into trouble over a poorly-worded privacy policy. They lost a class-action suit and were liable for damages, over the very issue of phoning home. If you phone home, you must explain, clearly, exactly under what conditions it occurs.


 2:10 am on May 8, 2002 (gmt 0)

I have no problem with them crawling what I show them. But they should clearly say they are doing this.

Here is a quote from GoogleGuy
Hey, our privacy policy says that we won't give personally identifiable outside of google. We don't use toolbar data in our crawl/indexing, but that would be allowed by our privacy policy. The toolbar does go to great lengths to avoid returning personal info. Right now, we strip out username/password from [user:password@host...] We also truncate dynamic urls at the ? mark. Finally, we try to avoid any intranet urls (that's hard to do exactly, but we do our best).
Sorry, chris_f, your url leaked out some other way.

From [webmasterworld.com...]


 2:54 am on May 8, 2002 (gmt 0)

I'll say it again. :) I don't think our privacy policy prevents Google from doing this, because we are allowed to use anonymous user data to improve our search, but installing the toolbar didn't make googlebot crawl your page. See
for some of the typical ways that urls leak. Other ways include people guessing urls, network/DNS setups, etc.


 3:14 am on May 8, 2002 (gmt 0)

I have always trusted you. But I am 100% sure the only thing that knows about this URL is me and my Google toolbar. There is no way for someone to guess it. It was designed to test this exact thing. That log file entry above is taken exactly from my log file. The only thing changed is the URL so it remains private. There is a 401 authentication on that page, because I suspected the toolbar caused crawling. I could see the domain leaking out because it is in the global zonefile. But this URL is two levels down and has no incoming links from the anywhere not even the root of the domain. I only access the page by typing it in the address location.

... installing the toolbar didn't make googlebot crawl your page

My test proves something completely different. Please check with your toolbar engineers. You may want to modify your privacy policy to tell people that the url WILL be visited by the search engine indexing bot.


 3:15 am on May 8, 2002 (gmt 0)

It looks like GoogleGuy's interpretation of public responsibility with respect to privacy is that they don't give out personally identifiable information.


Scenario: If you check out your best friend's private collection of revealing photos of you, that he put up in a private, unlinked, special, new directory on his website for a photography-class assignment, Google won't tell anyone that you went to this site.

But after the next update, all the photos might be available in Google's image search, since this isn't forbidden by their privacy policy! Google is not required to inform the public that this might happen, because Google would never tell who the person was that spilled the URL through the toolbar.


What sort of interpretation of privacy is this? Leaving aside the question of whether Lisa's experiment is definitive (I'm sure we can get to the bottom of this with further experiments), I say that GoogleGuy's narrow interpretation of what is permissible within the "personally identifiable" phrasing of any privacy policy is much too narrow, and that such narrowness turns the entire policy into a joke.

GoogleGuy, please get an opinion on this from someone responsible for privacy at Google -- someone willing to go on the public record with a real name. And while you're at it, do something about those ridiculous 36-year cookies! I still haven't heard from your Director of Corporate Communications, even though he said on March 22 that someone would get back to me "shortly."


 3:37 am on May 8, 2002 (gmt 0)

I don't know who is right (Lisa seem to know what she is doing though), but it simply does not make sense for google to crawl documents in this manner for the purpose of listing them on google.


Google doesn't care about "Interesting Pages". They don't work that way. If you are interested in a page - chances are you found it from a website. If that is correct - google will find it on its own.

They want pages with high PR. A page with no links to it has 0 PR.

It is just a waste of resources for them to do this for anything but testing.

It is no different than people submitting their pages to google. It is useless. Google works by PR+IR. If you have no PR - then you have nothing as far as they are concerned.

I have always loved google, but their privacy poilcy basically says "we collect everything we can about your searching history, but won't give it to anyone else without a court order."

Therefore - there is no privacy with google itself - just third parties.

Google probably has one of the top 10 most valuable databases of information in the world. This is one of the reasons I valued them so high for their IPO.

I am not doubting what Lisa says, just doubting that it is being used to produce a listing for SERP. It defies logic.


 3:43 am on May 8, 2002 (gmt 0)

Lisa is right - there is no doubt he works for Google. Even if I didn't know Brett checks this stuff - GG has given us info that would only be known to an emplyee (like they were going to put a message about popups on their main screen - which they did).

I would think there is enough info for them to prove where this came from if Lisa is wrong. Just a guess on that though.

Interesting none the less.


 4:21 am on May 8, 2002 (gmt 0)

It takes many pages to see a pattern and to calculate PR. That is why they would download it. They ARE interested in every part. The sum of all parts makes their database.

I am not concerned with it showing up in SERPs. With a natural PR of 1 and no links to it would not show up on a regular searches. But inside my page I list lots of valuable information (not that they could get past my 401). Well, if you search for item 3 and item 489 listed on my secret page then maybe it is the only URL on the Internet that has a cross pollination of both those two or three keywords. So in less common searches it would show up (had it not been 401 protected).

I am concerned since this action is not mentioned in the privacy policy. Other newbie webmasters or employees with access to websites may expose valuable information. Google should protect itself by mentioning that pages will be visited by a GoogleBot.

I encourage you other webmasters to test this out. Make a webpage, Turn advanced toolbar on. Only visit the page by typing the address in your address bar. Place a log on that one page. Wait until the next crawl and see what happens. Make sure not to link to this page, make sure the page is not a default document, make sure not to tell anyone the location, and make sure not to link to outside URLs.


 4:40 am on May 8, 2002 (gmt 0)

I am working on a new site. It has a "coming soon" page which another site has linked to. I have been working on internal pages for several weeks now with the Google toolbar (ver. 1.1.54-deleon) installed and active. I get to internal pages through a Favorites link and by typing page addresses into the address bar. During that time, Googlebot has been to the site but has never requested any pages other than the home page. I tend to believe that Googlebot only gets to pages that are linked to from somewhere.


 4:41 am on May 8, 2002 (gmt 0)

Lisa: I agree with you that GoogleGuy is from google.com. However, tofu has a point in that I have no idea who GoogleGuy is. Apparently he writes code.

Are we supposed to believe that in a corporation of 300 employees, the managers who set policy also write code? If the answer is no, then are we supposed to believe that management tells their software engineers everything? Does a software engineer writing for one application know what's happening with the software engineers writing for a different application?

The fact that GoogleGuy is indeed a Googler, and he writes code, is not completely satisfying. The whole issue of anonymity bothers me. I fail to understand why Google doesn't have an official ombudsman on staff who is not anonymous, and who has the authority to go around to anyone at Google and dig out answers, and publish what he knows on Webmasterworld or anywhere else, without fear of immediate retribution, according to the terms of his contract. The Washington Post has one. Who can claim that Google is so much less important than the Washington Post these days, in terms of international impact? That's the bottom-line problem with Webmasterworld and an unidentified GoogleGuy. (I don't like the Washington Post; I'm just saying that an ombudsman is not a bad idea in principle, and would be a lot better than an unidentifiable GoogleGuy.)

This issue that Lisa brought up is a perfect example of how a Google ombudsman could function in a very helpful way without revealing trade secrets. The issue I addressed to Google's Director of Corporate Communications is another example.


Chris_R: I agree that it doesn't make sense. But it only doesn't make sense within the narrow confines of what we presume Google is all about, and within our understanding of the role of PageRank. What if Google has ulterior motives?

Privacy is a more fundamental ethic than PageRank. (Actually, PageRank is not an ethic at all -- it's an excuse for not having anything better.) If I'm making a claim based on notions of privacy, then yes, you can say that it makes no sense for Google to be doing this in terms of PageRank. But the first question must be, "Is Google indeed doing this?" The question that follows must be, "If so, and if not for reasons of PageRank, then why are they doing this?"

In other words, the facts about what's happening are most important. Next, the justification from GoogleGuy is important. The least important thing is why this is smart or stupid in terms of PageRank.

For all we know, PageRank could be a CIA cover story! (Hey, I'm just kidding, tofu!)

This problem is this: Librarians have been around for over 100 years. Journalists have been around for over 100 years. Over the course of decades, a consensus of public accountability evolves in professions such as these. Lobbying organizations emerge to do things like protect First Amendment rights. Chinese Walls are erected between news departments and op-ed departments, and advertising is clearly marked as such. Conflicts of interest are fair game for exposure. By now, in these 100-year old professions, everyone has a sense of where the line is drawn in terms of the public interest.

Then along comes the Internet, at warp speed. In four years, Google owns the most important database on earth. And here we're all sitting at our keyboards, begging for scraps of information from some anonymous "GoogleGuy."

It shouldn't be that way; the public interest deserves more respect than that. It's no one's fault, and I'm not accusing Google of doing anything bad. It's just that it all happened so fast that the public sector hasn't had the time it needs to assert itself.

Another perfect example of this is Microsoft. They aren't self-consciously "bad," it could be argued. It just evolved that way because it happened so fast.

Google became important much faster than Microsoft. That means we have to be more alert about what Google is doing. Everyone can see by now where we failed with Microsoft.


 4:49 am on May 8, 2002 (gmt 0)

"Google doesn't care about "Interesting Pages". They don't work that way. If you are interested in a page - chances are you found it from a website. If that is correct - google will find it on its own."

Perhaps Google will follow the toolbar to pages with the intent in counting links rather that actualy indexing the page. It does make sence to an extent to gather this date from a point of view of giving the user what it wants. One question... did the page location_that_I_made_to_test_googlebot actualy enf up in the index??? or was it the recent crawl. search results...
Your search - location_that_I_made_to_test_googlebot - did not match any documents.
No pages were found containing "location_that_i_made_to_test_googlebot".

very interestig find Lisa :)


 5:00 am on May 8, 2002 (gmt 0)

The URL www.mydomain.com/location_that_I_made_to_test_googlebot/
was crawled May 6th at 11:51 PM

That was not the actual page name nor path. I have omitted to for security reasons. Everytime someone visits that page I receive an email if it is not comming from my IP. The page would not be indexed because it is password protected. And if it was indexed (impossible because it is protected) it would not show up in the SERPs for like another month. Remember they still have to index and calculate all those pages.


 5:01 am on May 8, 2002 (gmt 0)

I agree with Chris_R.

I've been using the advanced toolbar to navigate though private webpages since it was made available for download and never have had a Google IP hit any one of them. Even links that contain the word google in the URL.

Maybe your ISP or perhaps some spyware already installed on your computer is maintaining logs on your Internet activity and saving them offsite on some server somewhere (cia.gov?)and Google just happened to run across one of them.


 5:19 am on May 8, 2002 (gmt 0)

:) I am very sure it is them!

For one, I am my own ISP.
Two, I monitor very closely. No Spyware makes it past me. But I allow the GG-toolbar because I value it. :)
Three, my server farm with that website is located in the same building as me.
Four, My personal traffic to and from this website doesn't leave my building.


 5:39 am on May 8, 2002 (gmt 0)

heheh you have awoken the beast. Truth or Dare? Let me know.

Advanced install = google gets to sn00py on you via the toolbar to some degree was my impression.



 7:11 am on May 8, 2002 (gmt 0)

Lisa, if you let me know the url, I'll check it out. I sent you a stickymail where to send it if you're willing..

Best wishes,


 7:23 am on May 8, 2002 (gmt 0)

Just a quick note about "security through obscurity." If you can type a URL into a browser and get back a page, it's not wise to assume that page is secret or private from anybody. That info can leak in many, many ways. Besides the typical cases above, other people can use your computer, you could have viruses, other people can sniff your LAN connection, you might be connecting through a school or ISP computer that's been hacked, someone could watch over your shoulder as you type a url, etc. It's pretty much agreed in computer science that "security through obscurity" is better than nothing, but it's not perfect.

I think most people know all this, but I just wanted to emphasize it. If you want to keep something private on the web, .htaccess and passwords are your friends. If you want to keep something out of Google (or any other search engine), robots.txt and meta tags are your friends. If someone can type a url into a browser and find your page, don't count on a secret url remaining secret. Use passwords or robots.txt to protect data.

That's my public service announcement for the day. :)


 7:40 am on May 8, 2002 (gmt 0)

I think there are more questions to be asked about Googleguys so-called identity. We are told he writes code but just look back through all his posting - there are few mistakes in his grammar and, this is the clincher - HE CAN SPELL!

I have worked as a coder for about 18 years and, believe me, I have never come across a fellow programmer who can spell even easy words, let alone some of the complicated ones GG has used.

He is an obvious charlatan and exhibits an education which disbars him from the noble art of coding. ;)


 7:44 am on May 8, 2002 (gmt 0)

you visit let's say www.test.de.
once your browser (you use msie with the google toolbar installed on it,
advanced features on) sent the request to www.test.de, it also sends
a request to google, including your google toolbar user id and the visited
page's address:

GET /search?client=navclient-auto&ch=52387311422&q=info:http%3A%2F%2Fwww%2Etest%2Ede%2F HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Host: www.google.com
Cookie: PREF=ID=...MyGoogleToolbarUserDataGoesHere...
Connection: keep-alive

google get's the data and the respondse, then closes the connection.

HTTP/1.0 200 OK
Server: GWS/2.0
Date: Wed, 08 May 2002 07:30:40 GMT
Content-Encoding: gzip
Content-Type: text/xml
Cache-Control: private
Age: 0
Connection: close

what for would google need that data if it was not for the purpose to get
aware of brandnew sites as fast as possible?
this of course only happens if one loads documents that are not located
one the computer itself, where those documents on your hd i suppose would be
considered as private *G*

i don't like the msie that much anymore anyway, i rather use opera now,
much more secure.
about the privacy stuff... i don't care either, if there is a docu that
i don't want to be spidered i set up a robots.txt file
btw, since m$, spyware and other apps, is there left any privacy on the net? ;)


 7:46 am on May 8, 2002 (gmt 0)

GoogleGuy is the real deal, no doubt about that. His magic touch released one of my sites from 0 rank hell. The others are still in jail pending appeal.

I can understand people's fear about private areas being indexed - so why don't people password protect them and make sure in the robots file Google doesn't crawl the directory?

In some ways this WOULD be a nifty feature. I am about to launch a new site of mine that's taken bloody weeks to finish and I would love it if I can get the bugger into the next Google update... So I will visit the URL a few times with my advanced toolbar activated before Friday ;)


 8:27 am on May 8, 2002 (gmt 0)

Thanks GoogleGuy,
My email is off with my private info; I will patiently await your answer. I know it could take a few weeks to dig through all that data and confirm my findings. :) You got a big mountain there.

To the others,
Personally I find robots.txt a big red flag for evil bots. It says, Hey Bot! Come look at what I donít want you to look at. It is right here. I hate spamcrawlers. But for real search engines it can be useful for sales prices and other data you donít want cached in an engine. Things you want to have change based on the user or the special. Never hide things with robots.txt. Hide with a 401 status code and prompt for username and password.


 8:47 am on May 8, 2002 (gmt 0)

The fact that GoogleGuy is indeed a Googler, and he writes code, is not completely satisfying. The whole issue of anonymity bothers me. I fail to understand why Google doesn't have an official ombudsman on staff who is not anonymous

if i remember correctly when GoogleGuy first showed up he announced clearly that although he worked for google, he was visiting wmw in an "unofficial" capacity.

it cannot be another way - even though he is clearly sanctioned by Google - unless he was to only post text that had been carefully thought through and given official clearance at the plex, because posting in a forum is too open to misinterpretation and if it was official it could be quoted all over the net as google policy.

i think we should be grateful that google even bothers at all, check out the regular posts from the other search engines (not)

and like all information - even company policy statements - its up to the reader to make a judgement on the truth/agenda attatched to it.

re. anonymity. i'm happy to be completely anonymous here, as are others, it doesn't devalue my/their contributions, you must judge them for yourself as to their worth, others are not anonymous (possibly) and equally their posts are not extra valid per se and one should judge their posts equally.


 11:03 am on May 8, 2002 (gmt 0)

>Never hide things with robots.txt. Hide with a 401 status code and prompt for username and password.

Absolutely right, Lisa.

I've been waiting for this thread for some time now; plenty of people here have visited addresses with the Toolbar advanced features on to check for when Google starts using the information. GoogleGuy's direct response makes me suspect strongly that there was some other way of finding the address, but surely they're going to use the data some day.

I take it that your Web stat's are password protected and that you haven't given the secret address to other people?

<added>Thanks, Iguana, you brightened up a rainy morning in Scotland:)</added>

This 54 message thread spans 2 pages: 54 ( [1] 2 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google Desktop Tools and Google Labs Projects
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved