Msg#: 4165503 posted 1:27 am on Jul 7, 2010 (gmt 0)
Quick experiment I just did to see when Googlebot crawls a new page on an established domain. No other inlinks, this page lives in its own new directory.
Post to twitter (not using bit.ly but the whole URL), from a 600 follower account-- immediate hit from Googlebot as well as a Google App Engine bot.
Post to established but fairly inactive blogger account with link in title, immediate hit from Googlebot.
Post to Facebook with large account (1000 friends)- nothing from Googlebot (not surprising, privacy settings are tight on this account).
Maybe this is obvious stuff to most, but it's pretty fascinating to me (especially the Twitter indexing, but it makes sense as immediate indexing is basically the only way they can add any value to Twitter search).
Msg#: 4165503 posted 3:29 am on Jul 7, 2010 (gmt 0)
You likely told Google about the page long before you posted a link to it from any of those places.
Do you use the google toolbar? Does the page/site have adsense? Did you visit a 3rd party site that has adsense? Did you visit a 3rd party page that has other Google tools on it? Did you check your email? Do you have any other toolbars like Alexa that retrieve history? Does the site have an rss feed that is being copied by a spammer? Rss being picked up by Google reader? Does the site have a sitemap generator that added a link and pinged search engines? Does the site ping search engines when changes are made to the site? Did you see an ad on any page you visited after posting that was from doubleclick or used the google dart cookie? Does your site or any site you visited use tracking? Analytics? Site Meter? Do you use Google chrome? Many of the google products don't even need to be "turned on" to gather data, they just store it for future sending. Did you visit youtube? etc..etc..etc. Do you have a pagerank browser add-on? you get the idea.
The list of ways Google gathers data is MONUMENTAL making the type of test you're describing near impossible.
Cookies: Does your computer have any of the following cookies right now? google.com? youtube.com? doubleclick.net? blogger.com? picasa.google.com? finance.google.com? base.googlehosted.com? etc...etc.
Log files: Any server request made of a google property(including on other people's websites) results in data stored in log files about you which is added to their database(s), some of which are permanent backups that will never be deleted. When data is added about you it's even possible that a crawler is assigned to come look for other changes about all things "you" online.
Google gathers information about you and your browsing habits from ALL of it's products and from 3rd party sites that use any of them even if you don't own them and just visited. They didn't become the biggest search engine without making use of all it's data either.
edit: I'd even wager that Google assigns affiliations between content(s) that don't even mention you. My webmasterworld account for example, it may already be added to my data history. It's very safe to assume Google is that advanced with it's data gathering given how many redundant layers of data gathering it uses.
No really, it wouldn't be that hard. The browser tools and such mentioned above are working even now. User:Sgt_Kickaxe was the first to access this reply and also visits Google adsense account #123456789. Therefore Sgt_Kickaxe IS LIKELY #123456789. Sgt_Kickaxe is ALWAYS first to see replies by Sgt_Kickaxe and the ONLY person to also access #123456789. Therefore Sgt_Kickaxe IS #123456789. Rinse and repeat with other sites/google products and, well, it's the ultimate gotcha even when you think you're anonymous. Some data like this can't be posted for privacy reasons but Google has it. url's are childsplay in comparison.
Msg#: 4165503 posted 6:36 am on Jul 7, 2010 (gmt 0)
your post is awesome... and I venture to guess absolutely right.
My only point is that I saw an "if A, then B" correlation with this experiment. Literally watching the logs in real-time, I saw a post to Blogger was *immediately* followed by a GBot access, likewise with the post to Twitter just a few minutes later.
Could be helpful to people who are struggling to get their sites indexed (seen a lot of that lately).
Msg#: 4165503 posted 6:59 am on Jul 7, 2010 (gmt 0)
It was a pretty good rant, wasn't it? Not enough people know how massive the data collecting effort is from G.
You're right too, add a few Google products to browser or site and you'll get ranked much more quickly when you post to other sites who are also providing data to Google but it's impossible to say which data collection path got Google there first. It is possible to say that Google is SO fast because they have SO MANY ways of gathering data from people like you and me.