|Geolocation, IP delivery and Cloaking - Google clarifies|
The official Google blog has an interesting post on How Google defines IP delivery, geolocation, and cloaking [googlewebmastercentral.blogspot.com]. Nothing groundbreaking ("Googlebot should see the same content a typical user...would see" is the unsurprising quote du jour").
'First click free' is a bit vague if you ask me. Is that the first click from the visitor, or the first page served after a Google referral? I imagine Google would prefer the latter. Tease-cloaking, I call it ;)
And of course, everyone will have their favourite "but example.com is cheating!" story. The reality is, these sites may be taking a risk and either happen to be undetected, or are significant enough to bend the rules a little. That's life! :)
Here's my favorite quote:
|don't treat Googlebot as if it came from its own separate country—that's cloaking |
It's a sound byte summary that gives a clear communication.
Another tidbit shows up in the video from Webmaster Central's Maile Ohye:
|Serve largely the same content per URL |
She goes on to clarify that you can insert some dynamic portions into the page via IP detection, but you should "contain them or limit them to just small areas" of the total page. So there is no requirement from Google that the entire page will be exactly the same, no matter where the IP address for that request comes from.
Yeh, in contrast to the quote in the post under the cloaking section:
|A program such as md5sum or diff can compute a hash to verify that two different files are identical. |
Of course, the two aren't mutually exclusive, although I think there is an implication there.
--- Cloaking: Serving different content to users than to Googlebot. This is a violation of our webmaster guidelines. If the file that Googlebot sees is not identical to the file that a typical user sees, then ...---
So Can I decide who my users are? Some sites have their user base in US only. Some "Users" in US that are specifically located(IP based) and coming from a known COLO/Hosting Ranges get the same exact treatment as a users that is coming from(IP based) and searching(referral) for given sites' contact info and are coming from West Africa. First Click Suggestion does not apply here at all, sorry. Is there a term that applies to White Hat Cloaking? Oh, I now, Site Security!
That party line pile of nonsense has more holes than swiss cheese.
|Cloaking: Serving different content to users than to Googlebot |
Does that mean you're cloaking something different than the user sees?
Are sites that do this now at risk?
Also, I've seen those big domain parks target specific ads to their own properties just to Googlebot for ranking purposes opposed to the ads the domain parks were showing humans for making money.
Trying to prove they're cloaking those ads just to Google is the trick.
[edited by: incrediBILL at 10:50 pm (utc) on June 3, 2008]
Can't you use something like with the <noscript> tag instead of cloaking? I know it's an example but I'm sure most situations can be dealt with without checking someone's IP address or Browser user agent.
|Can't you use something like with the <noscript> tag instead of cloaking? |
Then you double up the size of your page load as my navigation is kind of large.
The point is it's Google trying to tell people how to run their websites. They don't own the web, we do, all our sites ARE the web, but they're trying to tell US how to do business online.
That simply rubs me the wrong way.
It has always been my understanding that while the rules are strict, google does not enforce them the same way - since there are quite a few legitimate reasons to "cloak", for example non-text-content. I don't remember which major site it was, but they were serving audio-recordings of shows to their users while serving transscripts of said shows to google. makes perfect sense from every angle.
plus, in my personal opinion, as long as google is not willing to support something similar to yahoo's robots-nocontent-attempt, it might be even in googles interest to have the webmaster decide what to show - you don't need your navigation indexed on every single page, esp. if you're using google's sitemaps program.
Always wondered about this scenario... particularly since there are allegedly no Google Data Centres in Oz (http://www.datacenterknowledge.com/archives/2008/Mar/27/google_data_center_faq.html [datacenterknowledge.com])
We have one customer who has both a .com.au and .com domain name.
They asked us about the following:
•301ing IPs originating in Australia from the .com to the .com.au site, and
•301ing IPs originating outside of Oz from the .com.au to the .com site
If all of Google’s DCs are out of Oz, none of the GoogleBots would have an Australian IP number based on our IP geo-targeting database.
So…. Assuming GB never has an Aussie IP, we would never have the .com.au site crawled.
If we don’t serve a 301 to a user agent of ‘bot’ then we could be mis-interpreted as cloaking….
Has anyone else ran into this scenario?
[edited by: tedster at 9:57 pm (utc) on June 10, 2008]
[edit reason] make link clickable [/edit]
|they're trying to tell US how to do business online. |
I do understand the feeling here, Bill, especially these days with Google such a dominant presence. But really, they're only telling us how to do business IF we want their free help. We are always free to build an online business that has nothing to do with Google. We just need a business model that can work.
I *have* to trim output for all bots because they are more than 90% of my traffic load. So I generally give all bots including G a 'lite' version of the page (very similar to the 'lite' mode a user can explicitly request for their session).
Also, the i18n (internationalisation) has to cope with the fact that many bots (including G's when I last checked) don't supply an Accept-Language header. So in that case I try to default to a suitable language variant for where *my server* is, that would be good for geographically-local users.
Thus, when G's (US-based) bots visit my Sydney server they get a slight Aussie accented en-au language variant, an en-us when they visit my US server, en-gb for my UK servers, en-in for my Mumbai server, and zh when visiting the Beijing server. All servers are capable of producing all variants and are normally driven by i18n content negotiation.
But falling back to a language variant local to the *server* not the visitor/bot should, IMHO, help geotargetted/local Web SERPs.
So I don't cloak for any kind of deception: I trim page weight for survival, and I have a sensible fallback when i18n content negotiation is not possible.
The upshot is: if you find (say) a page from my AU mirror in (say) Google's AU local search, it will have a Strine accent and be quick to download and view.
|But really, they're only telling us how to do business IF we want their free help. |
Not really, the post was full of holes like swiss cheese open to interpretation, nothing concrete in DO THIS and DON'T DO THAT, make it less vague and I'll have less issue with it.
Besides, if there was an internet consortium of search engines making these rules and Google adopted them, instead of dictating them, I'd have a much easier time accepting it as a consensus of the web opposed to the big internet bully dictatorship.
I think that G *has* to be a bit vague to give it wiggle room to do what is right for individual sites (and to change its mind), *and* as a defence against black and grey hats pushing things to and beyond reasonable limits.
If it were less vague then all you would get is 1,000,000 sites all conforming to that spec, that would not help you or them. Webmasters would claim they did everything exactly as Google said but they are still not number/page one.
I think they need to encourage webmasters to create the site for their users, but they want to alert us as to the sorts of things that can get us accidentally penalized because people really complain when that happens.
Mattieo's question is interesting. If Google does not have a datacentre in your country, then your country's pages will never be indexed. Maybe they do not have datacentres but they do use proxies in those countries (crawling via a normal user agent)?
When I have looked I have seen all G bot requests to all of my servers worldwide coming from the US.
It clearly does not prevent my (non-US) sites being indexed.
|A program such as md5sum or diff can compute a hash to verify that two different files are identical. |
Yes, it can. Theoretically.
But with this, not even **one single bit** may change without getting differencies thrown.
How ridiculous is that?
Just if you timestamp your pages ("Current date/time is (date) (time)"), every page gives another different md5 hash every second.
MD5 and DIFF is *absolutely* worthless and would give tons of false positives if used to find cloaking.
In theory, there is no difference between theory and practice. But in practice, there is.
What are they thinking over there?
Well, that reminds me of a dialogue of Mandrell and Dr. Who ... but I disgress.
[edited by: Romeo at 12:13 pm (utc) on June 4, 2008]
OK, time to test if Google really means it.
Got the heads up for this just now. Google should know about them for they're one of the challengers of YouTube.
Try accessing their service from any of the blocked countries to get what I mean.
158.Vatican City <*hahaaa* ...WHY?>
165.South Africa <...end of list.>
Trying to access the www subdomain from an IP that's associated w/ the above locations, the site falls short of denying service by showing a single, plain white page:
|Example is no longer available in YOURCOUNTRY. |
If you are not in YOURCOUNTRY or you think you have received this message in error, please report the issue below.
(Please enter your email address)
How's THAT for different content to users than to Googlebot? Seeing the recent news [news.yahoo.com] about them one has to wonder if this is some kind of a business model (?) but anyway...
|they're only telling us how to do business IF we want their free help. We are always free to build an online business that has nothing to do with Google. We just need a business model that can work. |
I wonder if a site that's unaccessible from half of the planet deserves their help. I'll be standing by, watching whether Google enforces their policy for good.
Honestly, after reading into the forum posts about what's happening, I feel like joining the revolution
[edited by: engine at 5:02 pm (utc) on June 4, 2008]
[edit reason] No specific sites, thanks [/edit]
In the few spare moments I have, I've been working to develop code for my sites that attempts to optimize the user experience based on IP address, language country agent and device as well as through the minimal use of optional cookies in a seamless and transparent way for real users who have a variety of device access choices. It is not easy to do that and make everything work quickly.
It seems to me that search engines with concerns about how the content gets delivered are not being unreasonable. Clearly they would want the page to look just like how users will see things for their own credibility reasons.
Search engine operators: I think that what you want is a fine goal, so start acting like a variety browsers and devices that users work with. Send out language and country data in headers to test such behavior, accept cookies and the like so that your crawler will see what users see.
The world of site development was once fairly challenging with just having to keep up with support for multiple browsers (who even today interpret standards in different ways). It is getting far more complex with so many different device display formats which also have to be considered in good site design these days to maximise the user experience.
While I appreciate at some level the junk with which search engines must have to condend, it is also important for them to stay up to date on reasonable approaches to site design. It is wrong to immediately conclude such features added to improve the site user (and usability) experience are "really" being done to "game the engine".
Very well said, Commerce - excellent points. I would add, however, web developers can become focused on user-experience issues to the exclusion of findability concerns, When that happens, their websites confuse search engines and make their job harder.
That's a lot of what these recent communications are about. If we want the help that Google can give our site, then we also need to understand what they wrestle with, what their current focus and limitations are, and so on.
It's all a work in progress, both on the developers' side and on Google's. The more they can tell us about the state of their art, the better things can be.
what about services like gravity stream, that serves static html pages to googlebot for large dynamic ecommerce site?
|what about services like gravity stream, that serves static html pages to googlebot for large dynamic ecommerce site? |
If the content that's delivered to Googlebot is the same as the content you deliver to the users, you should be ok. If the user is able to click on one of the indexed static URLs and see the same content as they would from navigating through dynamic URLs, you should be ok. As long as what you're delivering is in the "spirit" of the actual page, you should pass a quality check. At least, that's what I've been told.... by Matt.
Heck, Amazon does conditional IP delivery. They're fine... for now. ;)
The site that I mentioned in my last post ( and which Adobe and Intel invested in just recently ) keeps serving its 'go away' message to the 165+ countries listed above ( go to Wikipedia for full list ). It's been about two weeks since this began.
Regional versions of Google ( e.g. google.com.mx, google.co.in etc. ) still list content that users are not allowed to see. The blocking seems IP based. None the less, the number of pages listed in regional indexes have increased to some 6 million. That's 2 million+ since the content became completely unavailable.
Clicking the Google results will bring up nothing that the SERPs promise.
enforcement of Google policies in this case ?
When you look at Google's attempt to clarify IP delivery, after a while the edges do get very fuzzy. The point mattieo made above about Australian websites is just one of several patterns where their simplified statements don't quite fit with what we see, or hope to see, in the real world.
Interestingly enough, I am working with one client who has both a .com and a .com.au and they do some IP based redirecting between the two. And their indexing is a total mess.