|Google IP address simulating browser requests?|
Is cloaking sites for Google dead?
Since 7/10/06 I'm seeing requests from a Google IP address and one or two other large blocks of IP addresses that I won't identify. These queries in my logs look like "played back" or "repeated" Google search query "clicks" from many of the typically available browsers.
The referrer string looks just like one from a browser that has just clicked on a Google search result, BUT, the request is from a Google IP!
Ah you say someone at Google is manually doing searches and looking at pages, well ..
Even more curious is the identical request, the exact same query, referrer string and all, then comes from another IP address block, not associated with Google, BUT, this is within seconds of the original request. Sometimes multiple copies of the query from other IP address blocks appear within seconds.
This query then proceeds to simulate a browser requesting all the pertinent page content, images, frames, etc, making it look just like Firefox, or Internet Explorer browser requesting the full website page.
These "simulated" browser queries coming from a Google IP address and at the same time a non-Google IP address block(s) have continued in my logs almost every day since 7/10 to this day. Of course Google's normal crawling is proceeding on a daily basis.
These queries have keywords in the referrer string that are pertinent to many pages on this particular website. This is why I believe Google is actually repeating (simulating) previously recorded search queries from past Google visitors, who did searches, and then found this site. Many keywords pertinent to this site show up in the referrer strings.
For clarity these (many) queries do not have a typical Googlebot or Mediabot referrer string, the referrer string is typical of an internet surfer clicking on a Google search result. This type of query should not typically come from a Google IP, and magically be followed by a second and even third or fourth identical request from different IPs.
This type of automation, and willingness to cloak referrer strings, also using IP addresses not affiliated with Google, would definitely defeat all typical cloaking schemes.
Frankly I have no problem with these types of queries to this site, there is no cloaking done here, but many of the pages in question do rank well in the SERPS.
Any thoughts? Have I misunderstood my logs?
Just curious, but do you use google analytics? I have seen pages that were hit through my analytics tool get hit by google bot either the same day or very shortly after. I think they may be using analytics data to fuel some of the bot activities for some sites.
Please remember the log entries I describe are not from anything that identifies itself as the "Googlebot", nor MediaPartners bot, nor GoogleBot (Adwords bot) etc. These requests are however from a Google IP address, requests that are then duplicated within seconds from one or two other non-Google IP addresses. Then followed by the remaining typical requests for pictures, frames, etc, that a browser would normally make. These requests all look like they were from various web browsers, in one case a very outdated version of Firefox!
Are you sure that these are not hits coming thorugh Google's "Web Accelerator" (ie proxy)?
Yes, that is a good question that I thought about a lot, but the evidence doesn't make sense for an accelerator, but more for a cloaking check.
Why would the same request come through 2, 3, or even 4 sources? Redundancy? Perhaps a path latency determination, but that's pretty agressive use of a websites resources, if it were to be used by the many web users out there.
Would Google go to the trouble to use multiple proxies to achieve this goal? Maybe Google does own the IP's indirectly, but they were major Internet players with large address blocks.
Here's sort of a sample:
IP Google GET /sample-page.htm HTTP/1.1" 200 7945 "http://www.google.com/search?q=term1+term2&hl=en&sourceid=gd" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
IP Player1 GET /sample-page.htm HTTP/1.1" 200 7945 "http://www.google.com/search?q=term1+term2&hl=en&sourceid=gd" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
IP Player2 GET /sample-page.htm HTTP/1.1" 200 7945 "http://www.google.com/search?q=term1+term2&hl=en&sourceid=gd" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
IP Google GET /sample-page1.jpg HTTP/1.1" 200 7945 "http://www.example.com/sample-page.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
IP Google GET /sample-page2.jpg HTTP/1.1" 200 7945 "http://www.example.com/sample-page.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
etc, etc. Requests all within seconds of each other. Looks just like a browser query except a GET is done to the original content multiple times through multiple source IP's with an identical referrer string. That's the part that is very strange.
It's possible I'm seeing someone at Google using or supporting Tor, but I'd call this a buggy Tor producing the multiple redundant requests.
But this is exactly what one would do to look for cloaking using a fairly thorough technique.
So if a site is cloaking and it disappears from the Google index, the site owner could look back through the logs for this pattern. One method would be to find a very unusual referrer string reflecting a complex Google query and then look for multiple copies of the same referrer string, and check out the IP source addresses.
I just saw a lot of Google search strings from the same IP address and then was very surprised when the "who is" search indicated it was from Google itself. Then I searched for a complex referrer string associated with one of these queries and found other non-Google IP's making the identical request within seconds!
So yes it could be an accelerator, but if it is, it's a wasteful one! Multiple requests through multiple paths and service providers. I do have to look further, I know there is another post here identifying one Google address range used for acceleration. Even more detailed log information would help, but that's not likely in this case.
>>But this is exactly what one would do to look for cloaking using a fairly thorough technique. << 100% agree.
Google accelerator ip is 72.14.192.x
Cloaking has a bad reputation for no real reason and Google shouldn't worry about it too much. Cloaking is mainly used to hide a few footer links that the real user doesn't even want to see because they're off-topic. Some people cloak their sitemap pages just because they don't look nice. This has nothing to do with ranking high.
One sample Google IP generating these requests is 22.214.171.124 (not in the documented accelerator block), then a duplicate request comes from IP's associated with other major communications service providers, which I won't identify.
Regarding cloaking, HMMM, I don't know how many times I've looked at the Google cache of a site to find it has no correlation whatsoever with the actual site content. Cloaking is abused far more than it is used for useful purposes.
If I were a webmaster that was cloaking, I'd be reviewing my logs very carefully looking for evidence of these redundant, unidentified, requests from Google, and then investigate another means of content stuffing, etc.
I think the party may be over soon! Maybe Google will let some cloaking slide, who knows?
Or it's a bad bug in Google's accelerator and they are using more IP's that have not yet been identified.
Or Google is using the accelerator for double duty, accelerate and also find cloaking. A simple file diff will find out how severe the cloaking is.
Interesting... Google might as well be testing some kind of cloaking checking mechanism... I wouldn't be surparized...
I have not noticed anything like that on any of my sites - it would be hard to extract that kind of info out of my logs... :c(
BTW, not sure why analytics didn't work for you. It's very fast for me, never had any "slowdowns"... and running really big sites with it (100k uniques a day)...
|Jordo needs a drink|
It's probably Google prefetching. They prefetch from the 64.233 also. They also prefetch on search results. [google.com ]
Last night, 126.96.36.199 was prefetching from my site via search results. I know it was prefetching because of 2 things...
1. It also starting grabbing the links on the target page including my bot trap link.
2. Because it hit my bot trap, I could see it had a "x-moz: prefetch" in the header. [webaccelerator.google.com ]
I let users make up their own page , it could look to cloaking for google but in stead of that it is just a richtype Menu..
Hoping this give us no problems in the future..
I've seen too many spam sites using cloaking rank well not to be overjoyed with the possibility Google might do something serious against it (yes, it's a serious matter). I just hope they navigate like a regular user because I have defenses against people trying to download the whole site, where I exclude search crawlers using user-agents. If they navigate at the same speed they index, that could block them unintentionally.
I agree 188.8.131.52 my well be an accelerator IP, but then other IP's, that are not Google IP's make the same request, sometimes in the same second. This pattern is repeated over and over again, a Google IP makes a request then a non-google IP makes the same request!
I've noted a bug in Google's accelerator. It miss parses IFrame tags and tries to "GET /iframe...(the entire IFrame tag content is in the request string.) This seems to be a random failure on random IFrame tags through several of my sites. Sometimes it parses the same IFrame tag on the same page just fine, sometimes it does "GET /iframe ...." so it gets a 404 while the actual browser goes ahead and parses the IFrame tag correctly and fetches the page at the "src=" field.
I've written the "Accelerator team" about this several days ago but haven't received a reply.
I installed the Google accelerator to debug this, but two problems occured, so I'm going to uninstall soon!
1. It's mostly slower! I think the overhead of communicating through a proxy defeats the acceleration. (on a busy PC). They do proxy GET's before they do content GETs etc. I've noted many other bugs mentioned in Webmaster World posts.
2. I CAN'T ACCESS WEBMASTER WORLD! Ohh Nooo, Mr. Billlll!