homepage Welcome to WebmasterWorld Guest from 23.20.61.85
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Googlebot sending a referer and requesting image, css, js files
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4405414 posted 12:18 am on Jan 10, 2012 (gmt 0)

Background: There are a couple of earlier threads touching on this subject:

Googlebot getting images with html ? [webmasterworld.com]
Googlebot with Referer [webmasterworld.com]

but neither of them got to anything decisive, so I wanted to bring it up again. Besides, both threads ran about five minutes before I started reading the Forum (early April 2011); I had to search to make sure it wasn't another of those things everyone but me has always known.

* * *
I discovered this phenomenon while-- stop me if you've heard this one-- looking up something else. The Googlebot has got a sideline in image-harvesting... and it's doing it with the html page as referer, exactly like a human. The regular Googlebot, by that name, from a regular Google IP-- so far, always 66.249. The detour can come smack in the middle of a string of regular Googlebot hits. And then it carries on as if nothing out of the ordinary had happened.

Images are the most obvious, but it also picks up the occasional css file and even js. In fact, looking back, the very earliest Google javascript pickups I can find are by the Googlebot in referer mode.

There's no absolute pattern, but two behaviors I see pretty often. One is when a brand-new image or stylesheet has been added to a pre-existing page; it's as if the Googlebot decides on its own initiative to grab it quick before it gets roboted out. The other is when I've got a cluster of thumbnails on a single page (a pattern that one of those earlier post'ers also hinted at). Googlebot will then scoop up every last one of them at a rate of up to 2-3 files per second. (This is faster than their usual pace on my site.) But there's no rhyme or reason to which image pickups send a referer; sometimes the "referer" switch is turned off or on halfway through the visit.

I can't help feeling this is for some particular purpose, especially when stylesheets and javascript are thrown into the mix. The obvious thought is Preview-- but that's got its own UA, currently not blocked for anything but javascript. (G### has no business in my piwik files, and the only other javascript of note involves a font that Google doesn't have.)

* * *
Postscript: While re-checking, I found this masked bandit trying to blend in with the scenery.

67.221.235.nn - - [26/Dec/2011:15:55:56 -0800] "GET /games/LucysDownloads.html HTTP/1.1" 200 11941 "http://www.example.com/games/LucysDownloads.html" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

Far as I can tell, it was a one-off. Never seen the IP before, but the auto-referer is a dead giveaway.

[edited by: tedster at 12:24 am (utc) on Jan 10, 2012]
[edit reason] Added the titles for the two threads [/edit]

 

Samizdata

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4405414 posted 1:19 am on Jan 10, 2012 (gmt 0)

Googlebot decides on its own initiative to grab it quick before it gets roboted out

If it is not "roboted out" then it is up for grabs.

67.221.235.nn

That is not a Google IP address, it is a server farm.

Contegix 67.221.235.0 - 67.221.235.255

Either it's a fake Googlebot or the server has an open proxy - block it.

...

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4405414 posted 4:45 am on Jan 10, 2012 (gmt 0)

That is not a Google IP address, it is a server farm.

Uhm, yeah, that was my point. Hence the reference to masked bandits trying to blend in.

But that's peripheral. It's the real googlebot-- and its odd behavior-- that I'm interested in.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4405414 posted 5:14 am on Jan 10, 2012 (gmt 0)

It's the real googlebot...

Please explain more about that. Anyone can send any user agent they want, so if the IP address is wrong, then I'd think it is NOT the "real" googlebot, right? I mean, it just doesn't pass the Googlebot verification test [webmasterworld.com].

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4405414 posted 7:43 am on Jan 10, 2012 (gmt 0)

:: sigh ::

66.249.67.193 - - [15/Dec/2011:17:07:40 -0800] "GET /paintings/catsrats/thumbs/smallsafe.jpg HTTP/1.1" 200 5118 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:41 -0800] "GET /paintings/catsrats/thumbs/smallfates.jpg HTTP/1.1" 200 5761 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:41 -0800] "GET /paintings/catsrats/thumbs/smallsurvival.jpg HTTP/1.1" 200 6327 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:42 -0800] "GET /paintings/catsrats/thumbs/smallblacksnow.jpg HTTP/1.1" 200 3009 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:42 -0800] "GET /paintings/catsrats/thumbs/smallwindow.jpg HTTP/1.1" 200 4418 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:43 -0800] "GET /paintings/catsrats/thumbs/smallinteract.jpg HTTP/1.1" 200 5373 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:43 -0800] "GET /paintings/catsrats/thumbs/smallrules.jpg HTTP/1.1" 200 4228 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:44 -0800] "GET /paintings/catsrats/thumbs/smallcateyes.jpg HTTP/1.1" 200 3942 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:44 -0800] "GET /paintings/catsrats/thumbs/smallatticview.jpg HTTP/1.1" 200 4133 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:44 -0800] "GET /paintings/catsrats/thumbs/smallgreenwindow.jpg HTTP/1.1" 200 4419 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:45 -0800] "GET /paintings/catsrats/thumbs/smalldixierat.jpg HTTP/1.1" 200 3731 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:45 -0800] "GET /paintings/catsrats/thumbs/smallformerly.jpg HTTP/1.1" 200 4140 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:46 -0800] "GET /paintings/catsrats/thumbs/smalleducationaltv.jpg HTTP/1.1" 200 3556 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:46 -0800] "GET /paintings/catsrats/thumbs/smallbridge.jpg HTTP/1.1" 200 5062 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:47 -0800] "GET /paintings/catsrats/thumbs/smalldixietank.jpg HTTP/1.1" 200 4648 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:47 -0800] "GET /paintings/catsrats/thumbs/smalldixiecage.jpg HTTP/1.1" 200 4986 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:48 -0800] "GET /paintings/catsrats/thumbs/smalldixienoses.jpg HTTP/1.1" 200 4487 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:48 -0800] "GET /paintings/catsrats/thumbs/smallpusiruluk.jpg HTTP/1.1" 200 5290 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.193 - - [15/Dec/2011:17:07:49 -0800] "GET /paintings/catsrats/thumbs/smallsnowwindow.jpg HTTP/1.1" 200 4061 "http://www.example.com/paintings/catsrats.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

I changed, ahem, one word. Nineteen times.

Now, if Whois is participating in some kind of a scam and 66.249.64-95 isn't a bona fide g### IP, I am not the only one who would like to hear about it.

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4405414 posted 8:32 am on Jan 10, 2012 (gmt 0)

Most people scan rather than read threads, so missed the 66.249 reference early in your first post.

Off to take a closer look at some log files...

enigma1

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4405414 posted 2:48 pm on Jan 10, 2012 (gmt 0)

They're testing many more things as far I understand from my server logs. Information they gather could be used to determine whether the site cloaks content, how fast the server responds on side resources, also to use instant preview, etc.

What is your concern if they access images setting up referrers? If it's image stealing you care about, you could watermark your images and let it have them.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4405414 posted 4:47 pm on Jan 10, 2012 (gmt 0)

Yesterday the whois lookup I used did not report this IP block as Google - I got the same results that Samizdata reported. However, today I am seeing Google ownership being reported.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4405414 posted 11:59 pm on Jan 10, 2012 (gmt 0)

Yesterday the whois lookup I used did not report this IP block as Google - I got the same results that Samizdata reported. However, today I am seeing Google ownership being reported.

Were you looking up the same IP? The initial confusion was because of the non-Google 67.221.235.nn alongside the Google 66.249.67.nnn. I block by IP but ignore by UA so I would never have caught that particular spoofer if I hadn't been specifically looking for the pattern

\w" ".*?Google\w

I don't think "concern" is the right word. It's more of a "wtf?" Cloaking is one very reasonable explanation, since you could easily have your site set up to handle image requests differently depending on whether the request comes with the appropriate referer (most humans) or without (most robots). I don't-- but I do treat them differently in log wrangling.

And I honestly never knew they were doing the referer thing until, as noted above, I found it while looking for something else. If I deliberately look for it, I find it as far back as last April-- the oldest raw logs I've got access to.

In a way the css request is the most interesting, because afaik there's no "css search" function in google. At least not yet ;) So it would be used in Preview-- and, again, in questions about cloaking. Maybe "div class = 'header'" comes with {display: none}. Or the page has a background image that shows additional text.

There is no way they can not know that piwik is analytics. They're google. They know everything. So it would be trivial to tell all their robots-- including non-robots like Preview and Translate-- to keep the ### out. Then again, I could put all my cloaking into javascript and save it in a file that uses the name of a known analytics program.

Idle thought: When they get a new idea, do they first try it out on small sites to work out the bugs and see if it's worth following up, before cutting loose on the big sites where it might really make a difference?

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4405414 posted 12:27 am on Jan 11, 2012 (gmt 0)

I agree that seeing a referer is a surprise - although I haven't specifically looked for it in quite a while. Given the time/date stamps this looks like an automated version of googlebot that crawls very much like a human user would.

Does this user-agent/IP request a new HTML page? If so, does it include a referer in that request?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4405414 posted 2:39 am on Jan 11, 2012 (gmt 0)

Do you mean a new page that didn't previously exist, or just a new request? When the googlebot picks up images or css-- whether or not there's a referer-- it always starts with the html. No referer for the page, ever.

Incidentally I have just-- don't everyone yawn at once-- come upon an MSNbot doing the same thing. From the ordinary bing/msn 207.46 range, using one of those MSNbots that wear street clothes with too many spaces. I've seen it before trying to get at piwik files but this is the first time I've observed it going for a css file. Quick detour to raw logs tells me that it, too, has been doing it since at least April, though only with css files. Again, that's with a referer, always coming immediately after a request for the html itself.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4405414 posted 5:43 am on Jan 12, 2012 (gmt 0)

Continuing on the theme of google and referers, with a side of Things You Only Notice If You're Small:

Midway through the 5th (my time, which happens to correspond to google's time zone) Preview suddenly started sending a referer. For the initial html, that is. Previously it was - blank; now it's http://www.google.com/search without parameters.

Unlike the image-harvesting googlebot, which called for Regular Expression searches through my raw logs, this one is visible to the naked eye-- especially if, like me, you color-code your processed logs ;)

Two specific pages happen to show up both "before" and "after". Neither of them were changed during that time period, which suggests google doesn't bother to cache Previews for my site, but just runs them up on the fly. Or, alternately, they're only cached for a few hours. I know this is the case with Translate, which often behaves similarly to Preview.

So now I officially know what I already knew: that a Preview is the result of a search. But not what I'd really like to know: what the person was searching for.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4405414 posted 11:20 pm on Feb 3, 2012 (gmt 0)

See also: Google Web Preview hits show stock referrer [webmasterworld.com]

leadegroot

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4405414 posted 12:44 pm on Feb 4, 2012 (gmt 0)

I have seen extremely occasional referrers from Google over, oh, perhaps the last 12 months, perhaps a little more.
100,000 pageviews a day, so I just happened to grab the odd occurrence.
Haven't seen (or at least noticed) the fast-grab-thumbnail thing though.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved