Google favicon

Forum Moderators: open

Message Too Old, No Replies

Google favicon

keyplyr

10:18 am on Jul 24, 2015 (gmt 0)

First I've seen of this so apologies if this UA has been around awhile. Coming from various valid Google IPs:

UA: Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google favicon

Requests page, the the favicon.ico. It was very busy on my server.

keyplyr

7:57 pm on Jul 26, 2015 (gmt 0)

Per the title of this thread, that' the only UA I've seen.

We had this discussion a year or two also, and this was the only UA I saw then.

Google favicon

lucy24

7:32 pm on Aug 9, 2015 (gmt 0)

:: bump ::

Follow-up query: Has anyone else seen this new faviconbot requesting pages other than the front page?

On my personal site, its visits are divided among the front page and two interior pages, mostly /games/online.html but sometimes /games/palace/ This makes me suspect that it has something to do with links from other sites, or maybe some type of app?

Eventually I will start Ignoring it, but currently I'm a little bit curious.

keyplyr

8:44 pm on Aug 9, 2015 (gmt 0)

Don't all bot requests have "something to do with links from other sites?"

Pfui

9:30 pm on Aug 9, 2015 (gmt 0)

lucy: I've only seen hits to root and the custom error page "Google favicon" is routed to (in case the results end up somewhere a real person might see them). What surprises me is how often it comes by -- maybe eight to 10 times every four hours; always from google-proxy addresses:

google-proxy-66-249-84-155.google.com
google-proxy-66-249-84-141.google.com
google-proxy-66-249-83-132.google.com
Etc.

Curiously, exasperatingly, "Google favicon" NEVER, EVER asks for favicons of any kind (.ico or .gif).

keyplyr: No, they don't. E.g., every bot/exploit that hits IPs in sequence. I see those on our server all too often, numerically running the web-active IPs in our CIDR.

keyplyr

10:00 pm on Aug 9, 2015 (gmt 0)

Pfui - That's a crawl. IMO much different. Crawls originate from a DC index. Bots that follow links on other sites come from a linear algo. However I don't think we were discussing exploits or vulnerability probes.

I see Google favicon requesting favicon ico, gif, png, etc... but then I have all these, however only the ico can be found in my mark-ups. I have them all just in case something wants them, no matter where they want them from (scripted.) Mobile apps don't seem to follow standards much, but since many will use the favicon in various ways, including a home screen icon, I try to make these all available.

Google favicon requests dozens of pages on my personal site.. Each page followed by a favicon request, or two, or three. Then sometimes I see it request several pages with no favicon request.

Pfui

2:15 am on Aug 10, 2015 (gmt 0)

keyplyr: Sigh. You said "bots", and that's what I reported seeing -- automated, robotic hits, based on IPs, not links. I used the "and/or" slash to include bots and/or exploits because they're all automated but not all IP-sequenced bot hits I see are exploits per se. You subsequently subdivided "bots" into crawlers, and yes, those are obviously different animals because presumably they're crawling a link from somewhere.

(Aside: I've no clue what a "DC index" is, probably not DC comics... Funny, although I think we're saying similar things, you seem to be arguing yet again. Not interested, sorry.)

Okay. Back to --

lucy: The three of us report a range of behaviors from "Google favicon" -- including it not specifically asking for robots.txt on its own. Does anyone know precisely what it does, and why? Something official? I've yet to find anything that explains it. But seeing as how it hails from google-proxy, not googlebot, and is basically a mystery bot, it doesn't get free rein.

keyplyr

3:16 am on Aug 10, 2015 (gmt 0)

DC is internet speak for Data Center

I didn't "subdivide" anything. I said "crawl" as when a bot crawls the web pages on your site. Different than a script sending hits to see where a vulnerability is. Not arguing, just responding. Every time someones posts in my thread, I get a notification.

I guess some terms will always mean different things to different people :)

lucy24

7:21 am on Aug 10, 2015 (gmt 0)

including it not specifically asking for robots.txt on its own

afaik, the faviconbot in its various incarnations has never asked for robots.txt.

:: detour to raw logs ::

The old 74.125 blank-UA never did, at least not within the period I've got logs for.
Neither did the later "Firefox/6.0" version.
Neither has the current one, to date.

I don't think Google is very UA-specific when it comes to requesting robots.txt. In fact the only UA-specific behavior I can remember noticing in them involves the If-Modified-Since header, whose timestamp is always based on the last visit from that exact UA (such as the various mobiles).

Logs also confirm that "Google favicon" showed up abruptly on 23 July (two weeks ago), while the FF6 version last showed its face on ... drumroll ... 22 July.

But in the course of looking this up, I discovered that the FF6 version also asked for /games/online.html. (Asked for but didn't get, since the UA got it a generic redirect to an old-browsers page.) But it never did that before its last two days of activity.

Huh. So that's two changes.

Edit: Isn't the faviconbot used for the dropdown of your sites in wmt (abbreviation of Search Console)? That wouldn't explain the mysterious recent requests for game-related pages, though. Oh, and I almost forgot: one of the two non-front pages it's occasionally been asking for is noindexed. Not roboted-out, just meta noindex. Now, that's interesting.

keyplyr

8:52 am on Aug 10, 2015 (gmt 0)

I think it was mentioned above that the Google favicon bot had not ever asked for favicons. On my site it does, almost always, every other request:

66.249.84.169 - - [09/Aug/2015:02:17:49 -0700] "GET /example.html HTTP/1.1" 200 9351 "-" "Google favicon"
66.249.84.176 - - [09/Aug/2015:02:17:49 -0700] "GET /images/favicon.gif HTTP/1.1" 200 741 "-" "Google favicon"

66.249.84.169 - - [09/Aug/2015:03:24:17 -0700] "GET / HTTP/1.1" 200 8261 "-" "Google favicon"
66.249.84.162 - - [09/Aug/2015:03:24:17 -0700] "GET /favicon.ico HTTP/1.1" 200 741 "-" "Google favicon"

66.249.84.169 - - [09/Aug/2015:03:58:33 -0700] "GET /example.html HTTP/1.1" 200 9351 "-" "Google favicon"
66.249.84.169 - - [09/Aug/2015:03:58:33 -0700] "GET /images4/favicon.ico HTTP/1.1" 200 740 "-" "Google favicon"

66.249.83.178 - - [09/Aug/2015:04:14:12 -0700] "GET / HTTP/1.1" 301 494 "-" "Google favicon"
66.249.83.144 - - [09/Aug/2015:04:14:13 -0700] "GET / HTTP/1.1" 200 8261 "-" "Google favicon"

Probably hits about 30Xs daily.

lucy24

8:53 pm on Aug 10, 2015 (gmt 0)

Google favicon bot had not ever asked for favicons

Really? One of its distinguishing features for me is that even when it was blocked (the no-UA version) or redirected (the FF 6 version) it always requested the favicon along with the page. The one entity I can think of that didn't request the favicon, in spite of its name and purpose, was DuckDuckGo's faviconbot back when I was blocking it. (Not by name but because it (a) crawled from AWS and (b) used an auto-referer.)

almost always, every other request

Yes, that's the formula I see too. Page, immediately followed by favicon. Can you think of anything special about the page you've called "/example.html"?

keyplyr

10:10 pm on Aug 10, 2015 (gmt 0)

Nothing special, I may have had most pages hit by now.

lucy24

7:06 pm on Aug 16, 2015 (gmt 0)

Final note. I've just run the logs for my test site, which I only do every 2 weeks.

66.249.84.191 - - [03/Aug/2015:09:26:45 -0700] "GET / HTTP/1.1" 301 557 "-" "Google favicon" 
66.249.84.155 - - [03/Aug/2015:09:26:46 -0700] "GET / HTTP/1.1" 200 904 "-" "Google favicon" 
66.249.84.169 - - [03/Aug/2015:09:26:46 -0700] "GET /favicon.ico HTTP/1.1" 200 590 "-" "Google favicon"

And your point is...?

This is, as I said, my test site. robots.txt says comprehensively

User-Agent: *
Disallow: /

and has done so since Day 1. So it isn't that they're piggybacking on some other UA's request for robots.txt (Googlebot, for example, asks about once a day, and makes no other requests on this site). It's that, like Preview, they don't consider themselves a robot. In fact they have been doing this all along, going back to the blank and FF6 versions. Obviously the WMT dropdown isn't the reason on this specific site, so the favicon must be used for some other purpose as well. But, equally obviously, not for SERPs the way some search engines do.

Funny the things you notice when you're looking from a different angle...

keyplyr

10:28 pm on Aug 16, 2015 (gmt 0)

Funny the things you notice when you're looking from a different angle...

I'd rather not think about it.

Pfui

5:35 pm on Aug 17, 2015 (gmt 0)

So let's see -- its purposes are unknown, it ignores robots.txt, and it hails from google-proxy, etc. Why give this thing free rein? Because it's got Google in its name?

thetrasher

4:20 pm on Aug 20, 2015 (gmt 0)

66.249.93.211 - - [19/Aug/2015:21:*:* +0200] "GET / HTTP/1.1" 404 * "-" "Google favicon"

The request was forwarded for an unkown DSL user (or bot) from Berlin:

X-Forwarded-For: 84.129.61.*

Site is not in the index (always 404, "site:example.com" doesn't match any documents) and has no human visitors.

lucy24

5:56 pm on Aug 20, 2015 (gmt 0)

X-Forwarded-For: 84.129.61.*

Holy ### it would never have occurred to me to look for this. But yes indeed, Google favicon-- both the current one and the former FF6 version-- always includes an X-Forwarded-For header. In one case the X-Forwarded-For was my own IP. Cross-checking browser history and two other sites' logged headers confirms that this was a WMT ("Search Console") visit. So my earlier supposition wasn't wrong, just incomplete.

The header package is basically humanoid, except that there's never an Accept-Language. I don't suppose favicons come in languages, though theoretically they could.

During the short period I looked at, there were an inordinate number of requests from-- that is, forwarded for-- one particular human IP. Again cross-checking raw logs, this is someone who visited my site repeatedly, though never at the precise times of the Google favicon request. (It looks as if they were reading a multi-chapter ebook, hence the repeated visits.) No, they weren't using Chrome; it was Firefox, possibly with the Favicon Reloader extension. (I found some stray favicon requests from their own IP+UA.)

Another was from an IP outside North America that I've got flagged as "WebmasterWorld member" (aa.34.69.242 if you recognize yourself and can shed light). Again, the Google favicon timestamp didn't precisely correspond to any ...

:: lightbulb ::

It's Google Plus, isn't it? There's a section that lists your websites, including the favicon.

dstiles

6:16 pm on Aug 20, 2015 (gmt 0)

There are legitimate reasons for a proxy (even a google proxy!) to ask for favicon. A proxy (in these terms) fetches whatever files, including images, that a proxy-user asks for. The proxy-user is hopefully a web browser of some kind but could easily be a bot, scraper or other nasty. That latter category is what we have to detect and reject, even when coming at us through a proxy. :(

Remember, also, that (eg) Firefox web browsers ASSUME the location of favicon is in the root directory (which was always a bad move) when a meta header in the page headers show it to be somewhere else.

I think what we're considering in this thread is favicon-only grabbing. The forwarded-for IP in your example isn't entirely unknown and could be legitimate IF accompanied by other file requests (the IP is part of 84.128.0.0/10 Deutsche Telekom AG).

Not being in google's SE does not mean the site isn't on another SE or directory or linked from another web site. The URL could even have been extracted from an email you or someone else sent in one of several ways: G's mail server is known to extract URLs for nefarious purposes (avoid sendint to or via G's mail servers); a person who had that URL on a computer was rash enough to get a virus (which toasts virtually anything on that computer); someone came across the URL in an office email sent to someone else (never send to office addresses if you can help it!) and tried it for themselves; someone at NSA intercepted it - you are on a US suspects list; you name it.

Or G or virtually any one technically minded simply picked up the domain example.com from DNS, stuck http in front of it and gave it a whirl. We've all done that. :)

Somewhere, somehow, someone found the URL.

lucy24

11:01 pm on Aug 20, 2015 (gmt 0)

I think what we're considering in this thread is favicon-only grabbing.

I thought it was specifically about the "Google favicon" user-agent. There are plenty of perfectly legitimate favicon-only requests, like search engines using it in SERPs, or Firefox's Favicon Reloader.* Both of those have the end result of making your site look better, so that's probably reason enough to admit them. (I've got two holes poked for DuckDuckGo's version, for example.)

But it's always good to know what they do with your favicon, and that's where the questions arise about, specifically, Google favicon. WMT, definitely. Google Plus, maybe. But what about sites that don't fit under either category, like the various non-public domains and pages mentioned in this thread? If you've got a gmail address and someone sends you a link to a website, does the clicklink in your email mysteriously include the site's favicon? If yes, that would account for the interior pages.

* I don't personally use it, but it's a behind-the-scenes utility that periodically grabs the favicon for everything you've got bookmarked, using the browser's own current UA and IP. So the user's bookmarks menu looks prettier, the links are more memorable, and-- I believe-- users are more likely to visit your site again. Since it doesn't start by requesting a page, it probably does just assume /favicon.ico

keyplyr

2:43 am on Aug 21, 2015 (gmt 0)

But reasons to request a page first may be:
� to get the page header which has different/more info than the favicon header.
� to find out where that favicon resides (if customized)
� to include a referrer with the favicon request.

Having said that... I also have no idea what this bot really does :)

lucy24

8:06 pm on Aug 30, 2015 (gmt 0)

Follow-up:

It's Google Plus, isn't it?

I've just run the logs for one of my every-two-weeks sites. In the course of checking headers in search of ammunition agains robot infestation, I found:

2015-08-12:00:50:48
IP: 66.249.93.202
User-Agent: Google favicon
X-Forwarded-For: 86.8.aa.bb

And your point is...? A minute or two later, that selfsame 86.8.aa.bb visited the site in its own right, giving plus.url.google.com etcetera as referer. So yup, Google Plus is definitely one source. (Unlike some things, I couldn't test this out personally, because the only way I ever reach Google Plus is from WMT, meaning that they've already got my favicon. They must cache it for a little while.)

blend27

10:11 pm on Oct 5, 2015 (gmt 0)

ip: 66.249.88.98 (google-proxy-66-249-88-98.google.com)
time: {ts '2015-10-04 22:59:07'}
http_content:
method: GET
protocol: HTTP/1.1

User-Agent: Google favicon
X-Original-URL: /
Accept-Encoding: gzip,deflate
Host: www.example.com
X-Forwarded-For: 2601:482:4301:54d:f5cf:f0d5:76ff:3162
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Content-Length: 0
Connection: Keep-alive

Passing IPV6 as X-Forwarded-For. That is new to me.

lucy24

3:57 am on Oct 6, 2015 (gmt 0)

Passing IPV6 as X-Forwarded-For. That is new to me.

:: scurrying to saved headers to search for "X-Forwarded-For: \w+:" (an easy search, because only IPv6 uses colons) ::

I'll be darned. The very first* hit, coincidentally, is indeed the Google favicon:

2015-08-19:15:25:25
IP: 66.249.85.189
User-Agent: Google favicon

So they've been doing it for a while. And it wasn't a very big coincidence, because most hits are the faviconbot, including an early one (this is relative, because I don't keep headers long):

2015-07-15:08:57:14
IP: 66.249.83.152
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20110814 Firefox/6.0 Google favicon

In fact the only non-favicon/IPv6 I can find are ... drumroll ...

2015-09-01:19:28:48
IP: 66.249.83.152
User-Agent: Mozilla/5.0 (Linux; Android 5.1.1; SM-N920P Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.133 Mobile Safari/537.36

and

2015-08-24:15:57:24
IP: 64.233.172.171
User-Agent: Mozilla/5.0 (Linux; Android 5.1.1; Nexus 5 Build/LMY48I) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.133 Mobile Safari/537.36

My gosh, those IPs look familiar.

:: further detour for manual checking of processed logs ::

A couple of those favicon fetches were followed by a human visit, with either google or blank referer, from the same google IP that made the favicon request. I think they were all mobiles.

(Unrelated query: What happens to human visitors from IPv6 ranges? I'm sure my host sent out slews of email many years ago blathering about IPv6 readiness, so why don't I see them?)

* I think TextWrangler does this stuff alphabetically, so that was A-for-August.

lucy24

3:47 am on Oct 7, 2015 (gmt 0)

<continuing topic drift>
I remembered that I know someone with an Android, so I gave her an URL that I knew would lead to a 403.

# 1 the request was not preceded by a Google faviconbot visit (she's on gmail, so I took the opportunity to check a hunch involving icons in email)

#2 the URL request did lead to not one but six favicon requests:

The initial 403 generated two, back-to-back: one with the device's ordinary UA, and a second one with a "Dalvik" UA.
Then she clicked on a link from the 403 page, leading to two consecutive favicon requests.
And a second link from the 403 page, leading to a fifth favicon request. And finally, about ten seconds after all other requests, a final favicon.

All but the very first were with the "Dalvik" UA, which has got something to do with mobiles and images, but more than that I cannot remember.

This is in no way dispositive, but it does illustrate the Android's appetite for favicons with an unquestionably human user.
</td>

keyplyr

8:50 am on Oct 7, 2015 (gmt 0)

Dalvik is the Android virtual machine processor (VM). Yes, anytime someone is on your page using an Android native browser (as opposed to Mobile Chrome or Mobile Firefox) the image files with be requested with some version of Dalvik.

Similar to CFNetwork for iPhone & iPad, some browser add-ons will also use the system's VM (Dalvik.) App developers who write for Android will know this VM very well.

I often notice my own Android hits in my logs and see multiple favicon hits... but then I also see this with Firefox, which will ask for 3 or 4 favicons at the end of every session.

This 54 message thread spans 2 pages: 54

Google favicon

keyplyr

keyplyr

lucy24

keyplyr

Pfui

keyplyr

Pfui

keyplyr

lucy24

keyplyr

lucy24

keyplyr

lucy24

keyplyr

Pfui

thetrasher

lucy24

dstiles

lucy24

keyplyr

lucy24

blend27

lucy24

lucy24

keyplyr

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week