homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 43 message thread spans 2 pages: 43 ( [1] 2 > >     
Facebook's Bots

 3:57 pm on Oct 3, 2011 (gmt 0)

Facebook bot-running from named and bare (no-rDNS) IPs isn't new --
facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
robots.txt? NO

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
robots.txt? NO
facebookplatform/1.0 (+http://developers.facebook.com)
robots.txt? NO

-- but totally cloaked bot-running is:
AsyncHttpClient 1.0
10/0n 08:14:47
AsyncHttpClient 1.0
10/0n 08:14:46

robots.txt? NO

Got more?



 5:10 pm on Oct 3, 2011 (gmt 0)

I've got those as facebook bots. I generally find the ranges are nnn.nnn.nnn.244-254

If ANYTHING comes up with httpclient ANYWHERE it gets blocked.


 2:05 am on Oct 4, 2011 (gmt 0)

This isn't a bot but it is a mess. Courtesy of a Fb referrer, broken at a space after Mobile for readability:

Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_5 like Mac OS X; en_US) AppleWebKit (KHTML, like Gecko) Mobile
[FBAN/FBForIPhone;FBAV/3.5a;FBBV/3500;FBDV/iPhone3,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/4.3.5;FBSS/2; FBCR/Carrier;FBID/phone;FBLC/en_US]

[Say it ain't so that's anywhere near close to a 'final' version of Fb's iPhone app!]


 7:41 pm on Oct 4, 2011 (gmt 0)

Sorry, no idea what an iphone UA looks like - I have more than enough trouble working out mobiles and phones in general - but that certainly looks like a nightmare UA. :(


 12:13 am on Oct 5, 2011 (gmt 0)

no idea what an iphone UA looks like

Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_4 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8K2 Safari/6533.18.5

Research assistance provided by my son ;)


 3:29 am on Oct 5, 2011 (gmt 0)

no idea what an iphone UA looks like

It's the one that keeps requesting:


And if it doesn't find them, it takes a snapshot of your page to use as the icon for your site.


 4:43 pm on Oct 5, 2011 (gmt 0)

Thanks Lucy. Nice to know it's using the insecure Safari. :)

Keyplr - that seems harsh on the users who have to pay bandwidth charges, which I understand are quite high. Another reason for not using one. :)


 9:31 pm on Nov 3, 2011 (gmt 0)

A new Facebook-related plague began this Fall [webmasterworld.com...] and now it's really kicking in:

September (all): 19 hits
October (all): 44 hits
November (2.5 days): 16 hits

This month, the plague's on a very fast track to outpace October. What is it? This:

facebookplatform/1.0 (+http://developers.facebook.com)
robots.txt? NO

The bot hits independently of the OP's UAs (facebookexternalhit and AsyncHttpClient). It also hails from just a few Fb Hosts but scores of Fb IPs. It often hits once or twice, but also runs amok as it did in this 30-minute span: - - [2n/Oct/2011:23:29:58]
out-sf244.tfbnw.net - - [2n/Oct/2011:23:30:07]
out-sf247.tfbnw.net - - [2n/Oct/2011:23:30:10] - - [2n/Oct/2011:23:30:29] - - [2n/Oct/2011:23:30:31] - - [2n/Oct/2011:23:31:09] - - [2n/Oct/2011:23:32:01] - - [2n/Oct/2011:23:32:20] - - [2n/Oct/2011:23:32:59] - - [2n/Oct/2011:23:33:52] - - [2n/Oct/2011:23:34:34] - - [2n/Oct/2011:23:34:37] - - [2n/Oct/2011:23:35:38] - - [2n/Oct/2011:23:39:02] - - [2n/Oct/2011:23:39:05] - - [2n/Oct/2011:23:39:45] - - [2n/Oct/2011:23:41:19] - - [2n/Oct/2011:23:43:10] - - [2n/Oct/2011:23:43:40] - - [2n/Oct/2011:23:44:09] - - [2n/Oct/2011:23:47:11] - - [2n/Oct/2011:23:47:31] - - [2n/Oct/2011:23:48:19] - - [2n/Oct/2011:23:48:33] - - [2n/Oct/2011:23:50:43] - - [2n/Oct/2011:23:51:16] - - [2n/Oct/2011:23:52:26] - - [2n/Oct/2011:23:52:55] - - [2n/Oct/2011:23:55:02] <=Atypical; most IPs begin: 69.171.22n. - - [2n/Oct/2011:23:56:01] - - [2n/Oct/2011:23:56:50] - - [2n/Oct/2011:23:58:42] - - [2n/Oct/2011:23:59:43] - - [2n/Oct/2011:23:59:49]

It keeps going for a 1x1 botbait graphic -- minus the botbait page the graphic's on -- so at first I 403'd all hits. No difference. Then I renamed the graphic to send a 404. No difference. Now I'm officially beginning to hate it.

Fb mavens (et al), any idea why facebookplatform's got sites in its sights? And if you're seeing it, are you stopping it? And if so, how?


 12:24 am on Nov 4, 2011 (gmt 0) - - [2n/Oct/2011:23:55:02] <=Atypical; most IPs begin: 69.171.22n

But this too is next door to a facebookexternalhotlink ( so far).

Can't they take up whole ranges like other robots? Most of 69.171 belongs to legitimate humans; I've personally met 69.171.128-159.


 1:24 am on Nov 4, 2011 (gmt 0) -
OrgName: Facebook, Inc.
OrgId: THEFA-3

FACEBOOK-IPV6-BLOCK-1 (NET6-2620-1C00-1) 2620:0:1C00:: - 2620:0:1CFF:FFFF:FFFF:FFFF:FFFF:FFFF
FACEBOOK-INC (NET-173-252-64-0-1) -
TFBNET1 (NET-204-15-20-0-1) -
TFBNET3 (NET-66-220-144-0-1) -
TFBNET3 (NET-69-171-224-0-1) -
TFBNET2 (NET-69-63-176-0-1) -
TFBNET4 (NET-74-119-76-0-1) -


 1:57 am on Nov 4, 2011 (gmt 0)

All those blocks make it tricky blocking fake Fb UAs (...or, in my case, allowing legit FB UAs). FWIW, these are the most common I see, good or bad:

RewriteCond %{REMOTE_HOST} !\.facebook\.com$
RewriteCond %{REMOTE_HOST} !\.tfbnw\.net$

RewriteCond %{REMOTE_ADDR} !^66\.220\.146\.
RewriteCond %{REMOTE_ADDR} !^69\.63\.181\.
RewriteCond %{REMOTE_ADDR} !^69\.171\.2

RewriteCond %{HTTP_USER_AGENT} !^facebookexternalhit

(Aside: The developers.facebook.com-spawned plague gets a 403.)

Regarding the current ranges provided by wilderness (thanks!), do any of you see much activity from other than the preceding RewriteCond ADDRs?


 12:32 am on Jan 8, 2012 (gmt 0)

Apparently this next one's not new at all, but it is to me:

Mozilla/5.0 (compatible; FriendFeedBot/0.1; +Http://friendfeed.com/about/bot)

11:18:08 /
11:27:32 /

robots.txt? NO

PHP info for that Host's IP [projecthoneypot.org...] shows only the 'usual' Fb UA:

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)


- The FriendFeedBot's About page is helpful but there's nary an obvious word on-site about the company being a Fb subsidiary: "Facebook Acquires Friendfeed" (2009) [webmasterworld.com...]

- The "We crawl from the following IP addresses..." list on the About page doesn't include any Fb IPs -- they're all PSINet/Cogentco.

- The UA that hit me differed from that indicated on the site as "user agent of our crawler" ever so slightly: +Http (above) vs. +http


 1:50 am on Jan 8, 2012 (gmt 0)

After much speculation about these bots, what they're doing with my content, etc... I opened a FB account, put 2 different types of pages (personal and a "small business") and did some testing.

I only see my images & page thumbnails when linked to my sites either by me, someone reposting my post (status) or another FB user talking about my sites. All this is fine with me.

There is an app developer program (similar to Google) that I block on a case-by-case basis, but so far I have blocked all.


 8:42 am on Jan 8, 2012 (gmt 0)

On one of my sites voting for the annual competition is now on and some contestants have obviously posted about it on FB.

Each time one of their friends clicks on the link, within the same second - but always first - facebookexternalhit arrives then the visitor.

While FB is blocked it does not prevent the visitor from viewing the page and voting.


 6:56 pm on Jan 8, 2012 (gmt 0)

While FB is blocked it does not prevent the visitor from viewing the page and voting.

During my tests, I blocked FB. I still got human visitors from FB, but greatly reduced (maybe 80% less) than when I removed the FB block. I attribute this to several things:

If FB blocked - when someone posts a direct link to my site, the link looks like a url. Plain and not very inviting.

If FB not blocked - when someone posts a direct link to my site, FB grabs an image from my site (the FB user is given several image choices and may choose which one) and a META tag snippet, making the link look very attractive. The FB user can even "feature" this post giving it larger presentation. This also increases the chances that other FB users will "like" the link and repost on other areas of FB.

There are several other ways in which a web site may get traffic from FB, but all are much more attractive presentations if FB has access to the web page.


 7:34 pm on Jan 8, 2012 (gmt 0)

Akin to keyplyr, I've found that when facebookexternalhit is blocked, people trying to include a link to my site e.g., in a Wall post, get my custom redirect page. (Facebook.com-referred links are fine.)

When facebookexternalhit is white-listed and people try to include a link, Fb (via .tfbnw.net and/or bare IPs; never .facebook.com) pops up images from the target page that people can click through and select to accompany the link.

My problem is that facebookexternalhit traverses directories via /../ to display the images:
facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)

19:24:41 /dir/
19:24:42 /dir/../graphics/example.jpg

That /../ pattern has long been a tell of scrapers on my sites. And not wanting to code Yet Another Workaround for a major, the pattern remains blocked. Luckily there's no Fb prob: People simply don't see any graphic when they make a link. And they still make 'em.


 11:49 pm on Jan 8, 2012 (gmt 0)

Thank you both for the additional info, interesting but FB remains blocked.

Sofar, none of the visitors who came to vote (some could not even be bothered to click the vote button which is staring in their face) have shown no interest to explore the site further and that is perfectly alright. They are doing their friend a favor and that's what friends are for.

However, I refuse to 'enhance' FB's page(s) with content scraped from my site while I cannot even see their referral page without having to register which I won't do. So we're even and the friends still come to vote ;o)


 1:05 am on Jan 9, 2012 (gmt 0)

FYI - the FB bot that grabs images comes without a UA or referrer.


 1:49 am on Jan 9, 2012 (gmt 0)

Thanks keyplyr, I'll double check my log files :o)


 2:10 am on Jan 9, 2012 (gmt 0)

From my experience, the FB image grabber comes without UA/referrer from these 3 FB ranges: - - -

But other FB utilities also use these (and other) FB ranges.


 9:53 am on Jan 9, 2012 (gmt 0)

I went through my logs since voting started and had visits from all three ranges with UA facebookexternalhit/....

No visits from these (or other) ranges without UA and no images taken

I will certainly keep an extra eye out (my cyclops one) until voting is over ;o)


 6:40 pm on Jan 9, 2012 (gmt 0)

@Staffa - This covert FB bot does not scrape content like other thiefs, it only gets an image if someone is promoting your web site and posts a link (as I described above in #4404990.) So you won't see it unless someone likes your web site and wants to send traffic to you.


 8:22 pm on Jan 9, 2012 (gmt 0)

Thanks keyplyr, I understood that and it's exactly what is happening for the moment.
Because of the competition, some participants posted a link on FB to my web site. I guess they probably are urging their friends to go and vote for their entry and the friends come with a FB referral URL.

facebookexternalhit precedes these friends and arrives at the same time but always before the visitor. It comes from the three ranges that you posted.

But, as yet, no FB visit without UA to grab one or more images.


 12:00 am on Jan 10, 2012 (gmt 0)

But, as yet, no FB visit without UA...

Ditto. Always with a UA, typically --

facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)

-- or rarely, and as previously reported (see OP):

facebookplatform/1.0 (+http://developers.facebook.com)
AsyncHttpClient 1.0

Oh, also...

facebookexternalhit is NOT just visiting in response to a user trying to embed a link because it comes around far, far too similarly to the same page(s). Rather, it acts like a blog-host of a mother hen obsessively checking its chicks' links twice an hour.

For example, note the exact same '27 minutes after' hits to the exact same file:
08:27:00 /dir4/file08.html
09:27:00 /dir4/file08.html
10:27:00 /dir4/file08.html
11:27:00 /dir4/file08.html
12:27:00 /dir4/file08.html
14:27:00 /dir4/file08.html

Wait. There's more! Hourly hits to the exact same file at '57 minutes after' -- including one from "out-":
07:57:00 /dir4/file08.html
08:57:02 /dir4/file08.html
09:57:01 /dir4/file08.html
10:57:03 /dir4/file08.html

11:57:04 /dir4/file08.html
12:57:00 /dir4/file08.html
13:57:00 /dir4/file08.html

Those weren't the only FB-related hits that day, or even in that time frame, just the repetitively-timed ones. Coincidence? That 13 individuals tried to embed links to one of a half-million files every hour at 27-after and 57-after, to the second?

And to think I thought that page was wildly popular.

Anyway. Anyone else seeing similar same-time hits?


 4:20 pm on Feb 3, 2012 (gmt 0)

FWIW: Here's an update re Facebook's UAs and fondness for traversing directories via the attack-typical /../ route. [en.wikipedia.org...]

1.) A legit visitor first used this torturous, Fb-related UA to hit a single .html page and its nine graphics:

Mozilla/5.0 (iPhone; U; CPU iPhone OS 5_0_1 like Mac OS X; en_US) AppleWebKit (KHTML, like Gecko) Mobile [FBAN/FBForIPhone;FBAV/4.1;FBBV/4100.0;FBDV/iPhone4,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/5.0.1;FBSS/2; FBCR/AT&T;FBID/phone;FBLC/en_US;FBSF/2.0]

2.) Then they/Fb switched to a SECOND UA to re-hit the same page:

Facebook 4100.0 (iPhone; iPhone OS 5.0.1; en_US)

3.) Then they/Fb used a THIRD UA to re-hit the same graphics, this time using /../:

Facebook/4100.0 CFNetwork/548.0.4 Darwin/11.0.0

The latter generated 400s (Bad Request) for every re-hit graphic. Not 403s per my /../ blocks, but 400s:

"GET /../dir/file.gif HTTP/1.1" 400 293 "-" "Facebook/4100.0 CFNetwork/548.0.4 Darwin/11.0.0"

The person was probably looking to include a link to the page in a post or message. But nine 400s in eight seconds is not someone clicking through images to pick a corresponding graphic.

It's Fb messing up.


 8:24 pm on Feb 3, 2012 (gmt 0)

But nine 400s in eight seconds is not someone clicking through images to pick a corresponding graphic.

But it doesn't work that way does it? My impression was that the Facebookexternalhotlink robot presents the user with all the images at once, and they make their selection from this already-downloaded batch. Later on you start getting the recurring hotlinks, but only for the (un)lucky one image.

Would a 400 normally be served before or after a 403 if a request qualifies for both? I think mine just come through as 404.

Incidentally I hope there are not too many humans with "Darwin" in their UA because I recently got exasperated and blocked it.


 9:24 pm on Feb 3, 2012 (gmt 0)

I have Darwin as a mobile UA that's reported but not blocked.


 11:54 pm on Feb 3, 2012 (gmt 0)

1.) "Darwin" is also used on/by multiple Mac platforms to retrieve favicons. E.g.:

Safari/6534.51.22 CFNetwork/454.12.4 Darwin/10.8.0 (i386) (iMac10%2C1)

Blocking it just adds to log bloat with the additional steps.

2.) Lucy, I don't know why the Fb app sent 400s, just that it did.


 12:51 am on Feb 4, 2012 (gmt 0)

Darwin is Get tool of Mac Safari. By default it gets favicon and apple-touch-icon.png (of which there are currently 10 versions.) However it can easily be directed to get any image file.

I currently allow it to get the favicon and apple icons, but block it from anything else.


 2:42 am on Feb 4, 2012 (gmt 0)

We're OK then, because I've already got a <Files> directive allowing everyone to get the favicon. All I know about the apple-touch-icon is that I haven't got one ;) In fact the first time I ever saw a request for one was just a few days ago. (It always puzzles me when humans ask for things I haven't got. Like data:image or crossdomain.xml.) It came from an equally new-to-me UA:

Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.77 Large Screen Safari/534.24 GoogleTV/b61925

I have no idea what that is, apart from being to all appearances human.

This 43 message thread spans 2 pages: 43 ( [1] 2 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved