homepage Welcome to WebmasterWorld Guest from 54.198.42.105
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 43 message thread spans 2 pages: 43 ( [1] 2 > >     
Facebook's Bots
Pfui




msg:4370126
 3:57 pm on Oct 3, 2011 (gmt 0)

Facebook bot-running from named and bare (no-rDNS) IPs isn't new --

69.171.229.246
facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)
robots.txt? NO

out-sw248.tfbnw.net
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
robots.txt? NO

69.171.228.245
facebookplatform/1.0 (+http://developers.facebook.com)
robots.txt? NO

-- but totally cloaked bot-running is:

69.171.240.249
AsyncHttpClient 1.0
10/0n 08:14:47

69.171.240.245
AsyncHttpClient 1.0
10/0n 08:14:46

robots.txt? NO

Got more?

 

dstiles




msg:4370172
 5:10 pm on Oct 3, 2011 (gmt 0)

I've got those as facebook bots. I generally find the ranges are nnn.nnn.nnn.244-254

If ANYTHING comes up with httpclient ANYWHERE it gets blocked.

Pfui




msg:4370403
 2:05 am on Oct 4, 2011 (gmt 0)

This isn't a bot but it is a mess. Courtesy of a Fb referrer, broken at a space after Mobile for readability:

Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_5 like Mac OS X; en_US) AppleWebKit (KHTML, like Gecko) Mobile
[FBAN/FBForIPhone;FBAV/3.5a;FBBV/3500;FBDV/iPhone3,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/4.3.5;FBSS/2; FBCR/Carrier;FBID/phone;FBLC/en_US]

[Say it ain't so that's anywhere near close to a 'final' version of Fb's iPhone app!]

dstiles




msg:4370714
 7:41 pm on Oct 4, 2011 (gmt 0)

Sorry, no idea what an iphone UA looks like - I have more than enough trouble working out mobiles and phones in general - but that certainly looks like a nightmare UA. :(

lucy24




msg:4370840
 12:13 am on Oct 5, 2011 (gmt 0)

no idea what an iphone UA looks like


Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_4 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8K2 Safari/6533.18.5

Research assistance provided by my son ;)

keyplyr




msg:4370889
 3:29 am on Oct 5, 2011 (gmt 0)

no idea what an iphone UA looks like

It's the one that keeps requesting:

apple-touch-icon-precompredded.png
apple-touch-icon.png
16x16-apple-touch-icon-precompredded.png
16x16-apple-touch-icon.png

And if it doesn't find them, it takes a snapshot of your page to use as the icon for your site.

dstiles




msg:4371135
 4:43 pm on Oct 5, 2011 (gmt 0)

Thanks Lucy. Nice to know it's using the insecure Safari. :)

Keyplr - that seems harsh on the users who have to pay bandwidth charges, which I understand are quite high. Another reason for not using one. :)

Pfui




msg:4383238
 9:31 pm on Nov 3, 2011 (gmt 0)

A new Facebook-related plague began this Fall [webmasterworld.com...] and now it's really kicking in:

September (all): 19 hits
October (all): 44 hits
November (2.5 days): 16 hits

This month, the plague's on a very fast track to outpace October. What is it? This:

facebookplatform/1.0 (+http://developers.facebook.com)
robots.txt? NO

The bot hits independently of the OP's UAs (facebookexternalhit and AsyncHttpClient). It also hails from just a few Fb Hosts but scores of Fb IPs. It often hits once or twice, but also runs amok as it did in this 30-minute span:

69.171.228.247 - - [2n/Oct/2011:23:29:58]
out-sf244.tfbnw.net - - [2n/Oct/2011:23:30:07]
out-sf247.tfbnw.net - - [2n/Oct/2011:23:30:10]
69.171.228.250 - - [2n/Oct/2011:23:30:29]
69.171.224.251 - - [2n/Oct/2011:23:30:31]
69.171.224.245 - - [2n/Oct/2011:23:31:09]
69.171.224.247 - - [2n/Oct/2011:23:32:01]
69.171.229.249 - - [2n/Oct/2011:23:32:20]
69.171.228.246 - - [2n/Oct/2011:23:32:59]
69.171.229.251 - - [2n/Oct/2011:23:33:52]
69.171.228.245 - - [2n/Oct/2011:23:34:34]
69.171.224.248 - - [2n/Oct/2011:23:34:37]
69.171.224.246 - - [2n/Oct/2011:23:35:38]
69.171.229.245 - - [2n/Oct/2011:23:39:02]
69.171.224.248 - - [2n/Oct/2011:23:39:05]
69.171.224.247 - - [2n/Oct/2011:23:39:45]
69.171.229.248 - - [2n/Oct/2011:23:41:19]
69.171.224.248 - - [2n/Oct/2011:23:43:10]
69.171.228.244 - - [2n/Oct/2011:23:43:40]
69.171.242.250 - - [2n/Oct/2011:23:44:09]
69.171.228.250 - - [2n/Oct/2011:23:47:11]
69.171.224.250 - - [2n/Oct/2011:23:47:31]
69.171.224.250 - - [2n/Oct/2011:23:48:19]
69.171.229.244 - - [2n/Oct/2011:23:48:33]
69.171.228.245 - - [2n/Oct/2011:23:50:43]
69.171.229.245 - - [2n/Oct/2011:23:51:16]
69.171.224.247 - - [2n/Oct/2011:23:52:26]
69.171.224.248 - - [2n/Oct/2011:23:52:55]
69.63.189.248 - - [2n/Oct/2011:23:55:02] <=Atypical; most IPs begin: 69.171.22n.
69.171.224.250 - - [2n/Oct/2011:23:56:01]
69.171.224.244 - - [2n/Oct/2011:23:56:50]
69.171.228.245 - - [2n/Oct/2011:23:58:42]
69.171.224.251 - - [2n/Oct/2011:23:59:43]
69.171.229.250 - - [2n/Oct/2011:23:59:49]

It keeps going for a 1x1 botbait graphic -- minus the botbait page the graphic's on -- so at first I 403'd all hits. No difference. Then I renamed the graphic to send a 404. No difference. Now I'm officially beginning to hate it.

Fb mavens (et al), any idea why facebookplatform's got sites in its sights? And if you're seeing it, are you stopping it? And if so, how?

lucy24




msg:4383292
 12:24 am on Nov 4, 2011 (gmt 0)

69.63.189.248 - - [2n/Oct/2011:23:55:02] <=Atypical; most IPs begin: 69.171.22n

But this too is next door to a facebookexternalhotlink (69.63.181.144-151 so far).

Can't they take up whole ranges like other robots? Most of 69.171 belongs to legitimate humans; I've personally met 69.171.128-159.

wilderness




msg:4383310
 1:24 am on Nov 4, 2011 (gmt 0)

69.171.224.0 - 69.171.255.255
CIDR: 69.171.224.0/19
OrgName: Facebook, Inc.
OrgId: THEFA-3

FACEBOOK-IPV6-BLOCK-1 (NET6-2620-1C00-1) 2620:0:1C00:: - 2620:0:1CFF:FFFF:FFFF:FFFF:FFFF:FFFF
FACEBOOK-INC (NET-173-252-64-0-1) 173.252.64.0 - 173.252.127.255
TFBNET1 (NET-204-15-20-0-1) 204.15.20.0 - 204.15.23.255
TFBNET3 (NET-66-220-144-0-1) 66.220.144.0 - 66.220.159.255
TFBNET3 (NET-69-171-224-0-1) 69.171.224.0 - 69.171.255.255
TFBNET2 (NET-69-63-176-0-1) 69.63.176.0 - 69.63.191.255
TFBNET4 (NET-74-119-76-0-1) 74.119.76.0 - 74.119.79.255

Pfui




msg:4383320
 1:57 am on Nov 4, 2011 (gmt 0)

All those blocks make it tricky blocking fake Fb UAs (...or, in my case, allowing legit FB UAs). FWIW, these are the most common I see, good or bad:

RewriteCond %{REMOTE_HOST} !\.facebook\.com$
RewriteCond %{REMOTE_HOST} !\.tfbnw\.net$

RewriteCond %{REMOTE_ADDR} !^66\.220\.146\.
RewriteCond %{REMOTE_ADDR} !^69\.63\.181\.
RewriteCond %{REMOTE_ADDR} !^69\.171\.2

RewriteCond %{HTTP_USER_AGENT} !^facebookexternalhit

(Aside: The developers.facebook.com-spawned plague gets a 403.)

Regarding the current ranges provided by wilderness (thanks!), do any of you see much activity from other than the preceding RewriteCond ADDRs?

Pfui




msg:4404877
 12:32 am on Jan 8, 2012 (gmt 0)

Apparently this next one's not new at all, but it is to me:

out.snc1.facebook.com
Mozilla/5.0 (compatible; FriendFeedBot/0.1; +Http://friendfeed.com/about/bot)

11:18:08 /
11:27:32 /

robots.txt? NO

PHP info for that Host's IP [projecthoneypot.org...] shows only the 'usual' Fb UA:

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

FWIW:

- The FriendFeedBot's About page is helpful but there's nary an obvious word on-site about the company being a Fb subsidiary: "Facebook Acquires Friendfeed" (2009) [webmasterworld.com...]

- The "We crawl from the following IP addresses..." list on the About page doesn't include any Fb IPs -- they're all PSINet/Cogentco.

- The UA that hit me differed from that indicated on the site as "user agent of our crawler" ever so slightly: +Http (above) vs. +http

keyplyr




msg:4404886
 1:50 am on Jan 8, 2012 (gmt 0)

After much speculation about these bots, what they're doing with my content, etc... I opened a FB account, put 2 different types of pages (personal and a "small business") and did some testing.

I only see my images & page thumbnails when linked to my sites either by me, someone reposting my post (status) or another FB user talking about my sites. All this is fine with me.

There is an app developer program (similar to Google) that I block on a case-by-case basis, but so far I have blocked all.

Staffa




msg:4404915
 8:42 am on Jan 8, 2012 (gmt 0)

On one of my sites voting for the annual competition is now on and some contestants have obviously posted about it on FB.

Each time one of their friends clicks on the link, within the same second - but always first - facebookexternalhit arrives then the visitor.

While FB is blocked it does not prevent the visitor from viewing the page and voting.

keyplyr




msg:4404990
 6:56 pm on Jan 8, 2012 (gmt 0)

While FB is blocked it does not prevent the visitor from viewing the page and voting.


During my tests, I blocked FB. I still got human visitors from FB, but greatly reduced (maybe 80% less) than when I removed the FB block. I attribute this to several things:

If FB blocked - when someone posts a direct link to my site, the link looks like a url. Plain and not very inviting.

If FB not blocked - when someone posts a direct link to my site, FB grabs an image from my site (the FB user is given several image choices and may choose which one) and a META tag snippet, making the link look very attractive. The FB user can even "feature" this post giving it larger presentation. This also increases the chances that other FB users will "like" the link and repost on other areas of FB.

There are several other ways in which a web site may get traffic from FB, but all are much more attractive presentations if FB has access to the web page.

Pfui




msg:4404998
 7:34 pm on Jan 8, 2012 (gmt 0)

Akin to keyplyr, I've found that when facebookexternalhit is blocked, people trying to include a link to my site e.g., in a Wall post, get my custom redirect page. (Facebook.com-referred links are fine.)

When facebookexternalhit is white-listed and people try to include a link, Fb (via .tfbnw.net and/or bare IPs; never .facebook.com) pops up images from the target page that people can click through and select to accompany the link.

My problem is that facebookexternalhit traverses directories via /../ to display the images:

69.171.229.249
facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)

19:24:41 /dir/
19:24:42 /dir/../graphics/example.jpg

That /../ pattern has long been a tell of scrapers on my sites. And not wanting to code Yet Another Workaround for a major, the pattern remains blocked. Luckily there's no Fb prob: People simply don't see any graphic when they make a link. And they still make 'em.

Staffa




msg:4405056
 11:49 pm on Jan 8, 2012 (gmt 0)

Thank you both for the additional info, interesting but FB remains blocked.

Sofar, none of the visitors who came to vote (some could not even be bothered to click the vote button which is staring in their face) have shown no interest to explore the site further and that is perfectly alright. They are doing their friend a favor and that's what friends are for.

However, I refuse to 'enhance' FB's page(s) with content scraped from my site while I cannot even see their referral page without having to register which I won't do. So we're even and the friends still come to vote ;o)

keyplyr




msg:4405078
 1:05 am on Jan 9, 2012 (gmt 0)

FYI - the FB bot that grabs images comes without a UA or referrer.

Staffa




msg:4405089
 1:49 am on Jan 9, 2012 (gmt 0)

Thanks keyplyr, I'll double check my log files :o)

keyplyr




msg:4405092
 2:10 am on Jan 9, 2012 (gmt 0)

From my experience, the FB image grabber comes without UA/referrer from these 3 FB ranges:

66.220.144.0 - 66.220.159.255
66.220.144.0/20

69.63.176.0 - 69.63.191.255
69.63.176.0/20

69.171.224.0 - 69.171.255.255
69.171.224.0/19

But other FB utilities also use these (and other) FB ranges.

Staffa




msg:4405152
 9:53 am on Jan 9, 2012 (gmt 0)

I went through my logs since voting started and had visits from all three ranges with UA facebookexternalhit/....

No visits from these (or other) ranges without UA and no images taken

I will certainly keep an extra eye out (my cyclops one) until voting is over ;o)

keyplyr




msg:4405307
 6:40 pm on Jan 9, 2012 (gmt 0)

@Staffa - This covert FB bot does not scrape content like other thiefs, it only gets an image if someone is promoting your web site and posts a link (as I described above in #4404990.) So you won't see it unless someone likes your web site and wants to send traffic to you.

Staffa




msg:4405329
 8:22 pm on Jan 9, 2012 (gmt 0)

Thanks keyplyr, I understood that and it's exactly what is happening for the moment.
Because of the competition, some participants posted a link on FB to my web site. I guess they probably are urging their friends to go and vote for their entry and the friends come with a FB referral URL.

facebookexternalhit precedes these friends and arrives at the same time but always before the visitor. It comes from the three ranges that you posted.

But, as yet, no FB visit without UA to grab one or more images.

Pfui




msg:4405405
 12:00 am on Jan 10, 2012 (gmt 0)

But, as yet, no FB visit without UA...

Ditto. Always with a UA, typically --

facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)

-- or rarely, and as previously reported (see OP):

facebookplatform/1.0 (+http://developers.facebook.com)
AsyncHttpClient 1.0

Oh, also...

facebookexternalhit is NOT just visiting in response to a user trying to embed a link because it comes around far, far too similarly to the same page(s). Rather, it acts like a blog-host of a mother hen obsessively checking its chicks' links twice an hour.

For example, note the exact same '27 minutes after' hits to the exact same file:

69.171.229.248
08:27:00 /dir4/file08.html

69.171.228.245
09:27:00 /dir4/file08.html
10:27:00 /dir4/file08.html

69.171.229.244
11:27:00 /dir4/file08.html
12:27:00 /dir4/file08.html

69.171.229.245
14:27:00 /dir4/file08.html

Wait. There's more! Hourly hits to the exact same file at '57 minutes after' -- including one from "out-":

69.171.229.247
07:57:00 /dir4/file08.html

69.171.228.247
08:57:02 /dir4/file08.html

69.171.228.246
09:57:01 /dir4/file08.html

69.171.228.250
10:57:03 /dir4/file08.html

out-sf251.tfbnw.net
11:57:04 /dir4/file08.html

69.171.224.251
12:57:00 /dir4/file08.html

69.171.228.244
13:57:00 /dir4/file08.html

Those weren't the only FB-related hits that day, or even in that time frame, just the repetitively-timed ones. Coincidence? That 13 individuals tried to embed links to one of a half-million files every hour at 27-after and 57-after, to the second?

And to think I thought that page was wildly popular.

Anyway. Anyone else seeing similar same-time hits?

Pfui




msg:4413939
 4:20 pm on Feb 3, 2012 (gmt 0)

FWIW: Here's an update re Facebook's UAs and fondness for traversing directories via the attack-typical /../ route. [en.wikipedia.org...]

1.) A legit visitor first used this torturous, Fb-related UA to hit a single .html page and its nine graphics:

Mozilla/5.0 (iPhone; U; CPU iPhone OS 5_0_1 like Mac OS X; en_US) AppleWebKit (KHTML, like Gecko) Mobile [FBAN/FBForIPhone;FBAV/4.1;FBBV/4100.0;FBDV/iPhone4,1;FBMD/iPhone;FBSN/iPhone OS;FBSV/5.0.1;FBSS/2; FBCR/AT&T;FBID/phone;FBLC/en_US;FBSF/2.0]

2.) Then they/Fb switched to a SECOND UA to re-hit the same page:

Facebook 4100.0 (iPhone; iPhone OS 5.0.1; en_US)

3.) Then they/Fb used a THIRD UA to re-hit the same graphics, this time using /../:

Facebook/4100.0 CFNetwork/548.0.4 Darwin/11.0.0

The latter generated 400s (Bad Request) for every re-hit graphic. Not 403s per my /../ blocks, but 400s:

"GET /../dir/file.gif HTTP/1.1" 400 293 "-" "Facebook/4100.0 CFNetwork/548.0.4 Darwin/11.0.0"

The person was probably looking to include a link to the page in a post or message. But nine 400s in eight seconds is not someone clicking through images to pick a corresponding graphic.

It's Fb messing up.

lucy24




msg:4414036
 8:24 pm on Feb 3, 2012 (gmt 0)

But nine 400s in eight seconds is not someone clicking through images to pick a corresponding graphic.

But it doesn't work that way does it? My impression was that the Facebookexternalhotlink robot presents the user with all the images at once, and they make their selection from this already-downloaded batch. Later on you start getting the recurring hotlinks, but only for the (un)lucky one image.

Would a 400 normally be served before or after a 403 if a request qualifies for both? I think mine just come through as 404.

Incidentally I hope there are not too many humans with "Darwin" in their UA because I recently got exasperated and blocked it.

dstiles




msg:4414063
 9:24 pm on Feb 3, 2012 (gmt 0)

I have Darwin as a mobile UA that's reported but not blocked.

Pfui




msg:4414101
 11:54 pm on Feb 3, 2012 (gmt 0)

1.) "Darwin" is also used on/by multiple Mac platforms to retrieve favicons. E.g.:

Safari/6534.51.22 CFNetwork/454.12.4 Darwin/10.8.0 (i386) (iMac10%2C1)

Blocking it just adds to log bloat with the additional steps.

2.) Lucy, I don't know why the Fb app sent 400s, just that it did.

keyplyr




msg:4414108
 12:51 am on Feb 4, 2012 (gmt 0)



Darwin is Get tool of Mac Safari. By default it gets favicon and apple-touch-icon.png (of which there are currently 10 versions.) However it can easily be directed to get any image file.

I currently allow it to get the favicon and apple icons, but block it from anything else.

lucy24




msg:4414125
 2:42 am on Feb 4, 2012 (gmt 0)

We're OK then, because I've already got a <Files> directive allowing everyone to get the favicon. All I know about the apple-touch-icon is that I haven't got one ;) In fact the first time I ever saw a request for one was just a few days ago. (It always puzzles me when humans ask for things I haven't got. Like data:image or crossdomain.xml.) It came from an equally new-to-me UA:

Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.77 Large Screen Safari/534.24 GoogleTV/b61925

I have no idea what that is, apart from being to all appearances human.

This 43 message thread spans 2 pages: 43 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved