homepage Welcome to WebmasterWorld Guest from 54.204.58.87
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 40 message thread spans 2 pages: 40 ( [1] 2 > >     
Facebot
from Facebook
Pfui




msg:4679860
 11:49 am on Jun 14, 2014 (gmt 0)

Hi, all. This just in --

66.220.159.115
Facebot/1.0

robots.txt? NO

IP resolves to: rx115.tfbnw.net
Range: 66.220.144.0 - 66.220.159.255
CIDR: 66.220.144.0/20

Single GET to bare html only. Also not clustered with Fb regulars:

facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
facebookplatform/1.0 (+http://developers.facebook.com)

No clue if related to this same-name bot (last updated 2013-04-18):

[sourceforge.net...]

 

keyplyr




msg:4680079
 7:44 pm on Jun 15, 2014 (gmt 0)

no info page?

dstiles




msg:4680080
 7:51 pm on Jun 15, 2014 (gmt 0)

That IP is in the usual facebookexternalhit range so acceptable here. I'll add in the new UA. Thanks.

Is that the full UA?

lucy24




msg:4680086
 8:39 pm on Jun 15, 2014 (gmt 0)

Facebot/1.0

robots.txt? NO

Isn't that the essence of the complaint? If you call yourself a robot, you should at least look at robots.txt.

Earlier thread here [webmasterworld.com]. Not sure how it came up in search, since the term "facebot" doesn't seem to occur, but it's worth linking because there's some interesting content.

Pfui




msg:4680092
 10:26 pm on Jun 15, 2014 (gmt 0)

What little you see is all I got, gang, UA-wise. Before I double-checked the IP, I figured it was a spoof. Surprise, surprise.

(That's my old thread, lucy, which may explain its appearance. Also, alas, way too many bot-runners claim that only crawlers need ask for robots.txt, surely not their X,Y,Z creations.)

keyplyr




msg:4680093
 11:44 pm on Jun 15, 2014 (gmt 0)

IMO FB has stolen so much traffic away from web sites, the only way to get some of it back is to exploit the traffic potential of FB itself. After a lot of testing, I've been able to create significant traffic through FB and its many niche groups.

So I allow FB's main ranges and block most of the dev & app FB ranges. There are a couple FB apps that send me traffic. These are companies that were sending me traffic before, then they created apps for FB both desk-top and mobile. And the FB mobile apps have been the real winners here, with traffic going up and up.

This new UA is interesting and I'll try to determine what it does exactly.

Pfui




msg:4680565
 3:21 pm on Jun 17, 2014 (gmt 0)

Okay, so "Facebot/1.0" arrived again and now it's reading, and heeding -- and only hitting -- robots.txt. Too bad it's also looking like my new best friend/pest:

66.220.159.113 - 04:06:49
66.220.159.118 - 04:16:55
66.220.159.114 - 05:06:17
66.220.159.117 - 05:30:06
66.220.159.115 - 07:01:25

During that same time period I had only one other hit from Fb (via the usual facebookexternalhit), just a single probe-like hit to a special Fb-only jpg:

173.252.74.112 - 04:22:48

Here's hoping their bots don't start shape-shifting as often as their privacy settings...

Samizdata




msg:4680571
 4:13 pm on Jun 17, 2014 (gmt 0)

Seen here for the first time today, eight hits on robots.txt in two hours.

Seems to have obeyed instruction to go no further.

No information about it available on the web as far as I can see.

...

lucy24




msg:4680578
 4:50 pm on Jun 17, 2014 (gmt 0)

a special Fb-only jpg

I had to laugh at this. The noscript version of Piwik includes an administrative gif, which facebook and similar entities request right along with other images on the page. So I've taken to serving select user-agents a specially made small image-- the same kind you'd see in a "useful links" sidebar. Especially useful if the only other image on the page is a huge jpg that I'm not about to let them have (on account of it would obviate any need to visit the page at all).

dstiles




msg:4680606
 8:14 pm on Jun 17, 2014 (gmt 0)

I wonder if this new bot has anything to do with today's announcement at www[.]zdnet[.]com/blog/security/ ...

"Facebook announced changes to its privacy and advertising policies on its company blog, extending Facebook's ability to track users outside of Facebook. This counters 2011's position that [we] "do not track users across the web."

Could it be connected with a deeper tracking of visited sites?

Pfui




msg:4680658
 12:44 am on Jun 18, 2014 (gmt 0)

Heh. I thought Fb was already tracking via everybody's Like Me beacons and such. Anyway, I just wish I knew what they're up to with Facebot because it's currently getting my default 'go away' robots.txt --

User-agent: *
Disallow: /

-- whereas each of the majors get its own specialized robots.txt file of Allows served up by CGI.

Perhaps we'll learn more once Google picks up this thread.

keyplyr




msg:4680690
 3:43 am on Jun 18, 2014 (gmt 0)



I'm currently getting more traffic from Facebook than Yahoo :)

Pfui




msg:4680796
 12:50 pm on Jun 18, 2014 (gmt 0)

(I get maybe 20 times more traffic from Fb than Y! It always surprises me when I do get traffic from Y.)

FWIW/Speaking of Fb and/or its oddities, to quote lucy here: [webmasterworld.com...]

"Final caution: sometimes they'll pull a different user-agent. Lately I've found a few "visionutils/0.2"-- so far, always from the 173.252. range-- mixed in with the two versions of facebookexternalhit."

Ding-ding-ding. Earlier this morning. That's a new one to me.

Samizdata




msg:4680799
 1:10 pm on Jun 18, 2014 (gmt 0)

looking like my new best friend/pest

A total of 128 hits on robots.txt overnight on one site here (average 6.5 per hour approx).

I wonder what it wants from me.

...

wilderness




msg:4680814
 2:15 pm on Jun 18, 2014 (gmt 0)

Their unable to comprehend?

Don't go away mad, just go away ;)

Samizdata




msg:4680822
 2:33 pm on Jun 18, 2014 (gmt 0)

unable to comprehend?

It certainly seems to be a poorly-coded bot.

But what interests me more is its unwillingness to reveal its purpose.

After all, it comes from a multi-billion dollar web corporation.

Does it want to crawl my sites?

And if so, why?

...

Pfui




msg:4680837
 3:29 pm on Jun 18, 2014 (gmt 0)

Same could be said for Twitterbot. I've yet to allow it, in part because I don't know its purpose, in part because it ALWAYS ignores the blanket Disallow it does get:

199.59.148.210
Twitterbot/1.0

08:09:03 /robots.txt
08:11:37 /robots.txt
08:11:37
/
08:11:38 /robots.txt

Apparently like G, Y, msn, and amazon/amazonaws the bigger you get, the less others' rules apply.

keyplyr




msg:4680870
 5:33 pm on Jun 18, 2014 (gmt 0)

Twitterbot is necessary for your links to be verified and shortened by Twitter's own URL shortener utility.

visionutils is FB image retriever for their caching and db storage AFAIK. If they come enough times for images on your page (when someone or you post a link) visionutils will eventually show up.

Facebot's purpose is still unknown to me, but I wouldn't be surprised if FB will "unveil" something new in the near future.

lucy24




msg:4680888
 6:20 pm on Jun 18, 2014 (gmt 0)

08:09:03 /robots.txt
08:11:37 /robots.txt
08:11:37 /
08:11:38 /robots.txt

I generally ignore robotic requests for the front page alone, unless there's some aggravating circumstance such as a bogus referer. Most one-off robots never go further, so it isn't worth the time to block them.

Samizdata




msg:4680943
 10:26 pm on Jun 18, 2014 (gmt 0)

Well here is the official explanation:

The Facebot crawler may crawl your entire website instead of just a single page.

Source: [developers.facebook.com...]

Facebook is apparently now a search engine.

A "members only" one.

...

Pfui




msg:4681011
 11:36 pm on Jun 18, 2014 (gmt 0)

1.) Samizdata: Good find, and oh, man. The intro to that page, a.k.a. "Sharing Best Practices", is so much treacly doublespeak:

"We want news sites, magazines, blogs, and other media sites to easily reach their existing fans and grow their fan base. This way, people can get the most engaging Facebook experience."

Ugh.

2.) keyplyr: Thanks for the Twitterbot details. FWIW, blocking it has no apparent effect. Links-in, re-tweets, shortened URLs, etc., continue to function A-OK.

Samizdata




msg:4681013
 11:42 pm on Jun 18, 2014 (gmt 0)

I'm currently getting more traffic from Facebook than Yahoo

This critter is not going to send you traffic.

It is going to fetch your content and use it on Facebook instead.

And your entire site is targeted for the land grab.

Ugh.

Succinct and accurate.

...

keyplyr




msg:4681045
 12:44 am on Jun 19, 2014 (gmt 0)

keyplyr: Thanks for the Twitterbot details. FWIW, blocking it has no apparent effect. Links-in, re-tweets, shortened URLs, etc., continue to function A-OK - Pfui

It's only the Twitter brand URL shortener. Bit.ly, Owl, TinyURL and the rest do their own bidding.


This critter is not going to send you traffic. It is going to fetch your content and use it on Facebook instead. - Samizdata

Where is the documentation for that statement? Where does Facebook officially say that?

Samizdata




msg:4681049
 1:31 am on Jun 19, 2014 (gmt 0)

Where is the documentation for that statement? Where does Facebook officially say that?

I posted the link above.

The Facebook Crawler fetches content from your site and generates a preview for people on Facebook.

The Facebot crawler may crawl your entire website instead of just a single page.

Seems pretty clear.

...

tangor




msg:4681063
 3:19 am on Jun 19, 2014 (gmt 0)

Nah! You guys are missing the point.

Zuckerberg is into politics now. And wants to know the politics of all the sites out there.

Yeah, that's the ticket!

(No tinfoil hat... just writing on the wall)

MEANWHILE, there's only a few bots I allow and FB is not one of them.

keyplyr




msg:4681066
 3:51 am on Jun 19, 2014 (gmt 0)

The Facebook Crawler fetches content from your site and generates a preview for people on Facebook.

Seems pretty clear.

No... that's what they've always done. FB will retrieve a snippet & a couple images from the page that you or someone else posts the link to. That's great. It makes the link more attractive, bringing a lot of traffic.


The Facebot crawler may crawl your entire website instead of just a single page.

This is the only dynamic that is new, and I still see no eveidence of what exactly FB is up to with this.

However, I'll say again... instead of bashing FB, learn to make it work for you. I get triple digit targetted human traffic each and every day from FB.

keyplyr




msg:4681079
 9:19 am on Jun 19, 2014 (gmt 0)

Oh and BTW...

66.220.159.118 - - [18/Jun/2014:01:11:35 -0700] "GET /robots.txt HTTP/1.1" 206 4966 "-" "Facebot/1.0"

Header was conditional (included an If-Range field) YMMV

Samizdata




msg:4681184
 3:36 pm on Jun 19, 2014 (gmt 0)

FB will retrieve a snippet & a couple images from the page

As I said, fetching your content for use on their site.

And as they said, this may apply to your entire website.

In order that "people can get the most engaging Facebook experience".

I have no interest in people's "Facebook experience".

I still see no eveidence of what exactly FB is up to with this.

I see triple digit attempts at robotic crawling each and every day.

The stated reason is to improve people's "Facebook experience".

About which I care not.

FB has stolen so much traffic away from web sites

Your words, not mine.

As it happens, Facebook has never stolen any traffic from me.

And I am not going to let them steal my content, either.

instead of bashing FB, learn to make it work for you

I have no interest in doing either.

I understand your pragmatic approach, and doubtless some will feel the same way, but I can assure you that pandering to the "Facebook experience" would not benefit me in any way.

Whatever happened to Google Web Preview?

...

lucy24




msg:4681270
 7:06 pm on Jun 19, 2014 (gmt 0)

Header was conditional (included an If-Range field)

Hence the 206 response: "If the content has changed, let me see it, otherwise just show the header." Seems a bit excessive for robots.txt, where the header alone is bigger than most people's complete file, but whatever makes them happy.

Whatever happened to Google Web Preview?

What did happen to it? I see it periodically in logs, but I'm ### if I can find how to invoke it.

keyplyr




msg:4681280
 7:26 pm on Jun 19, 2014 (gmt 0)

As it happens, Facebook has never stolen any traffic from me. - Samizdata

Ohhh yes they have. Social media sites, in which Facebook is the poster child, has detoured a huge amount of traffic away from your (and everyone else) web site at an increasing rate.

And I am not going to let them steal my content, either. - Samizdata

And as long as you block the idea of changing with internet trends, you will not recover the lost traffic, nor gain all the new traffic, which can be significant.

Like it or not, the internet continues to evolve. In order to remain relevant, all webmasters must be flexible to these changes. A web site that does not render well on various mobile devices, nor interact with social media, is fast becoming archaic.

This 40 message thread spans 2 pages: 40 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved