Forum Moderators: open

Message Too Old, No Replies

facebook wider crawls

         

lucy24

7:10 pm on Jul 21, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can anyone shed any light on this behavior, noticed sporadically over the last few months and especially the last week or two?

IP: the 173.252 Facebook range (specifically)
User-Agent: the ordinary facebookexternalhit
behavior: robots.txt every couple of days, followed by a whole slew of images from all over the site, seemingly at random, or occasionally a request for root with all supporting files. The requested images are all different, and include some that would definitely never be selected by a human to represent the page they live on.

So far I don't know whether they actually honor robots.txt or it's just for appearances' sake, as it has only just occurred to me to try a Disallow, and they don't ask on every visit. But closer inspection tends to suggest they will not comply. In all their robots.txt requests on the site that also houses my piwik files, they have never noticed that the /piwik directory is Disallowed.

jmccormac

7:32 pm on Jul 21, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Not sure if it is a preview thing but I've seen the FB externalhit string in the logs quite often.Don't have many images on site and the requests are almost always for html queries (DNS history for domain names) rather than the stats pages.

Regards...jmcc

tangor

8:49 am on Jul 23, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The few times I have tried to deal with this in .htaccess folks on FB would contact me as to why my stuff no longer appeared on FB.

FB, however, does NOT honor robots.txt. Any surprise there?

dstiles

8:31 am on Jul 24, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I allow facebook hits but they are a bit cavalier in what they look for. I have an image I created 25 years ago - nopic.gif - and have, until recently, used on all my sites. It is a single pixel image that I once used as a resizable spacer. It's not used on new sites but I got a hit on it this morning. I'm not sure how many FB users would need to see that... :)

Martin Potter

3:52 pm on Jul 24, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



My experience has been a little different. Some years ago the now-familiar FB externalhits began, always looking for one key image that I had created to illustrate a new article on my site. After a while I got tired of this and simply changed the file name of the image. The externalhits for that original image continued but of course all they got was a "404". They continue today, at least weekly, sometimes as often as daily. I don't think I have ever noticed that they look for anything else.

Dimitri

6:16 pm on Jul 24, 2022 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



May be FB crawler is used to feed an AI

lucy24

9:26 pm on Jul 24, 2022 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My experience has been a little different.
What you describe sounds like the standard facebook experience, originating with some human referencing the page and selecting one of its images. These are pretty easy to identify, since the initial FB page request is accompanied by all of its images, and then future requests focus on one image. (Unless, I guess, two people have cited the same page, but then choose different images. That can be confusing.)

Here it's the randomness that puzzles me. I'm half thinking of blocking that specific IP range, since it's the only FB range currently engaging in this behavior.

Martin Potter

11:50 pm on Jul 25, 2022 (gmt 0)

5+ Year Member Top Contributors Of The Month



Oh, oh. I must correct my earlier post. After doing a proper search of my logs, I find that the familiar FB externalhit has been targetted at not only the original now-nonexistent image but also at many other images and pages. However, usually these hits are in the same subject area as that old image and they match the pages most visited on my site. I guess I was noticing only the ones that got a "404" error and not noticing all the "200" hits. Shame on me. (he says, looking at the floor)

But, Lucy, these are not random, unlike what you are experiencing.