homepage Welcome to WebmasterWorld Guest from 54.227.11.45
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Facebook scraper ? I can't figure this out.
kahuna




msg:4666256
 2:14 pm on Apr 27, 2014 (gmt 0)

Facebook scraper ? I can't figure this out...

Did a FB crawler/bot come by, or did someone post links to my pages...

I have a "folder" with an index page.. and on that index page are links to 10 pages (subtopics). This had been up for a couple of months.

then one morning..
At 7:30 am I added 20 more subtopic pages...
and then later that evening...
Every new subtopic (20) were hit by the facebook.com/externalhit_uatext.php (multiple times)

I can't find in my logs a "human" accessing the pages.

There were the typical media type bots (cyberalert and trendiction) but they didn't target those pages uniquely, they crawled a other pages too.

Searching my logs... no human had hit my index page or the subsequent new 20 pages I had posted.

I didn't think Facebook had a search engine type bot... just the link verification tool they use.

So did somebody use the/a media scraper trendiction.de to post to Facebook some where I can't find ?
Is this a malicious ?

I never saw this before.

 

lucy24




msg:4666320
 7:55 pm on Apr 27, 2014 (gmt 0)

You forgot to give the IP. Although Googlebot is by far the most commonly spoofed, it doesn't hold a monopoly.

kahuna




msg:4666367
 12:38 am on Apr 28, 2014 (gmt 0)

Thanks for your message...

I don't think this will tell you much... but here are the bots that came by... and then the flury of Facebook "external" bots..
Many many FB "bot" hits..
----------------------------------
crawl-66-249-79-153.googlebot.com - - [24/Apr/2014:07:20:41 -0500]
crawl-66-249-79-185.googlebot.com - - [24/Apr/2014:07:21:30 -0500]
crawl-66-249-79-121.googlebot.com - - [24/Apr/2014:07:21:48 -0500]

75.98.9.249 - - [24/Apr/2014:07:37:06 -0500] (compatible; NetSeer crawler/2.0; +http://www.netseer.com/crawler.html; crawler@netseer.com)
msnbot-131-253-24-80.search.msn.com - - [24/Apr/2014:10:04:01 -0500]
msnbot-131-253-24-94.search.msn.com - - [24/Apr/2014:10:13:01 -0500]
msnbot-65-55-213-42.search.msn.com - - [24/Apr/2014:10:39:20 -0500]
msnbot-131-253-24-47.search.msn.com - - [24/Apr/2014:11:20:49 -0500]

netdisk.cyberalert.com - - [24/Apr/2014:20:48:53 -0500] "GET /xxxxxxxxx/index.shtml HTTP/1.1" 200 10935 "-" "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)"
-----------------------
    than it goes down through the 20 new pages I uploaded this day...

netdisk.cyberalert.com - - [24/Apr/2014:20:49:27 -0500] "GET /xxxxxxxx/yyyyyyyyyyyy.htm HTTP/1.1" 200 17491 "-" "Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+1.1.4322)"
----------------------
    and the same for this... through the 20 new pages I uploaded this day...

p16n11.trendiction.de - - [24/Apr/2014:23:31:26 -0500] "GET /xxxxxxxxx/yyyyyyy.htm/ HTTP/1.1" 200 10935 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.0; trendictionbot0.5.0; trendiction search; [trendiction.de...] please let us know of any problems; web at trendiction.com) Gecko/20071127 Firefox/3.0.0.11"

    AND then I get the massive flurry of Facebook hits...

173.252.100.113 - - [24/Apr/2014:23:32:08 -0500] "GET /oxxxxx/iyyyy.htm HTTP/1.1" 200 17746 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
173.252.100.116 and the numerous Facebook hits with the --- HTTP/1.1" 206 76383 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
173.252.100.112
173.252.100.119 and more of the same... for each of the individual new "20" pages
69.171.248.0
69.171.248.1
69.171.248.3 and more of the same.... or each of the individual new "20" pages

===========================

not2easy




msg:4666412
 5:22 am on Apr 28, 2014 (gmt 0)

The facebook crawler seems to hit sites as they please. With all the other bots crawling there is no telling where the FB bot found those links. Those IPs are definitely the Facebook crawler. It does not mean anyone has posted the links on Facebook. Since msn and google crawled earlier the same day and found your new content, the others my have noticed new links in their rounds.

kahuna




msg:4666597
 7:26 pm on Apr 28, 2014 (gmt 0)

Thanks group.

I still don't "get" this situation.

I was on the understanding that Facebook really didn't have it's own crawler/bot...
Except the Link Checker that we see facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

And that only occurs when someone posts a link on their Facebook page.

I've tested it... and I'm sure you guys/gals already know this.

I've never seen Facebook do a crawl of my site, like other bots. Only the traffic when somebody posts a link to my site.

Thanks for your comments and taking the time to post.
K.

keyplyr




msg:4667384
 11:18 pm on Apr 30, 2014 (gmt 0)



I was on the understanding that Facebook really didn't have it's own crawler/bot...
Except the Link Checker that we see facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

That's correct.

However since it uses an image from your page in the link, it will periodically check to see that the image files still exist. It will usually grab several from each linked page, then offer them to the poster to choose from. This may amount to lots of hits bunched together in a short amount of time. Then the FB bot will come back in the future and do the same thing as FB uses continue to follow the link to your site.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved