homepage Welcome to WebmasterWorld Guest from 54.166.39.179
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 32 message thread spans 2 pages: < < 32 ( 1 [2]     
?fb locale=
from Facebook
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 7:50 pm on Jun 30, 2014 (gmt 0)

For the last few months, Fb's routinely appended locale/language designations to URIs, all too often referencing countries where I neither have nor want traffic, or it's strictly limited because of longtime problems (Brazil; Indonesia; Russia; Turkey).

To date I've not redirected/rewritted the "?fb_locale="-renamed files but technically, they don't exist. And they're starting to bug me. Here's today's rash, three hours of hits to the exact same plain html file, (and only one or two to any graphics, unlike 'regular' Fb traffic):

UA: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

69.171.237.112
09:21:56 /dir/filename.html?fb_locale=fr_FR

173.252.120.119
10:22:27 /dir/filename.html?fb_locale=th_TH

173.252.112.112
10:25:45 /dir/filename.html?fb_locale=id_ID

69.171.237.114
10:50:30 /dir/filename.html?fb_locale=pt_BR

173.252.73.118
10:56:56 /dir/filename.html?fb_locale=en_GB

31.13.99.115
11:07:40 /dir/filename.html?fb_locale=sv_SE

(That IP's Facebook Ireland, a newish one for me: 31.13.64.0 - 31.13.127.255; 31.13.64.0/18)

173.252.112.115
11:31:18 /dir/filename.html?fb_locale=ja_JP

173.252.73.112
11:30:25 /dir/filename.html?fb_locale=ru_RU

69.171.247.116
12:17:54 /dir/filename.html?fb_locale=tr_TR

Variations on the above URIs include these root-level hits:

/?fb_locale=da_DK
/?fb_locale=es_ES
/?fb_locale=fr_FR
/?fb_locale=it_IT
/?fb_locale=nb_NO
/?fb_locale=sv_SE

There's even a second Spanish(?) version:

fb_locale=es_LA

So --

Do you see the same Fb hits 'including' non-English locales? Do you know their purpose at Fb's end of things? A country-localized search database in the works? They don't appear to be from other sites' links using Fb buttons or some such, ditto real-person posts.

And if you're seeing them, are you ignoring them?

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 9:01 pm on Jul 11, 2014 (gmt 0)

Hasty edit as I didn't realize we're on a new page.
Some sites are not set up properly and for some accesses return a directory of all folder contents. This is still common on some tech sites but for most sites directory browsing is turned off.

Well, for a given definition of "properly" ;) I'm sure many people's very first htaccess consisted of the single line

Options -Indexes

Or, ahem, IIS equivalent.

When you run a shared server you have to have default settings. I suspect most hosts' default is auto-indexing enabled. But even then, it only works if the directory physically exists, its name is known, and it doesn't contain a named index page.

In Apache, there's an arcane combination of mod_dir settings that will sometimes allow auto-indexing of directories that do contain an index file. But I've never heard of it happening by accident. You have to override a lot of "Do Not Try This At Home" warnings.

I go one step further and add a redirect "page" in (eg) image folders to push visitors back to the home page if they get uppity and try for the "default page".

That's why I insist on a user-friendly 403 page. Longtime webmasters tend to forget that the only time most humans see a 403 is when they're exploring and hit a subdirectory without an index page. As far as the server is concerned, that's exactly the same 403 as "Get out of my sight* you horrible Ukrainian". But the human meant no harm.


* My fingers typed "site", which would have worked too.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 10:59 pm on Jul 11, 2014 (gmt 0)

Well I'm not about to explain my software in an open forum, but maybe this will represent the methodology: Open a local text or HTML editor to display a web document. Use the editor's Search or Find feature to identify all occurrences of "jpg". Now imagine software programed to GET these matching files, via their directory path. Voila! I don't call that a crawl, but if you choose to, go ahead.

As stated in my earlier post, all that is needed is the location of the environment where these files exist. FB has the page location. Often times it does not precede its image scrape by parsing the HTML (or PHP or whatever the page is built with.) It doesn't need to.

dstiles, yes it is possible to retrieve all files from a directory without previous info of the actual file names. All it needs is the directory name, often discovered in a previous crawl. JAVA is a good library for this. And yes Lucy, this is done with HTTP not FTP or other protocols, although it certainly could given the right circumstances.

As another example: both G and B scrape image file directories and GET even the files that have never been used on any web document, ever, even though these files rarely make it into the index. Years ago I was also baffled by this... "how could they know that file was there?" The software simple took all specific file types it was programed to GET within that environment.

If you want to call that a crawl, so be it. Nothing more to say about this, at least not from me.

This 32 message thread spans 2 pages: < < 32 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved