Welcome to WebmasterWorld Guest from

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

?fb locale=

from Facebook

7:50 pm on Jun 30, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1

For the last few months, Fb's routinely appended locale/language designations to URIs, all too often referencing countries where I neither have nor want traffic, or it's strictly limited because of longtime problems (Brazil; Indonesia; Russia; Turkey).

To date I've not redirected/rewritted the "?fb_locale="-renamed files but technically, they don't exist. And they're starting to bug me. Here's today's rash, three hours of hits to the exact same plain html file, (and only one or two to any graphics, unlike 'regular' Fb traffic):

UA: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
09:21:56 /dir/filename.html?fb_locale=fr_FR
10:22:27 /dir/filename.html?fb_locale=th_TH
10:25:45 /dir/filename.html?fb_locale=id_ID
10:50:30 /dir/filename.html?fb_locale=pt_BR
10:56:56 /dir/filename.html?fb_locale=en_GB
11:07:40 /dir/filename.html?fb_locale=sv_SE

(That IP's Facebook Ireland, a newish one for me: -;
11:31:18 /dir/filename.html?fb_locale=ja_JP
11:30:25 /dir/filename.html?fb_locale=ru_RU
12:17:54 /dir/filename.html?fb_locale=tr_TR

Variations on the above URIs include these root-level hits:


There's even a second Spanish(?) version:


So --

Do you see the same Fb hits 'including' non-English locales? Do you know their purpose at Fb's end of things? A country-localized search database in the works? They don't appear to be from other sites' links using Fb buttons or some such, ditto real-person posts.

And if you're seeing them, are you ignoring them?
9:01 pm on July 11, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
votes: 536

Hasty edit as I didn't realize we're on a new page.
Some sites are not set up properly and for some accesses return a directory of all folder contents. This is still common on some tech sites but for most sites directory browsing is turned off.

Well, for a given definition of "properly" ;) I'm sure many people's very first htaccess consisted of the single line

Options -Indexes

Or, ahem, IIS equivalent.

When you run a shared server you have to have default settings. I suspect most hosts' default is auto-indexing enabled. But even then, it only works if the directory physically exists, its name is known, and it doesn't contain a named index page.

In Apache, there's an arcane combination of mod_dir settings that will sometimes allow auto-indexing of directories that do contain an index file. But I've never heard of it happening by accident. You have to override a lot of "Do Not Try This At Home" warnings.

I go one step further and add a redirect "page" in (eg) image folders to push visitors back to the home page if they get uppity and try for the "default page".

That's why I insist on a user-friendly 403 page. Longtime webmasters tend to forget that the only time most humans see a 403 is when they're exploring and hit a subdirectory without an index page. As far as the server is concerned, that's exactly the same 403 as "Get out of my sight* you horrible Ukrainian". But the human meant no harm.

* My fingers typed "site", which would have worked too.
10:59 pm on July 11, 2014 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
votes: 511

Well I'm not about to explain my software in an open forum, but maybe this will represent the methodology: Open a local text or HTML editor to display a web document. Use the editor's Search or Find feature to identify all occurrences of "jpg". Now imagine software programed to GET these matching files, via their directory path. Voila! I don't call that a crawl, but if you choose to, go ahead.

As stated in my earlier post, all that is needed is the location of the environment where these files exist. FB has the page location. Often times it does not precede its image scrape by parsing the HTML (or PHP or whatever the page is built with.) It doesn't need to.

dstiles, yes it is possible to retrieve all files from a directory without previous info of the actual file names. All it needs is the directory name, often discovered in a previous crawl. JAVA is a good library for this. And yes Lucy, this is done with HTTP not FTP or other protocols, although it certainly could given the right circumstances.

As another example: both G and B scrape image file directories and GET even the files that have never been used on any web document, ever, even though these files rarely make it into the index. Years ago I was also baffled by this... "how could they know that file was there?" The software simple took all specific file types it was programed to GET within that environment.

If you want to call that a crawl, so be it. Nothing more to say about this, at least not from me.
This 32 message thread spans 2 pages: 32

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members