homepage Welcome to WebmasterWorld Guest from 54.237.98.229
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 32 message thread spans 2 pages: 32 ( [1] 2 > >     
?fb locale=
from Facebook
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 7:50 pm on Jun 30, 2014 (gmt 0)

For the last few months, Fb's routinely appended locale/language designations to URIs, all too often referencing countries where I neither have nor want traffic, or it's strictly limited because of longtime problems (Brazil; Indonesia; Russia; Turkey).

To date I've not redirected/rewritted the "?fb_locale="-renamed files but technically, they don't exist. And they're starting to bug me. Here's today's rash, three hours of hits to the exact same plain html file, (and only one or two to any graphics, unlike 'regular' Fb traffic):

UA: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

69.171.237.112
09:21:56 /dir/filename.html?fb_locale=fr_FR

173.252.120.119
10:22:27 /dir/filename.html?fb_locale=th_TH

173.252.112.112
10:25:45 /dir/filename.html?fb_locale=id_ID

69.171.237.114
10:50:30 /dir/filename.html?fb_locale=pt_BR

173.252.73.118
10:56:56 /dir/filename.html?fb_locale=en_GB

31.13.99.115
11:07:40 /dir/filename.html?fb_locale=sv_SE

(That IP's Facebook Ireland, a newish one for me: 31.13.64.0 - 31.13.127.255; 31.13.64.0/18)

173.252.112.115
11:31:18 /dir/filename.html?fb_locale=ja_JP

173.252.73.112
11:30:25 /dir/filename.html?fb_locale=ru_RU

69.171.247.116
12:17:54 /dir/filename.html?fb_locale=tr_TR

Variations on the above URIs include these root-level hits:

/?fb_locale=da_DK
/?fb_locale=es_ES
/?fb_locale=fr_FR
/?fb_locale=it_IT
/?fb_locale=nb_NO
/?fb_locale=sv_SE

There's even a second Spanish(?) version:

fb_locale=es_LA

So --

Do you see the same Fb hits 'including' non-English locales? Do you know their purpose at Fb's end of things? A country-localized search database in the works? They don't appear to be from other sites' links using Fb buttons or some such, ditto real-person posts.

And if you're seeing them, are you ignoring them?

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 1:47 am on Jul 1, 2014 (gmt 0)

69.171.230.115 - - [04/Jun/2014:11:38:41 -0600] "GET /MyFolder/MySub/MyPage.html?fb_locale=sv_SE

31.13.102.zzz - - [10/Jun/2014:04:53:11 -0600] "GET DifferentFolder/DifferentSub/DifferentPage.html?fb_locale=it_IT

Many Euro's are interested in my widget content, unfortunately, and unless I'm contacted with a referral from a person I'm familiar with, most of Europe is denied from my sites.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 3:01 am on Jul 1, 2014 (gmt 0)

Huh. Those don't even look like "locales" if that's intended to mean "physical location". They're languages. Possibly different editions of FB? But how many can there be?

/?fb_locale=da_DK (Danish, Denmark variety)
/?fb_locale=es_ES (Spanish as spoken in Spain, as opposed to the really staggering number of Latin American forms)
/?fb_locale=fr_FR (no relation to fr_CA, haha)
/?fb_locale=it_IT (guess)
/?fb_locale=nb_NO (hm, interesting, that's specifically Norsk Bokmål as opposed to Nynorsk)
/?fb_locale=sv_SE (Swedish from Sweden, as opposed to, uh, Swedish as spoken in Finland?)

If your pages are bona fide html, i.e. no possibility of a query string, you would be perfectly in your rights to forcibly redirect, along the lines of

RewriteCond %{QUERY_STRING} fb_locale
RewriteRule (.*) http://www.example.com/$1? [R=301,L]


or, for that matter, simply (in the Condition)

RewriteCond %{QUERY_STRING} .

if you want to dump all of 'em.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 4:35 am on Jul 1, 2014 (gmt 0)

I see these every day... when someone shares a posted link to a different language FB UI. That FB UI (desk-top or mobile) will now check the associated files again... and again if the link is posted on another language version of FB. The circle of life goes on...

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 6:34 am on Jul 1, 2014 (gmt 0)

But attaching their "locale" (language) to your URL just seems sloppy. Like when you find yourself in the Parameters area of wmt looking at "newpage=" or "printerfriendly=" and it's got nothing to do with you. If FB needs the extra information for internal administrative purposes, fine, but that doesn't mean everyone else has to put up with the clutter.

Humans following links will get here even if they have to go through a redirect along the way-- and facebookexternalthingy also follows redirects. (This is one of the many things I learned by observation when I moved sites half a year back. Ask for a page that no longer exists, and they'll presently show up at the new URL.)

Right now, major search engines don't "see" FB links. But anything could be different next year, next month,* next week, and then you've got octuplicate content that you never asked for.


* It is sheer coincidence that my computer clock currently says half an hour before midnight on the last day of the fiscal year.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 8:03 am on Jul 1, 2014 (gmt 0)

But attaching their "locale" (language) to your URL just seems sloppy.

FB does have a native translator. There's the possibility that the snippet FB grabs (url, title, description from meta tag) is translated to the respective language, but this is just a guess.

If FB needs the extra information for internal administrative purposes, fine, but that doesn't mean everyone else has to put up with the clutter.

I don't think they do. I was only aware of the FB file checker referrer displaying the extra info.

Right now, major search engines don't "see" FB links.

Well, that's a discussion of its own, but one thing I know... both Google & Bing image search have images/photos of mine I have only posted in FB. Of course, other people could have copied these images and posted them where-ever for bots to scrape, but I only put them up at FB.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 12:41 pm on Jul 1, 2014 (gmt 0)

Thanks for your thoughts and info, gang. Minutes after I wrote the OP, I got yet another hit pertaining to a restricted 'locale' --

173.252.73.115
/dir/filename.html?fb_locale=zh_CN

-- so I added the QUERY_STRING rewrite for "fb_locale" and we shall see. Here's hoping I don't have to start adding conditions based on Fb's iffy URI indicia. Their graphics handling wreaks enough .htaccess havoc as-is.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 2:19 pm on Jul 1, 2014 (gmt 0)

P.S. I guess I'm not really surprised since it is a bot: Fb doesn't follow the 301 to the proper/original page. But at least real people might. Maybe. Have yet to see any.

Oh, also, found these:

https://developers.facebook.com/docs/internationalization/
(...that includes an explanation about and a link to this...)
http://www.facebook.com/translations/FacebookLocales.xml

Hmm. Even though Fb's clearly not doing translations on demand -- lookout Google Translate?

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 6:06 pm on Jul 1, 2014 (gmt 0)

I block "developers"

I block "translate"

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 7:30 pm on Jul 1, 2014 (gmt 0)

so I added the QUERY_STRING rewrite

I hope you meant "redirect", even though it's configured as a RewriteRule.

FB doesn't follow redirects instantly, as a human browser would. But you should see them within a day or so. Or possibly not at all, if they've already seen the content* via a different language.

There's a Chinese facebook? Who knew.


* On second thought, this probably isn't true. Since FB uses the <noscript> version of a page, each initial request includes piwik's administrative gif. It's got the same name everywhere, and FB always gets served the same content, but they request it separately for each page.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 8:27 pm on Jul 1, 2014 (gmt 0)

Hmm. Even though Fb's clearly not doing translations on demand -- lookout Google Translate?

Not sure quite what you meant by this... FB does offer a translate option when one of your friends posts in a foreign language. But no, you can't import a text snippet to a FB utility for translation like Google Translator Service or Altavista's (now Yahoo's) Bablefish.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 5:26 pm on Jul 3, 2014 (gmt 0)

lucy: It's a rewrite that results in a redirect, right. But to me, to keep things straight in my head, it's a rewrite because it's a Rewrite rule, as opposed to a Redirect directive.

keyplyr: I was musing because there's no obvious (to me) reason for Fb to literally re-address our original links/filenames. And I can easily see Fb including more of our linked-to pages' content than the bits it does now -- and also translating same. Anything to trap eyeballs.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 1:41 pm on Jul 8, 2014 (gmt 0)

lucy, this QUERY_STRING couplet --

RewriteCond %{QUERY_STRING} fb_locale
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

-- doesn't send Fb (or me, testing) to the right place when the original target's in a subdir a la:

example.com/dir/filename.html?fb_locale=nl_NL

That incorrectly becomes:

example.com/filename.html

My mod_rewrite's so-o-o rusty. Do I need to put the couplet in every possible dir, or is there another solution, please? TIA

not2easy

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 4:18 pm on Jul 8, 2014 (gmt 0)

Just a note about FB Translations - they say it is Bing doing the work. In my experience with it, G need not be concerned.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 6:53 pm on Jul 8, 2014 (gmt 0)

Do I need to put the couplet in every possible dir, or is there another solution, please?

Unless your name is jdMorgan, don't use mod_rewrite in more than one place. Put all your RewriteRules in the root.

example.com/dir/filename.html?fb_locale=nl_NL

That incorrectly becomes:

example.com/filename.html

This can only happen if the rule is in a directory-specific htaccess (or, equivalently, in a <Directory> section of config). mod_rewrite strips off everything up to the directory that it's located in. So a request for

example.com/directory/subdir/filename.html

will be seen as
directory/subdir/filename.html
if the rule is located in the root (where it should be)

subdir/filename.html
if the rule is located in /directory/

and
filename.html
if the rule is located in /directory/subdir/

And then if you're reusing a capture, the three forms give you
example.com/directory/subdir/filename.html
example.com/subdir/filename.html
example.com/filename.html
depending on where the rule was located. See how that works?

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 2:06 am on Jul 9, 2014 (gmt 0)

Thanks, lucy! But, um... egads. TMI? Sorry but that's confusing (to) me. All I know is that this in root --

RewriteCond %{QUERY_STRING} fb_locale
RewriteRule (.*) http://www.example.com/$1? [R=301,L]

-- is wrong vis-a-vis subdir requests as I explained. So, although I dearly, dearly admire, adhere to, and miss Jim, I do have dir-specific mod_rewrite rules in per-dir .htaccess files. So to make this fb_locale thing work, I just added the following to the dir's .htaccess --

RewriteCond %{QUERY_STRING} fb_locale
RewriteRule (.*) http://www.example.com/subdir/$1? [R=301,L]

-- and now that works correctly. I reckon I could also just add the following to root and axe all of Fb's permutations --

RewriteCond %{REQUEST_URI} ^(.*)fb_locale(.*)
RewriteRule .* - [F]

-- but that might set your (& Jim's) hair on fire;)

(I'll keep fiddling and if need be, will revisit the mod_rewrite bits in the Apache forum.)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 3:17 am on Jul 9, 2014 (gmt 0)

If you've got directory-specific RewriteRules, make sure you never have more than one along the same path. That is, if you've got RewriteRules in
/directory/subdir/
you can't also have them in
/directory/
and vice versa-- unless you
either
add the line
RewriteOptions inherit
(which will not always behave as you want it to)
or
accept that the rules in /directory/ will be ignored by /directory/subdir/. All of them, not just the ones that happen to apply to the same request.

mod_rewrite is weird and that's all there is to it.

I reckon I could also just add the following to root and axe all of Fb's permutations

Nothing wrong with that if you choose to [F] instead of capture-and-redirect.

Now, come to think of it: Do they ever attach a query string to requests other than pages? If not, constrain your rule to filenames ending in \.html or whatever you use for pages. Obviously image files don't take queries-- but then, neither does html normally.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 4:09 pm on Jul 9, 2014 (gmt 0)

- "RewriteOptions inherit" is great boon but for directories containing CGI-generated files.

- OP has examples of requests to 'plain' root, not .html-suffix files per se. Thus methinks cleanest common denominator is: fb_locale

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 7:34 pm on Jul 9, 2014 (gmt 0)

'plain' root, not .html-suffix files per se

Most of my [F] rules are expressed as
(^|/|\.html)$
because the chances of a robot requesting a non-page file are so small that it isn't worth making the server check conditions on every single request. Admittedly Facebook is different because they do ask for image files-- but presumably only if they've first seen the page, so they'd know what images to ask for.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4683973 posted 6:43 pm on Jul 10, 2014 (gmt 0)

> only if they've first seen the page

I cannot comment about images but several scrapers/bots ask for a full list of pages when they should not know about any of them, being returned a 403 on the first page. I see it as quite possible that a legit-looking browser visits the site first, probably a real one with an add-on that reads pages on a "human" time-scale.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 7:25 pm on Jul 10, 2014 (gmt 0)

Facebook is different because they do ask for image files-- but presumably only if they've first seen the page, so they'd know what images to ask for.

Not in my opinion. Why would it be necessary for FB to visit the page first? With a few exceptions, I can grab any image from any page without first visiting the page. I do not need to know the names of image files, I simply request all files from the environment that use select extensions. I would imagine FB knows this as well :)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 7:55 pm on Jul 10, 2014 (gmt 0)

I simply request all files from the environment that use select extensions.

In English?

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 8:31 pm on Jul 10, 2014 (gmt 0)

site:example.com JPG

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 12:47 am on Jul 11, 2014 (gmt 0)

Huh? That's the formula for retrieving indexed content on a search engine. It's not even remotely how FB operates. On a first visit, they request all images that are called by a specific page. Generally they request each file at least twice: once naming the page as referer, once referer-less. The images may or may not all live in the same subdirectory, and conversely the subdirectory/ies may or may not contain images called by other pages. Some requested images may not be indexed at all.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 6:27 am on Jul 11, 2014 (gmt 0)

In English?

Simply put, a bit of code can GET all files of any or all specific type(s) once the environment (web document, account, etc) is known. In this case FB knows the URL (someone posted a link) so the program follows that link and gets the file type it is seeking (in this case jpg, png, gif, etc.) It is not necessary to know before hand the name or path of these file. It is not necessary to request other files types (html, php, etc) to discover the specific files sought. This is not a web browser.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 7:37 am on Jul 11, 2014 (gmt 0)

so the program follows that link and gets the file type it is seeking (in this case jpg, png, gif, etc.)

I repeat the question. How? Without seeing a page or having access to someone else's cache, how does any entity of any sort, whether human or robot, know which images are called by that page? And what does "follows that link" mean, if it doesn't mean "requests the page"? (In FB's case, sometimes in the form of HEAD and/or 206, but that's only on repeat visits.)

Besides, I must point out that in every case where FB has visited my site, it has always, in fact, by actual observation, without exception, begun by requesting the html page. "Always" in this context = within the period covered by logs available* on my current hard drive. You can identify a first visit, because that's the one where it gets all image files instead of just one. Sometimes on a rarely visited page I can even pinpoint the exact human visit that led to "Woo hoo, gotta tell all my friends!" :)

:: detour to check something ::

Huh. I'd entirely forgotten that I've seen the fb_locale thing myself. Fortunately it's only attached to the html request, not the attendant images.
31.13.97.113 - - [15/Feb/2014:19:34:48 -0800] "GET /hovercraft/hovercraft.html HTTP/1.1" 200 16238 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"
31.13.97.117 - - [15/Feb/2014:19:34:49 -0800] "GET /hovercraft/hovercraft.html?fb_locale=de_DE HTTP/1.1" 403 1124 "-" "facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)"

Somewhere along the line I started blocking all html requests with queries, on the general assumption that they're up to no good. So that's not an FB-specific lockout; I just never noticed it before.

Pfui, is it always 31.13.etcetera? Does Facebook Ireland cover the whole non-English-speaking part of the world? If the fb_locale request is always preceded by a query-less request for the same page-- I only noticed a few, but wasn't actively looking-- then it really is safe to block the with-query form. They've still seen the page.


* Somewhere along the line I accidentally deleted an archive, so a three-month block of 2012 is gone forever. Blast. But I hardly ever look that far back.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4683973 posted 7:56 am on Jul 11, 2014 (gmt 0)

Lucy it's just the execution of the request. Surely you've seen this many times before. A bot comes and scrapes lots of files without ever coming to your site before. It is not necessary. The command is GET this file type from this URI, or GET all file types from this URI, or whatever the program is written to do. Your page, or your entire file hierarchy is crawled and the indicated files are retrieved.

Anyway, I'm repeating myself here.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4683973 posted 10:10 am on Jul 11, 2014 (gmt 0)

Pfui, is it always 31.13.etcetera? Does Facebook Ireland cover the whole non-English-speaking part of the world?

As mentioned, that was newish to me. And I'm not sure it's as easy as X language/X locale vis-a-vis proximity to a Fb server farm/CIDR. For example, why would en_GB (Great Britain/United Kingdom) hit out from Menlo Park and not Ireland?

(Heck, do they really even need an en_GB for most sites? Nah. So what are they up to, hmm?)

Graphics P.S.

Something I've not seen since late last year (yay) was Fb's literal following of relative graphics:

/../example.jpg

A typical tell of a badly coded bot, that set my teeth on edge no end.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4683973 posted 10:26 am on Jul 11, 2014 (gmt 0)

Your page, or your entire file hierarchy is crawled and the indicated files are retrieved.

There can be no crawl without a request. Those requests are logged. I hope you are not suggesting that my host is so corrupt, or so incompetent, that any passing robot can FTP-or-equivalent into any directory and view a listing of what files are in that directory, without my knowledge. Still less can they learn which images are used by a specific, individual page in order to request those images without previously seeing the page.

If you're claiming that there exists an HTTP command that translates to "give me all your jpg's", I want to know what that command is. And so do all the robots over the years who have, instead, gone to the trouble of crawling my site link by link, or asking for a long list of named files ("wp-admin/blahblah") on the off chance that one of them will exist.

/../example.jpg

I don't have the details at my fingertips, but I fondly remember one robot so mind-bogglingly stupid, it simply looked for all occurrences of
<a something="blahblah"
and asked for "blahblah". Yes, whether or not the 'something' part was 'href', whether or not the href started in #. Got a lot of requests for
/directory/subdir/classname
and
/directory/subdir/fragmentname
that day.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4683973 posted 8:09 pm on Jul 11, 2014 (gmt 0)

Lucy - I too am also puzzled by keyplr's crawl notion. I can see one possibility, though.

Some sites are not set up properly and for some accesses return a directory of all folder contents. This is still common on some tech sites but for most sites directory browsing is turned off. I go one step further and add a redirect "page" in (eg) image folders to push visitors back to the home page if they get uppity and try for the "default page".

I haven't tried this, but I think it would be possible, with a site directory such as this, to get all contents of a folder. Sub-folders (eg img) would usually be included in the directory listing so the whole site could be easily scraped.

Keyplr - is that what you mean? If not, please suggest a tool you have seen doing this and I will try to reproduce it. Also, see the above comment I made yesterday.

This 32 message thread spans 2 pages: 32 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved