| This 32 message thread spans 2 pages: < < 32 ( 1  ) || |
For the last few months, Fb's routinely appended locale/language designations to URIs, all too often referencing countries where I neither have nor want traffic, or it's strictly limited because of longtime problems (Brazil; Indonesia; Russia; Turkey).
To date I've not redirected/rewritted the "?fb_locale="-renamed files but technically, they don't exist. And they're starting to bug me. Here's today's rash, three hours of hits to the exact same plain html file, (and only one or two to any graphics, unlike 'regular' Fb traffic):
UA: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
(That IP's Facebook Ireland, a newish one for me: 22.214.171.124 - 126.96.36.199; 188.8.131.52/18)
Variations on the above URIs include these root-level hits:
There's even a second Spanish(?) version:
Do you see the same Fb hits 'including' non-English locales? Do you know their purpose at Fb's end of things? A country-localized search database in the works? They don't appear to be from other sites' links using Fb buttons or some such, ditto real-person posts.
And if you're seeing them, are you ignoring them?
Hasty edit as I didn't realize we're on a new page.
|Some sites are not set up properly and for some accesses return a directory of all folder contents. This is still common on some tech sites but for most sites directory browsing is turned off. |
Well, for a given definition of "properly" ;) I'm sure many people's very first htaccess consisted of the single line
Or, ahem, IIS equivalent.
When you run a shared server you have to have default settings. I suspect most hosts' default is auto-indexing enabled. But even then, it only works if the directory physically exists, its name is known, and it doesn't contain a named index page.
In Apache, there's an arcane combination of mod_dir settings that will sometimes allow auto-indexing of directories that do contain an index file. But I've never heard of it happening by accident. You have to override a lot of "Do Not Try This At Home" warnings.
|I go one step further and add a redirect "page" in (eg) image folders to push visitors back to the home page if they get uppity and try for the "default page". |
That's why I insist on a user-friendly 403 page. Longtime webmasters tend to forget that the only time most humans see a 403 is when they're exploring and hit a subdirectory without an index page. As far as the server is concerned, that's exactly the same 403 as "Get out of my sight* you horrible Ukrainian". But the human meant no harm.
* My fingers typed "site", which would have worked too.
Well I'm not about to explain my software in an open forum, but maybe this will represent the methodology: Open a local text or HTML editor to display a web document. Use the editor's Search or Find feature to identify all occurrences of "jpg". Now imagine software programed to GET these matching files, via their directory path. Voila! I don't call that a crawl, but if you choose to, go ahead.
As stated in my earlier post, all that is needed is the location of the environment where these files exist. FB has the page location. Often times it does not precede its image scrape by parsing the HTML (or PHP or whatever the page is built with.) It doesn't need to.
dstiles, yes it is possible to retrieve all files from a directory without previous info of the actual file names. All it needs is the directory name, often discovered in a previous crawl. JAVA is a good library for this. And yes Lucy, this is done with HTTP not FTP or other protocols, although it certainly could given the right circumstances.
As another example: both G and B scrape image file directories and GET even the files that have never been used on any web document, ever, even though these files rarely make it into the index. Years ago I was also baffled by this... "how could they know that file was there?" The software simple took all specific file types it was programed to GET within that environment.
If you want to call that a crawl, so be it. Nothing more to say about this, at least not from me.
| This 32 message thread spans 2 pages: < < 32 ( 1  ) |