Forum Moderators: open

Message Too Old, No Replies

another one for the profilers

         

lucy24

12:16 am on Dec 5, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



File under: Once is happenstance, twice is coincidence, three times is a botnet.

Consider this log excerpt:

5.228.70.abc - - [04/Dec/2014:08:40:04 -0800] "GET /ebooks/aelfric/aelfric_full.html HTTP/1.1" 200 427034 "http://yandex.ru/yandsearch?text=searige&lr=213" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; MAARJS)" 
{supporting files snipped}
128.72.134.abc - - [04/Dec/2014:08:40:06 -0800] "GET /ebooks/horn/KingHorn_KH.html HTTP/1.1" 200 119187 "http://yandex.ru/yandsearch?text=toryues+boston&lr=213" "Mozilla/5.0 (Windows NT 5.1; rv:26.0) Gecko/20100101 Firefox/26.0"
{supporting files snipped}
95.220.135.abc - - [04/Dec/2014:08:40:06 -0800] "GET /ebooks/paston/paston5.html HTTP/1.1" 200 289460 "http://yandex.ru/yandsearch?text=maknon+judith&lr=213" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"
{supporting files snipped}

Each individual request is utterly plausible: Some human from a Russian IP searches Yandex for a string in Roman script (occasionally including thorn or even yogh, not evident in today's example), and gets all supporting files including analytics.

But, but, but...
#1 Requests always come in sets of 2 or 3, within one or two seconds of each other, from the same search engine. ("lr=213" means Moscow area. Someone in these forums once pointed me to a page that lists all the "lr" values Yandex uses.) Requests are so close together that they're tangled up in logs. On my site, and particularly for these pages, that kind of clustering does not naturally occur. Trust me on this.
#2 Requests are always for ebooks in some form of early English (I've got a clutch of them, spanning the range from OE to barely-Early-Modern).
#3 Some requests are from currently or previously blocked IP ranges-- not server farms but assorted infection-prone machines. As far as I can tell they're all in Russia; don't know if they're really all in Moscow.

It's been going on sporadically for a couple of months. The pattern is so weird that I noticed it right away, but I remain stumped.

Thanks to the unusual content, I have no idea what the equivalent pattern would look like on anyone else's site. About all you can search for is multiple occurrences /yandsearch with matching hour-and-minute timestamp.

Angonasec

2:20 pm on Dec 13, 2014 (gmt 0)



Q/
My brother uses TalkTalk.
/Q

I had to Allow a small TT range to permit a relative into our sites, but since I've explained why it is wise to avoid TT, he's switched to a reputable ISP. :)

BT are plain incompetent, another almost-British basket-case.

aristotle

10:07 pm on Dec 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If any of my mystery visitors are human, I'd expect them to follow the link. If they're not, they will go away with only a small php page instead of a multi-hundred-K ebook with supporting files.

But I wonder if that really settles the matter. For suppose that the app has a directory or catalog that includes descriptions of your ebooks. Then if a user decides that he or she wants a copy of one of your books, and clicks to have it added to their personal library (probably stored in the cloud), then the app might send a bot to retrieve the copy from your site, then add it to the user's personal library. So although it's a bot that comes to your site, it might have been sent pursurant to a human request. Another possibility is that the app periodically retrieves fresh copies of your books to keep users' libraries up to date with the latest editions.

lucy24

11:23 pm on Dec 13, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I simply don't believe a hypothetical app is the explanation. Especially after I tested some of the search strings and found they don't lead to anything on my site. (That is, the words simply don't occur in the relevant texts.) There's also the matter of one affected page in an unrelated directory.

ebooks, by their nature, are static. A new edition would be under copyright and therefore not downloadable as-is. I'm based in the US, so I use the 1922 cutoff.

And if any app is so inept that it would return a short redirect page and say "Here's your book!" ... well, that's something for the consumer to take up with the app developer.

wilderness

12:03 am on Dec 14, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



so I use the 1922 cutoff.


Thank you Sonny Bono, what putz ;)

Major Universities has grants and programs already in place, and were forced to cancel those programs when the law was changed from 73-years to 100-years.

lucy24

12:44 am on Dec 14, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<topic drift>
and were forced to cancel those programs when the law was changed

But changes were never retroactive; they only applied to the expiration date of copyrights that were still in effect at the time a given law was enacted. (It's actually 95 years, so the next visible rollover will be in 2018, when 1923 becomes available.)

When was it ever 73 years? When I was growing up, it was 28+28 with some convenient loopholes. (Notably "published without a copyright notice" which happened to apply most often to university publications, leading to the mistaken belief that state-- not Federal-- publications aren't copyrighted.) I hate to think of a grant being conditioned on something that will potentially happen 39 years in the future.

What's fun is when a work is long out of copyright in its country of publication, but because the US doesn't subscribe to the Rule of the Shorter Term, it's technically still protected here. But this doesn't apply to anything I've currently got online.

:: idly picturing some living relative of the Paston family, or possibly Aelfric, lawyering up and sending out Cease And Desist orders ::
</drift>

aristotle

1:53 am on Dec 14, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well Lucy I may have used the word "edition" improperly. I was actually thinking about your own formatting and styling that determine your presentation of each ebook.. And if at some point you decide to update a file with some new formating, fonts, colors, or whatever -- that was what I meant by "new edition".

As for the Yandex thing, the creator of the app could have used those as bogus referers to improve the bots chances of getting past blocks. And with regard to your new re-direct to trap bots, it would be hard for any app creator to overcome that.

But as I said earlier, I'm just speculating about possibilities. Maybe someone will come forward with a better explanation - I'd like to see it.

wilderness

2:56 am on Dec 14, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<topic drift>

But this doesn't apply to anything I've currently got online.


lucy,
Nearly all my online articles are within the '73 year' boundary.
Copyright is a very controversial topic.

For my widgets and considering the previously published magazines were the only existing evidence of the copyright.
1) No master digital evidence of the data existed prior to hitting my scanner.
2) In the widget-field most authors were free-lance and the magazine originally compensated the author for the right.
3) Both the magazine and the author are deceased/defunct.
4) thus my own OCR of the individual characters (not an image of the article) created a copyright in its own right and beyond the original copyright.

FWIW, I've also some articles online where the publications (multiple publications) failed in 1932, and way before digital materials were even a possibility. Also and since the magazines (many, many magazines failed during the great-depression in the US) failed their original typed copies were simply trashed.

Google books has archived some bound editions of magazines that were part of complete libraries, however these google-archived-issues are not even readily searchable/accessible even though they are made freely available. Additionally, they will not be available for another 60-70 years, which is absurd, especially when the magazine is defunct.

Time-Life put hordes and hordes of unused photo's by major photographers online in their archives, however after what was apparently excessive downloading many of the photo's were removed. (Fortunate for me, I saved all the widget photo's I was able to locate and categorized them by publication and photographer).

</drift>

lucy24

3:30 am on Dec 14, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<continuing topic drift>
most authors were free-lance and the magazine originally compensated the author for the right

That was probably standard practice. As recently as the 1980's, authors were advised to put the exact words "First Serial Rights" on any MS submission, just to make it clear that they weren't signing over the entire copyright in perpetuity.

thus my own OCR of the individual characters (not an image of the article) created a copyright in its own right and beyond the original copyright

I'd be a lot happier if you had an intellectual-property lawyer confirming this assertion, because "sweat-of-brow copyright" is not an inherent property of US law. Now, if you intentionally introduced typos into the work (in non-essential places, of course!), and someone else parroted the typo, then clearly they'd be stealing from you and not the original rightsholder.

If copyright wasn't renewed, even material as recent as the 1960's may be fair game. (I just looked it up. The current cutoff is 1963.) I know I've worked on Augustan Reprints (university press, no copyright) as recent as about 1980.

they will not be available for another 60-70 years, which is absurd

And then there's the whole vexed issue of "orphan books", which has been pretty well turned on its head by POD ventures.
</drift>
This 38 message thread spans 2 pages: 38