Forum Moderators: open

Message Too Old, No Replies

trafilatura

         

Pfui

11:22 pm on Aug 1, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Courtesy of a Russian Cloud. No robots.txt. Hard to say if it was manually dispatched because the filename it hit included a left-angled bracket (.html<), akin to someone copy-pasting. Or not:)

trafilatura/1.6.1 (+https://github.com/adbar/trafilatura)

Among other unwanted effects... "Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments." https://pypi.org/project/trafilatura/



[edited by: not2easy at 12:55 pm (utc) on Aug 12, 2023]
[edit reason] unlinked for readability [/edit]

not2easy

12:57 pm on Aug 12, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I don't know how I missed this, but usually I make a note of new UAs I haven't seen and I have no note so I missed it. Thank you Pfui, a new thing to look out for. :(

lucy24

5:14 pm on Aug 12, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Its main applications are web crawling, downloads, scraping, and extraction of main texts, metadata and comments."
In short, “undesirable stuff”.

My first stop was the dictionary, which didn’t have trafilatura as a discrete lexical item, but did have trafilare. (My second stop was the English dictionary, which tells me that to “wiredraw”--one word--is to draw a metal into wire.) You can see the thought process.