Forum Moderators: not2easy

Message Too Old, No Replies

Converting HTML to text

Making en ebook out of a website

         

CromeYellow

3:50 pm on Feb 26, 2003 (gmt 0)

10+ Year Member



HI

I have recently had a couple of requests from users on one of my sites for a downloadable, printable version of the site.

The trouble is, that while the site is predominantly text (over 90%), there are many pages, such that copying and pasting out of every one will take an age.

Does anyone know of a quick-and-nifty way to do this, or an application that will do it for me?

Yours lazily

Cy :)

kyr01

4:05 pm on Feb 26, 2003 (gmt 0)

10+ Year Member



Cy,
honestly, I wouldn't give away all my content in a printable format so easily... I had similar requests from users, and I made some (21) pdf documents that users can request from the site and that are automatically mailed. Few things to notice:
1- pdf makes possible to protect documents, so that untrusted users cannot edit, print or even save them;
2- I use a simply php mailer which limits the requests to a fixed number (4) for email address, in order to prevent troubles (you never know, someone may decide to use your documents for bombing);
3- I have a printable version (which is still protected from copy and modifications) that I manually send to trusted parties;
4- Again, I *strongly* suggest to use pdf to prevent modifications. You surely want your logo and domain name on those documents, don't you?
I know it is not the answer you were looking for, but I figured it was worth to give some suggestions: been there, done that...

CromeYellow

8:19 pm on Feb 26, 2003 (gmt 0)

10+ Year Member



Hi kyr01

Thanks for the tips. I think I'll use pdf, but allow printing as that's why people want it in ebook format. I'll use the text protection feature so people can't just copy it.

AS you say, that still leaves me with the crushingly boring task of converting all that HTML to text. Any ideas anyone?

Cy

Romeo

9:30 pm on Feb 26, 2003 (gmt 0)

10+ Year Member



Cy,

if you have the Acrobat (the full program, not just the acrobat reader), then just use the File > "open web page"
to make it connecting to the site fetching all pages and making a PDF book out of ther entire site ... nice and easy.

Regards,
R.

andreasfriedrich

9:34 pm on Feb 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The following Perl [perl.com] script will expect a list of filenames on STDIN. Those files will be converted to text files and saved as
filename.txt
.


#!/usr/bin/perl -w
use strict [perldoc.com];
use HTML::Parser [perldoc.com] ();

sub start_handler {
return [perldoc.com] if shift [perldoc.com] ne "body";
my $self = shift [perldoc.com];
$self->handler(text => sub [perldoc.com] { print [perldoc.com] OUT shift [perldoc.com] }, "dtext");
$self->handler(end => sub [perldoc.com] { shift [perldoc.com]->eof if shift [perldoc.com] %eq% "body"; },
"tagname,self");
}
#
my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler, "tagname,self");
#
while (<>) {
open [perldoc.com] 'OUT', ">$_.txt" or die [perldoc.com] "Can't open $_: $!\n";
$p->parse_file($_) ¦¦ die [perldoc.com] "Can't parse $_: $!\n";
close [perldoc.com] 'OUT';
}

Run the script like so:

script < filename_list.txt

HTH Andreas

CromeYellow

3:27 pm on Feb 27, 2003 (gmt 0)

10+ Year Member



Romeo, that sounds like the business - just what I'm looking for!

I do have Acrobat (I can make PDF's from Word documents), but I'm not sure how to do what you say. Can you point me in the right direction? Am I supposed to be looking within Distiller for this function?

Many thanks

Cy

CromeYellow

3:29 pm on Feb 27, 2003 (gmt 0)

10+ Year Member



P.S. Thanks also to you andreasfriedrich, but I am afraid that has gone over my head like a 747 at cruising altitude. ;)