homepage Welcome to WebmasterWorld Guest from 54.163.139.36
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
Reading/parsing HTML email on cpanel server
I can play around with simple text mail but problems with html, help!
explorador

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4601645 posted 10:44 pm on Aug 13, 2013 (gmt 0)

Hi there, I'm having problems here.

I need to read html email already on the cpanel server. Emails are stored as files on /mail/server/account/new in some weird format. I'm not using perl to get in the mailbox and read it because it required extra modules and I don't like it, personal choice (it affects emergency migrations from server to server if such things are not installed) so my approach is reading the file.

Yes, I'm, reading the file stored there and I'm getting work done on simple text emails. It works. The problem is when the user replies using html email like yahoo, hotmail or outlook configured html email with signatures, etc. This way I get a lot of garbage.

It's not as simple as striping the html, diff mail clients insert diff garbage. Any suggestions?

What this is all about:
I'm building my own hosted tool for customer support, I have a network of sites and I'm tired of using outlook/opera to answer each mail from specific accounts. So my tool allows me to read the emails sent via website form and then reply with templates or whatever comes to my mind, keeping the original server email account. I can see the whole cases as if it was a forum, a ticket system. Identifying the emails per case and keeping it that way was easier than I thought, but html email is not.

Please keep in mind: personal choice, no mysql, no third party ticket systems, no Gmail, no paid things, pure personal choice. I just need to read the html email and convert it into plain email text.

Thanks in advance.

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4601645 posted 12:57 am on Aug 14, 2013 (gmt 0)

Any chance you want to do this in PHP as I have some easy solutions for your problem, including code, just not in PERL.

Either way, you should use the IMAP interface which allows you to open the mailbox and download the content whether you're hosting it and have direct access or whether it's on a different server than the website.

Accessing the mailbox files directly is frowned upon and usually not possible in most servers because of the security settings which jail the software to the user account making IMAP the way to go.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4601645 posted 1:10 am on Aug 14, 2013 (gmt 0)

I do similar with a couple of sites myself. It is slick to do it that way. Having access to the direct mbox file is a boon for easy checking without have to be annoyed by a pop3 or imap checker.

The rock stock perl HTML parsers are not help? You can put them in your own code and skip the whole 'module' issue. There are alot of html strippers out there that are pretty easy to strip.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4601645 posted 1:11 am on Aug 14, 2013 (gmt 0)

Here is my custom html strip code. It isn't perfect - but it hits 99% of the web.



sub strip_html {
$HTML_Text = shift;
$HTML_Text =~ s/\&nbsp\;/ /gi;

# strip HTML tags that contain an ALT="foo" and replace with the literal string "foo":
$HTML_Text =~ s/<[^>]*\s+ALT\s*=\s*"(([^>"])*)"[^>]*>/ $1 /ig;

# The following code strips everything inside <SCRIPT..>...</SCRIPT> tags out of the HTML text:
$NoScript = '';
foreach (split(m!(\<\/SCRIPT>|\<\/STYLE>)!i, $HTML_Text)) {
next unless $_;
if (m!^(.*)(\<SCRIPT|\<STYLE)!i) {
$NoScript .= ' '.$1;
}
else {
$NoScript .= ' '.$_;
}
}
$HTML_Text = $NoScript;

$HTML_Text = &entity_strip($HTML_Text);


$HTML_Text =~ s!<([^>]*?)>!g;# strip all HTML tag and replace with blank spaces:

$HTML_Text =~ s!(\W|\_)!g;# Strip non-alphanumerics and underscores:
#print "$HTML_Text\n\n";
return($HTML_Text);
}


sub entity_strip {
my $t =shift;
@entity = (
"lt", "gt", "amp", "quot", "nbsp", "iexcl", "cent", "pound", "curren", "yen", "brvbar", "sect", "uml", "copy", "ordf", "laquo", "not", "shy", "reg", "macr", "deg", "plusmn", "sup2", "sup3", "acute",
"micro", "para", "middot", "cedil", "sup1", "ordm", "raquo", "frac14", "frac12", "frac34", "iquest", "Agrave", "Aacute", "Acirc", "Atilde", "Auml", "Aring", "AElig", "Ccedil", "Egrave", "Eacute", "Ecirc", "Euml", "Igrave", "Iacute",
"Icirc", "Iuml", "ETH", "Ntilde", "Ograve", "Oacute", "Ocirc", "Otilde", "Ouml", "times", "Oslash", "Ugrave", "Uacute", "Ucirc", "Uuml", "Yacute", "THORN", "szlig", "agrave", "aacute", "acirc", "atilde", "auml", "aring", "aelig",
"ccedil", "egrave", "eacute", "ecirc", "euml", "igrave", "iacute", "icirc", "iuml", "eth", "ntilde", "ograve", "oacute", "ocirc", "otilde", "ouml", "divide", "oslash", "ugrave", "uacute", "ucirc", "uuml", "yacute", "thorn","yuml");

foreach $i (@entity) {
$t =~ s/\&$i/ /gex;
}
return($t)
}

DrDoc

WebmasterWorld Senior Member drdoc us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4601645 posted 3:31 pm on Aug 14, 2013 (gmt 0)

I don't understand this part:

$t =~ s/\&$i/ /gex;

So, I would end up with a whole bunch of " ;" all over the place instead of their proper character equivalents? Isn't it better to have "r vi dr n?" instead of " ;r vi d ;r ;n?"

explorador

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4601645 posted 3:38 pm on Aug 14, 2013 (gmt 0)

IncrediBILLAny chance you want to do this in PHP as I have some easy solutions for your problem, including code, just not in PERL.

Either way, you should use the IMAP interface which allows you to open the mailbox and download the content whether you're hosting it and have direct access or whether it's on a different server than the website.

Accessing the mailbox files directly is frowned upon and usually not possible in most servers because of the security settings which jail the software to the user account making IMAP the way to go.

Thanks for the kind offer, please allow me to try a bit more because there is already a lot of code and interface done and built in PERL, I went as far as I could by my own means/knowledge/limits. I'm near solving it or giving up on that so I'll let you know, thanks.

Thanks Brett for the kind contribution of code. Yes I have direct access, easy checking with no need of pop3 or imap, besides sometimes the servers are working but the mail goes off: no problem accessing the files! I ried parsers and hand coded html strippers that worked fine with test texts but not with the actual file, I'm stuck there. I will try the code you kindly posted and will come back after doing some tests.

Thanks

explorador

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4601645 posted 9:01 pm on Aug 14, 2013 (gmt 0)

Done, thanks.

Well I have nothing to say but THANK YOU, thanks IncrediBILL for the code-help offer, and thanks specially to you Brett for your kind help with the code.

HTML emails in that file format on cpanel servers have strange format, some have the same content 2 times, some 3 times. I was having problems dealing with the trash but also getting only the content I was interested on, I don't fully understand why, I was stripping all the garbage but some of it wouldn't go away, and some filters worked on beta tests but not on the actual file.

From the code I turned this: (it was returning an error) :)

$HTML_Text =~ s!<([^>]*?)>!g;# strip all HTML tag and replace with blank spaces:
$HTML_Text =~ s!(\W|\_)!g;# Strip non-alphanumerics and underscores:


Into this only:

$HTML_Text =~ s/<([^>]*?)>/ /g;


The following part is good:
$HTML_Text =~ s!(\W|\_)!g

But it was causing me problems, at the end I got text with random spaces and line breaks, no commas, no semi colons, no / etc. I was trying to get rid of the beginning garbage of the file getting straight to the interesting part but somehow the regex refused to work on the string, so instead of keep trying I replaced the "Content-Transfer-Encoding\: quoted-printable" with a pipe and then performing a plain SPLIT, done!

From there all was very easy, removing the unnecessary line brakes leaving only 2 contiguous max. Also got rid of the content type and char set (independent on the number or code). And finally added some replacements ("ADDED") to get back some characters that were being converted into weird stuff.

I even kept the ">" indicating "old message" to make the replies and archive more understandable because the visitor gets a reply with the last message. Ignore the "" it was added for other reasons, now not needed and removed.

The email ticket system is almost, almost ready to go. Thanks!

Here is the code:


sub strip_html {
$HTML_Text = shift;
$HTML_Text =~ s/\&nbsp\;/ /gi;

### ADDED #################################
$HTML_Text=~ s/=20|=\?.*?Q\?|=\?.*?q\?|\?=/ /g;
$HTML_Text=~ s/=E1/a/g;
$HTML_Text=~ s/=E9/e/g;
$HTML_Text=~ s/=ED/i/g;
$HTML_Text=~ s/=F3|ó/o/g;
$HTML_Text=~ s/=B4/u/g;
$HTML_Text=~ s/=F1//g;
$HTML_Text=~ s/=2C/\,/g;
$HTML_Text=~ s/=3B/\;/g;
$HTML_Text=~ s/^ //g;
###########################################

$HTML_Text =~ s/<[^>]*\s+ALT\s*=\s*"(([^>"])*)"[^>]*>/ $1 /ig;
$NoScript = '';
foreach (split(m!(\<\/SCRIPT>|\<\/STYLE>)!i, $HTML_Text)) {
next unless $_;
if (m!^(.*)(\<SCRIPT|\<STYLE)!i) {$NoScript .= ' '.$1;}else {$NoScript .= ' '.$_;}
}

$HTML_Text = $NoScript;
$HTML_Text = &entity_strip($HTML_Text);
$HTML_Text =~ s/<([^>]*?)>/ /g;

############
$HTML_Text=~ s/Content-Transfer-Encoding\: quoted-printable/\|/gi;
($p1,$p2) = split(/\|/,$HTML_Text);$HTML_Text=$p2;
$HTML_Text =~ s/|=\n|Content-Type\: text\/html\; charset=\".*?\"|--\_.*?\_|>\n\n//g;
$HTML_Text =~ s/\n+/\n/g;
$HTML_Text =~ s/To\:/\n---------------------------------------------\n\nTo\: /;
############

return($HTML_Text);
}


sub entity_strip {
my $t =shift;
@entity = ("lt", "gt", "amp", "quot", "nbsp", "iexcl", "cent", "pound", "curren", "yen", "brvbar", "sect", "uml", "copy", "ordf", "laquo", "not", "shy", "reg", "macr", "deg", "plusmn", "sup2", "sup3", "acute", "micro", "para", "middot", "cedil", "sup1", "ordm", "raquo", "frac14", "frac12", "frac34", "iquest", "Agrave", "Aacute", "Acirc", "Atilde", "Auml", "Aring", "AElig", "Ccedil", "Egrave", "Eacute", "Ecirc", "Euml", "Igrave", "Iacute", "Icirc", "Iuml", "ETH", "Ntilde", "Ograve", "Oacute", "Ocirc", "Otilde", "Ouml", "times", "Oslash", "Ugrave", "Uacute", "Ucirc", "Uuml", "Yacute", "THORN", "szlig", "agrave", "aacute", "acirc", "atilde", "auml", "aring", "aelig", "ccedil", "egrave", "eacute", "ecirc", "euml", "igrave", "iacute", "icirc", "iuml", "eth", "ntilde", "ograve", "oacute", "ocirc", "otilde", "ouml", "divide", "oslash", "ugrave", "uacute", "ucirc", "uuml", "yacute", "thorn","yuml");
foreach $i (@entity) {$t =~ s/\&$i/ /gex;}
return($t)
}

[edited by: phranque at 12:35 am (utc) on Aug 15, 2013]
[edit reason] disabled graphic smileys [/edit]

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4601645 posted 12:01 pm on Aug 15, 2013 (gmt 0)

Ya, there are a couple of pipe characters in there that don't translate.

[webmasterworld.com...]

Doc - I strip them down that way for the purpose of the code base i was working with. I didn't want entities, or multibyte chars in the the code (needed to work with ascii/ansi for futher processing). So stripping off semi-colons was trivial later on.

Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4601645 posted 12:10 pm on Aug 15, 2013 (gmt 0)

There is also the cheap way of stripping cleanly formated html by walking through and just ripping the tags out by splitting the html on the left and right brackets.

The theory is to start with the html in a string with returns chars removed. Split the entire html on the left brackets of the html tags.

@html=split (/\</,$html); #slplit file on left bracket of all the html <tags>. so now @html consists of the fragment of each tag on the 0 position of each scaler. Now you have an array consisting of html tags on the left and the text on the right. So, "<b> test" has become "b> test". Now we need to rip out the rest of the "b>" and leave us with "test".


foreach $line (@html) {
$tagjunk,$line=split (/\>/,$line); #strip of the remaining html tag and the text is now on the right.
push (@final,$line); #or do whatever you want to rebuild the resulting text.
}
$text = join(" ",@final);

The only breakdown in the above code working, is when people imbed html in an html comment. So, you may want a preprocessor routine that strips comments.

The easier way to do that in one regex is left as an exercise for the reader ;-)

explorador

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4601645 posted 9:19 pm on Aug 15, 2013 (gmt 0)

Nice. My script is on full work now, will only add a few improvements. As for the split/regex I don't fully target the < or > chars because I decided to keep them in the email responses, it helps a lot to read the last message alone with the previous response.

Last message
> previous one

I even added reply with last message, so I'm able of answering the last email from one page alone including a quote of the previous message to make it easy for the reader on the other side.

As for the last loop/then one regex try well yes I got it and working cleaning strings off html tags in just one regex, but so far I don't fully understand why that worked on tests but not on the file, even removing line brakes, returns, etc first. I use that one regex to clean pasted html content on one CMS I have (it helps a lot to copy paste my own content when I'm doing some special tasks). Anyway I'm getting now full clean text with no extra spaces or line brakes.

Thanks again.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved