Forum Moderators: coopster

Message Too Old, No Replies

Get Parts from MIME formatted emails

         

usrbin

9:13 pm on Aug 24, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



I have a string called $email which contains an incoming email which is in the MIME format. I'm trying to get the to address, from address, from name, and subject line and am using the following script:

$email = '
...
From: Display Name <from@domain.com>
Date: Tue, 23 Aug 2016 19:56:48 -0400
Message-ID: <CAKfQz8NQHvunh65vVnusnxHQSkAczCxYzy_3TorWoP+FWgDw@mail.gmail.com>
Subject: Popsicle
To: to@url.com
...';

if(preg_match('/\nTo: (.*)\n/', $email, $match)) $to = $match[1];
if(preg_match('/\nFrom: (.*) <(.*)>\n/', $email, $match)){$user = $match[1];$from= $match[2];}
if(preg_match('/\nSubject: (.*)\n/', $email, $match)) $subj = $match[1];


So far the script works fine, but I'm wondering if I'm setting myself up for problems.

Are there always spaces after the colon? Does the from line always have a display name with the email address in brackets? Does the to line ever have a display name with the email in brackets? If there are multiple to email addresses will it break the script?

I'm trying to keep this script light weight so I'm not trying to load a large library, but is there a more resilient way to do this?

whitespace

10:07 pm on Aug 24, 2016 (gmt 0)

10+ Year Member Top Contributors Of The Month



Are there always spaces after the colon?


I'm not sure that this is explicitly stated in the spec? I think you should make the whitespace optional. eg. \s* (0 or more whitespace)

Does the from line always have a display name with the email address in brackets?


No, it could consist solely of an email address. Incidentally, the "display name" could also be quoted.

Does the to line ever have a display name with the email in brackets?


Yes, as with the "From:" header, it could be in both formats. Depending on what information you are trying to get, if you just want the actual email address then you could just search for what looks like an "email address", rather than the precise format? Although sometimes the display name could also look like an email address, but be different to the actual email address!

If there are multiple to email addresses will it break the script?


Currently that is quite probable. Multiple "To:" addresses are comma separated and there isn't necessarily a newline between them. Do you just want to grab the first one, or all of them?

I'm trying to keep this script light weight so I'm not trying to load a large library, but is there a more resilient way to do this?


More specifc regex and validate what you extract.

Whether it is enough is dependent on your use case... are you only dealing with specific emails from a known group of users? How accurate does this data need to be? Will your script be vulnerable if you grab the wrong data (or the regex fails)?

Also note that you are only checking for \n (ie. LF - Line Feed) char. MIME headers should be terminated by \r\n (ie. CRLF - Carriage Return + Line Feed). In your regex above you are possibly grabing an additional whitespace char which would need to be trimmed later.

usrbin

10:27 am on Aug 25, 2016 (gmt 0)

5+ Year Member Top Contributors Of The Month



Thank you very much for your detailed response!