Forum Moderators: coopster

Message Too Old, No Replies

regex problem locating email

obfuscate email automatically with regex

         

techtheatre

2:31 am on Jun 24, 2009 (gmt 0)

10+ Year Member



I have a content management system I use for my clients. Sometimes (even though they shouldn't) they put email addresses on pages of the websites. While not completely foolproof, I want to convert the characters from the email address into their ASCII equivalents in an effort to stop at least some basic spam spiders from finding them. I have all my page content in a PHP variable (let's call is $content for this example). Just before displaying $content on the website, I want to run the entire string through a function findEmails() to automatically perform the replacement. I found a couple scripts online that I have spliced together into what I THOUGHT would work, but every variation I try keeps generating errors.

<?php
$content = 'Some text with an email@none.com and maybe another address@somewhere.org.';

echo findEmails($content);

function convertAsciiEmail($email)
{
$obfuscatedEmail='';//declare variable
$length = strlen($email);
for ($i = 0; $i < $length; $i++)
{
$obfuscatedEmail .= "&#" . ord($email[$i]); // creates ASCII HTML entity
}
$return = '<a href="mailto:' . $obfuscatedEmail . '">'.$obfuscatedEmail.'</a>';
return $return;
}

function findEmails($string)
{
$pattern = "[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})";
//$pattern = "/^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/";
//$pattern = '/^[^\W][a-zA-Z0-9_]+(\.[a-zA-Z0-9_]+)*\@[a-zA-Z0-9_]+(\.[a-zA-Z0-9_]+)*\.[a-zA-Z]{2,4}$/';
//$pattern = '[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})';
//$pattern = '^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$';
preg_match_all($pattern, $string, $split);
foreach ($split[0] as $value)
{
$email_to_find = $value[0];
$string = eregi_replace($email_to_find,convertAsciiEmail($value[0]),$string);
}
return $string;
}
?>

You can see above the various regex variations i have tried to use to locate the email addresses that may be in my $content. I am not even sure at this point if the problem is in the regex string I am using or something else completely. THANKS!

The current error message is:

Warning: preg_match_all() [function.preg-match-all]: Unknown modifier '+' in /htdocs/stuff/email-test.php on line 26
Warning: Invalid argument supplied for foreach() in /htdocs/stuff/email-test.php on line 27

compatant

8:59 am on Jun 24, 2009 (gmt 0)

10+ Year Member



Try this :)

<?php

function get_emails ($str)
{
$emails = array();
preg_match_all("/\b\w+\@\w+[\.\w+]+\b/", $str, $output);
foreach($output[0] as $email) array_push ($emails, strtolower($email));
if (count ($emails) >= 1) return $emails;
else return false;
}

$str = 'Some text with an email@none.com and maybe another address@somewhere.org.';

$emails = get_emails ($str);

function convertAsciiEmail($email)
{
$obfuscatedEmail='';//declare variable
$length = strlen($email);
for ($i = 0; $i < $length; $i++)
{
$obfuscatedEmail .= "&#" . ord($email[$i]); // creates ASCII HTML entity
}
$return = '<a href="mailto:' . $obfuscatedEmail . '">'.$obfuscatedEmail.'</a>';
return $return;
}

//print_r ($emails);
foreach($emails as $no => $email)
{
echo convertAsciiEmail($email)."<br />";

}
?>

techtheatre

8:45 pm on Jun 24, 2009 (gmt 0)

10+ Year Member



Well...yes, that does locate the email addresses and then convert them to ASCII, however the idea is to leave the original content intact and just replace the email addresses. Also these functions will ideally be self-referencing so that one function can be applied to the whole string, and the "cleaned" output returned. While this was successful, I think it may be making things more difficult than necessary rather than easier. As far as i can tell, the only problem i am having is some sort of syntax issue in my findEmails() function

function findEmails($string)
{
$pattern = "[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})";
preg_match_all($pattern, $string, $split);
foreach ($split[0] as $value)
{
$email_to_find = $value[0];
$string = eregi_replace($email_to_find,convertAsciiEmail($value[0]),$string);
}
return $string;
}

newb2seo

3:42 pm on Jun 25, 2009 (gmt 0)

10+ Year Member



hi.

I think the problem is with your regex.

I reworked your code just picking it apart until I could get values to start showing up.

Try this and I bet it will point you in the right way.

#==========================================================


<?php

$content = 'Some text with an email@none.com and maybe another address@somewhere.org.';
$content .= 'Some text with an email@none.com and maybe another address2@somewhere2.org.';
echo '<br><b>output from findEmails: </b>'.findEmails($content);

echo '<hr>';

function findEmails($string)
{
echo '<br><b>$string = </b>'.$string;

$pattern = "¦(.+?)@¦";
#$pattern = "[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})";
//$pattern = "/^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/";
//$pattern = '/^[^\W][a-zA-Z0-9_]+(\.[a-zA-Z0-9_]+)*\@[a-zA-Z0-9_]+(\.[a-zA-Z0-9_]+)*\.[a-zA-Z]{2,4}$/';
//$pattern = '[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})';
//$pattern = '^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$';

echo '<br><b>$pattern = </b>'.$pattern.' STOP';

preg_match_all($pattern, $string, $split, PREG_PATTERN_ORDER);
echo '<br><b>$split[0] = </b>'.print_r($split[0]);

foreach ($split[0] as $value)
echo '<hr><b>$value = </b>'.$value.' STOP';
{
$email_to_find = $value[0];
$string = eregi_replace($email_to_find,convertAsciiEmail($value[0]),$string);
}
return $string;
}

function convertAsciiEmail($email)
{
echo $email;

$obfuscatedEmail='';//declare variable
$length = strlen($email);
for ($i = 0; $i < $length; $i++)
{
$obfuscatedEmail .= "&#" . ord($email[$i]); // creates ASCII HTML entity
}
$return = '<a href="mailto:' . $obfuscatedEmail . '">'.$obfuscatedEmail.'</a>';
return $return;
}

#==========================================================

I wasn't getting anything to show up as a $value until I made the $pattern just very simple.
Then it started printing those out.

My suggestion is to just start from that and then add to your regex until it does what you want.

Then just take out all that junky debug stuff I added in there. :)

Hope that helps a little.
-Jeff

idfer

4:35 pm on Jun 25, 2009 (gmt 0)

10+ Year Member



techtheatre, going back to your original code, the only problem is that you're not delimiting your pattern in the call to preg_match_all. Change that call to:

preg_match_all("/$pattern/", $string, $split);

and you should be ok. Explanation: preg_match takes the first character of your pattern to be the delimiter, in the original code, the first character is [ so preg_match assumes the next ] ends your pattern and treats everything after that (starting with the +) as a pattern modifier, and chokes on the +.

BTW, you can replace all the _a-z0-9 by \w.

BTW2, you don't need to call eregi_replace to make the substitutions, you can call str_replace which is faster and safer since nothing in your string can be treated as a special character.

techtheatre

6:14 pm on Jun 25, 2009 (gmt 0)

10+ Year Member



THANKS idfer! That got things rolling again. I did discover a few other issues after that in my regex...but I have everything working now (i was unsuccessful with your "\w" replacement). I thought this script might be very useful to others in the future (easy way in PHP to obfuscate email addresses on the fly), so the code is below.

I have one additional question/problem that i would love some help with. While this will not be a problem for me with my current CMS, I could see a problem for other people. If the incoming string already contains one or more mailto links (either labeled with custom text or with an email address), this script messes it all up. I included an example. I will simply not allow mailto links on the CMS side of my script, but thought that if anyone wants to tackle this, it could be very helpful to the community (and to me). Any thoughts on stripping them automatically before processing these functions?

<?php
$content = 'Some text with an email@none.com and other what@ever.org text. Add a <a href="mailto:mailto@something.com">CLICK HERE</a> link. Start@something.net with an email address. Or end with an email@address.com.';

echo findEmails($content);

function findEmails($string)
{
$pattern = "[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})";
preg_match_all("/$pattern/", $string, $matches);
foreach ($matches[0] as $value)
{
$MatchedEmail = $value;
//echo 'FOUND: '.$MatchedEmail.'<br />';//uncomment for debugging
$string = str_replace($MatchedEmail,convertAsciiEmail($MatchedEmail),$string);
}
return $string;
}

function convertAsciiEmail($email)
{
$obfuscatedEmail='';//declare variable
$length = strlen($email);
for ($i = 0; $i < $length; $i++)
{
$obfuscatedEmail .= "&#" . ord($email[$i]); // creates ASCII HTML entity
}
$return = '<a href="mailto:' . $obfuscatedEmail . '">'.$obfuscatedEmail.'</a>';
return $return;
}

?>

techtheatre

9:07 pm on Jun 25, 2009 (gmt 0)

10+ Year Member



More help still needed, but I have added some new functionality to the above. Please take a look at the code below, and feel free to re-use it in your projects. I found a JavaScript function that inserted the obfuscated code, thus providing another barrier to spam spiders. I heavily modified that script and embedded its functionality in my script below. As you will see from the example (if you run this code on your server), there is still the problem with email addresses that have already been set up as href links. Anyone know how to strip out the links?

<?php
$content = 'Some text with an email@none.com and other what@ever.org text. Add a <a href="mailto:mailto@link.com">mailto@link.com</a> link. Start@something.net with an email address. Or end with an email@address.com.';

// Example of use

//plain ASCII conversion of email addresses
$Output = findEmails($content, 0, 0);
echo $Output.'<br /><br /><br />'."\n\n";

//Convert ASCII and add mailto links
$Output = findEmails($content, 1, 0);
echo $Output.'<br /><br /><br />'."\n\n";

//Convert ASCII and add mailto links using JavaScript (default)
$Output = findEmails($content, 1, 1);
echo $Output.'<br /><br /><br />'."\n\n";

function findEmails($string, $MakeLink=1, $UseJavaScript=1)
{
//NOTE: if UseJavaScript is turned on (1), all email addresses are forced as links (regardless of "MakeLink" value)

$pattern = "[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})";
preg_match_all("/$pattern/", $string, $matches);
foreach ($matches[0] as $value)
{
$MatchedEmail = $value;
//echo 'FOUND: '.$MatchedEmail.'<br />';//uncomment for debugging
$AsciiEmail = convertToAscii($MatchedEmail);
if($UseJavaScript==1)
{
$NewEmail = jsEmail($MatchedEmail);
}
elseif($MakeLink==1)
{
$NewEmail = '<a href="mailto:' . $AsciiEmail . '">'.$AsciiEmail.'</a>';
}
else
{
$NewEmail = $AsciiEmail;
}
$string = str_replace($MatchedEmail,$NewEmail,$string);
}
return $string;
}

function jsEmail($RawEmail)
{
// We split username and domain into separate strings - otherwise the bot will have no trouble finding the email address

// Split the email into user name and domain
list($user, $domain) = explode('@', $RawEmail);

// Form the href attribute
$mailtouser = "mailto:$user";

$user = convertToAscii($user);//plain user
$domain = convertToAscii($domain);//plain domain
$mailtouser = convertToAscii($mailtouser);//href user (mailto added)

// Generate output
$output = <<<EOT
<script>
document.write('<a href="$mailtouser' + '&#64;');
document.write('$domain' + '"');
document.write('>$user' + '&#64;');
document.write('$domain</a>');
</script>
EOT;
return $output;
}

function convertToAscii($text){
$output = '';
for($i = 0; $i < strlen($text); $i++)
{
$output .= '&#' . ord($text[$i]) . ';';
}
return $output;
}

?>

[edited by: coopster at 1:09 pm (utc) on June 26, 2009]
[edit reason] no urls please TOS [webmasterworld.com] [/edit]

techtheatre

10:57 pm on Jun 25, 2009 (gmt 0)

10+ Year Member



Okay...i have made some serious progress on this latest question, but have again hit a wall... HELP!

I wrote the following regex:
(<a href="mailto:)([\w-\.]+@([\w-]+\.)+[\w-]{2,4})(">)[\._a-zA-Z0-9@-]*(</a>)

I tested it against the following string:
<a href="mailto:test@none.com">test@none.com</a> and it was found successfully.

I tried to insert this regex into the following function, but it bombs out:

function stripMailtoLink($string)
{

$pattern = '(<a href="mailto:)([\w-\.]+@([\w-]+\.)+[\w-]{2,4})(">)[\._a-zA-Z0-9@-]*(</a>)';
preg_match_all("/$pattern/", $string, $matches);

foreach ($matches[0] as $value)
{
$MatchedLinkString = $value[0];
$MatchedEmail = $value[2]; //i think this should be correct
echo 'FOUND: '.$MatchedLinkString.'<br />';//debugging
}
$output = 'done';//debugging
return $output;
}

HELP! :-) Thanks!

[edited by: coopster at 1:11 pm (utc) on June 26, 2009]
[edit reason] removed url [/edit]

techtheatre

4:18 am on Jun 26, 2009 (gmt 0)

10+ Year Member



FINAL POST:

Done. The PHP script below will take an incoming string of text and search out all email addresses, convert them to ascii character codes, and write them into the document using JavaScript. Feel free to use this on your own projects. Good luck.

<?php

$content = 'Some text with an email@none.com and other what@ever.org text. Add a <a href="mailto:test@none.com">TEST</a> link. Include another <a href="mailto:link@example.net">link@example.net</a> address. Start@something.net with an email address. Or end with an email@address.com.';

// Example of use

//plain ASCII conversion of email addresses
$Output = findEmails($content, 0, 0);
echo $Output.'<br /><br /><br />'."\n\n";

//Convert ASCII and add mailto links
$Output = findEmails($content, 1, 0);
echo $Output.'<br /><br /><br />'."\n\n";

//Convert ASCII and add mailto links using JavaScript (default)
$Output = findEmails($content, 1, 1);
echo $Output.'<br /><br /><br />'."\n\n";

function findEmails($string, $MakeLink=1, $UseJavaScript=1)
{
//NOTE: if UseJavaScript is turned on (1), all email addresses are forced as links (regardless of "MakeLink" value)

//start by removing any existing mailto links so we can work with plain text
$string = stripMailtoLink($string);

$pattern = "[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})";
preg_match_all("/$pattern/", $string, $matches);
foreach ($matches[0] as $value)
{
$MatchedEmail = $value;
//echo 'FOUND: '.$MatchedEmail.'<br />';//uncomment for debugging
$AsciiEmail = convertToAscii($MatchedEmail);
if($UseJavaScript==1)
{
$NewEmail = jsEmail($MatchedEmail);
}
elseif($MakeLink==1)
{
$NewEmail = '<a href="mailto:' . $AsciiEmail . '">'.$AsciiEmail.'</a>';
}
else
{
$NewEmail = $AsciiEmail;
}
$string = str_replace($MatchedEmail,$NewEmail,$string);
}
return $string;
}

function jsEmail($RawEmail)
{
// We split username and domain into separate strings - otherwise the bot will have no trouble finding the email address

// Split the email into user name and domain
list($user, $domain) = explode('@', $RawEmail);

// Form the href attribute
$mailtouser = "mailto:$user";

$user = convertToAscii($user);//plain user
$domain = convertToAscii($domain);//plain domain
$mailtouser = convertToAscii($mailtouser);//href user (mailto added)

// Generate output
$output = <<<EOT
<script>
document.write('<a href="$mailtouser' + '&#64;');
document.write('$domain' + '"');
document.write('>$user' + '&#64;');
document.write('$domain</a>');
</script>
EOT;
return $output;
}

function convertToAscii($text){
$output = '';
for($i = 0; $i < strlen($text); $i++)
{
$output .= '&#' . ord($text[$i]) . ';';
}
return $output;
}

function stripMailtoLink($string)
{
$pattern = '¦(<a href="mailto:)([\w-\.]+@([\w-]+\.)+[\w-]{2,4})(">)[\._a-zA-Z0-9@-]*(</a>)¦';
preg_match_all($pattern, $string, $matches, PREG_SET_ORDER);
foreach ($matches as $value)
{
$MatchedLinkString = $value[0];
$MatchedEmail = $value[2];
$string = str_replace($MatchedLinkString,$MatchedEmail,$string);
}
return $string;
}

?>

[edited by: coopster at 1:13 pm (utc) on June 26, 2009]
[edit reason] no personals please TOS [webmasterworld.com] [/edit]