Welcome to WebmasterWorld Guest from 35.172.195.49

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

Apache/PHP auto-convert email addresses in HTML source to images

Requires Apache 2.2.7 and PHP with GD library

     
6:09 am on Aug 8, 2008 (gmt 0)

New User

10+ Year Member

joined:Dec 7, 2003
posts: 17
votes: 0


The goal of this project was to output web pages with images of email addresses so that the source didn’t contain any information that a spam bot could harvest - but at the same time allow the source code to contain standard markup

<a href=”mailto:somebody@somewhere.com”>somebody@somewhere.com</a>

without any additional modifications.

Sound like a contradiction? Maybe, but it can be done quite easily!

Email Spam Harvester Background
So if you are here reading this post, you should already know about spam bots and some of the ways they harvest email addresses. If not, here is a great summary, study and discussion of various methods to obfuscate email addresses. Unfortunately, the study didn’t involve images of email addresses, however images of email addresses have been proven to be very difficult for spam bots to process without advanced OCR (optical character recognition), which slows down the bots considerably.

Images of Email Addresses
Creating images of email addresses to spam-proof web pages is not a new concept. And it has been proven to be very successful. However, it does have some drawbacks.

1. Usability - An image of an email address should be just that, and image. Not a mailto link that contains the image of the email address. The mailto link will contain the email address and just like that, the spam bots will also have the email address. This means that an email address must be retyped in the email client. Because the email address is an image, it cannot even be copy/pasted into the email client.

2. Ease of use - The image of the email address must somehow be generated. Most web authors are not going to want to manually create images, especially when their site may contain hundreds if not thousands of email addresses.Scripts can also be used to dynamically create images of email addresses, but how do you relay the email address to the script from the source code without revealing it to spam bots? There are some methods, one of which is used in the technique described in this post.Even if you do use server-side scripting to generate images of email addresses, most likely changes will need to be made at the source level. That isn’t easy for people who use content management systems and programs like Adobe Contribute to do distributed web authoring and may know little to nothing about HTML markup, let alone a scripting language like PHP or ASP.

In this example, I use a simple PHP script to convert passed URL variables which contain various parts of the email address into an image image that the script outputs.

Apache mod_substitution
Apache’s mod_substitute provides a mechanism to perform both regular expression and fixed string substitutions on response bodies. This is the key peice of functionality to pull off a seamless, on-the-fly transformation of source code that spam bots could take advantage of to source code that will make the page better protected from spam bots.

Putting it All Together
So now that we’ve covered the two main peices to the puzzle (the Apache regular expression substitution and the PHP image generation routine, let’s bring it all together.

1. Apache - We’ll start with the Apache changes first. You only need to add three configuration lines.The first is making sure you are loading mod_substitute into Apache:


LoadModule substitute_module modules/mod_substitute.so

Otherwise, the next two lines will cause an error when you restart Apache.

AddOutputFilterByType SUBSTITUTE text/html
Substitute s¦<a(\s)href\s?=\s?"\s?mailto\s?:\s?(.*)@(.*)\.(.*)"?\b[^>]*>(.*?)</a>¦<a$1href="mailto:"><img$1src="/email_test/email.php?m=$2\&h=$3\&tld=$4"$1align="absmiddle"$1border="0"$1/></a>¦i

The first line tells Apache to use this substitution for content-types of text/html.

The second line is where all of the work is done. It is a regular expression that matches a standard mailto link and replaces it with a call to the PHP script that will convert the email into an image. Although I haven’t done it here, you may want to modify the search part of the regular expression to pickup email addresses that are not embedded in an anchor (mailto) tag. The replace part of the regular expression replaces the entire <a> tag with a different <a> tag which also contains an <img> tag that calls the PHP script. The href attribute of the new <a> tag only contains “mailto:”. This will allow the image to be clickable and bring up an empty email window. Notice that finishing the mailto link would be counter-productive as it would give away the email address to the spam bots (this is one of the usability issues I mentioned above)! The email address is broken up into it’s various parts by the regular expression and passes as URL variables to the PHP script. You can use the two configuration lines above inside of a <Directory> or <Location> tag in httpd.conf, or you can use them inside an .htaccess file as well, which is great for everyone running there sites on a shared server!

That’s it for the Apache configuration side of things!
2. PHP - I’ve developed a small PHP script which will reform the email address from the URL variables passed to it and output it as an image:
/email_test/email.php:


<?
$size = ($_GET['size']) ? $_GET['size'] : 10; //This will be the font size of the image - change as desired
$font = $_SERVER['DOCUMENT_ROOT'].'/includes/fonts/ttf/arial.ttf'; //Use whatever TTF font you desire
$email = $_GET['m'].'@'.$_GET['h'].'.'.$_GET['tld']; //Reconstructs the email address from the URL varables
$bb = imagettfbbox($size, 0, $font, $email);
$w = $bb[2] - $bb[0];
$h = $bb[1] - $bb[7];
header("Content-type: image/png"); //Outputs as PNG, but can be GIF or JPG also
$im = imagecreate($w, $h);
$white = imagecolorallocate($im, 255, 255, 255); //This will be the background color of the image - change as desired
$blue = imagecolorallocate($im, 0, 0, 255); //This will be the text color of the image - change as desired
imagettftext($im, $size, 0, 0, $size, $blue, $font, $email);
imagepng($im);
imagedestroy($im);
?>

The great thing about this method is that as long as the PHP script is configured right, you can use whatever URL variable names that you would like to keep the spam bots from catching on. I used ‘m’ for mailbox, ‘h’ for host, and ‘tld’ for top-level domain. You can also obfuscate the parts of the email address however you like by modifying the replace part of the regular expression in Apache. Just remember to use an equivilant decoding technique in the PHP script!

Results
So what does that leave us with? Basically a source HTML file that looks like:


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
My email address is <a href="mailto:somebody@somewhere.com">somebody@somewhere.com</a> at my organization.
</body>
</html>

And an output stream from the web server to the client that looks like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Untitled Document</title>
</head>
<body>
My email address is <a href="mailto:"><img src="/email_test/email.php?m=somebody&h=somewhere&tld=com" align="absmiddle" border="0" /></a> at my organization.
</body>
</html>

I'd love to hear your comments on this!

3:38 pm on Aug 9, 2008 (gmt 0)

Senior Member

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 31, 2002
posts:25430
votes: 0


That's pretty cool, assuming that you include on-page instructions on how to access these e-mail addresses, and that your user base understands those instructions and your reason for obscuring the addresses.

Thanks for posting!

Jim

3:55 am on Aug 14, 2008 (gmt 0)

New User

10+ Year Member

joined:Aug 14, 2008
posts: 2
votes: 0


The real genius of that method is that it doesn't have to be a system that converts and email address to an image. You could use any replacement text, such as javascript or CSS obfuscation techniques.