Welcome to WebmasterWorld Guest from 54.90.204.233

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Simple Preg Match Parser for Emails

Error Message!

     
9:07 am on Jun 13, 2011 (gmt 0)

Preferred Member from GB 

10+ Year Member Top Contributors Of The Month

joined:July 25, 2005
posts: 406
votes: 17


Hi,

I'm trying to extract emails from old backup files of a custom built email client. They are saved as html files and the code is a total mess. At least the email addresses are contained within regularly occuring <td> tags. I tried this simple preg_match script but it ends up saying:

Warning: preg_match() [function.preg-match]: Unknown modifier 't' in /home/whatever/public_html/scripts/parser/index.php on line 4
NULL


This is the script. The (.+?) bit is meant to represent the email address :)

<?php
$data = file_get_contents('http://www.whatever.com/scripts/parser/backup1.php');
$regex = '/Email Address:</td><td class="email">(.+?)</td>/';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>


Any ideas please?
9:44 am on June 13, 2011 (gmt 0)

Junior Member from IN 

10+ Year Member

joined:Nov 3, 2002
posts: 91
votes: 0


I'm not sure about the pattern as a whole but all forward slashes and special characters needs to be escaped using a backward slash

$regex = '/Email Address:</td><td class="email">(.+?)</td>/';


should become

$regex = '/Email Address\:<\/td><td class\=\"email\">(.+?)<\/td>/';
10:54 am on June 13, 2011 (gmt 0)

Preferred Member from GB 

10+ Year Member Top Contributors Of The Month

joined:July 25, 2005
posts: 406
votes: 17


@chrisranjana, thank's that's great. It solved the problem but now I'm trying to write a loop for it, so that it goes to the end of the file and extracts all the emails.

<?php
$data = file_get_contents('http://www.whatever.com/scripts/parser/backup1.php');
$regex = '/Email Address\:<\/td><td class\=\"email\">(.+?)<\/td>/';
preg_match($regex,$data,$match);
foreach ($match as $val) {
echo $match[1];
}

This just lists the first result twice :D

Should I be using preg_match_all ? When I try this:

<?php
$page = "http://www.whatever.com/scripts/parser/backup1.php";
$content = file_get_contents("$page");
$regex = '/Email Address\:<\/td><td class\=\"email\">(.+?)<\/td>/';
preg_match_all($regex, $content, $matches, PREG_PATTERN_ORDER);
foreach ($matches as $val) {
print_r ($matches);
}
?>

I am getting all results but twice and mixed with some sort of Array notifications :(
11:20 am on June 13, 2011 (gmt 0)

Preferred Member from GB 

10+ Year Member Top Contributors Of The Month

joined:July 25, 2005
posts: 406
votes: 17


Aah, silly mistake, no arrays chosen. Got it to work now:

<?php
$page = "http://www.whatever.com/scripts/parser/backup1.php";
$content = file_get_contents("$page");
$regex = '/Email Address\:<\/td><td class\=\"email\">(.+?)<\/td>/';
preg_match_all($regex, $html, $matches, PREG_SET_ORDER);
foreach ($matches as $val) {
echo $val[0];
}
?>

This little script will save me approx 30 hours of tedious manual work. PHP rulez!
5:06 pm on June 13, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 28, 2004
posts:7999
votes: 0


You do not have to escape the double quotes in that regex. It errored because, and only because, you are using slashes for delimiters:

$regex = '/Email Address:</<-- PHP thinks your pattern ends here so expects this t to be a modifier --> td><td class="email">(.+?)</td>/';

I'd also go with case insensitive, and instead of "any character" use "anything not a <" because . (any) will match on a <.

Also if it's just that part after class, the rest of it is not needed. What happens if the string is valid HTML,

<label for="email">Email Address:</label></td><td class="email">this@that.com</td>

or spaces?

Email Address: </td><td class="email"> this@that.com </td>

or other attributes?

Email Address: </td><td class="email" title="my email"> this@that.com </td>

It would be better to grab the email pattern itself,

$regex = '/([^@]+\@\.[a-z]{2,3})/i'; // This is a poor pattern but you get the idea

but using your approach,

$regex = '/<.*class="email"[^>]?>([^<]+)</i';

The more specific you are, the less chance you'll have of an unexpected result cropping up in the future.
11:23 am on June 14, 2011 (gmt 0)

Preferred Member from GB 

10+ Year Member Top Contributors Of The Month

joined:July 25, 2005
posts: 406
votes: 17


Great advice, thank you!
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members