Forum Moderators: coopster

Message Too Old, No Replies

PHP and pattern recognition

         

hermes

10:18 pm on Jan 10, 2005 (gmt 0)

10+ Year Member



I am currently working on a PHP script that parses each received email for an email address in the message body.

I am wondering what syntax I should use for the searching of the email address. I want to look for a string of the general form of an email address:

xx@xx.xx

So,the only set entities are the @ and .
The x's can of course vary in character and length.

So, it is a pattern recognition problem. Where the pattern can actually vary to quite a large degree. Because I am not looking for A particular email address - but ANY email address.

Would be great if someone could point me in the right direction. The question mark in my mind is how PHP can search for a pattern with such variability. The only thing it has to cling onto is the @ followed by the .

Of course this must be possible. Because all the web email harvesters must use something like this.

Many thanks. I will keep you posted on my email parsing script as I am sure it is going to have many issues along the way.

bcolflesh

10:20 pm on Jan 10, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



[webmasterworld.com...]

msg #10

s1dev

10:25 pm on Jan 10, 2005 (gmt 0)

10+ Year Member



Use PHP's regular expression functions:

[php.net...]

I hope this helps. I haven't done too much myself with regular expressions, but these functions will do the job.

jshpro2

1:36 am on Jan 11, 2005 (gmt 0)

10+ Year Member



^((?>[a-zA-Z\d!#$%&'*+\-/=?^_`{¦}~]+\x20*¦"((?=[\x01-\x7f])[
^"\\]¦\\[\x01-\x7f])*"\x20*)*(?<angle><))?((?!\.)(?>\.?[a-zA
-Z\d!#$%&'*+\-/=?^_`{¦}~]+)+¦"((?=[\x01-\x7f])[^"\\]¦\\[\x01
-\x7f])*")@(((?!-)[a-zA-Z\d\-]+(?<!-)\.)+[a-zA-Z]{2,}¦\[(((?
(?<!\[)\.)(25[0-5]¦2[0-4]\d¦[01]?\d?\d)){4}¦[a-zA-Z\d\-]*[a-
zA-Z\d]:((?=[\x01-\x7f])[^\\\[\]]¦\\[\x01-\x7f])+)\])(?(angl
e)>)$

[edited by: jatar_k at 4:36 am (utc) on Jan. 11, 2005]
[edit reason] disabled smilies [/edit]

hermes

12:19 pm on Jan 11, 2005 (gmt 0)

10+ Year Member



Thanks a lot guys. I found a nice tutorial on this at:
[phpbuilder.com...]
It has a section on regular expressions for email addresses.

I think the one I will use is this one (note that I am using the case insensitive eregi () as opposed to the case sensitive ereg ()

$regexp = "^([_a-z0-9-]+)(\.[_a-z0-9-]+)*@([a-z0-9-]+)(\.[a-z0-9-]+)*(\.[a-z]{2,4})$";

eregi($regexp, $email)

I have a two Q's about this.

1) Will this be able to pick up emails of the form:

.co.uk
.co.fr
....

Or only ones of the form:

.com
.info
.....

I am a little confused on this.

2) Should I change the * to a +
Here I repeat the expression but with the postulated change.

$regexp = "^([_a-z0-9-]+)(\.[_a-z0-9-]+)*@([a-z0-9-]+)(\.[a-z0-9-]+)+(\.[a-z]{2,4})$";

A bit of regular expression terminology:
The * denotes that "zero or more" characters must occur.
The + denotes that "one or more" characters must occur.

So, as I see it. If I have the * - the expression will pick up

abc@abc

as a valid email.

But if I have + it will not pick it up as valid. It will need to have a full stop in it (with characters after) to be picked up. ie. it must be of the form:

abc@abc.a

hermes

12:23 pm on Jan 11, 2005 (gmt 0)

10+ Year Member



Sorry jshpro2. I am really grateful for your post. I am sorry for not going with your expression. I couldn't untangle it. It is pretty daunting! Has it much over and above the expression in my last post?

hermes

3:54 pm on Jan 11, 2005 (gmt 0)

10+ Year Member



I am having some problem with my code. Leaving the issue of what regular expression is best for picking out email addresses for a while.

I have this code that works fine. It manages to pick out the email address "abc@ac.ss" just fine

<?php
$email = "abc@ac.ss";
$regexp = "^([_a-z0-9-]+)(\.[_a-z0-9-]+)*@([a-z0-9-]+)(\.[a-z0-9-]+)*(\.[a-z]{2,4})$";
eregi($regexp, $email, $regs);
echo $regs[0];
?>

My problem comes when I try this. I put the email address in some wider text. Then try and pluck it out of this wider text. This is more along the lines of what I want to do eventually. I want to pluck email addresses out of text.

It is not working unfortunatly. What am I doing wrong?

<?php
$email = "hello james. This is an email to see how you are. abc@ac.ss Good. Bub bye.";
$regexp = "^([_a-z0-9-]+)(\.[_a-z0-9-]+)*@([a-z0-9-]+)(\.[a-z0-9-]+)*(\.[a-z]{2,4})$";
eregi($regexp, $email, $regs);
echo $regs[0];
?>

hermes

5:34 pm on Jan 11, 2005 (gmt 0)

10+ Year Member



In relation to my post 5. I have now sorted this out by doing some testing. The regular expression holds up pretty well. Can snare most email address formats.

But I am still very much stuck on the issue in post 7. How to pull out an email address from a body of text? To go fishing for email addresses in a body of text. I want to do this to pull email addresses out of email messages that I receive.

Just for anyone interested: this is the regular expression that I am going to use for email addresses:

$regexp = "^([_a-z0-9-]+)(\.[_a-z0-9-]+)*@([a-z0-9-]+)(\.[a-z0-9-]+)*(\.[a-z]{2,4})$";
eregi($regexp, $email)

jshpro2

10:44 pm on Jan 11, 2005 (gmt 0)

10+ Year Member



$data=file_get_contents('file.msg');
preg_match_all($regex, $data, $match);
foreach($match[0] as $addy) {
echo ($addy.'<br>');
}

hermes

6:55 pm on Jan 12, 2005 (gmt 0)

10+ Year Member



POST

I am really sorry to bother the board again. I have hit yet more problems and am unsure how to proceed. I have got this code working:

<?php
$text = "hello james. This is an email to see how you are. abc@ac.ss Good. Bub bye.";
$regexp = "([_a-z0-9-]+)(\.[_a-z0-9-]+)*@([a-z0-9-]+)(\.[a-z0-9-]+)*(\.[a-z]{2,4})";
eregi($regexp, $text, $regs);
echo $regs[0];
?>

Note that I have taken the ^ and the $ off the ends of the regular expression now.
It can pull abc@ac.ss out of the body of text. Brilliant. We are going somewhere.

Now I am setting my heart on pulling multiple email addresses out of a body of text. For instance, pulling abc@ac.ss and ss@hh.co out of:

"hello james. This is an email ss@hh.co to see how you are. abc@ac.ss Good. Bub bye.";

Dont know whether eregi() is suitable for this - trying to find multiple email addresses.
Going for preg_match_all instead. jshpro2 thank you ever so much for your code which I enclose below.

$data=file_get_contents('file.msg');
preg_match_all($regex, $data, $match);
foreach($match[0] as $addy) {
echo ($addy.'<br>');
}

The thing is that I could not get this working. I altered it a bit to:

<?php
$data="hello james. ad@sdda.ds This is an email to see how you are. abc@ac.ss Good. Bub bye.";
$regex = "([_a-z0-9-]+)(\.[_a-z0-9-]+)*@([a-z0-9-]+)(\.[a-z0-9-]+)*(\.[a-z]{2,4})";
preg_match_all($regex, $data, $match);
foreach($match[0] as $addy) {
echo ($addy.'<br>');
}
?>

The above did not work either. I got the following error messages:

Warning: Unknown modifier '(' on line 4
Warning: Invalid argument supplied for foreach() on line 5

I guess these two errors are interlinked. If you solve the first, the second will go away. I dont know what it means by
Unknown modifier '('.

I saw some example code on the web for plucking out multiple phone numbers from a body of text. It is below (modified it a bit). It
works fine. Returns both phone numbers.

<?php
preg_match_all ("/\(? (\d{3})? \)? (?(1) [\-\s] ) \d{3}-\d{4}/x", "Call 555-1212 or 1-800-555-1212", $phones);
print $phones[0][0];
print $phones[0][1];
?>

Using the layout of above - I tried to apply this to my email search issue.

<?php
preg_match_all("([_a-z0-9-]+)(\.[_a-z0-9-]+)*@([a-z0-9-]+)(\.[a-z0-9-]+)*(\.[a-z]{2,4})", "hello james. ad@sdda.ds This is ss@hh.se an email", $phones);
print $phones[0][0];
print $phones[0][1];
?>

But it doesnt work.
Get error message: Warning: Unknown modifier '(' on line 2
Notice: Undefined offset: 0 on line 3

Please help. I dont know how to proceed.

hermes

11:13 pm on Jan 12, 2005 (gmt 0)

10+ Year Member



Got to the bottom of my problem thanks to some very kind help. It seems that when using preg functions Perl style regex must be used. The code below works!

<?php
$data="hello james. ad@sdda.ds This is an email to see how you are. abc@ac.ss Good. Bub bye.";
$regex = "/(([A-Za-z0-9]+_+)¦([A-Za-z0-9]+\-+)¦([A-Za-z0-9]+\.+)¦([A-Z
a-z0-9]+\++))*[A-Za-z0-9]+@((\w+\-+)¦(\w+\.))*\w{1,63}\.[a-z
A-Z]{2,6}/";
if (preg_match_all($regex, $data, $match)){
foreach($match[0] as $email){
echo '<br />' . $email;
}
}else{
echo 'does not contain emailadresses';
}
?>