a regular expression needed for extracting links

Forum Moderators: coopster

Message Too Old, No Replies

a regular expression needed for extracting links

shankimout

11:42 pm on Mar 28, 2006 (gmt 0)

hi . thank you for browsing my topic . i have 2 questions

first question : i need regular exprssion that i can extract page & images urls from a text whit preg_split function .

second question :

i wants save a web page whit php scripting , do you have classes for this one?

thanks alot

shankimout

11:44 am on Mar 29, 2006 (gmt 0)

no body know?

i wants to extract all of the href='xx' and src='xx' from the html text

Alex_Miles

1:00 pm on Mar 29, 2006 (gmt 0)

As no-one else has replied I'll tell you how I would do it with wildcards, but this is a tedious manual way to do it. You can't automate it.

I would search and replace '>*<' with '><'

Next import the file to Excel '<' delimited.

Then I would have another look, and another think.

Usually saving as text files and importing with new delimiters will sort out difficult problems.

adb64

1:27 pm on Mar 29, 2006 (gmt 0)

You may try a pcre like :


/.*((href¦src) *= *'.+').*/

Take care of possible spaces around the = sign.
Replace the broken pipe by a closed pipe.
Didn't check it but you can use tools to check a regular expression online.

[edited by: coopster at 1:45 pm (utc) on Mar. 29, 2006]
[edit reason] removed url [/edit]

coopster

1:43 pm on Mar 29, 2006 (gmt 0)

Welcome to WebmasterWorld, shankimout.

With a regular expression you need to locate the 'href' OR 'src' attributes followed by an equal sign. Then match any single or double quotation mark which may open the attribute value. Of course, in some older html this may not even be present so we make it optional with a question mark. Next is the part we really want, the value. We know that the value will be anything in between until we hit another single quotation mark, double quotation mark, space (because there just may be another attribute such as a class or something, or it just may be the end of the element which is marked by a closing greater than sign. And there has to be at least one or more of something that doesn't match one of these characters to constitute a match for us, so that is what the plus sign does. You'll also notice the character class began with a caret sign, which negates the values inside. It says to find me anything that DOES NOT match any of the following characters. Lastly, we use the same characters without negation to tell the regex engine that that will mark the end of our pattern.

$pattern = '/(href¦src)=[\'"]?([^\'" >]+)[\'" >]/'; 
preg_match_all($pattern, $string, $matches); 
// $matches[2] will contain the values: 
print '<pre>'; print_r($matches[2]); print'</pre>';

Note, this hasn't been tested but should get you started in the right direction.

<added>thanks, adb64. Middle of posting when I got sidetracked. And good point, I often forget to remind folks to rekey that pipe character as the forum breaks them</added>

shankimout

2:24 pm on Mar 29, 2006 (gmt 0)

Thanks My Good Friends . i have testing and i say the result

this is the best programming forum in the world . thanks

shankimout

2:43 pm on Mar 29, 2006 (gmt 0)

doesnt work!

<?
$str_text = file_get_contents("http://php.net");

$pattern = '/(href¦src)=[\'"]?([^\'" >]+)[\'" >]/';

preg_match_all($pattern, $str_text, $matches);

echo "<pre>";
var_dump($matches);
echo "</pre>";
?>

this code prints

array(3) {
[0]=>
array(0) {
}
[1]=>
array(0) {
}
[2]=>
array(0) {
}
}

coopster

2:51 pm on Mar 29, 2006 (gmt 0)

Works fine for me. Do you have error_reporting() [php.net] turned on during development so you can see any errors being thrown?

shankimout

3:04 pm on Mar 29, 2006 (gmt 0)

i tested this one , but i didnt get result

would you mind write your result here? my php version is 4

error_reporting(E_ALL);

$str_text = file_get_contents("http://php.net");

$pattern = '/(href¦src)=[\'"]?([^\'" >]+)[\'" >]/';

preg_match_all($pattern, $str_text, $matches);

echo "<pre>";
var_dump($matches);
echo "</pre>";
?>

coopster

3:59 pm on Mar 29, 2006 (gmt 0)

Did you remember to rekey the pipe symbol? The forum here breaks the pipe symbol so if you copy and paste you need to key the pipe symbol (¦) again so it is not broken.

shankimout

4:23 pm on Mar 29, 2006 (gmt 0)

oh , yes , true

i changed the pattern to this

$pattern = '/(href¦src)=[\'"]?([^\'" >]+)[\'" >]/';

and it works well

thank you

jezra

7:12 pm on Mar 29, 2006 (gmt 0)

It would appear that if, in the text, there is a space between the "href" or "src" and the equals sign, or a space between the equal sign and the actual path of the image or URL, then it will not match. is this a correct assumption?

shankimout

8:38 pm on Mar 29, 2006 (gmt 0)

do you have another idea? i have writing a class for fetching real links in the file , it will get real path of the image or page , but i have problem . [webmasterworld.com...]

adb64

9:08 pm on Mar 29, 2006 (gmt 0)

To Jezra, that's a correct assumption. I also stated that in my msg#4 and my pcre also has the 0 or more spaces around the = sign

shankimout, take care of the optional spaces