Forum Moderators: coopster

Message Too Old, No Replies

preg match all not finding all meta descriptions

Help with regex and php's preg_match_all

         

Philiboy

2:28 pm on Feb 10, 2009 (gmt 0)

10+ Year Member



Hi, Can anyone help me with the code below? I want to be able to open web pages and find all the sections containing the meta element with the "name" attribute, and then grab the meta descriptions i.e. values of content attribute when 'name="description"' is present. The problem that I am having with the code below is that it doesn't find such sections when they are preceded by other '<meta name...' sections. Eg.
<!---
It will work for the following lines in the
content....
-->
<title>here is a title</title>
<meta NAME="DESCRIPTION" CONTENT="It works with
this description spread over
several lines">
<!---
but it won't work for.....
-->
.
.
<meta name="robots" content="all" />
<meta name="author" content="Fred Smith" />
<meta name="Copyright" content="Copyright (c) 2005 ..." />
<title>page title goes here</title>
<meta name="description" content="this is the content I want,
but it fails to grab it">
<!-- Now, here is the code. I suspect there is a problem with the regex or perhaps the way I am
calling preg_match_all
-->
<?
include_once("class.errorHandler.php");
$urlOpen = // url for web page to open
$content = '';
ErrorHandler::set();
try
{
if ($handle = fopen($urlOpen, "r"))
{
while (!feof($handle))
{
$content .= fread($handle, 8192);
}
}
fclose($handle);
} catch(Exception $e)
{
// "Couldn't open file";
}
if ($content != "")
{
$pattern = '/<meta ([^>]*)name="([^"\'>]*)"([^>]*)/im';
if (preg_match_all($pattern,$content,$matches))
{
for ($i=0;$i<count($matches)&&!isset($descr);$i++)
// loop will terminate when $descr gets assigned the result, or when
// all matches have been looped through
{
$str_matches=strtolower($matches[$i][0]);
$pos = strpos($str_matches,'name=');
if (!(is_bool($pos) && !$pos))
{
$name = trim(substr($str_matches,$pos+5),'"');
$pos = strpos($name,'"');
$name = substr($name,0,$pos);
if (strcasecmp("description",$name)==0)
{
$pos = strpos($str_matches,'content=');
if (!(is_bool($pos) && !$pos))
$descr=trim(substr($str_matches,$pos+8),'"');
}
}
}
}
else
{
$descr='no description found';
}
}

----------------
Any help will be appreciated. Thanks in advance,
Phil

PHP_Chimp

2:56 pm on Feb 10, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



$pattern = '/<meta ([^>]*)name="([^"\'>]*)"([^>]*)/im';

Have you had a look what you are getting in $matches?

I would try something like:


$pattern = '~<meta name=[\'"](\w+)[\'"] content=[\'"](\w+)[\'"](?: +/)?>~i';

That pattern will not work if someone does <meta content='something' name='keywords'>, as the content and name are the wrong way around. It will also fail if there are more than a single space between meta and name and name and content. You could add + to all the spaces to cover this eventuality.

The second (\w+) may not work for you, so you could use ([^\'"]+) as this will get everything that isnt the ending ' or ". \w will just get you 0-9 a-Z and _, depending on location there may be more.

My pattern should work for html and xhtml meta tags as the closing / is optional. You dont need that part captured so you can use (?: to group things but not capture them.

The m modifier is only going to be used if you are anchoring the regex using ^ or $.

There are quite a few online sites where you can check a regex to see if it works. Have a google then you can more easily test to see if you are getting the expected results.

Philiboy

5:20 pm on Feb 10, 2009 (gmt 0)

10+ Year Member



Thanks PHP_Chimp. I need a pattern which matches <meta.....name.....> i.e. it does not have to have 'content=' in it. Only when I process each match afterwards in my for loop do I look for 'content=' being in it (so the name part and content part can be anyway around). My pattern works fine when I test it in editpadpro (i.e. it finds all occurences <meta.....name.....> one by one, but what I'll do is try and test it with one of the other facilities on the web and unearth what the problem is. Meanwhile, this stays as an open problem on the forum - if anyone has any ideas on what the problem is.....?

Philiboy

7:29 pm on Feb 10, 2009 (gmt 0)

10+ Year Member



OK. Got it. The call to preg_match_all needs flag PREG_SET_ORDER so it becomes if (preg_match_all($descrip,$content,$matches,PREG_SET_ORDER)). Not sure why, but it will mean it orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on. So I've solved my own problem.