homepage Welcome to WebmasterWorld Guest from 54.161.191.254
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
preg_match_all() help for matching JavaScript in HTML
anshul




msg:1245746
 11:28 am on Nov 16, 2005 (gmt 0)

I'm using:

$content = htmlentities($content);
preg_match_all("/(<script)[a-z0-9¦>,;#!\"'()%~:_=&\-\.\?\/\s]*(<\/script>)/i", $content, $matches);

The above is not working good. Pleas help me to get JavaScript constructs used in HTML into an array.

Thanks.

 

coopster




msg:1245747
 12:36 pm on Nov 16, 2005 (gmt 0)

Are you looking to get rid of them? If so, perhaps you could just use PHP's strip_tags() [php.net] function?

anshul




msg:1245748
 3:51 pm on Nov 16, 2005 (gmt 0)

No. I intend to get all these in array and print out. Actually, I'm coding a SEO tool to generated JavaScript report for a Web page. Please help

killroy




msg:1245749
 6:38 pm on Nov 16, 2005 (gmt 0)

why that long character class? Why not simply .*? Also note that a regexp will be easily confused by quoted close script tags. not much you can do about that.

anshul




msg:1245750
 7:23 pm on Nov 16, 2005 (gmt 0)

Can there's a method in regexs.

Actually, if there's a </script> at the bottom of a page and one javascript in head of HTML document, I get everyting into the variable.

Can there is a possible regex to do what I wanna do? Definitely, there will be. I need you experts help.

coopster




msg:1245751
 4:49 am on Nov 17, 2005 (gmt 0)

You mean you want to get everything between script tags?
$pattern = "/<script[^>]*>(.*)<\/script>/Uis"; 
preg_match_all($pattern, $string, $matches);
print_r($matches);

The pattern matches anything that starts with '<script' followed by zero or more of anything that is not a '>', followed by a '>', followed by anything (and that part gets captured), followed by a closing '<script>' tag. The 'Uis' modifiers mean to make it Ungreedy, case-insensitive, and to match newlines in the dot metacharacter.

anshul




msg:1245752
 6:39 am on Nov 17, 2005 (gmt 0)

Thank you. But its not working as expected. Here's code to explain:

$content = file_get_contents($uri);
$pattern = "/<script[^>]*>(.*)<\/script>$/Uis";
preg_match_all($pattern, $content, $matches);
print_r($matches);

I expect output cotaining all the javascript constructs used used in the Web page. Not anything other than this.

anshul




msg:1245753
 6:58 am on Nov 17, 2005 (gmt 0)

Perhaps, this code is working well:

$content = htmlentities($content);
$pattern = "/(&lt;script)(.*)(&lt;\/script&gt;)/i";
preg_match_all($pattern, $content, $matches);
while( list($key, $value) = each($matches[0]) )
echo "<br>" . $value . "<br>";

I checked for few URIs, it done well. Though not sure! I'm trying more URIs ..

Can some one tell me, why results of preg_match_all() is in 2-dimensional array. I noted all the times, $matches[0] is only useful. Why other ones are vague?

In emample above, I couldn't undertand U and ^ modifiers!

If $ is a match at end-modifier, why putting $pattern = "/(&lt;script)(.*)(&lt;\/script&gt;$)/i"; in this code, makes the result array null?

Please explain.

I'm a regex novice. I've one last question: Is using lenghty regex slows the script. I mean, is there some efficieny concern, while using regexs or PHP Perl-compatible regex functions?

Thanks.

coopster




msg:1245754
 12:07 pm on Nov 17, 2005 (gmt 0)

To get only those types of scripts that are javascript, just modify your regular expression to say, give me anything that starts with '<script' followed by anything that includes the word 'javascipt', followed by anything that is not a '>', followed by a '>', etc. (same as before at the end here).
$pattern = "/<script.*javascript[^>]*>(.*)<\/script>/Uis";

The matched values are always returned in the order of the capturing parentheses, with the first match always being the entire pattern itself.

The '$' anchor means the last character or last part of the subject string. Very last, not just "the last part of my particular pattern".

anshul




msg:1245755
 12:21 pm on Nov 17, 2005 (gmt 0)

I don't know why $pattern = "/<script.*javascript[^>]*>(.*)<\/script>/Uis"; is not working.

But my problem is solved and I've uploaded the js tool live! First, I used htmlentities()
and then this pattern: $pattern = "/(&lt;script)[a-z0-9>,;#!\"'()%~:_=&\-\.\?\/\s]*(&lt;\/script&gt;)/Ui";

U is critical in above.

Thanks for motivation ;)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved