Welcome to WebmasterWorld Guest from 54.226.27.104

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

preg_match_all() help for matching JavaScript in HTML

     

anshul

11:28 am on Nov 16, 2005 (gmt 0)

10+ Year Member



I'm using:

$content = htmlentities($content);
preg_match_all("/(<script)[a-z0-9¦>,;#!\"'()%~:_=&\-\.\?\/\s]*(<\/script>)/i", $content, $matches);

The above is not working good. Pleas help me to get JavaScript constructs used in HTML into an array.

Thanks.

coopster

12:36 pm on Nov 16, 2005 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Are you looking to get rid of them? If so, perhaps you could just use PHP's strip_tags() [php.net] function?

anshul

3:51 pm on Nov 16, 2005 (gmt 0)

10+ Year Member



No. I intend to get all these in array and print out. Actually, I'm coding a SEO tool to generated JavaScript report for a Web page. Please help

killroy

6:38 pm on Nov 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



why that long character class? Why not simply .*? Also note that a regexp will be easily confused by quoted close script tags. not much you can do about that.

anshul

7:23 pm on Nov 16, 2005 (gmt 0)

10+ Year Member



Can there's a method in regexs.

Actually, if there's a </script> at the bottom of a page and one javascript in head of HTML document, I get everyting into the variable.

Can there is a possible regex to do what I wanna do? Definitely, there will be. I need you experts help.

coopster

4:49 am on Nov 17, 2005 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



You mean you want to get everything between script tags?
$pattern = "/<script[^>]*>(.*)<\/script>/Uis"; 
preg_match_all($pattern, $string, $matches);
print_r($matches);

The pattern matches anything that starts with '<script' followed by zero or more of anything that is not a '>', followed by a '>', followed by anything (and that part gets captured), followed by a closing '<script>' tag. The 'Uis' modifiers mean to make it Ungreedy, case-insensitive, and to match newlines in the dot metacharacter.

anshul

6:39 am on Nov 17, 2005 (gmt 0)

10+ Year Member



Thank you. But its not working as expected. Here's code to explain:

$content = file_get_contents($uri);
$pattern = "/<script[^>]*>(.*)<\/script>$/Uis";
preg_match_all($pattern, $content, $matches);
print_r($matches);

I expect output cotaining all the javascript constructs used used in the Web page. Not anything other than this.

anshul

6:58 am on Nov 17, 2005 (gmt 0)

10+ Year Member



Perhaps, this code is working well:

$content = htmlentities($content);
$pattern = "/(&lt;script)(.*)(&lt;\/script&gt;)/i";
preg_match_all($pattern, $content, $matches);
while( list($key, $value) = each($matches[0]) )
echo "<br>" . $value . "<br>";

I checked for few URIs, it done well. Though not sure! I'm trying more URIs ..

Can some one tell me, why results of preg_match_all() is in 2-dimensional array. I noted all the times, $matches[0] is only useful. Why other ones are vague?

In emample above, I couldn't undertand U and ^ modifiers!

If $ is a match at end-modifier, why putting $pattern = "/(&lt;script)(.*)(&lt;\/script&gt;$)/i"; in this code, makes the result array null?

Please explain.

I'm a regex novice. I've one last question: Is using lenghty regex slows the script. I mean, is there some efficieny concern, while using regexs or PHP Perl-compatible regex functions?

Thanks.

coopster

12:07 pm on Nov 17, 2005 (gmt 0)

WebmasterWorld Administrator coopster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



To get only those types of scripts that are javascript, just modify your regular expression to say, give me anything that starts with '<script' followed by anything that includes the word 'javascipt', followed by anything that is not a '>', followed by a '>', etc. (same as before at the end here).
$pattern = "/<script.*javascript[^>]*>(.*)<\/script>/Uis";

The matched values are always returned in the order of the capturing parentheses, with the first match always being the entire pattern itself.

The '$' anchor means the last character or last part of the subject string. Very last, not just "the last part of my particular pattern".

anshul

12:21 pm on Nov 17, 2005 (gmt 0)

10+ Year Member



I don't know why $pattern = "/<script.*javascript[^>]*>(.*)<\/script>/Uis"; is not working.

But my problem is solved and I've uploaded the js tool live! First, I used htmlentities()
and then this pattern: $pattern = "/(&lt;script)[a-z0-9>,;#!\"'()%~:_=&\-\.\?\/\s]*(&lt;\/script&gt;)/Ui";

U is critical in above.

Thanks for motivation ;)

 

Featured Threads

Hot Threads This Week

Hot Threads This Month