Forum Moderators: coopster

Message Too Old, No Replies

strip tags, javascript workaround needed, it leaves the JS code

dynamically removing javascript

         

Hero_Doug

4:08 am on Dec 17, 2003 (gmt 0)

10+ Year Member



I've noticed that strip tags leaves the javascript code in place. It removes <script language="JavaScript"> and </script>, but leaves all the code between them in place.

The scenario is that I don't know how many, if any javascript code blocks there will be.

I'm hoping to a code/function that eliminate the two tags and everything in between.

NickCoons

7:07 am on Dec 17, 2003 (gmt 0)

10+ Year Member



Write a function to ignore everything beginning with <script> and ending with </script>, then use strip_tags() to get rid of everything else.

jamie

11:01 am on Dec 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



hi doug,

i think preg_replace would work, e.g. find the script tags and everything possible contained between them, then replace them with nothing.

$pattern = "/<script>(.*)</script>/i";
$replacement = "";
$str = preg_replace($pattern, $replacement, $str);

then strip_tags

coopster

12:42 pm on Dec 17, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



You may want to have a peek at example 5 in the PHP preg_replace [php.net] function manual pages.

Hero_Doug

5:07 pm on Dec 17, 2003 (gmt 0)

10+ Year Member



So far no examples have worked, except

----------
$readthree = eregi_replace("<script.*/script>", "", $readtwo);
----------

The problem with that is, when indexing yahoo for instance, they have a javascript at the very top of their page, and one at the very bottom. So everything from start to finish is replaced, instead of just between the two javascript tags.

if it helps to understand my problem here is the full source code.

---------------------------------------------------------
<?php

$domain = 'http://www.yahoo.com/';

$open = fopen($domain, "r");

$readone = fread($open, 200000);
$readtwo = stristr($readone, '<body');
//$readthree = preg_replace("/<script(^>)+>.*?</script>/i", "", $readtwo);
$readthree = eregi_replace("<script(.*)/script>(.*)", "", $readtwo);

$read = strip_tags($readthree);

fclose($open);

<removed many repetitive eregi calls - jatar_k>

$$filtered = str_replace('!', ' ', $$filtered);
$$filtered = str_replace(',', ' ', $$filtered);
$$filtered = str_replace('.', ' ', $$filtered);
$$filtered = str_replace('?', ' ', $$filtered);
$$filtered = str_replace('&copy', ' ', $$filtered);
$$filtered = str_replace('&#149;', ' ', $$filtered);
$$filtered = str_replace('&#183;', ' ', $$filtered);
$$filtered = str_replace('&amp;', ' ', $$filtered);
$$filtered = str_replace('&nbsp;', ' ', $$filtered);
$$filtered = str_replace('&raquo;', ' ', $$filtered);
$$filtered = str_replace('¦', ' ', $$filtered);
$$filtered = str_replace('\\', ' ', $$filtered);
$$filtered = str_replace('/', ' ', $$filtered);
$$filtered = str_replace('<', ' ', $$filtered);
$$filtered = str_replace('>', ' ', $$filtered);
$$filtered = str_replace('-', ' ', $$filtered);
$$filtered = str_replace('_', ' ', $$filtered);
$$filtered = str_replace('^', ' ', $$filtered);
$$filtered = str_replace('(', ' ', $$filtered);
$$filtered = str_replace(')', ' ', $$filtered);
$$filtered = str_replace('{', ' ', $$filtered);
$$filtered = str_replace('}', ' ', $$filtered);
$$filtered = str_replace('[', ' ', $$filtered);
$$filtered = str_replace(']', ' ', $$filtered);
$$filtered = str_replace("'", ' ', $$filtered);
$$filtered = str_replace(';', ' ', $$filtered);
$$filtered = str_replace(':', ' ', $$filtered);
$$filtered = str_replace('"', ' ', $$filtered);

echo trim($$filtered);

?>
-----------------------------------------------

Now, when you run it, you'll only see about one line of text, that's where the JS starts. There is a lot more that's been taken out, that shouldn't be taken out.

[edited by: jatar_k at 9:31 pm (utc) on Dec. 17, 2003]
[edit reason] reduced amount of code [/edit]

coopster

5:33 pm on Dec 17, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I don't have time to setup a REGEX and test it, but I'm guessing you may need to use the UNGREEDY [php.net] pattern modifier.

Hero_Doug

6:54 pm on Dec 17, 2003 (gmt 0)

10+ Year Member



If I did it right it doesn't work.

eregi_replace("<script.*<\/script>/u", "", $readtwo);

jatar_k

9:35 pm on Dec 17, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



did you look at the example coopster mentioned on php.net?

[ca2.php.net...]

Example 5. Convert HTML to text
<?php
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.

$search = array ("'<script[^>]*?>.*?</script>'si", // Strip out javascript
"'<[\/\!]*?[^<>]*?>'si", // Strip out html tags
"'([\r\n])[\s]+'", // Strip out white space
"'&(quot¦#34);'i", // Replace html entities
"'&(amp¦#38);'i",
"'&(lt¦#60);'i",
"'&(gt¦#62);'i",
"'&(nbsp¦#160);'i",
"'&(iexcl¦#161);'i",
"'&(cent¦#162);'i",
"'&(pound¦#163);'i",
"'&(copy¦#169);'i",
"'&#(\d+);'e"); // evaluate as php

$replace = array ("",
"",
"\\1",
"\"",
"&",
"<",
">",
" ",
chr(161),
chr(162),
chr(163),
chr(169),
"chr(\\1)");

$text = preg_replace ($search, $replace, $document);
?>

NickCoons

1:01 am on Dec 18, 2003 (gmt 0)

10+ Year Member



<So far no examples have worked>

Then write your own.. I didn't think I'd have to write it for you :-)


$x = 0;
$fp = fopen("http://page", "r");

do
{
$Input = fgets($fp, 32768);
if(trim(strtolower($Input)) == "<SCRIPT>")
{
++$x;
}

$CodeStuff[$x] .= $Input;

if(trim(strtolower($Input)) == "</SCRIPT>")
{
--$x;
}
}
while(!feof($fp));

fclose($fp);


(untested)
Now $CodeStuff[1] contains everything between <SCRIPT> and </SCRIPT>, and $CodeStuff[0] contains everything else. You can probably clean it up a bit.

Hero_Doug

3:43 am on Dec 18, 2003 (gmt 0)

10+ Year Member



I spent 1.5 hours writing every combo I could think of.

It finally works though. Someone at devshed filled me in on the fact that the newlines were messing it up.