Forum Moderators: coopster
The scenario is that I don't know how many, if any javascript code blocks there will be.
I'm hoping to a code/function that eliminate the two tags and everything in between.
----------
$readthree = eregi_replace("<script.*/script>", "", $readtwo);
----------
The problem with that is, when indexing yahoo for instance, they have a javascript at the very top of their page, and one at the very bottom. So everything from start to finish is replaced, instead of just between the two javascript tags.
if it helps to understand my problem here is the full source code.
---------------------------------------------------------
<?php
$domain = 'http://www.yahoo.com/';
$open = fopen($domain, "r");
$readone = fread($open, 200000);
$readtwo = stristr($readone, '<body');
//$readthree = preg_replace("/<script(^>)+>.*?</script>/i", "", $readtwo);
$readthree = eregi_replace("<script(.*)/script>(.*)", "", $readtwo);
$read = strip_tags($readthree);
fclose($open);
<removed many repetitive eregi calls - jatar_k>
$$filtered = str_replace('!', ' ', $$filtered);
$$filtered = str_replace(',', ' ', $$filtered);
$$filtered = str_replace('.', ' ', $$filtered);
$$filtered = str_replace('?', ' ', $$filtered);
$$filtered = str_replace('©', ' ', $$filtered);
$$filtered = str_replace('•', ' ', $$filtered);
$$filtered = str_replace('·', ' ', $$filtered);
$$filtered = str_replace('&', ' ', $$filtered);
$$filtered = str_replace(' ', ' ', $$filtered);
$$filtered = str_replace('»', ' ', $$filtered);
$$filtered = str_replace('¦', ' ', $$filtered);
$$filtered = str_replace('\\', ' ', $$filtered);
$$filtered = str_replace('/', ' ', $$filtered);
$$filtered = str_replace('<', ' ', $$filtered);
$$filtered = str_replace('>', ' ', $$filtered);
$$filtered = str_replace('-', ' ', $$filtered);
$$filtered = str_replace('_', ' ', $$filtered);
$$filtered = str_replace('^', ' ', $$filtered);
$$filtered = str_replace('(', ' ', $$filtered);
$$filtered = str_replace(')', ' ', $$filtered);
$$filtered = str_replace('{', ' ', $$filtered);
$$filtered = str_replace('}', ' ', $$filtered);
$$filtered = str_replace('[', ' ', $$filtered);
$$filtered = str_replace(']', ' ', $$filtered);
$$filtered = str_replace("'", ' ', $$filtered);
$$filtered = str_replace(';', ' ', $$filtered);
$$filtered = str_replace(':', ' ', $$filtered);
$$filtered = str_replace('"', ' ', $$filtered);
echo trim($$filtered);
?>
-----------------------------------------------
Now, when you run it, you'll only see about one line of text, that's where the JS starts. There is a lot more that's been taken out, that shouldn't be taken out.
[edited by: jatar_k at 9:31 pm (utc) on Dec. 17, 2003]
[edit reason] reduced amount of code [/edit]
[ca2.php.net...]
Example 5. Convert HTML to text
<?php
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.$search = array ("'<script[^>]*?>.*?</script>'si", // Strip out javascript
"'<[\/\!]*?[^<>]*?>'si", // Strip out html tags
"'([\r\n])[\s]+'", // Strip out white space
"'&(quot¦#34);'i", // Replace html entities
"'&(amp¦#38);'i",
"'&(lt¦#60);'i",
"'&(gt¦#62);'i",
"'&(nbsp¦#160);'i",
"'&(iexcl¦#161);'i",
"'&(cent¦#162);'i",
"'&(pound¦#163);'i",
"'&(copy¦#169);'i",
"'&#(\d+);'e"); // evaluate as php$replace = array ("",
"",
"\\1",
"\"",
"&",
"<",
">",
" ",
chr(161),
chr(162),
chr(163),
chr(169),
"chr(\\1)");$text = preg_replace ($search, $replace, $document);
?>
Then write your own.. I didn't think I'd have to write it for you :-)
$x = 0;
$fp = fopen("http://page", "r"); do
{
$Input = fgets($fp, 32768);
if(trim(strtolower($Input)) == "<SCRIPT>")
{
++$x;
}
$CodeStuff[$x] .= $Input;
if(trim(strtolower($Input)) == "</SCRIPT>")
{
--$x;
}
}
while(!feof($fp));
fclose($fp);