Forum Moderators: coopster

Message Too Old, No Replies

Extract only plain text from a page

How can I extract only plain text from a page without leaving code behind?

         

mcbrook

5:23 am on Aug 24, 2007 (gmt 0)

10+ Year Member


Hi, I'm new here but have been using PHP for some time now.

Basically, what I am trying to do is write some PHP code that will automatically take text from any web page and eliminate all the HTML, CSS, and JS codes and formatting, leaving only the plain text from the page. I got my code started, but I have hit a snag with javascript and css codes. This is what I have so far:

<?php

$geturl = $_GET["url"];

ob_start();
include($geturl);
$page = ob_get_contents();
ob_end_clean();

$output = ereg_replace('<script.*.</script>', ' ', $page);

$output2 = ereg_replace('<style.*.</style>', ' ', $output);

$plaintext = strip_tags($output2);

echo $plaintext;

?>

The strip_tags function automatically removes all html tags, but it doesn't do anything to javascript and css because html code is not provided between the beginning and end tags, whereas javascript and css codes are both contained within two separate tags, like this for more clarification:

html:
<div name="htmltag">Keep this text here</div>

javascript:
<script>function somejs() {remove all this code}</script>

As you can see, the text between the div tags should stay, but the js between the script tags should be removed because it is code.

I then tried the ereg_replace function to get rid of js and css codes, but there is a problem when there is more than 1 piece of js or css code. The wildcard value (.*.) skips over any ending script or style tags until it reaches the last ending tag, therefore deleting all the text between the two pieces of code. Example:

<SCRIPT>function somejs() {remove all this code}</script> //removes all text and code from beginning here
KEEP ALL THIS TEXT HERE
<script>function somejs() {remove all this code}</SCRIPT> //to end here

Now finally down to the question, is there any way to only remove the js and css code between the beginning tag and the immediate next ending tag? Or is there any other way to get rid of the javascript and css codes?

Thanks for any help...
while I am banging my head against the wall :)

vincevincevince

6:05 am on Aug 24, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



$output = ereg_replace('<script.*.</script>', ' ', $page);
$output2 = ereg_replace('<style.*.</style>', ' ', $output);

I don't use ereg as a rule, but if you use preg then the expression is:

$output = preg_replace('/\<script.*?\<\/script\>/ism', ' ', $page);
$output2 = preg_replace('/\<style.*?\<\/style\>/ism', ' ', $output);

The? makes the .* match non-greedy and so it 'eats' only the minimum it can in order to match the pattern. By default, the * of .* is greedy and will go for the biggest possible valid pattern.

mcbrook

8:28 pm on Aug 24, 2007 (gmt 0)

10+ Year Member


It looks like that did the trick. Thanks. But now I have another issue. Now, I am trying to take the final plaintext string and put it in an RSS feed in an XML document. Looking at the whole scheme of things, I am trying to do text-to-speech using Talkr.com's service. I already have a flash mp3 player that uses Talkr automatically when it is given an item in an rss feed that does not have enclosure tags, so the actual text-to-speech thing is working fine. The only requirement is that the text I want it to read has to be in the <description> tag, and in this case, I would echo the plaintext string in that tag. It looks fine when I look at the XML document, but the problem is that the mp3 player is telling me there is an error with the include function in here:

ob_start();
include($geturl);
$page = ob_get_contents();
ob_end_clean();

So to solve this, is there something else I can use besides the include function to grab the source code from a page? Or is there any other method I can use to get the source code?

mcbrook

10:14 pm on Aug 24, 2007 (gmt 0)

10+ Year Member


Nevermind. I found out that it was saying there was an error because the feed was not receiving the page URL properly when going through the mp3 player. I'll fix this, and if I have any more problems, I'll post something. Thanks.