Extract only plain text from a page

Hi, I'm new here but have been using PHP for some time now.

Basically, what I am trying to do is write some PHP code that will automatically take text from any web page and eliminate all the HTML, CSS, and JS codes and formatting, leaving only the plain text from the page. I got my code started, but I have hit a snag with javascript and css codes. This is what I have so far:

<?php

$geturl = $_GET["url"];

ob_start();
include($geturl);
$page = ob_get_contents();
ob_end_clean();

$output = ereg_replace('<script.*.</script>', ' ', $page);

$output2 = ereg_replace('<style.*.</style>', ' ', $output);

$plaintext = strip_tags($output2);

echo $plaintext;

The strip_tags function automatically removes all html tags, but it doesn't do anything to javascript and css because html code is not provided between the beginning and end tags, whereas javascript and css codes are both contained within two separate tags, like this for more clarification:

html:
<div name="htmltag">Keep this text here</div>

javascript:
<script>function somejs() {remove all this code}</script>

As you can see, the text between the div tags should stay, but the js between the script tags should be removed because it is code.

I then tried the ereg_replace function to get rid of js and css codes, but there is a problem when there is more than 1 piece of js or css code. The wildcard value (.*.) skips over any ending script or style tags until it reaches the last ending tag, therefore deleting all the text between the two pieces of code. Example:

<SCRIPT>function somejs() {remove all this code}</script> //removes all text and code from beginning here
KEEP ALL THIS TEXT HERE
<script>function somejs() {remove all this code}</SCRIPT> //to end here

Now finally down to the question, is there any way to only remove the js and css code between the beginning tag and the immediate next ending tag? Or is there any other way to get rid of the javascript and css codes?

Thanks for any help...
while I am banging my head against the wall :)

Extract only plain text from a page

How can I extract only plain text from a page without leaving code behind?

mcbrook

vincevincevince

mcbrook

mcbrook

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week