Forum Moderators: coopster

Message Too Old, No Replies

Regular Expressions

Trying to find text and delete it

         

coho75

9:04 pm on Jul 7, 2004 (gmt 0)

10+ Year Member



I have some files I am working with that need to have some text deleted from them. The text is a mixture of HTML tags and plain text. I thought a good way to approach this was to insert <!-- DeleteBegin --> and <!-- DeleteEnd --> around the text I want to delete. This way I would be able to quickly find what needed to be deleted. So I started to write a program to go through the files. However, I am having some trouble getting it to work. What I have so far is:

<?php

$filename = "test.html";
####################################

// Read file to change
$fp = fopen($filename, "r") or die("Couldn't open $filename");

$str = fread($fp, 10000);
fclose($fp);

ereg("([^0-9]--[ \t\n\r]DeleteBegin[ \t\n\r]--[^0-9])([^0-9]--[ \t\n\r]DeleteEnd[ \t\n\r]--[^0-9])", $str, $out);

$found = "Test";

// $out is what should be between the <!-- DeleteBegin --> and <!-- DeleteEnd -->

$oldtext = "$out[2]";
$new_file_text = str_replace($oldtext , $found , $str);

// Write file

$filePointer = fopen($filename,"w");
fwrite($filePointer, $new_file_text);
fclose($filePointer);

?>

Maybe I am trying to do something that could be done an easier way or something that may not be possible. All help is appreciated.

Thanks,
coho75

ExpLarry

10:02 pm on Jul 7, 2004 (gmt 0)

10+ Year Member



ereg("([^0-9]--[ \t\n\r]DeleteBegin[ \t\n\r]--[^0-9])([^0-9]--[ \t\n\r]DeleteEnd[ \t\n\r]--[^0-9])", $str, $out);

personally I prefer the preg (Perl-compatible regular expression ) syntax, and I'd do something like this:

$str = preg_replace('/<!--\s*DeleteBegin\s*-->.*?<!--\s*DeleteEnd\s*-->/','',$str);

This matches and replaces in one go. See: [php.net...] (Disclaimer: code untested).

You might also want to be sure you're reading in the whole of your file, not just the first 10000 bytes (unless you're sure the file won't exceed that).

coho75

12:23 am on Jul 8, 2004 (gmt 0)

10+ Year Member



Thanks for your response. I couldn't get it to work. I tried (.*?) earlier and had no luck. All I got was error messages. Is there any way to print the text between <!-- DeleteBegin --> and <!-- DeleteEnd --> to the screen? Thanks for the help.

coho75

coopster

2:14 pm on Jul 8, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



You need to set up the subpattern [php.net] as a capturing subpattern so that when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector1 argument of pcre_exec() [pcre.org]. Opening parentheses are counted from left to right (starting from 1) to obtain the numbers of the capturing subpatterns. to get the correct data returned. You also need a back reference [php.net]. A backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier (i.e. to its left) in the pattern, provided there have been that many previous capturing left parentheses (Note: In PHP, replacement may contain references of the form \\n or (since PHP 4.0.4) $n, with the latter form being the preferred one):

$str = preg_replace('/<!--\s*DeleteBegin\s*-->(.*?)<!--\s*DeleteEnd\s*-->/', "$1", $string); 

1ovector:
Captured substrings are returned to the caller via a vector of integer offsets whose address is passed in ovector.
Resource: PCRE man pages [pcre.org]

ExpLarry

5:06 pm on Jul 8, 2004 (gmt 0)

10+ Year Member



$str = preg_replace('/<!--\s*DeleteBegin\s*-->(.*?)<!--\s*DeleteEnd\s*-->/', "$1", $string);

This has the effect of deleting the comment tags DeleteBegin / DeleteEnd, but not the characters inbetween, which is not what coho75 wanted (if I understand correctly).

The sections of text being deleted could be printed using preg_replace_callback() like this:


$string = <<<_END
Keep me <!-- DeleteBegin --> but delete me <!-- DeleteEnd --> and keep me too
_END;

print "Original:\n$string\n";

$string = preg_replace_callback('/<!--\s*DeleteBegin\s*-->.*?<!--\s*DeleteEnd\s*-->/','show_deleted',$string);

print "String now:\n$string\n";

function show_deleted($matches) {
print "Deleting: ".$matches[0]."\n";
return '';
}

(disclaimer: script tested and seems to work)

coopster

5:23 pm on Jul 8, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Right, I understand ultimately where this might be going, but the last question asked was...

Is there any way to print the text between <!-- DeleteBegin --> and <!-- DeleteEnd --> to the screen?

...so I obliged. Figured there must be some other troubleshooting coho75 needed to perform.

[edited by: jatar_k at 6:55 pm (utc) on July 8, 2004]

coopster

7:16 pm on Jul 8, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



BTW, you really wouldn't need to go through the callback function, it could all be done in the regex. As a matter of fact, you may even want to leave your tags in there, just in case you wanted to write some other content back in at some point.

$string = preg_replace('/(.*<!--\s*DeleteBegin\s*-->)(.*)(<!--\s*DeleteEnd\s*-->.*)/Uis', "$1$3", $string); 

If you don't want the comment tags, just move the parentheses in the subpatterns:

$string = preg_replace('/(.*)<!--\s*DeleteBegin\s*-->(.*)<!--\s*DeleteEnd\s*-->(.*)/Uis', "$1$3", $string); 

[edited by: coopster at 7:18 pm (utc) on July 8, 2004]

coho75

7:17 pm on Jul 8, 2004 (gmt 0)

10+ Year Member



Thanks guys for the help. ExpLarry, I could not get your code to work. You understood my first post exactly. I am trying to read a file and find the inserted comment tags and then delete what is between them. I would ultimately like to rewrite the original file to show the changes that were made. Currently, I cannot get the code to work. I thought by printing the portion that would get deleted to the screen it would be easier to make sure the correct text was being grabbed. Anyways, the code I am working with right now is:

<?php

$filename = "file.html";
####################################

// Read file to change
$fp = fopen($filename, "r") or die("Couldn't open $filename");

$str = fread($fp, 100000);
fclose($fp);

$string = <<<_END
Keep me <!-- DeleteBegin --> but delete me <!-- DeleteEnd --> and keep me too
_END;

print "Original:\n$string\n";

$string = preg_replace_callback('/<!--\s*DeleteBegin\s*-->.*?<!--\s*DeleteEnd\s*-->/','show_deleted',$string);

print "String now:\n$string\n";

function show_deleted($matches) {
print "Deleting: ".$matches[0]."\n";
return '';
}

?>

Thanks to everyone who has given help on this.

coho75

coopster

7:20 pm on Jul 8, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Posted while you were, see the previous message. The main thing added was to process any newlines, and make the regex ungreedy (the
Uis
modifiers; well, that and you have to remove the question mark (?) from the regex as per PHP).

coho75

7:28 pm on Jul 8, 2004 (gmt 0)

10+ Year Member



Thanks Coopster. I got it to work just the way I wanted it to. Could you explain what the "$1$3" does? Thanks for the help. It is really appreciated.

coho75

coopster

7:54 pm on Jul 8, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Sure. OK, from the top. First I added a bunch of newlines and other junk to make this look more like a real HTML page, then printed the original out, and a quick statement to let me know the "new" string was on it's way...

$string = "Keep me <!-- DeleteBegin --> but \n 
delete me <!-- DeleteEnd --> and keep me too\n
<table><tr><td>This</td><td>is</td><td>a</td></tr>\n
<tr><td>small</td><td>little</td><td>table!</td></tr></table>\n
Another Keep me <!-- DeleteBegin --> another but delete me <!-- DeleteEnd --> another and keep me too";
print "Original:\n$string\n\n";
print "New:\n";

Next, you run the string through the regular expression. The first set of parentheses matches any text in front of the comment, and will be stored in an incremental variable ($1,$2,$3,etc.) as described in my first post. So, the first opening (or left) parentheses will be placed in $1, the second in $2, etc. These can also be referred to as \\1, \\2, \\3, etc. as the manual states. So by looking at the regex below, we can see that we intend to replace the $string variable with "$1$3", which will be whatever we matched in the first subpattern and whatever was matched in the third subpattern (there is also a special 0 (zero) subpattern that is the entire match).

$string = preg_replace('/(.*)<!--\s*DeleteBegin\s*-->(.*)<!--\s*DeleteEnd\s*-->(.*)/Uis', "$1$3", $string); 
print "$string";

Note, as I stated before, the

Uis
modifiers [php.net]?
U = Ungreedy 
i = case-insenstive
s = include newlines

That's why we need to drop the question mark in the original post -- what if you had more than one comment to get rid of? That's what the U and s modifiers are taking care of for us.


A little trick to see how subpatterns get captured is to use preg_match and kick out the entire collection using print_r. Don't forget to comment out your preg_replace() function during testing though:

// $string = preg_replace('/(.*)<!--\s*DeleteBegin\s*-->(.*)<!--\s*DeleteEnd\s*-->(.*)/Uis', "$1$3", $string); 
preg_match("/(.*<!--\s*DeleteBegin\s*-->)(.*)(<!--\s*DeleteEnd\s*-->.*)/Uis", $string, $matches);
print '<pre>';
print_r($matches);
print '</pre>';

Hope this helps -- coopster

ExpLarry

8:52 pm on Jul 8, 2004 (gmt 0)

10+ Year Member



That's why we need to drop the question mark in the original post -- what if you had more than one comment to get rid of? That's what the U and s modifiers are taking care of for us.

Ah - now I see my problem. Though I prefer PHP for web stuff my scripting and regex background is Perl, and the question mark there is a none-greedy modifier. Thus my regex works in Perl (assuming modification for several lines) but not in PHP.

Must read up on PCRE - I've been going through life thinking "Perl Compatible" meant identical.

Apologies for the confusion.