Regular Expression Help - PHP Server Side Scripting forum at WebmasterWorld - WebmasterWorld

Forum Moderators: coopster

Message Too Old, No Replies

Regular Expression Help

Word Regex

RyanM

11:07 pm on Feb 21, 2005 (gmt 0)

10+ Year Member

Hi I have downloaded some word stripping code that removes all the word specific tags. It works well, however it leaves in the TOC (table of contents) anchor links. These are occasionally wraped around a heading, but most often are simply empty <a></a>.

After scouring the internet for tutorials and examples I finally came up with a regex that deletes the opening tag (ie '<a name="_Toc437192468">') however in reality in PHP it leaves the first '<' ie it deletes 'a name="_Toc437192468">'

so <a name="_Toc437192468"><h3>Heading</h3></a> becomes
<<h3>Heading</h3></a>

my regex is


$D = preg_replace('<a name="_Toc[^"\>]*">', '', $D);

paraphraised:

Find '<a name="_Toc' + (ANY CHARACTER THAT IS NOT '">' ) + '">"

What is wrong with this regex? Why is it ignoring the first "<" within the tag

Also if somebody could give me some pointers as to how I find the first </a> after each of these and delete them I would really appreciate it (I admit im stumped and for the most part regex's seem to be going over my head).

Thanks

- Ryan

coopster

12:31 pm on Feb 22, 2005 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Welcome to WebmasterWorld, RyanM.

The first character is more than likely being recognized as a delimiter [php.net]. Try adding a delimiter to your pattern and see where you get...

Bonusbana

12:41 pm on Feb 22, 2005 (gmt 0)

10+ Year Member

How about:

$D = preg_replace("/<a name=\"_Toc[^\"]+\">/", "", $D);

And if you want to delete the nextcoming </a> (but keep whats in between):

$D = preg_replace("/<a name=\"_Toc[^\"]+\">([^<]+)<\/a>/", "\1", $D);

RyanM

8:40 pm on Feb 22, 2005 (gmt 0)

10+ Year Member

Thanks for the help coopster and Bonusbana.

Bonusbana, that was great, the first regular expression worked very well. However the second one does not seem to work, I will keep trying to tinker with it. I think my problem was that I was basing my regular expresion on ones that I was writing and testing within dreamweaver, obviously dreamweavers regex's are not as powerful as the PHP ones.

Anyway it is important for me to learn Regex's so that I do not have to bother you guys about it again :). If someone does not mind could you please have a look at my paraphraising below and tell me if my understanding is correct.


First Regex:Open Deliminator /
<a name ="
_Toc
Any character other than "
One or more times
">
End Deliminator /
Second Regex:
Open Deliminator /
<a name ="
_Toc
Any character other than "
One or more times
">
Sub Pattern Start
Any character other than <
One or more times
Sub Pattern End
 </a>
End Deliminator /

also within the second regex (the one that does not work, for whatever reason) why is the replace string \1?

Thanks for all the help

- Ryan

ironik

12:20 am on Feb 24, 2005 (gmt 0)

10+ Year Member

To attempt to fix the second regex:

$D = preg_replace("/<a name=\"_Toc[^\"]+\">([^<]+)<\/a>/", "\\1", $D);

OR

$D = preg_replace("/<a name=\"_Toc[^\"]+\">([^<]+)<\/a>/", "$1", $D);

The regex engine stores matches indicated between the ( and ) characters and assigns a variable name based on the order in which they appear in the pattern.

The first match is \\1 or $1
the second match is \\2 or $2 and so on

I'm not sure what the difference between the 2 are (or even if using \1 is correct/incorrect), but everything I've read uses \\1 and $1 as the format for replacement matches.

RyanM

12:44 am on Feb 24, 2005 (gmt 0)

10+ Year Member

Hi ironik thanks for the help, however your regular expressions do not seem to work either, it does not appear to find them at all.

Here is an example of what I am looking at,

<a name="_Toc437192531"><span style="font-weight: bold;">Article 1</span></a>

Obviously the span will be removed with the existing Regex's.

thanks

- Ryan

RyanM

12:46 am on Feb 24, 2005 (gmt 0)

10+ Year Member

PS I have modified the regex so it is looking for <\/a

$D = preg_replace("/<a name=\"_Toc[^\"]+\">([^<\/a]+)<\/a>/", "$1", $D);

however this does not seem to work either.

thanks

- Ryan

ironik

1:23 am on Feb 24, 2005 (gmt 0)

10+ Year Member

I think it's not working because the test data you've shown there has nested tags. The regex will have to be modified so it permits multiple tags inside the <a></a> tags.

Here's an untested attempt
$D = preg_replace("/<a name=\"_Toc[^\"]+\">([.]*)<\/a>/", "$1", $D);

I think, in theory, it should strip all the toc link tags out regardless of any nested tags. I've had trouble with matches using the period character, so here's an alternate just in case:

$D = preg_replace("/<a name=\"_Toc[^\"]+\">([\w?\W?\s?\S?]*)<\/a>/", "$1", $D);

(edit: oops, just realised you said that the span tags will be removed... I'm not sure why it won't work then?)

RyanM

1:44 am on Feb 24, 2005 (gmt 0)

10+ Year Member

Hi Ironik,

Thanks for that, unfortunately it still does not work. It is, hoewever, a good starting point however, hopefully I can fiddle with them untill I can get it working. Of course any further help will be much appreciated.

Thanks

- Ryan

ironik

3:14 am on Feb 24, 2005 (gmt 0)

10+ Year Member

mmm... another stab in the dark then:

$D = preg_replace("/<a name=\"_Toc[0-9A-Za-z-]+\">([\w\s\S\W]+)<\/a>/i", "$1", $D);

added a case insensitive switch, and allowed only alpha numerics after the _Toc text (not sure if that other expression may have been causing it to fail).

Wish I had a test server available before posting them, apologies.

RyanM

3:31 am on Feb 24, 2005 (gmt 0)

10+ Year Member

Hi Ironik,

No need for appologies, you are already helping more than I could possibly have hoped for. If you lived in NZ I would owe you a beer :)

Anyway, for some reason it also does not work, It strangely works on only some (but not all of the _tocs

an example of one that it did not work on is this:

<a name="_Toc437192532"><strong>Article 2</strong></a>

an example of one that it did work on is:

<a name="_Toc437192531"><strong>Article 1</strong></a>

There is no discernable pattern of where it is working and where it is not.

I might just have to give up and do it with code I think, reguardless of whether or not this works I have been given a good starting point for learning and eventually mastering Regex's

Thanks

- Ryan