Regex grab HTML tags keep back reference

Forum Moderators: coopster

Message Too Old, No Replies

Regex grab HTML tags keep back reference

Readie

10:56 am on Mar 12, 2010 (gmt 0)

I wrote a script for a list of allowed HTML recently and I'm wondering if anyone with a bit more experience than me could compare the three regex (All work) that I wrote to aquire the non-self closing tags, and tell me which one is likely to incur the least overhead - because this will be going into effect for user comments and could end up looping several hundred times on (some) page loads.

In use:

/<([^ \/>]+)([^>]+)?>(?m)([^(\<\/\\1\>)]+)(?-m)<\/\\1>/is

Others:

/<([^ \/>]+)([^>]+)?>(?m)(.*?)(?-m)<\/\\1>/is
/<([^ \/>]+)([^>]+)?>(?m)(.*?(?!<\/\\1>).*?)(?-m)<\/\\1>/is

Cheers in advance,

Mike

eelixduppy

4:36 pm on Mar 12, 2010 (gmt 0)

>> looping several hundred times on (some) page loads.

I think this is more of a problem than how much overhead each of these regex's will have. To be honest if I had to guess I'd say that each of these would perform pretty close to the same if not exactly the same as far as anyone would be able to tell. If you want a more detailed analysis of their performance, that should be done on your box. Record the timestamp (to microseconds) before and after and find the difference. I still don't think it will make that much of a difference, though. You should work more on getting it so that it doesn't have to run on page loads at all, but perhaps, only when a user is submitted a comment, for example.

Readie

4:50 pm on Mar 12, 2010 (gmt 0)

Hmm, The problem is the way the site has been coded it is applying the convert-to-HTML as it pulls the stuff from the MySQL database...

Still, I'm not going to be letting people edit comments after posting, so I suppose I could just write a completley new system for comment saving that applies this during the insert

Thanks for replying - and unfortunatley the owner of the server I use is relying on Gentoo Portage for PHP updates and they still havn't cleared PHP 5.3 - so I can't do micro seconds :(

chasehx

5:16 pm on Mar 12, 2010 (gmt 0)

I'd go:
<([A-Z][A-Z0-9]*)>.*?</\1>

Personally...

Readie

9:54 am on Mar 13, 2010 (gmt 0)

The problem with that Chasehx is I want to allow the use of some attributes (which are ofcourse checked against an invalid list), and it is by far easier to validate both the tag and it's attributes with seperate back references.

Anyways, I've had a thought on a way of modifying every use of both my BB code function, my HTML function and my "webify" function to seriously reduce my overheads and still allowing editing (where I want it) which is so simple I can't believe it didn't occur to me before.

Apply the functions during the insert, and save both pre-function and post-function content in the database.