preg replace choking on \1, why?

Forum Moderators: coopster

Message Too Old, No Replies

preg replace choking on \1, why?

csdude55

10:47 pm on Dec 18, 2022 (gmt 0)

I have what I thought was a simple regex to remove the tag around a link; eg, converting

<a href="example.com">

to

example.com

So here's what I had:

$str = preg_replace("/<a href=(\"|')([^\1]+?)\1>/i", "$2", $str);

But it didn't match.

After playing around, it looks like [^\1] ends up matching example.com" (so it matches the ", too), leaving no trailing " to match. It has the same issue if I use <a href='example.com'>, too, so it's not the double-quote that's specifically the problem.

This gave the result I expected:

$str = preg_replace("/<a href=(\"|')([^\1]+?)[\"']>/i", "$2", $str);

but of course that wouldn't match if there's a ' in the tag somewhere. Which, I mean, there shouldn't be one, but since I'm dealing with user submitted content you just never know.

Any thoughts?

phranque

11:18 pm on Dec 18, 2022 (gmt 0)

normalize the input, then strip the tag from the link`

csdude55

12:29 am on Dec 19, 2022 (gmt 0)

If \1 isn't going to match it, though, then how would I normalize it?

I'm really more concerned that there's something here that I'm misunderstanding, and I'd hate to find that I've coded a glitch after it's all live :-O

lucy24

6:12 pm on Dec 19, 2022 (gmt 0)

After playing around, it looks like [^\1] ends up matching example.com" (so it matches the ", too), leaving no trailing " to match. It has the same issue if I use <a href='example.com'>, too, so it's not the double-quote that's specifically the problem.

This is the point at which I grab the code and say �OK, if you�re so smart, tell me what you THINK \1 (by itself) is�. I trust you can translate that into php.

Also double-check that the [^\1] construction doesn�t have to become [^\\1] --or possibly even three or four backslashes--to work as intended. Got a vague notion I�ve met this issue before.

Tangentially, there is no difference between [^blahblah]+ and [^blahblah]+? since the ^ means you have to stop anyway.

csdude55

6:46 pm on Dec 19, 2022 (gmt 0)

You're right, @lucy24, double-slashing fixed it! Thanks!

I was on the same general path as you, I first used this to see what was matching where:

preg_replace("/<a href=(\"|')([^\1]+)\1>/i", "1 -- $1\n2 -- $2", $str);

That's where I found that [^\1]+ was matching example.com", leaving no trailing \1 to end the match.

(that's why I added the ?, thinking that the problem was that it was going past the match)

Then I tested my regex pattern itself, and it worked here:

[regex101.com...]

My only remaining thought was a glitch in PHP, then I saw your response and tried \\1 :-)

So for future readers, the final that works is:

preg_replace("/<a href=(\"|')([^\\1]+?)\\1>/i", "$2", $str);

I had to double-slash ALL of them, not just the one in the negative group.

Tangentially, there is no difference between [^blahblah]+ and [^blahblah]+? since the ^ means you have to stop anyway.

I actually plugged in the ? after discovering that [^\] was matching the closing ", thinking that it wasn't stopping at the first match. I had a recent issue with regex in JavaScript where [^blah] wasn't stopping where I'd expected and the ? "fixed" it, so I was hoping that was the issue here, too.