Forum Moderators: coopster

Message Too Old, No Replies

Regex: find & replace particular html links

regex find replace particular html links

         

Scooter

7:23 am on Nov 17, 2008 (gmt 0)

10+ Year Member



I am trying to figure out the right regex to filter out and disable certain html links that contain particular strings (eventually string array).

The way I want to do it is I want to take out the <a href></a> tags if the link points to any link that contains the filter_string.

the code below should be pretty close to working but doesn`t do anything.

$filter_string = "\ba=*";

$regex = "/<a\s[^>]*href\s*=\s*([\"\']?)(".$filter_string."[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU";

$html = '<a href ="http://www.example.com/s/?&n=701">link1</a>
<a href="http://www.example.com/s/?&n=702">link2</a>
<a href= "http://www.example.com/s/?a=u&n=703">link3</a>
<a href="http://www.example.com/s/sd?704">link4</a>
<a href ="http://www.example.com/s/ef?705">link5</a>
<a href = "http://www.example.com/s/?gt706">link6</a>
';

$replacement_phrase = "\\3";

$html = preg_replace($regex, $replacement_phrase, $html);

echo $html;

I am very new to regex, please help?

Scooter

8:46 am on Nov 17, 2008 (gmt 0)

10+ Year Member



correction on the regex statement made above.

the regex above was constructed from:
<snip>

for some reason the second question mark in the first parenthesis doesn`t show up when I post here. so the regex above should have 2 consecutive question marks for it to work.

[edited by: dreamcatcher at 8:48 am (utc) on Nov. 17, 2008]
[edit reason] No urls please! [/edit]

Scooter

10:28 am on Nov 18, 2008 (gmt 0)

10+ Year Member



btw in the code in first post.

I intend the code to just match "a=" of the 3rd link and disable it.

The regex fails when there is anything before "a=" in the link.

I`ve spent days on this its a major bottleneck, any help appreciated.

coopster

12:10 pm on Nov 18, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



for some reason the second question mark in the first parenthesis doesn`t show up when I post here. so the regex above should have 2 consecutive question marks for it to work.

Are you escaping that question mark? I'm assuming you are stating that the question mark may represent a query string marker? If so, you will want to escape it because in a regular expression it has a special meaning, unless it is inside your character class.

Scooter

3:31 pm on Nov 18, 2008 (gmt 0)

10+ Year Member



Are you escaping that question mark? I'm assuming you are stating that the question mark may represent a query string marker? If so, you will want to escape it because in a regular expression it has a special meaning, unless it is inside your character class.

(article section below 3. Allow for Missing Quotes):
at: [the-art-of-web.com...]

it mentions that:
"Because we used the U modifier, all patterns in the regexp default to 'ungreedy'. Adding an extra ? after a ? or * reverses that behaviour back to 'greedy' but just for the preceding pattern. Without this, for reasons that are difficult to explain, the expression fails. Basically anything following href= is lumped into the [^>]* expression."

I`m not exactly sure what this means but it seems to be intended to reverse the U modifier for that grouping.

Scooter

3:35 pm on Nov 18, 2008 (gmt 0)

10+ Year Member



Are you escaping that question mark? I'm assuming you are stating that the question mark may represent a query string marker? If so, you will want to escape it because in a regular expression it has a special meaning, unless it is inside your character class.

I`ve been testing this regex all day and thought that the second question mark may not make a difference, but I later decided to put the second question mark into the code again because I had my doubts, because it seemed that regex was giving different results but I`m not absolutely sure on this.

[edited by: Scooter at 3:38 pm (utc) on Nov. 18, 2008]

Scooter

3:50 pm on Nov 18, 2008 (gmt 0)

10+ Year Member



finally got the regex to do something, but for some unknown reason completely filters out links previous to the matches which I want it to stay there.

$filter_string = "a=";

$regex = "/<a\s[^>]*href\s*=\s*([\"\']?)(\s*.*".$filter_string."[^\" >]*?)\s*\\1[^>]*>(.*)<\/a>/siU";

$html = '<a href ="http://www.example.com/s/ete.php?ei=t&39&n=701">link1</a>
<a href="http://www.example.com/s/ete.php?ei=t&n=702">link2</a>
<a href= " dfa=u&n=703 ">link3</a>
<a href= "http://www.example.com/s/?&u&n=703 ">link32</a>
<a href="http://www.example.com/s/sd?704">link4</a>
<a href ="http://www.example.com/s/a=ef?705">link5</a>
<a href = "http://www.example.com/s/ete.php?ei=t&n=706">link6</a>
';

$replacement_phrase = "\\3";

$html = preg_replace($regex, $replacement_phrase, $html);

echo $html;

any feedback to this regex newbie appreciated, if the regex I`m pursuing is completely wrong please point that out, as I`m finding out that with:

$regex = "/<a\s[^>]*href\s*=\s*([\"\']?)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU";

there`s nothing in \\2 for some links, its seems to be reacting irregularly..

Scooter

5:11 pm on Nov 20, 2008 (gmt 0)

10+ Year Member



Solved.
[webmasterworld.com...] (helpful)

$regex = "#(<a[^>]*?)$filter_string(.*?)<\/a>#s";