Forum Moderators: coopster
I have seen lots of regexps to match urls, but what I'm trying is to make a regexp that matches all the link that i.e. does match as valid.
This is "malformed" anchor tags incluseive. Here is a list of urls:
-------------------------------------------------------------------
valid urls:<br><br>
<a href="/cgi-bin/redir.cgi?url=http://www.site.com/main.html" onmouseover="window.status='http://www.whatever.com/';return true;" onmouseout="window.status=' ';return true;">Go to site: 1</a><br>
<a href
=
"/cgi-bin/redir.cgi?url=http://www.site.com/main.html"
onmouseover="window.status='http://www.whatever.com/';return true;" onmouseout="window.status=' ';return true;">Go to site: 2</a><br>
<a href= '/cgi-bin/redir.cgi?url=http://www.site.com/main.html' onmouseover="window.status='http://www.whatever.com/';return true;" onmouseout="window.status=' ';return true;">Go to site: 3</a><br>
<a href=/cgi-bin/redir.cgi?url=http://www.site.com/main.html onmouseover="window.status='http://www.whatever.com/';return true;" onmouseout="window.status=' ';return true;">Go to site: 4</a><br>
<a href=www.site.com>Go to site: 5</a><br>
<a href = www.site.com onMouseOver="">Go to site: 6</a><br>
<a href="/cgi-bin/redir.cgi
?url=http://www.site.com/main.html" onmouseover="window.status='http://www.whatever.com/';return true;" onmouseout="window.status=' ';return true;">Go to site: 7</a><br>
<br>non valid urls: <br><br>
<a href='/cgi-bin/redir.cgi?url=http://www.site.com/main.html">Go to site: 8</a><br>
< a href="/cgi-bin/redir.cgi?url=http://www.site.com/main.html" onmouseover="window.status='http://www.whatever.com/';return true;" onmouseout="window.status=' ';return true;">Go to site: 9</a><br>
<br>wrong urls:<br><br>
<a href="/cgi-bin/redir.cgi?url=http://www.site.com/main.html' onmouseover="window.status='http://www.whatever.com/';return true;" onmouseout="window.status=' ';return true;">Go to site: 10</a><br>
<a href='/cgi-bin/redir.cgi?url=http://www.site.com/main.html" onmouseover="window.status='http://www.whatever.com/';return true;" onmouseout="window.status=' ';return true;">Go to site: 11</a><br>
-------------------------------------------------------------------
Basically I do want to match all the valid urls based on anchor tag and href, so the expresion first should look for: <a ... href = ... >
I got an approach, but the main proble is that submatches are not always the correct content of href, as it varies depending on the type of url (single quotes, double quotes, or no quotes).
Also I don't know how to make the expression to only match the string if href contents is enclosed in single quotes, double quotes or no quotes, but not match when start quotes is different from end quotes.
Also is there a way to make the submatch of the content of href always be the same? (back references or alike?)
Here is my current expression:
[php]
preg_match_all("{<a\s+href\s*=\s*(\"([^>]+)\"\s*>¦'([^>]+)'\s*>¦([^\"]+[^>]+[^\"]+)\s*>)}x",$html, $matches, PREG_SET_ORDER);
[/php]
thanks for any ideas!
[edited by: jatar_k at 7:21 pm (utc) on June 8, 2004]
[edit reason] turned off smiles and fixed sidescroll [/edit]