Forum Moderators: phranque

Message Too Old, No Replies

Regex question

Regex

         

Tamashi

1:42 pm on Mar 28, 2010 (gmt 0)

10+ Year Member



Greetings

My apologies if I picked the wrong forum to ask this. My script is scanning my reciprocal url's daily with the following regex values:

<a.*?href=["'].*?www.example.com/?["'].*?>.*?</a>

How would I have to set up the code to:
- Allow any anchor tag
- Allow any title tag
- NOT ALLOW rel="nofollow" as a match?

Currently my code is matching the url, and allows for the rest anything ... including the rel="nofollow" tag to the links.

Could anyone give an example, or hint to a possible solution so that IF rel="nofollow" is applied to a reciprocal URL, that it will not match anymore?

g1smd

4:17 pm on Mar 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As you won't know the order of the attributes, href then rel or rel then href, you'll need to run two tests: one for the URL and another to see if the rel is present.

Tamashi

6:58 pm on Mar 28, 2010 (gmt 0)

10+ Year Member



the script only checks once. Isn't it possible to add exclusions of specific words with the .*? as in, match anything but the word "nofollow"

Tamashi

7:01 pm on Mar 28, 2010 (gmt 0)

10+ Year Member



something like
<a. (.*?[^"nofollow"]) href=["'].*?www.example.com/?["'](.*?[^"nofollow"]) >.*?</a>

Would something similar be possible?

g1smd

7:20 pm on Mar 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It only needs to check once, but modify the script so that it looks for one thing then the other.

It's likely an extra two or three lines of code to add to the script.

jdMorgan

7:26 pm on Mar 28, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The answer has already been provided, and is correct:

> you'll need to run two tests: one for the URL and another to see if the rel is present.

Square brackets indicate an alternate-character group, which matches or rejects the enclosed characters, not strings. Therefore, the sub-pattern [^"nofollow"] rejects any string containing a single character which is a double-quote or the letters n, o, f, l, or w -- several of which are listed multiple times.

Your original pattern is hugely inefficient, due to use of multiple ambiguous and greedy ".*" sub-patterns. Use more-specific subpatterns to speed this up -- possibly by a factor of one thousand or more. Something like
<a(\ [^\ ]*)*\ href=["']https?://(www\.)?example\.com[^"']*["'][^>]*>([^<]*<)+/a>
will greatly reduce the number of 'back-off-and-retry' iterations required to be executed by the matching engine.

See the regular-expressions tutorial cited in out Apache Forum Charter for more information.

Jim

Tamashi

2:19 pm on Mar 29, 2010 (gmt 0)

10+ Year Member



Got it working in a single line with this:
<a (?![^>]*rel=["\']nofollow[\'"])(?=[^>]*href=["\']http).*?href=["\']https?://(www\.)?example\.com[^"\']*["\'].*?>.*?</a>

g1smd

3:05 pm on Mar 29, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And if the rel is listed after the href?

Tamashi

5:38 pm on Mar 29, 2010 (gmt 0)

10+ Year Member



Then it still works :)

Ran several test runs with my script, and no matter where the rel is placed, it rejects the reciprocal link :)

Tamashi

5:44 pm on Mar 29, 2010 (gmt 0)

10+ Year Member



<a href="http://www.example.com"> - Found
<a href="http://www.example.com" target="_blank"> - Found
<a href="http://www.example.com" title="something" target="_blank"> - Found
<a href="http://www.example.com" title="something" target="_blank" rel="nofollow"> Not Found
<a rel="nofollow" target="_blank" rel="nofollow" href="http://www.example.com"> Not Found
<a href="http://www.example.com" rel="nofollow" target="_blank" title="something"> Not Found

Tried a few more combinations, and whenever "nofollow" was mentioned somewhere in the link, it was rejected by the script with the above regex values.

Tamashi

5:46 pm on Mar 29, 2010 (gmt 0)

10+ Year Member



I tried a few more combinations, and it seems to work fine with whatever I try, so if you have a combination that might not work, then leave it here and i'll test it.