Welcome to WebmasterWorld Guest from 54.147.10.72

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

perl regular expression

     

newbies

3:45 am on Feb 13, 2014 (gmt 0)

10+ Year Member



I have html files which are online version of articles with citations in the text and references at the end. The citation in the text appear as the following patterns [1] or [1-2] or [1,3,6] or [1,3,4-6, 9]. I want each number is linked to the references without changing the text.

I know how to match the simple ones such as the first one: $html =~ s`\[\d+\]`\[<a href="#$1"><$1>\]`g;

But for more complicated patterns, I have no idea of how to insert the link to the numbers.

Thank you for your help.

phranque

4:53 am on Feb 13, 2014 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



$html =~ s`\[\d+\]`\[<a href="#$1"><$1>\]`g;

you need parentheses to capture something for the backreferences.
your anchor tag doesn't have a closing tag, so not sure what the intended syntax is there.

i would try something more like this:
$html =~ s|([-,\[ ])(\d+)([-,\] ])|$1<a href="#$2">$2</a>$3|g;


this will look for a pattern that begins with either a left square bracket, a comma, a dash, or a blank; then an integer; and ends with either a right square bracket, a comma, a dash, or a blank.
the integer is captured in the second group and the preceding and following character are captured in the first and third groups respectively.
these matches are substituted as specified repeatedly until there are no more matches.

newbies

5:39 am on Feb 13, 2014 (gmt 0)

10+ Year Member



Thank you phranque. It seems we are almost there.
for this pattern [1,3,4-6, 9]

The result is:
[<a href="#1">1</a>,3,<a href="#4">4</a>-6, <a href="#9">9</a>]

so, 3, 6 are not linked. How to revise?

newbies

6:04 am on Feb 13, 2014 (gmt 0)

10+ Year Member



Another problem:

numbers not in [ ] may also get linked.

lucy24

9:08 am on Feb 13, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



:: thinking how I'd approach this in an e-book ::

pattern: 1 or more digits \d contained within square brackets, separated by one or more non-digits.

I think it's easiest to do it in multiple passes rather than try it all in one fell swoop. You're replacing
(\d+)
with something like
<a href[^>]+>\d+</a>
where presumably the captured digits $1 or \1 are used as part of the link.

So the pattern is
(\[(?:<a href[^>]+>\d+</a>[^\d<\]]+)*)(\d+)
change to
$1<a href et cetera>$2</a>
and repeat until it rinses clean.

In groups like "2-3" or "4-7" do you want each part separately anchored, or a single anchor for the whole unit? In the latter case, replace the sequence (\d+) wherever it occurs with (\d+)(-\d+)?. Note here that you have to capture the two parts separately, because only the first number will be used in the link. So you'll have
<a href et cetera including $2 somewhere>$2$3</a>

Anyway, that's what I would do in the index of an e-book (something I have a LOT of experience with :().

numbers not in [ ] may also get linked.

You wouldn't want them to, would you? Unless you have very unusual content, such that you can be reasonably certain the only numerals will be footnote references. You could also do it if the only non-footnote numerals are dates, so you can distinguish between \d{1,2} and \d{4,}. You would then need to express the search as \b\d\d?\b to make sure your Regular Expression doesn't sneakily try something like
1<a href et cetera including 94 here>94</a>5.

Edit: Oops. I guess I misread that. With appropriate use of [^\]] you can ensure that links only happen inside of square brackets.

newbies

4:06 pm on Feb 13, 2014 (gmt 0)

10+ Year Member



Thank you Lucy24!

numbers not in [ ] should not be linked.

for pattern like [3] or [4-6] is easy to do, but for [1,3,4-6, 9] or even more complicated such as [1,3,4-6, 9, 11-24], I don't know how to.

lucy24

11:07 pm on Feb 13, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Making a pattern that captures everything in one go is more trouble than it's worth. That is: It would take more time to devise the Regular Expression than it would take to hit Replace a few more times-- and this isn't something you will be using 20 times a day forever.

That's why I suggested the pattern with (blahblah)*. Simply keep running a global replace until nothing new comes up.

In forms like [3-5] do you want

<a href = blahblah3blahblah>3</a>-<a href = blahblah5blahblah>5</a>

or

<a href = blahblah3blahblah>3-5</a>
?

That's a pretty simple difference, so just decide what you want your user's experience to be. How often would someone click on the second element in a group 2-7 and expect to be taken directly to footnote 7? You know the site and your users better than anyone. At least I hope you do ;)

newbies

4:34 am on Feb 15, 2014 (gmt 0)

10+ Year Member



I almost got there using this code:
use strict;

my $c = "[1,3,4-6, 9, 12-23], [3-4]. some [11-12]. A 10-day 44 ";

my $c2 = replace ($c);

print "$c2\n";

sub replace {
my $html = shift;
if ($html =~ m/\[/g){
$html =~ s`(.*?)(\d+)(?=.*?\])`$1<a href="#$2">$2</a>`g;
}
return $html;
}


the output is this which is exactly what I wanted:
[<a href="#1">1</a>,<a href="#3">3</a>,<a href="#4">4</a>-<a href="#6">6</a>, <a href="#9">9</a>, <a href="#12">12</a>-<a href="#23">23</a>], [<a href="#3">3</a>-<a href="#4">4</a>]. some [<a href="#11">11</a>-<a href="#12">12</a>]. A 10-day 44


However, when I use the subroutine to test a real html file, the "10" in the sentence "A 10-day" is also linked. In the html file, that part of the text is exactly like the text in the $c variable without break or other special things between "[11-12]." and "A 10-day". I could not understand I got different results!

phranque

5:51 am on Feb 15, 2014 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



$html =~ s`(.*?)(\d+)(?=.*?\])`$1<a href="#$2">$2</a>`g; 


(.*?) is about as ambiguous as it gets for regular expressions.
it means "unanchored but beginning with zero or more of anything or nothing and capture what you match".

i would try something more like "preceded by a left square square bracket, then zero or more characters that are not a right square bracket, then one or more decimal digits, etc"

however i can't explain why you got different results.

newbies

8:06 am on Feb 15, 2014 (gmt 0)

10+ Year Member



Thank you.

You're right -
(.*?) is about as ambiguous as it gets for regular expressions.


It is that part that caused the problem.

now I modified the code, everything works fine!

$html =~ s`\G(.*?)(\d+)(?=[,|\-|\s|\d]*\])`$1<a href="#$2">$2</a>`g;

phranque

10:08 am on Feb 15, 2014 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



sorry i misread the regexp so disregard my description.

does your code work if there isn't a left square bracket before the first digit?

newbies

5:59 pm on Feb 15, 2014 (gmt 0)

10+ Year Member



No, any number has to be within [ ] to be linked. the complete code is as follows:

if ($html =~ m/\[/g){
$html =~ s`\G(.*?)(\d+)(?=[,|\-|\s|\d]*\])`$1<a href="#$2">$2</a>`g;
}

lucy24

10:51 pm on Feb 15, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



(.*?)

I wouldn't dare do this. Express the generic . as [^\[<>] etc, excluding any characters that absolutely must not occur in this location. Otherwise you risk going from
see citation no. [<a href = blahblah1234blahblah>1234</a>]

to
see citation no. [<a href = blahblah1234blahblah>1<a href = blahblah23blahblah>23</a>4</a>]


Or, worse, from
see citation no. [<a href = blahblah56blahblah>56</a>]

to
see citation no. [<a href = blahblah<a href = blahblah56blahblah>56</a>blahblah>56</a>]


In my personal experience-- still talking ebooks-- I get it most often with physical page numbers in a multi-page index. To run it as an unsupervised global replace you have to make sure nothing gets anchored that isn't already anchored.

newbies

8:02 am on Feb 16, 2014 (gmt 0)

10+ Year Member



Indeed, that is a problem. Then I don't know how to exclude those situations. I tried this code which did not work:

if ($html =~ m/\[/g){
$html =~ s`\G([\d|\-|,\s]*)(\d+)(?=[,|\-|\s|\d]*\])`$1<a href="#$2">$2</a>`g;
}

lucy24

11:20 am on Feb 16, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Well, I don't speak the language, but can you translate this?

Search string:
(\[(?:<a href[^>]+>\d+</a>[^\d<\]]+)*)(\d+)


Replace with:
$1<a href blahblah including $2 here>$2</a>


where $1 is the stuff leading up to your first new capture and $2 is your fresh reference. Pay close attention to the brackets. It may be easier to see if you cut and paste into something using a fixed-pitch font. I'm accustomed to reading Regular Expressions in Courier and they look like so much gibberish in anything else.

Walkthrough:
#1 Look for an opening bracket. Take no action until you find one.
#2 Opening bracket may or may not be followed by one or more existing anchor packages, possibly separated by a few characters that are NOT digits, < or close-brackets.
#3 If, after all this, you find a set of digits before reaching the close-bracket, capture and anchor those suckers.
#4 Repeat as an unsupervised global replace until everything rinses clean.

That's assuming you want each set of numerals captured and anchored separately.

newbies

1:32 am on Feb 19, 2014 (gmt 0)

10+ Year Member



Still it is not working as expected.
s`(\[(?:<a href[^>]+>\d+</a>[^\d<\]]+)*)(\d+)`$1<a href="#$2">$2</a>`g; 


input:
$c = "[adb1,3,4-6, 9, 12-23], 11 should not be linked [3-4]. Some [11-12]. A 10-day 44 watts treatment [22].";

Output:
[adb1,3,4-6, 9, 12-23], 11 should not be linked [<a href="#3">3</a>-4]. Some [<a href="#11">11</a>-12]. A 10-day 44 watts treatment [<a href="#22">22</a>].

lucy24

3:18 am on Feb 19, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



[adb

Urk, didn't think of that. You'll need to add a

[^\d<\]]*


to the beginning of the pattern-- immediately after the \[ --to capture possible non-digits in this location.

So the part

(?:<a href[^>]+>\d+</a>[^\d<\]]+)*


isn't working for non-zero values of * meaning that the package as a whole has something wrong with it.

Hmmm...

:: beating head against wall ::

It's going to be something really obvious that will cause us all to cry "D'oh!" in unison. Hang on. We Will Get This To Work.

Edit:
Oh, wait. Are you running this pattern repeatedly? It won't pick up everything on the first pass. You have to keep running it until it rinses clean. Are you doing this manually or programmatically? Manually, just keep hitting Global Replace. In code, add a loop that says --in translation, of course, but you'll figure it out--

expr = blahblah
do
{ replacement here }
while (expr.test(source-text))

newbies

4:44 am on Feb 19, 2014 (gmt 0)

10+ Year Member



Not manually run it, but programmatically.

For the pattern [adb1, 3], the numbers should not be anchored because this is not the pattern for citations. I put it here to exclude because you said this non-citation pattern may be anchored by mistake.

lucy24

6:37 am on Feb 19, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Oops again, overlooked the 1 in adb1. Is it always [a-z]+ before the not-to-be-captured \d? If so, replace all occurrences of
[^\d<\]]*

with
[^\d<\]]*(?:[a-z]+\d+)?[^\d<\]]*


There are several reasons for using non-capturing groups. Here I'm doing it so the target doesn't have to change every time the pattern gets tweaked. Otherwise you'd be up to about $27 by now!

I don't speak perl. Can you translate the "do...while" bit into something usable?

Software error:

Can't locate /home/deploy/webmasterworld/code_format-v6.lib in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.18.2 /usr/local/share/perl/5.18.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.18 /usr/share/perl/5.18 /usr/local/lib/site_perl .) at decode-post-v6.lib line 27, <THREADDAT> line 21.

For help, please send mail to the webmaster (it@imninjas.com), giving this error message and the time and date of the error.