homepage Welcome to WebmasterWorld Guest from 54.211.157.103
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
perl regular expression
newbies




msg:4644647
 3:45 am on Feb 13, 2014 (gmt 0)

I have html files which are online version of articles with citations in the text and references at the end. The citation in the text appear as the following patterns [1] or [1-2] or [1,3,6] or [1,3,4-6, 9]. I want each number is linked to the references without changing the text.

I know how to match the simple ones such as the first one: $html =~ s`\[\d+\]`\[<a href="#$1"><$1>\]`g;

But for more complicated patterns, I have no idea of how to insert the link to the numbers.

Thank you for your help.

 

phranque




msg:4644653
 4:53 am on Feb 13, 2014 (gmt 0)

$html =~ s`\[\d+\]`\[<a href="#$1"><$1>\]`g;

you need parentheses to capture something for the backreferences.
your anchor tag doesn't have a closing tag, so not sure what the intended syntax is there.

i would try something more like this:
$html =~ s|([-,\[ ])(\d+)([-,\] ])|$1<a href="#$2">$2</a>$3|g;

this will look for a pattern that begins with either a left square bracket, a comma, a dash, or a blank; then an integer; and ends with either a right square bracket, a comma, a dash, or a blank.
the integer is captured in the second group and the preceding and following character are captured in the first and third groups respectively.
these matches are substituted as specified repeatedly until there are no more matches.

newbies




msg:4644658
 5:39 am on Feb 13, 2014 (gmt 0)

Thank you phranque. It seems we are almost there.
for this pattern [1,3,4-6, 9]

The result is:
[<a href="#1">1</a>,3,<a href="#4">4</a>-6, <a href="#9">9</a>]

so, 3, 6 are not linked. How to revise?

newbies




msg:4644670
 6:04 am on Feb 13, 2014 (gmt 0)

Another problem:

numbers not in [ ] may also get linked.

lucy24




msg:4644723
 9:08 am on Feb 13, 2014 (gmt 0)

:: thinking how I'd approach this in an e-book ::

pattern: 1 or more digits \d contained within square brackets, separated by one or more non-digits.

I think it's easiest to do it in multiple passes rather than try it all in one fell swoop. You're replacing
(\d+)
with something like
<a href[^>]+>\d+</a>
where presumably the captured digits $1 or \1 are used as part of the link.

So the pattern is
(\[(?:<a href[^>]+>\d+</a>[^\d<\]]+)*)(\d+)
change to
$1<a href et cetera>$2</a>
and repeat until it rinses clean.

In groups like "2-3" or "4-7" do you want each part separately anchored, or a single anchor for the whole unit? In the latter case, replace the sequence (\d+) wherever it occurs with (\d+)(-\d+)?. Note here that you have to capture the two parts separately, because only the first number will be used in the link. So you'll have
<a href et cetera including $2 somewhere>$2$3</a>

Anyway, that's what I would do in the index of an e-book (something I have a LOT of experience with :().

numbers not in [ ] may also get linked.

You wouldn't want them to, would you? Unless you have very unusual content, such that you can be reasonably certain the only numerals will be footnote references. You could also do it if the only non-footnote numerals are dates, so you can distinguish between \d{1,2} and \d{4,}. You would then need to express the search as \b\d\d?\b to make sure your Regular Expression doesn't sneakily try something like
1<a href et cetera including 94 here>94</a>5.

Edit: Oops. I guess I misread that. With appropriate use of [^\]] you can ensure that links only happen inside of square brackets.

newbies




msg:4644843
 4:06 pm on Feb 13, 2014 (gmt 0)

Thank you Lucy24!

numbers not in [ ] should not be linked.

for pattern like [3] or [4-6] is easy to do, but for [1,3,4-6, 9] or even more complicated such as [1,3,4-6, 9, 11-24], I don't know how to.

lucy24




msg:4644966
 11:07 pm on Feb 13, 2014 (gmt 0)

Making a pattern that captures everything in one go is more trouble than it's worth. That is: It would take more time to devise the Regular Expression than it would take to hit Replace a few more times-- and this isn't something you will be using 20 times a day forever.

That's why I suggested the pattern with (blahblah)*. Simply keep running a global replace until nothing new comes up.

In forms like [3-5] do you want

<a href = blahblah3blahblah>3</a>-<a href = blahblah5blahblah>5</a>

or

<a href = blahblah3blahblah>3-5</a>
?

That's a pretty simple difference, so just decide what you want your user's experience to be. How often would someone click on the second element in a group 2-7 and expect to be taken directly to footnote 7? You know the site and your users better than anyone. At least I hope you do ;)

newbies




msg:4645481
 4:34 am on Feb 15, 2014 (gmt 0)

I almost got there using this code:
use strict;

my $c = "[1,3,4-6, 9, 12-23], [3-4]. some [11-12]. A 10-day 44 ";

my $c2 = replace ($c);

print "$c2\n";

sub replace {
my $html = shift;
if ($html =~ m/\[/g){
$html =~ s`(.*?)(\d+)(?=.*?\])`$1<a href="#$2">$2</a>`g;
}
return $html;
}


the output is this which is exactly what I wanted:
[<a href="#1">1</a>,<a href="#3">3</a>,<a href="#4">4</a>-<a href="#6">6</a>, <a href="#9">9</a>, <a href="#12">12</a>-<a href="#23">23</a>], [<a href="#3">3</a>-<a href="#4">4</a>]. some [<a href="#11">11</a>-<a href="#12">12</a>]. A 10-day 44


However, when I use the subroutine to test a real html file, the "10" in the sentence "A 10-day" is also linked. In the html file, that part of the text is exactly like the text in the $c variable without break or other special things between "[11-12]." and "A 10-day". I could not understand I got different results!

phranque




msg:4645487
 5:51 am on Feb 15, 2014 (gmt 0)

$html =~ s`(.*?)(\d+)(?=.*?\])`$1<a href="#$2">$2</a>`g;


(.*?) is about as ambiguous as it gets for regular expressions.
it means "unanchored but beginning with zero or more of anything or nothing and capture what you match".

i would try something more like "preceded by a left square square bracket, then zero or more characters that are not a right square bracket, then one or more decimal digits, etc"

however i can't explain why you got different results.

newbies




msg:4645491
 8:06 am on Feb 15, 2014 (gmt 0)

Thank you.

You're right -
(.*?) is about as ambiguous as it gets for regular expressions.


It is that part that caused the problem.

now I modified the code, everything works fine!

$html =~ s`\G(.*?)(\d+)(?=[,|\-|\s|\d]*\])`$1<a href="#$2">$2</a>`g;

phranque




msg:4645517
 10:08 am on Feb 15, 2014 (gmt 0)

sorry i misread the regexp so disregard my description.

does your code work if there isn't a left square bracket before the first digit?

newbies




msg:4645594
 5:59 pm on Feb 15, 2014 (gmt 0)

No, any number has to be within [ ] to be linked. the complete code is as follows:

if ($html =~ m/\[/g){
$html =~ s`\G(.*?)(\d+)(?=[,|\-|\s|\d]*\])`$1<a href="#$2">$2</a>`g;
}

lucy24




msg:4645629
 10:51 pm on Feb 15, 2014 (gmt 0)

(.*?)

I wouldn't dare do this. Express the generic . as [^\[<>] etc, excluding any characters that absolutely must not occur in this location. Otherwise you risk going from
see citation no. [<a href = blahblah1234blahblah>1234</a>]
to
see citation no. [<a href = blahblah1234blahblah>1<a href = blahblah23blahblah>23</a>4</a>]

Or, worse, from
see citation no. [<a href = blahblah56blahblah>56</a>]
to
see citation no. [<a href = blahblah<a href = blahblah56blahblah>56</a>blahblah>56</a>]

In my personal experience-- still talking ebooks-- I get it most often with physical page numbers in a multi-page index. To run it as an unsupervised global replace you have to make sure nothing gets anchored that isn't already anchored.

newbies




msg:4645681
 8:02 am on Feb 16, 2014 (gmt 0)

Indeed, that is a problem. Then I don't know how to exclude those situations. I tried this code which did not work:

if ($html =~ m/\[/g){
$html =~ s`\G([\d|\-|,\s]*)(\d+)(?=[,|\-|\s|\d]*\])`$1<a href="#$2">$2</a>`g;
}

lucy24




msg:4645691
 11:20 am on Feb 16, 2014 (gmt 0)

Well, I don't speak the language, but can you translate this?

Search string:
(\[(?:<a href[^>]+>\d+</a>[^\d<\]]+)*)(\d+)

Replace with:
$1<a href blahblah including $2 here>$2</a>

where $1 is the stuff leading up to your first new capture and $2 is your fresh reference. Pay close attention to the brackets. It may be easier to see if you cut and paste into something using a fixed-pitch font. I'm accustomed to reading Regular Expressions in Courier and they look like so much gibberish in anything else.

Walkthrough:
#1 Look for an opening bracket. Take no action until you find one.
#2 Opening bracket may or may not be followed by one or more existing anchor packages, possibly separated by a few characters that are NOT digits, < or close-brackets.
#3 If, after all this, you find a set of digits before reaching the close-bracket, capture and anchor those suckers.
#4 Repeat as an unsupervised global replace until everything rinses clean.

That's assuming you want each set of numerals captured and anchored separately.

newbies




msg:4646511
 1:32 am on Feb 19, 2014 (gmt 0)

Still it is not working as expected.
s`(\[(?:<a href[^>]+>\d+</a>[^\d<\]]+)*)(\d+)`$1<a href="#$2">$2</a>`g;

input:
$c = "[adb1,3,4-6, 9, 12-23], 11 should not be linked [3-4]. Some [11-12]. A 10-day 44 watts treatment [22].";

Output:
[adb1,3,4-6, 9, 12-23], 11 should not be linked [<a href="#3">3</a>-4]. Some [<a href="#11">11</a>-12]. A 10-day 44 watts treatment [<a href="#22">22</a>].

lucy24




msg:4646522
 3:18 am on Feb 19, 2014 (gmt 0)

[adb

Urk, didn't think of that. You'll need to add a

[^\d<\]]*

to the beginning of the pattern-- immediately after the \[ --to capture possible non-digits in this location.

So the part

(?:<a href[^>]+>\d+</a>[^\d<\]]+)*

isn't working for non-zero values of * meaning that the package as a whole has something wrong with it.

Hmmm...

:: beating head against wall ::

It's going to be something really obvious that will cause us all to cry "D'oh!" in unison. Hang on. We Will Get This To Work.

Edit:
Oh, wait. Are you running this pattern repeatedly? It won't pick up everything on the first pass. You have to keep running it until it rinses clean. Are you doing this manually or programmatically? Manually, just keep hitting Global Replace. In code, add a loop that says --in translation, of course, but you'll figure it out--

expr = blahblah
do
{ replacement here }
while (expr.test(source-text))

newbies




msg:4646532
 4:44 am on Feb 19, 2014 (gmt 0)

Not manually run it, but programmatically.

For the pattern [adb1, 3], the numbers should not be anchored because this is not the pattern for citations. I put it here to exclude because you said this non-citation pattern may be anchored by mistake.

lucy24




msg:4646545
 6:37 am on Feb 19, 2014 (gmt 0)

Oops again, overlooked the 1 in adb1. Is it always [a-z]+ before the not-to-be-captured \d? If so, replace all occurrences of
[^\d<\]]*
with
[^\d<\]]*(?:[a-z]+\d+)?[^\d<\]]*

There are several reasons for using non-capturing groups. Here I'm doing it so the target doesn't have to change every time the pattern gets tweaked. Otherwise you'd be up to about $27 by now!

I don't speak perl. Can you translate the "do...while" bit into something usable?

phranque




msg:4646573
 9:12 am on Feb 19, 2014 (gmt 0)

I don't speak perl. Can you translate the "do...while" bit into something usable?

usually something like this works:
while ($html =~ /some regular expression/){
# do something
}

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved