homepage Welcome to WebmasterWorld Guest from 54.211.180.175
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
Matching a truncated string to add missing characters
Not always truncated at the same place
StaceyJ




msg:4257068
 9:38 pm on Jan 23, 2011 (gmt 0)

Say I have a string like "This page is about X-Widgets" (without the quotes) where X-Widgets is usually the last word in the string. There are only X amount of characters allowed in the string, so sometimes "X-Widgets" might be "X-Widge" or "X-Wid" or anything in between. I want to correct this on the page so it always displays as "X-Widgets". Let's assume I have the "X-" to begin with, so I need to test if "W", "Wi", "Wid", "Widg", "Widge", "Widget" are present.

I guess I can use something like

(.*)X-[W]
(.*)X-[Wi]
(.*)X-[Wid]
(.*)X-[Widg]
(.*)X-[Widge]
(.*)X-[Widget]

But is there a better way? I tried

(.*)X-[W?i?d?g?e?t?]

but it didn't work, although it seemed like an interesting thing to try. I know there has to be a simple way, but I'm stuck. Thanks!

 

StaceyJ




msg:4257084
 10:29 pm on Jan 23, 2011 (gmt 0)

Actually after scratching my head some more I tried

(.*)X-(W|Wi|Wid|Widg|Widge|Widget)

which seems to work. I'm checking to see if it has any adverse effects that might not be obvious.

janharders




msg:4257121
 12:16 am on Jan 24, 2011 (gmt 0)

that'll probably work just fine.
make sure you add $ at the end of the regexp so, otherwise
".*X-W" will match "This page is about X-Wid"
so
(.*)X-(W|Wi|Wid|Widg|Widge|Widget)$
would work even better.
Of course, If the Widget-part is dynamic, it might be easier to just get that part with

my $string = "this page about X-Widge";
if($string =~ m/^.*X-(.*)$/) {
 my $potentiallycut = $1;
}

and just check wether it matches a list of words.

rocknbil




msg:4257426
 5:17 pm on Jan 24, 2011 (gmt 0)

I would do this, as the W seems to be always present, case sensitivity may be a problem, and limits the match to letters only . . . .

if($string =~ /^.*X-w[a-z]*$/i) {

StaceyJ




msg:4257464
 6:49 pm on Jan 24, 2011 (gmt 0)

Thanks for the replies and suggestions! After some more trial and error and further thinking I realized I may not have been totally clear. I also realized after my other posts that W isn't always present, so I needed to make the entire OR optional. Here's what I have so far (before the previous post I haven't tried yet).

s/X-(W|Wi|Wid|Widg|Widge|Widget)?/Widgets/gi;

It sort of works (yes, really, sort of) and works well if it's only matching X-

But when it matches say X-Widg it returns X-Widgetsidg and if it matches X-Wid it returns X-Widgetsid and so on. I'm totally baffled at that. As you can see I stripped it down to bare bones to see what's going on, but I can't figure it out. Btw, adding the $ anchor makes it fail altogether. Any ideas?

janharders




msg:4257479
 7:25 pm on Jan 24, 2011 (gmt 0)

but it's always X- in the string?

maybe
s/X-.*$/X-Widgets/g;
will work?

StaceyJ




msg:4257498
 7:48 pm on Jan 24, 2011 (gmt 0)

Thanks both of you!

s/X-[a-z]*/X-Widget/gi;

seems to work for the non-plural version, but I also realized the S on the end of Widget(s) is sometimes optional. I tried the below but it doesn't replace the S if it is actually there.

s/X-[a-z]*(s)?/X-Widget$1/gi;

None of the examples work with the $ anchor.

phranque




msg:4257632
 12:23 am on Jan 25, 2011 (gmt 0)

when it matches say X-Widg it returns X-Widgetsidg and if it matches X-Wid it returns X-Widgetsid and so on. I'm totally baffled at that

reverse the order of the order of alternatives in this match:
(W|Wi|Wid|Widg|Widge|Widget)

janharders




msg:4257636
 12:25 am on Jan 25, 2011 (gmt 0)

or go with
s/X-[Widget]*(s?)/X-Widget$1/gi;

Key_Master




msg:4257656
 1:42 am on Jan 25, 2011 (gmt 0)

This is how I would match it:

$string =~ s/X-W?i?d?g?e?t?(s?)/X-Widget$1/g;


Hope this helps.

StaceyJ




msg:4258009
 7:38 pm on Jan 25, 2011 (gmt 0)

Thanks for all the suggestions! I really appreciate the input. I finally found something that works, at least I think so so far.

$string =~ s/X-W?i?d?g?e?t?(s)?/X-Widget$1/gi;

What was rally odd and I can't figure out what was causing it was that if I left out the t? on the left side it would return Widgett or Widgetts. I have no idea where it was getting an extra "t" from. Very odd. But I was getting some odd behavior all around I guess because it's all in the midst of a long string of html so there's really no way to anchor either end.

Key_Master




msg:4258133
 11:17 pm on Jan 25, 2011 (gmt 0)

Now that I think about it, if the string is truncated, how would you know if a 's' character was supposed to be attached to the widget name or for that matter what if there is an 's' character in the middle of the widget name. Maybe you could get better results using an array to store your widget names and grep it for the best match. You would get a much more accurate guess than from the other code examples listed above. Of course, if all of your widgets start with a 'W' and the string is trucated to 'X-W', the array is going to return the first match it finds. The more characters in the string, the more accurate it would be. For example, it wouldn't return 'Widget' for 'getWid' like the other examples would.

Something like:

my @widgets = ('Widget', 'Water', 'Bridget', 'Widgets');
my $string = 'This page is about X-Wa';
my ($search) = $string =~ /X-(\w+)$/;
my ($result) = grep{ $_ =~ /^${search}/ } @widgets;
$string =~ s/${search}$/${result}/;

phranque




msg:4258181
 2:29 am on Jan 26, 2011 (gmt 0)

have you tried this yet?
(Widget|Widge|Widg|Wid|Wi|W)

this (W?i?d?g?e?t?) will match all of the above, but it will also match strings you don't want, such as "Wide", "id", etc

StaceyJ




msg:4258390
 3:35 pm on Jan 26, 2011 (gmt 0)

have you tried this yet?
(Widget|Widge|Widg|Wid|Wi|W)

I had tried that, although I had left out the entire word Widget because I thought it was silly to be replacing the correct word with the same word, and it was doing the same thing as I posted above, it was returning Widgett (double tt) somehow. So I added Widget to the OR as in your suggestion and that seems to have fixed the double tt thing and it's looking promising now. Thank you.

Now that I think about it, if the string is truncated, how would you know if a 's' character was supposed to be attached to the widget name or for that matter what if there is an 's' character in the middle of the widget name.

I understand, and part of the problem is that Widget, which is always the word in question, actually starts with an S and can end with an S. So yes, it's complicated. I tried something like this too with varying degree of results, but nothing totally usable

s/X-(s)?[a-rt-z]{0,5}(s)?/X-Widget$2/gi;

And thanks for the array suggestion and example, but Widget is the only word. It may be preceded by several descriptive words, but it's always Widget or Widgets in this case.

janharders




msg:4258533
 6:33 pm on Jan 26, 2011 (gmt 0)

I had tried that, although I had left out the entire word Widget because I thought it was silly to be replacing the correct word with the same word, and it was doing the same thing as I posted above, it was returning Widgett (double tt) somehow. So I added Widget to the OR as in your suggestion and that seems to have fixed the double tt thing and it's looking promising now. Thank you.


just to explain: Widget was in there to match the complete word, not so much to replace it with itself. When you took it out, "Widget" was matched by "Widge" and "Widge" was replaced by "Widget", thus resulting in Widgett".

StaceyJ




msg:4258583
 7:55 pm on Jan 26, 2011 (gmt 0)

just to explain: Widget was in there to match the complete word, not so much to replace it with itself. When you took it out, "Widget" was matched by "Widge" and "Widge" was replaced by "Widget", thus resulting in Widgett".

Thank you, I understand now, almost completely. Only thing I can't grasp is where the tt is coming from. So let me try to rationalize...

Widget is matched by Widge, but Widget still exists in its entire form - then Widge gets replaced by Widget, leaving the underlying t from the original Widget, hence Widgett? It would show better what I'm saying if I could use colors, but the board doesn't seem to allow them. I hope that all makes sense and thanks for helping me to understand! I think I got it.

I have a somewhat related question also but I'll start a new topic for that.

janharders




msg:4258595
 8:21 pm on Jan 26, 2011 (gmt 0)


Widget is matched by Widge, but Widget still exists in its entire form - then Widge gets replaced by Widget, leaving the underlying t from the original Widget, hence Widgett? It would show better what I'm saying if I could use colors, but the board doesn't seem to allow them. I hope that all makes sense and thanks for helping me to understand! I think I got it.


exactly!


I have a somewhat related question also but I'll start a new topic for that.


go for it. two topics are better than one, it'll make the perl-board seem very much alive ;)

StaceyJ




msg:4258650
 9:52 pm on Jan 26, 2011 (gmt 0)

> exactly!

Thank you! :)

>go for it. two topics are better than one, it'll make the perl-board seem very much alive ;)

I have plenty of things to help keep it alive. I just have to figure how to ask them that makes sense to everyone but me.

StaceyJ




msg:4258657
 9:56 pm on Jan 26, 2011 (gmt 0)

Also just for the record, this seems to be working just fine now

$string =~ s/X-(Widget|Widge|Widg|Wid|Wi|W)?(s)?/X-Widget$2/gi;

Thank you ALL for your help and input and suggestions! I learned a lot. I really appreciate it.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved