Welcome to WebmasterWorld Guest from 54.146.221.231

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Matching a truncated string to add missing characters

Not always truncated at the same place

     
9:38 pm on Jan 23, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


Say I have a string like "This page is about X-Widgets" (without the quotes) where X-Widgets is usually the last word in the string. There are only X amount of characters allowed in the string, so sometimes "X-Widgets" might be "X-Widge" or "X-Wid" or anything in between. I want to correct this on the page so it always displays as "X-Widgets". Let's assume I have the "X-" to begin with, so I need to test if "W", "Wi", "Wid", "Widg", "Widge", "Widget" are present.

I guess I can use something like

(.*)X-[W]
(.*)X-[Wi]
(.*)X-[Wid]
(.*)X-[Widg]
(.*)X-[Widge]
(.*)X-[Widget]

But is there a better way? I tried

(.*)X-[W?i?d?g?e?t?]

but it didn't work, although it seemed like an interesting thing to try. I know there has to be a simple way, but I'm stuck. Thanks!
10:29 pm on Jan 23, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


Actually after scratching my head some more I tried

(.*)X-(W|Wi|Wid|Widg|Widge|Widget)

which seems to work. I'm checking to see if it has any adverse effects that might not be obvious.
12:16 am on Jan 24, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0


that'll probably work just fine.
make sure you add $ at the end of the regexp so, otherwise
".*X-W" will match "This page is about X-Wid"
so
(.*)X-(W|Wi|Wid|Widg|Widge|Widget)$
would work even better.
Of course, If the Widget-part is dynamic, it might be easier to just get that part with
my $string = "this page about X-Widge"; if($string =~ m/^.*X-(.*)$/) { my $potentiallycut = $1; }

and just check wether it matches a list of words.
5:17 pm on Jan 24, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Nov 28, 2004
posts:7999
votes: 0


I would do this, as the W seems to be always present, case sensitivity may be a problem, and limits the match to letters only . . . .

if($string =~ /^.*X-w[a-z]*$/i) {
6:49 pm on Jan 24, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


Thanks for the replies and suggestions! After some more trial and error and further thinking I realized I may not have been totally clear. I also realized after my other posts that W isn't always present, so I needed to make the entire OR optional. Here's what I have so far (before the previous post I haven't tried yet).

s/X-(W|Wi|Wid|Widg|Widge|Widget)?/Widgets/gi;

It sort of works (yes, really, sort of) and works well if it's only matching X-

But when it matches say X-Widg it returns X-Widgetsidg and if it matches X-Wid it returns X-Widgetsid and so on. I'm totally baffled at that. As you can see I stripped it down to bare bones to see what's going on, but I can't figure it out. Btw, adding the $ anchor makes it fail altogether. Any ideas?
7:25 pm on Jan 24, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0


but it's always X- in the string?

maybe
s/X-.*$/X-Widgets/g;
will work?
7:48 pm on Jan 24, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


Thanks both of you!

s/X-[a-z]*/X-Widget/gi;

seems to work for the non-plural version, but I also realized the S on the end of Widget(s) is sometimes optional. I tried the below but it doesn't replace the S if it is actually there.

s/X-[a-z]*(s)?/X-Widget$1/gi;

None of the examples work with the $ anchor.
12:23 am on Jan 25, 2011 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10544
votes: 8


when it matches say X-Widg it returns X-Widgetsidg and if it matches X-Wid it returns X-Widgetsid and so on. I'm totally baffled at that

reverse the order of the order of alternatives in this match:
(W|Wi|Wid|Widg|Widge|Widget)
12:25 am on Jan 25, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0


or go with
s/X-[Widget]*(s?)/X-Widget$1/gi;
1:42 am on Jan 25, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


This is how I would match it:

$string =~ s/X-W?i?d?g?e?t?(s?)/X-Widget$1/g;


Hope this helps.
7:38 pm on Jan 25, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


Thanks for all the suggestions! I really appreciate the input. I finally found something that works, at least I think so so far.

$string =~ s/X-W?i?d?g?e?t?(s)?/X-Widget$1/gi;

What was rally odd and I can't figure out what was causing it was that if I left out the t? on the left side it would return Widgett or Widgetts. I have no idea where it was getting an extra "t" from. Very odd. But I was getting some odd behavior all around I guess because it's all in the midst of a long string of html so there's really no way to anchor either end.
11:17 pm on Jan 25, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Now that I think about it, if the string is truncated, how would you know if a 's' character was supposed to be attached to the widget name or for that matter what if there is an 's' character in the middle of the widget name. Maybe you could get better results using an array to store your widget names and grep it for the best match. You would get a much more accurate guess than from the other code examples listed above. Of course, if all of your widgets start with a 'W' and the string is trucated to 'X-W', the array is going to return the first match it finds. The more characters in the string, the more accurate it would be. For example, it wouldn't return 'Widget' for 'getWid' like the other examples would.

Something like:

my @widgets = ('Widget', 'Water', 'Bridget', 'Widgets');
my $string = 'This page is about X-Wa';
my ($search) = $string =~ /X-(\w+)$/;
my ($result) = grep{ $_ =~ /^${search}/ } @widgets;
$string =~ s/${search}$/${result}/;
2:29 am on Jan 26, 2011 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10544
votes: 8


have you tried this yet?
(Widget|Widge|Widg|Wid|Wi|W)

this (W?i?d?g?e?t?) will match all of the above, but it will also match strings you don't want, such as "Wide", "id", etc
3:35 pm on Jan 26, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


have you tried this yet?
(Widget|Widge|Widg|Wid|Wi|W)

I had tried that, although I had left out the entire word Widget because I thought it was silly to be replacing the correct word with the same word, and it was doing the same thing as I posted above, it was returning Widgett (double tt) somehow. So I added Widget to the OR as in your suggestion and that seems to have fixed the double tt thing and it's looking promising now. Thank you.

Now that I think about it, if the string is truncated, how would you know if a 's' character was supposed to be attached to the widget name or for that matter what if there is an 's' character in the middle of the widget name.

I understand, and part of the problem is that Widget, which is always the word in question, actually starts with an S and can end with an S. So yes, it's complicated. I tried something like this too with varying degree of results, but nothing totally usable

s/X-(s)?[a-rt-z]{0,5}(s)?/X-Widget$2/gi;

And thanks for the array suggestion and example, but Widget is the only word. It may be preceded by several descriptive words, but it's always Widget or Widgets in this case.
6:33 pm on Jan 26, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0


I had tried that, although I had left out the entire word Widget because I thought it was silly to be replacing the correct word with the same word, and it was doing the same thing as I posted above, it was returning Widgett (double tt) somehow. So I added Widget to the OR as in your suggestion and that seems to have fixed the double tt thing and it's looking promising now. Thank you.


just to explain: Widget was in there to match the complete word, not so much to replace it with itself. When you took it out, "Widget" was matched by "Widge" and "Widge" was replaced by "Widget", thus resulting in Widgett".
7:55 pm on Jan 26, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


just to explain: Widget was in there to match the complete word, not so much to replace it with itself. When you took it out, "Widget" was matched by "Widge" and "Widge" was replaced by "Widget", thus resulting in Widgett".

Thank you, I understand now, almost completely. Only thing I can't grasp is where the tt is coming from. So let me try to rationalize...

Widget is matched by Widge, but Widget still exists in its entire form - then Widge gets replaced by Widget, leaving the underlying t from the original Widget, hence Widgett? It would show better what I'm saying if I could use colors, but the board doesn't seem to allow them. I hope that all makes sense and thanks for helping me to understand! I think I got it.

I have a somewhat related question also but I'll start a new topic for that.
8:21 pm on Jan 26, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2008
posts:661
votes: 0



Widget is matched by Widge, but Widget still exists in its entire form - then Widge gets replaced by Widget, leaving the underlying t from the original Widget, hence Widgett? It would show better what I'm saying if I could use colors, but the board doesn't seem to allow them. I hope that all makes sense and thanks for helping me to understand! I think I got it.


exactly!


I have a somewhat related question also but I'll start a new topic for that.


go for it. two topics are better than one, it'll make the perl-board seem very much alive ;)
9:52 pm on Jan 26, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


> exactly!

Thank you! :)

>go for it. two topics are better than one, it'll make the perl-board seem very much alive ;)

I have plenty of things to help keep it alive. I just have to figure how to ask them that makes sense to everyone but me.
9:56 pm on Jan 26, 2011 (gmt 0)

Junior Member

5+ Year Member

joined:Sept 18, 2007
posts:105
votes: 0


Also just for the record, this seems to be working just fine now

$string =~ s/X-(Widget|Widge|Widg|Wid|Wi|W)?(s)?/X-Widget$2/gi;

Thank you ALL for your help and input and suggestions! I learned a lot. I really appreciate it.