Forum Moderators: coopster & phranque

Message Too Old, No Replies

Perl @array question

Check for duplicate before push(@array)

         

Birdman

3:30 pm on Jun 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello,

I'm very new to Perl and I'm sure this is easy for most of you. I'm looking for the equivelent of PHP's in_array() function. I think it is exists, but I still can't make it work.

Can someone tell me how to check the array for a duplicate, before adding it to the array? Thanks!


my @imgs = ();
sub callback {
my($tag, %attr) = @_;
return if $tag ne 'img';
push(@imgs, values %attr);
}
$p = HTML::LinkExtor->new(\&callback);

jatar_k

10:18 pm on Jun 17, 2003 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



maybe if I keep posting bad answers to perl questions people might answer them to help me save face. ;)

maybe just stick to php Birdman

</bump>

Birdman

11:46 pm on Jun 17, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>>stick to php

:) You're right, I'm just a glutton for punishment.

Perl looks so good, I can't help but give it a go.

sugarkane

7:10 pm on Jun 18, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The 'exists' function is for checking if a %hash key exists, rather than array elements.

Is there any reason why you need to check for duplicates before adding to the array? If you're going to be adding more than one or two entries, it's probably more efficient to just add everything and then remove the duplicates later:


foreach $i (@array) {
push(@new_array, $i) unless ($seen{$i}++);
}
@array=@newarray;

cminblues

10:55 pm on Jun 18, 2003 (gmt 0)

10+ Year Member



Hi,
not sure 100%.. but maybe using only a string,
to do the control in 'real-time', is, yes, very ugly,
but less expensive.

@not_parsed_data = ('red','yellow','black',
'red','green','blue',
'white','yellow','black',
'blue','green','black');
@new_array = ();
$boundary = '___';
$control_string = $boundary;
foreach $sing(@not_parsed_data) {
if (index($control_string, "$boundary$sing$boundary") < 0) {
push (@new_array, $sing);
$control_string .= "$sing$boundary";
}
}
print "@new_array\n";
exit 0;

[Note: used 'index' instead of regexp, only for performance reasons]

Birdman

11:40 am on Jun 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks sugerkane and cminblues!

Sugerkane, welcome...back, as it seems, to the Perl forum. I read that you were getting bored in your other forum. I'm just picking Perl up, so I'll keep you plenty occupied for awhile ;)

Back to the topic. Can you explain how this part works:

($seen{$i}++)

It worked like a charm :), I just want to get in my head how it worked. Is seen a function? Much appreciated!

sugarkane

1:04 pm on Jun 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Birdman, it's good to be back

> I'll keep you plenty occupied

Hehe, no worries ;)

Okay - $seen{$i}++

'seen' is just a random name for a hash variable (ie a list of key/value pairs). You could choose any name you want - 'seen' is just a touch more descriptive of what the variable is actually doing in this case.

A hash is simlar to an array, but instead of just referencing the entries by number, you can also reference them by a string eg:

$colour{'banana'} = "yellow";
$colour{'tomato'} = "red";

For each hash 'key' (eg banana) there can only be one value - in this case yellow. But - the value can be anything - a string, a number, whatever. That de-duping snippet uses this fact to it's advantage. The

push (@new_array, $i) unless ($seen{$i}++);

line is basically saying: test if $seen{$i} has a value, and if not, add $i to the array. Then increment $seen{$i} by one.

The trick is in the '++' which increments the value of $seen{$i} by one, but *after* it's been tested. So, if $i is a new, unique entry, $seen{$i} will have a value of zero/null and it will be added to the array. If the same value of $i appears later on in the loop, $seen{$i} will already have a value of one (or more), so it won't be added.

You could rewrite the line without the ++ operator, which might make it a bit clearer what's going on:
[perl]
foreach $i (@array) {
push (@new_array, $i) unless ($seen{$i}); # if $seen{$i} has no value, it gets added to the array
$seen{$i}=$seen{$i}+1; # Next time we see $i, $seen{$i} will have a value, so it won't be added to the array
}
@array=@new_array
[/perl]

I hope that makes a little sense...

Birdman

1:36 pm on Jun 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>>I hope that makes a little sense..

Yes, and thank you for taking the time to spell it out for me. I'm still a bit fuzzy on the concept, but starting to get it.

When I add it to my script, it removes dupes but, for some reason, it breaks something else. Basically, I'm experimenting with a tiny spider(don't worry, I don't plan to cut it loose, except on my sites).

Notice at the end of the first loop where I clear the @imgs array? It works fine without the nested loop you showed me, but with the extra loop it doesn't seem to clear the array and the second url that gets spidered shows the links from the first url along with it's own links.

Thanks again for your time!


use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

my @links = ();
sub callback {
my($tag, %attr) = @_;
return if $tag eq 'img';
push(@links, values %attr);
}

@url = ("ht*p://www.site1.com/","h**p://www.site2.com/","ht*p://www.site3.com/");

foreach $newurl (@url){
print <<END;
<h1>$newurl</h1>
END

$ua = LWP::UserAgent->new;
$p = HTML::LinkExtor->new(\&callback);
$res = $ua->request(HTTP::Request->new(GET => $newurl),
sub {$p->parse($_[0])});
my $base = $res->base;
@links = map { $_ = url($_, $base)->abs; } @links;
foreach $i (@imgs) {
push(@new_array, $i) unless ($seen{$i}++);
}

@links=@new_array;
print join("<br />", @imgs), "\n";
@links = ();
}

sugarkane

1:43 pm on Jun 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You need to clear @new_array as well as @links - that should sort it.

> experimenting with a tiny spider

Hehe, always good fun ;)

Birdman

1:51 pm on Jun 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ah, I see now! Thanks again!

>>Hehe, always good fun

Yeah, it is loads of fun. Just installed Apache/Perl/PHP/MySQL on my PC and it's like being in a toy store ;)

One more thing..if you have some good Perl tutorial links, I'd love to see them. The docs are cool but a bit technical for me.

cminblues

1:07 am on Jun 20, 2003 (gmt 0)

10+ Year Member



Birdman..
you need also to clean the %seen hash, otherwise you can end up with a 'legitimate' entry in 2nd loop,
not accepted because it was also in 1st.
[not to mention the hash's growing he..]

%seen=();

Birdman

2:29 am on Jun 20, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks cminblues, I never thought of that either. It's still tough to understand, but if I keep staring at it long enough, it'll come around.