Forum Moderators: coopster & phranque

Message Too Old, No Replies

regex in perl

         

Diceman

4:39 am on Oct 16, 2006 (gmt 0)

10+ Year Member



this may or may not supposed to be in this forum, i didnt think any of the other forums were a better choice.

i have a perl script that uses regular expressions to parse a webpage and return results from the page. there are 20 items that i am parsing for. all was going fine except one was not playing nice and showing the value for the very first item that i was searching for. eventually, i realized that the items on the page had changed order and that was the problem. after fixing the order problem, all 20 values showed up correctly.

here is the problem. this page will change the order of the items in my list on a regular basis. i need a way to keep using just one script and returning all of the values without any extra cruft messages. here is a sample of the script i am using for weather graphing. (graphing is a side item, the script just returns values, just like the script i am having trouble with)

#!/usr/bin/perl
use warnings;
use strict;

use LWP::Simple;

my $httpaddr = "http://www.aws.com/aws_2001/asp/obsForecast.asp?id=WISHT";

my %data;
my %trash;
my $content = LWP::Simple::get($httpaddr) or die "Couldn't get it!";

# regex in html source order
if ($content =~ /(<b>Temperature<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /<b>(-?\d+\.\d+)<\/b>/g) { $data{Temp} = $1; }

if ($content =~ /(<b>Humidity<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /<b>(\d+\.\d+)<\/b>/g) { $data{Humidity} = $1; }

if ($content =~ /(<b>Wind<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /(\d+\.\d+)<\/b>/g) { $data{Wind} = $1; }

if ($content =~ /(<b>Daily Rain<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /<b>(\d+\.\d+)<\/b>/g) { $data{Rain} = $1; }

if ($content =~ /(<b>Pressure<\/b>)/g) { $trash{a} = $1; }
if ($content =~ /<b>(\d+\.\d+)<\/b>/g) { $data{Pressure} = $1; }

if ($content =~ /(HEAT INDEX¦WIND CHILL)/g) { $trash{a} = $1; }
if ($content =~ /(\d+\.\d+)/g) { $data{HeatIndex} = $1; }

if ($content =~ /(DEW POINT:)/g) { $trash{a} = $1; }
if ($content =~ /(\d+\.\d+)/g) { $data{DewPoint} = $1; }

for (keys %data) {
printf "%s:%s ", $_, $data{$_};
}
print "\n";

all of these values stay in the same order all the time, they just change the values returned as the weather changes.

i have tried adding a


my %data;
my %trash;
my $content = LWP::Simple::get($httpaddr) or die "Couldn't get it!";

section for each item so it would start over each time, problem is it returns a bunch of messages about the variables changing and that extra cruft prevents my graphing portion from reading the output properly.

any help is greatly appreciated. thanks!

perl_diver

5:34 am on Oct 16, 2006 (gmt 0)

10+ Year Member



I think this is a better approach:



#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

my $httpaddr = "http://www.example.com";
my %data;
my %trash;
my $content = get($httpaddr) or die "Couldn't get it!";

$content =~ s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs;
$content =~ s/\s+/ /gs;
$content =~ s/&[a-zA-Z]{3,4};//gs;

if ($content =~ /(Temperature).+?(-?\d+\.\d+)/) {
$trash{a} = $1;
$data{Temp} = $2;
}
if ($content =~ /(Humidity).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{Humidity} = $2;
}

if ($content =~ /(Wind).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{Wind} = $2;
}

if ($content =~ /(Daily Rain).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{Rain} = $2;
}

if ($content =~ /(Pressure).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{Pressure} = $2;
}

if ($content =~ /(HEAT INDEX¦WIND CHILL).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{HeatIndex} = $2;
}

if ($content =~ /(DEW POINT:).+?(\d+\.\d+)/) {
$trash{a} = $1;
$data{DewPoint} = $2;
}

for (keys %data) {
printf "%s:%s ", $_, $data{$_};
}
print "\n";


although I don't understand what %trash is being used for. The regexp for stripping HTML code is crude but seems to work OK in this case. When I run it against the URL you posted I get something like this printed out:

DewPoint:30.0 Humidity:40.5 Temp:53.4 Wind:2.2 Pressure:30.03 Rain:0.00 HeatIndex:53.4

rocknbil

9:55 am on Oct 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you look at the root of your task, I think you will find that XML:Simple and a reliable RSS weather feed will present **tons** less headaches for you. :-) Been where you are and can testify.

perl_diver

4:42 pm on Oct 16, 2006 (gmt 0)

10+ Year Member



I gotta agree with rocknbil. You should probably be looking into something like that instead using the method you have been.

Diceman

10:10 pm on Oct 16, 2006 (gmt 0)

10+ Year Member



the method i have is working, i dont care to implement a huge change in this code as i dont know anything about perl, i jacked this code from a script that someone else made, changed it around a little, and have used it for various other things. changing it does not help with my current goals of getting this working as soon as i can. this new script i am writing is not for weather, that is just a sample of the code base that i use. there is no rss feed available for the page i am parsing.

if there is no way to do what im wanting to do, just say it so that i can move along with friggen 20 seperate scripts that pipe to a file and then get read to complete the primary goal. however, that is not my preferred method as it is sloppy and rigged.

[edited by: Diceman at 10:11 pm (utc) on Oct. 16, 2006]

perl_diver

10:26 pm on Oct 16, 2006 (gmt 0)

10+ Year Member



>> any help is greatly appreciated. thanks!

Evidently not.

Diceman

2:33 am on Oct 17, 2006 (gmt 0)

10+ Year Member



help related to the problem i asked about, not streamlining a script that i didnt write.

perl_diver

4:02 pm on Oct 17, 2006 (gmt 0)

10+ Year Member



The approach I suggested could be used regardless of the document you are parsing. The only order that is important is the two related bits of data, like:

Temperature 53.0

as long as those two bits of data are in sequence on the page the order that you parse the page will not matter since each regexp is searching through the entire document/variable until it finds the first correct match.

Diceman

8:19 pm on Oct 17, 2006 (gmt 0)

10+ Year Member



i did copy your script to my server and ran it, but it gave me a blank result, like it didnt find anything.

are you saying the numbers would have to be right next to the text with no other tags in between, because on most pages, this is not the case.

perl_diver

11:19 pm on Oct 17, 2006 (gmt 0)

10+ Year Member



My code was written specifically for the URL you posted where the data stuff was being parsed out of so it's doubtful it would work as-is for other data.


are you saying the numbers would have to be right next to the text with no other tags in between, because on most pages, this is not the case.

No. The code I posted removes all the html tags. Or at least it tries to. You are hopefully left with one string. That one string is parsed for the matching patterns. There could still be something left between the two related bits of data though. Spaces or just text for example. That has to be taken into account. Try this:



#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

my $httpaddr = "http://www.your-url-here.com";
my %data;
my %trash;
my $content = get($httpaddr) or die "Couldn't get it!";

$content =~ s/<(?:[^>'"]*¦(['"]).*?\1)*>//gs; # removes html tags
$content =~ s/\s+/ /gs; # collapses multiple spaces to one space
$content =~ s/&#?[a-zA-Z0-9]{3,6};//gs; # removes ASCII entities

print $content;


and see what $content looks like. Note that this character '¦' should be a pipe, the character above the backslash '\' on the keyboard. You will need to repalce that in the above code because this forum changes that character when it's posted here to the split pipe.

pinterface

12:28 am on Oct 18, 2006 (gmt 0)

10+ Year Member



/g [perldoc.perl.org] means to search starting at where the previous search left off. Using it like you have means "find these things in this order". If you want to find things in any order, either remove the '/g's
if ($content =~!<b>Temperature</b>.*?<b>(-?\d+\.\d+)</b>!s) { $data{Temp} = $1; }

or use pos [perldoc.perl.org] to reset the search starting position in between each grouping:
pos($content) = undef;

Diceman

1:06 am on Oct 18, 2006 (gmt 0)

10+ Year Member



i tried to remove the /g from the second line in each search and it gave me an error, i put the /g back in and used the pos($content) = undef; and it works beautifully.

again, thanks very much everyone for the suggestions.

perl_diver

5:09 pm on Oct 18, 2006 (gmt 0)

10+ Year Member



It's going to work until the order changes, which was the problem you wanted to avoid I thought.