Forgotten my regex basics

Forum Moderators: coopster & phranque

Message Too Old, No Replies

Forgotten my regex basics

Robber

1:18 pm on Oct 22, 2002 (gmt 0)

OK guys, I've been trying to build a nice simple regex today. I'm pulling an xml feed with php and I want to parse this along with xsl stylesheet, no problem.

But, the result returned from http request contains the header information above the actual xml doc. (Anyone know how to avoid this?) eg.

HTTP/1.1 200 OK
Date: Tue, 22 Oct 2002 12:53:07 GMT
Server: blah blah blah
blah
Connection: close
Content-Type: text/xml

<?xml version="1.0" encoding="UTF-8"?>

Since I didnt know of way to stop this happening I thought Id strip the header info using regex - find all characters before the xml declaration and substitute with nothing - ie delete all the header stuff.

Sounds simple but its starting to annoy me, Im sure I must have gotten something really fundamental here.

Any ideas?

dingman

2:02 pm on Oct 22, 2002 (gmt 0)

s/(.*)(<?xml.*)/$2/

would be my first shot.

Robber

2:27 pm on Oct 22, 2002 (gmt 0)

Hi Dingman,

Thanks for helping, Im in php at the moment so I had tried things like:

$search = "/(.*)<\?xml/";
$replace= "<?xml";
$result = preg_replace ($search, $replace, $result);

Just running your suggestion at the mo with some slight tweaks - it managed to remove the whole xml declaration!

Cheers

[edited by: Robber at 2:40 pm (utc) on Oct. 22, 2002]

Robber

2:37 pm on Oct 22, 2002 (gmt 0)

Taking another look at your suggestion, would I be right in thinking that this should find zero or more of any character for the first group and then find the start of the xml declaration followed by zero or more of any character. I guess this would make it look through the whole of the xml before ending the search since it would find the very last character. It would be nice if it would bail out after finding <?xml and not look any more.

I would think something like below would do this - but it certainly isnt!:

$search = "/(.*)<\?xml/";
$replace= "<?xml";
$result = preg_replace ($search, $replace, $result);

I have escaped the? otherwise it would be interpreted as the greediness operator. Surely this is has a simple answer!

andreasfriedrich

3:07 pm on Oct 22, 2002 (gmt 0)

$search = "/(.*?)(?=<\?xml)/";  
$replace= "";  
$result = preg_replace ($search, $replace, $result);

I would think that the first match should be done non greedily. You want to catch the first occurance of the xml declaration. (I do know that there should be only one but who knows). The zero width positive lookahead assertation '(?=<\?xml)' matches the '<?xml' but does not include it with the things to be replaced.

I didn�t test this but hope it helps. Otherwise you might want to check for the first occurance of two ore more newlines since that is what constitutes the boundary between the header and the content of a http message.

Andreas

Robber

3:59 pm on Oct 22, 2002 (gmt 0)

Thanks Anreas,

I tried your first suggestion but unfortunately no go there. Although I was interested in the zero width positive lookahead assertation '(?=<\?xml)' , I'll will be taking a closer look at that at some point.

But you did set me on the right track suggesting looking for the 2 newlines. Tried \n\n and no joy there, but \r\n did the trick. Thanks very much to you both.

Cheers

Robber

4:00 pm on Oct 22, 2002 (gmt 0)

In case anyone else will find it useful, this is the code that worked:

$search = "/(.*\r\n)/";
$replace= "";
$result = preg_replace ($search, $replace, $result);

andreasfriedrich

4:38 pm on Oct 22, 2002 (gmt 0)

A http message is defined as follows:

generic-message = start-line 
................. *(message-header CRLF) 
................. CRLF 
................. [ message-body ] 
.... start-line = Request-Line � Status-Line

4.1 Message Types - RFC 2616 [ftp.isi.edu]

So you would neet to match (\r\n){2}. \r\n will match just any end of line (HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all protocol elements - 2.2 Basic Rules - RFC 2616 [ftp.isi.edu]). Since not all UAs and servers use CRLF you probably should match (\r\n){2,}�\n{2,}.

Andreas

Robber

9:34 pm on Oct 22, 2002 (gmt 0)

Good point, I'll give that a whirl in the morning. Something else that has just struck me - I seem to remember that .* won't match past line ends unless you use the correct modifier (can't remember what it is off the top of my head though - I think it might be s), is that right?

This would explain why the original match wasn't working.

dingman

9:46 pm on Oct 22, 2002 (gmt 0)

Yup, s would be right. Along with m to make $ match newlines.

andreasfriedrich

9:52 pm on Oct 22, 2002 (gmt 0)

With /s . matches newline, with /m ^ and $ match not only at the beginning and end of the string but also next to a newline.

Andreas

amoore

10:51 pm on Oct 22, 2002 (gmt 0)

I don't see any need to use a regular expression. Your headers and body are seperated by an empty line. This snip is wrtten for high clarity:

my $inheader = 1; # this line is still in the headers
my $xml = ''; # this is my xml document.
while (<>) {
chomp(); # pull off the newlines.
if ( $inheader && $_ eq '' ) {
$inheader = 0; # I just saw the line
} else {
$xml .= $_; # append this line onto the rest of the XML.
}

andreasfriedrich

11:17 pm on Oct 22, 2002 (gmt 0)

This approach will only work if you read the file line by line. But since there is a http header I�d suspect that this is not the case but rather that the content arrives in arbitraryly sized chunks just as the server sends them.

Perhaps Robber can let us know how the RSS feed is retrieved.

Andreas

dingman

12:18 am on Oct 23, 2002 (gmt 0)

Andreas, did you see my braino in the two minutes it took me to correct it? ;)

Robber

8:32 am on Oct 23, 2002 (gmt 0)

As things stand at the moment I am sending an http request which responds with an xml doc. I am taking this response in chunks and storing it in a variable. It is this variable that I then run the reg ex on once the whole of the xml is stored in it. So I guess options are open - although I figured a reg ex might be better performance wise, but thats little more than a gut feeling. Are there any guidelines on performance issues?

Now just trying to work with using variables in the xslt_process() function - it prefers to have the xml in files - I'll report my solution as there seems to be little on it.

<edit>
Strange, regarding the xslt_process(), yesterday it didnt work, now it does - straight out of the php manual!

[edited by: Robber at 8:46 am (utc) on Oct. 23, 2002]

Robber

8:37 am on Oct 23, 2002 (gmt 0)

For the benefit of anyone else looking in, adapting andreas first solution to use the s modifier makes it work great:

$search = "/(.*?)(?=<\?xml)/s";

I think I will go with this solution as to me it seems to have the least amount of ambiguity about it and I like the?= that was introduced earlier in the thread.

amoore

4:10 pm on Oct 23, 2002 (gmt 0)

This approach will only work if you read the file line by line. But since there is a http header I�d suspect that this is not the case but rather that the content arrives in arbitraryly sized chunks just as the server sends them.

You can always read it in line by line. That just means deal with the data up to the next newline (or $/ char) next. It will be buffered on its way in.

Come to think of it, if you think that you read it in in the same size chunks as it is arriving, how do you know what size chunks it's arriving in and at what time? I don't think you can. That stuff is handled by the operating system and doesn't need to be dealt with by an application like this.

andreasfriedrich

4:36 pm on Oct 23, 2002 (gmt 0)

if you think that you read it in in the same size chunks

I did not write anything about same size chunks. Quite the opposite: I wrote arbitrarily sized [webmasterworld.com] meaning the size of the chunks is beyond the scripts control.

You can always read it in line by line. That just means deal with the data up to the next newline

You sure can. But as you write yourself, you would need to collect the chunks as they arrive until there is a newline in it, work with that text and then go on and collect chunks again.

IMO in a streaming situation like the one at hand the RE approach is more suitable. If you get your data record by record then I would use the looping approach as suggested by you.

Andreas