But, the result returned from http request contains the header information above the actual xml doc. (Anyone know how to avoid this?) eg.
HTTP/1.1 200 OK
Date: Tue, 22 Oct 2002 12:53:07 GMT
Server: blah blah blah
blah
Connection: close
Content-Type: text/xml
<?xml version="1.0" encoding="UTF-8"?>
Since I didnt know of way to stop this happening I thought Id strip the header info using regex - find all characters before the xml declaration and substitute with nothing - ie delete all the header stuff.
Sounds simple but its starting to annoy me, Im sure I must have gotten something really fundamental here.
Any ideas?
Thanks for helping, Im in php at the moment so I had tried things like:
$search = "/(.*)<\?xml/";
$replace= "<?xml";
$result = preg_replace ($search, $replace, $result);
Just running your suggestion at the mo with some slight tweaks - it managed to remove the whole xml declaration!
Cheers
[edited by: Robber at 2:40 pm (utc) on Oct. 22, 2002]
I would think something like below would do this - but it certainly isnt!:
$search = "/(.*)<\?xml/";
$replace= "<?xml";
$result = preg_replace ($search, $replace, $result);
I have escaped the? otherwise it would be interpreted as the greediness operator. Surely this is has a simple answer!
$search = "/(.*?)(?=<\?xml)/";
$replace= "";
$result = preg_replace ($search, $replace, $result);
I would think that the first match should be done non greedily. You want to catch the first occurance of the xml declaration. (I do know that there should be only one but who knows). The zero width positive lookahead assertation '(?=<\?xml)' matches the '<?xml' but does not include it with the things to be replaced.
I didnŽt test this but hope it helps. Otherwise you might want to check for the first occurance of two ore more newlines since that is what constitutes the boundary between the header and the content of a http message.
Andreas
I tried your first suggestion but unfortunately no go there. Although I was interested in the zero width positive lookahead assertation '(?=<\?xml)' , I'll will be taking a closer look at that at some point.
But you did set me on the right track suggesting looking for the 2 newlines. Tried \n\n and no joy there, but \r\n did the trick. Thanks very much to you both.
Cheers
generic-message = start-line
................. *(message-header CRLF)
................. CRLF
................. [ message-body ]
.... start-line = Request-Line Š Status-Line
4.1 Message Types - RFC 2616 [ftp.isi.edu]
So you would neet to match (\r\n){2}. \r\n will match just any end of line (HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all protocol elements - 2.2 Basic Rules - RFC 2616 [ftp.isi.edu]). Since not all UAs and servers use CRLF you probably should match (\r\n){2,}Š\n{2,}.
Andreas
This would explain why the original match wasn't working.
my $inheader = 1; # this line is still in the headers
my $xml = ''; # this is my xml document.
while (<>) {
chomp(); # pull off the newlines.
if ( $inheader && $_ eq '' ) {
$inheader = 0; # I just saw the line
} else {
$xml .= $_; # append this line onto the rest of the XML.
}
Now just trying to work with using variables in the xslt_process() function - it prefers to have the xml in files - I'll report my solution as there seems to be little on it.
<edit>
Strange, regarding the xslt_process(), yesterday it didnt work, now it does - straight out of the php manual!
[edited by: Robber at 8:46 am (utc) on Oct. 23, 2002]
$search = "/(.*?)(?=<\?xml)/s";
I think I will go with this solution as to me it seems to have the least amount of ambiguity about it and I like the?= that was introduced earlier in the thread.
This approach will only work if you read the file line by line. But since there is a http header IŽd suspect that this is not the case but rather that the content arrives in arbitraryly sized chunks just as the server sends them.
Come to think of it, if you think that you read it in in the same size chunks as it is arriving, how do you know what size chunks it's arriving in and at what time? I don't think you can. That stuff is handled by the operating system and doesn't need to be dealt with by an application like this.
if you think that you read it in in the same size chunks
I did not write anything about same size chunks. Quite the opposite: I wrote arbitrarily sized [webmasterworld.com] meaning the size of the chunks is beyond the scripts control.
You can always read it in line by line. That just means deal with the data up to the next newline
You sure can. But as you write yourself, you would need to collect the chunks as they arrive until there is a newline in it, work with that text and then go on and collect chunks again.
IMO in a streaming situation like the one at hand the RE approach is more suitable. If you get your data record by record then I would use the looping approach as suggested by you.
Andreas