Forum Moderators: coopster & phranque

Message Too Old, No Replies

Spider using IO::Socket , HTTP/1.1

How do I get to see all the headers?

         

Damian

5:25 pm on Mar 28, 2003 (gmt 0)

10+ Year Member



After trying for a few hours I gave up on getting utf-8 encoded pages to appear properly in HTML using LWP, so I'm trying to switch to IO::Socket, which does return the content in a usable format for the conversion to HTML numeric entities.

That part is solved, but now I can't get to the headers I'm used to anymore :(

I get to see some header data such as Expires, Date, Server and Content-Type and a few other things,
but not for example Client-Date, Client-Peer, X-Meta-MSSmartTagsPreventParsing, X-Meta-Description, and X-Meta-Robots that I do get using LWP to retrieve the same page.

This is all a bit over my head, can someone explain in simple terms? Is there something missing in my script, am I really seeing all the headers served to my script, and/or is this inherent to using IO::Socket::INET?

I also didn't find a way of splitting the headers I do get from the content except using homemade regular expressions..is there a better way, something similar to LWP's
$response->headers_as_string?

Fischerlaender

11:47 am on Mar 29, 2003 (gmt 0)

10+ Year Member



Does the robot you created with I0:Socket send the same request as LWP does? If not chances are that the webservers are returning a different set of header. The missing headers (as Client-Date) you mentioned are generated on the client side. So if your robot does not send them in the request, the server cannot have them in the reply.

littleman

12:45 pm on Mar 29, 2003 (gmt 0)



Using IO::Socket you are responsible for generating your own headers. It isn't all that hard. Here are some snippets from a script I wrote, it should give you an idea how to format the header.

my $EOL = "\015\012"; ## a CR LF pair for the server $remote = IO::Socket::INET->new(
Proto => "tcp",
Timeout => "5",
PeerAddr => $host,
PeerPort => "http($port)",
);

my $content; if ($remote) {
$remote->autoflush(1);
print ">>flushed>>";
print $remote "GET $document HTTP/1.1" . $EOL;
print $remote "Host: $host" . $EOL;
print $remote
'User-Agent: Mozilla/5.0 Phase One/Bookmark Crawler ( http://collectivemind.sourceforge.net/phase-one.html )'
. $EOL;
print $remote
'Accept: text/xml,application/xml,application/xhtml+xml,text/html;text/plain'
. $EOL;
print $remote "Connection: close" . $EOL;
print $remote "Referer: Bookmark Crawler" . $EOL;
print $remote $EOL;
print $remote $EOL;
print " asked for $document..";

while (<$remote>) {
$content .= $_;

#print
}

close $remote;
print "closed connection";
}

I also didn't find a way of splitting the headers I do get from the content except using
homemade regular expressions..is there a better way, something similar to LWP's
$response->headers_as_string?

Not as far as I know. I've had to rip the code out the old fashion way.

Damian

8:55 am on Mar 31, 2003 (gmt 0)

10+ Year Member



Thanks guys.

Does the robot you created with I0:Socket send the same request as LWP does?

Eh..I guess not, I didn't realise LWP was actually configuring/shaping my request to be readable by the server. I will have to brush up my knowledge about this header stuff.

Hehe, Littleman..I have to admit I had already found your cool example and used it as a basis..

Now I'm trying to figure out how to retrieve for example the ip address of the machine hosting the page I spider in the same request (Client-Peer) but I couldn't find any syntax examples.

I read somewhere that seeing the headers as part of the content when you retrieve a page may mean the http version was not properly recognised and the server of the page retrieved returns the page as HTTP/0.9 instead of HTTP/1.0 or HTTP/1.1
Could this be happening here and how do I prove I actually got the page using HTTP 1.1?

Fischerlaender

9:35 am on Mar 31, 2003 (gmt 0)

10+ Year Member



how do I prove I actually got the page using HTTP 1.1?

Your request is something similiar to this:

GET /thepage/iwant.htm HTTP/1.1 
Host: www.domain.com

The Host header is defined as part of HTTP/1.1, but all webservers today recognize this header even if they can't understand all of the 1.1 definitions. This is required for name based virtual hosting.

The response to this request should be something like this:

HTTP/1.0 200 OK 
Content-Type: text/html

This tells you that the request was answered properly, but that the server just "speaks" 1.0. In other words: You requested with 1.1, the server answers with 1.0, because it does not recognize all of the details of the 1.1 protocol version.

My experience showed me that it is best to use HTTP/1.0 in combination with the Host header.