That part is solved, but now I can't get to the headers I'm used to anymore :(
I get to see some header data such as Expires, Date, Server and Content-Type and a few other things,
but not for example Client-Date, Client-Peer, X-Meta-MSSmartTagsPreventParsing, X-Meta-Description, and X-Meta-Robots that I do get using LWP to retrieve the same page.
This is all a bit over my head, can someone explain in simple terms? Is there something missing in my script, am I really seeing all the headers served to my script, and/or is this inherent to using IO::Socket::INET?
I also didn't find a way of splitting the headers I do get from the content except using homemade regular expressions..is there a better way, something similar to LWP's
$response->headers_as_string?
my $EOL = "\015\012"; ## a CR LF pair for the server $remote = IO::Socket::INET->new(
Proto => "tcp",
Timeout => "5",
PeerAddr => $host,
PeerPort => "http($port)",
); my $content; if ($remote) {
$remote->autoflush(1);
print ">>flushed>>";
print $remote "GET $document HTTP/1.1" . $EOL;
print $remote "Host: $host" . $EOL;
print $remote
'User-Agent: Mozilla/5.0 Phase One/Bookmark Crawler ( http://collectivemind.sourceforge.net/phase-one.html )'
. $EOL;
print $remote
'Accept: text/xml,application/xml,application/xhtml+xml,text/html;text/plain'
. $EOL;
print $remote "Connection: close" . $EOL;
print $remote "Referer: Bookmark Crawler" . $EOL;
print $remote $EOL;
print $remote $EOL;
print " asked for $document..";
while (<$remote>) {
$content .= $_;
#print
}
close $remote;
print "closed connection";
}
I also didn't find a way of splitting the headers I do get from the content except using
homemade regular expressions..is there a better way, something similar to LWP's
$response->headers_as_string?
Does the robot you created with I0:Socket send the same request as LWP does?
Eh..I guess not, I didn't realise LWP was actually configuring/shaping my request to be readable by the server. I will have to brush up my knowledge about this header stuff.
Hehe, Littleman..I have to admit I had already found your cool example and used it as a basis..
Now I'm trying to figure out how to retrieve for example the ip address of the machine hosting the page I spider in the same request (Client-Peer) but I couldn't find any syntax examples.
I read somewhere that seeing the headers as part of the content when you retrieve a page may mean the http version was not properly recognised and the server of the page retrieved returns the page as HTTP/0.9 instead of HTTP/1.0 or HTTP/1.1
Could this be happening here and how do I prove I actually got the page using HTTP 1.1?
how do I prove I actually got the page using HTTP 1.1?
Your request is something similiar to this:
GET /thepage/iwant.htm HTTP/1.1
Host: www.domain.com
The response to this request should be something like this:
HTTP/1.0 200 OK
Content-Type: text/html
My experience showed me that it is best to use HTTP/1.0 in combination with the Host header.