Any Idea why a 35a occured instead no line break and anything.
AthlonInside
10:52 pm on Nov 11, 2003 (gmt 0)
I am crawling a page from the web. The link in the site is working properly and it is on Linux.
After I download the page with my Script, the link however become
<a href="h 35a ttp://www.site.com">Site</a>
And I am on Windows.
I can't get and idea why the 35a would appeared. Any insights? Thank you.
AthlonInside
10:56 pm on Nov 11, 2003 (gmt 0)
There are more funny numbers appear as follow
<s d1 trong>
<br 1a0 >
DrDoc
11:43 pm on Nov 11, 2003 (gmt 0)
Your parser is doing something funky? What does the download part of your script look like?
AthlonInside
6:36 pm on Nov 12, 2003 (gmt 0)
I do a manual fsockopen with PHP using GET and HTTP/1.1. The documents I get contains the funny character. However, IE could view that page without the problem! Is that related to line break or something?
DrDoc
7:18 pm on Nov 12, 2003 (gmt 0)
Are you reading batches of the file, or the whole thing at once?
AthlonInside
7:22 am on Nov 13, 2003 (gmt 0)
i fetch the complete file into my hard drive and I view it with notepad.
DrDoc
3:50 pm on Nov 13, 2003 (gmt 0)
Tried writing batches of 2k or so?
AthlonInside
5:26 pm on Nov 14, 2003 (gmt 0)
can you explain further on 'batch 2k'?
AthlonInside
5:33 pm on Nov 14, 2003 (gmt 0)
Ok, I have solve the problem by using HTTP/1.0 instead of HTTP/1.1. And i have remove the 'connection: close' header.
So, can someone figure out why HTTP/1.1 would cose this funy problem while HTTP/1.0 will not?
DrDoc
6:20 pm on Nov 14, 2003 (gmt 0)
Glad you solved it ;)
What I meant by "2k batches" is to only read in about 2048 bytes at a time, write those to a file, read the next 2048 bytes, etc.