Forum Moderators: open

Message Too Old, No Replies

Any Idea why a 35a occured instead no line break and anything.

         

AthlonInside

10:52 pm on Nov 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am crawling a page from the web. The link in the site is working properly and it is on Linux.

After I download the page with my Script, the link however become

<a href="h
35a
ttp://www.site.com">Site</a>

And I am on Windows.

I can't get and idea why the 35a would appeared. Any insights? Thank you.

AthlonInside

10:56 pm on Nov 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are more funny numbers appear as follow

<s
d1
trong>

<br
1a0
>

DrDoc

11:43 pm on Nov 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your parser is doing something funky?
What does the download part of your script look like?

AthlonInside

6:36 pm on Nov 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I do a manual fsockopen with PHP using GET and HTTP/1.1. The documents I get contains the funny character. However, IE could view that page without the problem! Is that related to line break or something?

DrDoc

7:18 pm on Nov 12, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are you reading batches of the file, or the whole thing at once?

AthlonInside

7:22 am on Nov 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



i fetch the complete file into my hard drive and I view it with notepad.

DrDoc

3:50 pm on Nov 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Tried writing batches of 2k or so?

AthlonInside

5:26 pm on Nov 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



can you explain further on 'batch 2k'?

AthlonInside

5:33 pm on Nov 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, I have solve the problem by using HTTP/1.0 instead of HTTP/1.1. And i have remove the 'connection: close' header.

So, can someone figure out why HTTP/1.1 would cose this funy problem while HTTP/1.0 will not?

DrDoc

6:20 pm on Nov 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Glad you solved it ;)

What I meant by "2k batches" is to only read in about 2048 bytes at a time, write those to a file, read the next 2048 bytes, etc.