Forum Moderators: phranque

Message Too Old, No Replies

Apache service in specific bytes to clients. Is it SEO problem?

Why is this server sending apparent length declarations in middle of chunks

         

robsoles

9:41 am on Jan 23, 2008 (gmt 0)

10+ Year Member



Hi All,

wish I'd just looked for an apache forum on webmasterworld before going anywhere else. Wish I could just put in less than a 100 words on this one!

I've spent about a month working up a web crawler for my work, it's just in vb60, I'm just using a 'winsock' and I can't do ssl yet but that is not what has driven me here!

I found the basic difference between IIS and Apache pretty quick, IIS sends 'Content-Length' named in the header and Apache tells you after two carriage returns after the head, in HEX - cool I thought, no content-length in the head, check for all valid hex digits followed by CRLF at the start of the document and read that as length if found.

A site I did some SEO to just before christmas gained a bunch of backlinks and was in the top 10 for 5 keyword phrases by christmas day or so and three of those fell out of the top 30 last monday.

I checked the root page of the site and found that unlike I had left it, it had not only my <title> tag but also a <meta name="title" ..> tag (saying something much shorter and not on keywords), it's a joomla site so I jumped into their backend with super admin rights and turned the rotten meta title tag off in global config.

I ran my crawler over it as it is fine and dandy so far for non-ssl sites and was a bit stunned when I saw several URLs it had parsed out of links in the documents it was reading had carriage return + hex looking number + carriage return in them - I studied the raw bytes the server was sending me and wrote a fix to check for a valid hex figure at the start of each chunk and re-ran the crawler.

Much to my dismay it appeared to have as many or more of these bizarre links my crawler was queing to fetch! I got the crawler to save every byte it fetched into text files, marking starts and ends of chunks clearly, ran the spider and waited till the output showed me it had added some of those crooked URLs.

There are pretty much random length declarations scattered throughout the majority of chunks this Apache (v1.3.34) server sent me, rare to see one of these 'length declarations' at the start of chunks where I would have more expected to find them! I wrote a routine to pickout these and add them up and sure enough they add up to about 16 or so bytes off the length of the string that collected everything that looked like actual document!

Crawler can crawl their site successfully (tricky routine makes sure!), hits the external links and their host/developers link goes to the developer's site root and is 302 found at a subdirectory of same - brilliant!

I have an inkling that I may be being misled by something that the MS Winsock control is doing in terms of double buffering the conversation I'm hoping to have directly with the server to know for sure what is being sent - this supports that my friend's host is not doing evil SEO destructive nonsense as I am growing suspicious that they are (*he could buy their SEO services if he wanted to...).

Please tell me if you know: Are there configuration settings in Apache to do this or do you have to write it in php or other scripting language?

I operate Apache (2.x.x) in a centos 5 setup for my boss and haven't noticed where to set that one... I wish he'd let me set my friend's joomla site up on his server!

My browser and the Google cache show no sign of these length declarations, the stuff my crawler fetches from that server seems to have many more of them right smack bang after the domain name and forward slash of the internal URL in the body's source!

I wouldn't bother asking if one of two conditions existed: a) they were all 1000 hex (4k), or b) they only occured directly after the head in the first chunk and then at the start of each subsequent chunk.

There are '1000's but there are also random figures of less than 1000, 5a7 seems a little popular but there are plenty of others and although I haven't done scan comparisons to the files I kept of each crawl I'd still swear these figures are never in the same place - although one of them seems to nail the domain name with /1000 as the docurl each crawl.

Is it one of the mambots I don't recognise in their 'hosted' joomla setup? Is it actually a reasonable way to serve?

Please help me, I'll be pleased to hear that Apache is easily configured like that and it's a completely reasonable way to do your webservice - It would make my life easier to just believe the rankings loss is coz of two titles situation - is only change I know of, wish I had crawler idea earlier and knew if this was same service from server as two weeks ago.

phranque

10:31 am on Jan 23, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



welcome to WebmasterWorld [webmasterworld.com], robsoles!

that doesn't sound familiar to me as an apache issue.
i would be more inclined to think it is the cms or its templates causing problems.
you should be able to test your crawler on other joomla sites and other apache sites to see if you see the same type of problem.

note that some sites ban crawlers - you probably wouldn't get far crawling WebmasterWorld...

robsoles

10:53 am on Jan 23, 2008 (gmt 0)

10+ Year Member



Good-aye phranque,

It is bizarre phranque, I had my crawler sending:

GET /robots.txt HTTP/1.1
Host: www.example.com
User-Agent: Spidey/Experimental WebBot
Accept: text/xml,application/xml,application/xhtml+xml,text/html,text/plain,image/png,*/*
Accept-Charset: ISO-8859-1,utf-8
Connection: close

And I got these chunks with the hex figures strewn in the middle of URLs , after I posted above I felt a bit shamed and went and tried to make sense of the RFCs on the topic and changed it to:

GET / HTTP/1.0
Host: www.example.com
User-Agent: Spidey/Experimental WebBot
Accept: text/xml,application/xml,application/xhtml+xml,text/html,text/plain,image/png,*/*
Accept-Charset: ISO-8859-1,utf-8
Connection: close

The crawl with this request didn't receive any length declarations from the server at all, the crawler reported line feeds and carriage returns in URLs that it parsed out, looking at the text files with the raw data in them I see that there is no length declaration and I think that the rfcs I tried to grasp were saying that's fair enough if 'connection: close' is the request, or possibly even just because it's http/1.0

I've plenty to learn, I am just hoping someone can clearly indicate if this is an seo problem or not before I learn for sure by plodding on in my study of 'it all'.

Regards,
robsoles.

after-thought-edit: I'm not revealing the URL feeding the interesting raw data, the server at www.example.com is IIS, mine, and I'm pretty sure it's configured all 'nice and polite'.

[edited by: engine at 12:09 pm (utc) on Jan. 23, 2008]
[edit reason] examplified [/edit]

phranque

11:29 am on Jan 23, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



maybe it's junk in the robots.txt file, which would be a text/plain content type.
1.1 has been around since 99, so i think you'll be ok with that.

what happens when you use:
GET / HTTP/1.1

also, have you checked your response headers, so you know what you are getting?

robsoles

12:32 pm on Jan 23, 2008 (gmt 0)

10+ Year Member



Thanks for trying to help phranque, and for the warm welcome.

Thanks especially for directing my attention to the robots.txt, it's interesting to me that this file gets a length defined when requesting http/1.0

These 'bits' are small so I'll just drop in the robots request and response:

Request {
GET /robots.txt HTTP/1.1
Host: <HOST/DOMAIN REMOVED>
User-Agent: Spidey/Experimental WebBot
Accept: text/xml,application/xml,application/xhtml+xml,text/html,text/plain,image/png,*/*
Accept-Charset: ISO-8859-1,utf-8
Connection: close

}
Response (wrapped as [Head]all_raw[HeadChunkEnd])
[Head]HTTP/1.1 200 OK
Date: Wed, 23 Jan 2008 11:52:07 GMT
Server: Apache
Last-Modified: Tue, 15 Aug 2006 04:54:22 GMT
ETag: "<REMOVED>"
Accept-Ranges: bytes
Content-Length: 286
Connection: close
Content-Type: text/plain; charset=iso-8859-1

User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /editor/
Disallow: /help/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /mambots/
Disallow: /media/
Disallow: /modules/
Disallow: /templates/
Disallow: /installation/

[HeadChunkEnd]

Actual_Length:286 Server State Length: 286

The last line there is appended to the string that the crawler is assembling in the 'data_arrival' routine of the winsock to show the math, what is interesting to me here is that the bottom line of the root page using http/1.0 reads:

Actual_Length:8442 Server State Length:

which is interesting that it specifies the length for robots and doesn't for text/html content type, possibly quite fair due to 'chunked' situation in http/1.0 - all the info I put about this in the o/p is based on using http/1.1 in my request.

I've made routines to check the robots.txt file, rel="nofollow" and <meta name="robots" ...> directives are checked and the routine can check a specific robot or all and will use all if specified bot is not named in the file. The bit that managed wildcards was really tricky till I read what Google say about the wildcards Googlebot will interpret and I had to dumb it down a bit to just tricky. (P

I had tested spidey very successfully on 20 odd sites before running it on (I nearly wrote the domain out) this particular server owned by a particular hosting company and basically being rented by my friend. It already pulls all tags apart, storing attributes of all and content of such the like as <title> and <a href="... > tag pairs.

What it is doing now is rudimentary by comparison to how close to what Googlebot can tell about your pages I want it to do - right now it can reveal broken links, links that redirect and if that redirection works out to 200 Ok, then in a plainish html file a report of the title, all meta tags, certain response headers in case of missing or redirect, and every link, status of destination on same line as Destination URL followed by the anchor text, actual href value and title attributes.

Anyway, if you are interested I will post the robots.txt that I get when using http/1.1 on this server and (with friend domain removed) the raw response on either the messiest page served I can find amongst the collection I've been making or the root page if you specify.

Wanna see these crude bits? I'll show you webservice of the root page on an IIS6.2, Apache 2.x.x and this (IMO) rotten one if you like - I'll remove all but my own domain names, if ya don't mind.

(In my previous post I put a note on the bottom, I just want to say that I don't own www.example.com and I don't mind if my URL will be stripped out too.)

phranque

12:41 pm on Jan 23, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



we don't usually like "code dumps" in the forum, but this could have academic interest...

(just use example.com instead of your personal url - it's a reserved domain name and won't link on this forum)

robsoles

1:57 pm on Jan 23, 2008 (gmt 0)

10+ Year Member



I was pretty sure it would be seen as rude without an invite, thanks for the invite phranque, you didn't specify which raw dump you wanted so I choose the messiest response when my request was using 'http/1.1'

This took ages because I did my best to remove every method of identifying anybody from it, I'll have to write up a quickie to turn every word that is apparent 'visual content' and all domain names to example for me if you want to see the nice clean hit I get on a Joomla 1.0.13 site on Centos 5, or the other nice clean one I get on IIS6.2

The following is one of 51 files returned from the host for 61 links the crawler found, when I initially studied the list it looked like it had the most mid-chunk length-data but it isn't the worst in my collection, without the routine for checking the middle of chunks my crawler was chasing nasty red herrings, in the hit I get using http/1.0 it has carriage returns in the urls without hex digits - my stuff I wrote still fixes that to request the correct URL but I really do wonder if doing this form of webservice doesn't make the site look bad to Googlebot.

Hope I nailed everything you could possibly search on and find this site with, identify the developers etc. By the way, these documents have mixed use of CRLF and LF in them. The only hex digits which occur in the middle of a URL on this page is the '1000' on a line by itself below, the other midbody hex digits (bb8) would effect the visual text of the site if mis(?)interpreted.

[Head]HTTP/1.1 200 OK
Date: Wed, 23 Jan 2008 12:52:08 GMT
Server: Apache
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Mon, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
X-Powered-By: PHP/4.4.6
Set-Cookie: [removed /]; path=/
Last-Modified: Wed, 23 Jan 2008 12:52:08 GMT
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1

fd3
<?xml version="1.0" encoding="iso-8859-1"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title>[removed for post /]</title>
<meta name="description" content="[removed /]." />
<meta name="keywords" content="example example example, example example, example example, example example, example example, example, example supplies, example suppliers, example distributors" />
<meta name="Generator" content="removed." />
<meta name="robots" content="index, follow" />
<link rel="shortcut icon" href="http://www.example.com/images/favicon.ico" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<link href="http://www.example.com/templates/example/css/template_css.css" rel="stylesheet" type="text/css" />
<!--[if IE]>
<link rel="stylesheet" type="text/css" href="http://www.example.com/templates/example/css/iehacks.css" />
<![endif]-->
<script type="text/javascript" src="http://www.example.com/templates/example/example.js"></script>
</head>
<body ondragstart="return false" onselectstart="return false">

<div id="holder">
<div id="header"><img src="http://www.example.com/templates/example/siteimg/toplogo.gif" alt="Distributors" width="313" height="107" border="0" /></div>
<div id="menu"><div id="menuitems"><ul id="mainlevelmainmenu"><li><a href="index.php?option=com_content&amp;task=view&amp;id=12&amp;Itemid=26" class="mainlevelmainmenu" >about us</a></li><li><span class="mainlevelmainmenu" >¦</span></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=18&amp;Itemid=49" class="mainlevelmainmenu" >example example example</a></li><li><span class="mainlevelmainmenu" >¦</span></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=21&amp;Itemid=35" class="mainlevelmainmenu" >example example</a></li><li><span class="mainlevelmainmenu" >¦</span></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=76&amp;Itemid=151" class="mainlevelmainmenu" >h\e minerals</a></li><li><span class="mainlevelmainmenu" >¦</span></li><li><a href="index.php?option=com_login&amp;Itemid=126" class="mainlevelmainmenu" >professionals</a></li></ul></div[HeadChunkEnd][ChunkStart]></div>
<div id="flashcontent" style="background-image: url('/templates/example/siteimg/example_header.jpg');">
<div id="noflashheader"><h1>example</h1></div>
</div>
<script type="text/javascript">
var so = new SWFObject("flash/example.swf", "example", "100%", "100%", "7", "#FFFFFF");
so.addParam("wmode", "transparent");
so.addParam("salign", "t");
so.write("flashcontent");
</script><div id="colcont">
<div id="rightcol">
<ul id="mainlevelexamplemenu"><li><span class="mainlevelexamplemenu" >example example example</span></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=18&amp;Itemid=49" class="mainlevelexamplemenu" >in example news</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=19&amp;Itemid=63" class="mainlevelexamplemenu" >testimonials</a></li><li><a href="javascript:popUp('enquire.php')" class="mainlevelexamplemenu" >where to buy</a></li><li><span class="mainlevelexamplemenu" >articles</span></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=25&amp;Itemid=55" class="mainlevelexamplemenu" >about example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=32&amp;Itemid=65" class="mainlevelexamplemenu" >birth of example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=33&amp;Itemid=66" class="mainlevelexamplemenu" >example example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=34&amp;Itemid=67" class="ma[ChunkEnd][ChunkStart]inlevelexamplemenu" >example a</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=35
1000
&amp;Itemid=68" class="mainlevelexamplemenu" >example c</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=36&amp;Itemid=69" class="mainlevelexamplemenu" >damaged example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=37&amp;Itemid=70" class="mainlevelexamplemenu" >example example</a></li><li><span class="mainlevelexamplemenu" >examples</span></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=77&amp;Itemid=152" class="mainlevelexamplemenu" >what's new</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=24&amp;Itemid=54" class="mainlevelexamplemenu" >interactive example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=26&amp;Itemid=57" class="mainlevelexamplemenu" >example example example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=27&amp;Itemid=145" class="mainlevelexamplemenu" >original example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=28&amp;Itemid=58" class="mainlevelexamplemenu" >example example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=29&amp;Itemid=59" class="mainlevelexamplemenu" >intensive example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=30&amp;Itemid=60" class="mainlevelexamplemenu" >example example</a></li><li><a href="in[ChunkEnd][ChunkStart]dex.php?option=com_content&amp;task=view&amp;id=31&amp;Itemid=61" class="mainlevelexamplemenu" >example roll-cit</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=80&amp;Itemid=156" class="mainlevelexamplemenu" id="active_menuexamplemenu">example example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=83&amp;Itemid=158" class="mainlevelexamplemenu" >sun example example</a></li><li><a href="index.php?option=com_content&amp;task=view&amp;id=82&amp;Itemid=157" class="mainlevelexamplemenu" >example® example machine</a></li></ul></div>
<div id="leftcol">
<p class="sectiontitle" align="right">
<span style="color: #ff0000">NEW!</span>&nbsp; example example
</p>
<p class="subheading" align="right">
&nbsp;
</p>
<img style="width: 247px; height: 263px" src="images/stories/example_jpegs_env/example_example.jpg" border="0" alt="example_example" title="example_example" hspace="5" vspace="5" width="247" height="263" /><br />
<span class="subheading">example example</span> <br />
example example example example example example example.&nbsp; Its example example is to example your example with example example-example example that example example example of example in example example example.&nbsp; It is example example to maintain example example example examplee. &nbsp;example example includes example and example-example&trade; two example example example example example example contro[ChunkEnd][ChunkStart]l example example and example of example which is example for example example example example and an example example example. A examplet example example example example of example active example.<br />
example example example example example be used on example example example where example example example is example and example, or can be used on example example example and example for example v example example. example example example example example example of example, examples example and examples example transfer of example to example example.<br />
example example example must be exampleed to example and once it has been example example into example example example standard example&reg; examples may be used. example example in example example example example&reg; example and example, example example example example to example example of example and example example and example a example example with improved example.&nbsp;It is example to example that example example example example example example example example example of example example example example. example&reg;&rsquo;S example example example example of example. It example example example example example example example example example to example example und
bb8
esirable example.<br />
example new example example example is example in an example example example example example example.<span style="font-family: arial,[ChunkEnd][ChunkStart]helvetica,sans-serif"> <br />
<br />
</span><span class="subheading">example example</span> <br />
example example example example is a example example example example to example example, example example, example example transfer of example into example example as well as example, which is a example example found example. This example is example in that examplere are no example example examples example on example example. <br />
example example example example is example example example example greater example of example active example on example example. example example example be example on example example after example example example example example example example and exampleed to example.&nbsp;It example &ndash; even example but example on where example example is being used some will example example on example example example two example. example example example example example on example example, example more example example results.&nbsp;example may example a example example of example under example example. example should example example example example example example or example example if example example example too example. example example example example example example, example example of example examples and examplees.&nbsp;It is example to example that example example example example example example example example example of example A example example example example. example&reg;&rsquo;S Original example example example example example of example. example example example[ChunkEnd][ChunkStart] example example example example example example example A example example be example example to example example example.<br />
example example example exampleES are example in examplely example example with 30 example in example.<span style="color: #000000"><span style="font-size: 10pt"><span style="font-family: arial,helvetica,sans-serif"> </span></span></span>
<p class="subheading" align="right">
&nbsp;
</p>


</div>
</div>
<div id="footer">
<div id="footerlinks">
<a href="http://www.example.com/index.php?option=com_contact&task=view&contact_id=1&Itemid=51">contact us</a> ¦ <a href="http://www.example.com/index.php?option=com_login&Itemid=126">example</a> &nbsp;&nbsp; external links: <a href="http://www.example.com" target="_blank">example example</a> ¦ <a href="http://www.example.co.za" target="_blank">example</a> ¦ <a href="http://www.example.com" target="_blank">example</a>
</div>
<div id="example">
<a href="http://www.example.com" target="_blank"><img src="siteimg/example.gif" alt="Site by example example" width="100" height="29" border="0" /></a>
</div>
</div>
</div>
<script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "<removed />";
urchinTracker();
</script>
</body>
</html><!-- 1201092729 -->
0

[ChunkEnd]

Actual_Length:11154 Server State Length: 11147

[edited by: jdMorgan at 4:19 pm (utc) on Jan. 23, 2008]
[edit reason] Redacted by request. [/edit]

phranque

2:18 pm on Jan 23, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you crack me up!
=8)
nice examplification...

looks like you're right - it's probably sending (as close as possible to) 4k chunks.
FD3 is 4051, then 1000 is 4096, then the partial final chunk.

really strange - i've never seen that.
i'll take your word, but you don't see this from the other servers?
no way you aren't printing a buffer at a time in your script and somehow outputting the length before/between buffers?
maybe you don't see it because only apache buffers?
just some wild guesses here...

jdMorgan

2:47 pm on Jan 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would suggest a test session using a simple "terminal emulator" tool such as Telnet to send requests using HTTP/1.0 and HTTP/1.1 to this server, and to examine the raw response streams.

Let's keep the code dumps short, and only post what's relevant, please.

Jim

robsoles

2:51 pm on Jan 23, 2008 (gmt 0)

10+ Year Member



yeah, it is pretty well exampled meaningless, isn't it?

The routine is copying the data that 'arrives' into the buffer that is being added to an 'otherwise untouched' buffer as it arrives, it is prepended with either '[Head]' or '[ChunkStart]' and similar end markers as it is added to that buffer string, the string being sent to the parsing routines don't include the header of the response, the markers I wrote, nor the hex digits that are arriving in the raw response.

I took great care to make sure my routines put the dead simple truth in the local copy files, weeded out all non-document parts of the page without taking characters that belong in place and retested all parsing routines to make sure they weren't dropping the ball - can't find it my end and since writing in the bit to record the raw responses in local file, can prove it's coming from their end (*lol, you have to get spidey off me or find one that makes as much obvious as spidey does so far) while they remain live with it.

Or, can someone confirm or deny my suggestion about winsock in o/p? Maybe their chunks are coming in with length unit at start and winsock nonsense is re-arranging chunks on me, making it look like these chunks come with length units in the middle of stuff.

Nope, never saw this out of any other crawl and I did a 400+ page joomla site (1.0.13) day before yesterday etc, etc. It gave the size of the page in a single figure preceding the first chunk of the body.

In fairness to them, I crawled the developers site and although there were no URLs broken up in their stuff (actually, is this more suss?) but the random looking length units were in their source too.

I wish an Apache developer would state that this is something you have to write yourself and Apache delivers much easier pages (as demonstrated by all apache/linux setups I crawled before that one! ~11) for user-agents to dissemble, especially if they are some unknown agent and they form a mediocre request in terms of elements present in it.

I could show what I now call a 'clean hit' off IIS 6.2 with less effort as my site is ASP in IIS6.2 and I can just find/replace my domain with example.com, not now though.

I have to retire now and go to work when I un-retire, catch you in my tomorrow most likely after work!

robsoles

3:02 pm on Jan 23, 2008 (gmt 0)

10+ Year Member



mmh, I can't edit the post with the raw page anymore and I left a prime peice of branding in an img tag's alt attribute, can someone please remove the obvious branding with the word 'distrubutors' in it from the post?

Thanks in advance,
robsoles.

robsoles

11:54 pm on Jan 23, 2008 (gmt 0)

10+ Year Member



(popped in at work), Hey jdMorgan, thx for telnet suggestion and I didn't mean to offend with code dumps or unnecessary info.

I can't seem to form a decent request using telnet, the server in question disconnects me when I send the end of request 'CRLF+CRLF' and never once sends or echoes anything back to me. I will re-attack it tonight, an alternative to winsock which shows the true raw data will be terribly handy - have to try harder with telnet or simile when I get home.

Regards,
robsoles.
Ps. thanks very much to the fixer of the branding...

jdMorgan

12:12 am on Jan 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As I recall, there is a setting in Telnet that adjusts the line-enders between *nix mode, Windows mode, and others. I think it also allows you to type CR-only to get the proper line-ender(s). Make sure that it is set correctly for HTTP requests.

Jim

robsoles

9:06 am on Jan 24, 2008 (gmt 0)

10+ Year Member



Hey jdMorgan, phranque and whoever else ever becomes interested.

I stopped mucking around with the MS-DOS version of 'telnet.exe' and switched to puttytel.exe that I realised I downloaded in a zip just chasing putty.exe for decent ssh a few months ago.

I got a few successful sample hits using puttytel, but the chunks arrive to quickly for me to determine if these 'length units' come at the start of each chunk or randomly 'mid-chunk' as the output of my crawler fairly strongly suggests.

The first hit from the server in question was as clean as a whistle, '1bfa' (or so) directly before unbroken body (just like what I see as a healthy Joomla/Apache/Centos hit) - as I parsed it with my eyes I worried for a second that I made a gross error in my coding of the crawler's raw-response-parser that was somehow introducing the problem - nup, second hit cleared that one up straight away.

As did the third and fourth hits - just trouble being that the chunks arrive too quick for me to be sure I am seeing the last byte of the prior chunk and the first byte of the next chunk as they arrive and I don't think I'll find a setting in any telnet program to place markers in between the chunks as they arrive (*tho, please correct me if you know better!)

To me, this very basically confirms that this server is sending 'length units' mid-body, but it neither confirms nor refutes that the server is sending them mid-chunk. These things breaking up body text and more importantly URLs inside of href attributes where I get it from no server before it, is making me really wonder about it.

If you would help me but for some reason won't post a reply to this thread, please consider the following:
<snip>

[edited by: jdMorgan at 3:45 pm (utc) on Jan. 24, 2008]
[edit reason] No URLs in any form, please. See Terms of Service. [/edit]

jdMorgan

3:49 pm on Jan 24, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are bandwidth-reduction utilities available that will slow your internet connection down. They are mostly useful for emulating dial-up connection speeds, and may help with this case. Try a search for "sloppy connection speed java".

Jim

robsoles

10:13 pm on Jan 24, 2008 (gmt 0)

10+ Year Member



Thanks Jim, especially for persisting at helping me after having to fix my faux pas that make it obvious I didn't read too many of the forum rules before I started posting.

I'll pursue that idea when I get home and post back to say how I went, I am hoping to conclusively prove that the service I see out of 'that server' is perfectly reasonable and within RFC specs for it because the alternative just isn't cool.

Regards,
robsoles.

phranque

11:36 pm on Jan 24, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



no worries!
those rules make this a low noise place to discuss things.

that could be close to a record for longest first thread...
=8)

stick around and join the fun!

robsoles

11:06 am on Jan 25, 2008 (gmt 0)

10+ Year Member



Thanks very much phranque.

Jim, sloppy's pretty cool - whilst fetching stuff via it with putty I realised my crawler was already armed with a tool I could use to find out what a browser would display if it received that!

Deadly simple HTTP service, left over from the bits I constructed to proxy my browser's requests to my own sites, and then my bosses apache sites after a quick squiz at the ubundantly confusing RFCs.

I wrote the line required to make it spit out the contents of one the files I was considering to be 'dirty' and...

FF - Perfect, 'Standards compliant' page. Context-menu has 'Page Info' and it states the length specified in the sum of the units that are scattered throughout the page, which are absent in the source.

MSIE - Looks same, right click disabled, source same and information unavailable in the properties - very clever.

CASE APPEARS CLOSED: Neither browser shows any sign of the 'length units' in the resultant page, the source thereof, nor does it effect the script's ability to manage compliance mode but displays well in MSIE successfully killing right-click.

I like it, thank goodness I can sing their developer/host's praises instead of saying potentially seriously regrettable stuff to anyone who might do anything silly with it!

Thanks again guys, I've gotten many answers to 'my current problem' over the last six months by finding something on www.webmasterworld.com (I hope I can make that one here!) by posing my problem as directly as possible in the Google search box.

May Google bless us all (or go to {I won't say in hope it's bless!})
robsoles.

phranque

7:10 pm on Jan 25, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I've gotten many answers to 'my current problem' over the last six months by finding something on www.webmasterworld.com (I hope I can make that one here!) by posing my problem as directly as possible in the Google search box

WebmasterWorld indexes well on google - and quickly, often within minutes...