spiders and hostname header

Forum Moderators: DixonJones

Message Too Old, No Replies

spiders and hostname header

any spiders not using http/1.1 host headers?

plumsauce

9:23 pm on Jan 1, 2004 (gmt 0)

Has anyone observed legitimate major spiders that
do not issue the host: header as part of the request?

There is a lot of traffic these days in the logs
that do not grab the associated .js or .css files.
A class of these requests does not include the
host header, even when declaring the protocol
version as HTTP/1.1.

In other words, the request would look like this:

GET / HTTP/1.1

instead of:

GET / HTTP/1.1
Host: example.com

As most commonly used browsers issue a host
header as part of the request, and follow up
with requests for the .js or .css files included
on the page, I am concluding these are automated
scrapers.

I am considering writing an isapi filter to just
drop these requests on the ground.

Of course, I do not want to drop useful traffic.
Hence, the question at the top of this post.

Dreamquick

12:32 am on Jan 2, 2004 (gmt 0)

Pretty much any decent spider needs to be using the HOST: header otherwise they don't stand a chance of being able to index the multitude of virtually hosted sites that are out there.

The only legitimate reason I can think of not to include a host header is if you were accessing the site via IP address for some reason, aside from that the only programs that tend to issue automated requests without one are looking for exploits via IP addresses.

Email harvestors / screen scrapers etc. normally uses hostnames since it allows them to access the site content and harvest/scrape.

- Tony

plumsauce

7:12 pm on Jan 2, 2004 (gmt 0)

Thanks for replying.

Pretty much any decent spider needs to be using the HOST: header otherwise they don't stand a chance of being able to index the multitude of virtually hosted sites that are out there.

That's pretty well what I believe too.

As for automated exploit detection, I am not
seeing any followup requests that normally
accompany these. It seems as if there are
a bunch of zombied machines out there from
ip's all around the world.

+++

ronburk

4:28 am on Jan 3, 2004 (gmt 0)

The Host header is a non-optional requirement for an HTTP 1.1 request. Likewise, the standard also requires servers to respond with a 400 if an HTTP 1.1 request arrives that contains no Host header.

Quick check of www.microsoft.com (whom I suspect of running an IIS server) with a missing Host header garners the following reply:

HTTP/1.1 400 Bad Request
Content-Type: text/html
Date: Sat, 03 Jan 2004 04:22:24 GMT
Connection: close
Content-Length: 39

<h1>Bad Request (Invalid Hostname)</h1>

This seems better than violating the standard and just ignoring such requests.

If you have the ability, it would be of academic interest to examine such requests in more detail. Were they being hand-typed in telnet (separate packet for each char)? Does it look like it could be a broken proxy (Host header empty, as opposed to missing)?

plumsauce

6:31 am on Jan 3, 2004 (gmt 0)

If you have the ability, it would be of academic interest to examine such requests in more detail. Were they being hand-typed in telnet (separate packet for each char)? Does it look like it could be a broken proxy (Host header empty, as opposed to missing)?

Yep, the header is mandatory in 1.1, but who knows
who programmed these things. From the number that
are occurring, I doubt that it is a manually typed
request.

You bring up a good point with the proxies. This
used to be a common problem. The current state of
affairs is not known to me.

But, as the client is not getting a useful response
anyways, I guess no harm is done in dropping the
request.

While I use sniffers on test boxes, I don't think
a sniffer on a production server is going to be
very easy in terms of picking out a packet from
the trace logs.

+++

ronburk

6:09 pm on Jan 3, 2004 (gmt 0)

But, as the client is not getting a useful response
anyways, I guess no harm is done in dropping the
request.

The harm is that the client is not notified that they have made a mistake. Therefore, it's entirely reasonable for the client to try again. And again. And again. The 400 response explicitly says "do not send this request again -- it's screwed up." That is, in fact, a useful response for some clients. While you may be encountering a client that fails to obey the 400 response, the best bet for dealing with the entire world of clients (present and future) is to conform to the standard, IMO. If it's really a particular set of IP addresses that continues to annoy, I would rather drop them at the firewall than at the level of HTTP requests.

I have so far not encountered a single 1.1 request with a missing Host on the servers I manage (but now I'll add a weblog filter to watch for it :-). Should it start to crop up on a regular basis, I will leave my servers conformant and try to manually notify the humans behind the client of their error. Failing that, I would notify their ISP. Failing that, I would clip them at the firewall if they are so kind as to use a static IP address. Last resort after that would be to tarpit the connection and see if they notice their client is starting to get reeeeeal slow.

For senders of 1.0 requests that fail to include a Host header (legal, but equally useless when the IP address hosts multiple sites), I return a page that explicitly explains what they've done and why it's not particularly useful. Virtually all my 1.0 requests are from spiders and hackers (seems about an equal mix, some days).

plumsauce

8:44 pm on Jan 3, 2004 (gmt 0)

First, they definitely are being seen as version
1.1 requests. The only reason they stuck out
like a sore thumb is because of the missing host
header and lack of followup for style sheets and
javascript files.

The firewall is not going to be any real help as
they come from all over the world. So, notifying
their isp is not really something that is going
to be on the top of my to-do list.

*drop* was used in a general sense.
The actual response can be drop, 400, tarpit,
whatever. Conformance is not really a major
consideration when dealing with rogue requests.

The question really is: has anyone seen legitimate
traffic including mainstream search engines that
behaves like this?

Pedanticist?

+++