Forum Moderators: phranque

Message Too Old, No Replies

Log shows GET http://example.com/index.htm

That's not the usual format

         

SteveWh

7:29 am on Feb 9, 2011 (gmt 0)

10+ Year Member



Server: Linux/Apache 2.2

In my HTTP access log, a normal request for home page looks like this:

GET /index.htm

which is equivalent to /public_html/index.htm.
Almost all requests have this format.
GET is followed by a space and a /, and the remainder of the request path is relative to that.


But there are also infrequent log entries that look like this:

GET http://example.com/index.htm

There is no forward slash after GET, and the request contains the full protocol and website name, which normally would have been stripped off by Apache.

The result code is 200 and the bytes transferred seem to indicate that the server sent the correct page.

On my Apache server at home, I've tried to reproduce this with a browser and with wget, and cannot.

If I send a request like this:

GET http://example.com/http://example.com/index.htm

it shows in the log like this, still with the leading slash, and the server returns a 404 for it:

GET /http://example.com/index.htm

I cannot craft a request such that the server log entry doesn't start with "GET /"

Any ideas what text format is being used for these strange requests, or what download tool? Actually, it doesn't seem like the download tool would matter. The mystery is how an HTTP request can be crafted to look like that in the log.

jboy

12:59 pm on Feb 9, 2011 (gmt 0)

10+ Year Member



have you tried accessing both www.example.com and example.com in your testing? also with and without index.html, also maybe, with and without trailing slash on example.com/ and www.example.com/ ? just a few thoughts.

SteveWh

10:18 am on Feb 11, 2011 (gmt 0)

10+ Year Member



Got it! They're using telnet, or a program that uses telnet protocol, which can be used to request web pages.
[support.microsoft.com...]

jdMorgan

8:07 pm on Feb 17, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Or they're just using a badly-coded script...

Including the protocol and hostname is a valid "long form" of request-line formatting, but very, very rarely seen. It almost always indicates that the client is a script or some sort of 'tool' and is not a human using a standard browser.

You can detect the http:// at the beginning, and if the requested domain is NOT your own domain (or a valid variant of it), then return a 403-Forbidden. If it is your own domain, but not a canonical hostname, then 301 redirect to the canonical hostname after stripping off the "http://example.com" part. If it is exactly your canonical hostname, then just do an internal rewrite to remove the protocol and hostname, and serve the content from the remaining URL-path.

Jim

SteveWh

10:35 pm on Feb 18, 2011 (gmt 0)

10+ Year Member



You can detect the http:// at the beginning

I decided to unconditionally 403 all these requests based on just that one test.