Forum Moderators: DixonJones
212.127.***.** - - [01/Feb/2005:02:41:38 -0800] "GET /robots.txt HTTP/1.1" 200 1751 "-" "Java/1.5.0_01"
212.127.***.** - - [01/Feb/2005:02:41:38 -0800] "GET /Blahblah.html HTTP/1.1" 200 8455 "-" "Qarp-0.33"How does the UA string show representation from two UAs, at the same second, from the same IP Number?
Is it as simple as two browsers open, each with a different UA string?
When you google Qarp-0.33, at the moment, there is nothing...nada.
'Sumptin someone dreampt up on the spur of the moment?
If it asked for robots.txt, lets guess its a bot, maybe written in Java. And just for fun, its multi-threaded. Furthermore, lets guess the programmer was 'sloppy', and didn't use the same base object for his http requests. So in the robots.txt thread, they didn't set user agent string, but they did set that header field for the other other threads that get the pages.
Not that any programmer I know would write such code of course...
Actually I have seen in my logs the agent switch in the same visit when their is a browser add-in to manage downloads. Very confusing when first encountered.
Larry
>And just for fun, its multi-threaded.
I'm not really literate here at all. Could you flesh these two statements out for me? What does multi-threaded mean? That they can roam several servers simultainiously?
>didn't use the same base object for his http requests.
Same with 'base object'. What is the base object?
I've not become to learnerd on this subject, yet I find it facinating how 'bots work.
Thanks.
Multi-threaded means a program that does more than one thing at a time. Pretty much any GUI program no-a-days is multi-threaded, so that it can do stuff in the background , for instance a wordprocessor will do pagination or printing in the background while you type, responding to your typing right away and using the (for the computer lenghty) period between keystrokes to do something else.
Apache is multi-tasking, with higher priority given to serving pages than logging for instance. I'm not familiar with the internals, but my guess would be that the web server has multiple tasks for serving pages, managing it cache, logging, cgi, php, etc. Pretty much each 'mod' you load is its own task.
Likewise for a spider. It might have one task that goes out to read robots.txt and that is all that task knows how to do. Another task that would have to smarter to understand HTML would read web pages, and a third task be be specially designed to read images.
A 'good' programmer would build a basic HTTP task that knows how to fetch URLs. Using objects, the programmer would extend that basic task to read robots.txt, html, gif, jpg, etc. So if the basic task identified itself as "Qarp-0.33" then all the derived tasks would also be "Qarp", but if they were all separate tasks with no common base, anything goes -- which is what I think might be happening in your case.
Hope that clears things up,
Larry