Forum Moderators: phranque
I wrote a small script to parse apache raw logs for analyzing what 'search engine' searches yielded to a page on my site, and it works pretty well. I realized that some rows in the logs are replicated for some reason. For example if you look at the requests below, you will see that IPs are same, request times are almost same, request URLs and referers are same, the only difference is file size (and even that is same for some).
Do you have any idea this single request is recorded as multiple requests? Is it because something like browser optimization, meaning browser is sending multiple requests (for retrieving different chunks ) to the same page concurrently for getting faster response times?
Also, given that awstats generates the stats from these logs, how does this affect awstats logs? Are these reflected as 8 page hits on awstats?
Thanks...
68.999.132.25 - - [30/Jul/2005:05:34:24 -0400] "GET /content/blogcategory/57/82/ HTTP/1.1" 200 35042 "http://www.google.com/search?hl=en&q=fuzzy+blue+widgets&meta=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
68.999.132.25 - - [30/Jul/2005:05:34:25 -0400] "GET /content/blogcategory/57/82/ HTTP/1.1" 200 25270 "http://www.google.com/search?hl=en&q=fuzzy+blue+widgets&meta=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
68.999.132.25 - - [30/Jul/2005:05:34:25 -0400] "GET /content/blogcategory/57/82/ HTTP/1.1" 200 35042 "http://www.google.com/search?hl=en&q=fuzzy+blue+widgets&meta=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
68.999.132.25 - - [30/Jul/2005:05:34:26 -0400] "GET /content/blogcategory/57/82/ HTTP/1.1" 200 20974 "http://www.google.com/search?hl=en&q=fuzzy+blue+widgets&meta=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
68.999.132.25 - - [30/Jul/2005:05:34:29 -0400] "GET /content/blogcategory/57/82/ HTTP/1.1" 200 35042 "http://www.google.com/search?hl=en&q=fuzzy+blue+widgets&meta=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
68.999.132.25 - - [30/Jul/2005:05:34:32 -0400] "GET /content/blogcategory/57/82/ HTTP/1.1" 200 32430 "http://www.google.com/search?hl=en&q=fuzzy+blue+widgets&meta=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
68.999.132.25 - - [30/Jul/2005:05:34:32 -0400] "GET /content/blogcategory/57/82/ HTTP/1.1" 200 20974 "http://www.google.com/search?hl=en&q=fuzzy+blue+widgets&meta=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
68.999.132.25 - - [30/Jul/2005:05:34:32 -0400] "GET /content/blogcategory/57/82/ HTTP/1.1" 200 20974 "http://www.google.com/search?hl=en&q=fuzzy+blue+widgets&meta=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
[edited by: jdMorgan at 12:58 am (utc) on Aug. 2, 2005]
[edit reason] Obscured specifics per TOS. [/edit]
No, that's eight unique requests. If these were partial-GET requests, you'd see a server response code of 206 [w3.org], not 200. This looks like a "keyword research 'bot" to me.
Do you have dynamic content, such as SSI or PHP on these pages? That could account for the varying response length (byte count).
Jim
Actually this URL is for a php page.
The thing is, the IP in question is a home IP address from some DSL company, and also if it was a word bot, don't you think it would be using different search words possibly with a different user agent? Why do you think it keeps coming to the page with the same word search on google?
Thanks for the information, this has been puzzling me for quite some time and appreciate any information on this...
'Bots typically either ignore the 403 and continue requesting URLs not linked-to from my custom 403 page, or they stop dead and go away. Humans typically try their original request several times in quick succession, and finally may take the time to actually read my custom 403 page, which contains some helpful info.
Jim