http Get request text only

Forum Moderators: DixonJones

Message Too Old, No Replies

http Get request text only

svrart

12:08 am on May 1, 2013 (gmt 0)

Hello,

Not sure if this is the right place, but here goes:

I am looking at my raw access logs and seeing that most of the "visits" are by bots. I want to filter such activity out to get a true sense of human visitors.

Filters for the biggies like Googlebot, bing etc are easy. However, there are numerous other small ones that come and go. One thing I have noticed is that these small bots, only download the landing page text but none of the related graphics.

So, I want to write a small php on my pages that looks at the get request and if it requests only text exclude it from my stats. How does one identify such text only requests. I know of variables such $_SERVER['REMOTE_ADDR'] but don't know how to get the text only part.

Regards,

lucy24

12:46 am on May 1, 2013 (gmt 0)

If I understand your question correctly, you're looking for something that can't be done.

A page is, by definition, requested before its supporting files. So at the time the request is made there is no way to tell whether it will be followed by other requests. You can look at other aspects of the request header, but each request is an island.

If you're talking about getting information after the fact from raw logs, you don't need to know ahead of time whether it was human or robot, because you can look at the package. Up until a few years ago, all you needed to look for was a request for the favicon. Robots that ask for this are few and far between, and most are known quantities that you can filter out.

Now, thanks to all those ### mobiles, it's trickier. Add requests for apple-touch-icon to the list. Then look for packages, such as search-engine query followed by requests for supporting files. There are a few other robot flags that jump out at you in logs. One I use is the auto-referer: any request for a page that names the page itself as referer. This of course only works if your pages don't link to themselves-- no active "home" link on the home page, that kind of thing. Internal fragment links in # are OK because your server won't see them. Another that works for me but may not work for everyone is requests giving my front page as referer when the page isn't linked from the front page-- in my case, everything but top-level directory index files.

All this is assuming you want to do the work yourself by processing raw logs. Option B of course is to use a javascipt-based analytics package like piwik or GA. All you need to do then is block known robots from getting at the analytics code. Minor robots don't matter because they don't execute javascript anyway; you only need to screen out a few major search engines. They know who they are ;)

svrart

9:54 am on May 1, 2013 (gmt 0)

Hi Lucy,
Thanks for the reply. I understand the http thing better now.

When I looked at my raw logs closely I realized that even though the stats say I am getting 80+ visitors a day, in reality its about 20 human visitors a day!

I have created a small javascript that calls an AJAX to log in the human visit.

svrart

11:32 am on May 3, 2013 (gmt 0)

Hi,

I have written a small javascript, but some bots are able to get to the callback function. How do I block robots from getting to my code?

lucy24

8:40 pm on May 3, 2013 (gmt 0)

This is getting a bit circular isn't it? The code is intended to identify robots, so you need robots to keep out of the code so they don't get identified as...

:: pause for head to stop spinning ::

You could, of course, tuck the js into a roboted-out directory. But this will only stop the law-abiding robots. Beyond that, you can block robots in htaccess-- but, again, only if you know their names.

:: shuffling papers ::

RewriteCond %{REMOTE_ADDR} ^(65\.5[2-5]|131\.253\.[2-4]\d|157\.(5[4-9]|60)|207\.46|209\.8[45])\. [OR]
RewriteCond %{HTTP_USER_AGENT} ([a-z]Bot|facebook|pinterest|Googlebot|Seznam|Preview) [NC,OR]
RewriteCond %{HTTP_REFERER} cache
RewriteRule (piwik|dir2|dir3)/ - [F]

That kind of thing. That's a direct cut-and-paste; the IP addresses are bing. Er, yes, "[a-z]Bot" and "Googlebot" is redundant; I neglected to edit after adding [NC].

Just how many different robots are involved? Aside from major search engines and scrapers,* most don't bother with javascript.

* Squinting modifier, but I decided to leave it that way ;)