You could may be check the referrer and if empty abort. However, you will need to do this server-side, perhaps with PHP and process all your .js files through PHP.
However, the MAJOR problem with this is the referrer is unreliable. The referrer could be empty under normal conditions (for some users) and your pages will break.
Ok, I had a feeling .php would be the best option. Thank you.
The problem I'm having is the TalkTalk (UK ISP) Anti Virus Scanning bot. It hits my .js files around 20 times every day and is thus starting to use quite a chunk of bandwidth.
I've blocked its' obvious U-A's and as many IP's as I can trace but it has now started using user-IP's which clearly would not be good to block. I don't mind them virus checking my files, I just want them to stop doing it so damned often.
However, this is off-topic for this thread now so I'll maybe post elsewhere.
Does the TalkTalk bot not obey robots.txt?
Just to follow on from my post above (in light of what you have mentioned). When a bot requests your page (or .js files) I'm not sure that this is the same as a 'direct request' anyway?!
No, it has ignored robots since day one. That was my very first attempt at stopping it but despite calling the robots file at almost every visit, it simply disregards it.
No I guess you're correct there but I'd still be more comfortable knowing I can stop it if/when its' usage gets beyond acceptable. I've even tried emailing their support twice but have had no response as I'm not a TalkTalk/Tiscali customer.
Well, look at it from the user's point of view. If you're doing a virus scan of every site you visit, or every unfamiliar site (don't know how it works at their end), and the scanner says "I'm not allowed to check" or "I couldn't get in", wouldn't that make you suspicious of the site?
I just did a quick log check and confirmed that TalkTalk's visits are always associated with a human visit to the same site at the same time. If two different people visit the same page in close succession, you'll see two separate TalkTalk visits. They only check text files-- in my case html and associated js. (Not css, so if you've figured out how to make a virus that lives in an external stylesheet you're home free.)
Ah OK, ... this is a virus scanner that is running on the users (home) computer? If the scanner is connected to user requests then this shouldn't be impacting your bandwidth too much is it? Or is the content being downloaded twice - once by the scanner, once by the browser (since the scanner appears to be making its own requests)?!
I run Avast! anti virus which also has a web scanner, but it doesn't request any web content itself as far as I'm aware. It appears to intercept the content that the browser has requested.
But if the virus scanner bot is requesting site content and making its presence known by specifying a friendly User Agent then couldn't a malicious site cloak its 'bad' content anyway?!
They do a separate download, immediately adjacent to the user request. (Logs being logs, it might show up as either a second before or a second later. But I think this is about logs' timekeeping-- I've seen similar hiccups in above-suspicion contexts-- not any weirdness at their end.)
IPs I've seen recently (multiple occurrences of each):
where the nnn is not my obfuscation, it means more than one number in this slot. They do download their own copy of the page, but they don't download images, so unless you have really fat pages it shouldn't be much of a bandwidth issue.
Quick edit: Oops. When I said "same site" above, I of course meant "same page". Or same URL, if you want to be snarky about it ;)
|They do a separate download, immediately adjacent to the user request. |
This might be the case of this TalkTalk Anti Virus Scanner (I do see "(TalkTalk Virus Alerts Scanning Engine)" in my logs) but I see nothing to suggest that my "Avast! Web Shield" is doing the same; there are no additional entries in my access logs. May be this TalkTalk scanner is a different animal?
Definitely. I don't recognize the name Avasti-- and my computer came up cold when I asked it to search the Raw Logs directory-- so it's probably not doing anything to make itself visible in site logs. But somehow it's got to intercept the files before they arrive at your own computer, because otherwise it would be too late. I mean, you can't just ask an infected file to wait politely in the router while you do your examination ;)
|I just did a quick log check and confirmed that TalkTalk's visits are always associated with a human visit to the same site at the same time. |
On my site this is not the case. The 'TalkTalk Anti Virus Engine' bot AND other UA's using TalkTalk/Tiscali IP's visit regularly (sometimes four times per day) and only visit my *.js AND *.css files. On some occasions, the visits ar in conjunction with genuine human visitors but on many they are not.
Perhaps I'm being a little over-cautious and thrifty when it comes to bandwidth usage but I would rather prevent than cure.
For the record, I've never seen Avasti in my logs.
It is actually "Avast!" with an exclamation mark (not an 'i' - eye) - a fairly common anti virus software in the UK. It seems to intercept all web traffic, as it arrives at the computer, but before the application that requested it gets their hands on it! So far today, the computer has been on for almost 5 hours and it has scanned 4,664 files, mostly *.js by the looks. I have several browsers and lots of webpages open, but some pages are fairly active with AJAX requests so the scanned file count steadily increases without having to visit any new pages.
Do we know what this TalkTalk bot actually is? It might be a browser plugin that validates sites in search results?! Avast! have recently created a similar tool but I have chosen not to install it.
|All pages are scanned by the scanning engine regardless of having home safe activated or deactivated |
The TalkTalk Network / Virus Alerts
This also range some bells:
|The product (according to press sources) that you are utilising is the HuaweiSymantecSpider which according to [huaweisymantec.com...] states that it honours robots.txt - if TalkTalk have changed (or written their own) product then why have TalkTalk not adhered to industry standards (though given TalkTalks lack of knowledge on other industry standards - e.g. RDNS - this doesn't surprise me one bit). |
The TalkTalk system apparently does a deep packet inspection of whatever the user is browsing, extracts the URLs the user visited, and then visits each URL about 20 seconds after the user did.
This does nothing to protect the user from anything malicious that might have been present on the website, as it checks the website after the user visited.
The bot reads public and private pages alike and since it can replicate the exact same request the user made, complete with session IDs in the URL (has anyone tested to see if it ever requests the same cookies too?) potentially their system can "see" EXACTLY what the user just saw in their browser.
This has immense privacy implications, and could very well be illegal. The bot should be blocked but that is tricky as there are many user agents in use, including
TalkTalk Virus Alerts Scanning Engine
and several more that "look like" human visitors, including this one
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; InfoPath.2)
but are not. Visits come from a variety of IPs, the only common thread being that the bot turns up about 20 seconds after a real visitor and makes the same request.
A number of extra details are noted in the thread at: [webmasterworld.com...]