Forum Moderators: open
I manage a large non-commercial Internet site (about half a million requests a day) and we're getting a lot of traffic from a single IP address (which I have found out is a popular commercial web proxy content filtering service). The thing is hammering our website - in fact it is responsible for almost 40% of all our requests! This has been going on 24/7 for months.
The thing's user agent identifies itself as either:
Microsoft Scheduled Cache Content Download Service
or
Fetch API Request
The thing visits most of the pages on our site, but the overwhelming number of hits are to the pages that serve up content from large database of publically accessible content we maintain. The thing seems to know how to post keywords that closely match our industry sector into the search form in order for search results to be returned. Sometimes the keywords are even mis-spelt.
It seems to request each page twice.
Does anyone know what it might be doing?
Does anyone know what it might be doing?
I don't know, however it's a safe assumption that they are either caching your pages or filtering the content for a user that has software installed.
1) Fetch API is in most every ban list ever created.
2) "Microsoft Scheduled Cache Content Download Service"
If a UA appeared like this in my visitor logs? It wouldn't matter to me, who or where the visitior was coming from! Both the UA and the IP range would be denied.
There are key words that nearly demand denial of access:
reap, fetch, download, cache, spider, link, agent, crawl, email, find, gather, loader, java, larbin, library, LWP, probe, capture, ANYTHING that begins with the word web, and there even be others that I have missed or not seen.
Don