I assume it was a google IP - the one you give is a local (private) network IP and so irrelevant.
The IP quoted in the other WebmasterWorld posting was certainly G but not one of their bot IPs. In fact it's in a range 188.8.131.52/16 that has been used in illegal - or certainly immoral - activities in the past (my notes say: this range suspect (apart from known bots) following mocality revelation - http blog[.]mocality[.]co[.]ke/2012/01/13/google-what-were-you-thinking/). I re-enabled the range briefly in June and quickly disabled it again following a lot of bad (ie non-bot) accesses.
My guess, with no knowledge but deep suspicion of G, is that they are testing something they probably do not want us to know about.
My own method of protection is: block ALL G ranges except for small ranges that can be shown by rDNS to be proper crawlers, and then to evaluate known user-agents against those IPs. Works for me, anyway, and for many here. :)
Thank you dstiles.
Unfortunately our Balancer does not retain original IP at this time, so its always one private IP logged. This is going to change in the future.
But I think you probably right that this is Google (I assume G) - we recently got chunk of patent hosting business that they provided in the past - lots of downloads of huge files.
If you cannot detect source IPs and make decisions on blocking them then your web site is toast. It WILL be scraped by everyone and their goat.
Primary defence against this is to detect source (and/or proxy) IP and block all coming from server farms, including all of Amazon, most of google, and so on. There are major lists of IPs to block in this forum but the list is by no means definitive.
Secondary defence includes bad user-agents, bad headers and more. Again, check previous postings in this forum.
To me "xrawler" screams typo. Look at your keyboard.
|Sometimes it uses HEAD |
HEAD /payb.php/downloads/abc/2013/abc.tar - HTTP/1.1 google-xrawler
Google has stated they do not use HEAD requests, because in their testing it really didn't provide any faster spidering since they had to make two requests rather than one for any modified page request and when they're just grabbing HTML it's a very fast process to just use GET.
It may be they've started using HEAD for grabbing something like a .tar, but personally I'd guess it's a spoof until proved otherwise.
I am just a developer who was in charge for writing code to download files up to 10GB - thousands of them. To do this I had to understand behaviors and distinguish between good and bad requests and somehow avoid downloads when possible(GET, HEAD, Last Modified, etc).
In the other hand it is desired or even required that these files are public and available to any friendly customers.
My Network group is in charge for tuning balancer, so I am hoping that they will make it work soon and that will be able to block malicious IPs. I do not need to mention how malicious requests slowing down downloads.
And yes lucy it is "google-xrawler"
...did not finish previous post.. it is "google-xrawler"
few others to be fake:
and I believe few other suspects to be used as malicious:
"MSIE 7.0(compatible; Mozilla/5.0; Windows NT 6.0; DigExt)"
"Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.0)"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
-- can anyone confirm - without IPs?
There are a LOT of "malicious" user-agents including those you mention. The specific discussion here refers to user-agents emanating from google and purporting to be valid google agents.
Mozilla/6.0 (compatible) is a generic UA used by (as far as I can determine) proxy servers such as bluecoat. They are harmless and require only a temporary block, not a full-blown IP ban.
Dutch, I didn't mean typo on your part :) I meant that when a robot spoofs someone else's name to get past UA blocks, sometimes its original programmer hits the wrong key. For comparison purposes I recently posted about an "App3leWebKit/53.1".
Behind every stupid robot is a stupid human.
Another G IP range, found today (updated 6-sep)...
NetRange: 184.108.40.206 - 220.127.116.11
Comment: ** The IP addresses under this Org-ID are in use by Google Cloud customers ***
Draw your own conclusions / block policy. :)
My example hit came in with curl, one of the most common and blatant scrapers.
And another new (to me) google cloud range...
18.104.22.168 - 22.214.171.124
And finally I got IP address available. It seems that all of these are google translate:
Organization: Google Translate
Ah there should be the white space between IP address an Agent like this:
IP ----------- Agent
Block the whole /16. With G it's the only way. There is nothing useful there if you discount translate (which is often used for other fell purposes as well). :(