Welcome to WebmasterWorld Guest from 54.196.244.206

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

google-xrawler

Unknown User agent "google-xrawler"

     
5:22 pm on Aug 22, 2013 (gmt 0)

New User

joined:Aug 22, 2013
posts: 6
votes: 0


Does anyone know what this 'User agent' is : google-xrawler

Found only one trace about on this forum 2y ago:
[webmasterworld.com ]

Nothing about it on the entire Internet.

Sometimes it only hits pages:
GET /pgmpi.php - 80 - 172.31.88.68 HTTP/1.1 google-xrawler

Sometimes it uses HEAD
HEAD /payb.php/downloads/abc/2013/abc.tar - HTTP/1.1 google-xrawler

Sometimes it requests download a few times a second :
2013-08-22 15:32:35 GET /payb.php/downloads/abc/2013/abc.tar HTTP/1.1 google-xrawler
2013-08-22 15:32:35 GET /payb.php/downloads/abc/2013/abc.tar HTTP/1.1 google-xrawler
2013-08-22 15:32:35 GET /payb.php/downloads/abc/2013/abc.tar HTTP/1.1 google-xrawler
7:38 pm on Aug 23, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3091
votes: 2


I assume it was a google IP - the one you give is a local (private) network IP and so irrelevant.

The IP quoted in the other WebmasterWorld posting was certainly G but not one of their bot IPs. In fact it's in a range 74.125.0.0/16 that has been used in illegal - or certainly immoral - activities in the past (my notes say: this range suspect (apart from known bots) following mocality revelation - http blog[.]mocality[.]co[.]ke/2012/01/13/google-what-were-you-thinking/). I re-enabled the range briefly in June and quickly disabled it again following a lot of bad (ie non-bot) accesses.

My guess, with no knowledge but deep suspicion of G, is that they are testing something they probably do not want us to know about.

My own method of protection is: block ALL G ranges except for small ranges that can be shown by rDNS to be proper crawlers, and then to evaluate known user-agents against those IPs. Works for me, anyway, and for many here. :)
8:21 pm on Aug 23, 2013 (gmt 0)

New User

joined:Aug 22, 2013
posts: 6
votes: 0


Thank you dstiles.

Unfortunately our Balancer does not retain original IP at this time, so its always one private IP logged. This is going to change in the future.
But I think you probably right that this is Google (I assume G) - we recently got chunk of patent hosting business that they provided in the past - lots of downloads of huge files.
7:40 pm on Aug 24, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3091
votes: 2


If you cannot detect source IPs and make decisions on blocking them then your web site is toast. It WILL be scraped by everyone and their goat.

Primary defence against this is to detect source (and/or proxy) IP and block all coming from server farms, including all of Amazon, most of google, and so on. There are major lists of IPs to block in this forum but the list is by no means definitive.

Secondary defence includes bad user-agents, bad headers and more. Again, check previous postings in this forum.
9:46 pm on Aug 24, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12696
votes: 244


To me "xrawler" screams typo. Look at your keyboard.
12:30 am on Aug 25, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member Top Contributors Of The Month

joined:July 19, 2013
posts:1097
votes: 0


Sometimes it uses HEAD
HEAD /payb.php/downloads/abc/2013/abc.tar - HTTP/1.1 google-xrawler

Google has stated they do not use HEAD requests, because in their testing it really didn't provide any faster spidering since they had to make two requests rather than one for any modified page request and when they're just grabbing HTML it's a very fast process to just use GET.

It may be they've started using HEAD for grabbing something like a .tar, but personally I'd guess it's a spoof until proved otherwise.
12:38 pm on Aug 29, 2013 (gmt 0)

New User

joined:Aug 22, 2013
posts: 6
votes: 0


I am just a developer who was in charge for writing code to download files up to 10GB - thousands of them. To do this I had to understand behaviors and distinguish between good and bad requests and somehow avoid downloads when possible(GET, HEAD, Last Modified, etc).
In the other hand it is desired or even required that these files are public and available to any friendly customers.
My Network group is in charge for tuning balancer, so I am hoping that they will make it work soon and that will be able to block malicious IPs. I do not need to mention how malicious requests slowing down downloads.
And yes lucy it is "google-xrawler"
Thanks
1:54 pm on Aug 29, 2013 (gmt 0)

New User

joined:Aug 22, 2013
posts: 6
votes: 0


...did not finish previous post.. it is "google-xrawler"

few others to be fake:

"Mozilla/6.0 (compatible)"
"Internet Explorer"

and I believe few other suspects to be used as malicious:

"MSIE 7.0(compatible; Mozilla/5.0; Windows NT 6.0; DigExt)"
"Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.0)"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
"FDM 3.x"

-- can anyone confirm - without IPs?

Thanks
7:15 pm on Aug 29, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3091
votes: 2


There are a LOT of "malicious" user-agents including those you mention. The specific discussion here refers to user-agents emanating from google and purporting to be valid google agents.

Mozilla/6.0 (compatible) is a generic UA used by (as far as I can determine) proxy servers such as bluecoat. They are harmless and require only a temporary block, not a full-blown IP ban.
8:17 pm on Aug 29, 2013 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12696
votes: 244


Dutch, I didn't mean typo on your part :) I meant that when a robot spoofs someone else's name to get past UA blocks, sometimes its original programmer hits the wrong key. For comparison purposes I recently posted about an "App3leWebKit/53.1".

Behind every stupid robot is a stupid human.
9:21 pm on Sept 14, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3091
votes: 2


Another G IP range, found today (updated 6-sep)...

NetRange: 173.255.112.0 - 173.255.127.255
CIDR: 173.255.112.0/20
NetName: GOOGLE-APPS
Comment: ** The IP addresses under this Org-ID are in use by Google Cloud customers ***

Draw your own conclusions / block policy. :)

My example hit came in with curl, one of the most common and blatant scrapers.
4:07 pm on Oct 26, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3091
votes: 2


And another new (to me) google cloud range...

162.216.148.0 - 162.216.151.255
6:13 pm on Nov 7, 2013 (gmt 0)

New User

joined:Aug 22, 2013
posts: 6
votes: 0


And finally I got IP address available. It seems that all of these are google translate:

IPAgent
74.125.184.17google-xrawler
74.125.184.18google-xrawler
74.125.184.20google-xrawler
74.125.184.22google-xrawler
74.125.184.23google-xrawler
74.125.185.17google-xrawler
74.125.185.20google-xrawler
74.125.185.80google-xrawler
74.125.185.81google-xrawler
74.125.185.84google-xrawler

EXAMPLE:
IP:74.125.184.17
Decimal: 1249753105
Hostname: 74.125.184.17
ISP: Google
Organization: Google Translate

;-)
6:16 pm on Nov 7, 2013 (gmt 0)

New User

joined:Aug 22, 2013
posts: 6
votes: 0


Ah there should be the white space between IP address an Agent like this:

IP ----------- Agent
74.125.184.17 google-xrawler
74.125.184.18 google-xrawler
74.125.184.20 google-xrawler
74.125.184.22 google-xrawler
74.125.184.23 google-xrawler
74.125.185.17 google-xrawler
74.125.185.20 google-xrawler
74.125.185.80 google-xrawler
74.125.185.81 google-xrawler
74.125.185.84 google-xrawler
7:51 pm on Nov 7, 2013 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3091
votes: 2


Block the whole /16. With G it's the only way. There is nothing useful there if you discount translate (which is often used for other fell purposes as well). :(
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members