homepage Welcome to WebmasterWorld Guest from 54.237.98.229
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
google-xrawler
Unknown User agent "google-xrawler"
DutchTheWiz



 
Msg#: 4604177 posted 5:22 pm on Aug 22, 2013 (gmt 0)

Does anyone know what this 'User agent' is : google-xrawler

Found only one trace about on this forum 2y ago:
[webmasterworld.com ]

Nothing about it on the entire Internet.

Sometimes it only hits pages:
GET /pgmpi.php - 80 - 172.31.88.68 HTTP/1.1 google-xrawler

Sometimes it uses HEAD
HEAD /payb.php/downloads/abc/2013/abc.tar - HTTP/1.1 google-xrawler

Sometimes it requests download a few times a second :
2013-08-22 15:32:35 GET /payb.php/downloads/abc/2013/abc.tar HTTP/1.1 google-xrawler
2013-08-22 15:32:35 GET /payb.php/downloads/abc/2013/abc.tar HTTP/1.1 google-xrawler
2013-08-22 15:32:35 GET /payb.php/downloads/abc/2013/abc.tar HTTP/1.1 google-xrawler

 

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4604177 posted 7:38 pm on Aug 23, 2013 (gmt 0)

I assume it was a google IP - the one you give is a local (private) network IP and so irrelevant.

The IP quoted in the other WebmasterWorld posting was certainly G but not one of their bot IPs. In fact it's in a range 74.125.0.0/16 that has been used in illegal - or certainly immoral - activities in the past (my notes say: this range suspect (apart from known bots) following mocality revelation - http blog[.]mocality[.]co[.]ke/2012/01/13/google-what-were-you-thinking/). I re-enabled the range briefly in June and quickly disabled it again following a lot of bad (ie non-bot) accesses.

My guess, with no knowledge but deep suspicion of G, is that they are testing something they probably do not want us to know about.

My own method of protection is: block ALL G ranges except for small ranges that can be shown by rDNS to be proper crawlers, and then to evaluate known user-agents against those IPs. Works for me, anyway, and for many here. :)

DutchTheWiz



 
Msg#: 4604177 posted 8:21 pm on Aug 23, 2013 (gmt 0)

Thank you dstiles.

Unfortunately our Balancer does not retain original IP at this time, so its always one private IP logged. This is going to change in the future.
But I think you probably right that this is Google (I assume G) - we recently got chunk of patent hosting business that they provided in the past - lots of downloads of huge files.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4604177 posted 7:40 pm on Aug 24, 2013 (gmt 0)

If you cannot detect source IPs and make decisions on blocking them then your web site is toast. It WILL be scraped by everyone and their goat.

Primary defence against this is to detect source (and/or proxy) IP and block all coming from server farms, including all of Amazon, most of google, and so on. There are major lists of IPs to block in this forum but the list is by no means definitive.

Secondary defence includes bad user-agents, bad headers and more. Again, check previous postings in this forum.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4604177 posted 9:46 pm on Aug 24, 2013 (gmt 0)

To me "xrawler" screams typo. Look at your keyboard.

JD_Toims

WebmasterWorld Senior Member Top Contributors Of The Month



 
Msg#: 4604177 posted 12:30 am on Aug 25, 2013 (gmt 0)

Sometimes it uses HEAD
HEAD /payb.php/downloads/abc/2013/abc.tar - HTTP/1.1 google-xrawler

Google has stated they do not use HEAD requests, because in their testing it really didn't provide any faster spidering since they had to make two requests rather than one for any modified page request and when they're just grabbing HTML it's a very fast process to just use GET.

It may be they've started using HEAD for grabbing something like a .tar, but personally I'd guess it's a spoof until proved otherwise.

DutchTheWiz



 
Msg#: 4604177 posted 12:38 pm on Aug 29, 2013 (gmt 0)

I am just a developer who was in charge for writing code to download files up to 10GB - thousands of them. To do this I had to understand behaviors and distinguish between good and bad requests and somehow avoid downloads when possible(GET, HEAD, Last Modified, etc).
In the other hand it is desired or even required that these files are public and available to any friendly customers.
My Network group is in charge for tuning balancer, so I am hoping that they will make it work soon and that will be able to block malicious IPs. I do not need to mention how malicious requests slowing down downloads.
And yes lucy it is "google-xrawler"
Thanks

DutchTheWiz



 
Msg#: 4604177 posted 1:54 pm on Aug 29, 2013 (gmt 0)

...did not finish previous post.. it is "google-xrawler"

few others to be fake:

"Mozilla/6.0 (compatible)"
"Internet Explorer"

and I believe few other suspects to be used as malicious:

"MSIE 7.0(compatible; Mozilla/5.0; Windows NT 6.0; DigExt)"
"Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.0)"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
"FDM 3.x"

-- can anyone confirm - without IPs?

Thanks

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4604177 posted 7:15 pm on Aug 29, 2013 (gmt 0)

There are a LOT of "malicious" user-agents including those you mention. The specific discussion here refers to user-agents emanating from google and purporting to be valid google agents.

Mozilla/6.0 (compatible) is a generic UA used by (as far as I can determine) proxy servers such as bluecoat. They are harmless and require only a temporary block, not a full-blown IP ban.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4604177 posted 8:17 pm on Aug 29, 2013 (gmt 0)

Dutch, I didn't mean typo on your part :) I meant that when a robot spoofs someone else's name to get past UA blocks, sometimes its original programmer hits the wrong key. For comparison purposes I recently posted about an "App3leWebKit/53.1".

Behind every stupid robot is a stupid human.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4604177 posted 9:21 pm on Sep 14, 2013 (gmt 0)

Another G IP range, found today (updated 6-sep)...

NetRange: 173.255.112.0 - 173.255.127.255
CIDR: 173.255.112.0/20
NetName: GOOGLE-APPS
Comment: ** The IP addresses under this Org-ID are in use by Google Cloud customers ***

Draw your own conclusions / block policy. :)

My example hit came in with curl, one of the most common and blatant scrapers.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4604177 posted 4:07 pm on Oct 26, 2013 (gmt 0)

And another new (to me) google cloud range...

162.216.148.0 - 162.216.151.255

DutchTheWiz



 
Msg#: 4604177 posted 6:13 pm on Nov 7, 2013 (gmt 0)

And finally I got IP address available. It seems that all of these are google translate:

IPAgent
74.125.184.17google-xrawler
74.125.184.18google-xrawler
74.125.184.20google-xrawler
74.125.184.22google-xrawler
74.125.184.23google-xrawler
74.125.185.17google-xrawler
74.125.185.20google-xrawler
74.125.185.80google-xrawler
74.125.185.81google-xrawler
74.125.185.84google-xrawler

EXAMPLE:
IP:74.125.184.17
Decimal: 1249753105
Hostname: 74.125.184.17
ISP: Google
Organization: Google Translate

;-)

DutchTheWiz



 
Msg#: 4604177 posted 6:16 pm on Nov 7, 2013 (gmt 0)

Ah there should be the white space between IP address an Agent like this:

IP ----------- Agent
74.125.184.17 google-xrawler
74.125.184.18 google-xrawler
74.125.184.20 google-xrawler
74.125.184.22 google-xrawler
74.125.184.23 google-xrawler
74.125.185.17 google-xrawler
74.125.185.20 google-xrawler
74.125.185.80 google-xrawler
74.125.185.81 google-xrawler
74.125.185.84 google-xrawler

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4604177 posted 7:51 pm on Nov 7, 2013 (gmt 0)

Block the whole /16. With G it's the only way. There is nothing useful there if you discount translate (which is often used for other fell purposes as well). :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved