Forum Moderators: open

Message Too Old, No Replies

Go 1.1 package http

         

keyplyr

9:55 am on Sep 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Go 1.1 package http" is the default UA for the Google Go programming language. If the developer does not rename after building their bot, this will be the UA.

However, a quick web search will show a lot of webmasters complaining their sites were scraped at a high rate by this UA, from various IP ranges. So it appears to be easily used for malicious purposes.

Even though I had the UA blocked (because of a similar named agent) and returned all 403s, this thing continued to hit the same web page of mine 6k times within an hour, then came back again and again. On that day it came from 2 different hosts in Sweden, then a host in the UK. I've been dealing with it for several days now and lately it comes from a well known Germany server farm, hitting several hundred times, then coming back in a few hours and doing it again and again.

I've sent close to a dozen email complaints to the abuse desk at all these hosts, with varied responses, but it just comes back from a different host... so either the guy buys cheap server space all over Europe, or these are all infected machines running this script.

Although I block the agent, the huge amount of 403s fill my logs with bloat. I've written a little piece of code to remove the hits from the downloaded access log, but that's another step I wish I didn't have to do.

aristotle

6:59 pm on Sep 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If the developer does not rename after building their bot, this will be the UA.

I think this also occurs sometimes with things like Ruby nd Java, as well as something that's currently hitting one of my sites hard with the following UA:
EventMachine HttpClient

In some cases they may just be testing a new bot they're working on and will rename it later.

P.S. Another one that might be in the same category is
Typhoeus - https://github.com/typhoeus/typhoeus

keyplyr

10:45 pm on Sep 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



P.S. Another one that might be in the same category is

Typhoeus - [github.com...]

GitHub = a den of malevolence

aristotle

6:05 pm on Sep 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I just noticed the following UA that looks similar:
Go-http-client/1.1

In some ways this looks like a real human, as the IP (70.119.37.22) lookup says Time Warner Cable and it followed a 301 re-direct, although it shows the old re-directed page as a referer.

But it didn't download images or execute the Statcounter script.

So I don't know what it is.

robzilla

8:57 pm on Sep 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Can't you block the IP server-side as soon as the user agent hits your site? It's a lot of hassle for one scraper, but if it's really bothering you, it might be worth it.

Thanks for the heads up anyway, I have Java etc. blocked but not Go.

So I don't know what it is.

Well it's definitely not a browser :-) Google it and you'll find it in the docs at golang.org. Just someone tinkering, I guess, for good or for evil.

wilderness

9:36 pm on Sep 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



cpe-70-119-37-22.tx.res.rr.com

keyplyr

9:41 pm on Sep 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I said in the OP the hits were blocked. All 403s. That's not the point.

aristotle

11:16 pm on Sep 13, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



wilderness - I don't know enough to be able to understand what cpe-70-119-37-22.tx.res.rr.com signifies. I saw it when I looked up the IP but aren't familiar with that aspect of it. Can you give further explanation of what it signifies in this case.

keyplyr

12:04 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



cpe-70-119-37-22.tx.res.rr.com

Server Class (Cloud or Central Processing Equipment)
IP (70-119-37-22)
State (Texas)
Residential Customer
Company (Road Runner)

lucy24

12:45 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It's your own judgement call about whether it's worth the bother, but some robots do go away faster if you feed them a different response, such as the contemplate-your-navel redirect (either 127.0.0.1 or their own IP) or a 410 or a manual 404 (in Apache, R=404).

For the same reason it's worth naming unwanted robots in robots.txt. On rare occasions this will stop them from asking for pages, so your logs will show just one request instead of hundreds.

:: noting with interest, after running assorted minor sites' logs, that this past week appears to have been International Robot Week ::

wilderness

1:04 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



cpe-70-119-37-22.tx.res.rr.com

Server Class (Cloud or Central Processing Equipment)
IP (70-119-37-22)
State (Texas)
Residential Customer
Company (Road Runner)


Clermont, Fla

Time Warner purchased RoadRunner a while back. The Primary WHOIS shows up as TimeWarner, however more focused IP Trace's provide RoadRunner servers.
IP trace sites may provide gross errors in results. I did three, two provided Clermont and one provide the server hub in Plano, TX.

I started including traces on my IP searches some years back for more detail.
Tried using 'tracert' for a long while, however results are too vague and most simply TIME OUT when leaving the larger hubs and attempting to establish a smaller local identity.

As to what to surmise of the cpe-70-119-37-22.tx.res.rr.com ?
Could be anything from a corrupted machine to a user running a server.

keyplyr

2:58 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@ Lucy - true for robots, however Go-http-client/1.1 is not a robot per se.

Go-http-client/1.1 is used as GET tool and usually straight-out-of-the-box (or script library in this case.) Versions I've seen are not built on a library indexer like most "crawlers" but request each file separately or from a list (which may be a series of pages w/ associated files requested in sequence.)

keyplyr

4:30 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for that info Don, however I was responding to aristotle's question about what that string meant (even though he asked you.)

In that string the "tx" represents texas.

wilderness

5:29 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



aristotle,
I do small range denies (both multi-conditional and temporary) using an address map that lucy provided a few years ago effectively.

For that 70.119.37.22 IP, I would expand to 70.119.32-39. (70.119.32.0/21), which would result in a defined area surrounding Clermont, FLA.
This is a very small and focused range, and should not provide any great reduction in traffic regardless of what you web enterprise may be.

Historically speaking RoadRunner provides customers with dynamic IP's, however I've a longtime correspondent (more than a decade) where RoadRunner has restricted to just two different Class B IP's (one for normal and the other for a temporary alternative) and a very narrow Class C range. (Their procedures seem to work similar to the secondary (alternative) routes utilities use to assure customers of electricity.

wilderness

5:38 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for that info Don, however I was responding to aristotle's question about what that string meant (even though he asked you.)

In that string the "tx" represents texas.


keyplr,
I'm aware of the designation for TX, however and as previously mentioned that is the location of the data center in Plano.
Regional data goes through there and is then routed to relay stations and more relay stations (and perhaps more and more) until it reaches Clermont, FLA

There are major hubs for many providers and how traffic is routed (to and fro) through that hub is sometimes absurd.
I'm east of Chicago and everything goes back and forth via Chicago.
Most of this info (at least for your own provider) shows up when using tracert for queries. It's the smaller relay stations on the query end where time outs take place, thus making tracert (a once useful tool) pretty useless.

keyplyr

6:48 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



And I agree with all that :)

A question was asked what the elements in the server string meant and I defined them, nothing more.

robzilla

10:39 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I said in the OP the hits were blocked. All 403s. That's not the point.

I know, I meant taking it up the chain: have your firewall reject/deny the IP as soon as the UA hits your site. Saves you the trouble of having to clean your logs.

keyplyr

11:25 am on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Of course, but my personal site is on shared hosting not a VPS or dedicated server and as such I only have control of my account config, not the server config. Maybe I should have said that in the OP.

Funny, half of my clients have their sites on private servers. I just don't get traffic high enough to warrant that. I never get over 60k page loads per day. I've had clients with over a million daily.

...now if I took down my blocks I might :)

aristotle

1:10 pm on Sep 14, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks everyone for taking time to respond to my questions. At least I have a partial understanding now.