Forum Moderators: open

Message Too Old, No Replies

github.com/typhoeus

         

aristotle

9:12 pm on Apr 16, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This hit one of my home pages 4 times in 5 seconds on April 15.
Host: 54.197.81.31
/
Http Code: 200 Date: Apr 15 18:29:47 Http Version: HTTP/1.1 Size in Bytes: -
Referer: -
Agent: Typhoeus - https://github.com/typhoeus/typhoeus

I never noticed it before and haven't seen it since. Evidently it came from a part of AWS that I don't block (my AWS blocking has gotten messed up). So I'm wondering what exactly it is, and why it came 4 times in 5 seconds.

[edited by: keyplyr at 5:55 am (utc) on Apr 1, 2017]
[edit reason] delinked URL [/edit]

Pfui

11:53 pm on Apr 16, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Typhoeus wraps libcurl in order to make fast and reliable requests." http://github.com/typhoeus/typhoeus

In a word: Ugh.

I've blocked Typhoeus [NC] since 2011 but I don't have additional info about it. FWIW, I also block any UA containing github.

Apparently I'm not the only one who doesn't like GitHub... [foxnews.com...]

[edited by: keyplyr at 5:56 am (utc) on Apr 1, 2017]
[edit reason] delinked URL [/edit]

aristotle

12:11 pm on Apr 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks Pfui
I think these hits might be a test by someone who is working on a new crawler. At any rate, I've blocked that IP and also followed your suggestion and added github and typhoeus to my UA block list.

lucy24

7:34 pm on Apr 17, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: shuffling papers ::

a giant who tried to force Jupiter from heaven, and was buried beneath Etna. *

Not who I'd choose for a namesake, but burying robots beneath Etna is always an attractive option.


* Just from the dictionary, which is currently within arm's reach as I'm working on an ebook. The OCD presumably has more information, but cat is in my lap so I can't go check.

trintragula

8:44 pm on Apr 17, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Typhoeus, the deadliest monster of Greek mythology, with a hundred dragon heads (which spurt fire) and a bellowing many-tongued voice, created whirlwinds, fought with Zeus, then requested a couple of home pages on my site early this morning. And was turned down.

Watching for Echidna...

keyplyr

1:28 am on Apr 18, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just an Apache run data miner build on curl. Block it

aristotle

4:01 pm on Apr 25, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm I want to respond to a post I saw in this thread yesterday but it seems to be gone now.

trintragula

4:20 pm on Apr 25, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



I noticed this also. There may be some technical difficulties as a result of a recent upgrade - check this thread:
(Home / Forums Index / Local / WebmasterWorld Community Center / WebmasterWorld Beta Bugs and Comments)
Links in posts also appear not to be working at the moment, which has already been reported.
At time of writing that thread is near the top of the Recent Posts list under the More v menu at the top of the page.
Hang in there - hopefully they'll get it straightened out soon...

aristotle

8:50 pm on Apr 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I decided to see what my raw logs show for this series of requests, and discovered that my original post at the top of this thread describes only a small part of what happened. In fact there were more than five hundred HEAD requests from IP 54.197.81.31, all of them occuring during a 30 second time period starting at 15/Apr/2015:18:29:43. Within this 30-second period, there were several short pauses of about 1 or 2 seconds when no requests were made.

All of these requests had the following form:

54.197.81.31 - - [15/Apr/2015:18:29:43 -0400] "HEAD / HTTP/1.1" 200 - "-" "Typhoeus - https:// github.com/typhoeus/typhoeus"


In my first post at the start of this thread, I reported that "Latest Visitors" showed four of these requests. When I made that first post, I didn't know about the hundreds of others shown in my raw logs. I can't explain why "Latest Visitors" only showed four of them, unless perhaps it couldn't keep up.

So to summarize, there were more than five hundred HEAD requests, all within a 30-second time interval. Also, there weren't any GET requests.

I think someone might have been running a test.

lucy24

9:09 pm on Apr 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I can't explain why "Latest Visitors" only showed four of them

A HEAD request alone wouldn't trigger anything javascript-based, would it?

Incidentally I've just noticed that your first post-- which predated the Forums code change and server migration-- illustrates something I'd only just noticed in a different venue: When you close a quote in the same line as an auto-link, the quote doesn't close. Technically the </blockquote> becomes part of the <a href> markup; you can see it in your browser. I don't suppose anyone remembers what that first post looked like in old-style Forums code?

aristotle

9:34 pm on Apr 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy, when I made that first post, I added an extra space to the github URL to prevent it from displaying as a clickable link. But now it has somehow been turned into a clickable link anyway. When I made my last post, I had to insert TWO spaces into the URL to prevent it from becoming a clickable link.

As for "Latest Visitors", I don't know much about how it works, but its report is almost instantaneous, whereas the hosting I have doesn't make the raw logs available until the next day. That's why I prefer "Latest Visitors', plus the fact that it's easier on the eyes.

lucy24

10:05 pm on Apr 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I had to insert TWO spaces into the URL
One of them seems to have been eaten by the Forums software, because I only see one after the double-slash.

You can always avoid auto-linking by using [ code] instead of [ quote] markup. Even this won't stop the expansion of "wm w" or &, though.

the hosting I have doesn't make the raw logs available until the next day
Eeuw, what a ripoff. I can download mine at any time, meaning that I can try something on my test site and then instantly grab the logs to see what shows up.

If "Latest Visitors" is a function provided by the host, it's quite possible that they intentionally exclude HEAD requests, since it isn't really a visit.

[edited by: lucy24 at 10:06 pm (utc) on Apr 29, 2015]

keyplyr

10:05 pm on Apr 29, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Probably safe to block that Amazon AWS range:

54.192.0.0/12
54.208.0.0/16
54.192.0.0 - 54.208.255.255

aristotle

12:30 am on Apr 30, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Actually "Latest Visitors" can show HEAD requests, and I see them occasionally. The first post at the top of this thread shows an example. You can tell that it's a HEAD request because the "Size in Bytes" is blank.

keyplyr

12:52 am on Apr 30, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month




@aristotle - anytime you use third party software to get crucial data for your site, you are setting yourself up to a disadvantage, i.e. the limitations of that program. IMO these free site-stats pages offered by hosting companies are not sufficient to run a real web site in today's aggressive environment; you need more complete information. If you don't have real time access to server logs where you're hosting at, I'd seriously consider moving to a new host that does give you this essential feature.

aristotle

1:08 am on Apr 30, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



keyplyr --
We all have different goals and different priorities. I have five websites, all of them non-commercial, and on most days don't spend more than 30 minutes looking at stats and logs for all of them together. I ilke the combination of Latest Visitors, Awstats, and Statcounter as my main sources of information. Also, I don't enjoy looking at raw logs and don't feel any need for real-time monitoring of them. As I said, everyone has their own goals and priorities (and preferencies).

keyplyr

1:46 am on Apr 30, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your site, your choice. However "priorities" are one thing, but if you don't keep a diligent eye on what's hitting your files, your site is toast!

aristotle

8:39 pm on May 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well whatever this is, it showed up again early this morning (May 1, 2015). I noticed it when I saw three requests with the same UA in Latest Visitors. But I also have the raw logs entries available, because I discovered that the hosting account for this particulat site allows real-time access to raw logs. (My other sites are hosted at a different company which doesn't.)

At any rate, here are the raw log entries:
54.81.37.50 - - [01/May/2015:02:51:21 -0400] "HEAD / HTTP/1.1" 403 - "-" "Typhoeus - https://github.com/typhoeus/typhoeus"

54.167.240.74 - - [01/May/2015:02:51:34 -0400] "HEAD / HTTP/1.1" 403 - "-" "Typhoeus - https://github.com/typhoeus/typhoeus"

54.234.92.159 - - [01/May/2015:02:51:45 -0400] "HEAD / HTTP/1.1" 403 - "-" "Typhoeus - https://github.com/typhoeus/typhoeus"

So there were only three requests this time, instead of hundreds, possibly because of the 403 responses. These 403 responses occurred because I've added github and typhoeus to my UA snippets block list.

Also the requests came from three different IPs this time.

Question: Is it likely that the current UA string for these requests is temporary, and will be changed to something else at some point?

keyplyr

10:54 pm on May 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Is it likely that the current UA string for these requests is temporary, and will be changed to something else at some point?

No one can answer that really.

Easiest solution (once again) is to block AWS. In doing so, you'll not only solve your "typhoeus" problem, but all the other crap that github makes available to scrape our sites. In addition to that, AWS is home to a never-ending list of bad guys :)

AWS has its own thread here: [webmasterworld.com...] where you'll find a list of block ranges.

aristotle

11:09 pm on May 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks keyplyr
A couple of years ago I copied an AWS blocklist Into my IP block section. I don't remember where I got that AWS blocklist, but it must have been defective, because I occasionally saw it block a real human. So I started removing parts that blocked real people. Now it's messed up and I haven't had a chance yet to do anything about it.

keyplyr

11:19 pm on May 1, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There are a few proxy ranges buried inside a couple AWS ranges. Depending on your site, human visitors may come through these proxies. You can poke holes to let them through. We also recently discovered that a default iOS app for Facebook uses a few AWS ranges to connect their users. That's kinda the problem with AWS. They are cloud services for anyone who wants to use them, for any reason, for any amount of time.

lucy24

1:20 am on May 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



there were only three requests this time, instead of hundreds, possibly because of the 403 responses

You may think that's good, but actually it's bad. Few things scare me more than a robot intelligent enough to watch responses in real time and modify its behavior accordingly. Your ordinary dumb robot comes in with a shopping list and works through the whole thing, obligingly eating one 403 after another. ("They don't have spaghetti in this aisle of the hardware store? Well, maybe they'll have rigatoni, fettucine or radiatore. Can't hurt to ask.")

Then again, I've never perfectly understood the many robots that come through just asking for the front page. What do they do with it? What possible content would lead them to make further incursions?

keyplyr

3:19 am on May 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Few things scare me more than a robot intelligent enough to watch responses in real time and modify its behavior accordingly.
Didn't James Cameron say that?

dstiles

7:14 pm on May 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



aristotle - as keyplr said, there is a thread with just about every amazon range listed - at least, at last update.

The thing to watch out for, though, is the noika range embedded therein AND the fact that some silk hits come from ANYWHERE in the range!

My blocking has a hole drilled for nokia but I had to allow anything through it with the relevant silk UA. A nuisance but one I feel necessary. I also let through FlipboardProxy UAs but that is a metter of choice.

aristotle

9:11 pm on May 2, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



dstiles -- I don't know what you mean by "silk hits" and "silk UA'. Can you give more information about this?

lucy24

4:47 am on May 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Silk? The one for Kindle?

[webmasterworld.com...]

I know there have been more recent discussions, because I'm darn sure my memory doesn't go back to 2011, but I can't find the thread I was thinking of.

keyplyr

5:41 am on May 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Kindle generic as well as Kindle Fire & Kindle Silk browsers are frequent in my logs since my couple sites are in the edu category and many schools use them (likely due to their low price.)

One of my students works for Amazon and is senior programmer for many Kindle projects. He's the guy they send to Beijing to talk code to the processor manufacturers. There's a lot of these tablets out there, but until recently they're been profiled as eBook readers. Since Fire & Silk (approx 2 years) they've been better equipped as a web device and as such, have been re-marketed so we *should* be seeing more of them.

I have reported several times in the last couple years that Kindle (et al) requests often come from various Amazon ranges, some clearly used for caching, others full browser hits. These sometimes come as parallel requests, one from the user's ISP & one from Amazon (image caching.) The mystery for me has always been in figuring out whether this is determined by ISP, by device, by OS or by region, Another question is... if Amazon is blocked by the web site's server, does Kindle still get all the files through the (unblocked) ISP connection?

keyplyr

12:04 pm on May 3, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Allowable time to edit is past due
hmmm...

This Kindle user connecting from a DSL pool in Germany got all files through his ISP & did not use AWS at all:

85.16.226.*** - - [02/May/2015:07:42:26 -0700] "GET /images/file.png HTTP/1.1" 200 1464 "http://example.com/page.html" "Mozilla/5.0 (Linux; U; Android 4.0.3; de-de; KFTT Build/IML74K) AppleWebKit/537.36 (KHTML, like Gecko) Silk/3.66 like Chrome/39.0.2171.93 Safari/537.36"

trintragula

2:24 pm on May 3, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



There's quite a bit of documentation about Silk on the AWS website:
[docs.aws.amazon.com...]
and
[docs.aws.amazon.com...]

There's also a document (which I think I've seen posted on here before, but now can't find) that lists the user agent strings for the various models. Here it is again:

[docs.aws.amazon.com...]

That doc is interesting because it also mentions the Amazon Fire Phone(!)

and also user agent strings for Silk on a linux desktop that's not android, and also silk accelerated Macintosh Desktop/Tablet.

This all suggests there are more ways in which user traffic may come via AWS...

(The discussion about silk is now fairly evenly split between the Amazon Hosts Bad Bots thread, and this one, which was about Typhoeus/github..)