Welcome to WebmasterWorld Guest from 34.228.115.216

Forum Moderators: phranque

Can anyone explain this partial website file access behavior?

web logs show strange file-access pattern

     
12:54 am on Oct 8, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:69
votes: 0


A hit to my company website results in a total of 32 files being delivered to the requesting browser. One default.html file, one favicon.ico file, and 30 gif files. I recognize a legit/human web-hit primarily when a complete set of these files are downloaded to a believable IP address using a legit-looking user-agent. A referrer like google or bing is almost always also present - but sometimes there is no referrer. Now if someone decides to browse the site, then more files will be downloaded. If they go away, then that's all I'll see in the logs.

Starting July 17 this year, I started to see rare hits where only the default.html file and two specific gif files were downloaded. These 2 gif's are usually the first 2 that are normally downloaded when a "full hit" happens. I've never seen bots do this. They will go after the html files and ignore the gif's.

Looking at these hits in detail - the user agent is one of these:

a) Intel Mac OSX 10_x_something (where x is 11 or 12 or 13) - about 20%
b) Windows NT 10 (about 50%)
c) Windows NT 6.1 (about 30%)

In all cases, browser is Chrome.

There have been 26 such hits, all from different IP's, the last being Oct 3. At most there is 7 or 8 days between hits, sometimes not quite 24 hours. Typical seems to be 3 to 5 days.

21 of the IP's are from western countries (7 US, 6 Canada) the other 5 are what I would call third-world. Two are major biotech companies, 1 is a US university, the rest seem to be a mix of residential and business big-ISP broadband (verizon, sbcglobal, comcast, roadrunner, bell, etc). There is at least one case where I've gotten a previous hit last year from the exact same IP, and that hit (last year) looked "normal".

I have a theory. That because the site is http (not https) that something in the browser or some network device at the user location has decided to prevent the user from surfing our site due to it being http and only the 3 files in question made it out to the user (or their network) before the session was terminated.

Does this sound legit? Or is it something else? Some sophisticated web-caching going on at the user's end, where they already have a copy of our files somehow?
2:45 am on Oct 8, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


Does this sound legit? Or is it something else? Some sophisticated web-caching going on at the user's end, where they already have a copy of our files somehow?
Sounds like a bot spoofing a browser.

I hope you're actually examining your server access log and not just getting your information from some analytics report.

If it was a legit browser caching, you'd see a lot of 301s. A mobile browser sometimes uses a network cache where you wouldn't see 301 requests, but you said the OS was not mobile & the browser was not mobile... so it's a bot.

Most of Your Traffic is Not Human [webmasterworld.com]

Blocking Methods [webmasterworld.com]
3:09 am on Oct 8, 2018 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:4161
votes: 262


Your 'partial load' sounds as good as any other explanation, especially when the IPs are not typically known for bots and when the browser signals may scare some people away. With various apps, extensions and sharing tools out there, it can be best guess time. Sometimes it is bot activity and sometimes it is the way people use the site. If you're not seeing signs of unwanted use (such as hotlinking) then I'd make note of it and see what else can be determined.
9:39 am on Oct 8, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


Botnets will typically use the same, or nearly the same UA but come from a variety of compromised IPs, some company, some server farm and some ISPs.

However, your point about the browser blocking access after a couple round-trip requests being fulfilled may be a valid one, especially with Chrome.

I guess you have a reason for not using HTTPS, but consider more and more issues going forward. I eventually think browsers will block nonsecure sites outright. Of course that may be irrelevant if the Search Engines omit them from the index altogether.

What Will Happen if I Don't Switch to HTTPS? [webmasterworld.com]

Downsides of not using HTTPS [webmasterworld.com]

Why HTTPS Matters [developers.google.com]
12:33 pm on Oct 8, 2018 (gmt 0)

Preferred Member from CA 

Top Contributors Of The Month

joined:Feb 7, 2017
posts:521
votes: 46


If you have implemented request headers you could check them to see if they have a language. No language would mean it is a bot. All human run browsers should have a language.

The volume of these odd requests is very low. I suspect this is a bot and not some human browser culling your site's contents.
3:41 pm on Oct 8, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:69
votes: 0


> I hope you're actually examining your server access log and
> not just getting your information from some analytics report.

I only look at raw IIS log files, one log file created each day for that day's hits. Usually these files are from 500 to 1500 lines (each line being an individual request for either "/" or a specific file). Google and bing bots probably account for 25% of the lines in a given log file. Weekends and holidays tend to generate smaller files. These numbers include IP pre-filtering in our Ubiquity ER3-Lite router, which is currently blocking about 350 million IPv4 addresses (in cidr groups as small as /24 to as large as /8). We only have IPv4 internet access through a single static IP.

> If it was a legit browser caching, you'd see a lot of 301s

I've never seen 301 codes in the logs. I do see a moderate amount of 304's though.

> I guess you have a reason for not using HTTPS

I'm trying to get https up and running now on this NT4/IIS4 server. We run no server-side scripts, no e-commerce, no tracking of any sort, no integration with any ad network. It's a basic site who's structure / layout has not changed since it was created in the year 2000 except for some minor tinkering.

> Of course that may be irrelevant if the Search Engines omit
> them from the index altogether.

Every on-line site safety/security scanner I submit my site to comes up green-lighting the site with no issues, even though it is http (not https).

I see outfits like Trend micro and symantec hit our site every once in a while (symantec has a division in Ireland that does this?) and I white-list them when I can identify them as such (vs being a random bot). BTW, I am blocking a lot of google IP space (I think in the 35/8 network) - the rdns comes back as "google-user" or google-user-content? I'm assuming that's a bot running on rented google servers and blocking those have no implication for how google-search relates to us. yes/no?

A google search for about a dozen keywords that we consider useful to our business always results in our site coming up on the first page of results, and could be as many as 5 of the first 10 results point back to us in some way. It's been this way for years, I haven't seen this change during the past few months.

> If you have implemented request headers you could check them to see if they have a language.

I'll have to see if there's a way to do that in IIS4. I remember that when the server was originally set up in like 1997 or 98, IIS didn't have the ability to log the requesting IP. I downloaded (from somewhere) an isapi filter dll that allowed IIS to log the requesting IP and referrer information.
1:32 am on Oct 9, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:69
votes: 0


Had a look at more recent logs today.

On Oct 6 another example of the mystery hit came in from Mount Sinai School of Medicine. User-agent was:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36

Yesterday afternoon another mystery hit from Emory University. User-agent was:

Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36

Between Feb 2016 to March 2018 there have been 6 pdf requests from Emory (our pdf files are reprints of scientific papers). The last actual website hit happened in January 2016 (and that was normal-looking).

Then yesterday evening there were two hits from University of New South Wales (Australia). One was using Windows NT 10, the other was NT 6.1. Both using Firefox 62. Those were normal hits - all normal site files were requested. The referrer from one of them was duckduckgo.

Will be continuing this coming week to get https up and running and see if that changes anything. Would like to know if others operating http-only sites (which I admit are probably few to none, but maybe some hobby / personal sites?) are seeing aborted site-access attempts like I'm describing, or if otherwise there is solid info that *something* has changed in regards to user or browser behavior (or corporate / institutional firewall) when accessing non-https websites.
1:41 am on Oct 9, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


Had a look at more recent logs today...
That's how a botnet would work.

Good luck with the HTTPS security upgrade
1:52 am on Oct 9, 2018 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11567
votes: 182


I do see a moderate amount of 304's though.

that should indicate a real browser that supports cacheing - are you seeing 304 responses from any of the same visitors that previously showed the partial access behavior?
2:33 am on Oct 9, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:69
votes: 0


I haven't yet seen examples where the partial-access people came back for another go at the site.

I have to think this (below) is the reason for what I'm seeing (as I said in first post, I started seeing this July 17).

And by the way, if this was bot activity, with nothing at all to do with http vs https, then some of you should be seeing this same behavior in your logs. And I've never seen bot activity from universities this heavy before (exception is U-mich and their IOT scanning).

============
[theverge.com...]

Feb 2 / 2018

Chrome will mark all HTTP sites as ‘not secure’ starting in July

Starting in July, Google Chrome will mark all HTTP sites as “not secure,” according to a blog post published today by Chrome security product manager Emily Schechter. Chrome currently displays a neutral information icon, but starting with version 68, the browser will warn users with an extra notification in the address bar. Chrome currently marks HTTPS-encrypted sites with a green lock icon and “Secure” sign.

Google has been nudging users away from unencrypted sites for years, but this is the most forceful nudge yet.

=====================

'"Without that encryption, someone with access to your router or ISP could intercept information..."

Honestly, is that really a practical threat? When you're at home? Or you're using a hard-wired desktop/laptop?

Google has scanned my site a gazillion times and knows it inside and out and chrome has to put up a warning even though google search lists my site first or second (after a wikipedia entry) and third and forth and fifth?

[edited by: phranque at 5:17 am (utc) on Oct 9, 2018]
[edit reason] fair use [/edit]

3:26 am on Oct 9, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


I do see a moderate amount of 304's though.

that should indicate a real browser that supports cacheing
Not with these requests.

I've never seen bot activity from universities this heavy before (exception is U-mich and their IOT scanning).
a "botnet" is not a normal bot. Botnets come from compromised ISP accounts, server farm accounts, shared hosting accounts, schools, and other places that have been infected with malicious code.

The code drones that server to send requests out for specific reasons, usually to look for vulnerabilities at other servers, and so it goes on & on.

The requests from these compromised accounts usually will come in bunches, requesting similar files and will show in logs as coming from a dozen or more IP addresses, from various sources.

The list of these compromised accounts are traded & sold on the dark web to other bad actors.

I deal with this all the time.
3:39 am on Oct 9, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator martinibuster is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 13, 2002
posts:14872
votes: 478


Chrome may be blocking users from your site because of the http.
I've been unable to access some sites because of the http issue. Chrome blocks the page entirely because it's insecure. I can use another browser to access it. But not Chrome.

Doesn't happen to all sites, just some sites.

Chrome can prefetch files then the http issue hits and it stops. At that point when the user clicks the link Chrome prevents the user from reaching your site.
3:52 am on Oct 9, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:69
votes: 0


I get tons of requests for files and file-paths that I don't have on my site and have never had on my site. I know what those hits look like. Sometimes they have long / weird strings of stuff in the useragent or referrer. Or sometimes they do referrer spam. Or sometimes they scrape the site - only the .html files. I can tell you it's quite rare to see that coming from a US .EDU or medical center. Coming from India or China or Russia or Ukraine - sure.

I've only really started in mid 2015 looking at each line of my log files in depth, every day. Seeing a hit to default.html followed by only 2 gif files (always the same 2 gif files) is completely new behavior, and one that has zero usefulness from a bot / scraping pov. If this is distributed bot behavior, then why are they always grabbing the same 3 files? Why not the entire site as a group effort?

Is there a service (like spamhaus or senderbase for smtp spam) where I can submit an IP address and see if it has been identified or suspected of being a web bot/crawler?
3:57 am on Oct 9, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:69
votes: 0


> Chrome can prefetch files then the http issue hits and it stops.
> At that point when the user clicks the link Chrome prevents the
> user from reaching your site.

I personally don't use chrome, but I've just asked someone who does have it to look at the site and tell me what he sees. He says there is a little icon or something saying the site is not secure, but it's not putting up any message box telling the user they can't continue or some such garbage. If chrome is preventing you from browsing to an http site - is it because you have enabled some other security setting? I'd like to know what chrome 68/69 default behavior is regarding http browsing.
4:13 am on Oct 9, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


Is there a service (like spamhaus or senderbase for smtp spam) where I can submit an IP address and see if it has been identified or suspected of being a web bot/crawler?
Not for botnets. As previously stated, these are compromised accouts that will presumably be fixed at some point.

Usually a bot that pretends to be a human browser is malicious so they aren't registered ranges.

Again, most all this info is outlined here: [webmasterworld.com...]
7:45 am on Oct 9, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator martinibuster is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 13, 2002
posts:14872
votes: 478


I've just asked someone who does have it to look at the site and tell me what he sees.


Ask him to search for your keywords and click from the SERPs. That's how it's manifested to me.

It doesn't happen all the time, only to a few sites.

It's not consistent either. I've reached insecure news sites from Google News with no problems.
2:45 pm on Oct 9, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:69
votes: 0


> Ask him to search for your keywords and click from the SERPs. That's how it's manifested to me.

But then how is that browser-specific? Does Google give a user a different experience when they click on a "non-secure" SER link if they're using Chrome vs Firefox?
3:19 pm on Oct 9, 2018 (gmt 0)

Moderator from US 

WebmasterWorld Administrator martinibuster is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 13, 2002
posts:14872
votes: 478


Yes.
SERPs have frequently been different depending on browser. For years now.

Not limited to SERPs either. Googling on a VPN with Chrome can get you more CAPTCHAs than doing it on IE.
4:41 pm on Oct 9, 2018 (gmt 0)

Junior Member

Top Contributors Of The Month

joined:Sept 8, 2016
posts:69
votes: 0


> SERPs have frequently been different depending on browser.

What I mean is, I know that you can get different SER's depending on what browser you use, even if the search query is exactly the same. But if one of the SER's is the exact same URL, and you click on it in one case with Chrome and another case with FF, will (or is) google throwing up a warning or is blocking you from proceeding when you're using chrome but not with FF?

By the way, I have confirmation that a google search for keywords that gives results that point directly to our site, that clicking on those SER's brings you to our site with no warning that the site is "insecure". This is with Windows / Chrome. The only indication there is something "wrong" is a "not secure" label at the left end of the url address bar. Presumably the entire page loads (all 32 files) not just the 3 files I'm seeing (the page wouldn't render correctly if only 2 of the 30 gif's were rendered by the browser).

Could it be that this has to be tried on a system that has never before hit our site to fully test this theory? Maybe clear the cache and browser history to make this a real test? Maybe chrome keeps some sort of reputation score for sites browsed in the past and will not throw up any warnings or prevent access to http sites that were hit in the past?
7:19 pm on Oct 9, 2018 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 891


[Is] google throwing up a warning or is blocking you from proceeding when you're using chrome but not with FF?
The warnings are browser generated.

The search index doesn't create the warnings, each browser is responsible for what is displayed as far as safety warnings.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members