Forum Moderators: open

Message Too Old, No Replies

Blue Coat

         

lucy24

10:13 pm on Sep 15, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Someone refresh my memory, please: Why are we supposed to not block BlueCoat ranges?

199.19.249.196 - - [15/Sep/2015:01:11:25 -0700] "GET /fonts HTTP/1.1" 301 593 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0" 
199.19.249.196 - - [15/Sep/2015:01:11:26 -0700] "GET /fonts/ HTTP/1.1" 200 8261 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
199.19.249.196 - - [15/Sep/2015:01:15:02 -0700] "GET /hovercraft/april_blues.html HTTP/1.1" 200 277436 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"
199.19.249.196 - - [15/Sep/2015:01:16:12 -0700] "GET /hovercraft/nunavut99 HTTP/1.1" 301 623 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"

et cetera for a total of 418 requests. I don't think this site actually has 418 pages; the total was bloated by all those slashless requests. Also a request for /paintings// with spurious extra slash, which makes me wildly uneasy because the Googlebot has also been requesting this lately, and I swear I can't find any malformed links, or else the other search engines would be requesting them too. Mostly pages, except a few non-page files that happen to have <a href> links.

User-Agents toggled between
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
and
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1
-- mostly the latter.

No robots.txt, but plenty of requests for pages in one roboted-out directory (one of two, and the less visible one at that). Further weird quirk: they didn't get around to requesting the front page until about a quarter-hour into the visit. Final weird quirk: although they didn't bother about robots.txt, they did ask for the sitemap. I consider that rude.

Checking back in my records, I find that this particular range-- 199.19.248.0/21 --was blocked at one time, later unblocked due to apparent humans ... and is now decidedly blocked, in case someone comes rattling the barn door in search of additional horses.

....

Oh, ###. As I write this, I realize that an additional, seemingly unrelated robot from the next day-- which I've been avoiding taking a closer look at because it's too ### complicated, having done a fine job of impersonating a human-- was using the identical two User-Agents as this unwanted BlueCoat visitor. WTF?

keyplyr

7:33 am on Sep 16, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't block bluecoat.com 199.19.248.0/21 199.19.248.0 - 199.19.255.255

I agree they look pretty shady, however AFAIK they provide company firewalls so blocking them may result in blocking human visitors. Anyway, I don't have any notes they've caused any mischief for me.

Those two UAs are spoofed quite a bit in my logs too.

As far as requesting file paths that don't exist, then having Google request them... I always assume Google is following a toolbar user who clicked on a bad database link, or possibly a scraped copy of my page somewhere. In these cases it's ironic there never seems to be a referrer to investigate.

lucy24

8:19 am on Sep 16, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I spoke too soon.

After posting, having run out of ways to procrastinate, I dealt with that other robot, the one with the same two UAs. Turns out I'd got two sets of visits mixed up (the date in the posted log excerpt should have been a clue); the "other" one wasn't from the next day, it was at exactly the same time. Perfectly normal UA, but I have to postulate one of those wretched plugins that let a human browser scrape an entire site for your later reading pleasure-- even if you've only seen three pages and have absolutely no reason to think that everything else will be equally interesting to you. The human was from an ATT Global address in the UK. Two, in fact, because they used an entirely different IP for piwik-- or they did, until they turned off scripting, about ten pages in.

So BlueCoat was entirely blameless: it was just following along and requesting the HTML the way it's supposed to, while the browser gobbled up every last thing. Luckily all this took so long that I hadn't even got around to uploading the htaccess with the added IP block. I also learned that BlueCoat shares the oBot's problem with directory slashes: human asks for /directory/; BlueCoat asks for /directory and then gets redirected to /directory/. Every single time.

The other IP-- the human/scraper one-- even asked for robots.txt at one point. And I was wrong: they did obey it. (Got them mixed up with an unidentified robot from-- of all places-- Japan, which came through a few days ago and devoured every last page. That barn door can stay locked.)

No idea why it toggled between those two different UAs (FF 40.1 for Windows, FF 40.0 for Linux). After a while it got bored and just used the Windows version. There was also a lone "Wget/1.16.1 (linux-gnu)" at the beginning of the scraping stage-- before the robots.txt request-- but that one was summarily 403d so they didn't try it again.

This annoys me. If some human wants to open a bunch of pages and save them to their HD for offline reading, they're welcome to do so. I can hardly object, since I've done the same thing myself. But there is absolutely zero possibility that someone manually visited every last page in every last directory-- excluding only the two roboted-out areas. And humans on this site don't usually make such a blizzard of requests that they run into a 503* overload.


* I think my server's current ceiling is 100 concurrent requests; can't be sure, because current error logs don't show 503s. I see them sometimes on the game site, where a single "page" can have well over 100 associated non-page files. Doesn't seem to affect the user experience, though.