Forum Moderators: open

Message Too Old, No Replies

Automating the server farm identification

an alternative approach

         

trintragula

10:16 am on Dec 9, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



Identifying server farms is a manual process. It's time consuming and open-ended.

However, every time a visitor does something that's manifestly human they're identifying themselves as not being from a server farm.
If we can collect enough of that information automatically, then the server farms are the other ones.

This is the germ of an idea.

As an example of this idea in practice, here is a list of the /8s that have never posted on my forum (in 5 years). So from my perspective, any candidates for a deny from /8 should be in this list.

0,3,4,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,25,26,28,
29,30,33,34,35,36,39,40,41,42,43,44,45,48,51,52,53,55,56,
57,102,104,111,117,125,126,127,133,135,136,140,148,153,158,
160,161,167,170,177,179,180,181,183,191,196,197,200,215,
221,223,224,225,226,227,228,229,230,231,232,233,234,235,
236,237,238,239,240,241,242,243,244,245,246,247,248,249,
250,251,252,253,254,255

54 is not included because I had posts from that /8 2 weeks before AWS bought them.

Because my forum is small (only around 1000 members) it's not a statistically strong sample (and individual numbers are still getting knocked off this list every few months), but someone with a much larger forum could produce this list the same way I did, with a single line SQL query, and could generate some better data.
I'd be interested...

One way of looking at the bot problem is to think about asking "what IP addresses are you prepared to give the benefit of the doubt?"
I would regard the list above as being candidates for closer-than-usual scrutiny - e.g throwing up a captcha page.

/8 are huge chunks. It would be nice to go a lot finer, but that would need more data. And of course if it's collected from forum posts, you need to make sure they're not spam! That gets harder to guarantee on a large forum. But there may be more reliable sources.

I've not really done anything about this yet, I'm just thinking aloud. There may be some flavours of this approach that could complement what people are doing with all those lists of server farms...

A quick look at today's log reveals that about 3% of my visits are from these /8s, including some Baidu, Synapse, Yisou and a few other bots. I'm already blocking the vast majority of bot traffic, though, so on an undefended site it might be much higher.

Angonasec

7:29 am on Dec 17, 2014 (gmt 0)



Q/
I am very sorry to have been a nuisance, and I have no wish to cause offence.
/Q

Terrific line for a bot UA.

Not taking offence is one of the seven secrets of happiness :)

A Merry Christmas and a Happy New Year to our Reader.

trintragula

5:09 pm on Dec 17, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



I'm getting distracted by the season, and also some mad SQLing in my logs with phpmyadmin.

Q/
I am very sorry to have been a nuisance, and I have no wish to cause offence.
/Q

Terrific line for a bot UA.

I've almost seen that! There are quite a few UAs that have made me laugh... My favourite is 200pleasebot. Uh, 403thank-you.

lucy24

6:17 pm on Dec 17, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I once had a running thread [webmasterworld.com]* on good UA strings. For a while I was getting a lot of "Untrusted" versions of I-forget-which browser. "Took the words right out of my mouth."

Edit after rereading old thread: At the very end, someone stepped in with an explanation of what "untrusted" means. I also eventually figured out "rarely used", which is in fact quite frequently used. Has to do with mobile image search.


* The original thread title was "And the winner is..." I forget why they changed it.

trintragula

9:05 pm on Dec 17, 2014 (gmt 0)

10+ Year Member Top Contributors Of The Month



That was pretty funny... :)

I had one called just "Keep Out", and one called "Just-Crawling 0.1"
As for PlurkBot and Vagabondo - not sure I want to know.
Then there's:
"MSIE or Firefox mutant; not on Windows server;"
Oh well that's alright then...

Then there's:
"Windows-Live-Social-Object-Extractor-Engine/1.0"
I'm not sure I want my Live Social Objects extracted, thank you!

keyplyr

12:46 am on Dec 18, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I explained that the *connection* was untrusted and not the browser or UA.

lucy24

4:48 am on Dec 18, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



not the browser or UA

Yes, I remember. But they put it in the UA string, not in something like a supplementary header. Inevitably it prompts the question "Now that I know this, what am I expected to do?"

The user-agent "rarely used" turned out to be similarly mundane. (I didn't figure that one out until a later thread.) But there remain some real winners among UA strings.

keyplyr

7:57 am on Dec 18, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I don't think "rarely used" is/was mundane. It is/was used to download image files, if I remember correctly. Anyway, I block it.

Yes, I don't know why "untrusted" got in the UA string, that is odd. Guess there is no where else to put it ?

trintragula

2:15 pm on Jan 4, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Last night being Saturday night, and me being the party animal that I am, I got to thinking:

So if I now have a list of server farms from WW (albeit partial), I wonder if any of the posts on my forum are from those server farms?

I know there's no spam on my forum, so this should be interesting.

So this morning with phpmyadmin, I try this against my database:
SELECT 
concat( f.ip, '/', f.mask ) ,
substring_index( from_unixtime( max( m.postertime ) ) , ' ', 1 )
FROM (
SELECT * FROM `smf_messages`
GROUP BY substring_index( posterip, '.', 3 )
) AS m
INNER JOIN smf_farms AS f
ON inet_aton( m.posterip ) & ( -1 << ( 32 - f.mask ) ) =
inet_aton( f.ip ) & ( -1 << ( 32 - f.mask ) )
GROUP BY f.ip
ORDER BY substring_index( from_unixtime( max( m.postertime ) ) , ' ', 1 ) DESC


Sorry about the awful formatting - is there a way to do better?

With this result:

54.80.0.0/12 2014-07-26 -- amazon
93.112.0.0/13 2014-07-26 -- voxility
93.115.80.0/20 2014-07-26 -- fullshop romania (voxility?)
54.192.0.0/12 2014-07-14 -- amazon
146.185.0.0/16 2014-06-11 -- HSI 100TB (was netsumo)
5.63.144.0/21 2014-06-09 -- HSI 100TB (was netsumo)
74.115.0.0/21 2014-04-01 -- anchorfree
91.108.180.0/22 2014-03-18 -- webexxpurts
65.192.0.0/11 2014-02-05 -- colostore (now some MCI/verizon?)
69.40.0.0/13 2013-09-04 -- windstream
2.232.0.0/13 2013-07-08 -- fastweb
69.174.0.0/17 2013-03-06 -- scansafe
209.251.192.0/19 2013-02-08 -- tampa time inc
93.114.40.0/21 2013-02-05 -- voxility
213.235.192.0/18 2013-01-26 -- austria tele2
209.68.0.0/18 2013-01-24 -- pairnet
69.48.0.0/12 2013-01-09 -- HSI/intergenia


These are farm ranges mentioned on the server farms thread stickied on this forum from which I have seen forum posts at one time or another. The dates listed above are when I last saw a post on the forum from that range.
The comment on the end is my cursory manual lookup of the IP itself.
I've only included the last 2 years, because older data is unlikely to be useful.

I'm guessing most (but not all) of these are because ranges have changed hands. But I think in both directions. (people posting before they were listed, or after they were delisted).
There are 2-3 voxility ranges in there: I have a member who proxies through them for anonymity and I've not had cause to block him.
While my forum is international, it is pretty innocuous.

I have about 6000 distinct /24s amongst my forum posts, and I'm seeing maybe 0.5% of these here (old ones are not shown). The list of server farms I'm using from WW is pretty small - only 2000 ranges.
So far I'm only using this list for research - not blocking, but it does show that updates do happen and have a measurable effect.
A better check would look at the date the ranges were listed here and see whether they were listed at the time they were used. That's probably possible to automate with some more work - or for a list this short I could do it once by hand. Harder for me is to figure out whether any of these ranges have been delisted, or broken up and partially delisted.

trintragula

3:22 pm on Jan 4, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



There's not much point to blocking 224.0.0.0/3 -- or indeed of referring to this sector in any way whatsoever -- since it appears to be perpetually unassigned (yes, even while 185 is being doled out in /22 slivers). Just makes a smidgen more work for the server.


224 to 239 is multicast so will not be used for normal web traffic of any kind. 240 up seems to be reserved for multicasting.


I got curious about this: It turns out I have seen a handful of visits in 2014 from north of 224/8, and also north of 240/8. Hackers maybe?
There shouldn't be anything out there. But there is.

EDIT: I'm suspecting something to do with the forwarding headers or shared hosting. This may not occur if you block at the htaccess level.

lucy24

6:56 pm on Jan 4, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It turns out I have seen a handful of visits in 2014 from north of 224/8, and also north of 240/8. Hackers maybe?

Weird and unnerving, because if someone has figured out how to fake an IP we are in "Be afraid. Be very afraid." territory.

:: quick detour to raw logs to verify that I've never met anyone from ^2(2[4-9]|[3-5]\d)\. ::

Can you lay your hands on a specimen log entry?

trintragula

7:27 pm on Jan 4, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Alas because I'm on shared hosting, and doing my own logging using php and mysql in the forum software, I don't have a conventional log. The provider only provides them if you give them 24 hours notice before you know you need them :(.

The forum software does some stuff with headers to figure out the 'real' IP address from HTTP_X_FORWARDED_FOR and other headers when people are proxying so I rather suspect this isn't the IP address that it came in with, but something spoofed in the headers. Though why they should try to persuade me that they're from an illegal range, I don't know.
All I have is the IP and the useragent.

I think I may have to have a closer look at what the forum software is doing with the IP address before I see it.

Until then, I'm going to assume it's a false alarm, as it does seem very, very unlikely.

EDIT: yes, this is almost certainly happening through the headers. I'll need to fix that.

dstiles

8:29 pm on Jan 4, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I suggest that the amazon IPs you are seeing are genuine mobiles - there have been comments here about that (despite which, I block all amazon).

I suspect that other server farm IPs have been used as proxies. Some people set up proxies on their own servers for their own use or for public use. They may be innocuous but I still block most of them. There are sites showing open proxies that may be useful. Such proxies may be set up for nefarious purposes.

Scansafe (and ironport) are reasonable proxy sources used by (I think) mostly businesses that are worried about their own IPs being on the internet - some of this is scaremongering but not all. There are also some UK educational services that use proxies to protect school pupils (who, I suspect, are quite capable of circumventing such devices).

Apart from amazon, most /13 ranges are DSL so can ba accepted. It's true there are some nasties on such ranges but that's true of all broadband ranges. In fact, I think some of your smaller ranges are also broadband (eg tele2).

Look at the DNS using linux Network Tools to determin the ownership and range (NOTE: some RIPE ranges are only sub-ranges and have to be pursued). Use linux UMIT to see if there are any open port IPs - I generally check three or four IPs but sometimes more. In this, note that clouds do not usually show open ports - still haven't figured out how that works! :(

What Lucy said about the upper IP ranges. I do see firewall log entries that seem to be DNS or similar lookups made by my server using 224.0.0.nnn over UDP but I can't recall having seen TCP usage.

jmccormac

8:46 pm on Jan 4, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



As an example of this idea in practice, here is a list of the /8s that have never posted on my forum (in 5 years). So from my perspective, any candidates for a deny from /8 should be in this list.
This kind of approach bothers me because it is so subjective. I did build a global map of IP addresses and owners as part of a mapping project a while ago. However I am not the average WW poster when it comes to this kind of thing and creating blocking ranges for webservers was not the main intent of the research. Ranges are redelegated and reassigned so it is possible that there could be ranges in various /8s that are human use rather than data centres.

Regards...jmcc

not2easy

8:54 pm on Jan 4, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I have 224.239.0.0 Amazon Technologies
You can count on finding Amazon ranges..
(I did not dig deep enough to find what/when prompted that lookup, sorry)

trintragula

9:06 pm on Jan 4, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thanks, that's useful.

I've not generally taken much of a stand against proxies, because I'm blocking by behaviour, so it doesn't matter much where it comes from, except when tracking a sequence of requests. I'm certainly aware that some of my members use proxies, presumably when visiting from work. Although I see it happen when I look, I don't have a permanent record of it. And of course some unwanted visitors will use proxies to hide their origin.

I'm interested generally in metrics to help classify visitors into good vs. bad, and the size of the range they're from is an additional clue. Currently I have a half-dozen metrics visible while I'm researching that I don't take action on.


Incidentally, I have seen a visitor who carried on a perfectly plausible session but with essentially every request from a different /32 within the same /24. I'm guessing this is some kind of high capacity firewall that's using a pool of addresses on the outside, as well as NAT on the inside. Does this happen? I'm not a network engineer, so I'm inferring this based on what I saw happening. I had initially assumed this was a server farm but on closer inspection it looked much more humanoid based on the pattern of requests.
I'm afraid I don't remember whether it was a mobile device. It would be a plausible thing for a mobile operator to do, but I've not generally seen that.

lucy24

9:33 pm on Jan 4, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



but with essentially every request from a different /32 within the same /24

That's innocuous. Sometimes you will even meet a cell-phone browser hopping about from different A (what you call /8) ranges. In my personal log-wrangling I use the pattern
^(\d+\.\d+\.\d+)\.\d+ {blahblah} \1\.\d+

because it's so common for the D part to vary.

trintragula

9:13 am on Jan 6, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



<interlude>
While I'm researching I'm checking the logs often...

This morning I have a visitor from China! An actual human!
They haven't signed up on the forum, but they're showing a normal pattern of distinctly human behaviour. I hope they do sign up...

My site is international, being a media-based hobby site, although heavily dominated by the English-speaking nations.

I'm used to seeing Chinese IPs relegated to the 'bots' category on my log summariser, and it's very rare to see one being categorised as a human.

This one is from a university on CERNET (59.78.0.0 - a /15 which I think has nothing to do with the European CERN). I looked the range up here, but it's only listed as part of a way to block China, not as a source of specific abuse (except for dark rumblings about research spiders...)

Meanwhile I'm blocking a couple of Baidu visitors, and a couple of spammers from elsewhere in China, who are triggering multiple traps.

The only trigger this human is showing is being from a range which I haven't seen posts from before. There are no forwarding headers.

Well that's heart-warming anyway. :)


Meanwhile there's a bot at datashack that's been pounding away round the clock with no less than 29 different user agents... hmm SFS thinks its a spambot. No kidding...

</interlude>

lucy24

7:02 pm on Jan 6, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well, with a population in excess of one billion, you have to assume there are a few humans living in China.

You may find it illuminating to look at request headers. IncrediBill or someone like him once posted a bit of code that creates a php file listing all headers. I put it in my shared footer and glance at the files periodically.

:: idly wondering if anyone reading these forums has ever had a visit from 175.45.176.0/22 ::

not2easy

7:17 pm on Jan 6, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



:: idly wondering if anyone reading these forums has ever had a visit from 175.45.176.0/22 ::

Nothing so far that required keeping a record. But I'm not through last months logs yet for a few sites.

trintragula

7:22 pm on Jan 6, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



I've been watching the headers for a while now. One in particular is an astonishingly good predictor of botness. Thx iBill for the hint!

:: idly wondering if anyone reading these forums has ever had a visit from 175.45.176.0/22 ::

Nothing that's sprung any traps.
Though I vaguely recall something about this recently... perhaps they erased my memory.

dstiles

7:58 pm on Jan 6, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Lucy - good one! But yes, I have it in my database and I'm fairly sure it was as a reaction to an invalid hit 2 years ago this Wednesday! It could have hit since but as I flagged it as "Block" I wouldn't have seen it.

trintragula - cernet is the Chinese educatioonal network. Relatively untroublesome.

lucy24

9:15 pm on Jan 6, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For those who didn't bother to look it up, 175.45.176.0/22 (wilderness would say "^175\.45\.17[6-9]") is North Korea. That is: it's all of North Korea, encompassing the top 1024* governmental officials who are allowed to see the real internet as you and I see it. They don't appear to be interested in me, darn it.

Hasty edit as I realize I'm in the wrong thread.


* I chewed on this for a while and decided that, North Korea being North Korea, they probably don't share IP addresses the way a normal government would. That way it's easier to check on who's been visiting what.

trintragula

10:56 am on Jan 8, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Chrome and Firefox have been on rapid release now for a while, so at any given time, there's a sliding window of a few versions that are in common use. Old browsers are commonly associated with plain-clothes bots, though there are always a few human hold-outs.

Here's a quick query against my log of the last 24 hours for versions of non-mobile chrome that are in use

n_ips,n_requests,percent_blocked,version,ua_variants
1,1,0,9,1
1,1,0,14,1
2,2,0,18,1
1,1,100,25,1
6,21,9,26,1
5,7,0,27,3
2,7,0,29,2
14,33,3,30,4
2,17,0,32,2
2,24,0,33,2
1,2,0,34,1
1,4,25,35,1
8,16,31,36,7
15,53,30,37,9
11,56,0,38,9
234,1209,6,39,36
4,118,0,40,3
2,6,0,41,2

I present this in CSV format for a spreadsheet, as the forum appears to have no way to lay this out decently.

I've done a similar exercise with Firefox.

There's a spike around Chrome/30 that may well be a stealth bot that I'm not blocking.

It would certainly be possible to produce a dynamic weight of the likelihood that a given browser version is 'good' and this would be better done by closeness to the most popular version than simply by the number of hits.

trintragula

10:04 am on Jan 19, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



I haven't forgotten about the original purpose of this thread, and the research is still ongoing. I can tell you're all on the edge of your seats...

In the meantime, here's a distraction I've been following up.

It's occurred to me that the timing of visits may tell you things about the visitor. I've long had a trap to watch for "clockbots" - visitors that apparently don't need sleep... but most mechanisms show periodicities that you don't find in humans.
Since these are most easily spotted by visualisation, I started on that yesterday.
My online monitor now shows the previous 24hrs for each visitor's visits divided into 10-minute time slices, with a bar for each 10-minute period in which there were visits.
I can't embed pictures here, so I've produced a text version at lower resolution which shows the general idea.

Here's a typical logged-in human visitor:
---------------------------------------------------------------x-x----xx
=--=-============-=========================-=-=====-==--=--==-======--==

The row of hyphens are 20 minute time slices. 'x' marks a slice during which there were visits by this visitor.
The second row shows visits (with an '=') by other visitors using the same User Agent.
Because I'm looking at recent activity, most normal visits are clustered at the right hand end.

A lot of visitors ask for a single page (often as a result of a search engine hit). When they're using a popular browser they look like this:
-----------------------------------------------------------------------x
===-=-============-======================-==-=-=====-==--=---=-======--=


With a rare browser they look like this:
-----------------------------------------------------------------------x
-----------------------------------------------------------------------=

I see this a lot with mobile devices.

Here's a googlebot:
--------xxx-xx-xxxxx--xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx-xxx
==--------------====-==========-======-----------=-=----================

you'll notice that there are 'x's where there is no corresponding '=' below. This is because the same IP also visited with other user agents.

Here's the Netlyzer "clock bot", which started visiting recently. Always 22 times a day at semi-regular intervals:
--x--x--x---x----x--x--x--x--x--x--x---x--x--x--x--x----x--x--x--x--x--x
--=--=--=---=----=--=--=--=--=--=--=---=--=--=--=--=----=--=--=--=--=--=


Here's a visitor from datashack - scraping (getting blocked for other reasons)
---x----------------------x---------------------x-x-------x--x---------x
--=-----------------=-----=--------------------------------------------=

There are multiple requests in bursts under each of those 'x's, so more pages are being requested than it may appear.
I've seen a few other periodic visitors like this also.

Oh and here's bing:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
=-==---==-=-=--=-=-======-=-===-=----==-------===----=------=-==-------=

No surprise there. 15% of my bandwidth. Nuff said.

Yahoo slurp! is interesting:
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
========================================================================

not because of the above but because the hits per day is 863. My crawl delay is set at 100s, so they're actually obeying it to the letter... that's unusual for a bot.

At the higher resolution I'm using there's more detail, but you get the general idea.

I'm noticing "cluster bots" which visit periodically, but together several of them will show up from completely unrelated IP addresses within the same few minutes periodically during the day. Botnets. Relatively easy to spot when they share a user agent, but probably possible to spot even when they don't.

So now I'm looking into doing some Fourier analysis: if I can automate what I'm doing by eye, I think I might have a useful tool.

lucy24

9:28 pm on Jan 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My crawl delay is set at 100s, so they're actually obeying it to the letter... that's unusual for a bot.

If you throw all robots-- including the a priori blocked ones-- into the mix, the mere act of looking at robots.txt makes them unusual. In the specific case of the Googlebot, they forthrightly ignore crawl-delay in favor of requiring you to set the same information in wmt. Come to think of it, I've no idea whether they do honor the wmt directive. Don't know if anyone has ever checked.

Now, if you took that Bing diagram and redid it with X (capital) representing requests for robots.txt, it would probably look more like

XXXXXXxxxXXXXXXXxxXxxXXXXXXxXXXXXXxxXXXX ....

trintragula

10:23 pm on Jan 19, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



LOL! Very probably!
As it is I'm thinking of converting the log into a sound file and listening for tones in it. The trouble is that at CD quality and one-second resolution a day's log would run about 2 seconds...

trintragula

11:44 am on Jan 20, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Incidentally...

I was running some queries to experiment with ways to find periodic robots and found a few: Pinterest came up, but also Firefox/12.0 (which are also getting blocked by other means).
On closer inspection I find that the Firefox/12 requests corresponded with hits from Pinterests robot, with the same requested file and often to within the second. The IP addresses are also from within AWS.
Plain clothes pinterest, it would seem - not that unusual for a bot, but who knew?

The UA is this one, if anyone's interested:

Mozilla/5.0 (Windows NT 6.1; rv:12.0) Gecko/20120403211507_____Firefox/12.0

I've replaced five spaces by underscores there, because the forum collapses them, even inside code tags.

lucy24

8:18 pm on Jan 20, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Plain clothes pinterest, it would seem

Huh, that's interesting. Are these re-visits or primary visits? The only secondary FB UA I've noticed-- apart from the chronic and inexplicable toggle between 1.0 and 1.1 --is "visionutils". This seems to represent the single image the FB user ends up choosing for their pin. Er, like. Follow. Whatever.

I've never understood big-name robots using some wildly improbable UA that will all too likely get them blocked on its own merits. Like google's faviconbot moving up from no UA at all ... to Firefox 6. Huh what?

trintragula

9:29 pm on Jan 20, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



If I understand you right, these are revisits: I'm seeing multiple requests for the same page in pairs of Pinterest/0.2 (+http://www.pinterest.com/) and FF/12 spread across the day. Looks like they're watching for updates.
On further inspection it looks like they're also doing this with Chrome/18, Safari/534.55.3 and MSIE 8.0. All the plain clothes UAs are distinguished by having runs of extra space characters in them.
And they're all Amazon IPs.

I think some MSIE 9.0 hits from Amazon may also be connected with Pinterest, but they don't match the telltale signs as well, so I'm not sure.

It appears to solve a mystery with some of the plain clothes bots I'm seeing from AWS, however.

At least google let you know who they are by putting their name in the favicon UA.

lucy24

11:16 pm on Jan 20, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oops, my bad, my eyes saw "pinterest" and my brain inexplicably rendered it as "facebook". Never mind. I'll go away now :(
This 63 message thread spans 3 pages: 63