Coincidentally, last night I posted on this very topic, in our March Google Updates thread [
webmasterworld.com...] ...and suggested that, if you're seeing more than 3 entries for any "high commercial" keywords, it's likely that it's a test...
In the past (since, say, July 2012), if your preferences were set to show 10 results per page, and the query was for "high commercial" phrases and not long tail, then Google was in testing mode....
I'm glad to see JS_Harris post the topic as a dedicated thread, and that doc_z linked to the Brett Tabke's original thread. To give that thread a name and a more detailed reference, it was...
Many results from one site - Host Crowding vs Brand Authority June-Aug, 2012 https://www.webmasterworld.com/google/4464096.htm [webmasterworld.com]
The topic turned into a long and contentious discussion. Just to make it clear, I didn't like the nature of the test either, and I'm assuming also that Matt Cutts didn't. It's a very invasive method of testing.
In his opening post, Brett linked to a Matt Cutts video about Google's history with host crowding and its move to multiple results. Here's another link to the video (again with more detailed references)...
How does Google decide when to display multiple results from the same website? Matt Cutts - June 11, 2012
trt 5:40 https://www.youtube.com/watch?v=AGpEdyIcZcU [youtube.com]
From the thread, regarding Matt's comments, Brett almost nails it...
The only real new thing here is that for the first time, Matt Cutts talks about the issue in a video, but never does in fact answer the question entirely. I have yet to hear any major voice say this is a plus.
[youtube.com...]
Matt gives quite a few comments about why host crowding is a plus. In fact - listen close - seems almost as if he prefers the old method like all of us do...
Later in the thread I post along these same lines, but go further into my theory that Google was testing...
The Matt Cutts video, which Brett posted at the start of this thread, clearly lays out the positives and negatives of host crowding, along with the negatives of dropping it... and I feel at the end Matt hints that this might be a test. I can imagine a discussion in a Google meeting room where the approach was hashed out, and ultimately the opponents said, "OK, let's try it and see what kind of data we get."
The video is interesting to see again in light of the new results....
As the test (or recalibration) unfolded over time, painfully slowly for many, I'm sure, it became apparent to most who were following it carefully and sticking to the default 10 results per page, not 100 results per page, that Google was methodically honing down the number of results, page by page as it got sufficient data. I think that Matt in the video sets up the more-than-two-pages aspect fairly well, though he was never explicit that this was a test.
Over the course of the test, you could see the results get refined, initially on the first page of the serps, and then deeper and deeper into the serps, as the less prominent results accumulated sufficient searches and user data. For anyone interested in reading through it, I commented about this in the thread occasionally. Deeper pages, and less-searched queries, took the longest time to get sorted out.
I also have continued to see some very long tail results continue to show 4 (or even 5?) results, on a page, and I've occasionally commented on those, where appropriate, in the forums. Some of these are for queries where I've not been able to get any useful data from SEMrush or SpyFu, and I'm thinking the reason is that they are very seldom searched.
IMO, I think my assertion that this was a test and was another way of looking at data pretty much held up.
One thing that is not clear now is whether Google is now looking at the long tail under different conditions, or they're back in the short tail. Martin_Ice_Web in the updates thread raised the question as "high commercial", and I can't argue with that because that might be a subjective thing.
I can only say that, during the time the above "domain crowding" test was evolving, I saw a great many queries that site owners felt were highly competitive terms, but which in fact had, say, only 120 instances on the web. They were very niche, and I'm sure in that niche were competitive.
The difficulty of calibrating a niche, though, is the small amount of data. I have a feeling that this relative lack of data is what has slowed full automation of Panda and Penguin... and that for all we know, this might be a calibration step in the long tail area of those algorithms, if this is long tail. "Specialty sites" in xelaetaks' OP here suggests that this is long tail. Martin_Ice_Web's comment in the update thread suggests that it's not.
There's also RankBrain's effect on long tail, which might be getting mixed into this calibration now.
PS: Shortened end of post and edited for clarification.