It is judged using the same criteria as manual algo changes - comparison between the machine generated results, and the hand selected order of results.
If Googles hand selected order of the top five results for a keyword are:
Site1, Site2, Site3, Site4, Site5
And the current algo produces:
Site2, Site1, Site5, Site4, Site3
Then a new algo that produced:
Site1, Site2, Site4, Site3, Site5
would be an improvement.
Naturally this is a waaaaaay to trivial example though. Algos would be tested with hundreds (probably thousands) of keywords and phrases, and a lot deeper than 5 sites. Click thoughs, repeat similiar searches by people, and many other metrics also determine the perceived quality of one algo over another.