The MIT 5k Dataset 2: The Ground Rules

by Dan Margulis on November 13, 2017

The following details the procedures used in evaluating the images in this study. It is posted separately so that I do not have to repeat it every time I discuss results in the future.

I went through the set of 5,000 original images and deleted those I thought were of limited interest. I used the same standards I would use in retaining images that I had shot on a trip. In other words, I was retaining by content rather than by technical difficulty. When finished, 1,315 originals remained.

I chose the hundred to be corrected by selecting 8 to 10 consecutive images from random places in the set. The purpose of choosing consecutive images was to avoid “cherrypicking” ones that I knew were likely to favor my approach. Naturally, this resulted in a fair number of easy exercises or others where PPW has no real advantage.
The 100 chosen images can be classified as follows:
*38 basically about people. Of these, 6 concentrated almost exclusively on faces. 22 were full figure, meaning that the faces were much smaller. 10 fell in between the two extremes.
*20 scenic shots. Of these, 8 featured desert or canyon settings, 7 were dominated by greenery, and 5 other, such as lake scenes.
*10 images primarily of animals or birds.
*6 night or twilight shots of cities.
*6 architectural, of which 4 were interior and 2 exterior.
*5 studies of flowers. Of these, two were large flowers of a single color, and three were bouquets containing many varied bright colors.
*3 studies of strongly colored objects other than flowers.
*3 food shots.
*2 sports photos.
*7 miscellaneous, such as a shot of a rainbow, of a concert on a beach, etc.

For the 100 chosen images, I also downloaded each of the five versions of the retouching group. These files had been saved in 16-bit Prophoto; for testing I converted them to 8-bit sRGB, saved at JPEG level 9, which was the same setting used for my corrections.

For each set of five, I also saved a sixth, “par” version, constructed by blending all five into a single-layered file, with each of the five given 20% weight. Without referring to any of these, I also prepared my own version. To try to emulate the conditions under which the retouchers worked, I limited myself to methods that could be automated: no local retouching or explicit use of the unsharp mask filter.

In my ACT classes, I compare groups of up to a dozen versions of the same images to each other, when needed supported by votes from the class members. As I do such comparisons hundreds of times in each class, and have been teaching the classes for 24 years, I have heard audience reaction to tens of thousands of head-to-head comparisons. I am therefore able to predict the reaction much better than the average person could, and my judgment is not affected by whether I personally am one of the contestants.

For this study I had 700 corrections to compare, seven each of 100 originals, being the five versions prepared by the hired retouchers, the par version that averages all five into a single new version, and the version done by me as described above. I did three rounds of comparison:

1) Each par version versus each of its five parents.

2) My version versus each of the retouchers’ versions, not the par version.

In these two tests the standard was whether the likely favorite—the par version, or my version—was “significantly better” than a version produced by a group member. I define this as meaning that a jury, given a choice of only two versions (although “no preference” is a permitted response) would likely give at least a two-thirds vote to a certain version. Note the lack of distinction between the case where the favorite wins by 60 percent of the vote with 40 percent no preference, and the one where it loses unanimously to the underdog. Either counts as “not significantly better”

3) My version versus each par version. This comparison is a key one because any decisive win suggests an advantage for PPW. Therefore I rate the comparisons as one of five things: decisive win, significantly better, tie, significantly worse, decisive loss.

A “decisive” win or loss means that I believe observers would almost unanimously prefer one version with a single glance.

“Significantly better” is as defined above.

A “tie” occurs when any of the following conditions are met:
1) Neither version would likely receive a two-thirds vote given only the two choices.
2) A straight 50-50 blend of the two results in a better original than either.
3) Blending 100% of the color of one with 100% of the luminosity of the other results in a better original than either.

I have already published the results for the par-versus-its-parents tests as an indication of how powerful blending is. The par version was significantly better than a parent version in 382 of 500 comparisons. The par version was significantly better than *all five* parent versions in 26 of 100 cases. These results are far better than if there were nothing advantageous about creating an averaged version. I have written about the results on the ACT list but the reasons for the results are straightforward. I will elaborate in later posts to this blog.

I have not yet published the results involving my own versions. Those will take considerable explaining because in addition to the images in which PPW has a decided advantage, there are many confounding factors, beginning with a nonprofessional retouching group whose work is being compared with that of someone with considerably more experience.

{ 0 comments… add one now }

Modern Photoshop

Color Workflow

The MIT 5k Dataset 2: The Ground Rules

Recent Comments