Measuring image manipulation
Discussing what level of manipulation is acceptable is usually done with rather broad brushstrokes. For example, debates usually appeal to analogue dark-room techniques to decide what’s acceptable or not. Apart from being deplorably imprecise, this analogy limits the scope of the debate to the knowledge base of the participants. But today’s photographers have very little to no idea what went on in the wet dark-room. Or judges rely on subjective responses – i.e. what they feel – to decide how much manipulation is too much.
Retouching prints, negatives and even colour transparencies, or multiple printing, masking, re-photography, shadow and highlight recovery, using toners for subtle through to strong split-toning effects are all dark-room techniques. All were able create to substantial changes to image appearance and content, including the addition of completely new picture components. Just because some of this work demanded high-level skills and therefore cost a lot meant it did not happen in the mainstream. But for those with generous advertising budgets it was routine work.
Notwithstanding the effect of a knowledge gap, waving at analogies with film-based photography leads to confusion and error because the vagueness allows different parties to appear to agree when, in detail, they hold positions that are different. Alternatively – and worse – the protagonists appear to occupy opposing positions when, in fact, they agree on the basic notions, but these have been calibrated in different ways.
Perhaps worse still, differences in the assessment of the image may become apparent only when a specific image or narrow range of issues is under the judging microscope. This can lead to embarrassing discoveries about how e.g. competition rules were not sufficiently precise or were open to misinterpretation.
Here I propose a scoring system based on scales of visible differences. These combine to give a measure for our notions of what is allowable in terms of manipulation.
Importance of a metric
This scheme aims to help locate the important points of difference in judgements of photographs. Applying the measure does still call for subjective assessments. But, by placing assessments such as ‘just visible’, ‘clearly visible’ on a perceptual scale, it is easier to arrive at rankings that everyone can agree on. It is generally easier to agree on ranking of individual qualities than assessing several factors at once. This can be achieved relatively easily where the original capture and the submitted image can be examined both at the same time.
For example, in photojournalism tonal changes which normalise a non-normal scene are generally accepted. For example, an early morning scene in low ambient light with soft, diffused lighting will record as low in mid-tone contrast. We generally accept that, for pictorial use, a photographer can boost mid-tone contrast in order to bring out details. The strength or effect of the adjustment can vary from the nearly invisible to the clearly visible.
We can agree that adjustments with a certain effect are acceptable for a given purpose e.g. photojournalism. But can we can go on to state that any greater effects that dramatically increase the effect of chiaroscuro (light and shade) will confer values on the image that belong more to the aesthetics of camera club pictorialism that is right to allow in photojournalism.
For the purpose of evaluating images, competitions or institutions may calibrate their metrics by reference to test images. That would be the next step of development. This scheme is not intended to set any standards beyond dispute, but to provide a much firmer basis for discussion than having various judges arguing in vague abstractions about the differences between ‘moderate’ and ‘strong’, ‘dark-room’ or ‘beyond dark-room’.
The modifier ‘allowable’ means roughly “what is permissible, tolerable, admissible, compliant with certain rules” and is relative to a given framework of norms. It does not refer to any legal, cultural, aesthetic measure or artistic value.
It would be lovely to be able to aggregate the score for each image and compare it to the allowable sum. This works well for low sums i.e. those intolerant of any manipulation, but could easily t lead to unwanted results where image use is more tolerant of manipulation as large scores can overpower low scores. But competition administrators can set maximum allowable scores in any category.
So, to what extent should image manipulation be allowed? Any measurement or metric rests on a scale.
I suggest a 10-point scale of 0 (zero) to 9 (nine) as follows:
0 = no change at all between the captured image and the image under review.
1 = minor global changes reflecting differences, for example, between default conversions using different raw conversion engines.
2 = almost invisible change – needing a grader’s or printer’s eye – for the purpose of correcting to normal values or machine calibration.
3 = just visible change for the purpose of correcting to normal values.
4 = visible or light change for the purpose of visual enhancement; a minor but useful correction.
5 = clearly visible or moderate change for the purpose of visual improvement just beyond normal values, and without visible artefacts.
6 = clearly visible change to make image more appealing with values beyond normal but not unnatural or hyper-real, with small visible artefacts permissible.
7 = obvious or strong change emphasising a feature or quality for purpose of producing a striking image, with values clearly beyond normal i.e. unnatural , exaggerated or hyper-real.
8 = radical change to exaggerate an effect in a feature or quality, with strong differences from parent image.
9 = very radical, exaggerated change for the purpose of graphic effect resulting in image obviously and markedly different from parent image, even unrecognisable.
How this works
Let’s take three scenarios or work situations in which the extent or degree of image manipulation makes some difference to whether the image is regarded as acceptable i.e. whether it is fit for a particular purpose. Let’s keep it simple by aiming for a straight, binary ‘Yes’ or ‘No’ decision.
Forensic, scientific or evidence-based recording
Where integrity of the actual capture is crucial as when gathering forensic scene-of-crime records, or scientific experiments subject to third-party scrutiny, we won’t allow any manipulation at all. We cannot accept anything that would disrupt the integrity of the record. This also forbids changes to file name, and any metadata. We want to ensure, as it were, as straight and uninterrupted a causal line between the subject and the image as possible.
So we’d want the image to score zero in all changes. For example, we look at Exposure/Brightness to see if that has been altered. If not at all, that scores zero, and we check the next one. If Exposure/Brightness has not been altered at all, Tone Curve will probably score zero too, as will Highlight/Shadow. But White Balance is independent of these, so could be varied. But we don’t allow any changes here either, so this should also score zero. In the case of raw files, the conversion should, of course, be at the settings used at point of capture.
In some circumstances, we may wish to allow a small adjustment in White Balance, in which case, we may stipulate that the score for this parameter is allowed to score 1, or 2.
And so on, through all eleven parameters (see from p.122). You may stipulate, for most strictness, that the total score cannot exceed 1. Or you may relax conditions and allow a total score up to 10 with no category allowed to score more than 1.
Fine art, experimental, graphic design
Let’s go to the other end of the scale. We may characterise allowable changes as being those that allow almost the visible severing of the link between the original capture and the manipulated image. In short, just about any alternation or change is allowed. In fact we may not even need to know – or have any interest in – how the final image is created or in how it compares with the original capture or captures. These images may score as much as 99, but in practice can score somewhat less as some parameters are likely to be only lightly adjusted.
The short conclusion is that, thus far, the proposed metric is producing the results we expect and adapts different norms and practices. The tougher test now is to see how it performs in the middle range.
Photojournalism, news reporting
The stress test for any image manipulation scoring is how it deals with photojournalism and news reporting. Here, we are prepared to accept a certain level of global change in tone, exposure, mid-tone contrast, white balance and colour balance, and monochrome conversion which all broadly enhance the visual qualities of the image without a substantial deviation from the scene as originally captured.
This context therefore covers the middle ground in which some manipulation is allowed – usually constrained to those which shelter under the banner of ‘dark-room effects’ – but no more: and not too much. This begs the question of how much is ‘too much’. Any metric should be able to offer a crisp answer. It should be able to unpick the bundle of ‘dark-room’ effects so that, for instance, we can allow some removal of minor capture defects without stepping onto the slippery slope of allowing cloning within and into the image.
For example, if in photojournalism and news reporting we wish to keep scores in all parameters to around 5, we are broadly allowing corrections to enhance visual impact without shifting, distorting, supplementing or abridging content, then a clearly visible application of dodge and burn would clearly score more than 4 or 5. So the controversial Hansen image of World Press Photo 2016 might have scored 8 or even 9 to reflect the fact that the tone changes on the faces of the anguished men suggested auxiliary (but non-existent) light sources or nearby reflecting surfaces. Therefore, if World Press Photo forbids scores greater than 4-5, a score of 7 or 8 would then press the red buzzer for rejection.
Still in news and photojournalism, if we look at cumulative scores over the eleven parameters, we could allow an image to score a maximum of, say, 50 so long as no single parameter scores more than, say, 7 or 8. This would describe an image which has received adjustments in exposure, tonality, colour and removal of minor defects, and so forth. But it has not transgressed by making a large change in any feature. For photojournalistic use we might judge this image OK in terms of post-processing.
The metric is cumbersome but that reflects the complexity of the job. Also, details will vary with the user and the context (e.g. whether it’s applied to a competition for amateurs, or to a journalism award on the world stage).
This metric does offer the advantage in being able capture some nuances that are often confused. For example, while burn and dodge effects are widely acceptable to some degree, this method helps you define just how much dodge and burn is acceptable. It helps to define the difference between a ‘minor’ dodge effect (not easily noticed) from a ‘major’ dodge effect (easily noticed, or one suggesting new light sources or reflective surfaces).
For more details, see from p,134 my book Photo Judging.