Abstract

To quantify the excellence of multimedia quality, subjective evaluation experiments are conducted. In these experiments, the tradition of quantitative assessment is the most dominating, but it disregards the understanding of participants' interpretations, descriptions, and the evaluation criteria of quality. The goal of this paper is to present a new multimedia quality evaluation method called Open Profiling of Quality (OPQ) as a tool for building a deeper understanding on subjective quality. OPQ is a mixed method combining a conventional quantitative psychoperceptual evaluation and qualitative descriptive quality evaluation based on the individual's own vocabulary. OPQ is targeted for naïve participants applicable to experiments with heterogeneous and multimodal stimulus material. The paper presents the theoretical basis of the development of OPQ and overviews the methods for audiovisual quality research. We present three extensive quality evaluation studies where OPQ has been used with 120 participants. Finally, we conclude further recommendations of use of the method in quality evaluation research.

1. Introduction

To become successful, new multimedia systems and services need to meet the user’s requirements, offer pleasurable experiences, and provide higher quality than the existing systems. At the same time, audiovisual systems are becoming more and more complex as technological progress provides new possibilities of presenting content. For example, audiovisual 3D on portable devices requires a high level of optimization of technical resources to handle huge amounts of data, with possible limitations due to transmission channel and device constraints. This can result in perceivable heterogeneous impairments in the value chain from content production to display techniques and influence the user’s perception of quality. To assess the experienced quality of these novel systems and services, subjective audiovisual quality evaluation experiments are conducted.

Subjective ( perceptual, affective, experienced, sensorial) quality evaluation is based on human judgments of various aspects of experienced material based on perceptual processes [13]. These quality perceptions contain both a low-level sensorial and high-level cognitive processing, including knowledge, emotions, attitudes, and expectations [47]. Since 1970s, recommendations for video quality evaluations have offered a good basis for assessing one dimension of quality—its hedonistic excellence [8]. Recently, a broader view to quality has been taken by covering other aspects of active perception in the evaluations including in knowledge, different levels of human information processing, or even contextual behavior [915]. Although these evaluations have made a significant contribution for understanding quality, they are still limited to the investigation of quantitative quality preferences. Subjective impressions, interpretations, and experiences as factors to explain and understand the results (constructed in the evaluations of different system factors) beyond the excellence are rarely considered [16, 17]. This can be partly because of a lack of reliable explorative instruments for tackling the descriptive characteristics of quality or even more ambitiously for relating quality preferences and descriptions. The few previous attempts have been suggested to multimedia quality society, while they have constraints in terms of accuracy, complexity, required type of assessors, unimodal evaluations, or emphasis only on qualitative methods [1621].

The goal of this paper is to present a mixed method called Open Profiling of Quality (OPQ) to understand the multimodal quality of experience. Mixed methods combine both quantitative excellence evaluation and qualitative descriptive research into one single study to compensate for the weaknesses of one method, expand understanding of phenomena, and provide complementary viewpoints [2224]. For the method development we conduct a literature review in mixed methods, quality evaluation research in multimodal quality, and other related fields (food science, consumer research). The proposed method combines a conventional quantitative psychoperceptual evaluation and qualitative descriptive quality evaluation using the participant’s self-defined vocabulary. It is applicable to naïve participants and heterogeneous multimodal stimuli material. We present three multimedia quality studies, where OPQ has been used to make further recommendations of use of the method in quality evaluation research. The method presented helps practitioners to conduct mixed method research in the field of audiovisual quality.

The paper is organized as follows: in Section 2, we outline the theoretical background of multimedia quality, a state-of-the-art of mixed method research and its use in audio, video, and audiovisual quality evaluation. Section 3 contains the description of the Open Profiling of Quality method. Sections 46 present the three studies using the method presented. Finally, discussion and conclusions of the method and its further use are described in Section 7.

2.1. Understanding Quality Perception

Multimedia quality is characterized by the relationship between produced and perceived quality. Produced quality is determined by the technical factors of multimedia, typically categorized into three different abstraction levels: content, media, and network [25, 26]. The special requirements for produced multimedia quality can result in the juxtaposition of a huge amount of multimedia data and limited bandwidth, a vulnerable transmission channel, and constraints of receiving devices. Perceived (also called experienced, sensorial) quality describes the users’ or consumers’ view of multimedia quality. It is characterized by active perceptual processes, including both bottom-up and top-down and low-level sensorial and high-level cognitive processing [4, 6]. The relationship between perceived and produced quality for the end-to-end systems is described in terms of Quality of Experience (QoE) as “The overall acceptability of an application or service, as perceived subjectively by the end-user” [27]. More broadly, Wu et al. [28] have summarized it “as a multidimensional construct of user perceptions and behaviors.”

Quality perception is constructed in an active multilayered process. It contains the extraction of relevant features of the incoming sensorial information in its early stage (e.g., brightness, form, and motion for vision or pitch, loudness, and location for audio) [6]. However, quality perception is not only determined by the stimuli-driven bottom-up processing. The high-level top-down cognitive processing involves individual emotions, knowledge, expectations, and schemas representing reality which can weight or modify the importance of each sensory attribute, enable contextual behavior and active quality interpretation [47]. In this stage, stimuli are interpreted according to their personal meaning and relevance to human goal-oriented actions. Related to multimodal perception, one sensory channel can complement and modify the perception derived from another channel [29]. This dependency can be due to clarity of stimuli, as well as content, task, and context [2932].

Multimedia quality studies aim at optimizing quality factors produced under strict technical constraints or resources with as little negative perceptual effects as possible. Recent multimodal quality evaluation studies have started to underline the characteristics of active and multilayered quality perception. Quality does not only derive from the characteristics of stimuli, but also from usage-, task-, and context-dependent factors [14, 16]. In this paper, we want to continue to work on this track, see human perception as an interesting challenge, and develop explorative tools for understanding underlying attributes of perceived quality.

2.2. Research Methods for Perceived Quality Evaluation
2.2.1. Quantitative Psychoperceptual Evaluation

Psychoperceptual quality evaluation is a method for examining the relation between physical stimuli and sensorial experience following the methods of experimental research. These methods have their origin in classical psychophysics of the 19th century, and they have been later applied in uni- and multimodal quality assessment [2, 3, 8, 33]. In the quality evaluation domain, the applied methods are standardized in technical recommendations by the International Telecommunication Union (ITU) or the European Broadcasting Union (EBU) [8, 33, 34]. The goal of these methods is to analyze quantitatively the excellence of perceived quality of stimuli in a test situation. Psychoperceptual quality evaluation studies are characterized by a high level of control over the variables and test circumstances and can include the use of standardized test sequences, procedures, and the categorization of participants to naïve or professional evaluators to ensure the repeatability of study. As an outcome, subjective quality is expressed as an affective degree-of-liking using mean quality satisfaction or opinion scores (MOS). In psychoperceptual studies, the variety of quality being tested and the research question define the applicable method. Single stimulus methods are useful for evaluations of the large quality range from low to high with detectable differences between stimuli, while pairwise comparisons are powerful when comparing stimuli with small differences [8, 33]. A common single-stimulus method is Absolute Category Rating (ACR). The test method includes a one-by-one presentation of short test sequences at a time. Every test sequences is then rated independently and retrospectively using a 5/9/11-point scale [33]. In multimedia quality assessment, ACR has outperformed other evaluation methods [35, 36].

Recently, conventional psychoperceptual methods have been extended from hedonistic assessment towards the evaluation of appropriateness to use- and goal-oriented actions (cf. overview [37]). Quality is measured as a multidimensional construct of cognitive information assimilation or satisfaction constructed from enjoyment and subjective, but content-independent objective quality [11, 12, 31, 38]. Furthermore, evaluations of acceptance act as an indicator of service-dependent minimum. The useful quality level has been established parallel to the assessment of quality satisfaction in the laboratory and natural contexts of use [14, 15, 3941]. However, all quantitative approaches lack the possibility to study the underlying quality rationale of the users’ quality perception.

2.2.2. Descriptive Quality Evaluation

Descriptive quality evaluation approaches focus on a qualitative evaluation of perceived quality. The basic idea is that test participants are asked to describe their quality factors or the reasons for a certain quality rating. In more advanced methods, these expressions or quality attributes are used to rate test items in a subsequent task. We identified three main approaches: ( ) interview-based approach, ( ) consensus vocabulary profiling, and ( ) individual vocabulary profiling which differ in terms of vocabulary elicitation methods, methods of analysis, and characteristics of participants (Table 1).

Interview-Based Evaluation
In the existing interview-based methods, naïve participants describe explicitly the characteristics of stimuli, their degradations or personal quality evaluation criteria under free-description or stimuli-assisted description tasks [16, 4042]. The goal of these interviews is the generation of terms to describe the quality and to check that the test participants perceived and rated the intended quality aspects. Semistructured interviews are commonly used. They are especially applicable to relatively unexplored research topics, constructed from main and supporting questions, and, compared to open interviews, they are less sensitive to interviewer effects [43]. The frameworks of data-driven analysis are applied and the outcome is described in the terms of the most commonly appearing characteristics [16, 17, 21].

Consensus Vocabulary Profiling
The “RaPID perceptual image description method” (RaPID) is based on a descriptive analysis assuming that image quality is the result of a combination of several attributes and that these attributes can be rated by a trained panel of assessors [2, 18, 44]. Its purpose is to develop a consensus vocabulary. Later, trained test participants rate quality, based on the vocabulary. A multistep procedure contains ( ) extensive group discussions where panel members first develop a consensus vocabulary of quality attributes for image quality; ( ) a refinement discussion where the panel then agrees about the important attributes and the extremes of intensity scale for a specific test according to the test stimuli available; ( ) an evaluation task where each test participant applies each attribute for a set of stimuli in a pair comparison of the test stimulus and a fixed reference. RaPID requires extensive and time-consuming panel training, can be sensitive to context effects, and requires an experienced researcher to conduct the experiments [18]. A comparable methodology is used for audio evaluation in the Audio Descriptive Analysis and Mapping (ADAM) technique [45].

Individual Vocabulary Profiling
In contrast to consensus vocabulary profiling, Lorho’s Individual Profiling Method (IVP) is a descriptive quality evaluation for naïve participants. His work was the first approach in multimedia quality assessments to use individual vocabulary from test participants to evaluate quality. The procedure contains four steps. ( ) Familiarization—participants become familiar with describing the attributes of stimuli, and they develop their individual vocabulary in two consecutive tasks. ( ) An attribute list is generated in a triad stimulus comparison using an elicitation method called Repertory Grid Technique. ( ) The developed attributes are used to generate scales for the evaluation. Each scale consists of an attribute and its minimal and maximal quantity. ( ) Test participants train and evaluate quality according to the attributes developed. The data is analyzed through hierarchical clustering to identify underlying groups among all attributes and Generalized Procrustes Analysis [46] to develop perceptual spaces of quality. Compared to the other descriptive methods, the four-step procedure for individual vocabulary training can be time consuming. However, analysis of IVP is relatively easy and the location of the researcher’s interpretive process is at the very end compared to interview-based methods. Although the paper shows that there are various methods to study perceived multimedia quality quantitatively and qualitatively, the methods do not combine both approaches (in Table 1). We see the challenge of modern evaluation methods also in the combination of both data sets.

2.3. Mixed Method Research
2.3.1. The Theory of Mixed Method Research

Fundamentally, mixed method research has its roots in pragmatic philosophy, represents the third wave of research methods, and is suitable for applied research, such as quality evaluation [22]. It is defined as the class of research in which the researcher mixes or combines quantitative and qualitative research techniques, methods, approaches, concepts, or language into a single study [22]. The major characteristics of traditional quantitative (QUAN) research are a focus on deduction, confirmation, theory/hypothesis testing, explanation, prediction, standardized data collection, and statistical analysis [22]. The major characteristics of traditional qualitative (QUAL) research are induction, discovery, exploration, theory/hypothesis generation, the researcher as the primary “instrument” of data collection, and qualitative analysis [22]. To combine these two research traditions, mixed methods are used to provide complementary viewpoints, to provide a complete picture of phenomena, to expand understanding to phenomena, and to compensate for the weaknesses of one method [23]. The core of mixed method theory is the combination of quantitative and qualitative methods into one final result. There are four main design patterns to fuse these methods with slight differences in the emphasis of dominating method, their interdependency, and the purpose (Table 2) [50].

Triangulation is the most common and important mixed method design (Table 2) [50]. In triangulation, data collection and analysis are carried out independently for QUAN and QUAL methods with no preference, and the final inference aims at creating a broad picture of the phenomenon [50]. Three possible outcomes can be expected in these studies: ( ) the convergence of results where both results lead to the same conclusions, ( ) the complement of results where the different results highlight different aspects of the same phenomenon, or ( ) the results can be divergent or contradictory [24]. The ideas of triangulation and other mixed method designs (Table 2) have been used in quality evaluation research although researchers have not explicitly expressed the relationship to this methodological approach, for example, [16, 17, 42]. Following, we present a review of the existing methods using mixed quantitative and qualitative evaluation of quality.

2.3.2. Mixed Methods in Audio, Visual, and Audiovisual Quality Evaluation

In multimedia quality evaluation methods, triangulation is the applied mixed method design. Jumisko-Pyykkö et al. [16] have introduced an approach of combined quantitative psychoperceptual evaluation and posttask interviews to explore experienced quality factors for audiovisual quality with naïve test participants. Psychoperceptual evaluation follows the ITU recommendations to collect overall quality [8, 33]. The experienced quality factors were collected using a Semistructured interview with a free-description task to describe the quality evaluation criteria used during the quantitative evaluation. Data-driven analysis, following the framework of Grounded Theory, was used in the interview analysis [51]. The results have underlined that experienced quality is constructed from the impressions of ( ) low-level features of stimuli (e.g., audio, video, audiovisual impairments), ( ) high-level factors (e.g., relationship of quality to use, content), and ( ) the most varied variable representing the peaks or extremes of quality [16, 40, 42]. Finally, the interpretation of both quantitative and qualitative data was firstly carried out independently, and the interpretations were integrated to support each other’s conclusions. This method may suffer from inaccuracy as the descriptions are related to a set of stimuli instead of single stimulus. However, the descriptive task is fast to conduct and can be easily adapted to the quality evaluations in challenging circumstances (e.g., field) [40].

Triangulation is also applied in the method called Interpretation Based Quality (IBQ) [17, 21], adapted from [52, 53]. IBQ also follows a two-step procedure with naïve participants: ( ) a classification task using free-sorting and an interview-based description task and ( ) the psychoperceptual evaluation based on one quality attribute. In the perceptive free-sorting task, test participants form groups of similar items and describe the characteristics of each group. The free-sorting task with naïve participants produces comparable results to consensus vocabulary approach with expert participants in terms of describing the same sensations and the related wording of the attributes [52]. However, the costs of free-sorting are lower because of naïve test participants, missing training, and fast assessment of a large test set [52]. Extending the idea of a free-sorting task, IBQ allows combining preference and description data in a mixed analysis to better understand preferences and the underlying quality factors in a level of a single stimulus [17]. However, the analysis of interview-based methods for large data sets is time consuming as it requires a multistep procedure and interrater reliability estimations. In contrast to the original definition of the method [17, 21], the term IBQ has been inconsistently used later to refer to monomethodological designs and variable procedures of descriptive tasks [49, 54]. In this paper, we refer to IBQ as it was originally presented as a research method.

Summarizing the review of related work, quality evaluation research has slowly started to extend its approach from quantitative excellence evaluation towards descriptive and mixed methods to create a broader understanding of quality. There are two main approaches in the descriptive quality research: interview and vocabulary-based approaches, both applicable to naïve participants. However, the most accurate versions of these methods have been only applied for the assessment of unimodal quality. The goal of this paper is to develop a new quality evaluation method, which uses the mixed method approach to create a deeper understanding of multimodal quality and which is applicable to naïve participants.

3. Open Profiling of Quality

Open Profiling of Quality (OPQ) is a mixed method that combines the evaluation of quality preferences and the elicitation of idiosyncratic experienced quality factors. It therefore uses quantitative psychoperceptual evaluation and, subsequently, an adaption of Free-Choice Profiling. OPQ is “open” in terms of being “free from limitations, boundaries, or restrictions” [55] and “accessible to new ideas” [56] to understand the participants’ construct of overall quality without restricting or constraining their descriptions. The term “profile” refers “to represent the outline (of something)” [56], targeting some kind of identity, characteristics, descriptions, and structure for the phenomenon under study. Finally, our method aims at capturing the dualistic nature of excellence and characteristics of quality according to its two central meanings as “the degree of excellence of something [56]” and as “a distinctive attribute or characteristic possessed by—something [56]”. The specific goals of an OPQ study are(1)to define the excellence of overall quality for different stimuli using quantitative psychoperceptual evaluation methods;(2)to understand the characteristics of quality perception by collecting individual quality attributes using qualitative sensory profiling methods;(3)to combine quantitative excellence and qualitative sensory profiling data to construct a link between preferences and quality attributes;(4)to provide a test methodology that is applicable to use with naïve test participants.

Following, we will present the OPQ method step-by-step, introduce its theoretical background, and describe the test procedure to conduct an OPQ study.

3.1. General Considerations

Open Profiling of Quality as a research method consists of three subsequent parts (see Figure 1): ( ) psychoperceptual evaluation, ( ) sensory profiling, and ( ) external preference mapping. The studies with the first two methods are independently conducted and their data can be combined in the last method.

3.1.1. Test Participants

OPQ is designed to be applicable for naïve test participants with predefined sensorial acuity criterion. Naïve is defined as not meeting any particular selection criterion for assessment tests, neither has experience in the research domain nor in the evaluation task [1, 16, 57]. Naive participants are expected to give holistic quality evaluations and produce unbiased results due to lack of knowledge about the test stimuli and their production [47]. In contrast, the expert assessors are trained for accurate, detailed, and domain-specific evaluation tasks (e.g., visual artifacts) [58]. A certain sensorial acuity level is required from participants to make sure that the results are not biased by sensorial inaccuracy (e.g., the sensorial acuity tests used myopia, hyperopia (Snellen index: 20/40), color vision according to Ishihara, hearing threshold with respect to ISO 7029 [59], and, in our cases, 3D vision using Randot Stereo Test ( 60 arcsec)).

More broadly, the sample selection contributes to the external quality of the study and defines how well the results from the sample tested generalize to some broader population of interests [60]. The recommended number of participants according to ITU recommendations is at least 15 [8, 33]. However, we recommend 25–30 participants for the psychoperceptual evaluation to provide good statistical conclusion validity in within-subject designs [61]. For sensory profiling and the external preference mapping, a minimum of 12–20 participants is needed [62]. In the optimum case, all assessors participate in both parts of evaluation while the selection of a representative subsample can be considered.

3.1.2. Scheduling the Experiments

The psychoperceptual evaluation task is conducted prior to the sensorial evaluation. Although the order of the tasks may not have an impact on the outcome, as proved in [63], it is recommended to begin with the psychoperceptual evaluation as assessors are “clear of influence” [63]. In addition, the following profiling task can be done more precisely due to the already existing comprehension of the product under test [63].

The experiments are divided into several sessions. Depending on the amount and length of the test stimuli as well as the final design of each part, psychoperceptual evaluation and sensory profiling will take 90–120 minutes, respectively. The length of each part forces the researcher to conduct OPQ in two or three sessions.

3.2. Psychoperceptual Evaluation
3.2.1. Research Problem

The goal of psychoperceptual evaluation is to assess the degree of excellence of the perceived overall quality for multimedia.

3.2.2. Data Collection

Psychoperceptual evaluation is based on the standardized quantitative methodological recommendations [8, 33]. The selection of the appropriate method needs to be based on the goal of the study and the perceptual differences between stimuli. Their provided guidelines to design and conduct the experiments and the quantitative data analysis (for a review see [37]) are recommended to follow. The overall quality of the stimuli is assessed by test participants in the three following ways. ( ) It can be used to evaluate heterogeneous stimuli material (e.g., multimedia quality) to build up the global or holistic judgment of quality [1]. This is controversial to the assessment of a certain quality attribute, such as brightness. ( ) It assumes that both stimuli-driven sensorial processing and high-level cognitive processing including knowledge, expectations, emotions, and attitudes are integrated into the final quality perception of stimuli [1, 16, 64]. ( ) It is a suitable task for consumer- or user-oriented studies in product development conducted with naïve participants [64]. In addition, overall quality evaluations can be complemented with other simple evaluations. Especially for the consumer-oriented studies, the evaluation of an acceptable quality level as an indicator of a minimum useful quality level can be appropriate for quality judgments for novel multimedia services [65].

The test procedure during the data collection contains training and anchoring and the evaluation task. In training and anchoring, participants familiarize themselves with the presented qualities and contents used in the experiment as well as with the data elicitation method in the evaluation task. Often a subset of the actual test set is used, representing the full range of quality in the study. In the following evaluation task the full set of test stimuli is presented according to the selected research method. The stimuli can be evaluated several times.

3.2.3. Method of Analysis

The quantitative data can be analyzed using the Analysis of Variance (ANOVA) or its comparable nonparametric methods if the presumptions of ANOVA are not fulfilled [43].

3.2.4. Results

Fulfilling the first goal of OPQ, psychoperceptual evaluation results in a preference ranking of the excellence of all test stimuli. These results can be translated into preferences of treatments or test parameters under evaluation, respectively.

3.3. Sensory Profiling
3.3.1. Research Problem

The goal of the sensory profiling is to understand the characteristics of quality perception by collecting individual quality attributes.

3.3.2. Data Collection

In sensory profiling, research methods are used to “evoke, measure, analyze, and interpret people’s reaction to products based on the senses” [3]. In OPQ, we partly follow the method of Free-Choice Profiling (FCP), originally introduced by Williams and Langron in 1984 [66]. It allows naïve participants to use their own vocabulary, differing sensitivities and idiosyncrasies to describe the characteristics of products in a multistep evaluation procedure [3, 66]. FCP is free of time-consuming panel training but produces comparable results with other methods of descriptive analysis [3, 47, 67, 68]. Furthermore, it is well established in food sciences, acting as a good reference to the multimodal quality evaluation in the other research fields [47, 69].

The test procedure contains four subtasks called ( ) introduction, ( ) attribute elicitation, ( ) attribute refinement, and ( ) sensory evaluation task.

(1) Introduction
It aims at training participants to explicitly describe quality with their own quality attributes. These quality attributes are descriptors (preferably adjectives) for the characteristics of the stimuli in terms of perceived sensory quality [3]. The introduction helps participants to understand the nature of the descriptive evaluation task. The descriptive skills of test participants will limit the attribute elicitation [70]. The ability to express quality is an important requirement for the participants to produce strong quality attributes. In training, we start with a small task to describe something familiar to participants, such as apples. “Imagine a basket full of apples. What kind of attributes, properties, or factors can you use to describe similarities and differences of two randomly picked.” Thereby, the researcher may help the test participant to find attributes, but he never comes up with suggestions. After the introductory task, participants start to describe the audiovisual quality following the idea presented.

(2) Attribute Elicitation
It aims at identifying individual quality attributes that characterize the participants’ quality perception of the different test stimuli. The actual extraction of attributes can be done using different elicitation methods available. In the original Free-Choice Profiling, assessors write down their attributes without limitations [66]. However, it has been reported that it was a hard task for participants to develop their vocabulary and, therefore, supporting elicitation techniques can be applied [71]. In the Repertory Grid Technique, as one of the supporting technique [72] test participants develop attributes in triad stimuli presentations. Attributes are developed as distinguishing factors from two stimuli to the third of the triad. In the second technique, Natural Grouping [73], stimuli are divided into two groups differing in one attribute. Each new group can then be divided again by the use of a second attribute and so on. We have applied the supporting task free technique in our case studies as no additional benefit in term of the attributes’ quality has been found for the supporting tasks [71, 74]. Independent of the used elicitation method, stimuli can be replicated several times and people need enough time to watch them, and iteratively develop their attributes, as we have learned over our studies. Overall, the attribute elicitation is a very important step for successful sensory profiling, as only the attributes found in this phase will be taken into account in the later evaluation.

(3) Attribute Refinement
It aims at separating strong attributes from all developed attributes. In FCP, participants may develop unnecessarily many attributes in their elicitation step whereas strong attributes are needed for accurate profiling. We apply two rules to describe a strong attribute. Firstly, the participants must be able to define the attribute in their own words, that is, they must know very precisely which aspect of quality is covered by the attribute. This is important for the interpretation of the results to understand the individual attributes [3]. Secondly, the attribute must be unique or nonredundant [3]. Each attribute must describe one aspect of quality. Following these rules, test participants are allowed to modify their list of attributes. It has been shown useful to also limit the maximum number of attributes. A larger set of attributes can add larger error than additional information to the sensory data [74]. However, this should be checked in a pilot study. At the end of the refinement, test participants write down a definition of each of the attributes left over for the final evaluation. The attributes are attached with a 10 cm long scale labeled with “min.” and “max.” in its extremes (see Figure 2). It results in an individual score card which the test participants will use for stimuli evaluation.

(4) Sensory Evaluation Task
It aims at qmid-suantifying the strength of developed attributes per stimuli. Stimuli are presented one by one and the assessment for each attribute is marked on a line with the “min.” and “max.” in its extremes. “min.” means that the attribute is not perceived at all while “max.” refers to its maximum sensation.

3.3.3. Method of Analysis

By measuring the distance from the beginning of the 10 cm long line to the mark for the rated intensity, the sensory sensation is transformed into quantitative values. Each test participant produces one configuration, that is, M × N-matrix with M rows = “number of test items” and N columns = “number of individual attributes”. To be able to analyze these configurations, they must be matched according to a common basis, a consensus configuration. Generalized Procrustes Analysis (GPA) has been introduced by Gower in 1975 [46]. It rotates and translates all configurations by minimizing the residual distance between the configurations and their consensus [3, 75]. Kunert and Qannari [76] present an alternative approach to analyze sensory profiling data, claiming this approach to be more applicable for FCP data analysis. The problem of scaling the individual configurations is solved in a way that “all the configurations (are put) on the footing as the sums of squares become equal for all the data sets”. The scaled data sets from GPA or Kunert and Quannari's approach [76] can be analyzed using Principal Component Analysis (PCA).

3.3.4. Results

GPA and the alternative Kunert and Quannari approach both create a low-dimensional model of the high-dimensional input matrix. As a value of excellence of the model, the explained variance is the amount of variance of the high-dimensional space that is represented by the model. The results are finally plotted as word charts (correlation plots) showing correlation of the individual attributes with the principle components of the low-dimensional model. In contrast to interview-based evaluation methods [16, 17], no personal data interpretation has been introduced in the analysis. At this stage, the researcher will start to identify the principal components of the perceptual space, the GPA scores of the items and attributes’ correlation with the components to understand the rationale behind the model. This fulfills the second goal of the OPQ method.

3.4. External Preference Mapping
3.4.1. Research Problem

The goal of the External Preference Mapping (EPM) is to combine quantitative excellence and sensory profiling data to construct a link between preferences and quality construct.

3.4.2. Research Method

In general, External Preference Mapping maps the participants’ preference data into the perceptual space and so enables the understanding of perceptual preferences by sensory explanations [62, 77]. EPM is carried out using methods of multiple polynomial regressions, for example, Partial Least Square Regression [78] or PREFMAP [77].

To show how OPQ can be applied in multimedia quality research, we present three experiments in the field of audiovisual 3D quality. The first experiment explores experienced audiovisual quality when room acoustic audio reproduction and visual presentation mode (2D/3D) on a midsized screen are varied. In the second experiment, experienced audiovisual quality is examined under different audio (mono/stereo) and visual (2D/3D) presentation modes on a small mobile screen size. The third experiment investigates the influence of different 3D video coding methods on experienced quality on small screens. In all experiments, constructed quality level can be considered as moderate, containing perceivable impairments in their presentation.

4. Experiment  1: Experienced Quality of Audiovisual Depth

The goal of the first experiment is to explore the influence of audiovisual depth on perceived quality. In the previous work, bimodal depth experiences are studied for virtual reality systems with large screen sizes and very high-quality multichannel audio, or only one modality is explored at the time [7982]. In this study, we investigate multimodal quality perception with mixed methods when the depth is varied in both modalities. Our independent variables are mono- and stereoscopic visualizations on midsized screen and audio-related room acoustic simulations for small and large spaces with multichannel loudspeaker reproduction.

4.1. Research Method
4.1.1. Test Participants

A number of 25 naïve assessors took part in psychoperceptual quality evaluation task (gender: 9 females, 16 males; age: 18–27 years) [1, 57, 58]. Sensory profiling was conducted with a subsample of 19 participants. All participants had normal or corrected-to-normal visual acuity and normal audio acuity.

4.1.2. Stimuli

We varied depth in visual presentation mode (2D/3D) and room acoustic simulations (small/large room) in audio. Two different audiovisual contents, rendered from different sized virtual rooms, were used. Visually, a sharp display offers the possibility to physically switch between 2D and 3D presentation of the content. For the audio part, the IAVAS player offers functions to render different room acoustics [83].

In a large room, visualized as a classroom, an audio is presented by a male speaker and the sound source by a manikin (see Figure 3(a)). In a small room, visualized as a student’s living room, the audio plays drum and bass music and the sound source is represented by a laptop (see, Figure 3(b)). The users’ movement through the room is automated containing movement straight on and turning right or left. In total, eight 15-second long stimuli were used in the experiment.

The rooms were designed using Maya software. For playback in the IAVAS I3D player [83], the scenes were exported into Binary Format for Scenes (BIFS). The audio was included using Advanced Audio BIFS. The audio files were encoded with AAC at a bit rate of 128 kbps. The room acoustics were modeled using the perceptual approach that is provided by the player. For each room a suitable room acoustic was modeled taking into account the different sizes and acoustical characteristics of the rooms. To vary depth in audio perception, the room models were exchanged between the rooms.

4.1.3. Stimuli Presentation

The tests were conducted in the Listening Lab at Ilmenau University of Technology, set according to [8, 84]. The videos were presented on a “15” Sharp AL3DU stereoscopic display based on parallax barrier technology. The parallax barrier is built as a secondary LCD layer which can be switched on and off so that the screen can be used for monoscopic and stereoscopic videos. The viewing distance was 55 cm. The sound was played back on a four-speaker surround setup at “30” and “110” and a distance of 1 meter from the assessor [79]. The stimuli were repeated twice in random order for psychoperceptual evaluation.

4.1.4. Test Procedure

The test procedure is described according to the theoretical method description in Section 3.

Psychoperceptual Evaluation
Prior to the actual evaluation, training and anchoring took place. Participants trained for viewing the scenes (i.e., finding a sweet spot) and the evaluation task were shown all contents and the range of constructed quality, including four stimuli. Absolute Category Rating was applied for the psychoperceptual evaluation for the overall quality, rated with an unlabeled 11-point scale [33]. In addition, the acceptance of overall quality was rated on a binary (yes/no) scale [39]. All stimuli were presented twice in a random order. The simulator sickness questionnaire (SSQ) was filled out prior to and after the psychoperceptual evaluation [85, 86].

Sensory Profiling
The Sensory Profiling task was based on a Free-Choice Profiling [47] methodology. The procedure contained four parts and they were carried out in two sessions within three days. ( ) An introduction to the task was carried out using the imaginary apple description task. ( ) Attribute elicitation—all stimuli were presented three times, one by one. The participants were asked to write down their individual attributes on a white sheet of paper. They were not limited in the amount of attributes nor were they given any limitations to describe sensations. ( ) Attribute refinement—the participants were given a task to rethink (add, remove, change) their attributes to define their final list of words. It was transformed into the assessor’s individual score card. Finally, four randomly chosen stimuli were presented once and the assessor practiced the evaluation using a score card. In contrast to the following evaluation task, all ratings were done on a one score card. Thus, the test participants were able to compare different intensities of their attributes. ( ) Evaluation task—the stimulus was presented three times in a row, and the participants rated it on a score card. If necessary, they were allowed to ask for a fourth repetition.

4.1.5. Methods of Analysis

Psychoperceptual Evaluation
Nonparametric methods of analysis were used (Kolmogorov-Smirnov: ). Friedman’s test is applicable to the measurement of differences between several ordinal dependent variables and Wilcoxon’s test in their pairwise comparisons [43].

Sensory Profiling
The sensory data has been analyzed using Microsoft excel and the GPA routine of XLSTAT 2.9.0. The data was also analyzed using Kunert and Qannari’s method [76]. As the GPA produced stronger results in terms of explained variance of the model, the GPA model will be used for further analysis.

4.2. Results
4.2.1. Psychoperceptual Evaluation

Acceptance of Overall Quality
All presented stimuli provided a highly acceptable quality level, reaching an acceptance level of 83% at the minimum. The test parameters did not have an impact on the acceptance of overall quality (Cochran's , , ns). All items were rated equally (McNemar: all comparisons ).

Overall Quality Satisfaction
Visual presentation modes and room acoustic simulations did not have significant influence on the overall quality satisfaction (Friedman, χ² = 3.341, = 7, , ns). All stimuli were equally rated (all pairwise comparisons , ns).

4.2.2. Sensory Profiling

The test participants developed a set of 289 attributes. A total of 216 of the attributes represented between 50% and 100% of the explained variance. These attributes are located between the inner and the outer circle in the correlation plots (see Figures 6, 7, and 8). The assessors used 10 attributes (min. 7, max. 32) on average to describe their sensory perception in the dimensionally reduced data.

Identification of Dimensions and Attributes
Seven components were needed to explain 100% variance in the GPA model. The contribution of each component is described in Table 3. Considering the elbow criteria and the Heyman and Lawless rule of interpretability [3], the first three components are used for further data interpretation. The GPA result with the three principal components forms the GPA model or perceptual space.
To understand the perceptual space, the attributes and test stimuli are plotted into the model, resulting in a three-dimensional space. For better interpretation, component 2 and component 3 are always plotted against component 1 to get two-dimensional slices of the perceptual space shown in Figures 4 and 5. The item names are substituted by the corresponding variables. Comparing variables and the separation of the items in the perceptual space allows for determining the components. Figure 4 shows that Dimension number 1 (PC1) relates to content (classroom or student's room). Dimension number 2 (PC2) separates the test items according to the visual Presentation Mode (2D or 3D presentation). PC2 is identified as “video quality.” Dimension number 3 (PC3) divides the items by the room acoustics (simulated small room and simulated large room). It relates to the “audio quality” of the stimuli. Although the interpretation was done based on the test items or their related test parameters, we will refer to the quality aspects of content, video representation, and room acoustics in further interpretation. This first finding confirms that test participants derived their individual quality factors from the chosen test parameters.
The attributes can be classified into two different groups. Technical descriptions directly describe the characteristics of the test variables (like reverberation or grainy). The second group of attributes is characterized by experiences, subjective impressions, and feelings about the test items (e.g., monotone, lively, or obtrusive). This group is called impression descriptions. Following, we will discuss the correlation of the attributes and attribute groups with the GPA model.

Correlation of Attributes and the Perceptual Space
Word charts represent the correlation of the individual attributes with the perceptual space (see Figures 68). The closer an attribute is placed to one of the dimensions, the more it correlates with this dimension. Attributes placed between two dimensions correlate with both dimensions equally.

The Dimension “Content” (PC1, 37.09% of Explained Variance)
We identify the two polarities of this dimension as classroom on the one side and student's room on the other side by interpreting the attributes. But only a few attributes such as “unpleasant voice”, “comiclike” or “messy” describe the content or the layout of the room directly. PC1 is more an impression description of the content or the impression of the content on the individual perception. Descriptions as “liveless”, “emotional” and “likeable”, or “monotone” and “sterile” highly correlate with one of the two polarities, respectively. The high amount of impression descriptions shows that quality perception is formed on an abstract level by the test participants. The assessors were able to find individual attributes that describe quality on a general level among the test items.

The Dimension “Visual Presentation Mode” (PC2, 25.38% of Explained Variance)
The polarities agree with varied visual presentation modes (mono (2D), autostereoscopic (3D)). The 2D polarity shows descriptions of “sharpness” or “sharp edges”, “high contrasts”, “clear”, “light”, or “colorful”. In contrast, 3D presentation mode is described with a negative description of the visual artifacts, such as “skewed outline”, “unclear”, or “interlaced lines”. It seems that the artifacts and reduced brightness of 3D results from limitations of the display technique (parallax barrier and viewing angle of the display). However, the results also show the participants’ ability to experience visual depth. It is described as “integration”, “three-dimensional”, “spacious”, or “tangible”.

The Dimension “Room Acoustic Model” (PC3, 18.9% of Explained Variance)
PC3 also corresponds directly to the varying room acoustic models used in the test. The dimension can be divided at the extreme values in the large room and the small room. While the small room acoustics are described poorly, a lot of quality factors can be found for the large room acoustics. In this dimension, technical descriptions are dominating. The large room correlates with a high amount of reverberation, “full spacious sound”, and “filling the room”. On the level of impression descriptions, PC3 is characterized by “imaginable”, “insistent”, or “shrill”.

Interdimensional Attributes
Attributes that correlate with more than one dimension can be interesting. Especially attributes that correlate with PC2 and PC3, as they describe audiovisual effects. Interdimensional attributes between audio and video dimension are rare (see Table 8). Especially depth-related attributes that we expected to correlate with both dimensions correlate either with the video (e.g., spacious (P3)) or with the audio dimension (e.g., spacious (P14)). These results show that depth was perceived or rated independently either in auditory or visual perception. So, in the next section we will have a closer look at the participants’ individual perceptual patterns.

Comparison of the Individual Configurations and the GPA Model
The results show individual differences in the perceptual space. By plotting the assessor's attributes into the perceptual space independently, we identified sensorial preferences between participants. As an example, the word charts illustrates that audiophile assessors (e.g., P14) mainly pay attention to auditory stimuli, while videophile assessors (e.g., P13) emphasize the visual part of stimuli. Just a few assessors (e.g., P25) used the whole perceptual space for characterizing the stimuli with their attributes. These results show that multimodal quality evaluation is also influenced by the participant’s sensorial preferences.

4.2.3. External Preference Mapping

The external preference mapping was not applied, as the results of psychoperceptual evaluations did not show any preferences between stimuli.

4.3. Discussion and Conclusions

Our results of psychoperceptual quality evaluation did not show the influence of audiovisual depth on perceived quality. However, the results of sensory profiling gave further understanding of this. Firstly, the nonsignificant difference was not caused by the nondetectable differences between stimuli, as the participants qualitatively differentiated them. Secondly, the perceived depth was highlighted by both modalities contributing to the overall audiovisual perception. Thirdly, when visual 3D presentation mode was used, it was described as spacious and three-dimensional, but more importantly it was attached to several negative terms of inferiority. It is known that the added value induced by the visual depth perception is only acknowledged if the level of visible artifacts is low enough [8789].

Our results also showed individual preferences towards the quality of one modality. It is known that there are modality-dependent individual differences in human information processing styles. For example, the categorization into visual and verbal information processing styles is common [90]. Our results indicate that these different processing styles can also contribute to final multimodal quality judgments. There are two suggestions for further work. Firstly, the influence of different processing styles on multimodal quality perception under different quality levels and heterogeneous stimulus material needs to be addressed in detail to confirm the phenomenon. Secondly, for the practitioners of audiovisual quality, a well-validated tool is needed for identifying the groups of different information processing styles and reporting these groups to characterize the sample.

5. Experiment  2: Experienced Quality of Audiovisual Depth in Mobile 3D Television and Video

We examined the influences of mono and stereo audio and visual presentation modes on experienced quality for mobile 3D television and video. Visual stereoscopic 3D experience is a multidimensional construct of video quality, depth perception, and visual comfort [88]. Previous work has shown that visual 3D has added value of depth and it can be instrumented in the evaluations of depth perception [89]. In the overall quality evaluations of impaired 3D images, the artifacts dominate over the benefits of depth [91]. Furthermore, viewing comfort is a part of the experienced quality on stereoscopic displays, and on small displays it has a coinfluence on the viewer [9294]. To date, there are only a limited number of studies published comparing the subjective visual quality between different presentation modes (2D and 3D) on the mobile screen sizes [49, 54, 95]. Although these studies underline the critical aspects of visual 3D, they do not pay attention to overall multimodal where audio can also contribute. Previous studies have shown interaction between audio and video quality on studies for mobile television [41, 96]. The previous studies have not addressed experienced quality when depth has been varied in both audio and visual modalities.

5.1. Research Method
5.1.1. Test Participants

A total of 45 test participants (gender: 13 females, 32 males; age: 15–30, mean = 24 years) took part in the psychoperceptual evaluation task. For sensory profiling, a subsample of 15 participants was randomly selected. All test participants passed a screening for visual acuity, color, and 3D vision and hearing acuity. The majority of the participants were categorized as naïve assessors (87%) [1].

5.1.2. Stimuli

Variables and Their Production
Targeting different depth perception in auditory and visual channels, the videos were varied in video (monoscopic or stereoscopic) and audio (mono or stereo), resulting in 24 videos under test.
The original audio tracks from all videos were exported as mono and stereo tracks from Adobe Premiere in the required length. Audio was normalized. The original videos were resized to a resolution of 856 px 240 px and exported as stereoscopic videos. To create the monoscopic videos, two original videos were imported into Shake and resized to 856 px 240 px. The right video was cropped from both videos, resulting in two left videos, each with a resolution of 428 px 240 px. One of the cropped videos was shifted to the right side. Both videos were added, resulting in two left-side videos next to each other with a resolution of 856 px 240 px. Finally, the monoscopic videos were exported with mono and stereo audio tracks, respectively. All videos were coded with mp4v codec using Simulcast at 25 fps with high bitrates of at least 10 Mbit/s for the video track.

Contents
Six different contents were used to create the stimuli under test (Table 4). The selection criteria for the videos were spatial details, temporal resolution, amount of depth, and the user requirements for mobile 3D television and video [97].

5.1.3. Stimuli Presentation

The controlled laboratory conditions were similar to experiment [8]. An NEC “autostereoscopic 3.5” display with a resolution of 428 px 240 px was used to present the videos. This prototype of a mobile 3D display provides equal resolution for monoscopic and autostereoscopic presentation. It is based on lenticular sheet technology [98]. The viewing distance was set to 40 cm. The display was connected to a Dell XPS 1330 laptop via DVI. AKG K-450 headphones were connected to the laptop for audio representation. The laptop served as a playback device and control monitor during the study. The stimuli were presented in a counterbalanced order in both evaluation tasks. All items were repeated once in the psychoperceptual evaluation task. In the sensory evaluation task, stimuli were repeated only when the participant wanted to see the video again.

5.1.4. Test Procedure

A two-part data-collection procedure follows the theoretical method description in Section 3.

Psychoperceptual Evaluation
The procedure was identical to experiment . To capture the positive aspects of autostereoscopic presentation, participants also rated for perceived depth on an 11-point unlabeled scale.

Sensory Profiling
A four-part sensory profiling task contained: ( ) an introduction to the task—identical to experiment ; ( ) attribute elicitation—participants watched 15 randomly selected items in groups of three items. Triad presentation (as used in the Repertory Grid method (RGM) [99]) was chosen to help the participants with the attribute elicitation through a comparison of different items in unlimited time. The number of generated attributes was not limited per triad. In the end, the participants were given a chance to review and revise their attributes, ( ) attribute refinement—the aim of this task was to revise (remove, add, or redefine) all attributes. A number of 15 randomly selected items were presented in triads, and the participants rated each using their score cards. In each triad, the three items were presented one after another without a break and rated on the same scoring card. Triad presentation was chosen to help the participants to compare the items during the rating process. Each triad was repeated once if the participant so needed. At the end of this task, the participants defined their final attributes, ( ) evaluation task—in the final evaluation task, all 24 items were rated independently with all attributes. Each item was presented once and the rating time was not limited.
The study was conducted in two sessions of approximately 90 minutes. Psychoperceptual evaluation and the subtasks 1-2 of sensory profiling took place in the first session and the rest in the second session.

5.1.5. Method of Analysis

Psychoperceptual Evaluation and Sensory Analysis
The methods were identical to experiment .

External Preference Mapping
External Preference Mapping was applied to map the users’ preferences into the perceptual space. Two models can be used to describe the participants’ preferences: the vector model and the ideal point model [77]. Within the PREFMAP method in XLSTAT, the most suitable model is chosen automatically.

5.2. Results
5.2.1. Psychoperceptual Evaluation

Acceptance of Overall Quality
Overall, all presented stimuli provided a highly acceptable quality level. On average, 2D presentation mode reached the acceptance level of 90% and all stimuli reached at least an acceptance of 88%. For 3D visual presentation mode, the average acceptance level of quality was 79%, while none of the stimuli went below 63% of acceptance.

Overall Quality Satisfaction
Parameter combinations influenced overall quality satisfaction when averaged over the content ( = 92.2, = 3, ). Figure 9 shows the overall quality scores averaged over the content and content by content for the 4 parameter combinations (v2D_aMono, v2D_aStereo, v3D_aMono, and v3D_aStereo). The 2D video presentation mode provided the most satisfying quality compared to the 3D video mode ( ). The audio presentation mode had no effect on the quality ratings. Mono audio and stereo audio were equally evaluated in both video presentation modes ( , ns). The results of content by content analysis follow this main tendency with content cave as an exception. Although there is no overall effect of parameter combinations on satisfaction ( = 4.46, = 3, , ns) in this content, detailed pairwise comparisons show that the 3D presentation mode provides higher quality under equal audio conditions (3D versus 2D—mono: = −2.53, ; 3D versus 2D—stereo: = −3.12, ). However, 2D accompanied with stereo audio reaches the quality level equal to 3D with mono audio presentation ( = −1.61, , ns).

3D Impression
The parameter combinations influenced the perceived depth when averaged over the content ( , , ). Figure 9 shows the mean values for 3D impression and overall quality for all content and separately for all six contents. The highest level of depth perception was provided by stimuli in 3D presentation mode ( ). Under the 3D mode, the used audio presentation mode did not influence depth perception ( = −1.45, , ns), while stereo mode slightly outperformed mono when 2D video mode was used ( = −2.91, ).

5.2.2. Sensory Profiling

A number of 15 participants developed 130 individual quality attributes in the OPQ task. For further research, the attributes from one participant were eliminated because this participant had already been eliminated from the quantitative analysis due to outliers. Finally, 116 attributes from 14 participants remained (mean = 8.3, min. = 3, max. = 14).

Identification of Dimensions and Attributes
A total of 14 components were needed to explain 100% variance in the GPA model (Table 5). The first two components are used for further data interpretation according to the elbow criteria and the rule of interpretability [3]. These two components form the perceptual space.
Figure 10 shows the test stimuli in the perceptual space of PC1 and PC2. Correspondingly, Figure 11 shows the attributes in the space of PC1 and PC2. Attributes with an explained variance between 100% and 50% are emphasized and considered for further interpretations. As can be seen in both plots (Figures 10 and 11) PC1 divides the items by video presentation mode (2D and 3D). PC2 relates to positive and negative descriptions of overall quality. Interestingly, participants concentrated on the video quality description and their impressions and did not find attributes related to the audio presentation mode.

Dimension 1 (“visual presentation mode,” 68.21% explained variance)
The polarities agree with varied visual presentation mode (monoscopic (2D) and stereoscopic (3D)). Monoscopic videos are described by attributes like “normal,” “natural,” “flat,” “sharp,” and “focusable”. In contrast, attributes like “3D benefit,” “depth impression,” “3D feeling,” “three-dimensional,” “tangible,” or “sharp” describe videos in stereoscopic presentation mode. The assessors were able to distinguish between 2D and 3D video presentation mode and described the quality on a general level.

Dimension 2 (“impression descriptions,” 12.96% explained variance)
It divides the perceptual space into negative and positive impression descriptions. Videos in monoscopic presentation mode are described only by positive descriptions like “exciting,” “pleasant,” “beautiful,” “focusable image,” or “stress-free.” The participants described the stereoscopic videos with positive and negative attributes. On the one hand, side negative descriptions are artifact related and concern the display technology like “image totters when moving the display”, “flickering,” “inconclusive,” “stressful,” “restless,” or “blurred.” On the other hand, 3D videos are described as “spatial,” “brilliant,” “appealing,” “illusory,” or “layered.” The effects of content on quality perception can be seen in the participants’ impression descriptions, like “interesting” or “exciting.”

5.2.3. External Preference Mapping

Psychoperceptual data was combined with the sensory profiling results using external preference mapping (Figure 12). It can be seen that many participants prefer content cave in stereoscopic mode and the other contents in monoscopic video mode. Three preference clusters could be obtained from the preference map. The participants in cluster 1 prefer stereoscopic items, especially participants 10 and 25. The participants in cluster 2 prefer item cave in stereoscopic mode. Monoscopic video items, especially items knight and cave, are preferred by cluster 3. It is also possible to combine the users’ preferences with the attributes from the profiling task. Therefore, both plots, the external preference map and the word plot, should be observed next to each other. The preferences from the preference map can be correlated with the attributes from the word plot. The participants who prefer items cave in 3D video mode seem to like items that can be described as “spatial,” “to come forth,” or “three-dimensional.” Furthermore, 2D videos that are preferred are described as “interesting,” “stress-free,” or “pleasant.” On the contrary, contents music, rhine and oldtimer correlate with attributes like “blurred,” “stressful,” or “inconclusive” and are disliked.

5.3. Discussion and Conclusion

Our results underlined the dominance of visual quality factors over the audio factors and their interaction in the experienced quality. This result was confirmed by three different evaluation tasks used (psychoperceptual quality satisfaction and depth impression and sensory profiling). Similar to our results, nonsignificant influences of audio on audiovisual quality have been concluded in the context of large displays and surround systems in a good quality level [100, 101]. Neuman et al. [100] have shown that naïve participants have difficulties in differentiating between mono and stereo audio under the video viewing task. Furthermore, Lessiter and Freeman [101] underlined that the feeling of presence is not enhanced by audio mode. It is also possible that the visual variable acted as the most changing variable in the experiment and captured the greatest attention as suggested by peak-end theory [102].

The results showed also a controversial impact of 3D presentation mode on overall quality and depth impression. While the use of 3D mode increased the depth impression, it decreased the overall satisfaction. The descriptive GPA results gave further explanations to these results by underlining the inferiority (spatial, stressfulness, flickering, eye-strain) in the case of 3D. However, our results also showed that in artifact-free cases, 3D can reach higher perceived quality compared to 2D. In that case the perceived depth and the exciting 3D sensation make the stereoscopic videos subjectively better. This result indicates that the added value induced by the depth perception in stereoscopic presentation is only valid when the level of visible artifacts is low. These findings support results from further studies [88, 89, 91].

Further work needs to address the most annoying artifacts to improve 3D presentation to the sufficient level of technical resources for portable devices.

6. Experiment  3: Experienced Quality of Video Coding Methods for Mobile 3D Television

The third case study targeted the selection of an optimum stereo video coding method for mobile 3D television and video applications. Different approaches of coding algorithms have currently been optimized for mobile 3D video [103]. No previous work evaluated these approaches in a large-scale study. Previous work on stereo video coding was mainly done on still images [65, 88, 91]. These studies showed that the added value of stereoscopic stimuli given for the uncompressed case [89] is not valid for MPEG2 or JPEG compressed material [65, 91, 104]. In these cases, the depth perception did not increase the perceived overall quality of the stimuli.

6.1. Research Method
6.1.1. Test Participants

A total of 47 naïve assessors (gender: 23 females, 24 males; age: 16–37, mean: 24) took part in the psychoperceptual evaluation task [1]. A total of 15 of them were randomly selected from this sample for the sensory profiling task. All assessors passed a screening for visual acuity, color, and 3D vision and were also among the potential users for mobile 3D television [97]. Parents’ consent was required for the participation of underage assessors.

6.1.2. Stimuli

Variables and Their Production
We varied four coding methods and two quality levels in this study. Four coding methods, especially adapted for mobile stereo video [103], were chosen for evaluation. As Video + Video approaches H.264/AVC Simulcast [105], a straight-forward coding solution was chosen. As an advanced approach H.264/AVC MVC [106], Mixed Resolution Stereo Coding (MRSC) [107] was chosen. In addition, Video + Depth [108] as an alternative approach to the Video + Video coding methods was selected. As a coding profile, the Baseline profile, that is, IPPP structure and CAVLC (Context Adaptive Variable Length Coding), was used. The GOP size was set to 1. A low and a high quality level was defined for each test sequence. To guarantee comparable low and high quality for all sequences, individual bit rate points had to be determined for each sequence. For the definition of low quality for all sequences, the quantization parameters (QPs) for simulcast coding were set to 30. The resulting bit rates for each sequence are given in Table 6. These bit rates were used as target rates for the other three approaches.
Two different codecs were used for video encoding. H.264/AVC Reference Software JM 14.2 was used for the Simulcast, Mixed Resolution, and Video + Depth. MVC was performed using H.264/MVC reference Software JMVC 5.0.5. The test stimulus production for Simulcast and MVC-encoded sequences was straightforward according to the target bit rates in Table 6. To achieve these target bit rates, the quantization parameters for the left and the right were changed together. Thus, the left and the right views were of the same quality. The depth for the Video + Depth approach has been estimated from the left and the right view using a Hybrid Recursive Matching algorithm [99]. The view synthesis was performed using Merkle et al.’s algorithm [109]. For the generation of Mixed Resolution sequences, the right view was decimated by a factor of two in both the horizontal and vertical direction. For up- and down-sampling, tools provided with the JSVM reference software for Scalable Video Coding have been utilized. The applied optimization approach is described in [110]. The frame rate of all sequences was set to 15 fps.

Contents
Six different contents were chosen to create the test stimuli (Table 7) according to similar criteria to experiment . None of the contents contained scene cuts.

6.1.3. Stimuli Presentation

The conditions in the controlled environment were similar to experiment . The same setup as in experiment was used, but without headphones. All items were presented twice in psychoperceptual evaluation. Each item was presented three times in a row in the sensory profiling task.

6.1.4. Test Procedure

According to the theoretical model in Section 3, the study contained two parts. A psychoperceptual evaluation and a subsequent sensory profiling were conducted.

Psychoperceptual Evaluation
The psychoperceptual evaluation followed the same method as described in experiments 1 and 2. Test participants evaluated overall quality acceptance and satisfaction with overall quality in this study. The session took about 90 minutes.

Sensory Profiling
Sensory profiling was conducted in the second session, lasting 75 minutes. A Free-Choice profiling approach was applied with the following subtasks ( ) introduction to task—identical to experiments 1 and 2; ( ) attribute elicitation—the test participants watched a subset of 24 randomly chosen test items. While watching, they wrote down their idiosyncratic quality attributes. No limit for the number of attributes was given in this step. During the last clips, the test participants were encouraged to review their attributes by checking if all quality aspects were covered with these; ( ) attribute refinement—at the beginning of the attribute refinement, the assessors were asked to select a maximum of 15 attributes to their score card. After the selection, 12 test items were presented and the test participants evaluated these on their score cards. Still, the possibility of revising the score card (add, remove, redefine) was given. The score card was then finalized, and each assessor defined his quality attributes; ( ) evaluation task—in the final evaluation task, all 48 items were rated independently. Each item was shown three times in a row to allow for enough time to apply all attributes. The rating time was not limited.

6.1.5. Method of Analysis

Psychoperceptual Evaluation, Sensory Profiling, and External Preference Mapping
The analysis was identical to experiment .

6.2. Results
6.2.1. Psychoperceptual Evaluation

Acceptance of Overall Quality
All coding methods provide highly acceptable quality at the high quality level, 80% at the minimum. At the low quality level, MVC and + still reached 60% of acceptance, while the acceptance for MRSC and Simulcast was below 40%.
The distributions of acceptable and unacceptable ratings on the satisfaction scale differ significantly (χ²(10) = 2368, ). The scores for nonaccepted overall quality are found to be between 1.4 and 4.2 (Mean: 2.8, : 1.4). Accepted quality was expressed with ratings between 4.5 and 8.5 (Mean: 6.5, : 2.0). The Acceptance Threshold can be determined as being between 4.2 and 4.5.

Overall Quality Satisfaction
At the high quality level, coding methods had an influence on quality satisfaction ( = 241.83, = 3, ; Figure 13). MVC and Video + Depth provided the highest overall quality satisfaction scores when averaging over the content (MVC versus V + : Z = −.828; ; ns), outperforming MRSC and Simulcast (all pairwise comparisons: ). The results were confirmed for low quality level ( = 648.97, = 3, ), where MVC and Video + Depth outperform MRSC and Simulcast (all pairwise comparisons ).
Content-by-content analysis showed that Video + Depth outperformed all other methods at the high and low quality levels (all comparisons ). For Butterfly content, MVC had the best satisfaction scores for both quality levels (all comparisons: ). Coding methods did not have an influence on Bullinger content at the high quality level ( = 2.942; = 3; ; ns).

6.2.2. Sensory Profiling

A number of 15 assessors in the sensory profiling session developed a total of 102 individual quality attributes.

Identification of Dimensions and Attributes
Considering Lawless and Heymann’s rule of interpretability [3], two dimensions were identified to be important for the GPA model. The first two components of the GPA model had 88.36% explained variance, where PC1 covered the majority of explained variance (83.32%). Figure 14 shows the item plot and Figure 15 the correlation plot of the GPA model. The analysis emphasizes attributes explaining more than 50% of the variance. As can be identified from the plots, PC1 is mainly determined by video quality. PC2 is discriminating the items (Figure 14) into items with high amount of motion (soccer) and low amount of motion (bullinger).

Dimension 1 (“video quality”, 83.32% explained variance)
PC1 shows a high correlation of its negative polarity with attributes like “blurry,” “blocky,” or “grainy.” On its positive polarity it correlates with attributes like “sharp,” “detailed,” and “resolution.” This component describes the video quality. It separates the model into good and bad quality. The bad quality mainly contains descriptions of artifacts.

Dimension 2 (“amount of motion”, 5.03% explained variance)
Along PC2, static test content (Bullinger, Mountain, Horse) and content containing motion (Butterfly, Soccer2, Car) are separated (Figure 14). It is remarkable that the explained variance of PC2 is very small compared to the first dimension. However, it is reasonable that the amount of motion is impacting on perceived quality due to the applied coding methods. No attributes were identified to describe the perception of motion.
A separate depth component was not identified in the GPA model. The correlation plot shows that 3D-related attributes like “spacious,” “3D reality,” or “background depth” correlate with the positive polarity of PC 1. The results show that depth descriptions seem to be part of good quality. If video quality is low due to coding artifacts, this quality degradation will exceed the additional value provided by the stereoscopic video presentation. Depth will not be taken into account to describe quality.

6.2.3. External Preference Mapping

The results show a preference for artifact-free content (Figure 12). The content with the highest user preference is identified along PC1. The least preferred items are all Bullinger clips at the opposite side of the marks. It can also be seen that the Bullinger clips correlate with an attribute called “redundant.” Although this attribute only appeared once, it may explain the quantitative results of Bullinger clips. Quantitative analysis has shown that the differences between coding methods were rather small for Bullinger content. The “redundancy” of the Bullinger items may show that the participants evaluated the content on a more affective level, not on its provided quality.

6.3. Discussion and Conclusion

Our results of psychoperceptual evaluation showed that Multiview Coding and Video + Depth provide the highest experienced quality among the tested coding methods. They also represent contrary methods in the coding of 3D video. While MVC uses inter- and intraview dependences of the two video streams (left and right eye), the Video + Depth approach renders virtual videos from a given view and its depth map [103]. In addition, the provided quality level was highly acceptable compared to previous studies [40].

The results of sensory profiling showed that artifacts are still the determining quality factor for 3D. The expected added value through depth perception was rarely mentioned by the test participants. When mentioned, it was connected to the artifact-free video. These results are in line with previous studies concluding that depth perception and artifacts both determine 3D quality perception [88, 104]. In contrast to Seuntïens’ model [88], our profiles showed a hierarchical dependency between depth perception and artifacts. When the visibility of artifacts is low, depth perception seems to contribute to the added value of 3D. To conclude, this experiment confirms the findings of experiment . With respect to stereo video coding methods, we can see that the compression of the depth map in Video + Depth approaches directly impacts depth quality. In contrast, depth is not affected in Video + Video approaches by related coding methods. Further work needs to investigate more deeply the interaction between artifacts and depth to improve coding methods for mobile stereo video.

7. Discussion and Conclusions

The aim of this paper was to present a novel method, OPQ for multimedia quality evaluation with naïve participants. As a mixed method, OPQ combines a conventional quantitative psychoperceptual evaluation and qualitative descriptive quality evaluation to gain deeper understanding of the quality factors. We applied the method presented in three audiovisual quality evaluation experiments.

7.1. Convergence and Complementation

Our three studies highlighted the complementation and convergence between the results acquired, with different methods underlining the positive features of mixed method research [24]. The results are summarized in Table 8. They complemented each other in all studies, and even more importantly quantitative quality preferences were explained by qualitative descriptions. For example, when quantitative excellence between stimuli was not identified, the qualitative results showed the detectable differences between the used variables, inferiority nullified the positive influence of quality (audiovisual depth), and the participants’ sensorial preferences can contribute to final multimodal quality evaluations.

Furthermore, we were able to explain the excellence between parameter comparisons by understanding the relationship between quality and depth using sensory profiling. The descriptions of depth and error-freeness were attached to good quality when visual presentation mode and coding factors were varied. Without qualitative data, the reasons beyond the quantitative data had been based on assumptions, while sensorial data as a single method is not capable to show preferences.

The convergence between the results was represented in the whole affective dimension. Poorly rated quality was attached to badness, inferiority, and erroneousness. In the neutral case, the influences were not visible in any of the measures (e.g., the variables of constructed audio quality factors did not influence quality ratings nor were they described in sensory profiling). Similarly, the most satisfying variable was visible in both measures, consistently showing the high quality ratings and goodness of layered visual depth. Finally, one of our case studies also showed a slightly contradictory aspect between the results. The results showed that participants were able to identify the preferences between visual stimuli while these differences were not visible as such in the results of sensory profiling. This may indicate that for the stimuli with small differences, naïve participants are able to express overall quality preferences quantitatively, while their ability to express them or sensibility in sensory profiling can be limited [111, 112] or may be guided by the most changing variables according to peak end theory [102]. To sum up, the benefits of using OPQ as a mixed method for multimedia quality evaluation were expressed as the ability to provide the complementation, and as mainly the ability to explain quantitative results with qualitative descriptions. Further work needs to systematically probe the method with small and detectable differences in multimodal stimuli with naïve participants to expand the consciousness of the limitations of its use.

7.2. Further Work

The other aspects of the further development of the OPQ method are mainly targeted on the sample and on conducting and analyzing the results. For multimodal quality evaluation studies with naïve participants, according to our results, it is worth considering a well-validated tool for identifying the groups of different information processing styles (e.g., [90]) and reporting these groups to characterize the sample. Our experience using OPQ has repeatedly highlighted the importance of training and careful attribute development in the sensorial studies. Individual differences in the ability to describe properties accurately are not only a typically reported challenge in food science [52], but it also seems to be present in multimedia quality studies. Based on our informal observations, we have noticed that the apple description task in the training, as something concrete and familiar, helps participants to start to create their descriptions. We have also observed the importance of giving enough time for attribute elicitation and refinement tasks which can contribute to the success of the final sensory evaluation. As the last remark, the use of OPQ requires participation in multiple sessions. In general, drop-out rates can be a problem of construct validity in multi-session studies [60]. Although we did not face this problem in our studies, it is good for practitioners to keep this limitation in mind if considering small sample sizes for the sensory profiling task.

Based on our experience with the analysis of interview-based descriptive data (e.g., [16, 42]), the analysis of sensorial data seems to be comparably quick and straightforward. However, there are four main suggestions to be considered in further work. Firstly, the guidelines for detecting outliers in this data are needed. While in quantitative results, outliers can be detected and removed, sensory evaluation methods do not provide robust methods that can be applied. However, the residuals given for each configuration after GPA [46] show large differences between the most important (low residual) and the least important configuration (high residual). These residuals may provide the possibility for outlier detection [113]. Secondly, the issue of dominance of components needs to be addressed. The PC1 of the perceptual models in study 2 and study 3 is very dominating, that is, larger than 60% of the explained variance. This may lead to a loss of information in the components of lower explained variance and eventually to an incomplete understanding of perceptual mechanisms. Thirdly, the reliability aspects of the interpretation of the perceptual spaces in sensory profiling and external preference mapping need to be further considered. Currently, the results of GPA and EPM charts can be constructed based on one researcher’s interpretation. For example, in the interview-based data-driven analysis (e.g., Grounded theory, content analysis), the reliability aspect is considered using interrater reliability estimations and reviews of multiple independent researchers [114]. A similar type of procedure needs to be considered to improve the reliability of interpretations of results in GPA and EPM charts. Fourthly, the impact of different methods of analysis needs to be investigated. A comparison of GPA and Kunert and Quannari’s approach [76] returned a stronger GPA model in terms of explained variance for Generalized Procrustes Analysis. The same investigations need to be done for the methods of External Preference Mapping (PREFMAP, PLS) to understand similarities and differences and to minimize the impact of different methods on the results.

Finally, systematical comparisons between OPQ and existing methods are needed to provide guidelines for an effective use of these methods for the practitioners. To probe aspects in the comparisons, OPQ can provide a relatively easy data-collection and analysis procedure, but, on the other hand requires multiple evaluation sessions. In contrast, interview-based methods can require good interviewing skills of personnel, a relatively slow procedure in the analysis while they can complete the whole study with one-session participation. The systematical comparisons need to verify performance-related aspects (e.g., accuracy in different quality range, validity, reliability, and costs), complexity (e.g., ease of planning, conducting and analyzing, and interpreting results), evaluation factors (e.g., number of stimuli, knowledge of research personnel) (e.g., [115117]). The long-term goal is to support the idea of safe development of these instruments by understanding their benefits and limitations when capturing deeper understanding of experienced multimedia quality.

Acknowledgments

MOBILE3DTV project has received funding from the ICT programme of the European Community in the context of the Seventh Framework Programme (FP7/2007–2011) under Grant agreement no. 216503. The paper reflects only the authors views, and the European Community or other project partners are not liable for any use that may be made of the information contained herein. The work of the second author is supported by the Graduate School in User-Centered Information Technology (UCIT).