Abstract

The current paper focuses on validating an implementation of a state-of-the art audiovisual (AV) technologies setup for live broadcasting of cultural shows, via broadband Internet. The main objective of the work was to study, configure, and setup dedicated audio-video equipment for the processes of capturing, processing, and transmission of extended resolution and high fidelity AV content in order to increase realism and achieve maximum audience sensation. Internet2 and GEANT broadband telecommunication networks were selected as the most applicable technology to deliver such traffic workloads. Validation procedures were conducted in combination with metric-based quality of service (QoS) and quality of experience (QoE) evaluation experiments for the quantification and the perceptual interpretation of the quality achieved during content reproduction. The implemented system was successfully applied in real-world applications, such as the transmission of cultural events from Thessaloniki Concert Hall throughout Greece as well as the reproduction of Philadelphia Orchestra performances (USA) via Internet2 and GEANT backbones.

1. Introduction

It is unquestionable that the rapid evolution of next generation networks and broadband access emerging nowadays has an increased impact on traditional information and communication technologies (ICT) services and applications. Among others, digital multimedia production and broadcasting is mostly influenced by these changes, allowing taking full advantage of the contemporary technological advances. Novel, user-oriented, and on-demand services are currently deployed for browsing, searching, and retrieval of AV content, including news, multimedia e-learning, AV streams of cultural events, entertaining shows, and other applications. Web-based video on demand (VoD) services, digital/interactive TV (DTV/ITV), IP-based TV (IPTV) programs over Internet and mobile systems are typical examples where AV content is usually delivered via IP-based topologies [17]. This also complies with the persistent demand for continuous extension of the available AV content resolution and fidelity, in an effort to achieve better experience, creating a sense of realism or telepresence [819]. High definition (HD) AV technologies [57, 9, 10], Ultra High Definition Video (UHDV or Super Hi-Vision) [1719], and digital cinema (D-cinema) [20] projects are currently focusing on these objectives, increasing the capacity of the related video streams and raising a compulsory demand for even higher data transfer rates. Besides the utilities that have been recently launched in next generation networks architectures [13], the combination of broadband services and high definition multimedia broadcasting is a very challenging technological/research field that the current paper aims to discuss in depth.

The widespread of the World Wide Web and the know-how obtained through the successful implementation of the Internet have lead to the global adoption of the IP technology and the IP-based communications over a broad area of contemporary ICT services. There is a worldwide effort to construct broadband network backbones, like the Internet2 [21] and the Geant [22] initiatives that have been implemented during the last decade in USA and Europe, respectively. It is obvious that digital multimedia broadcasting inherently belongs to demanding broadband services such as the above as well as similar state-of-the art technological approaches [810, 21, 22]. The purpose of the current paper is to analyze, implement, and evaluate the use of state-of-the art digital multimedia broadcasting technologies in combination with broadband services for the transmission of demanding AV streams, captured by means of live performance HDTV shots. The proposed AV configuration setup has been successfully deployed in real-world applications, such as the transmission of cultural events from Thessaloniki Concert Hall throughout Greece [23] as well as the reproduction of Philadelphia Orchestra performances (USA) [24] via Internet2 and GEANT backbone. However, the presented methodology can be further utilized as guidance in related IPTV applications, especially those dealing with HD video/multimedia streaming.

The paper is organized as follows. The problem definition is described in Section 2, where state of research and related works are also discussed. Section 3 presents the proposed system configuration, providing detailed information about the development steps of the work, all the physical and the technical aspects faced during the implementation phases, while metrics’ statistics/relation and their utilization are also considered. Experimental results are analyzed in Section 4, where evaluation of the adopted system configuration is carried out in combination with conclusion and future work remarks.

2. Problem Definition and Background Work

The increased popularity of networked multimedia applications has created new demands for reliable and secure video transmission, so that AV content is expected to account for a large portion of the traffic in the future Internet and in next-generation wireless systems [8, 25, 26]. This creates further necessities for broadband networks due to the fact that digital audio and video hold a large amount of information, especially in critical/demanding cases, whereas high resolution is required, while quality compromises are not acceptable. To meet such demands, advanced compression techniques are continuously evolving in combination with novel routing architectures and algorithms, in an effort to guarantee the required QoS via the currently available Internet data transfer rates [810, 25, 27, 28]. Classical examples of this category are the IPTV and VoD projects that have been lanced during the last years [17]. As a consequence, various studies have been appeared focusing on the evaluation of the network and the compression parameters, in combination with various content types, using quantitative metrics (often perceptually adapted) as well as subjects that incorporate functional, perceptual, and aesthetical mean opinion scores (MOSs) [8, 2735]. Recently, research effort has began focusing on HDTV-related approaches [57, 9, 10, 33, 34], including D-cinema [20] and UHDV [1619, 36], aiming at evaluating the advantages, as well as the technical difficulties of these technologies, toward the implementation of future, AV broadcasting services.

The scope of the current paper is twofold: first, to present a solid system layout for HD video content production and transmission; second, to evaluate various parameters of the system configuration setup for their effect to the achieved quality. It is important to mention that the current paper deals with technical issues both at the production (physical theatre) and reproduction (remote amphitheatre) stages as well as all the intermediary phases related with content packaging and streaming (AV formats, compression algorithms, routing architectures, etc.) Hence, unlike the previously mentioned research works, it aims at providing end-to-end solutions and their evaluation for successful broadcasting of live shows in auditoriums. In addition, audio and surround sound techniques are essential toward the successful accomplishment of the “high fidelity” target. Thus, audio-related issues are equally important with video and deserve, respectively, careful treatment at all the recording, mixing, multiplexing, coding, and reproduction procedures. Specifically, the problem under study is best described with commonly raised questions that should be answered. How many cameras and what formats should be employed? How many microphones, what types, and where should be placed? Which AV compression, multiplexing, and coding algorithms should be selected? Which are the most applicable AV content packaging/streaming techniques that should be implemented in combination with the available network topologies and the corresponding routing architectures? Is there dedicated AV equipment to fulfill the reproduction demands, and how should be used? How the above parameters influence the achieved quality and the perception of the transmitted shows? The outermost target of the current work is to provide guidance for live AV capturing/IP-broadcasting/reproduction as well as an integrated methodology for the prediction/estimation of the achieved, end-to-end QoE (), using QoS metrics (i.e., PSNR) related to both application-demands () and at broadcasting-network level ().

The majority of the corresponding research works are mainly focused on the video-related issues, which are the most technically demanding, since they are related to the larger portion of the AV data stream. For example, CCD-camera noise and lighting-related degradation may appear in cases where capturing conditions cannot be adjusted based on the broadcasting demands [8, 37], as the performed show is mainly designated for the audience at the physical auditorium. In general, video signals can be corrupted by noise during acquisition, recording, digitization, processing, and transmission [37]. On the other hand, audio information requires careful treatment in order to be able to create high fidelity sound-field reproduction conditions. The importance of this task has been exhaustively analyzed in D-cinema and UHDV initiatives [1316, 36], while various stereo sound recording techniques have been proposed in combination with the corresponding surround sound systems for its effective implementation [1316, 36, 3843]. From a practical point of view, 5.1 and 7.1 surround systems have prevailed in real-world application, such as movie theatres and cinema industry. In addition, subjective tests have been applied to provide best format selection guidance, in combination with the corresponding content type and the available reproduction HDTV sets [33, 34], while future trends of HDTV and the third generation of HDTV formats, like the 1080p, have been studied [9, 10, 34]. However, in the case of HDTV, little attention was given to the accompanied surround sound in combination with the viewing distance conditions [9, 11].

Considering that the video compression-related degradation has been studied during the development of the corresponding algorithms, many researchers are exclusively focused on the evaluation of the video broadcasting procedures, studying the generated traffic demands in relation to the characteristics of the involved network technologies [8, 25, 26]. In general, there are three major research approaches in evaluating broadcasting networks video-wise: (a) actual video bitrate-based techniques, (b) video traces-based approaches, and (c) model-based techniques [26, 2935]. Actual video streams exhibit originality and provide accurate results during their transmission evaluation over lossy networks, but raise difficulties connected with their availability and copyright permissions. In order to overcome such issues, video traces use only the video content number of bits and the related timing information instead the video content itself [26, 29, 30]. These techniques are focusing on the study of the transmission characteristics but fail to estimate their effect on the decoding process being unable to predict how the occurred errors reflect on the perception of the received image sequences [26, 29, 30]. An evolution of the above methods include the design of advanced video traces that incorporate various video-related features [30], enabling the study of lost packets influence in compressed sequences [28, 29]. Finally, model-based video traffic techniques use mathematical models to describe the video streaming propagation over the network, depending strongly on the validity of the selected model in a real-world application [26, 29, 30]. These methods represent the simplest evaluation approaches that provide lost-data statistics at the level of pixel, frame, and groups of pictures (GOPs), over various network bandwidths and topologies, different routing architectures, and in general, variable QoS settings [8, 25, 26].

As already stated, packet-loss information is inadequate to estimate the actual video degradation, related to the received and perceived image quality. In most cases, reference content is available (full reference (FR) methods [26, 29, 30]), therefore comparisons between the broadcasted (reference) and the received AV streams are employed to form the corresponding metrics. Peak signal-to-noise ratio (PSNR) and mean square error (MSE) are the most common arithmetic metrics that are usually employed for objective evaluation of “processed” images and videos (i.e., processes of compression, enhancement/denoising, transmission, etc.) [8, 2532, 37], where and are the reference and the impaired video frames for the corresponding video components (i.e., luminance component , color components , , etc.), with the Cartesian image-coordinates, the frame number, is the maximum value of the 2D signals that is related to the number of quantization bits (equal to 255 for an 8-bit image signal), and the horizontal and vertical image resolution, respectively [8, 32, 37].

Although PSNR metric is quite acceptable for evaluation of integrity of final signal, it is not capable of evaluating video degradations caused by structural dissimilarities. A quite simple and “computationally affordable” metric that applies and is widely used for this task is the perceptually-adapted structural similarity (SSIM) metric [31] which is extracted based on the general principle that human vision system (HVS) is highly sensitive to the structural information of the received optical stimulus. Thus, considering again the reference and the impaired signals (video-frames) and a simplified form to estimate SSIM is given as follows [31]: where are known statistics (mean, standard deviation, cross-variance), and are small positive constants that are introduced to prevent computational overflow and instability when values are very close to zero.

Besides FR methods and their corresponding metrics (PSNR, MSE, SSIM), reduced-reference (RR) and no reference (NR) methods are employed, when source (reference) content is partially available or unavailable during the evaluation process [2931]. These two approaches are commonly used in combination with subjective tests which are relying on MOS statistics in order to fully incorporate perceptual attributes of the HVS during evaluation. In addition, subjective tests are very useful in cases that the influence of various parameters is not predetermined, or depends on the characteristics of the AV content itself (i.e., degradations at the acquisition/digitization phases, impact of packet losses at the decompression phase, appearance of errors and concealed errors with respect to image position and motion activity of the content, impact of the selected resolution—scanning to the perceived quality with regard to content type, etc.) [2531].

To the best of our knowledge, there are not sufficient guidance and related information on real-world, end-to-end implementations using state-of-the art audiovisual HD equipment with the above characteristics, besides the research efforts and demos discussed in the previous paragraphs that are partially focusing on some of the discussed objectives. In the current paper scope, statistical analysis was performed using both FR and NR methods, using standard quality metrics and testing hybrid evaluation functions on real-world broadcasting and simulated transmission on HD streams.

3. The Proposed System Configuration

HD video and surround sound were the least viable choices to meet the minimum configuration requirements of the application in question. Additionally, for the specific demands of the current work in order to be employed in real-world demanding applications, various practical issues should be considered.

3.1. Physical and Technical Issues: Application Demands and Technology Requirements

In order to address the above case, there is a necessity of the replication of the conditions present in the actual event venue to a potential virtual event venue. Considering the proportions, a suitably configured projection hall in a remote site could act as a virtual auditorium following certain specifications. Fortunately, the high-speed Internet and digital technology growth during the last years present us with a unique opportunity to combine all of the above in a digital form following suitable standards and transmit the content in real time to even the remotest of audiences, satisfying the uniqueness and motivation needs of an actual cultural event. In the above frame, our research was targeted to the investigation of the conditions of the actual venue, the transmission infrastructure, and the virtual venue in terms of various factors such as actual spectator perception, capturing, transport, reproduction, and so on. An overview of the architecture used is shown in Figure 1(a), while AV capturing/reproduction setups are presented in Figures 1(b) and 1(c), respectively (detailed information about the selected architecture setups are provided in the following paragraphs).

3.1.1. Capture of The Performances at The Physical Theater: The Transmission Site

The first question that emerges in the design of such a system is the study of perception, thus the experience of a spectator in a performance hall (auditorium, concert, opera, etc.). Given the fact that the actual performance is being held in an organized hall, we take for granted its acoustical and visual integrity. As a result, audience “sweet spots” exist and are known in each different case. Thus, if one was to decide a spectator position which would be the experience reference, a suitable selection would be the choice of one of these spots in terms of both audio and video stimuli [23, 42, 43]. Our effort was targeted to the transfer of a selected actual spectator position experience to all the members of the remote audience [23].

In this point, it is useful to discriminate the audio and video capturing properties. As far as audio is concerned, the perceived result in an actual hall is produced from the combination of the following factors: (a) the direct acoustic field, (b) the reverberant acoustic field (hall acoustics), and (c) the field created by the PA system, if any installed [42, 43]. The reproduction system in the remote hall was specified to be a standard 5.1 surround system which, nowadays, is the typical sound system installed in the majority of public projection halls (i.e., cinemas). In order to be able to reproduce effectively the original audio conditions, it is crucial to acquire the above fields as isolated as possible so that the final 5.1 mix corresponds to the experience reference position. For that reason two kinds of microphone setups were used concurrently: (a) a gun microphone array pointing the stage for the direct field and (b) a soundfield microphone in the reference position that holds spatial sound information by means of the four and audio components [4043] and is used to extract the actual spectator surround perception (Figure 1(b)).

Since the direct field arrives to the spectator with a delay proportional to the distance from the stage, the above setup is able to provide the direct field and the reverberant/PA field in the position in question [42, 43]. According to [42, 43], such a hybrid system is capable of providing soundfield localization, virtual source positioning-panning and signal enhancement, by means of amplitude weighting and time delay compensation, especially for the cases of small-sized sources (point source model) or even still sources. Although these conditions were valid for most cultural performances dealt in the current project (i.e., jazz, theatrical acts, recitals, etc.), we decided not to involve sound-source localization for practical reasons as well as in order to propose a universal recording layout that could be applied in every cultural show. However, we adopted the amplitude—delay weighting approach, aiming at capturing and reproducing the audience experience related with the hall acoustic properties of the physical theatre. Based on the above remarks, the gun microphone array signals and the corresponding soundfield components were mixed to a 5.1 setup based on sound propagation criteria (amplitude-phase weighting) according to their positioning and the coordinates of the capturing (camera) spot [44].

As far as the encoding of the six audio channels that had to be transmitted is concerned, an AC3 encoder which created a stream of 320 kbits/s, achieving an easily decodable, high-quality audio form for the consumer side [23]. This is the encoding that DVB format uses and so it was not further examined because of the proven quality that provides [45].

The case of video is more straightforward as the view of a spectator in such an environment is limited by the visible field viewing angle and the boundaries of the stage from the selected position. The most critical point in this case is that a spectator is able to focus his/her attention to any point on the stage at any time of the performance. In order to achieve that feeling in the remote hall, it was necessary to provide the virtual spectators with full stage view, in adequate quality and at least near-to-real dimensions. The chosen video capture device (HD-camera) was covering the stage area in a steady position in order to transfer an unbiased stage aspect (without the intervention of a director) on the virtual audience [23]. The setup used in the actual venue is depicted in Figure 1(b).

The exact capturing position, meaning the distance from stage and the height of camera, had to be decided. Specifically, the determination of the acquisition point strongly depends on the stage’s physical dimensions (width and height). In addition, special care had to be taken for technical issues, such as camera lens’s focus length and zoom in order to avoid deformities due to zoom lenses. In any case, the former set of parameters should be configured according to the restrictions and limitations of the theater stage and seats’ layout and so it is quite different in each different physical theater.

An alternative strategy, regarding live-transmission of audiovisual (cultural) shows, is to employ a multicamera director-based setup for video and a 5.1 audio mix independent of the hall acoustic properties, which is being currently held by the Philadelphia Orchestra with the Global Concert Series [24]. Both strategies can share identical configuration setups regarding AV broadcasting, however there are major differences related to the capturing or even the reproduction layout, issues that have an impact on the achieved QoE, as it will be further commented in Section 4.

The decision of the audiovisual framework was in fact the most crucial part of the design. The video format finally used was 1080i; however, the 720p format was also tested. The main reasons for this choice involved the cost and the consumer side projection capability. The native protocol in the production side used was the HD-SDI standard in order to eliminate the quality loss before the transmission [23]. The visual content was MPEG-2 compressed based on the fact that it is easily decoded, providing sufficient video quality in the scenario under question. At the time of investigation, there was a variety of MPEG-2 enabled devices (encoders, decoders) on the market for a significant amount of time which ensured the reliability of the method. Moreover, the relatively low bandwidth demands for this kind of task were able to ensure reliable transmission (given the Internet framework discussed below). The bitrate chosen was based on subjective, objective, and empirical criteria and was closely related to the nature of the transmitted material (a more thorough analysis is presented in Section 3.1.2) [5, 7, 9, 10].

Other implementation choices included decisions related to the forward error correction (ProMPEG FEC), variable/constant bitrate, encoding profile, and GOP structure. Most of these parameters are under examination by the ITU [46] in relation to the processed content of past research works [33, 34], taking also into account the limitations of the transmitting networks. Based on the low-motion nature of the transmitted content, the final decisions included the use of Main Profile at High Level (MP@HL) [23] format of MPEG-2 encoding, CBR mode, and lack of error correction as the less costly and most easily implemented.

3.1.2. Reproduction at The Remote Auditorium: The Reception Site

According to the audio-recording layout already discussed, it is easy to describe the corresponding necessities at the remote auditorium. In fact, a standard 5.1 surround system setup is only required (Figure 1(c)), while slight variation should be occurred related to the dimensions of the remote auditorium and the adequate number of loudspeakers. In any case, neutral acoustical hall behavior (low reverberation) is preferred, allowing transfer of the physical theatre acoustical experience with more fidelity. The sweet spot, in this kind of environment, is also defined mostly based on the audio quality superimposed to the conditions of each projection hall.

For the video part, the constraint posed is relative to the projection size. For the biggest degree of realism to be succeeded, the projection of the event should display the projected objects by their physical dimensions. Ideally, the size of the projection screen should match the dimensions of physical theatre stage, but this is not probable to be the case for the remote auditoriums. Thus, the screen size should be as close as possible to the physical stage, and of course should not exceed these physical dimensions. The screen size and the projection resolution affect the desirable viewing angle to convey the full sensation of presence which for a wide-field video system is 80–100 arcdeg [11, 12, 17]. In order to achieve best audience viewing experience, the viewing distance should be bigger than the shortest distance at which a person with normal vision of 1.0 is unable to recognize the pixel structure on a screen [12]. However, the sensation of telepresence is decreased as distance increase. A compromise between the two above conflicting criteria, giving priority to the first, was adopted and recommended for best viewing experience [44].

In the remote side the standards used for projection varied according to the projection equipment among HD-SDI, DVI, XGA, and HDMI due to the versatility provided by MPEG-2 decoding devices. Although there have been studies focusing on the subjective evaluation of the HD format, in combination with the size of the projection screen in flat TV-sets [33, 34], there are no available published works focusing on large-auditorium projection. A thorough technology-market research pointed out as most applicable the use of DLP technology projector, in combination with over five meter wide, electronic projection screens. As a result, three alternative projection formats were examined: 720p50, 1080i25, and 1080i29.94, with the last one applied during reproduction of Philadelphia Orchestra transmissions [23, 24].

3.2. Transport of Av Data and Network Conformation Demands

Most of the attributes describing the transport framework can be derived from the audiovisual framework specifications. The finally selected choices included MPEG TS encapsulation using the RTP protocol at the rate of 30 Mbps combined stream bitrate. This choice was based on the fact that the MPEG TS is capable of encapsulating simultaneous audio stream in AC3 format, eliminating the synchronization problems appeared in past efforts, and allowing the delivery of multichannel surround sound. Finally, incorporation of RTP introduced an amount of jitter immunity to the consumers who could take advantage of this capability. Also, the usage of ProMPEG FEC was tested [23].

As previously mentioned, the present Internet attributes motivated the implementation of the effort in discussion. Following a close collaboration with GRNET, the concluding setup was capable of offering our service reliably to a considerable number of Greek provincial and urban universities, thus covering a potentially large number of spectators. In order to service as much interested consumers as possible, the transmissions were decided to be multicasted through a single group [23]. The connection provided on the transmission site was established through an M-BGP enabled switch with an end-to-end optical Gigabit Ethernet uplink to the GRNET backbone switch and copper wire Gigabit Ethernet downlinks for the internal connections. The reception side setup inside the university campuses relied on the already established network infrastructure consisting of copper wire Fast Ethernet wall sockets of certified functionality and efficiency. The links were tested for transmission/reception errors and were required to appear high quality link state which was a relatively easy task since we were addressing to locally well-organized networks (universities) and not the open public (i.e., home users). For the transmission case, a stand-alone IGMP enabled MPEG-2 encoder/streamer was used while the reception case required a respective receiver/decoder which could be consisted of a stand-alone or a pc-based setup. The multicast traffic was in both cases handled by the Multicast Backbone (MBONE) infrastructure of the providers involved (GRNET and University NOCs). A session announcement through SAP was also statically available during the transmission period in order to encourage participation to our tests. The above setup ensured the minimization of the physical layer related corruptions.

3.3. Qos and Qoe Issues

Quality of service is a term that has been mostly used in network applications to describe data integrity during transmission, including timely arrival demands. As already mentioned, in video streaming/broadcasting application we may distinguish two sequential stages, affecting the overall QoS: the application QoS () and the broadcasting-network QoS () [4749]. This 2-stage model has been broadly applied in related application [4749], and it was also adopted during the experimental phase of the current work for the broadcasting simulation setups and their statistical analysis (refer to Figure 2). Specifically, in the application level we may consider the influence of the bitrate [Mbps], input format [1080i/720p], encoding type [CBR/VBR], content motion activity [motionActivity], deinterlacing, and error concealment strategy for . As far as the network broadcasting level is concerned, the is influenced by the network direct packet losses, the jitter, which produces indirect/secondary packet losses according to the routing/streaming/buffering/strategy (buffer size, UDP/RTP). In general, QoS metrics (Qsm) may be configured using the already presented FR metrics PSNR/MSE (use of a single FR-metric or a combination of them) to evaluate the video quality at each stage. Hence, given the preceded analysis we may form simple expressions to model Qsm as function of the application/broadcasting characteristics: where , , and are the involved end-to-end, application, and broadcasting QoS metrics, respectively, n is the number of video sequence broadcasted frames, and are, in general, complicated functions aiming to relate the input independent variables (-vars, -vars) with the QoS estimates.

As already stated, video stream, packet losses, and PSNR do not reflect linearly on the “video quality” during reproduction/projection of the AV broadcasted content, since lost pixels might or not be visible, and in general have different impact on the perceived QoE [29, 32, 35, 50]. In order to be able to account for the gained experience and model the QoE, perceptual criteria related to the HVS should be considered. In the above context, the simplest rule to express QoE metrics (Qem) as function of the input parameters (-vars, -vars) is to pass all the , , metrics through a filter that emulates the HVS behavior to obtain the desired , , and . A different approach would be to deploy perceptually adapted metrics, such as the SSIM, or even subjective MOS results. Thus, we may replace Qsm and Qem with the generic expression Qm to parameterize QoS/QoE as function of the corresponding application and network-oriented QoS/QoE estimates where are again nontrivial parametric functions controlled by the -vars/-vars independent variables (inputs) previously discussed. According to (4b), the wanted in the current approach is to model the QoE outcomes as a function of both (, ) and (, ) pairs, expressed by means of PSNR and SSIM, respectively.

Let us take a closer inspection to (3), (4a), and (4b), trying to predict the Qm changes with respect to single-input variations. We will assume that QoS and QoE are correlated with increasing monotony, so that a Qsm raise (improved QoS) would reflect to a relative Qem increase (improved QoE), and vice versa. Hence, we may form simple rules to describe the influence of bitrate, format (720p/1080i), and content motion activity to the /, and the influence of jitter to the /. It is obvious that as the bitrate increases, get also higher due to the fact that better image quality is obtained with fewer compression artifacts. Another issue connected with the bitrate is the content itself. For instance, the AV streams feature high video motion activity, this makes the encoding process more demanding and complicated, so that video degradation worsens in order to be able to attain the desired bandwidth. On the other hand, increasing the bitrate may cause metrics to decrease, since the effect of packet losses and jitter to the increasing network-traffic demands is rising. Finally, it is clear that as the jitter increases the metrics go worse, since the indirect packet loss rates get higher. The affection of the remaining parameter is not considered here as less important for the specific demands of the current application. For instance, CBR/VBR variations were not tested, on the basis that CBR is more robust and reliable to be deployed in broadband networks, while deinterlacing and error concealment options were excluded from the current study for the sake of simplicity (to avoid confusion by using too many parameters). Similarly, since direct packet losses are not quite common in broadband networks, we considered only the influence of jitter. As far as routing strategies are concerned, it is obvious that the use of RTP is superior over UDP (related experiments validated this statement), so that the use of RTP was decided as fixed option. The above remarks could be formalized and expressed by the following equations: where the partial derivative expresses the influence of each independent variable, considering that all the other input variables (-vars, -vars) remain unchanged. We may observe that some of the above changes have complete different impact to the partial system responses , so that none could easily say which of the above parameters will prevail to the determination of the end-to-end metrics in a nontrivial/realistic broadcasting-configuration scenario, like the one dealt with in the current work.

Related to the motion activity characteristics is also the use of interlaced (1080i) or progressive (720p) scanning. For instance, it would be preferable for a high motion video sequence to be encoded to progressive format (i.e., 720p) in order to avoid filtering out motion details due to interlaced scanning (in case of 1080i). Although these motion artifacts are quite annoying and they are easily perceived by the subjects/spectators, both the FR-metrics PSNR and SSIM are unable to measure this limitation due to the fact that it also inherently exists in the original, source material which is used as reference. Hence, once again the influence of a single parameter (interlaced/progressive) provides controversial effects to the overall system behavior. It is important to mention that the use of 720p and 1080i was not intended for direct comparisons between the two formats, using the previously mentioned FR-metrics, but they were proposed as alternative source-content choices to confirm that they both follow certain rules and pose similar Qm dependencies from the remaining independent variables (-vars, -vars).

Based on the above analysis, evaluation procedures were contacted using both simulation setups and real video-transmission according to the layout drawn in Figure 2.

4. Experimental Results and Discussion

Following the design and the implementation of the system as well as the related methodology, several applications involving the organization of transmissions and receptions were conducted. The experimental transmissions among others included four actual real-time transmissions from the Thessaloniki Concert Hall (three from the foyer and one from the main hall). Receiving projection venues was set up in four cities in Greece (Athens, Thessaloniki, Patras, and Heraklion) and one in EU (Dublin, Ireland) involving a total of seven virtual halls varying from very small to medium size. The decoding devices we encouraged the organizers to set up and finally used were PC-based decoders using the VLC Media Player, whereas the case of hardware decoding was only tested by our team. The acceptance of the audience was quite encouraging for this kind of activities and was expressed by the increasing number of spectators and the desire to continue to provide the service.

We also tested the proposed methodology concerning the virtual hall arrangement via the organization of two public projections of the Global Concert Series, considering also the organizational aspects of such an event. Several promotional acts were taken (TV promotional videos, posters, invitations, etc.) in order to stimulate and measure the public interest on the subject. After the event, the spectators were asked to fill in a questionnaire in order to measure certain experience factors. The similar results to the ones received from our own transmissions are quite promising. A more exhaustive analysis of the parameter selection and the tests conducted is presented below.

4.1. Quantitative Analysis by Means of Metric-Based Evaluation

Before any NR qualitative evaluation, FR methods were also enabled aiming at evaluating the video degradation issues by means of metric-extracted objective quantities. This evaluation procedure was carried out only for the video content due to the fact that audio coding/multiplexing was based on a well-tested technology (AC-3) that has been successfully implemented [45] during the last decades. However, the evaluation of the new recording layout is worthwhile, and this is why audio-related subjective tests were included in the qualitative evaluation procedure during content reproduction at the remote site(s) (this issue is further analyzed in the next paragraph). In addition, more thoroughly subjective evaluation, in combination with quantitative analysis and the adoption/definition of appropriate audio-metrics, is currently scheduled.

As far as it concerns video evaluation, uncompressed HD video content [33, 34, 51] was selected as reference material in order to be compared with the received/decompressed video at the simulated remote site. In general, the evaluation procedure had to be carried according to the following variables: (a) content type, (b) compression parameters, (c) streaming parameters, and (d) routing/QoS settings. The first category divides content into two major categories according to the original HDTV format (720p,1080i). Five subcategories are formed for each format type, based on the involved motion activity of the content, which implies pace of action [33, 34]. The involved “Sverige Television AB” (SVT) reference video set has been used in the past for similar evaluation in HDTV-related application [33, 34] and this is an additional reason for its selection that enriches its suitability for the demands of the current application. The 10 different content type reference videos were edited separately for each format and two different video clips were produced, one 720p sequence and one 1080i sequence. Each content type was used three times in each sequence and between the different content types a grey mate of 2 seconds was added. The five different content types were ranked by the motion activity that they contain by us. They were graded in a scale from 1 to5, with 1 being the one containing the smallest motion activity.

Besides content type, the compression parameters provide an additional variable that determines the encoding bitrate. Given the use of MPEG-2 compression, three different bitrates were involved during simulations (, , and ), these values were selected as recommended and used as reference from IPTV Focus Group of ITU [23, 2529]. Other parameters of compression like CBR versus VBR, type, and length of GOP as already stated were not further examined [23]. However, the unavoidable network layer issues were put under investigation in order to balance the factors of video stream bitrate/protocol versus quality in jitter conditions. By definition, jitter is the variation presented in trip time from a transmitter to a receiver leading to the deterioration of the stream quality especially in the case of synchronization sensitive services such as the multimedia applications. It is also highly dependent of the network topology of a packet switching infrastructure. Since in the Internet framework the network complexity and therefore the jitter involved is increased as the geographic distances of the venues grow, it is crucial to investigate this factor in the current context.

As for the streaming parameters, the protocol selection (UDP versus RTP) was the only variable involved given that the MPEG2-TS formation was used. The superiority of RTP is rather obvious that the use of it is preferable, whenever this is possible. Nevertheless, the presence of extra buffering memory that RTP implies is a cost proportional to the transmitted stream bitrate that cannot be unnoticed. Especially in cases of streams of high bitrates like the ones we transmit, the extra cost of using RTP is quite significant and so it is supposed to be the second, more costly choice, after UDP. Since RTP actually adds a predefined immunity of certain milliseconds of jitter, according to the buffer size used, the relation of the jitter effects between RTP and UDP can be expressed by the following equations: where are the horizontal and vertical video resolution, respectively, bpp stands for bits per pixel, fps for frames per second, and compression_ratio is the ratio of the original versus the compressed stream size. Due to the above, the case of UDP, which may be used for general conclusions, was examined.

The network performance was simulated by the NETEM (NETwork EMulator) module [52] which can be totally parameterized. Network latency was set at constant typical value of 50 milliseconds as it does not affect the one way transmission quality except the addition of a constant delay. The value of jitter (latency variation) was used as a control variable using values of , and 0.12 milliseconds. As a result, 30 different simulations (2 resolution formats × 3 bitrates × 5 jitter) had to be implemented, in order for all the possible combinations to be recorded. For this purpose, fifteen different hypothetical reference circuits (HRCs) were used which are presented in Table 1.

One high performance windows-based PC, equipped with an HD-SDI video card and a high data transfer rate striped disk array was used as player for playing the reference video clips. A Linux-based PC with NETEM module installed was used as the network simulator and finally the capturing of projected video was done by another PC similar to the first. The encoding and decoding were done by the standard Tandberg encoder-decoder system used in all transmissions. For compatibility reasons, the set of reference video contents that was used was converted and edited by taking special care about not having any quality degradations during the whole pretransmission process. The format that was used was uncompressed YUV 4:2:2 8-bit in AVI file container. The coded and transmitted video signals had to be edited and converted to the same format in order for the comparisons to be done. Editing of the captured video was mandatory since both recording softwares that we tried were unable to synchronize at once with the playback system through network connection.

The comparison and evaluation of transmitted video signals were done with Semaca’s software VQLab [53] which can extract the PSNR and SSIM metrics of each video signal compared to a reference video. Both metrics were extracted once for the video degradation ( and ) caused by each HRC end-to-end, using as references the original played content videos and once for video degradation () caused just by the network using as reference the video produced right after the codec system. The latter is the video signal coded and transmitted by an ideal network (jitter and packet loss are zero) and so it coincides with the video produced from systems HRC 1, 2, and 3 for different coding bitrates.

The modeling of as a function of and is our intention as we have already stated in Section 3.3. By the extracted values of PSNR and SSIM metrics of HRCs 1, 2, and 3 () can be modeled as a function of bitrate. In Figure 3 the graphs of experimental data are presented where the logarithmic trend of them can be seen. A logarithmic function is also used in [49] for standard definition video SSIM modelling and so it may be assumed that the relation of quality metrics and bitrate can be described by the following equation in the case of high definition too: By using our experimental data and the Levenberg-Marquardt algorithm for nonlinear curve fitting in LabVIEW 7.1, we calculated the coefficients of this function for each different content type and for both metrics. The two equations for each metric are as follows: The model works fine for all content types as it can be seen from the mean errors and the standard deviations of it, which are presented in Table 2. All the coefficients for each content type can be viewed in Table 3.

tabA1
Table 3: Coefficients for the model of .

Following the same strategy, can be modeled from the measured PSNR and SSIM values of HRCs 6, 9, 12, and 15 as a function of jitter using as reference the measurements of HRC 3. Based on the observed logarithmic decay of the readings as can be viewed in Figure 4, the following exponential equation was initially used: By applying our experimental data to the previously mentioned method, we calculated the coefficients of this function for each different content type and for both metrics. The two equations for each metric are presented as follows: The evaluation of the above model, through the examination of the mean errors and the standard deviations, showed undesirable behavior for the case of PSNR, which provided concrete evidence for the model-data incompatibility. However, this model proved to be acceptable for the case of SSIM for both cases of 720p and 1080i. These remarks are evident in Table 4. All the coefficients for each content type can be viewed in Table 5.

tabA2
Table 5: Coefficients for the first model of .

To overcome the instability of the above model in the case of PSNR, a linear mixture of exponential models was tested, which was expressed by the following equation: Thus, the resulting equations were formed to the following and the fitting was based on the least squares algorithm using the LabVIEW 7.1 environment, for each different content type and for both metrics: The above model presented acceptable behavior for the SSIM case as well as for the PSNR case of 1080i, but failed to adequately model the PSNR case of 720p, the fact that is evident from the mean errors and the standard deviations of the second model fitting in Table 6. All the coefficients for each content type can be viewed in Tables 7 and 8.

tabA3
Table 7: Coefficients for the second model of .
tabA4
Table 8: Coefficients for the second model of .

Finally, as far as the end-to-end model is concerned, a function combining the results of subsystems A and B as defined in (4a) and (4b) is needed. From (2), it can be proven that the values of SSIM range from (0-1) according to the similarity of the original and the processed video frame. Moreover, the subsystems A and B are connected in cascade resulting in additive deterioration of the image quality related to the behavior of each. Based on the above remarks, the proposed model of the end-to-end system for the SSIM metric was defined by the following function: On the other hand, the PSNR metric is a logarithmic measure expressed in dB as presented in (1). Therefore, the final result of the end-to-end metric is in fact the superimposition of the partial subsystems metrics, in which the minimum value is defined from the minimum partial value. This is similar to the superimposition applied to the uncorrelated sound sources for the calculation of the equivalent sound pressure level [54] as follows: In our case, the calculation formula based on (14) is transformed to the following: The functions (13) and (15) were applied to the observations of all the HRCs defined in Table 1 for 720p, 1080i and the combination of the two and the results are summarized in Table 9.

4.2. Qualitative Evaluation

After examining the whole set of possible parameters of the system that can define its performance and taking into account all the restrictions, a core system configuration was chosen which was to be under minor changes. This configuration was based in the conclusions of other relevant research, as well as in economical and practical reasons. The implementation of technologies that have already been tested and evaluated separately or respectively to other use was decisive in proceeding to real-world experiments. So the chance to evaluate system’s performance by transmitting or receiving real-world events was given.

As it has already been stated, three transmissions and two receptions have taken place by us. From the first three transmissions, useful conclusions have been conducted, respectively, to the potentials of the system. Alternative formats, recording techniques and equipment were tested and the results were promising about the feasibility of the system and there have been also interest by the potential audience wherever projection has taken place in big auditorium (University of Patras). The audience accepted the projections of events with enthusiasm and the feeling of high-definition video and surround sound was noticeable and positive evaluated by everyone. The use of just one camera was evaluated positive too and in any case not monotonic, in spite of the fact that the size of the screen was smaller than the recommended.

The reception of two transmissions from the Global Concerts Series by the Philadelphia Orchestra gave us the chance to evaluate systems performance one more time and compare a different approach in a fully controlled projection site and by a survey that was conducted in a larger group of people. Although the use of just one camera was not the case in these transmissions, the rest of the transmission system was almost identical to the one we used so the survey’s results can be useful for the evaluation.

The network over which the transmission has taken place provided a high bandwidth of 100 Mbps and a very high QoS. Thus, the encoder’s bitrate was 18 Mbps, the transmitting format 1080i25 with MP@HL(4:2:0) profile and the packet formation protocol UDP. The other parameters for reasons that have already been mentioned were constant bitrate (CBR), IBBP format for GOP, and a relatively low length of 12 frames. The encoder that was used was Tandberg E5280 with HD expansion module, as decoder was used mainly Tandberg T1228 decoder and secondary VLC software installed in PC and a DLP-technology projector was used.

With this configuration in the cases of one transmission and one reception, 56 people participated in the survey and completed questionnaires. The subjects were not experts and were randomly selected. They were given the questionnaires before the projection or in the break and supplement it at the end. The instructions given were to read it in advance in order to ensure the worst case scenario. This is because the subjects after reading the questions concentrated and paid attention even for minor quality degradation. The subjects were asked among others to characterize the total quality of video and audio separately using a five-degree quality scale and also to evaluate the quality degradations. The scale that was used is similar to the one proposed by ITU for single stimulus continuous quality evaluation (SSCQE) [3234, 55]. Moreover, they were asked about the total experience of watching an event by this way and had to answer how much realistic they have found the projection. Another question was, if they would like to watch another event, what was their motive to watch it and what improvements they would suggest? From the validation tests, three subjects were rejected for inconsistent answers or incomplete questions.

The results presented in Figure 5 show that in both cases, the transmission from Thessaloniki and from Philadelphia, the subjects evaluated the quality of video as “very good” or “excellent,” more than 80% and the quality of sound more than 70%. Most of the subjects, more than 60%, evaluated the degradations and impairments of the video and audio as “perceptible, but not annoying” or “imperceptible.” Lastly, the total experience of the event was evaluated as “very realistic” or “interesting” by more than 70%.

It is obvious that the results are influenced by the “hollow effect.” This can be explained by the fact that the audience was for the first time watching a transmission like this. However, the percentages of positive answers were very high to be caused just by that effect. Another interesting finding was that comparing the answers of the subjects answering in the first survey (transmission) the frequency of answering “Realistic” in the evaluation of the total experience was higher than the relative one in the second survey (reception). The small size of the samples in both surveys results in a big confidence interval for the percentages of answers “realistic” so we can assume that there is a trend by the audience to evaluate the use of just one camera more realistic. This can be claimed once all the other parameters were identical in both cases and the relative percentages have a quite big difference in favor of the first transmission and consequently of the one camera use. In order to prove the validity of this statement, a more elaborate survey is necessary.

Figure 6 represents the spatial distribution of subjects’ answers in the case of the second survey. The respective results of the first survey cannot be evaluated because of the size of the sample. A1 is located in front of the left part of the projection screen facing the audience. It may be observed the distribution of subjects, respectively, and the evaluation of several parameters. It is obvious from Figure 6(c) that the subjects sited near the minimum viewing distance evaluated the video quality better than the ones in front or behind of them. This figure in combination with Figure 6(e) proves that sitting closer than the minimum viewing distance, the viewing experience decreases and the video errors and impairments are more perceptible. Figure 6(d) reveals that because of the poor acoustic performance of the remote auditorium, the listening experience was better close to the loudspeakers which were besides the screen and at the back of side walls.

4.3. Conclusion and Further Work

The purpose of the current work was to implement and evaluate HDTV over IP technologies in real-world multimedia-broadcasting applications, for the live transmission of cultural events via broadband networks (such as Internet2 and Geant backbones). Various evaluation procedures were conducted in combination with network simulations and different configuration setups, before the finally selected architecture was decided and deployed. The soundness of the current work stems from the fact that similar experimentation procedures have not been considered for specific “enhanced reality” applications, such as e-learning, teleworking, telecollaboration, and others. The proposed system was enthusiastically accepted by the audience, proving to be feasible and reliable. The adopted methodology to use a single-still camera in combination with large projection screen and theater-adapted surround sound seems to provide increased realism. Given the above scenario combined with the optimization tests that we conducted, resulted to the specification of minimal requirements (bitrate, jitter) for such a task. Specifically, CBR encoding of 18 Mbps UDP in jitter conditions of 0.10 millisecond substantiated to be a minimal choice for high-quality transmission, although further experimentation could lead to more optimal utilization of network resources and increased tolerance to QoS variations. For instance, the use of additional combinations of VBR, ProMPEG FEC for today’s MPEG-2, different compression formats (MPEG-4, WMV, etc.), various types and lengths of GOP, and lower/higher bitrates are under consideration. The potential of full interaction of the system is another issue.

Furthermore, a more elaborate investigation of transmission modeling was made based on simulations of hypothetical reference circuits (HRCs). The creation of a conditional model setup showed the feasibility of end-to-end performance estimation from the distinct properties of two subsystems regarding the encoding process (logarithmic curve fitting) and the transmission process (exponential mixture curve fitting), respectively. The model presented acceptable performance for all the cases except from the case of network subsystem PSNR modelling for the case of 720p which calls for further investigation. Also, a qualitative evaluation of the applied system is presented, proving the assumptions made during the design process mostly regarding the physical aspects of the project. Evolutions of the present model could include the incorporation of such subjective tests and perceptually adapted metrics into the performance definition QoS/QoE of a system, the extension of the properties for the subsystem parameterization as well as audio performance estimation.

In any case, we may conclude that the impact of broadband networks in digital multimedia broadcasting, like the one described, will bring a new era to the cultural and educational world prospects.

Acknowledgments

The authors would like to acknowledge the valuable collaboration of the Philadelphia Orchestra-Global Concert Series project team as well as the contribution of Sound Engineer Ph.D. candidate K. Kontos during the development phases of the project.