Abstract

This paper presents a hybrid (geometry- & image-based) framework suitable for providing photorealistic walkthroughs of large, complex outdoor scenes at interactive frame rates. To this end, based just on a sparse set of real stereoscopic views from the scene, a set of morphable 3D-mosaics is automatically constructed first, and then, during rendering, a continuous morphing between those 3D-mosaics that are nearby to the current viewpoint is taking place. The morphing is both photometric, as well as geometric, while we also ensure that it proceeds in a physically valid manner, thus remaining transparent to the user. The effectiveness of our framework has been demonstrated in the 3D visual reconstruction of the Samaria Gorge in Crete, which is one of the largest and most beautiful gorges in Europe.

1. Introduction

One of the main research problems in computer graphics is the creation of modeling and rendering systems capable to provide photorealistic representations of complex, real-world environments. Minimal human intervention during the modeling process and operation in real time during rendering, a property that allows walkthroughs at interactive frame rates, are some of the desirable characteristics for such systems. Two are the most prominent approaches that have been proposed so far in this regard. The first, more traditional, way of building this type of systems is by taking a purely geometric approach. In this case, a full 3D geometric model of the whole scene is constructed first, and that model is then used for the rendering of the scene afterwards. One big advantage of this approach is that it offers great flexibility with respect to manipulating or editing the scene's properties. For example, due to a full 3D model being available, operations such as altering the viewpoint of the camera, changing the position of light sources, modifying part of the geometry, and so forth can be attained very easily in this case. The price for this flexibility, however, lies in constructing the underlying 3D model of the scene needed in the first place. Unless done automatically, this can be an extremely time consuming and daunting task, especially for large-scale scenes. Due to this fact, a lot of effort went recently into how to obtain such a geometric model automatically by using only real images as input [1, 2]. This task is now known as multiple view geometry in the computer vision literature, and has been a very active research topic over the last years [35]. A characteristic method in this regard is the work of Pollefeys et al. [6] on 3D reconstruction from hand-held cameras, as well as the work of Debevec et al. [7] on recovering the geometry of architectural-type scenes, just to name a few of the proposed methods. For scenes with irregular geometry, however, such as outdoor scenes, or scenes with trees and plants, extracting an accurate 3D model still remains a very difficult task in the general case.

Relatively recently, however, it was realized by researchers that geometry is often unnecessary for obtaining a realistic representation of the scene [8]. As a result, the so-called image-based rendering (IBR) methods have emerged. These methods completely skip the geometric modeling of the scene and, instead, concentrate just on how to fill the image pixels corresponding to a novel view. In this case, geometric modeling is replaced by simply capturing a dense set of real images, with each one of these images associated to a different view of the scene. When given a novel view, these methods then try to synthesize it simply by (appropriately) resampling the previously captured set of views. Intuitively, IBR methods can be thought of as trying to assemble a puzzle. The puzzle to be assembled is the novel view itself, while the pieces of the puzzle correspond to the pixels contained in the captured views. Since the invention of the concept of image-based rendering, many such methods have been proposed (see [9] for an excellent survey), with Lightfield [10] and Lumigraph [11] being two of the most characteristic examples in this regard. The biggest advantage of all IBR techniques is the extremely high level of photorealism they can attain (since they make use of actual images of the scene). The price for this, however, is the huge amount of data (i.e., input images) often required by IBR methods, which actually forms one of their main disadvantages. To see why this needs to be so in general, it is worth considering another interpretation of IBR techniques. According to it, these methods try to reconstruct the so-called plenoptic function [12]. This is a 7-dimensional function , which models a 3D dynamic environment by recording the light rays at every space location , towards every possible direction , over any range of wavelengths and at any time . Each time we capture an image by a camera, the light rays passing through the camera's center of projection are recorded and so that image can be considered as a specific sample of the plenoptic function. In this context, image-based rendering can be interpreted as a signal processing task, that is, that of reconstructing a continuous signal (in this case the plenoptic function) based only on a discrete set of samples from that signal. Given the high dimensionality of the plenoptic function, the reason why a huge number of input images is, in general, required by IBR techniques then follows directly from the well-known Nyquist theorem.

This issue with the huge amount of input data is also one of the reasons why the great majority of existing IBR techniques apply only to scenes of either small or medium scale. On the contrary, much fewer IBR methods have been proposed for the case of large-scale scenes. Motivated by this fact, the work presented in this article describes a novel, hybrid (i.e., both geometry-based and image-based) system, that can deal efficiently with the visual reconstruction of large-scale, natural outdoor environments. We note that this reconstruction is always photorealistic, while its visualization can take place at interactive frame rates, that is, in real time. For reducing the otherwise huge amount of input data, our system makes use of a new compact data representation of a 3D scene, called morphable 3D-mosaics [13, 14]. As we will see, this scene representation consists of a set of morphable (both geometrically and photometrically) 3D models. Furthermore, these morphable models are constructed automatically based just on a sparse set of real stereoscopic views. Due to the above-mentioned properties, our system can be extremely useful in applications such as 3DTV, gaming, virtual reality, and so on, where one seeks to create a 3D depth impression of the observed scene. In fact, the latter has been one of the starting motivations for our system, which has thus already been used as a part of a 3D virtual reality installation in the Natural History Museum of Crete with the goal of providing a lifelike virtual tour of the Samaria Gorge (one of the largest and most magnificent gorges in Europe).

Besides the proposal of a novel hybrid representation for a 3D scene, our system also includes the following new algorithms and techniques as part of its image-based modeling and rendering pipeline: (1) a robust photometric morphing procedure that is based on automatically extracting a dense field of 2D correspondences between wide-baseline images (Section 5.1), (2) an efficient geometric morphing algorithm based on estimating 3D correspondences between local geometric models (Section 5.2), (3) a robust algorithm for constructing 3D panoramas by combining local 3D models (Section 7) and (4) a highly optimized rendering pipeline that runs entirely on the GPU (thanks to employing pixel and vertex shaders for the morphing), while it also uses decimated, but visually correct, 3D geometric models (Section 6).

Of course, besides our system, other IBR techniques for large-scale scenes exist as well. To this end, work on unstructured/sparse lumigraphs has been proposed by various authors. One such example is the work of Buehler et al. [15]. These authors make use of a fixed geometric proxy that supposedly describes the global geometry of the scene in order to reduce the required amount of input data. However, this assumption of a fixed geometric proxy is not adequate in our case. Hence, contrary to [15], our system actually uses view-dependent geometry due to the continuous geometric morphing taking place. Another example of a sparse lumigraph is the work of Schirmacher et al. [16]. Although multiple geometric proxies are allowed in this work, any possible inconsistencies existing between these proxies (e.g., due to errors) are not taken into account during rendering. This is again in contrast to our work, where an “optical flow” estimation between wide-baseline images is used for dealing with this issue. Furthermore, this estimation significantly reduces the required number of input views (and hence the amount of input data). For these reasons, if any of the above two approaches were to be applied to large-scale scenes, like those handled in our case, many more images (than ours) would then be needed. In addition, due to our rendering path, which can be highly optimized in modern graphics hardware, we can achieve very high frame rates during rendering, whereas the corresponding frame rates listed, for example, in [16] are much lower due to an expensive barycentric coordinate computation that needs to take place. In the “interactive visual tours” approach [17], video (from multiple cameras) is being recorded as one moves along predefined paths inside a real world environment, and then image-based rendering techniques are used for replaying the tour and allowing the user to move along those paths. In this way, virtual walkthroughs of large scenes can be generated. However, a specialized acquisition system is needed in this case, whereas our method requires using just off the shelf digital cameras. Also, in the “sea of images” approach [18], a set of omnidirectional images are captured for creating interactive walkthroughs of large, indoor environments. However, this set of images needs to be very dense with an image spacing of about 1.5 inches. Finally, the authors in [19] have proposed an IBR system for providing constrained interactive walkthroughs inside large and complex virtual environments.

2. Overview of the Morphable 3D-Mosaics Framework

As already mentioned, our system is capable of providing photorealistic walkthroughs of large-scale scenes at interactive frame rates. Assuming, for simplicity, that during the walkthrough the user motion takes place along a predefined path inside the environment (we note, of course, that our system can be readily extended to handle the case where the stereoscopic views have been captured not just along a predefined path, but throughout the scene), the input to our system is then a sparse set of stereoscopic views captured at certain locations (called key-positions hereafter) along that path (see Figure 1). Considering, initially, the case where only one view per key-position exists, a series of local 3D models is then constructed based on these stereoscopic views. There exists one model per view, and each local model is assumed to approximately capture the photometric and geometric properties of the scene only at a local level. Then, instead of trying to create a global 3D model out of all these local models (a task that can prove to be extremely difficult in many cases, requiring a very accurate registration between local models), we rather follow a different approach, which is that of creating a series of so-called morphable 3D-models. The key idea then is that, during the transition between any two successive key-positions (say , ) along the path, one of these morphable 3D-models, say , is displayed by the rendering process (see Figure 1). At position , this model coincides with the local model at that position, while, as we are approaching , it is gradually transformed into the next local model , coinciding with that upon reaching key-position . Therefore, during the rendering process, and as the user traverses the predefined path, a continuous morphing between successive local 3D models takes place all the time. It is important to note that this morphing between local models is both photometric, as well as geometric. Moreover, due to the way the morphable models are constructed, we ensure that the morphing always proceeds in a physically-valid manner, thus remaining transparent to the user of the system. For this purpose, algorithms for extracting 2D correspondences between wide-baseline images, as well as algorithms for establishing 3D correspondences between local geometric models, are proposed and implemented. These are used for estimating the photometric and geometric morphings, respectively.

Our system can be also extended to handle the existence of multiple stereoscopic views per key position, all of them related by a pure rotation of the stereoscopic camera. In this case, there will also be multiple local models per key-position. Therefore, before applying the morphing procedure, a so-called 3D-mosaic per key-position needs to be constructed as well. Each 3D-mosaic will simply comprise the multiple local models at the corresponding key-position and, intuitively, will itself be a larger local model covering a wider field of view. Morphing can then proceed in the same way as before, with the only difference being that these 3D-mosaics will be the new local 3D models to be used during the stage of morphing (in place of the smaller individual ones). Hence, in this case, a morphable 3D mosaic (instead of a morphable 3D model) is said to be created by the algorithm.

Based on the above discussion, it follows that the 3D modeling pipeline of our framework consists of the stages that are listed below.(i)Local models construction: this construction is accomplished based on the stereoscopic views that have been captured at key-positions along the path.(ii)Relative pose estimation: a coarse estimation of the relative rotation and translation between successive local models takes place during this stage.(iii)Generation of morphable models: this task amounts to estimating the photometric as well as geometric morphings between successive local models.(iv)3D mosaics generation: during this stage, a 3D mosaic is generated at each key-position by aggregating all local models corresponding to that position. During the next sections, each one of the above tasks will be described briefly, while the various algorithmic choices that have been made in each case will be justified and explained.

3. Local 3D Models Construction

As already mentioned in the previous section, a set of local 3D models needs to be extracted during the first stage of the 3D modeling pipeline. There will be one local 3D model per stereoscopic view (and hence per key position as well, since we initially assume there is a one-to-one correspondence between key-positions and stereoscopic views). Such a 3D model is supposed to provide both a photometric as well as a geometric representation of the scene, but only at a local level (i.e., only near the corresponding key-position along the path). To this end, a stereo matching procedure is applied between the left and right images of the stereoscopic view, so that pixel correspondences between the 2 images are extracted [20]. Based on these correspondences (and assuming a calibrated pair of cameras), a 3D reconstruction can then take place via triangulation (i.e., by intersecting in 3D space the camera rays passing through corresponding pixels). In this manner, the 2D maps , , and are produced that contain, respectively, the coordinates of the reconstructed 3D points for each pixel in the left image (e.g., see the depth map in Figure 2). Without loss of generality, we will assume that these geometric maps are defined only over a region on the image plane, and this region will be referred to as their domain hereafter.

These maps will form the local geometric representation of the scene, whereas the left image will be used as the corresponding photometric representation (i.e., ). The set will thus make up what we will hereafter refer to as a local model. In fact, this term will always implicitly refer to such a set of elements. As its name suggests, such a model can be used for reproducing a photorealistic reconstruction of the real scene, but only in a local region around the model's key-position (e.g., see the right image in Figure 2). In fact, it is only at this local region in 3D space where this model will be used during rendering, which is exactly why the geometric maps of a local model do not have to provide a very accurate representation of the whole scene's geometry.

For solving the stereo matching problem, we estimate a discrete disparity field by minimizing the energy of a 2nd-order Markov random field. The nodes of this MRF are the pixels of the left image, while the labels belong to a discrete set of disparities , where represents the maximum allowed disparity. We then seek to assign a label to each node so that the following MRF energy is minimized:where the symbol denotes a set of interacting pairs of pixels on the image grid (a 4-neighborhood system is assumed).

The single node potential for assigning disparity to pixel is going to be set as . The factor expresses a confidence measure and is used for giving less weight to pixels that are less reliable for matching (e.g., their neighborhood in the image has uniform intensity). Its value is set based on the minimum eigenvalue of the following autocorrelation matrix :Here, represents the left image gradient, while the sums are taken over a small window around the point in the left image. The minimum eigenvalue is a measure of the “cornerness” of point and will be large in regions with high texture variation, but will be small in uniform regions. Regarding the pairwise potentials , these are set equal to the following truncated quadratic function: , where denotes the maximum allowed discontinuity penalty that can be imposed. For optimizing the energy of the above discrete MRF, the recently proposed primal-dual algorithms of Komodakis et al. [21, 22] have been used, which can guarantee an approximately optimal solution for a very wide range of MRFs. For turning the resulting discrete disparity field into a continuous one (which will yield a local 3D model that is better for visualization purposes), a continuous optimization scheme (e.g., a simple gradient descent) can be applied afterwards as well.

4. Relative Pose Estimation between Successive Local Models

Given any two successive local models , along the path, a rough estimation of their relative 3D poses (i.e., their 3D rotations and translations) needs to take place as well. Note that, similarly to the estimation of the models' geometry, only a very coarse and approximate estimation of the 3D pose is needed. Again, the reason is because this pose is not going to be used for performing an exact registration into a global 3D coordinate system, but merely for facilitating the morphing procedure that will take place later (i.e., during the next stage). One way for obtaining such a rough estimate is by, for example, computing a sparse set of approximate pixel correspondences between the photometric images , of models , . Based on these correspondences, the so-called fundamental matrix (that relates points in images , ) can be computed, from which one can easily extract an approximate estimation of the relative 3D pose [3].

Therefore, the pose estimation problem is reduced to that of extracting a sparse set of correspondences between , . A usual method for tackling the latter problem is the following: first a set of interest-points in are extracted (using an interest-point detector). Then for each interest-point, say , a set of candidate points inside a large rectangular region of are examined and the best one is selected according to a similarity measure. Usually the candidate points are extracted by applying an interest-point detector to region as well.

However, unlike left/right images of a stereoscopic view, and are separated by a wide baseline. Simple measures like correlation have been proved extremely inefficient in such cases. Assuming a smooth predefined path (and therefore a smooth change in orientation between , ), it is safe to assume that the main difference at an object's appearance in images and comes from the forward camera motion along the axis (looming). The idea for extracting valid correspondences is then based on the following observation: the dominant effect of an object being closer to the camera in image is that its image region in appears scaled by a certain scale factor , that is, if , are corresponding pixels: . So an image patch of at should look similar to an image patch of an appropriately rescaled (by ) version of .

Of course, the scale factor varies across the image. Therefore, the following strategy, for extracting reliable matches, can be applied.(1)Quantize the scale space of to a discrete set of values , where .(2)Rescale by the inverse scale for all to get rescaled images . For any , , let us denote by a (small) fixed-size patch around the projection of on and by an equal-size patch of at .(3) Given any point and its set of candidate points in , use correlation to find among the patches at any and across any scale , the one most similar to the patch of at :

This way, apart from a matching point , a scale estimate is provided for point as well.

The above strategy has been proved very effective, giving a high percentage of exact matches even in cases with very large looming. Such an example can be seen in Figure 3.

5. Estimation of Morphable 3D Models

The output of the previous stage will thus be a series of approximate local 3D models (along with approximate estimates of the relative pose between every successive two). Rather than trying to create a consistent global model by combining all local ones (a rather tedious task, requiring among others high-quality geometry and pose estimation), we will, instead, follow a different approach, which is based on the following observation. Let , be any two successive local models along the path and let also , be their corresponding key-positions. Near position , the model is ideal for representing the surrounding scene. However, as we move forward along the path and approach key-position of next model , the photometric and geometric properties of the environment are much better captured by that model. Compare, for example, the fine details of the rocks that are revealed in image of Figure 4 (i.e., the photometric image of the next local model), but are not visible in image of that figure (i.e., the photometric image of the previous local model). Hence, during the transition from to , we will try to gradually morph model into a new destination model, which should coincide with upon reaching position . (In fact, only part of this destination model can coincide with since, in general, models , will not represent exactly the same part of the scene.) This morphing should be geometric, as well as photometric (the latter wherever possible), and should proceed in a physically valid way. For this reason, we will use what we call a morphable 3D-model:As can be seen, in addition to including all the geometric and photometric maps of model (called the source maps hereafter), also consists of some additional destination maps: these are the geometric maps , as well as the photometric map , containing, respectively, the destination 3D vertices and colors for all points of the source model . Note that both the source and destination maps are defined over the same domain . Hence, at any time during the rendering process, the 3D coordinates and colors of the morphable model will be given bywhere is a parameter determining the amount of morphing (with at , at and in between). Therefore, specifying simply amounts to filling-in the values of the destination maps at each .

For this purpose, a 2-step procedure will be followed, depending on whether a point in has a physically corresponding point in or not.(1) The main idea is that if a point in has a physical correspondence in , that correspondence can be directly used to fill the destination maps. Specifically, let denote the subset of region , consisting of all those 's for which there exists a physical correspondence in (i.e., region refers to that part of the scene which is common to both models , ). Let also be a function that maps each to its counterpart in . Since model (after morphing) should coincide with , it must then hold as follows: This means that, for all , the destination maps at can be filled simply by copying the appropriate values from the maps of model . As a result, all points of in region can be transformed both photometrically as well as geometrically.(2) The rest of the points of (i.e., all points defined over region ) do not have counterparts in model . Hence, no photometric morphing is possible to be applied to those points. As a result, these points will retain their color value from model , orNevertheless we still need to apply geometric morphing to them, so that no distortion/discontinuity in the 3D structure is observed during the transition from to . This means that we will still have to fill-in the destination 3D coordinates (i.e., the destination geometric maps) for all .

The 2 important remaining issues (that also constitute the core of the morphing procedure) are as follows:(i) how to compute the mapping . This is equivalent to estimating a 2D displacement field (also called optical flow field hereafter) between images and ,(ii) and how to obtain the missing values of the destination geometric-maps for . These values will be needed for fully specifying the geometric morphing of model over the whole domain . Both of these issues will be the subject of the two sections that are following.

5.1. Photometric Morphing via Optical Flow Estimation

In general, obtaining a reliable, dense optical flow field between images and is a particularly difficult problem. In fact, without additional input, usually only a sparse set of optical flow vectors can be obtained in the best case. The main difficulties stem from the fact that, unlike the left and right images of a stereo pair, images and are separated by a wide baseline (since they are captured at totally different positions , ).(i) On the one hand, this means that the resulting displacement vectors can be large, which, in turn, implies that, when searching for the correspondence of a pixel , one needs to examine a very large region of image (the search region corresponding to pixel will be denoted by hereafter). In this manner, however, the chance of an erroneous optical flow vector increases significantly (and so does the computational cost).(ii) On the other hand, for extracting pixel correspondences, one usually needs to compare patches from images and (where the assumption is that patches associated with corresponding pixels should look similar to each other). In the case of wide-baseline images, however, simple similarity measures for comparing patches (such as the correlation between patches) will typically produce very inaccurate correspondences, when used naively. The reason is that, due to the wide baseline, the appearance of a patch may change significantly from one image to the next (e.g., the patch may have been rescaled and thus appear larger). Therefore, more appropriate similarity measures have to be used in this case.(iii) Finally, even if both of the above problems are solved, optical flow estimation is inherently an ill-posed problem and so additional regularization assumptions need to be imposed.

For dealing with the first of these issues, we will make use of the underlying geometric maps , , of model , as well as the relative pose between views and . Theoretically, by using these quantities, we can find the correspondence of a pixel of by reprojecting it onto image . However, since all of the above quantities have been estimated approximately, this only allows us to narrow the search region to a smaller region around the reprojection point, for example, see Figure 4 (in fact, the search region can be further restricted by taking its intersection with a small zone around the epipolar line corresponding to ). In addition, since we are interested in searching only for pixels of that belong to region (this is where is defined), the final search region for will be . If the resulting region is empty, then no optical flow vector will be estimated and pixel will be considered as not belonging to region .

For dealing with the second problem, we will use a technique similar to the one described in Section 4 for getting a sparse set of correspondences. As already stated therein, the dominant effect due to a looming of the camera is that pixel neighborhoods in image are scaled by a factor varying across the image. The solution proposed therein (and which we will also follow here) was to compare image patches of with patches not only from but also from rescaled versions of the latter image. We will again use a discrete set of scale factors and will denote image rescaled by .

However, even after the above enhancements, estimating correspondences independently for each pixel will produce a lot of errors. This is because optical flow estimation is an inherently ill-posed problem. Hence, to obtain high-quality results, one needs to regularize this problem by taking advantage of the fact that optical flow vectors at neighboring pixels typically exhibit a high correlation. Discrete Markov random fields (MRFs) [23] are ideal candidates for modeling this type of local correlations and have also been used with great success in many other cases. For this reason, we will reduce optical flow estimation to a discrete MRF optimization task, that is to a discrete labeling problem. In this case, based on our earlier discussion, each label will consist not only of an optical flow vector , but also of a scale factor (hereafter, the optical flow vector of any label will be denoted by and its scale by ). Furthermore, the labels that can be assigned to a pixel will belong to the set . This definition of set simply encodes the fact that, when seeking a correspondence for a pixel of image , we are allowed to search across all scales (i.e., in all images ), but only inside region at each scale (see Figure 4).

Given all these label sets, getting then an optical flow field is equivalent to picking one element from the cartesian product . In MRF optimization, this is typically done by trying to choose an element that minimizes the MRF energy. This energy is defined as follows:Here, the functions , denote, respectively, the so-called unary and pairwise MRF potentials, while denotes all pairs of neighboring pixels in region . These two functions serve a different role in MRF optimization: the terms try to penalize labelings that do not fit well to the observed data, whereas the terms try to enforce smoothness between labels , of neighboring pixels. In our case, each term will measure the dissimilarity between corresponding patches according to label . According to that label, a patch around will correspond to a patch around that has been rescaled by . Hence, to determine , pixel is projected onto the rescaled image and a fixed size patch, denoted by , is taken around this projection. We then compare this patch with an equal-size patch around , orwhere denotes the -norm and thus measures the sum of absolute intensity differences between corresponding patches.

On the other hand, as already mentioned, the role of the pairwise terms is to regularize the solution by imposing smoothness on the resulting optical flow vectors. These terms can be defined using the following truncated distances: As can be seen, these terms penalize labelings that assign discontinuous displacements or discontinuous scale factors to neighboring pixels (note that, due to the truncation by and , some discontinuities are still allowed to exist, e.g., at the boundary between 2 objects). Minimizing the resulting MRF energy is in general NP-hard. Yet, there exist optimization techniques, which can provably guarantee approximately optimal solutions [21, 22, 24]. Foe example, Figure 5 contains the resulting optical flow for images , of Figure 4. For comparison, we also show the corresponding optical flow when no search across scale takes place, that is . As expected, in this case, the optical flow vectors are quite noisy in regions undergoing a large change of scale due to the camera motion. When using the fast optimization algorithms from [21], convergence is very fast (only a few iterations are needed) and the average running time per iteration is on the order of a few seconds.

5.2. Geometric Morphing

After estimating function , we may apply (6) to all points in region . This allows us to copy values from the geometric maps of model in order to fill-in region of the destination geometric maps , , . For notational convenience, the resulting partially complete geometric maps, that is with known values only in region , will be denoted by , , (e.g., see the middle image in Figure 6). To completely specify morphing, it only remains to fill-in the destination geometric maps over the rest of their domain, that is, over region . For this, we need to specify the corresponding destination 3D vertices for all points of in . Since these points do not have a physically corresponding point in , we cannot apply (6) to get a destination 3D coordinate from model . A naive solution to deal with this would be to simply apply no geometric morphing to these points, that is to allow the destination 3D vertices of these points to coincide with their source vertices in . In that case, we would end up with destination geometric maps:(i) whose values in would have been copied from model ,(ii) but whose values in would have been copied from model .

In this manner, however, the resulting destination maps , , would contain discontinuities along the common boundary (say ) of regions and (e.g., see the left image in Figure 6), thus leading to very annoying discontinuity artifacts (holes) during the morphing procedure (e.g., see the left image in Figure 7). This would indeed be the case, since the geometries of both and (as well as their relative poses) have been estimated only approximately and thus these models may not match perfectly to each other.

Instead, the proper way to fill-in the destination vertices at region is based on the observation that a physically valid destination 3D model should satisfy the following 2 conditions.(1) Along the boundary of , no discontinuity in its 3D structure should exist, that is the values of , , at should coincide with the already filled-in values of , , at (2) In the interior of , the relative 3D structure of the destination model should coincide with the relative 3D structure of the source model (since these two models are assumed to be related by a rigid transformation). Put otherwise, the relative 3D structure of should be preserved during morphing, or, at least, it should not be distorted too much.

In mathematical terms, the first condition obviously translates towhereas the second condition, that is that of preserving the relative 3D structure of during morphing, implieswhich is easily seen to be equivalent to Therefore, based on the above conditions, the destination vertices can be extracted by independently solving 3 similar optimization problems (one problem per 3D coordinate). For instance, the missing values of can be estimated by solving the following optimization problem:The finite-difference discretization of (13) (using the underlying pixel grid) yields a quadratic optimization problem, which can be reduced to a sparse (banded) system of linear equations, that can then be solved easily. Alternatively, one needs to observe that a function minimizing (13) is also a solution to the following Poisson equation with Dirichlet boundary conditions [25, 26]:Therefore, in this case, the missing values of the destination geometric maps can be filled by independently solving 3 Poisson equations of the above type (note that these equations can be solved very efficiently, e.g., the needed time is at most only a few seconds). See the right image in Figure 6 for a destination depth map produced with this method. Also, the middle image of Figure 7 contains a corresponding rendering.

6. Rendering Pipeline

An important advantage of our framework is that, regardless of the scene's size, only one “morphable 3D-model” needs to be displayed at any time during rendering, that is, the rendering pipeline has to execute the geometric and photometric morphings for only one local model (as described in Section 5). This makes our system extremely scalable to large-scale scenes. In addition to that, by utilizing the enhanced capabilities of modern 3D graphics hardware, both types of morphing can admit a GPU implementation, thus making our system ideal for 3D acceleration and capable of achieving very high frame rates during rendering.

More specifically, for implementing the photometric morphing of model , multitexturing needs to be employed as a first step. To this end, both images , will be used as textures and each 3D vertex whose corresponding 2D point is located inside region will be assigned 2 pairs of texture coordinates: the first pair will coincide with the image coordinates of point while the second one will be equal to the image coordinates of the corresponding point (see (6)). Then, given these texture coordinates, a so-called pixel-shader (along with its associated vertex-shader) [27] can simply blend the two textures in order to implement (on the GPU) the photometric morphing defined by (5). Pixel and vertex shaders are user defined scripts that are executed by the GPU for each incoming 3D vertex and output pixel, respectively. One possible implementation of such scripts, for the case of photometric morphing, is shown on the left side of Figure 8 where, for this specific example, the OpenGL Shading Language (GLSL) [27] has been used for describing the shaders. As for the 3D vertices which are associated to points located inside region , the situation is even simpler since no actual photometric morphing takes place in there (see (3)) and so only image needs to be texture-mapped onto these vertices.

On the other hand, for implementing the geometric morphing, the following procedure is used: two 2D triangulations of regions , are first generated resulting into two 2D triangle meshes TRI, TRI. Based on these triangulations and the underlying geometric maps of , two 3D triangle meshes MESH, MESH are constructed. Similarly, using TRI, TRI and the destination geometric maps , , , two more 3D triangle meshes MESH, MESH are constructed as well. It is then obvious that geometric morphing (as defined by (5)) amounts to a simple vertex blending operation, that is, meshes MESH, MESH are weighted by , meshes , are weighted by and the resulting weighted vertices are then added together. Vertex blending, however, is an operation that is directly supported by all modern GPUs and, as an example, Figure 8 (right box) contains skeleton code in C showing how one can implement vertex blending using the OpenGL standard.

Therefore, based on the above observations, rendering a morphable model simply amounts to feeding into the GPU just 4 textured triangle meshes. This is, however, a rendering path which is highly optimized in all modern GPUs and, therefore, a considerable amount of 3D acceleration can be achieved in this manner during the rendering process.

6.1. Decimation of Local 3D Models

Up to now, we have assumed that a full local 3D model is constructed each time, that is, all points of the image grid are included as vertices in the 2D triangulations TRI, TRI. However, we can also use simplified versions of these 2D triangle meshes, provided, of course, that these simplified meshes approximate well the underlying geometric maps [28]. In fact, due to our framework's structure, a great amount of simplification can be achieved and the reason is that a simplified model has to be a good approximation to the full local model only in the vicinity of (remember that model is being used only in a local region around ). Based on this observation, the following iterative procedure is being used for the simplification of the 2D meshes: at the start of each iteration, there exists a current 2D Delaunay triangulation TRI which has as vertices only a subset of the points on the image grid. Based on TRI, an error function is defined over the image grid which is measuring how well the current MESH approximates the underlying geometric maps (here MESH denotes the 3D surface defined by TRI). To each triangle, say , of TRI we then associate the following two quantities: (i.e., the maximum geometric error across ) and (i.e., the interior point of achieving this maximum error). At each iteration the triangle of maximum error is selected and its point is added as a new vertex in the triangulation. This way a new Delaunay triangulation TRI is given as input to the next iteration of the algorithm and the process repeats until the maximum error falls below a user specified threshold , which basically controls the total amount of simplification to be applied to the local model. Our algorithm is initialized with a sparse Delaunay triangulation TRI and the only restriction imposed on TRI is that it should contain the edges along the boundary between regions and (i.e., a constrained Delaunay triangulation has to be used) so that there are no cracks at the boundary of the corresponding meshes.

For completely specifying the decimation process, all that remains to be defined is the error function . One option would be to set , where denotes the geometric deviation at between MESH and the underlying geometric maps (MESH is the 3D point defined by MESH at ). This, however, would not result in the greatest possible geometric decimation. The key observation for managing to increase the geometric simplification of MESH, relies on the fact that MESH needs to provide a good approximation of only in a local region between positions and of the path. As a result, we can choose to relate to the maximum projection error only at these locations. More specifically, we setwhere PROJ_ERR and PROJ_ERR denote the maximum projection error at positions and respectively, that is,In practice, this definition of the error has managed to achieve much larger reductions in the geometric complexity of the local 3D models, without sacrificing the visual quality of the resulting renderings, thus giving excellent results. Furthermore, the user-defined threshold can now be expressed in pixel units and can thus be set in a far more intuitive way by the user. An example of a simplified local model that has been produced in this manner is shown in Figure 9, in which case has been set equal to 0.5 pixels. We should finally note that by using the simplified local 3D models one can reduce the rendering time even further, thus achieving even higher frame rates, for example, over 50 fps.

7. 3D-Mosaics Construction

For simplicity, up to this point we have been assuming that, during the image acquisition process, only one stereoscopic image has been captured per key-position along the path. Our framework, however, can be readily extended to the case where multiple stereoscopic views (and hence multiple local models) exist per key-position, related to each other by a pure rotation of the stereoscopic camera. Such a scenario can be useful, for example, when an extended field of view is required during the virtual walkthrough (as in VR installations with large screens). To reduce this case to the one already examined, it suffices that a single local model (called 3D-mosaic hereafter) is constructed at each key-position. This can be done by assembling all individual local models at that position, which, as we will see, is an easy task, since these models differ only by a rotation. Then, at any time during the rendering process, instead of morphing the individual local models, a morphing between successive 3D-mosaics should take place, as before. For this reason, the term morphable 3D-mosaic is used in this case.

Intuitively, each 3D-mosaic should correspond to a local model produced from a stereoscopic camera with a wider field of view. Technically, let , with , be the individual local models at the th key-position. Then, constructing a 3D mosaic corresponds to filling-in the geometric and photometric maps of based on all maps from . To this end, the rotation between each pair of local models , needs to be estimated first. This can be done very accurately, since no depth information is required for this task. Essentially, estimating is equivalent to estimating a 2D projective transformation (the infinite homography induced by the plane at infinity [3]) relating pixels in photometric images , (and for this, just a sparse set of pixel correspondences between , needs to be established).

The estimated transformations (for all , ) can then be used for aligning the maps of all local models into a common image plane. Furthermore, these aligned maps can be fused to produce the maps of model . Although this suffices for the construction of the photometric map , it turns out that, before fusing the geometric maps, for example, to produce , an additional rectification step needs to be applied to all these maps , so that they become geometrically consistent with each other (recall that the geometry of each has been estimated only approximately). To this end, an operator is defined, whose role is to modify so that it coincides with at the intersection (this ensures geometric consistency between , ), but without the relative 3D structure of in region being distorted too much. This operator can be easily seen to reduce to solving the following optimization problem: This problem is similar to the problem (13) studied in the previous section, and can thus be reduced to a Poisson partial differential equation. Obviously, geometric consistency is ensured simply by applying operator to all pairs , with , that is, by solving a set of Poisson PDEs. See Figure 10 for some results produced with this method.

As already mentioned, an important property of our geometric rectification algorithm is that the true relative 3D structure is preserved in the resulting 3D mosaic. This is in deep contrast to other methods for fusing geometric maps, such as feathering. This is illustrated with the following toy example. Let us assume that we have 2 local models, , , that must be fused to produce a 3D mosaic. The true geometry of the 3D mosaic should consist of a planar object at a constant depth . Due to noisy observations, however, the left half of this planar object (corresponding to local model ) was estimated to have depth , while the right half of the object (corresponding to local model ) was estimated to have depth . It is easy to verify that, no matter what the values of , are, the 3D mosaic produced by our fusion algorithm will always be a planar object with constant depth. On the other hand, when using a feathering-like method, that is, when a weighted linear combination of , is used for creating the mosaic's depth map, this depth map will exhibit a high distortion at a zone around the middle of the object, that is, around the common boundary of and . Inside this zone, the resulting 3D mosaic will have depths that vary from to , and will thus appear highly distorted to the human eye, when visualized. Not only that, but the higher the error (i.e., the values of , ), the higher the distortion will be. In fact, since this error is typically proportional to how far the object is from the camera, this value can be really high for large natural scenes, as is the distortion in this case.

8. Visual Reconstruction of the Samaria Gorge

The “morphable 3D-mosaics” framework has been successfully applied to the visual 3D reconstruction of the well-known Samaria Gorge in Crete (a gorge which was recently awarded by the Council of Europe with a Diploma First Class, as being one of Europe's most beautiful spots). Based on this 3D reconstruction, and by also using a 3D virtual reality installation, the ultimate goal of that work has been to provide a lifelike virtual tour of the Samaria Gorge to all visitors of the National History Museum of Crete, located in the city of Heraklion. To this end, the most beautiful spots along the gorge have been selected and for each such spot a predefined path, that was over 100 meters long, was chosen as well. About 15 key-positions have been selected along each path and approximately 45 stereoscopic views have been acquired at these positions, with 3 stereoscopic views per position (in this manner, a field of view, that was approximately wide, has been covered). Using the reconstructed morphable 3D-mosaics, a photorealistic walkthrough of the Samaria Gorge has been obtained, which was visualized at interactive frame rates by means of a virtual reality system. The hardware equipment that has been used for the virtual reality system was a common PC (with a Pentium 4 2,4 GHz CPU on it), which was connected to a single-channel stereoscopic projection system from Barco. The projection system was consisting of a pair of circular polarized LCD projectors (Barco Gemini), an active-to-passive stereo converter, as well as a projection screen, while the rendering was done just on a single GeForce 6800 3D graphics card (installed on the PC). For the stereoscopic effect to take place, 2 views (corresponding to the left and right eyes) were rendered by the graphics card at any time (e.g., see Figure 11) and so the museum visitors were able to participate in the virtual tour simply by wearing stereo glasses that were matched to the circular polarization of the projectors.

Despite the fact that a common graphics card has been used, very high frame rates, of about 25 fps in stereo mode (i.e., 50 fps in monomode), were obtained. On the one hand, this is due to the fact that, regardless of the actual size of the scene, only one morphable 3D-model needs to be displayed at any time during rendering. This is an important advantage, which makes our framework extremely scalable to large-scale scenes. On the other hand, the achieved frame rates are also due to the highly optimized rendering pipeline of our system. The reason for this, is that both the photometric as well as the geometric morphings can be implemented directly on the GPU. For example, the photometric morphing operation (associated with, say, morphable model ) reduces to defining a simple so-called pixel shader. This is a user-defined script that is automatically executed by the GPU for each output pixel. In this case, that script should simply be responsible for blending corresponding pixels (as specified by ) from photometric images and . Similarly, geometric morphing amounts to a simple vertex blending operation, that is, an operation that is directly supported by all modern GPUs. Hence, rendering a morphable model essentially reduces to feeding a textured triangle mesh into the GPU (with pixel and vertex shaders being applied to this mesh). This is, of course, a highly optimized rendering path in all modern GPUs, and thus a considerable amount of acceleration can be attained in this manner.

Sample results from the visual reconstruction of the Samaria gorge are shown in Figure 12. That figure contains some rendered views that were generated in real time as the virtual camera traversed a path through the so-called Iron Gates area, which is one of the most famous parts of the gorge. A corresponding video, with just a short clip from a virtual tour into the Samaria Gorge, is also available at the URL in [29]. Also, Figure 13 contains some rendered views of the gorge, where a synthetically generated volumetric fog has been added to the scene to further enhance the visual experience of the virtual tour.

8.1. Extensions

Another difficulty that we had to face, during the visual reconstruction of the Samaria Gorge, was related to the fact that a small river was passing through a certain part of the gorge. This was a problem for the construction of the corresponding local 3D models as our stereo matching algorithm could not possibly extract disparity (i.e., find correspondences) for the points on the water surface. This was the case because the water was not always static and, furthermore, the sun reflections that existed on its surface were violating the Lambertian assumption during stereo matching (see Figure 14(a)). Therefore, the disparity for all pixels lying on the water had to be estimated in a different way. To this end, since the water surface was approximately planar, we made the assumption that a 2D homography (i.e., a 2D projective transformation represented by a homogeneous matrix ) can directly map the left-image pixels that lie on the water to their corresponding points in the right image. For estimating , we took advantage of the fact that most left-image pixels that are located on the ground next to the river lie approximately at the same plane as the water surface and, in addition to that, stereo matching can extract valid correspondences for these pixels, as they are not on the water. A set of such pixels in the left image is thus extracted and their matching points in the right image are also computed based on the already estimated disparity maps. Using this sparse set of matches, the matrix can then be easily recovered via using an RANSAC-based procedure to cope with outliers in the matches . Inlier matches found by RANSAC can then be used for refining the estimation of by minimizing the total reprojection error , that is, the sum of distances between and (note that points and are represented in homogeneous coordinates):Given the matrix , the disparity of an image point on the water surface can then be set equal to . An example of a disparity field that has been estimated with this method can be seen in Figure 14(c). We should note that, by using a similar method, a 2D homography , mapping pixels of lying on the water to their corresponding pixels in image , can be computed as well. In this way, we can achieve estimating an optical flow field for all the pixels of image that are lying on the water surface.

9. Conclusions and Discussion

In conclusion, we have presented a new approach for obtaining photorealistic and interactive walkthroughs of large, outdoor scenes. To this end, a novel hybrid data structure has been proposed, called “morphable 3D-mosaics”. The resulting framework offers a series of important advantages. (1) To start with, no global 3D model of the environment needs to be assembled, a process which can be extremely cumbersome and error-prone, especially for large-scale scenes, where many local models need to be registered to each other. (2) Furthermore, rendering such a global model would make a very inefficient operation, with no inherent support for scalability, thus leading to very low frame rates for large-scale scenes. On the contrary, our method is extremely scalable to large-scale environments, as only one morphable 3D-model needs to be displayed at any time. (3) On top of this, it also makes use of a rendering path, which is highly optimized in modern 3D graphics hardware. (4) Not only that, but by utilizing an image-based data representation, our framework is also capable of fully reproducing the photorealistic richness of the scene. (5) As another advantage, the data acquisition process is very easy (e.g., collecting the stereoscopic images for a path over 100 meters long took us only about 20 minutes), while it also requires no special or expensive equipment (but just a pair of digital cameras and a tripod). (6) Finally, our framework makes up an end-to-end system, thus providing an almost automated processing of the input data, which are just a sparse set of stereoscopic images from the scene.

In the future, we intend to eliminate the need for calibrating the stereoscopic camera, as well as to allow the stereo baseline to vary during the acquisition of the stereoscopic views (these enhancements would allow for an even more flexible data acquisition process). Another issue that we wish to investigate is the ability of capturing the dynamic appearance of any moving objects, such as moving water or grass, that are frequently encountered in outdoor scenes (instead of just rendering these objects as static). To this end, we plan to enhance our “morphable 3D-mosaic” framework, so that it can also make use of real video textures that have been previously captured inside the scene. Finally, a current limitation of our method is that it assumes that the lighting conditions across the scene are not drastically different (something which is not always true in outdoor environments). One possible approach to deal with this issue is by obtaining the radiometric response function of each photograph. In this case, when constructing the morphable 3D models, one should also use the estimated response functions so as to compensate for the image differences due to the varying lighting conditions.

Acknowledgments

This paper is part of the 03ED417 research project, implemented within the framework of the Reinforcement Programme of Human Research Manpower (PENED) and cofinanced by National and Community Funds (75% from E.U.-European Social Fund and 25% from the Greek Ministry of Development-General Secretariat of Research and Technology).