Abstract

The increase in computational power in recent years has opened a new door for image processing techniques. Three-dimensional object recognition, identification, pose estimation, and mapping are becoming popular. The need for real-world objects to be mapped into three-dimensional spatial representation is greatly increasing, especially considering the heap jump we obtained in the past decade in virtual reality and augmented reality. This paper discusses an algorithm to convert an array of captured images into estimated 3D coordinates of their external mappings. Elementary methods for generating three-dimensional models are also discussed. This framework will help the community in estimating three-dimensional coordinates of a convex-shaped object from a series of two-dimension images. The built model could be further processed for increasing the resemblance of the input object in terms of its shapes, contour, and texture.

1. Introduction

Three-dimensional image processing comprises visualization, processing, and analysis of three-dimensional image datasets. With the increase in computational power, image processing techniques have evolved from traditional two-dimensional analysis to generation and analysis of spaces and objects in three dimensions. The abovementioned leap in mobile computation helped in the research on the three-dimensional representation of common objects [1]. Applications such as virtual reality and augmented reality require the mapping of real-world objects into digital representations for flawless experience.

1.1. Descriptors and Representation of Objects

Solid physical objects have definite shape and size. The size of an object could be seen as the outline of the dimensions defined by its length, width, and height [2]. The object in three dimensions, when captured, essentially creates a view of the object from the perspective of the capturing device (such as a camera). This results in the generation of a two-dimensional Cartesian plane where each unit is represented by a pixel which is a combination of red, blue, and green. Thus, image capturing is a function that generates two-dimensional coordinates of a scene that gives the view from the capturing angle [3].

Consider a case of capturing a tangible convex object. The shape descriptors of the object might differ when it is seen from different angles or sides. Fundamentally, there are two types of objects with respect to this property:(1) Objects having uniform exterior shape (Type A)(2) Objects that do not have uniform shape (Type B)

Here, the “shape” corresponds to the outline of the object that is described by the edges of the object. The shape of the first set of objects can be uniformly expressed by a uniform descriptor whereas the latter type requires description from multiple angles. For instance, footballs belong to the first category where the shape could be generally said as “round” or “spherical” whereas an object like a stapler could not be described by a universal shape. Even though the nature of shape remains uniform for the first category, the colour and texture might vary throughout the object [4, 5].

The edges of an object belonging to the second category, when captured, cannot describe the shape of the overall object as the image does not give perceptions from other angles [6]. The object in three dimensions is thus represented without taking the “depth” of the object into account [7].

1.2. Applications

There are numerous applications where three-dimensional object models are used as primitive components. Augmented reality and virtual reality applications are the most commonly observed examples for the same. Off-site controlling and demo of technical work, tutorials for equipment or simulation of medical operations, guides to fit components, and so on [8, 9] are some examples. Such generated three-dimensional models play an important role also in multimedia applications where the mock of objects is used. These include video calls with objects present near to the caller, 3D viewer applications, and so on.

For any of these applications, it is very essential to have accurately created object models. Whether the objective is to create an immersive experience for the people or to create a composite view of objects, the rendering process requires prebuilt object models [10, 11]. Often the users compare the models that are present to the actual real-world objects. Therefore, a novel method for generating at least the preliminary structure of the objects is required.

2. Background

Given an object, a designer has to spend a considerable amount of time designing each component with respect to its depth, shape, and other characteristics in computer-aided design (CAD) software. The design process is tiring and requires skilled labour as each component with respect to its depth, shape, and other characteristics that have to be designed [12, 13].

The use of three-dimensional models in applications is increasingly becoming popular. There is a significant trade off between the quality of generated models and the manual effort applied in the process of modelling. No same mapping models could be applied universally to objects. The relative spatial arrangement of features must be taken care of at all levels of the object designing process. This study consists of suggesting methods to convert an object into its three-dimensional mapping. The suggested framework considers generic objects [14, 15].

2.1. Existing Studies

As mentioned earlier, there are various studies that discuss different algorithms for generating three-dimensional maps for stereoscopic images, SAR images, and so on. The recent digitalization insists on progressing beyond the limitations of traditional photo processing techniques. There are studies that compare existing reconstruction algorithms used in a variety of applications. The survey focusing on triangulation and stereo-vision [1619] compared the speed, accuracy, and practicality of different algorithms used in the motion-parallax scenario. This study also enumerates different approaches such as image-based, voxel-based, and object-based approaches for scene and geometric parallax reconstruction [20, 21]. Even though there was no mention of generation of object coordinates, the bounding box method was used on objects [22] for its pose estimation.

SV3DVision [23] is a depth-map generating algorithm used for reconstruction of scenes based on a single-photograph input. The above proposed method uses a single-photograph input whereas a more calibrated method [24] using parallel axes uses a stereoscopic system. Both these studies focus on identification of near-placed and far-placed objects for depth-map generation, focusing on robotic vision applications [25]. A silhouette-based method [26, 27] based on volume intersection approach is also available in the 3D model reconstruction research area. The above proposed study however uses camera calibration and bounding cube estimation for silhouette extraction using triangulation and decimation.

Apart from the algorithmic methods discussed in the previous studies, approaches based on neural networks [28, 29] are also available. In these studies, depth estimation of the human body, face, videos, and so on is done. These studies suggest adversarial methods to do reconstruction in single- and multi-view approaches [30, 31].

For easing out the computer-aided design process, the methods should suggest reconstruction of objects with the fewest number of images. Even though the approach follows stereoscopic vision, the triangulation method still could be used for reconstruction of a scene. The features are mapped for triangulation [14, 32]. However, this requires special camera setups; therefore, it is not suitable for everyone to follow. This problem is eliminated in the study [33] where stereo sequences captured using handheld cameras are considered for reconstruction. This method generates disparity maps and object boundaries in texture less regions [34]. Although this is the only study that discussed handheld cameras, this also was focusing on scene/space reconstruction whereas our requirement is to generate the same for objects [13].

The different techniques such as binocular disparity, motion parallax, image blur, linear perspective, triangulation, image blur method, and silhouette process have their own advantages and disadvantages [35]. We have to consider the shortcomings and advantages of each of these approaches to come up with a better solution.

We could observe that the space reconstruction and remodelling domain have a good improvement over the past years in terms of implemented methods. Research studies on parallax photography help in reconstructing dense three-dimensional geometry of a space or a scene. However, with respect to the reconstruction of small objects based on different viewpoints, the progress is not that great [35]. However, studies related to object estimation and reconstruction are still not popular. It is important to note that none of the studies facilitates a faster computer-aided design process. We believe that this study will become an initiative in the field of three-dimensional object reconstruction as none of the studies have any indication towards that goal.

2.2. Problem Statement

The computer-aided design makes a highly accurate representation of objects for visualization, dimensional analysis, and other applications. Objects are represented either in three-dimensional metric spaces or by vector edges. Based on the above fact, the research problem is to generate metric points in three dimensions based on an input set of images around an object. Figure 1 represents the overall system of our research.

This paper discusses an algorithm for generating preliminary object models that could be used for further processing. The study focuses only on replicating the external structure of a given object; the problem does not focus on replicating or predicting the internal structure of the object. A visual comparison of some input objects and generated object models is also given.

3. Methodology

It is observed that the generation of three-dimensional models requires highly sophisticated and costly capturing equipment. Our goal is to provide a framework for even common people where normal handheld cameras could be used. It is also important to discuss the scope of the goal. Ideally objects must be rotated over the top as well as the bottom. However, our scope does not focus on it. Consider the standard plane representation of Rubik’s cube as given in Figure 2. With our method, we can construct the primary shape of the cube with the left, front, right, and back side (except the white and yellow faces), that is, reconstructing the shape surrounding the given object and not its top and bottom views.

Our problem statement conveys that we require a series of images for the input. There are a good number of 3D object databases such as IKEA3D, LDOS, ObjectNet3D, and Thingi10 K, and those are unusable in our case. Out of the available datasets, COIL-100 [11] is the most suitable one for our needs.

3.1. Dataset Description and Relevance

We use the COIL-100 dataset for carrying out the study. COIL-100 was collected by the Center for research on intelligent systems at the department of Computer Science, Columbia University. This is an image-based dataset that consists of colour images of 100 different small objects taken at different angles. To be more precise, the entire 360° view is divided into 72 positions, 5° apart from each other, and the image is captured from that particular position.

The sizes of given images are normalized for having homogeneity in terms of image properties. The background of each image is black, and the set of objects have a wide variety of complex geometric and reflectance characteristics. Even though this dataset was created to identify the angular pose of the image, we use the same set of images to attempt a 360 object reconstruction.

Figure 3 shows the various classes that are present in the COIL-100 dataset. There are 100 different objects that are part of the dataset. These objects are used in our study for replication purposes.

Figure 4 shows how each object is captured through different angles. For each object, all these images are taken and processed to generate models that are visually similar.

3.2. Testing and Evaluation Metrics

The final design criteria are deciding on the approach for comparison of results with the input image. The COIL-100 dataset provides various objects’ images, captured at different angles. Hence, the modelled outputs could be visually compared with the input image dataset and result. There are no defined mechanisms for the comparison of images and models. Therefore, only the visual comparison of models and input image set is possible. In this paper, we are giving a step-by-step process of four randomly selected different objects from the COIL-100 dataset.

4. Proposed System

The literature review done in the earlier section showed us several approaches that are practised currently for the generation of three-dimensional maps for a pair or more images. The proposed system is expressed in Figure 5. Since our primary objective is to provide paths (or vectors) at the end of the result, we cannot do volumetric reconstruction or voxel-based reconstruction. Our approach does not have a target model; hence, there is no scope for us to use any type of neural networks even though they are good at giving predictive results. Considering the above statements, an overall framework is suggested as shown in Figure 5.

Our problem cannot be categorized as a simple triangulation method or image binding method as we have a series of images and input. Hence, features present in each image play an important role in the reconstruction part.

4.1. Steps of Proposed System

The steps that are part of the proposed system is explained in simple terms in the following subsections. The entire process is classified into two: estimation process and regeneration process. The algorithm discussed in the next section supports the following steps for the reconstruction process:Step 1. Preparing input image sequence: A series of 72 images is taken per object as input of the system. Any object part of the COIL-100 dataset would be input.Step 2. Classification of images: There are two types of images as far as we are concerned such as Type A and Type B as mentioned in the introduction. The approaches for both Type A and B are different; hence, we have to classify them to either of the types. The steps are to remove the background, extract edge descriptors, and find the variance of edge features. If the variance is above a threshold T, then the object is of Type B; otherwise A. If the object is of uniform shape (Type A), it implies that edges of the object will not be high.Step 3. Texture removal: Objects in general have their own outer border shape, which is referred to as “outline” of the object throughout this paper. The proposed system is able to extract and reconstruct the same. The reconstruction process requires exact edges on its input. If any objects have some kind of textures, designs, or difference of contrast in its body, then it is important to remove those low-level features. Any variation in the texture could act as an object edge to the proposed regeneration algorithm. This could result in over fitting of the system. This step consists of combining edge detection and adaptive threshold processes. Once the small edges (low-level features) are identified, the adaptive threshold would help in filling the area with either of the binary colours. This enables the object image to be ready for defining its outer boundaries.Step 4. Defining object boundaries: This step plays a key role in the entire process. For objects belonging to Type A, coordinate estimation process ends here as the average of all coordinate borders is smoothened and stored. For objects belonging to Type B, individual extreme border paths are created. The leftmost and rightmost extremes define the distance magnitude of a pair of points. When we use all images of a given object, then let . For image, the abovementioned excrement define and data points, respectively, as shown in Figure 6. The generated values are stored as a new entry in a (3 x 72) matrix.Step 5. Generating edge coordinates from object boundaries: Shape of the object is finalized in this step. We assume that edges generated are of good accuracy. Based on the extracted features in the previous step, we define the entire object boundaries and shape. We use regional edge linking process for the same, which assumes that the edge points are defined well and in an order. Since we have object boundaries of each image, the edge linking process is repeated over those images. We assume that the outline of a convex object will always be a closed polygon. For each image, the edges will be represented through an array of two-dimensional values as follows: .Step 6. Representing features of all images: As per our goal, we have to represent our coordinates mathematically in terms of vectors. The quality of images is suddenly increased after this step as the generated matrix is normalized, and values are smoothened and are converted in terms of paths.Step 7. Combining features and converting edges to mathematical paths: This is the most essential step in the reconstruction process. The normalized input matrix is converted to image representations to find connectivity. The proposed algorithm is used for the same. This is repeated for all the images that will be able to generate a file openable in the CAD tool. The raster to vector conversion will help in resizing the new object to any extent.

4.2. Image Outputs at Different Stages

The images show the effect of our processes in various steps. Four objects at four different angles are given below for easier comprehension. Figures 7(a)–7(e): Input Image Set, Background and Texture Removal, Edge Detection. Figures 7(f)–7(h) represent Edge Generation, Combing Features, and Size Normalization. Using the features extracted from the above steps, our algorithm is implemented to generate the below result. The processed result is shown in Figure 8. The shapes in the given picture are mildly processed for easier visualization. The steps 6 and 7 cannot be visualized as there are no image outputs obtained from those steps.

4.3. Proposed Algorithm for Combining Images

The input set of an array of images captured at different angles (5° apart) is the input for the below algorithm. A common feature is identified between pairs of images which are then used for the depth analysis.Step 1. Let i and j denote any images in the set of images. The epipoles are computed from the 2D features of i th and j th images in the input set.Step 2. Estimation of projection matrices using both images. The i th view image is used as zero-orientation image, and the projection matrix for the j th view is computed by the reference frame and epipole reconstruction along with the image feature matrix of i th image.Step 3. Initial estimate of 3D point coordinates is obtained through the triangulation.Step 4. For all the remaining pairs of views, the estimation is done by repeating steps 1 to 3. The obtained result will be unrefined 3D coordinate estimation for each pair of images that will have error between each element in the estimated result set.Step 5. The obtained result set is optimized to refine coordinates by comparison. The reprojection error of each estimated three-dimensional point is reduced. As a result, we have better projective coordinates of the images.Step 6. Projective coordinates are transformed to three-dimensional metric coordinates by assuming a ground truth.Step 7. Multiple 3D points are triangulated for estimation of three-dimensional structure.

We could see that the algorithm runs for each pair of images with complexity where n is the number of images for feature construction.

5. Observations

Our primary requirement was to generate models that are maximum similar to the input objects. Pixel matching based on mathematical computation of accuracy measures cannot be done because of the dissimilarity of input and output representations. The system estimates stable posed 3D bounding boxes without additional 3D models.

5.1. Accuracy of the Proposed System

Accuracy of the output is dependent on the number of images used in the process.Since the COIL-100 dataset has images of very low dimensions, the variation in path was very high during the generation process. We suggest that higher image size will help in getting better accuracy.The objects of Type A have better visual accuracy than Type B images.

5.2. Variations with Parameters

The only variable parameter is the number of images considered in the input array. Our observations show that the number of images is directly proportional to the accuracy of the final output. Table 1 shows a basic comparison of quality of output and number of images used (quality comparison based on image count). Graphical representation of quality comparison based on image count is represented in Figure 9.

5.3. Advantages

The images given above directly show us the visual comparison of expected and obtained output. The observed advantages are bulleted as follows:The process of computer-aided design of objects is speeded upVector paths are made instead of voxel or volumetric outputsThe discussed algorithm is universal in natureHandheld and single cameras could be used

5.4. Shortcomings

While proposing a new algorithm for a generic process, it is important to mention the observed disadvantages. The below-enumerated points are the most important observations with respect to the shortcoming of the proposed system:The algorithm is implemented on only the COIL-100 dataset, and the background of all the images was black.The texture or the volume of the objects is not present during the reconstruction process.It works for only convex-shaped objects and not spaces.Only a preliminary outline of the object is generated. For the object model to be used in actual real-world applications, the generated shapes must be further processed.

5.5. Future Scope

We believe that this study is primitive and will lead to a new area of research as this is the first attempt to automate the three-dimensional object generation process. There is still scope for implementing the same with another dataset as we had the same background for all images. Furthermore, generation of augmented views and coloured object outputs is still yet to be achieved. Another future enhancement could be the inclusion of top and bottom aligned images to make a completely rotatable model.

6. Conclusion

As part of the big leap in the image processing domain, volumetric estimation and reconstruction algorithms are getting popular. As a primitive attempt to automate the process of three-dimensional object reconstructions, we are able to suggest a framework in our study. Even though we were not able to make a complete triple-axis rotatable model output due to the unavailability of such images in the dataset, using a series of images, we were able to replicate the primitive shape and features of the input objects. The suggested system can act as replacement for manual designing processes at least at the initial stages. We are able to conclude that obtaining three-dimensional models is possible when a set of images around the object is given.

Data Availability

The data used to support the findings of this study are freely available at https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php.

Disclosure

The funder had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors would like to thank the support from Taif University Researchers Supporting Project number TURSP-2020/211, Taif University, Taif, Saudi Arabia.