Abstract

We present a multilayer approach to classify articles of clothing within a pile of laundry. The classification features are composed of color, texture, shape, and edge information from 2D and 3D data within a local and global perspective. The contribution of this paper is a novel approach of classification termed L-M-H, more specifically LC-S-H for clothing classification. The multilayer approach compartmentalizes the problem into a high (H) layer, multiple midlevel (characteristics (C), selection masks (S)) layers, and a low (L) layer. This approach produces “local” solutions to solve the global classification problem. Experiments demonstrate the ability of the system to efficiently classify each article of clothing into one of seven categories (pants, shorts, shirts, socks, dresses, cloths, or jackets). The results presented in this paper show that, on average, the classification rates improve by +27.47% for three categories (Willimon et al., 2011), +17.90% for four categories, and +10.35% for seven categories over the baseline system, using SVMs (Chang and Lin, 2001).

1. Introduction

Sorting laundry is a common routine that involves classifying and labeling each piece of clothing. This particular task is not close to becoming an automated procedure. The laundry process consists of several steps: handling, washing, drying, separating/isolating, classifying, unfolding/flattening, folding, and putting it away into a predetermined drawer or storage unit. Figure 1 gives a high-level flow chart of these various steps. In the past, several bodies of work have attempted at solving the tasks of handling [18], separating/isolating [812], classifying [6, 9, 1115], unfolding/flattening [14, 16], and folding [17] clothes. Figure 1 gives a flow chart of the various areas.

A robotic classification system is designed to accurately sort a pile of clothes in predefined categories, before and after the washing/drying process. Laundry is normally sorted by individual, then by category. Our procedure allows for clothing to be classified/sorted by category, age, gender, color (i.e., whites, colors, darks), or season of use. The problem that we address in this paper is grouping isolated articles of clothing into a specified category (e.g., shirts, pants, shorts, cloths, socks, dresses, jackets) using midlevel layers (i.e., physical characteristics and selection masks).

Previous work on classifying clothing [6, 1114] used dual manipulators to freely hang each article of clothing at two grasp points to reveal the overall shape of each article of clothing. In contrast to previous work, our approach is tested on an unorganized, unflattened article for the purpose of classifying each individual piece of clothing. The database of clothing, that we use in this paper, consists of individual articles placed on a table by a single robotic manipulator.

Our approach uses a set of characteristics, a local/global histogram for each article, and low-level image/point cloud calculations in order to accomplish this task. The proposed method can be seen as a particular application of the paradigm of interactive perception, also known as manipulation-guided sensing, in which the manipulation is used to guide the sensing in order to gather information not obtainable through passive sensing alone [1822]. The articles of clothing that we use in this paper have been manipulated in a predefined set of movements/operations in order to gather more information for our approach. Each article of clothing is initially placed on a flat surface and then pulled in four directions (i.e., up, down, left, right) which results in five unique configurations of each piece of laundry. In other words, deliberate actions change the state of the world in a way that simplifies perception and consequently future interactions.

In this paper, we demonstrate the approach based on characteristics, which was introduced in [15], for automatically classifying and sorting laundry. We extend [15] by applying this approach to a larger database. This new database increases the number of midlevel characteristics as well as increasing the number of categories. This paper will display the results of the algorithm by increasing the complexity of the problem. Figure 2 and Table 1 illustrate examples of laundry items from several actual household hampers which form the data set that will be used in this paper. Our new approach is comprised of a multilayer classification strategy consisting of SVM [23] and bag-of-words model [24], similar to [25], that utilizes characteristics and selection masks to bridge together each calculated low-level feature to the correct category for each article of clothing.

2. Previous Work

There has been previous work in robot laundry which relates to the research in this paper. Kaneko and Kakikura [11] focused on a strategy to isolate and classify clothing. The work in [11] is an extension of the authors’ pioneering work in [10]. In these papers, the authors categorize what steps and tasks are needed for laundry (isolate, unfold, classify, fold, and put away clothes), as in Figure 1. The authors classify each object into one of three groups (shirt, pants, or towel) by deforming the article of clothing. The process of deforming the object aims to find hemlines as grasp points and the grasp points cannot be closer than a predefined threshold. Once the object is held at two grasp points, then the shape of the object is broken up into different regions and the values of the different regions are used to calculate feature values. Each of the feature values is compared against threshold values. If the features are above or below the thresholds, then the object is classified as a shirt, pant, or towel based on a decision tree. The results achieved a 90% classification rate.

Osawa et al. [12] use two manipulators to pick up/separate a piece of cloth, unfold, and classify it. The approach in [12] begins the process by picking up a piece of cloth by holding it with one manipulator. While the item is being held by the first manipulator, the second manipulator looks for the lowest point of the cloth as a second grasp point. The grasp spreads the article of clothing by the longest two corner points. The article of clothing now held by both robot arms is imaged by a camera and the image is converted into a binary image using background subtraction. This binary image is then compared with the binary template to classify the article of clothing. The approach is tested with eight different categories (male short sleeve shirt, women/child short sleeve shirt, long sleeve shirts, trousers, underwear shorts, bras, towels, and handkerchiefs). Fifty trials were conducted with clothing of the same type as the training data, achieving an average of 96% classification.

Kita et al. [6] extend their original work [4] of determining the state of clothing. The work in [6] provides a three-step process to successfully find a grasp point that the humanoid robot can grab: (1) clothing state estimation (previous work [4]); (2) calculation of the theoretically optimal position and orientation of the hand for the next action; and (3) modification of position and orientation, based on the robots limitations, so the action can be executed. The 3D model is used by calculating the orientation of the shirt model at the grasp point. The 3D model gives three vectors around the grasp point and the vectors are used in the calculation of the orientation. The authors tested their results of six scenarios and were successful on five. One scenario failed in predicting the correct clothing state, four scenarios were successful but had to use step (3) for completion, and the last scenario was successful without using step (3).

The results in [13] are another extension to Y. Kita and N. Kita’s previous work [4]. The purpose is to accurately determine the shape/configuration of a piece of clothing based on the observed 3D data and the simulated 3D model. Each of the clothing states is compared against the 3D observed data based on the length of the clothing as a first step. Each of the shape states within a threshold value is selected to use in the next step of the process. Then, a correlation function is applied to the 2D plane of the observed data and simulated data to calculate which shape state model is closest to the observed data. The experimental results show that 22 out of 27 attempts were correctly classified as the correct shape state.

Cusumano-Towner et al. [14] present an extension of Osawa et al. [12] and Kita et al. [13] combined. The authors identified a clothing article by matching together silhouettes of the article to precomputed simulated articles of clothing. The silhouettes of each iteration were placed into a hidden markov model (HMM) to determine how well the silhouettes of the current article match against the simulated models of all categories of clothing. The authors placed the article on a table by sliding the clothing along the edge of the table to maintain the configuration while it was hanging in the air. The authors grasped two locations on the cloth during each iteration. The planning algorithm that determines where the next set of grasp points are located was computed by using a directed grasp-ability graph that locates all possible grasp points that can allow the system to reach. The authors tested their approach on seven articles of clothing of the end-to-end approach and were successful with 20 out of 30 trials.

Our work differs from the previous papers [6, 1114] in that we work with a single manipulator and our results do not depend on hanging our clothing in the air by two grasp points. Our approach is independent of the type or how many manipulators that are used to collect the data. Our work also differs from that of [6, 13, 14] in that we do not require or use predefined shape models. We instead use novel characteristic features that describe each category of clothing. The characteristic features are not predefined but calculated through training the algorithm with a priori collected clothing data.

Our current work extends the problem of our previous work [15] by expanding the midlevel layer with more characteristics and incorporating more categories in the high layer. We expand the database of clothing that was used in our previous paper from 117 articles to over 200 articles of various size, color, and type. We then compare how well our algorithm improves over a baseline system of SVMs, one per category, to classify a pile of laundry into three, four, and seven groups (Tables 3 and 4).

3. Approach

3.1. Overview

We call the proposed algorithm L-C-S-H, for low-level-characteristics-selection mask-high level. Low-level refers to the features that are calculated from the images of each article (e.g., 2D shape, 2D color, 2D texture, 2D edges, and 3D shape). Characteristics refer to the attributes that are found on generic articles of clothing (e.g., pockets, collars, hems). Selection mask refers to a vector that describes what characteristics are most useful for each category (e.g., collars and buttons are most useful for classifying shirts), used as a filter. High level refers to the categories that we are attempting to classify (e.g., shirts, socks, dresses). Figure 3 illustrates the path of the L-C-S-H process. The L-C-S-H algorithm is described in detail in later sections.

3.2. Clothing Database

The database that we used consists of over 200 articles of clothing in over 1000 total configurations. The entire database is separated into seven categories and more than three dozen subcategories. Each article of clothing is collected from real laundry hampers to better capture actual items encountered in the real world. Each one is laid flat on a table in a canonical position and dropped on the table in a crumpled position from four random grasp points for a total of five instances per article. Figure 1 illustrates seven instances of clothing for each of the seven categories that we used.

The database contains 2D color images, depth images, and 3D point clouds. For this paper, we will utilize the assortment of different types of items, such as shirts, cloths, pants, shorts, dresses, socks, and jackets, that might be encountered by a domestic robot involved in automating household laundry tasks. Figure 2 illustrates an example of a single color image, a single depth image, and a single point cloud for one article.

3.3. Low-Level Features

The L component of the approach uses the low-level features to estimate if the article does or does not have a particular characteristic. The low-level features that were used in this approach consist of color histogram (CH), histogram of line lengths (HLL), table point feature histogram (TPFH), boundary, scale-invariant feature transform (SIFT), and fast point feature histogram (FPFH). Each low-level feature will be described in detail in further subsections. The database that we used consisted of five instances of each article, see Section 3.2. In order to combine the low-level features of all five instances into a single value or histogram, we calculated each value by averaging each individual value along with its neighbors, in the case of the histogram. Equation (1) was the computation used to combine all five instances of each article where is the combined score of the article, is the number of instances of each article (we use ), is the current value within the histogram along with and to be the immediate neighbors to the left and right of the current value, respectively.

For the part of the algorithm that converts from low-level to characteristics, we compared the low-level features, described in the following subsections, to the various characteristics. Since the characteristics were binary values, we used libSVM [23] to solve the two-class problem. We used each low-level feature to determine if the characteristic is in class 1 or 2. Class 1 contains positive instances and class 2 contains negative instances.

For an article of clothing, we capture an RGB image and a raw depth map, from which we get a 3D point cloud. We perform background subtraction on the RGB image to yield an image of only the object. The background subtraction is performed using graph-based segmentation [26], see Section 3.3.1. Once the object is isolated within the image, multiple features are calculated from the RGB image and the 3D point cloud. These features, which are discussed in later sections, capture 2D shape, 2D color, 2D texture, 2D edges, and 3D shape for both global and local regions of the object.

3.3.1. Graph-Based Segmentation

We use Felzenswalb and Huttenlocher’s graph-based segmentation algorithm [26] because of its straightforward implementation, effective results, and efficient computation. This algorithm uses a variation of Kruskal’s minimum spanning tree algorithm to iteratively cluster pixels in decreasing order of their similarity in appearance. An adaptive estimate of the internal similarity of the clusters is used to determine whether to continue clustering. Figure 4 shows the results of graph-based segmentation on an example of 640 × 480 RGB color image taken in our lab using the default value for the scale parameter . As can be seen, the segmentation provides a reasonable representation of the layout of the items in the pile. From this result, we determine the foreground regions as those that do not touch the boundary of the image.

3.4. Global Features
3.4.1. Color Histogram (CH)

A CH is a representation of the distribution of the colors in a region of an image, derived by counting the number of pixels with a given set of color values. CH are chosen in this work because they are invariant to translation and rotation about the viewing axis, and for most objects they remain stable despite changes in viewing direction, scale, and 3D rotation. CH is used to distinguish, for example, between lights and darks, as well as denim. We compute the CH of each object in HSV color space. The HSV color space is computed by converting the RGB space to hue, saturation, and value. Equations (2)–(6) display the equations needed to compute the hue of a pixel. In the case of , the first bin in the hue color channel is incremented to keep track of all undefined instances. Equations (7)-(8) display the equations needed to compute the value and saturations of a pixel, respectively. We use 15 bins for hue color channel and 10 bins for the saturation and value color channels, leading to 35 total bins. The hue ranges from degrees; the saturation and the value both range from . Figure 5 illustrates the difference in an article of clothing with light colored and dark colored fabric

3.4.2. Histogram of Line Lengths (HLL)

The histogram of line lengths (HLL) is a novel feature that we are introducing to help distinguish between stripes, patterns, plaid, and so forth. For this, we use the object image as before (after background subtraction) and compute the Canny edges [27], then erode with a structuring element of ones to remove effects of the object boundary. Then we compute the length of each Canny edge using Kimura et al.’s [28] method. The Kimura et al.’s method uses diagonal movement and isothetic movement to accurately calculate the length of a pixelated edge within an image, see (9). Diagonal movements refer to moving from the center pixel to one of the four corners within a window and isothetic movements refer to moving directly left, right, up, or down, see Figure 6. These pixel lengths are used to compute a histogram of 20 bins that range from 0 pixels to 1000 pixels, so each bin captures lengths within 50 pixels. Lengths greater than 1000 get mapped down to the last bin. Figure 7 illustrates the difference in an article of clothing with no texture, with stripes, and with plaid

3.4.3. Table Point Feature Histogram (TPFH)

The TPFH consists of a 263-dimension array of float values that result from three 45-value subdivisions, that are calculated from extended fast point feature histograms (eFPFH), and 128-value subdivision for table angle information. This feature is a variant on the viewpoint feature histogram (VFH) [29]. The eFPFH values are calculated by taking the difference of the estimated normals of each point and the estimated normal of the objects centerpoint. The estimated normals of each point and the centerpoint are calculated by projecting them on the , and plane. The differences between the two normals are labeled , and . These three values, ranging from degrees (4 degrees each bin), are placed within bins in the three 45-value histograms of the eFPFH. Figure 8 illustrates the graphical representation of the normals and how they are calculated. The 128-value table histogram is computed by finding the angle, , between each normal vector and the translated central table vector for each point. The central table vector is translated to the same point as the normal that is currently being computed. Figure 8 illustrates the difference in direction for each vector. In TPFH, the eFPFH component is a translation, rotation, and scale invariant, while the table component is only a scale invariant, 3D representation of the local shape. Figure 9 illustrates an example of how the TPFH values are visually different in two separate categories.

3.4.4. Boundary

The boundary feature captures 2D shape information by storing the Euclidean distances from the centroid of each article to the boundary. First, the centroid of each binary image is calculated containing the object (after background subtraction). Then, starting at the angle of the major axis found by principle components analysis, 16 angles that range from 0 to 360 (i.e., 0 to 337.5) are calculated around the object. Angles are . For each angle, we measure the distance from the centroid to the furthest boundary pixel, see Figure 10.

3.5. Local Features
3.5.1. SIFT Descriptor

The SIFT, scale invariant feature transform [30], descriptor is used to gather useful 2D local texture information. The SIFT descriptor locates points on the article (after background subtraction) that provide local extremum when convolved with a Gaussian function. These points are then used to calculate a histogram of gradients (HoG) from the neighboring pixels. The descriptor consists of a 128-value feature vector that is scale and rotation invariant. Figure 11 illustrates the SIFT feature points found on an article. The arrows represent the orientation and magnitude of the feature point. After all of the feature points are found on an article, the SIFT descriptors are placed within a bag-of-words model [24] to calculate 100 codewords. These codewords provide each article with a 100-element feature vector that represents the local texture information.

3.5.2. FPFH

The FPFH, fast point feature histogram [31], descriptor is used to gather local 3D shape information. The FPFH descriptor utilizes the 3D point cloud and background subtraction for each article and segments the article from the background of the point cloud. For each 3D point, a simple point feature histogram (SPFH) is calculated by taking the difference of the normals between the current point and its neighboring points with a radius . Figure 12 illustrates an example of a 3D point along with its neighbors. The radius is precomputed for each point cloud to best capture local shape information. Once all of the SPFHs are computed, the FPFH descriptor of each point is found by adding the SPFH of that point along with a weighted sum of the neighbors, see (10). FPFH is a histogram of 33 values, that is, three sets of 11 bins for the three different orthogonal planes of , and

3.5.3. Final Input Feature Vector

Once the features are computed, the global features are concatenated to create a histogram of 35 + 20 + 263 + 16 = 334 values. For local features, SIFT and FPFH are calculated separately through bag-of-words to get two 100-element histograms of codewords. Being concatenated, this yields 200 values for the local features. Then being concatenated with global features yields 534 values, which are then fed to the multiple one-versus-all SVMs.

3.6. Midlevel Layers
3.6.1. Characteristics

The C component of the approach uses the characteristics found on everyday clothing that is best suited for each category. We use a binary vector of 27 values that correspond to each characteristic used to learn what attributes are needed to separate the differences between shirts and dresses. The 27 characteristics that were chosen are shown in Table 2, along with the number of instances of each characteristic in the database. Figure 13 illustrates the breakdown of how much each characteristic is used per category. Light red blocks represent small percentages, dark purple blocks represent high precentages, and blank areas represent a percentage of zero. The color scheme changes from red to purple as the percentage increases.

3.6.2. Selection Mask

The S component of the approach uses the characteristics to determine which subset is best suited for each category. The selection masks are stored as a binary value; 0 or FALSE, means that the category does not need that particular characteristic, and 1 or TRUE, means that the category does need that particular characteristic.

We aimed to determine what characteristics have a higher importance and which have a lower importance for each category. Therefore, we zeroed one characteristic at a time and viewed the resulting percentage over the rest of the characteristics. The percentage was determined by comparing the characteristic vector for each article against mean vectors for each category, which are calculated in Section 3.7. If the current percentage was higher or equal to the previously calculated percentage, then we zeroed that particular characteristic for the next iteration.

Importance values closer to 1.00 represent a positive type of instance and importance value scloser to 0.00 represent a negative type of instance. Importance values closer to 1.00 were chosen based on the fact that for one particular category, there is a likelihood that the category will contain a characteristic, more often, over the other categories. Importance values closer to 0.00 were chosen based on the fact that for one particular category, there is a likelihood that the category will not contain a characteristic while the other categories will contain that characteristic. Positive importance values can be converted to +1 and negative importance values can be converted to −1 to better understand how each characteristic contributes to that one category.

The next iteration repeated the same process of removing one characteristic at a time along with having the previously chosen characteristics removed. After permanently removing several features, the resulting classification percentage began to increase to a peak and after the peak began to decrease when we removed the remaining characteristics. Next, we then created a binary vector for each category that contained a 1 or TRUE if the characteristic has a higher importance and a 0 or FALSE if the characteristic has a lower importance. Algorithm 1 steps through the process of C-S.

Input: Characteristic vector for category ( )
Output: Selection mask for category ( )
;
;
foreach     do
× ;
If   ( ) ( )  then
   ;
end
1;
end

3.7. High-Level Categories

The H component of the approach uses the characteristic vectors that correspond to the articles within the training set to average the binary vectors of each category together to create a mean vector, as a descriptor, for each category. In other words, all shirts have averaged their characteristic vectors together, all dresses have averaged their characteristic vectors, and so on to create seven unique mean vectors, one for each category. Then the selection mask is multiplied by the characteristic vector to zero out any characteristics with a negative importance.

4. Experimental Results

The proposed approach was applied to a laundry scenario to test its ability to perform practical classification. In each experiment, the entire database consisted of 85 shirts, 30 cloths, 25 pants, 25 shorts, 10 dresses, 22 socks, and 5 jackets. The database was labeled using a supervised learning approach so that the corresponding category of each image was known. We provide an approach that illustrates the use of having midlevel characteristics (attributes) in grouping clothing categories that are difficult to classify. First, we consider the baseline approach that uses SVMs to directly classify each category without the use of midlevel characteristics. Then, we introduce characteristics as an alternate mapping from low-level features to high-level categories.

4.1. Baseline Approach without Midlevel Layers

This experiment demonstrates the baseline approach using the low-level features to classify each article of clothing correctly with support vector machines (SVM). We consider various aspects of the SVM approach to better understand how the articles are categorized. This experiment uses local, global, and a combined set of low-level features as input for each SVM that is used to classify for a particular article. Since there are seven categories to choose from, then seven SVMs are trained in order to provide a probability value for each input vector from the test set. For this type of multiclass classification, we chose the one-versus-all (OVA) approach for each SVM, which uses a single SVM to compare each class with all other classes combined. Previous researchers [3235] suggest that for sparse data sets, the OVA approach provides better results over the max-wins-voting (MWV) approach, which uses a single SVM for each pair of classes resulting in a voting scheme.

Each experiment was conducted using the 5-fold cross validation (CV) [36] to completely test this stage of the algorithm. We used 80% of each category for training and 20% for testing, with each train and test set being mutually exclusive for each validation. To better describe the separation, we used 80% of the database that had a 1 or TRUE for that category 80% of the database that had a 0 or FALSE for that category to make the training set, and the rest of the database went to the test set. The resulting value from each SVM ranges from , with a deciding threshold of 0.

4.2. Testing Approach with Characteristics (L-C-H)

For our experiment with L-C-H, we used 5-fold CV to test this stage of the algorithm, as in the previous experiments (Tables 58). The goal of this experiment is to determine the increase of overall performance with the addition of characteristics. The training set still consists of the ground truth characteristics to compute the mean vectors for each category.

4.3. L-C-H versus L-C-S-H

For our experiment with L-C-S-H, we used 5-fold CV to test this stage of the algorithm, as in the previous experiments. The goal of this experiment is to determine the increase of overall performance with the addition of a selection mask between the characteristics and the high-level categories. The training set still consists of the ground truth characteristics to compute the mean vectors for each category. Each category demonstrates an increase in percentage until a threshold has been reached. Each threshold varies based on the low-level features used and what category is being tested. At that time, the percentage then eventually decreases to 0% after reaching the threshold.

Figure 14 illustrates the resulting TPR (true positive rate) as the most negatively important characteristic removed at each iteration, using local and global features. Each category demonstrates an increase in percentage until a threshold has been reached. This threshold is decided to coincide with the least amount of remaining characteristics that produce the highest TPR. At that time, the percentage then eventually decreases to 0% when the rest of the characteristics are zeroed out. So, zeroing out a subset of negatively important characteristics improves the overall TPR. Characteristic values for the mean category vector in the training set are the only values that are zeroed out. The training vectors go through this process to better describe how each category is characterized, while the testing vectors are not changed and compared to how close they are to each mean vector.

5. Conclusion

We have proposed a novel method to classify clothing (L-M-H, more specifically L-S-C-H) in which midlevel layers (i.e., physical characteristics and selection masks) are used to classify categories of clothing. The proposed system uses a database of 2D color images and 3D point clouds of clothing captured from a Kinect sensor. The paper is focused on determining how well this new approach improves classification over a baseline system. The database of clothing, that we use in our experiments, contains articles of clothing whose configurations are difficult to determine their corresponding category.

The overall improvement of this new approach illustrates the critical importance of middle-level information. The addition of middle-level features, termed characteristics and selection masks for this problem, enabled the classification process to improve, on average, by +27.47% for three categories, +17.90% for four categories and +10.35% for seven categories, see Table 9. Increasing the number of categories, increases the number of characteristics for the algorithm to classify. With the increase in complexity, the average true positive rate decreases due to the difficulty of the problem.

Using this approach could improve other types of classification/identification problems such as object recognition, object identification, person recognition. The use of middle-level features could range from one or more layers. This approach does not have to be limited to using only two layers of features (e.g., characteristics and selection masks). The notion of adding more features in between the low and high level may increase the percentage rate. Layers of filters, that could be used to distinguish between categories, could include separating adult clothing from child clothing or male clothing from female clothing.

Future extensions of this approach include classification of subcategories (e.g., types of shirts, types of dresses), age, gender, and the season of each article of clothing. The experiments are used on a subset of characteristics that are useful for each category. Perhaps the amount of characteristics that are used in the midlevel layer correspond to the resulting classification percentages. This approach could be applied to separating clothing into three groups of darks, colors, and whites before the laundry is placed into the washing machine. This novel approach can apply to grouping and labeling other rigid and nonrigid items beyond clothing.

Acknowledgment

This research was supported by the US National Science Foundation under Grants IIS-1017007 and IIS-0904116.