Abstract

Mental care has become crucial with the rapid growth of economy and technology. However, recent movements, such as green technologies, place more emphasis on environmental issues than on mental care. Therefore, this study presents an emerging technology called orange computing for mental care applications. Orange computing refers to health, happiness, and physiopsychological care computing, which focuses on designing algorithms and systems for enhancing body and mind balance. The representative color of orange computing originates from a harmonic fusion of passion, love, happiness, and warmth. A case study on a human-machine interactive and assistive system for emotion care was conducted in this study to demonstrate the concept of orange computing. The system can detect emotional states of users by analyzing their facial expressions, emotional speech, and laughter in a ubiquitous environment. In addition, the system can provide corresponding feedback to users according to the results. Experimental results show that the system can achieve an accurate audiovisual recognition rate of 81.8% on average, thereby demonstrating the feasibility of the system. Compared with traditional questionnaire-based approaches, the proposed system can offer real-time analysis of emotional status more efficiently.

1. Introduction

During the past 200 years, the industrial revolution has caused a considerable effect on human lifestyles [1, 2]. A number of changes occurred [3] with the rapid growth of the economy and technology, including the information revolution [3], the second industrial revolution [4], and biotechnology development. Although such evolution was considerably beneficial to humans, it has caused a number of problems, such as capitalism, utilitarianism, poverty gap, global warming, and an aging population [1, 2]. Because of recent changes, a number of people recognized these crises and appealed for effective solutions [5], for example, the green movement [6], which successfully creates awareness of environmental protection and leads to the development of green technology or green computing. However, the green movement does not concentrate on body and mind balance. Therefore, a solution that is feasible for shortening the discrepancy between technology and humanity is of utmost concern.

In 1972, the King of Bhutan proposed a new concept that used gross national happiness (GNH) [7] to describe the standard of living of a country, instead of using gross domestic product (GDP). The GNH has attracted considerable attention because it measured the mental health of people. Similar ideas were also proposed in other works. For example, Andrew Oswald advocated Happiness Economics [8] by combining economics with other research fields, such as psychology and sociology. Moreover, a book entitled “Well-Being” [9], which was written by Daniel Kahneman (a Nobel Prize winner in Economic Sciences in 2002) explained the fundamentals of happy psychology. The common objective of those theories is to upgrade the living quality of humans and to bring more happiness into our daily lives. Recently, the IEEE launched the humanitarian technology challenge (HTC) project (http://www.ieeehtc.org/) [10] by sponsoring resource-constrained areas to build reliable electricity and medical facilities. Such an action also highlights the importance of humanistic care. Similar to the HTC project, Intel has supported a center for aging services technologies (CAST) (http://www.agingtech.org/), and its objective is to accelerate development of innovative healthcare technologies. Several academic institutes responded to the trend and subsequently initiated medical care research, such as the “CodeBlue” project at Harvard University [11] and “Computers in the Human Interaction Loop” (CHIL) at Carnegie Mellon University [12]. Inspired by those related concepts [1, 2, 6, 812], this study devised a research project for studying the new interdisciplinary “Orange Technology” to promote health, happiness, and humanistic care.

Instead of emphasizing the relations between environments and humans, as proposed by green technology, the objective of the orange computing project is to bring more care or happiness to humans and to promote mental wellness for the well-being of society.

Orange computing is an interdisciplinary field that includes computer science, electrical engineering, biomedical engineering, psychology, physiology, cognitive science, and social science.The research scope of orange computing contains the following.(1)Health and security care for the elderly, children, and infants.(2)Care and disaster relief for people in disaster-stricken areas.(3)Care for low-income families.(4)Body-mind care for people with physiological and psychological problems.(5)Happiness indicator measurement and happiness enhancement.

To demonstrate the concept of orange computing, a case study on a human-machine interactive and assistive system for emotion care was investigated in this study. The proposed system is capable of recognizing human emotions by analyzing facial expressions and speech. When the detected emotion status exceeds a threshold, an alarm will be send to a doctor or a nurse for further diagnosis and treatment.

The remainder of this paper is organized as follows: Section 2 introduces the orange computing models; Section 3 presents a discussion of a case study on the emotion recognition system for care services; Section 4 summarizes the performance of the proposed method and the analysis results; lastly, Section 5 offers conclusions.

Orange computing originates from health informatics, and it contains two research topics: one is physiological care and the other psychological care. Both of the two topics focus on enhancing humans’ physical and mental health, enriching positive emotions and finally bring more happiness to others [13, 14]. The physiological and psychological care models of orange computing are similar to the health model in medical expert systems [15, 16], which have been well developed and commonly used in health informatics over several decades.

In a medical expert system, when a user inputs a query through the interface, the system can automatically search predefined knowledge databases and consult with relevant experts or doctors. After querying databases or merging opinions of experts, the system subsequently replies to the user with an appropriate response. In traditional medical expert systems, database querying and feedback usually involve semantic understanding techniques and delicate interface design [1719], so that users do not feel inconvenient during the process. However, in some telemedical care systems, such as [20], knowledge databases and feedback mechanisms are replaced with caregivers for better interactivity. Recently, expert systems have gradually integrated knowledge-based information management systems with pervasive computing [21]. Although such systems have been prototyped and modeled in several studies [22, 23], they have not been deployed. However, the abovementioned ideas have spurred the development of orange computing.

Happiness informatics, or the happiness model, is the key characteristic of orange computing. Similar to the health model, the happiness model also requires a user input and a predefined database. The input is commonly measured from the biosignals or behavior of a user, for example, facial expressions, emotional speech, laughter, body gestures, gaits, blood pressure, heartbeat rates, electroencephalograms (EEGs), electrocardiograms (ECGs), and electromyograms (EMGs) [24, 25]. With such information, the happiness model can help users evaluate their emotional status in various applications. Nevertheless, it is quite challenging to determine the manner in which to combine those data and determine emotional status [2628].

3. Case Study

This section demonstrates a technological application for daily humanistic care in home environments. The system uses contactless multimodal recognition techniques to measure positive emotion degree of users. The recognition results can be logged into the database and sent to analysts for further processing. As shown in Figure 1, the ambient devices of the proposed system include multiple audiovisual sensors, a service robot, and a smart TV. The robot is a self-propelled machine with four wheels and serves as a remote agent between users and the server. To interact with users, it is equipped with audiovisual sensors, loudspeakers, and a touch screen. Similar to the robot, the TV is also used for interacting with users.

After the ambient sensors receive signals from users, data are subsequently sent to a processing server through a cloud network. The workflow of the data processing procedures comprises three stages, as follows: the first and second stages are the audiovisual recognition, and the last stage is the feedback stage. The detail of each stage is described as follows.

3.1. Visual Recognition

At the image processing stage, as shown in Figure 2, after video streams are captured by the camera, Haar-like features [29] are extracted and sent to AdaBoost classifiers [29] to detect user faces. Subsequently, the system uses the Active Shape Model, which was proposed by Cootes et al. [30], to model facial regions. Thus, facial regions can be represented by a set of points using the point distribution model.

A novel feature called “Multilayered Histogram of Oriented Gradients” (MLHOGs) is proposed in this study to generate reliable characteristics for estimating facial expressions. The MLHOGs are derived from Histograms of Oriented Gradients (HOGs) [31] and Pyramid Histograms of Oriented Gradients (PHOGs) [32]. Let represent the pixel of coordinate and , denote gradients, refer to the weight of a coordinate, and be edge directions. The histogram of oriented gradients can be expressed as follows: After gradients are computed, a histogram of edge directions is subsequently created to collect the number of pixels that belongs to a direction.

Unlike pyramid histogram of oriented gradients, which concentrates on fixed rectangular shapes inside an image, the proposed MLHOGs are modeled by object-based regions of interest (ROIs), such as eyes, mouths, noses, and combinations of ROIs. Furthermore, each objected-based ROI has a dedicated classifier for recognizing the same type of ROIs. A concept example of multilayered histogram of oriented gradients and multilayered local directional patterns is illustrated in Figure 3.

Similar to the proposed MLHOGs, our study also develops a new texture descriptor called “Multilayered Local Directional Pattern” for enhancing recognition rates. Such multilayered directional patterns are computed according to “edge responses” of pixels, which are based on the same concept of Jabid’s feature, “Local Directional Patterns (LDPs)” [33]. The difference is that the proposed method focuses on patterns at various ROI levels. Computation of multilayered local directional patterns is listed as follows: where is the input image, means eight-directional Kirsch edge masks like Sobel operators, stands for edge responses of , represents eight directions, and is the number of edge responses in a designated direction. Before the system accumulates the edge responses of using (3), an LDP binary operation [33] is imposed on to generate an invariant code. A one-by-eight histogram is adopted to collect the edge responses in the eight directions. In the proposed multilayered local directional patterns, only edge responses in objects of interest are collected, so that the histogram differs from ROIs to ROIs.

In addition to upright and full frontal faces, this work also supports roll/yaw angle estimation and correction. The active shape model can label facial regions. Relative positions, proportions of facial regions, and orientations of nonfrontal faces can be measured properly with the use of spatial geometry. Once the direction is determined, corresponding transformation matrices are applied to the nonfrontal faces for pose correction.

At the end of the image processing stage, multiple Support Vector Machines (SVMs) are used to classify facial expressions. Each of the SVMs is trained to recognize a specific facial region. The classification result is generated by majority voting.

3.2. Audio Recognition

Audio signals and visual data have a considerable effect on deciphering human emotions. Therefore, the audio processing stage focuses on detecting emotional speech and laughter to extract emotional cues from acoustic signals. The workflow at this stage is illustrated in Figure 4.

First, silence segments in audio streams are removed by using voice activity detection (VAD) algorithm. Subsequently, an autocorrelation method called “Average Magnitude Difference Function” (AMDF) [34] is used to extract phoneme information from acoustic data. The AMDF can effectively estimate periodical signals, which are the main characteristics of speech, laughter, and other vowel-based nonspeech sounds. The AMDF is derived as follows: where represents one of the segments in the acoustic signal, is the length of , denotes the time index, and is the shifting length. After AMDF() reaches the minimum, a phoneme can be acquired by extracting indices from to .

Algorithm 1 expresses the process of syllable extraction when phonemes of a signal are determined.

Initialization
 designating the beginning phonemes ;
For each phoneme
begin
 If Similarity ( ) <
  If Distance ( ) >
    is the end of asyllable;
end

In the next step, to classify signals into their respective categories, energy and frequency changes are used as the first criteria to separate speech from vowel-based nonspeech because spectral discrepancy of speech is relatively smaller in most cases.

Compared with other vowel-based nonspeech, the temporal pattern of laughter usually exhibits repetitiveness. To detect such patterns, this study uses cascade filters, which consist of a Dynamic Time Warping (DTW) filter [35] and a syllable discriminant filter, to compute similarities of the input data. With the use of Mel-frequency cepstral coefficients (MFCCs), the Dynamic Time Warping filter can find out desired signals by matching them with the samples in the database. The signals that successfully pass through the first filter are subsequently input to the second filter. The syllable discriminant filter compares each input sequence with predefined patterns by using the inner product operation. When the score of an input is higher than a threshold, the input is labeled as laughter.

For emotional speech recognition, this study follows previous works [3638] and extracts prosodic and timbre features from speech to recognize emotional information in voices. Tables 1 and 2 show the acoustic features used in this system.

In addition to the acoustic features, this study also uses the keyword spotting technique to detect predefined keywords in speech because textual data offer more emotion clues than acoustic data. After detecting predefined keywords in utterances, the system iteratively computes the association degree between the detected keyword and each emotion category.

Let represent the index of the emotion categories; denote the index of the detected keyword in the sentence corpus; refer to the detected keyword; represent the occurrence of in category ; denote the number of sentences containing .

The association degree can be defined as where the first part of the equation is the weighting score, and the second part is the confidence score of (see [39] for detailed information). The textual feature vector is subsequently combined with the acoustic feature vector and sent into a classifier (AdaBoost) for training and recognition.

3.3. Feedback Mechanism

After completion of the audiovisual recognition stage, the system generates three results along with their classification scores. One of the three results is the detected facial expression, another is the detected vocal emotion type, and the other is laughter. The classification scores are linearly combined with the recognition rates of the corresponding classifiers and finally output to users. Additionally, the recognition result is logged in the database 24 hours a day. A user can browse the curve of emotion changes by viewing the display. The system is also equipped with a telehealthcare module. Personal emotion status can be sent to family psychologists or psychiatrists for mental care. The service robot can serve as an agent between the cloud system and users, providing a remote interactive interface.

4. Experimental Results

This study conducted an experiment to test audiovisual emotion recognition to assess the performance of our system. Only positive emotions, including smiling faces, laughter, and joyful voices, were tested in the experiment.

At the evaluation of the facial expression stage, 500 facial images containing smiles and nonsmiles were manually selected from the MPLab GENKI database (http://mplab.ucsd.edu/). The kernel function of the SVM was the radial basis function, and the penalty constant was empirically set to one. Furthermore, 50% of the dataset was used for training, and 50% was used for testing. During the evaluation of laughter recognition, a database consisting of 84 sound clips was created by recording the utterances of six people. Eighteen samples from these 84 clips were the sound of people laughing. After removing silence parts from all of the clips, the entire dataset was subsequently sent into the system for recognition. For emotional speech recognition, this research used the same database as that in our previous work [39]. The speech containing joyful and nonjoyful emotions was manually chosen and parsed to obtain their literal information and acoustic features. Finally, these features were inputted into an AdaBoost classifier for training and testing.

Figure 5 shows a summary of the experimental results of our system, in which the vertical axis denotes accuracy rates, and the horizontal axis represents recognition modules. As shown in the figure, the accuracy rate of smile detection can reach 82.5%. The performance of laughter recognition can also achieve an accuracy rate as high as 88.2%. Compared with smile and laughter recognition, although the result of emotional speech recognition reached 74.6%, such performance is comparable to those of related emotional speech recognition systems. When combined with the test result of emotional speech recognition, the overall accuracy rate can reach an average of 81.8%.

The following experiment tests whether the proposed system can help testees remind and evaluate their emotional health status as caregivers do. During the experiment, total ten persons were selected from the sanatorium and the hospital to test the system for a week. The age of the participants ranges from 40 to 70 years old. The audiovisual sensors were installed in their living space, so that the emotional data can be acquired and analyzed in real time. For privacy, the sensors captured behavior only during 10:00 to 16:00. To avoid generating biased data, each testee was not aware of the locations of the sensors and the testing details of the experiment. Furthermore, after the system analyzed the data, the medical doctors and nurses helped testees complete questionnaires. The questionnaire contained total ten questions, nine of which were irrelevant to this experiment. The remaining question was the key criterion that allowed the testees to give a score (one(unhappy)–five(happy)) to their daily moods.

The questionnaire scores are subsequently compared with the estimated emotional status of the proposed system. To obtain the estimated emotional score, the proposed method firstly calculates the duration of smiling face expressions, joyful speech, and laughter of the testees. Next, a ratio can be computed by converting the duration into a one-to-five rating scale based on the test period.

The correlation test in Figure 6 shows performance of the questionnaire approach and the proposed system. The vertical axis represents the questionnaire result, whereas the score of the proposed system is listed on the horizontal axis. All the samples are collected from the testees. Closely examining the scatterness in this figure reveals that Pearson’s correlation coefficient reaches as high as 0.27. This implies that our method is analogous with the questionnaire-based approach. Moreover, two groups of the scores in the linear regression analysis reflect a linear rate of 0.33. Above findings indicate that the proposed method can allow computers to monitor users’ emotional health, subsequently assisting caregivers in reminding users’ psychological status and saving more human resources.

5. Conclusion

This paper presents a new concept called orange computing for health, happiness, and humanistic care. To demonstrate the concept, a case study on the audiovisual emotion recognition system for care services is also conducted. The system uses multimodal recognition techniques, including facial expression, laughter, and emotional speech recognition to capture human behavior.

At the facial expression recognition stage, multilayered histograms of oriented gradients and multilayered local directional patterns are proposed to model facial features. To detect patterns of laughing sound, two cascade filters consisting of a Dynamic Time Warping filter and a syllable discriminant filter are used in the acoustic processing phase. Furthermore, when classifying emotional speech, the system combines textual, timbre, and prosodic features to calculate association degree to predefined emotion classes.

Three analyses are conducted for evaluating recognition performance of the proposed methods. Experimental results show that our system can reach an average accuracy rate of 81.8%. Concerning the feedback mechanism, data from the real-life test indicate that our method is comparable to the questionnaire-based approach. Additionally, correlation degree between two methods is as high as 0.27. The above results demonstrate that the proposed system is capable of recognizing users’ emotional health and thereby providing an in-time reminder for them.

In summary, orange computing hopes to arouse awareness of the importance of mental wellness (health, happiness, and warming care), subsequently leading more people to join the movement, to share happiness with others, and finally to enhance the well-being of society.

Acknowledgments

This work was supported in part by the National Science Council of the Republic of China under Grant no. 100-2218-E-006-017. The authors would like to thank Yan-You Chen, Yi-Cheng Chen, Wei-Kang Fan, Chih-Hung Li, and Da-Yu Kwan for contributing the experimental data and supporting this research.