Abstract

The main goal of Ambient Assisted Living solutions is to provide assistive technologies and services in smart environments allowing elderly people to have high quality of life. Since 3D sensing technologies are increasingly investigated as monitoring solution able to outperform traditional approaches, in this work a noninvasive monitoring platform based on 3D sensors is presented providing a wide-range solution suitable in several assisted living scenarios. Detector nodes are managed by low-power embedded PCs in order to process 3D streams and extract postural features related to person’s activities. The feature level of details is tuned in accordance with the current context in order to save bandwidth and computational resources. The platform architecture is conceived as a modular system suitable to be integrated into third-party middleware to provide monitoring functionalities in several scenarios. The event detection capabilities were validated by using both synthetic and real datasets collected in controlled and real-home environments. Results show the soundness of the presented solution to adapt to different application requirements, by correctly detecting events related to four relevant AAL services.

1. Introduction

During the last years, the interest of the scientific community in smart environments has grown very fast especially within the European Ambient Assisted Living (AAL) Program with the aim of increasing the independent living and the quality of life of older people. The design of AAL systems is normally based on the use of monitoring infrastructures provided by smart environments. Such infrastructures include heterogeneous sensing devices in ad hoc networks with distributed data processing resources and coordinated by intelligent agents, offering information analysis and decision making capabilities. Human activities monitoring is a crucial function of AAL systems, especially in detection of critical situations (e.g., falls and abnormal behaviors) or as support during the execution of relevant tasks (e.g., daily living activities and training/rehabilitation exercises). Generally, human monitoring systems are based on both wearable devices or environmental equipment. In the first case, markers or kinematic sensors (e.g., MEMS accelerometers or gyroscopes) are worn by the end user for body’s movements detection. Recently Baek et al. [1] have presented a necklace embedding a triaxial accelerometer and a gyroscope able to distinguish falls from regular ADLs (activities of daily living) by measuring the angle of the upper body to the ground. Although wearable techniques could be accurate and suitable even in outdoor conditions, their efficiency is limited due to the false alarm occurrences if the device is incorrectly worn or not worn at all (after having a bath or during the night), as described by Kröse et al. [2]. On the other hand, ambient devices are normally based on some kind of vision system (monocular or stereo cameras, infrared camera, 3D sensors, etc.) for people tracking or sensors installed in household furniture/appliances (pressure sensors on the floor, on/off switches, etc.) inferring the person’s activities. Rimminen et al. [3] have been suggested an innovative floor sensor able to detect falls by using the near-field imaging (NFI) principle. However, ambient sensors require typically an ad hoc design or redesign of the home environment. The vision-based techniques are the most affordable and accurate solutions due to the richness of acquired data and the installation accessibility that do not require an ad hoc redesign of the environment. Foroughi et al. [4] have been presented a passive vision (monocular camera) approach for monitoring human activities with a particular interest in the problem of fall detection. More recently, Edgcomb and Vahid [5] have made an attempt to detect falls on privacy-enhanced monocular video. On the other hand, the visions systems based on 3D sensors of new generation of both TOF (Time of Flight) and non-TOF (e.g., structured light) overcome typical issues affecting the passive vision systems such as the dependence on ambient conditions (e.g., brightness, shadows, and chromaticity and appearance of surfaces) and the poor preservation of the privacy. Indeed, 3D sensors enable a new visual monitoring modality based on the metric reconstruction of a scene by processing only distance maps (i.e., range images), guarantying person’s privacy. The use of range images simplifies both preprocessing and feature extraction steps, allowing the use of less computational expensive algorithms more suitable for embedded PCs typically installed in AAL contexts. The usage of cheaper (non-TOF) 3D sensors have been recently described by Mastorakis and Makris [6] for the detection of falls in elderly people. However non-TOF sensors estimate distances from the distortion of an infrared light pattern projected on the scene, so their accuracy and covered range are seriously limited (distance up to 3-4 m with 4 cm accuracy) as reported by Khoshelham [7]. However, such sensors are designed especially for human motion capture in entertainment/gaming applications and then optimized for nonoccluded frontal viewing. TOF sensors, employing the so-called Laser Imaging Detection and Ranging (LIDAR) technique, estimate distances more reliably and accurately by measuring the delay between emitted and reflected light, so they can reach longer distances within a wider Field-of-View (FOV). Mixed approaches based on different kind of sensors are also possible, in which data coming from heterogeneous sensors are correlated (data fusion) to compensate false alarms, making the solution more reliable. A multisensor system for fall detection using both wearable accelerometers and 3D sensors has been described by Grassi et al. [8].

AAL systems present other critical issues concerning interoperability, modularity, and hardware independence, as observed by Fuxreiter et al. [9]; devices and applications are often isolated or proprietary, preventing the effective customization and reuse causing high development costs and limiting the combination of services in order to adapt to user’s needs. New trends are addressed to integrate an open middleware in AAL systems, as a flexible intermediate layer able to accommodate requirements and scenarios. Relevant AAL-oriented middleware architectures have been described by Wolf et al. [10], Schäfer [11], and Coronato et al. [12].

This paper presents a novel monitoring platform based on 3D sensors for AAL services delivering in smart environments. The proposed solution is able to process data acquired by different kind of 3D sensors and it is suitable to be integrated in AAL-oriented middleware providing monitoring functionalities in several AAL applications.

2. Materials and Methods

The platform (architecture) is conceived as modular, distributed, and open, by integrating several detector nodes and a coordinator node (Figure 1). It is designed to be integrated into wide AAL systems through open middleware. Three main logical layers have been defined: data processing resource, sensing resource, and AAL services management. The data processing resource layer is implemented by both detector nodes (Figure 1(a)) and coordinator node (Figure 1(b)). Moreover, the detector nodes implement the sensing resource layer. The 3D sensor network has a hierarchical topology as shown in Figure 2(a), composed of M detector nodes managing several 3D sensor nodes for each and one coordinator node that receives high-level reports from detector nodes. Both the 3D sensor, shown in Figure 2(b), and the embedded PC implementing the coordinator and detector nodes, shown in Figure 2(c), are low-power, compact, and noiseless devices, in order to meet typical requirements of AAL contexts.

A detector node can handle either overlapping 3D views (i.e., frames captured by distinct 3D sensors having at least a few common points) or nonoverlapping ones, whereas 3D views managed by distinct detector nodes must be always nonoverlapped. 3D data streams are fused at level of single detector node, whereas high-level data are fused at both detector and coordinator nodes. The detector nodes are responsible for events involving either single view or overlapping views, using data fusion to resolve occlusions. Instead, the coordinator handles events involving nonoverlapping views (inter-view events) and it is responsible for achieving a global picture of the events (e.g., the detection of a wandering state with a recovered fall in the bedroom and an unrecovered fall in the kitchen). Since AAL systems are typically implemented to assist elderly people living alone, the issue of inter-view people identification has not been addressed (i.e., only one person at a time is assumed present in the home).

The coordinator layer includes the architectural modules for the management of the detector nodes (control and data gathering), high-level data fusion, inter-view event detection and context management. The system manager (Figure 1(c)) managing the whole AAL system that includes the monitoring platform presented in this work as a functional component. It is inspired to the open AAL middleware UniversAAL [13], in order to achieve global AAL services’ goals. Each aforementioned architectural layer will be described in detail in the following.

2.1. The Sensing Resource

The sensing nodes are the 3D sensors connected to each detector node. Figure 3 shows one sensing node in a wall-mounted configuration with its extrinsic calibration parameters (θ: tilt angle, β: roll angle, H: height with respect to the floor plane) referring to the sensor reference system . The figure shows also the world reference system fixed on the floor plane. The 3D sensors management module is devoted to read out from the 3D sensor the data stream (three Cartesian matrices , , and in coordinates). The preprocessing module includes functionalities for extrinsic calibration and people detection in points cloud. The extrinsic calibration is fully automated (self-calibration) to simplify the sensor installation without requiring neither calibration tool nor user intervention. The self-calibration procedure is based on the RANSAC floor plane detection, suggested by Gallo et al. [14], in order to estimate the reference system change from to in which the feature extraction process is done. To identify a person in the acquired data, a set of well-known vision processing steps, namely, background modeling, segmentation, and people tracking, is implemented according to a previous authors’ study [15]. The Mixture of Gaussian dynamical model, proposed by Stauffer and Grimson [16], is used for background modeling since it is able to rapidly compensate little variations in the scene (e.g., movements of chairs and door opening/closing). For person detection and segmentation, a specific Bayesian formulation is defined in order to filter out nonperson objects even in very cluttered scenes.

A multiple hypothesis tracking, by using the conditional density propagation overtime, proposed by Isard and Blake [17], is implemented to track people effectively even in presence of large semioccluding objects (e.g., tables and sofas) as frequently happen in home environments. Considering that only one person at a time can be present in the home, the range data generated from overlapping 3D views are fuse together by using a triangulation-based prealignment refined with the fast Iterative Closest Point as suggested by Won-Seok et al. [18]. Finally, a middleware module plugs in 3D sensors into the whole system providing also a semantic description of the 3D data. This module is able to handle different type of 3D sensors, both TOF and non-TOF, translating the specific data format into an abstract one common for all 3D sensors plugged in the system.

2.2. The Data Processing Resource

Two different kinds of data processing resources are defined, namely, the detector and the coordinator, of which details are provided in this section.

2.2.1. The Detector

The detector data processing resource includes the following modules: feature extraction, person’s position, posture recognition, and intra-view event detection. Features are extracted from 3D data by using two body descriptors having different level of details and computational complexity. Coarse grained features are extracted by using a volumetric descriptor that exploits the spatial distribution of 3D points represented in cylindrical coordinates corresponding to height, radius, and angular locations, respectively. The rotational invariance is obtained by choosing the h-axis related to the body’s vertical central axis and suppressing the angular dimension θ. Instead, the scale invariance obtained normalizing by the size of the reference cylinder, whereas the translational invariance is guaranteed by placing the cylinder axis on the body’s centroid M. Thus the 3D points are grouped into rings orthogonal to and centered at the h-axis while sharing the same height and radius (Figure 4(a)), showing the cylindrical reference system with highlighting the kth ring and its included 3D points. The corresponding volumetric features are represented by the cylindrical histogram shown in Figure 4(b) obtained by taking the sum of the bin values for each ring. The fine grained features are achieved by using a topological representation of body information embedded into the 3D point cloud. The intrinsic topology of a generic shape, that is, a human body scan captured via TOF sensors, is graphically encoded by using the discrete Reeb graph (DRG) proposed by Xiao et al. [19]. To extract the DRG, the geodesic distance is used as invariant Morse function [20] since it is invariant not only to translation, scale, and rotation but also to isometric transformations ensuring the high accuracy of the body parts representation under postural changes. The geodesic distance map is computed by using a two-step procedure. At first, a connected mesh is built on the 3D point cloud (Figure 4(c)) by using the nearest-neighbor rule. Then, assuming M the starting point (i.e., the body’s centroid), the geodesic distance between M and each other mesh node is computed as the shortest path on mesh by using an efficient implementation of Dijkstra’s algorithm [21]. The computed geodesic map is shown in Figure 4(d) in which false colors represent geodesic distances. The DRG is extracted by subdividing the geodesic map in regular level sets and connecting them on the basis of an adjacency criterion as described by Diraco et al. [22] suggesting also a methodology to handle self-occlusions (i.e., due to body parts overlapping other body parts). The DRG-based features are shown in Figure 4(e) and represented by the topological descriptor that includes DRG nodes and related angles . The person’s position is defined in terms of 3D coordinates with respect to the world reference systems associated with the 3D sensor in case of single (nonoverlapping) view.

In case of overlapping views the 3D position is estimated via triangulation (i.e., considering at least two sensor views and the relative position of the person in them) and assuming one of the overlapping views as reference view. In activity monitoring and related fields, the body posture is considered a crucial element, since a typical human action can be decomposed in few relevant key postures differing significantly from each other and suitable to both representing and inferring activities and behaviors, as pointed out by Brendel and Todorovic [23] and by Cucchiara et al. [24], respectively. To cover as wide range as possible of AAL applications, a large class of key postures organized into four levels with different details has been defined as described in the following. The considered levels are summarized in Figure 5(a). At the first level, the four basic postures, Standing (St), Bending (Be), Sitting (Si), and Lying down (Ly), are extracted. At the second level, the person’s centroid height with respect to the floor plane is taken into account in order to discriminate, for instance, a “Lying down on bed” from a “Lying down on floor.” The orientation of the body’s torso is taken into account by the third level. The fourth and final level describes the body’s extremities configuration, providing a total amount of 19 postures. A sample TOF frame for each kind of posture is shown in Figure 5(b). Given the coarse-to-fine features extracted as previously discussed, the multiclass formulation of the SVM (Support Vector Machine) based on the one-against-one strategy described by Debnath et al. [25] is used for postures classification. Since interesting results have been obtained when Radial Basis Function (RBF) kernel is used [26], key kernel parameters (regularization constant and kernel argument ) are adjusted according to a global grid search strategy.

Furthermore, the detector node is responsible to detect events related to both nonoverlapping and overlapping views (such as falls or activities happening entirely into the same detection area). In general, human actions are recognized by considering successive postures over a time period. A transition action occurs when the person changes the current action to another action. Thus, a transition action might include several transition postures. Such transition postures are recognized at detector node level by using a backward search strategy, whereas transition actions are recognized by the coordinator. Starting from the current 3D frame, the previous frames are checked if they refer to similar postures. In this case, the transition action is recognized, otherwise the backward search strategy continues with another 3D frame. Recognized transition postures are sent to the coordinator responsible for events detection in nonoverlapping views. If transition postures yield a meaningful event (e.g., fall) it is also sent to the coordinator. Finally, the detector data processing is plugged in via middleware module as data processing resource able to process data coming from 3D sensor resources and to communicate with the coordinator.

2.2.2. The Coordinator

The coordinator data processing resource includes the following functional modules: detector nodes management, data fusion, inter-view event detection, and context management. Information concerning detector nodes gathered by the coordinator includes the node position within the home (e.g., living room and bedroom) and the adjacency of node views (i.e., if two nonoverlapping views are directly accessible or if they are accessible through a third view). In addition, on the basis of the current application context the detector nodes are configured according to the most appropriate level of details and the events of interest. The data reports gathered from detectors are fused together on the basis of both detector position and timestamp. The inter-view events are detected by using a backward search strategy similar to that already described in the previous subsection but recognizing transition actions instead of transition postures. The transition actions are merged together to form single atomic actions whereas global events are recognized by using Dynamic Bayesian Networks (DBNs) specifically designed for each application scenario, following an approach similar to the one proposed by Park and Kautz [27]. The designed DBNs have a hierarchical structure with three node layers named activity, interaction, and sensor. The activity layer stays on top of the hierarchy and includes hidden nodes to model high-level activities (i.e., ADLs, behaviors, etc.). The interaction layer is an hidden layer as well and it is devoted to model the states of evidence for interactions inside the home (i.e., appliances and furniture locations, and person’s position). The sensor layer, at the bottom of hierarchy, gathers data from detector sensors: locations and postures. Each DBN is hence decomposed in multiple Hidden Markov Models (HMMs) including interaction and sensor layers and trained on the basis of the Viterbi algorithm as described by Jensen and Nielsen [28].

2.3. System Manager

The system manager refers to the whole AAL system management by means of the definition and the execution of procedures and workflows which react to situations of interest. While such situations are identified by the context manager at charge of the coordinator, the system manager through the procedural manager handles the workflow on the basis of what the system is required to react (e.g., sending an alarm message). Furthermore, the procedural manager outlines service goals in an abstract way, whereas the composer is responsible for combining available AAL services to achieve such goals.

3. Results and Discussion

Invariance properties and recognition performance of suggested coarse-to-fine features were assessed by using realistic synthetic postures generated as suggested by Gond et al. [29]. A large posture datasets of about 6840 samples, with and without semiocclusions, was generated under different angles (from 0° to 180° with 45° steps) and distances (Low: 2 m, Med: 6 m, High: 9 m). The 4-quadrant technique suggested by Li et al. [30] was adopted to simulate semiocclusions similar to those normally present in home environments. The achieved recognition rates detailed for each feature level, angle, and distance are reported in Table 1.

At levels 1 and 2, the volumetric descriptor exhibited a good recognition rate that without semiocclusions was in average equal to 96% (average taken over all angles and distances), decreasing to 93% in presence of semiocclusions. The topological descriptor (levels 1 and 2) exhibited a classification rate without semiocclusions in average equal to 95% and 94%, respectively, demoting to 84% and 83%, respectively, with semiocclusions. Results suggest that the volumetric descriptor is more robust to semiocclusions than the topological one. At level 3 the volumetric descriptor showed an acceptable classification rate on average of 92% that demoted to 87% with semiocclusions. At this level the topological descriptor gave an average classification rates of 91% and 82% without and with semiocclusions, respectively. At level 4 the two descriptors exhibited the major differences. In fact, the volumetric descriptor achieved very poor classification rates, whereas the topological descriptor was able to discriminate well high-level postures achieving without semiocclusions and at Low distances an average of 96%, and of 89% at all distances (up to 10 m), decreasing to 83% in presence of semiocclusions. Whereas the volumetric classification rate was almost uniform across angles and distances, the topological one tended to decrease with distance and in correspondence to viewpoint angles of 90° (lateral position) at which self-occlusions were most heavy.

The event detection performance was evaluated in real-home environments by involving ten healthy subjects, 5 males, and 5 females, having different physical characteristics: age years, height  cm, and weight  kg. Figure 6(a) shows sixteen sample 3D frames of the collected dataset. The typical apartment is shown in Figure 6(b) with the locations (from 1 to 11) in which actions have been performed. The sensor network used during experiments included three sensor nodes, S1 in the bedroom, and S2 and S3 in the living room with overlapping views. Each sensing node is represented by the MESA SwissRanger SR-4000 [31] shown in Figure 1(b), that is, a state-of-the-art TOF 3D sensor with compact dimensions (65 × 65 × 68 mm), noiseless functioning (0 dB noise), QCIF resolution (176 × 144 pixels), long distance range (up to 10 m), and wide (69° × 56°) FOV (Field-Of-View). Since the SR-4000 sensor comes intrinsically calibrated by the manufacturer, the calibration procedure had to estimate only the extrinsic calibration parameters. The 3D sensors were managed by two logical detector nodes: one for S1 and another one for both S2 and S3. The two logical detectors and the coordinator were implemented into the same physical node which was an Intel Atom 1.6 GHz processor-based embedded PC shown in Figure 1(c). Four relevant AAL services have been considered, namely, fall detection, wandering detection, ADLs recognition, and training exercises recognition. One dataset for each service was collected and characterized by different combinations of occlusions, distances, angles, and feature levels as reported in Table 2 by columns from 2 to 8.

The last two columns in Table 2 report the achieved detection performance in terms of True Negative Rate (TNR) and True Positive Rate (TPR) measures defined as follows: where TP, TN, FP, and FN stand for True Positive, True Negative, False Positive, and False Negative, respectively.

Concerning the fall detection scenario, different fall (locations 2, 3, 4, and 5 in Figure 6(b)) and nonfall (locations 1 and 11 in Figure 6(b)) actions were performed, with and without the presence of occluding objects, in order to evaluate discrimination performance. The system was able to discriminate correctly falls even in presence of semiocclusions, achieving 97.5% and 83% of TNR and TPR, respectively. Since fall events were detected at level of detector node (intra-view), ambiguous situations such as those in which a fall was located between nonoverlapping views (e.g., location 4 in Figure 6(b)) gave rise to False Negatives. The wandering state, instead, was detected at coordinator level since it normally involves several nonoverlapping views. In general, it is not simple to detected a wandering state since it is not just an aimless movement. Rather, it is a “purposeful behavior initiated by a cognitively impaired and disoriented individual characterized by excessive ambulation” as stated by Ellelt [32]. On the basis of such characterization, wandering states were discriminated from ADLs with 92.7% and 81.6% of TNR and TPR, respectively. While for fall detection the involved feature detail levels were almost exclusively the first two with prevalent adoption of the volumetric representation, in the case of wandering detection also the third feature level was involved with a moderate topological representation. The following seven kinds of activities were performed in order to evaluate the ADLs recognition capability: sleeping, waking up, eating, cooking, housekeeping, sitting to watch TV, and physical training. In this case all four feature levels were involved, although the fourth level had a low incidence (5%). ADLs were recognized with 98.3% and 96.4% of TNR and TPR, respectively. A moderate misclassification was observed for housekeeping activities since they occasionally were erroneously recognized as wandering state. For the training exercises scenario, a virtual trainer was developed instructing participants to follow a physical activity program and to perform the recommended exercises correctly. The recommended physical exercises were of the following kinds: biceps curl, squatting, torso bending, and so forth. Involved feature levels were, for the majority, the last two (40% and 50%, resp.), with prevalent use of topology-based features. The performed exercises were correctly recognized achieving 99.2% and 95.6% of TNR and TPR, respectively. The detection results reported in Table 2 show that the system was able to select the most appropriate level of feature details almost in all scenarios. The most computationally expensive steps were preprocessing, feature extraction, and posture classification. They were evaluated in terms of processing time that was constant for preprocessing and classification resulting, respectively, in 20 ms and 15 ms per frame. The volumetric descriptor has taken an average processing time of about 20 ms, corresponding to about 18 fps (frame per second). The topological approach, on the other hand, required a slightly increasing processing time among hierarchical levels from an average value of 40 ms to 44 ms due to the incremental occurrence of self-occlusions, achieving up to 13 fps.

Comparison with Related Studies. Different studies based on both wearable and ambient devices have been considered in order to compare the discussed results. All reported studies have been carried out by detecting abnormal behaviors (e.g., falls and wandering) among normal human activities (e.g., ADLs and physical exercises) in real or near-real conditions. Studies based on 3D sensors have not been reported yet; to from the author’s knowledge, comprehensive works on abnormal behaviors detection cannot be found in the literature. The results of the related studies are reported in Table 3.

4. Conclusions

The main contribution of the work is the design and the evaluation of a unified solution for 3D sensor-based in-home monitoring for different context-aware AAL services. A modular platform has been presented, which is able to classify a large class of postures and detect events of interest, accommodating simple wall-mounting sensor installations. The platform was optimized and validated for embedded processing in order to meet typical AAL in-home requirements, such as low-power consumption, noiselessness, and compactness. The experimental results have shown that the system was able to effectively adapt to four different important AAL scenarios exploiting a context-aware multilevel feature extraction. The process guarantees a reliable detection of relevant events overcoming well-known problems affecting traditional vision-based monitoring systems in a privacy preserving way. The ongoing work concerns the on-field validation of the system that will be deployed in elderly apartments in support of two different AAL scenarios concerning the detection of dangerous events and abnormal behaviors.

Acknowledgment

The presented work has been carried out within the BAITAH project funded by the Italian Ministry of Education, University and Research (MIUR).