Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996-3450, USA
Department of Biology, University of Memphis, Memphis, TN 38152-3150, USA
Abstract
Identifying functional groups of genes is a challenging problem for biological applications.
Text mining approaches can be used to build hierarchical clusters or trees from the information in the biological literature. In particular, the nonnegative matrix factorization (NMF) is examined as one approach to label hierarchical trees. A generic labeling algorithm as well as an evaluation technique is proposed, and the effects of different NMF parameters with regard to convergence and labeling accuracy are discussed. The primary goals of this study are to provide a qualitative assessment of the NMF and its various parameters and initialization, to provide an automated way to classify biomedical data, and to provide a method for evaluating labeled data assuming a static input tree. As a byproduct, a method for generating gold standard trees is proposed.
1. Introduction
High-throughput techniques in
genomics, proteomics, and related biological fields generate large amounts of
data that enable researchers to examine biological systems from a global
perspective. Unfortunately, however, the sheer mass of information available is
overwhelming, and data such as gene expression profiles from DNA microarray
analysis can be difficult to understand fully even for domain experts. Additionally,
performing these experiments in the lab can be expensive with respect to both
time and money.
In recent years, biological literature repositories
have become an alternative data source to examine phenotype. Many of the online
literature sources are manually curated, so the annotations assigned to
articles are subjectively assigned in an imperfect and error-prone manner.
Given the time required to read and classify an article, automated methods may
help increase the annotation rate as well as improve existing annotations.
A recently developed tool that may help improve
annotation as well as identify functional groups of genes is the Semantic Gene
Organizer (SGO). SGO is a software environment based upon latent semantic
indexing (LSI) that enables researchers to view groups of genes in a global
context as a hierarchical tree or dendrogram [1]. The low-rank approximation
provided by LSI (for the original term-to-document associations) exposes latent
relationships so that the resulting hierarchical tree is simply a visualization
of those relationships that are reproducible and easily interpreted by
biologists. Homayouni et al. [2] have shown that SGO can identify groups of related
genes more accurately than term co-occurrence methods. LSI, however, is based
upon the singular value decomposition (SVD) [3], and since the input data for SGO is a nonnegative
matrix of weighted term frequencies, the negative values prevalent in the basis
vectors of the SVD are not easily interpreted.
On the other hand, the decomposition produced by the
recently popular nonnegative matrix factorization (NMF) can be readily
interpreted. Paatero and Tapper [4] were among the first researchers to investigate this
factorization, and Lee and Seung [5] demonstrated its use for both text mining and image
analysis. NMF is generated by an iterative algorithm that preserves the
nonnegativity of the original data; the factorization yields a low-rank,
parts-based representation of the data. In effect, common themes present in the
data can be identified simply by inspecting the factor matrices. Depending on
the interpretation, the factorization can induce both clustering and
classification. If NMF can accurately model the input data, it can be used to
both classify data and perform pattern recognition tasks [6]. Within the context of SGO,
this means that the groups of genes presented in the hierarchical trees can be
assigned labels that identify common attributes of protein function.
The interpretability of NMF, however, comes at a
price. Namely, convergence and stability are not guaranteed, and many
variations have been proposed [5], requiring different parameter choices. The goals of
this study are (1) to provide a qualitative assessment of the NMF and its
various parameters, particularly as they apply to the biomedical context, (2) to
provide an automated way to classify biomedical data, and (3) to provide a
method for evaluating labeled data assuming a static input tree. As a
byproduct, a method for generating “gold standard” trees is proposed.
2. Methods
As outlined in [7], hierarchical trees can be constructed for a given
group of genes. Once those trees are formed, techniques that label the interior
nodes of those trees can be examined.
2.1. Nonnegative matrix factorization
Given an
nonnegative matrix
,
where each entry
denotes the term weight of token
in gene document
,
the rows of
represent term vectors that show how terms are
distributed across the entire collection. Similarly, the columns of
show which terms are present within a gene
document. Consider the
term-by-document matrix
in Table 1 derived from the sample document
collection [7] in
Table 2. Here, log-entropy term weighting [8] is used to define the relative importance of term
for document
.
Specifically,
,
where
(1)
is the frequency of token
in document
,
and
is the probability of token
occurring in document
.
By design, tokens that appear less frequently across the collection but more
frequently within a document will be given higher weight. That is,
distinguishing tokens will tend to have higher weights assigned to them, while
more common tokens will have weights closer to zero.
Table 1: Term-document matrix for the sample collection in
Table
2.
Table 2: Sample collection with dictionary terms displayed in bold.
If NMF is applied to the sample term-document matrix in
Table 1, one possible factorization is given in Tables 3 and 4; the
approximation to the term-document matrix generated by mutliplying
is given in Table 5. The top-weighted terms
for each feature are presented in Table 6. By inspection, the sample collection
has features that represent leukemia, alcoholism, anxiety, and autism. If each document and term is
assigned to its most dominant feature, then the original term-document matrix
can be reorganized around those features. The restructured matrix typically
resembles a block diagonal matrix and is given in Table 7.
Table 3: Feature matrix

for the sample collection.
Table 4: Coefficient matrix

for the sample collection.
Table 5: Approximation to sample term-document matrix given in
Table
1.
Table 6: Top 5 words for each feature from the sample collection.
Table 7: Rearranged term-document matrix for the sample
collection.
NMF of
is based on an iterative technique attempts to
find two nonnegative factor matrices,
and
,
such that
(2)where
and
are
and
matrices, respectively. Typically,
is chosen so that
.
The optimal choice of
is problem-dependant [9]. This factorization
minimizes the squared Euclidean distance objective function [10]
(3)
Minimizing the objective (or cost) function is convex
in either
or
,
but not both variables together. As such, finding global minima to the problem
is unrealistic—however, finding several local minima is within reason. Also,
for each solution, the matrices
and
are not unique. This property is evident when
examining
for any nonnegative invertible matrix
[11].
The goal of NMF is to approximate the original
term-by-gene document space as accurately as possible with the factor matrices
and
.
As noted in [12], the
singular value decomposition (SVD) produces the optimal rank-
approximation with respect to the Frobenius
norm. Unfortunately, this optimality frequently comes at the cost of negative
elements. The factor matrices of NMF, however, are strictly nonnegative which
may facilitate direct interpretability of the factorization. Thus, although an
NMF approximation may not be optimal from a mathematical standpoint, it may be
sufficient and yield better insight into the dataset than the SVD for certain
applications.
Upon completion of NMF, the factor matrices
and
will, in theory, approximate the original
matrix
and yet contain some valuable information
about the dataset in question. As presented in [10], if the approximation is
close to the original data, then the factor matrices can uncover some
underlying structure within the data. To reinforce this,
is commonly referred to as the feature matrix containing feature vectors that describe the themes
inherent within the data while
can be called a coefficient matrix since its columns
describe how each document spans each feature and to what degree.
Currently, many implementations of NMF rely on random
nonnegative initialization. As NMF is sensitive to its initial seed, this
obviously hinders the reproducibility of results generated. Boutsidis and
Gallopoulos [13]
propose the nonnegative double singular value decomposition (NNDSVD) scheme as
a possible remedy to this concern. NNDSVD aims to exploit the SVD as the
optimal rank-
approximation of
.
The heuristic overcomes the negative elements of the SVD by enforcing
nonnegativity whenever encountered and by iteratively approximating the outer
product of each pair of singular vectors. As a result, some of the properties
of the data are preserved in the initial starting matrices
and
.
Once both matrices are initialized, they can be updated using the
multiplicative rule [10]:
(4)
2.2. Labeling algorithm
Latent semantic indexing (LSI), which is based on the
SVD, can be used to create a global picture of the data automatically. In this
particular context, hierarchical trees can be constructed from pairwise
distances generated from the low-rank LSI space. Distance-based algorithms such
as FastME can create hierarchies that accurately approximate distance matrices
in
time [14]. Once a tree is built, a labeling algorithm can be
applied to identify branches of the tree. Finally, a “gold standard” tree
and a standard performance measure that evaluates the quality of tree labels
must be defined and applied.
Given a hierarchy, few well-established automated
labeling methods exist. To apply labels to a hierarchy, one can associate a
weighted list of terms with each taxon. Once these lists have been determined,
labeling the hierarchy is simply a matter of recursively inheriting terms up
the tree from each child node; adding weights of shared terms will ensure that
more frequently used terms are more likely to have a larger weight at higher
levels within the tree. Intuitively, these terms are often more general
descriptors.
This algorithm is robust in that it can be slightly
modified and applied to any tree where a ranked list can be applied to each
taxon. For example, by querying the SVD-generated vector space for each
document, a ranked list of terms can be created for each document and the tree
labeled accordingly. As a result, assuming the initial ranking procedure is
accurate, any ontological annotation can be enhanced with terms from the text
it represents.
To create a ranked list of terms from NMF, the
dominant coefficient
in
is extracted for document
.
The corresponding feature
is then scaled by
and assigned to the taxon representing
document
,
and the top 100 terms are chosen to represent the taxon. This method can be
expanded to incorporate branch length information, thresholds, or multiple
features.
2.3. Recall measure
Once labelings are produced for a given hierarchical
tree, a measure of “goodness” must be calculated to determine which
labeling is the “best.” When dealing with simple return lists of
documents that can be classified as either relevant or not relevant to a user's
needs, information retrieval (IR) methods typically default to using precision
and recall to describe the performance of a given retrieval system. Precision
is the ratio of relevant returned items to total number of returned items,
while recall is the percentage of relevant returned items with respect to the
total number of relevant items. Once a group of words is chosen to label an
entity, the order of the words carries little meaning, so precision has limited
usefulness in this application. When comparing a generated labeling to a
“correct” one, recall is an intuitive measure.
Unfortunately in this context, one labelled hierarchy
must be compared to another. Surprisingly, relatively little work has been done
that addresses this problem. Kiritchenko in [15] proposed the hierarchical precision and recall measures,
denoted as
and
,
respectively. These measures take advantage of hierarchical consistency to
compare two labelings with a single number. Unfortunately, condensing all the
information held in a labeled tree into a single number loses some information.
In the case of NMF, the effects of parameters on labeling accuracy with respect
to node depth is of interest, so a different measure would be more informative.
One such measure finds the average recall of all the nodes at a certain depth
within the tree. To generate nonzero recall, however, common terms must exist
between the labelings being compared. Unfortunately, many of the terms present
in MeSH headings are not strongly represented in the text. As a result, the
text vocabulary must be mapped to the MeSH vocabulary to produce significant
recall.
2.4. Feature vector replacement
When working with gene documents, many cases exist
where the terminology used in MeSH is not found within the gene documents themselves.
Even though a healthy percentage of the exact MeSH terms may exist in the corpus, the term-document matrix is so heavily overdetermined (i.e., the number of terms is significantly larger than the number of documents) that expecting significant recall values at any level within the tree becomes unreasonable.
This is not to imply that the terms produced by NMF are without value. On the
contrary, the value in those terms is exactly that they may reveal what was
previously unknown. For the purposes of validation, however, some method must
be developed that enables a user to discriminate between labelings even though
both have little or no recall with the MeSH-labeled hierarchy. In effect, the
vocabulary used to label the tree must be controlled for the purposes of
validation and evaluation.
To produce a labeling that is mapped into the MeSH
vocabulary, the top
globally-weighted MeSH headings are chosen for
each document; these MeSH headings can be extracted from the MeSH metacollection
[7]. By inspection of
,
the dominant feature associated with each document is chosen and assigned to
that document. The corresponding top
MeSH headings are then themselves parsed into
tokens and assigned to a new MeSH feature vector appropriately scaled by the
corresponding coefficient in
.
The feature vector replacement algorithm is given in Algorithm 1. Note that
is distinguished from
since the dictionary of MeSH headings will
likely differ in size and composition from the original corpus dictionary. The
number of documents, however, remains constant.
Algorithm 1: Feature vector replacement algorithm.
Once full MeSH feature vectors have been constructed,
the tree can be labeled via the procedure outlined in [7]. As a result of this
replacement, better recall can be expected, and the specific word usage
properties inherent in the MeSH (or any other) ontology can be exploited.
2.5. Alternative labeling method
An alternative method to label a tree is to vary the
parameter
from (2) with node depth. In theory, more
pertinent and accurate features will be preserved if the clusters inherent in
the NMF coincide with those in the tree generated via the SVD space. For
smaller clusters and more specific terms, higher
should be necessary; conversely, the ancestor
nodes should require smaller
and more general terms since they cover a
larger set of genes spanning a larger set of topics. Inheritance of terms can
be performed once again by inheriting common terms—however, an upper
threshold of inheritance can be imposed. For example, for all the nodes in the
subtree induced by a node
,
high
can be used. If all the genes induced by
are clustered together by NMF, then all the
nodes in the subtree induced by
will maintain the same labels. For the
ancestor of
,
a different value of
can be used. Although this method requires
some manual curation, it can potentially produce more accurate labels.
3. Results
The evaluation of the factorization produced by NMF is
nontrivial as there is no set standard for examining the quality of basis
vectors produced. In several studies thus far, the results of NMF runs have
been evaluated by domain experts. For example, Chagoyen et al. [16] performed several NMF runs
and then independently asked domain experts to interpret the resulting feature
vectors. This approach, however, limits the usefulness of NMF, particularly in
discovery-based genomic studies for which domain experts are not readily
available. Here, two different automated protocols are presented to evaluate
NMF results. First, the mathematical properties of the NMF runs are examined,
then the accuracy of the application of NMF to hierarchical trees is
scrutinized.
3.1. Input parameters
To test NMF, the 50TG collection presented in
[2] was used. This
collection was constructed manually by selecting genes known to be associated
with at least one of the following categories: (1) development, (2) Alzheimer's
disease, and (3) cancer biology. Each gene document is simply a concatenation of
all titles and abstracts of the MEDLINE citations cross-referenced in the
mouse, rat, and human EntrezGene (formerly LocusLink) entries for each gene.
Two different NMF initialization strategies were used:
the NNDSVD [17] and
randomization. Five different random trials were conducted while four were
performed using the NNDSVD method. Although the NNDSVD produces a static
starting matrix, different methods can be applied to remove zeros from the
initial approximation to prevent them from getting “locked” throughout
the update process. Initializations that maintained the original zero elements
are denoted NNDSVDz, while NNDSVDa, NNDSVDe, and NNDSVDme substitute the average
of all elements of
,
,
or
,
respectively, for those zero elements;
was set to
and was significantly smaller than the
smallest observed value in either
or
(typically around
), while
was the machine epsilon (the smallest positive
value the computer could represent) at approximately
.
Both NNDSVDz and NNDSVDa were described previously in [13], whereas NNDSVDe and
NNDSVDme are added in this study as natural extensions to NNDSVDz that would
not suffer from the restrictions of locking zeros due to the multiplicative
update. The parameter
was assigned the values of 2, 4, 6, 8, 10, 15,
20, 25, and 30.
Each of the NMF runs iterated until it reached 1,000
iterations or a stationary point in both
and
.
That is, at iteration
,
when
and
,
convergence is assumed. The parameter
was set to 0.01. Since convergence is not
guaranteed under all constraints, if the objective function increased between
iterations, the factorization was stopped and assumed not to converge.
Log-entropy term-weighting scheme (see [8]) was used to generate the original token weights for
each collection.
3.2. Relative error and convergence
The SVD produces the mathematically optimal low-rank
approximation of any matrix with respect to the Frobenius norm, and for all
other unitarily-invariant matrix norms. Whereas NMF can never produce a more
accurate approximation than the SVD, its proximity to
relative to the SVD can be measured. Namely,
the relative error, computed as
(5) where both factorizations are
truncated after
dimensions (or factors), can show how close
the feature vectors produced by the NMF are to the optimal basis [18].
Intuitively, as
increases, the NMF factorization should more
closely approximate
.
As shown in Figure 1, this is exactly the case. Surprisingly, however, the
average of all converging NMF runs is under 10% relative error compared to the
SVD, with that error tending to rise as
increases. The proximity of the NMF to the SVD
implies that, for this small dataset, NMF can accurately approximate the data.
Figure 1: Error measures for the SVD, best NMF run, and average NMF run for the 50TG collection.
Next, several different initialization methods
(discussed in Section 3.1) were examined. To study the effects on convergence,
one set of NMF parameters must be chosen as the baseline against which to
compare. By examining the NMF with no additional constraints, the NNDSVDa
initialization method consistently produces the most accurate approximation
when compared to NNDSVDe, NNDSVDme, NNDSVDz, and random initialization [7]. The relative error NNDSVDa
generates less than 1% for most
tested values of
.
Unfortunately, NNDSVDa requires several hundred iterations to converge.
NNDSVDe performs comparably to NNDSVDa with regard to
relative error, often within a fraction of a percent. For smaller values of
,
NNDSVDe takes significantly longer time to
converge than NNDSVDa although the exact opposite is true for the larger value
of
.
NNDSVDz, on the other hand, converges much faster for smaller values of
at the cost of accuracy as the locked zero
elements have an adverse effect on the best solution that can be converged
upon. Not surprisingly, NNDSVDme performed comparably to NNDSVDz in many cases,
however, it was able to achieve slightly more accurate approximations as the
number of iterations increased. In fact, NNDSVDme was identical to NNDSVDz in
most cases and will not be mentioned henceforth unless noteworthy behavior is
observed. Random initialization performs comparably to NNDSVDa in terms of
accuracy and favorably in terms of speed for small
,
but as
increases, both speed and accuracy suffer. A
graph illustrating the convergence rates when
is depicted in Figure 2.
Figure 2: Convergence graph comparing the NNDSVDa, NNDSVDe, NNDSVDme, NNDSVDz, and best random NMF
runs of the
50TG collection for (

).
In terms of actual elapsed time, the improved
performance of the NNDSVD does not come without a cost. In the context of SGO,
the time spent computing the initial SVD of
for the first step of the NNDSVD algorithm is
assumed to be zero since the SVD is needed a priori for querying purposes
However, the initialization time required to complete the NNDSVD when
is nearly 21 seconds, while the cost for
random initialization is relatively negligible. All runs were performed on a
machine running Debian Linux 3.0 with an Intel Pentium III 1-GHz processor and 256-MB
memory. Since the cost per each NMF iteration is nearly.015 seconds per
(when
), the cost of performing the NNDSVD is
(approximately) equivalent to 55 NMF iterations. Convergence taking into
account this cost is shown in Figure 3.
Figure 3: Convergence graph comparing the NNDSVDa, NNDSVDe, NNDSVDme, NNDSVDz, and best random NMF
runs of the
50TG collection for (

) taking into account initialization
time.
Figure 4: MAR as a function of

under the various NNDSVD initialization
schemes with no constraints for the
50TG collection.
3.3. Labeling recall
Measuring recall is a quantitative way to validate
“known” information within a hierarchy. Here, a method was developed to
measure recall at various branch points in a hierarchical tree (described in
Section 2.3). The gold standard used for measuring recall included the MeSH
headings associated with gene abstracts. The mean average recall (MAR) denotes the
value attained when the average recall at each level is averaged across all
branches of the tree. Here, a hierarchy level refers to all nodes that share
the same distance (number of edges) from the root. This section discusses the
parameter settings that provided the best labelings, both in the local and
global sense to the tree generated in [2]
with 47 interior nodes spread across 11 levels.
After applying the labeling algorithm described in
Section 2.2 to the factors produced by NMF, the MAR generated was very low
(under 25%). Since the NMF-generated vocabulary did not overlap well with the
MeSH dictionary, the NMF features were mapped into MeSH features via the
procedure outlined in Algorithm 1, where the most dominant feature
represented each document only if the corresponding weight in the
matrix was greater than 0.5. Also, the top 10
MeSH headings were chosen to represent each document, and the top 100
corresponding terms were extracted to formulate each new MeSH feature vector.
Consequently, the resulting MeSH feature vectors produced labelings with
greatly increased MAR.
With regard to the accuracy of the labelings, several
trends exist. As
increases, the achieved MAR increases as well.
This behavior could be predicted since increasing the number of features also
increases the size of the effective labeling vocabulary, thus enabling a more
robust labeling. When
,
the average MAR across all runs is approximately 68%.
Since the NNDSVDa initialization provided the best
convergence properties, it will be used as a baseline against which to
compare. If
is not specified, assume
.
In terms of MAR, NNDSVDa produced below average results, with both NNDSVDe and
NNDSVDz consistently outperforming NNDSVDa for most values of
;
NNDSVDe and NNDSVDz attained similar MAR values as depicted in Figure 4. The
recall of the baseline case using NNDSVDa and
depicted by node level is shown in Figure 6.
Figure 5: Hierarchical tree for a 50 test gene (50TG) collection described in [
2] using updated MEDLINE abstracts.
Figure 6: Recall as a
function of node level for the NNDSVD initialization on the
50TG collection. The achieved MAR for the
baseline case is 58.95%, while the best achieved MAR for the NNDSVD
initialization is 74.56%.
The 11 node levels of the 50TG hierarchical tree
[2] shown in Figure 5
can be broken into thirds to analyze the accuracy of a labeling within a depth
region of the tree. The MAR for NNDSVDa for each of the thirds is approximately
58%, 63%, and 54%, respectively. With respect to the topmost third of the tree,
any constraint applied to any NNDSVD initialization other than smoothing
applied to NNDSVDa provided an improvement
over the 58% MAR. In all cases, the resulting MAR was at least 75%. NNDSVDa
performed slightly below average over the middle third at 63%. Overall, nearly
any constraint improved or matched recall over the base case over all thirds
with the exception that enforcing sparsity on
underperformed NNDSVDa in the bottom third of
the tree; all other constraints achieved at least 54% MAR for the bottom third.
With respect to different values of
,
similar tendencies exist over all thirds. NNDSVDa is among the worst in terms
of MAR with the exception that it does well in the topmost third when
is either 2 or 4. There was no discernable
advantage when comparing NNDSVD initialization to its random counterpart.
Overall, the best NNDSVD (and hence reproducible) MAR was achieved using
NNDSVDe and
(also shown in Figure 6).
3.4. Labeling evaluation
Although relative error and recall are measures that
can automatically evaluate a labeling, ultimately the final evaluation still requires some manual observation and interpretation. For example, assuming the tree given in Figure 7 with leaf nodes representing the gene clusters given in Table 8, one possible labeling using MeSH headings generated from Algorithm 1 is
given in Table 9, and a sample NMF-generated labeling is given in Table 10.
Table 8: Genes comprising each leaf node of the tree
shown in Figure
7.
Table 9: Top 10 MeSH terms for the leaf nodes of the
tree shown in Figure
7.
Table 10: Top 10 terms for the leaf nodes of the tree
shown in Figure
7.
Figure 7: A hierarchical tree containing a set of genes related to Alzheimer's disease (leaf nodes A and B), brain development (leaf nodes C and D), or both Alzheimer's disease and brain development (leaf node E).
As expected, many of the MeSH terms were too general
and were also associated with many of the 5 gene clusters, for example,
genetics, proteins, chemistry, and cell. However, some MeSH terms were indeed
useful in describing the function of the gene clusters. For example, Cluster A
MeSH labels are suggestive of LDL and alpha macroglobulin receptor protein
family; Cluster B MeSH labels are associated with Alzheimer's disease and
Amyloid beta metabolism; Cluster C labels are associated with extracellular
matrix and cell adhesion; Cluster D labels are associated with embryology and
inhibotrs; and Cluster E labels are associated with tau protein and
lymphocytes.
In contrast to MeSH labeling, the text labeling by NMF was
much more specific and functionally descriptive. In general, the first few
terms (highest ranking terms) in each cluster defined either the gene name or
alias. Interestingly, each cluster also contained terms that were functionally
significant. For example, rap (Cluster A) is known to be a ligand for a2m and
lrp1 receptors. In addition, the 4 genes in Cluster C are known to be part of a
molecular signaling pathway involving Cajal-retzius cells in the brain that
control neuronal positioning during development. Lastly, the physiological
effects of Notch1 (Cluster D) have been linked to activation of intracellular
transcription factors Hes1 and Hes5.
Importantly, the specific nature of text labeling by
NMF allows identification of previously unknown functional connections between
genes and clusters of genes. For example, the term PS1 appeared in both Cluster
B and Cluster D. This finding is very interesting in that PS1 encodes a protein
which is part of a protease complex called gamma secretases. In addition to
cleaving the Alzheimer protein APP, gamma
secretases have been shown to cleave the developmentally important Notch
protein. Therefore, these results indicate that NMF labeling provides a useful
tool for discovering new functional associations between genes in a cluster as
well as across multiple gene clusters.
4. Discussion
While comparing NMF runs, several trends can be
observed both with respect to mathematical properties and recall tendencies.
First, and as expected, as
increases, the approximation achieved by the
SVD with respect to
is more accurate; the NMF can provide a
relatively close approximation to
in most cases, but the error also increases
with
.
Second, NNDSVDa provides the fastest convergence in terms of number of
iterations to the closest approximations. Third, applying additional constraints
such as smoothing and sparsity [7] has little noticeable effect on both convergence and
recall, and in many cases greatly decreases the likelihood that a stationary
point will be reached. Finally, to generate relatively “good” approximation
error (within 5%), about 20–40 iterations are recommended using either NNDSVDa
or NNDSVDe initialization with no additional constraints when
is reasonably large (about half the number of
documents). For smaller
,
performing approximately 25 iterations under random initialization will usually
accomplish 5% relative error, with the number of iterations required decreasing
as
decreases.
While measuring error norms and convergence is useful
to expose mathematical properties and structural tendencies of the NMF, the
ultimate goal of this application is to provide a useful labeling of a
hierarchical tree from the NMF. In many cases, the “best” labeling may be
provided by a suboptimal run of NMF. Overall, more accurate labelings resulted
from higher values of
because more feature vectors increased the
vocabulary size of the labeling dictionary. Generally speaking, the NNDSVDe,
NNDSVDme, and NNDSVDz schemes outperformed the NNDSVDa initialization. Overall,
the accuracy of the labelings appeared to be more a function of
and the initial seed rather than the
constraints applied.
Much research is being performed concerning the NMF,
and this work examines three methods based on the multiplicate update (see
Section 2.1). Many other NMF variations exist and more are being developed, so
their application to the biological realm should be studied. For example,
[19] proposes a hybrid
least squares approach called GD-CLS to solve NMF and overcomes the problem of
“locking” zeroed elements encountered by MM, [20, 21] propose nonsmooth NMF as an
alternative method to incorporate sparseness, and [22] proposes an NMF technique
that generates three factor matrices and has shown promising clustering
results. NMF has been applied to microarray data [23], but efforts need to be
made to combine the text information with microarray data; some variation of
tensor factorization could possibly show how relationships change over time
[24].
With respect to labeling methods, MeSH heading labels
were generally useful, but provided little specific details about the
functional relationship between the genes in a cluster. On the other hand, text
labeling provided specific and detailed information regarding the function of
the genes in a clusters. Importantly, term labels provided some specific
connections between groups of genes that were not readily apparent. Thus, term
labeling offers a distinct advantage for discovering new relationships between
genes and can aid in interpretation of high throughput data.
Regardless of the techniques employed, one of the
issues that will always be prevalent regarding biological data is that of
quality versus quantity. Inherently related to this problem is the
establishment of standards within the field especially as they pertain to hierarchical
data. Efforts such as gene ontology (GO) are being built and refined [25], but standard datasets for
comparing results and clearly defined (and accepted) evaluation measures could
facilitate more meaningful comparisons between methods.
In the case of SGO, developing methods to derive
“known” data is a major issue (even GO does not produce a “gold
standard” hierarchy given a set of genes). Access to more data and to
other hierarchies would help test the robustness of the method, but that
remains one of the problems inherent in the field. In general, approximations
that are more mathematically optimal do not always produce the “best”
labeling. Often, factorizations provided by the NMF can be deemed “good
enough,” and the final evaluation will remain subjective. In the end, if
automated approaches can approximate that subjectivity, then greater
understanding of more data will result.
Acknowledgments
This work was supported by the Center for Information
Technology Research and the Science Alliance Computational Sciences Initiative
at the University of Tennessee and by the National Institutes of Health under
Grant no. HD52472-01. The authors would like to thank the anonymous referees
for their comments and suggestions for improving the manuscript.
References
- K. E. Heinrich, Finding functional gene relationships using the semantic gene organizer (SGO), M.S. thesis, Department of Computer Science, University of Tennessee, Knoxville, Tenn, USA, 2004.
- R. Homayouni, K. Heinrich, L. Wei, and M. W. Berry, “Gene clustering by latent semantic indexing of MEDLINE abstracts,” Bioinformatics, vol. 21, no. 1, pp. 104–115, 2005.
- G. Golub and C. Van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, Md, USA, 3rd edition, 1996.
- P. Paatero and U. Tapper, “Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
- D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
- L. Weixiang, Z. Nanning, and Y. Qubo, “Nonnegative matrix factorization and its applications in pattern recognition,” Chinese Science Bulletin, vol. 51, no. 1, pp. 7–18, 2006.
- K. E. Heinrich, Automated gene classification using nonnegative matrix factorization on biomedical literature, Ph.D. thesis, Department of Computer Science, University of Tennessee, Knoxville, Tenn, USA, 2007.
- M. W. Berry and M. Browne, Understanding Search Engines: Mathematical Modeling and Text Retrieval, SIAM, Philadelphia, Pa, USA, 1999.
- S. Wild, J. Curry, and A. Dougherty, “Motivating non-negative matrix factorizations,” in Proceedings of the 8th SIAM Conference on Applied Linear Algebra (LA '03), Williamsburg, Va, USA, June, 2003.
- D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural and Information Processing Systems, T. K. Leen, T. G. Dietterich, and V. Tresp, Eds., vol. 13, pp. 556–562, MIT Press, Cambridge, Mass, USA, 2001.
- M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons, “Algorithms and applications for approximate nonnegative matrix factorization,” Computational Statistics and Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.
- C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, 1936.
- C. Boutsidis and E. Gallopoulos, “On SVD-based initialization for nonnegative matrix factorization,” University of Patras, Patras, Greece, 2005.
- R. Desper and O. Gascuel, “Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle,” Journal of Computational Biology, vol. 9, no. 5, pp. 687–705, 2002.
- S. Kiritchenko, Hierarchical text categorization and its applications to bioinformatics, Ph.D. thesis, University of Ottawa, Ottawa, Canada, 2005.
- M. Chagoyen, P. Carmona-Saez, H. Shatkay, J. M. Carazo, and A. Pascual-Montano, “Discovering semantic features in the literature: a foundation for building functional associations,” BMC Bioinformatics, vol. 7, article 41, pp. 1–19, 2006.
- C. Boutsidis and E. Gallopoulos, “SVD based initialization: a head start for nonnegative matrix factorization,” University of Patras, Patras, Greece, 2007.
- A. Langville, C. Meyer, and R. Albright, “Initializations for the nonnegative matrix factorization,” preprint, 2006.
- F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons, “Document clustering using nonnegative matrix factorization,” Information Processing & Management, vol. 42, no. 2, pp. 373–386, 2006.
- A. Pascual-Montano, J. M. Carazo, K. Kochi, D. Lehmann, and R.D. Pascual-Marqui, “Nonsmooth nonnegative matrix factorization (nsNMF),” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 403–415, 2006.
- P. Carmona-Saez, R. D. Pascual-Marqui, F. Tirado, J. M. Carazo, and A. Pascual-Montano, “Biclustering of gene expression data by non-smooth nonnegative matrix factorization,” BMC Bioinformatics, vol. 7, article 78, pp. 1–18, 2006.
- C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegative matrix tri-factorizations for clustering,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 126–135, ACM Press, Philadelphia, Pa, USA, August 2006.
- J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov, “Metagenes and molecular pattern discovery using matrix factorization,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 12, pp. 4164–4169, 2004.
- A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I. Amari, “Novel multi-layer nonnegative tensor factorization with sparsity constraints,” in Proceedings of the 8th International Conference on Adaptive and Natural Computing Algorithms (ICANNGA'07), vol. 4432 of Lecture Notes in Computer Science, pp. 271–280, Warsaw, Poland, April 2007.
- M. Ashburner, C. A. Ball, J. A. Blake, et al., “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000.