The general approach to modeling binary data for the purpose of estimating the
propagation of an internal solitary wave (ISW) is based on the maximum likelihood
estimate (MLE) method. In cases where the number of observations in the data is small,
any inferences made based on the asymptotic distribution of changes in the deviance may be unreliable for binary data (the
model's lack of fit is described in terms of a quantity known as the deviance). The
deviance for the binary data is given by D. Collett (2003). may be unreliable for binary data.
Logistic regression shows that the -values for the likelihood ratio test and the score
test are both 0.05. However, the null hypothesis is not rejected in the Wald test. The
seeming discrepancies in -values obtained between the Wald test and the other two tests
are a sign that the large-sample approximation is not stable. We find that the parameters
and the odds ratio estimates obtained via conditional exact logistic regression are
different from those obtained via unconditional asymptotic logistic regression. Using
exact results is a good idea when the sample size is small and the approximate -values
are 0.10. Thus in this study exact analysis is more appropriate.
1. Introduction
Internal waves refer to the motion at the interface
between layers of water of different densities in a stratified
water body, such as the ocean. The simplest oceanic density structure, where
differences in water density are mostly caused by differences in water
temperature or salinity, can be approximated by a two-layer model. Oceanic internal
waves typically have wavelengths ranging from hundreds of meters to tens of
kilometers, with periods from tens of minutes to tens of hours. In the
Andaman and Sulu Sea they can
have amplitudes (peak to trough distance) exceeding 50m and in the South
China Sea the amplitude can exceed 110 m
[1–9].
The mixing and dissipation generated by internal waves have important effects
on the cross slope exchange processes, enhancement of bottom stress, and
generation of the nepheloid layers. It has recently been proposed that internal waves may make a significant contribution
to internal oceanic mixing and hence have an important influence on climatic
change. This is why it is necessary to
scrutinize the interaction of nonlinear internal solitary waves (ISWs)
with the seabed topography [10–17].
Several studies, including both simulations and
laboratory experiments, aiming
at exploring the mechanisms for the generation, propagation, and
evolution of ISWs, have already been carried out. However, since energy
dissipation plays such an important and varied role on water and sedimentary movement
in coastal seas [18], we need a better fitting and more appropriate model for predicting
ISW
propagation. A
preliminary approach has recently been made in which the effects of weighted
parameters on the amplitude and reflection of energy-based ISWs from uniform
slopes in a two layered fluid system were investigated [19]. The results are quite consistent with other experimental
results, and are applicable to the naturally occurring reflection of ISWs from
sloping bottoms. More recently, Chen et al. [20] concluded the goodness-of-fit
and predictive ability of the cumulative logistic regression models to be
better than that of the binary logistic regression
models. However, in cases where the data are so small that there are some
observations with proportions close to zero or one, inferences based on the asymptotic
distribution of the change in deviance may be unreliable. In point of fact, reports on statistical
manipulations related to this theme are rather rare.
The rest
of the paper is organized as follows. In Section 2 we describe the experimental
set-up and theoretical background needed to understand the hydrodynamic
interaction. We also discuss the analysis of the logistic regression
model, and introduce the exact conditional logistic model and the hypothesis on which the parameters are based.
Section
3 is devoted to a comparison of the conditional exact logistic regression model and the
unconditional asymptotic logistic regression model. Finally, some conclusions
are made. It is noted that small sample size
means that there are some observations with proportions close to zero or one
and P-values of less than 0.10, which
is an indication that an exact analysis would be more appropriate.
2. Research Framework
Experiments were carried out in the laboratory using a
two-layer fluid system of fresh and briny water in a 12 m long wave flume (rectangular in
cross-section). The
upper layer of water in the wave
flume consisted of fresh water with a density and a depth , while the lower layer was comprised of
brine with a density and
a depth . The leading ISW was generated by the lifting of a
pneumatic sluice gate at one end of the flume. The wave propagated into the
main section of the flume to the left-hand side (LHS) of the gate. The amplitude a and characteristic length of the ISW were predetermined by arranging the step
length L and step depth (see Figure 1). Six ultrasonic probes connected to an amplifier
unit and A/D converter, then to a personal computer, gathered and processed
digital signals as the ISW propagated along the flume. As
the ISW propagated from the RHS (right-hand side) to the LHS of the
flume, the first ultrasonic probe (P1) recorded the properties of the incident
ISW, the wave amplitude and characteristic length, while the second probe (P2)
collected reflected signals showing the wave-obstacle interaction. The methodology for
measuring the physical properties related to the propagation and dissipation of
the ISW has been reported in detail by Chen
et al. [21]. The
amplitude-based transmission rate during the wave-ridge interaction was
dependent on two factors, ridge height and potential energy.
Figure 1: A schematic
view showing the set up for ISW propagation in a two-layer fluid system over a single obstacle.
2.1. Exact
Conditional Logistic Regression Model
The theoretical basis for the exact conditional logistic
regression model was originally laid down by Cox [22], but recent algorithmic
advances in computing the exact distributions have made the methodology more
practical. Since then Hirji et al.
[23] have developed an efficient algorithm for generating the required
conditional distributions. Cox and Snell [24] noted that it has been known
since the 1970’s
how to extend the theory of Fisher’s exact test to logistic regression models.
The interested reader may refer to Mehta and Patel [25] for a useful summary of
exact logistic regression. A complete discussion of the exact logistic
regression methodology and more detailed applications can be found in a variety
of sources [26–30].
Here, let
represent the probability of “success” for a binary response for the
explanatory variables .
The notation can be simplified by using
to represent the conditional mean of given when a logistic distribution is
utilized: such that The transformation of ,
which is central to this study of logistic regression, is the logit
transformation. This transformation is defined as where is an unknown parameter vector.
The sufficient statistics for the in the
unconditional likelihood function are where is the realization of .
If and indicate sufficient
statistics corresponding to and , the conditional probability density function of conditional
on can be formulated as where indicates the number of vectors , such that and .
Conditional exact inference involves the generation of the
conditional permutational distribution for the sufficient statistics for the parameters.
The distribution is called the permutation conditional distribution
or exact conditional distribution.
2.2. Testing the Hypotheses
According to exact logistic
regression (for both the exact score conditional test and the probability test)
the parameters for the specified hypothesis are equal to zero. If an effect
consists of two or more parameters, then it is hypothesized that all the
parameters are simultaneously equal to zero [26, 27].
2.2.1. Exact Score Conditional Test
The null
hypothesis is The conditional
mean and variance matrix of (conditional on ) are calculated via the exact
conditional scores test. The score statistic is Now
compare this to the score for each member of the distribution In the
null hypothesis, an exact P-value,
which is the probability of obtaining a more extreme statistic than the
observed one, is assumed.
The
result of the P-value is where A mid P-value,
adjusted for the discreteness of the distribution, is assumed for the null
hypothesis.
The mid-p statistic is defined as
2.2.2. Probability Testing
For small samples, the parameter inference process is
carried out using conditional distribution probabilities, such as exact P-values, rather than a crude approximation
[29]. For testing the null hypothesis we use Under the null hypothesis, the
exact probability test statistic is just ; the corresponding P-value
gives the probability of getting a less likely statistic where
3. Analytical Results
The effects of the
ridge height, the depth of the lower water layer, and the potential energy on the
propagation of the ISW are all considered. The results from the laboratory
experiments are shown in the data sets. The amplitudes of the incident and
reflected waves are also included. The dependent variables for the binary
logistic regression model are classified into two groups, weak and strong, based
on the amplitude incident rate. When the hypothetical incident rate is >0.5
it is considered strong and when it is <0.5 it is considered weak. The
frequencies for the strong and weak levels are 35 and 28, respectively.
3.1. Asymptotic
Logistic Regression Model
The methodologies utilized in the asymptotic
logistic regression model and the diagnostics of the goodness-of-fit statistics
are discussed below.
3.1.1. Goodness-of-Fit Statistics
The Pearson Chi-squares test and deviance Chi-squares test
are used. The results of the Pearson Chi-square test give a distribution
with the degrees of freedom , where is the number of
explanatory variables, is the number of response levels, and is the number
of subpopulations.
The goodness-of-fit
statistics are shown in Table 1. The dispersion parameter (value/DF), which
indicates estimated deviance, is given in the value/DF column. The dispersion
parameter is 0.7268 and the Pearson Chi-squares dispersion parameter is 1.2157.
Ideally, this value should be very close to 1.00. The values of the Pearson Chi-square
and deviance Chi-square statistics are 60.7842 and 36.338, respectively, with
50 degrees of freedom . The Pearson Chi-squares value is
slightly larger than the degrees of freedom; the P-values for the deviance and Pearson Chi-squares
are all larger than 0.05 (0.9260, 0.1412). From this we see that although there
is a little over dispersion, this model seems to have an acceptable fit
with the data. The overdispersion means that the model still needs to be modified.
Table 1: Deviance and Pearson goodness-of-fit statistics.
3.1.2. Regression Diagnostics
There
are a number of different ways to plot the regression diagnostics, each
directed at a particular aspect of the fit. For examples see
Hosmer and Lemeshow [28],
and Landwehr et al. [31] who discussed
graphical techniques for logistic regression diagnostics. Generally such
techniques offer a visual rather than numerical representation that may be more
intuitively appealing to some researchers. Index plots are useful for the identification
of extreme values [32]. An examination of the index plots of the Pearson
residuals (Figure 2) and the deviance residuals (Figure 3) for our data indicates
that case 11 and case 27 are poorly accounted for by the model. It can be seen
in the index plot of the diagonal elements of the hat matrix (Figure 4) that case 49 is at the
extreme point in the design space.
Figure 2: Plot of Pearson residual (Reschi) versus
case number index.
Figure 3: Plot of deviance residual (Resdev)
versus case number index.
Figure 4: Plot of hat diagonal (Resdev) versus case
number index.
3.1.3. Outliers and
Influential Observations
The values of outliers can be quite substantial and influential. A look at
Table 5 shows the advantage of removing such observations from the data (here, case
11, case 27, and case 49), then refitting the newly revised model to the
remaining observations.
The goodness-of-fit statistics are presented in Table 2.
The estimates of deviance are shown in the column marked value/DF. The
dispersion parameter (value/DF) is 0.3376 and the Pearson Chi-square dispersion parameter is 0.4752. The values of the deviance
and Pearson Chi-square are less than the
degrees of freedom, while the P-values
of the deviance and Pearson Chi-square are all >0.05 (i.e., 1.0000, 0.9993,
resp.). These indicate that this model seems to have an acceptable fit with the
data.
Table 2: Deviance and Pearson goodness-of-fit statistics.
3.1.4. Testing the Global
Null Hypothesis:
When testing the null hypothesis for large samples, the
explanatory variables have coefficients of zero. According to the Chi-squares
analysis, the associated P-values are
all approximately zero, suggesting that the explanatory coefficients are all zero.
The results obtained after rerunning
the unconditional asymptotic logistic
regression after the removal
of some of the observations from the data (i.e., case 11, case 27, and case 49) (see
Table 3) still contain some unconditional asymptotic
results. These results are obtained by deriving the Chi-square statistics while
testing for the global null hypothesis
(likelihood ratio, score, and Wald tests). For the likelihood ratio and score
tests, the null hypothesis that is zero is rejected, but not
for the Wald test. The seeming discrepancies in P-values obtained between the Wald test and the other two tests are
a sign that the large-sample approximation is not stable.
Table 3: Testing of the global null hypothesis: β = 0.
3.2. Exact Logistic Regression Model
Exact
logistic regression for binary outcomes can be utilized to provide an exact
score test and an exact probability test for hypotheses where the parameters
are equal to zero; these tests produce an exact P-value and a mid P-value.
To
test whether individual parameter estimates are zero, we also require point
estimates of the parameters, an odds ratio that contains two-sided confidence
limits, and the P-value.
3.2.1. Conditional Exact Tests:
The results of exact conditional
analysis obtained using the exact logistic regression model are shown in
Table 4. The results for the exact score conditional test and the probability test
are also reported in this table. For the joint test it is required that all the
parameters for the exact statement be simultaneously equal to zero, that is,
the null hypothesis is .
Table 4: Conditional exact
test results.
Table 5: Analysis of MLEs.
In the joint test results an exact P-value of <.0001 is produced; the probability
test produces an exact P-value of
0.0023. These test results lead to
a rejection that the null hypothesis of is zero. This
shows that the ridge height ,
lower layer water depth , and potential
energy
are significant for the joint exact test.
Given the effects of the ridge height , lower
layer water depth , and potential
energy , the
exact P-value and mid P-value are both <.0001. These results
lead to a rejection of the
null hypothesis that is
zero.To put
it another way, ridge height ,
lower layer water depth , and potential
energy are significant factors associated with the amplitude-based
incident rate.
3.2.2. Parameter Estimation
and Odds Ratio Estimation
Stokes et al. [27] have suggested that large sample
theory may not be appropriate for small-sized
data. This thus means that tests based on the asymptotic normality of the MLEs
may be unreliable. They recommend that when sample sizes are small, with
approximate P-values of less than
0.10, it is a good idea to look at the exact results. If the approximate P-values are larger than 0.15, then the
approximate methods are probably satisfactory, in the sense that the exact
results are likely to agree with them.
Parameter Estimates
for Unconditional Asymptotic Logistic Regression
The analytical results for the estimated maximum
likelihood and odds ratios are shown in Tables
5 and
6. The ridge height , lower layer water depth , and potential energy are all significant
factors affecting the amplitude-based incident rate (P = .0106, P = .0053, and P = .0067, resp.).
The fitted unconditional asymptotic logistic regression
lines can be stated as
Table 6: Odds ratio estimates.
Parameter Estimation for Conditional Exact Logistic
Regression
The analytical
results of the exact parameter estimates and exact odds ratio estimates are
presented in Tables 7 and 8, respectively. The ridge height , lower layer water depth , and potential energy are all significant
factors affecting the amplitude-based incident rate (P < .0001). We create
a median unbiased estimate instead of the conditional MLE, because the value of
the observed sufficient statistic lies at the extreme end of the derived
distribution. The implication is that the conditional MLE does not exist. Even
though the asymptotic results are unreliable, the exact analysis allows us to
conclude that these factors have a significant effect. The fitted conditional
exact logistic regression lines can be formulated as We can see from
Tables 5 and
7 that the parameters obtained from conditional exact logistic
regression are smaller than those obtained from unconditional asymptotic
logistic regression, but the P-values
of the unconditional asymptotic estimates are larger than those of the exact
estimates. A comparison of the odds ratio estimates (in
Tables 6 and 8) shows that the parameters obtained from
the conditional exact logistic regression are different than those obtained
from the unconditional asymptotic logistic regression.
Stokes et al. [27] recommended that when sample sizes are small and the
approximate P-values are less than
0.10, it is better to look at the exact results. Thus in this study, the small
sample size and P-values make exact
analysis more appropriate.
Table 7: Exact parameter estimates.
Table 8: Exact
odds ratios.
4. Conclusions
A laboratory experiment is designed to investigate the propagation
of an internal solitary wave over a submerged ridge. Analytical methods and a logistic
regression model are employed to examine the amplitude-based incident rate.
Large sample theory may not be suitable for data with small cell counts. This
tends to make tests based on the asymptotic normality of the MLEs unreliable.
The ridge height, lower layer water depth, and potential
energy are considered in the regression model. Once a model has been fitted to
the observed values of a binary response variable, it is essential to check the
validity of the fit. We discuss some methods for exploring the adequacy of the model
and some diagnostic methods. The techniques used to examine the adequacy of a
fitted unconditional asymptotic logistic regression model and conditional exact
logistic regressions are known as diagnostics methods for testing the global null
hypothesis. Based on the analytical results we can draw the following
conclusions.
(1)
The unconditional asymptotic logistic model results
lead us to the conclusion that the three explanatory variables (ridge height,
lower layer water depth, and potential energy) are significant factors
affecting the amplitude-based incident rate. Both deviance and Pearson Chi-square
tests are used to examine the goodness-of-fit of the model. The dispersion
parameter for the estimate of deviance (value/DF) is 0.7268, and the Pearson Chi-square
dispersion parameter is 1.2157. Preferably, this value should be very close to
1.00. The Pearson parameter is slightly larger than the degrees of freedom. We
note that there is still a little overdispersion with this model which means
that it needs to be modified.
(2)
A look at the index plots
for the Pearson residuals (Figure 2) and the deviance residuals (Figure 3)
shows that case 11 and case 27 are poorly accounted for by the model. In the
index plot of the diagonal elements of the hat matrix (Figure 4), case 49 is an
extreme point in the design space. After these observations (case 11, case 27,
and case 49) are removed from the data, the new revised model is refitted based
on the remaining observations. The values of
the deviance and Pearson Chi-squares are now less than the degrees of freedom,
and the P-values for deviance and
Pearson Chi-square are all >0.05 (1.0000, 0.9993, resp.). In other words,
this revised model seems to fit the data acceptably well.
(3)
When testing the global null hypothesis , only three
Chi-square statistics (likelihood ratio, score, and Wald tests) are generated. The P-values obtained by logistic
regression for the likelihood ratio test and score test are both <0.05.
However, the null hypothesis is not rejected for the Wald test. The seeming
discrepancies in P-values obtained
between the Wald test and the other two tests are a sign that the large-sample
approximation is not stable.
(4)
The results of exact conditional
analysis from the exact logistic regression model are shown in
Table 4. The ridge height , lower layer water depth , and
potential energy are all significant in the joint results. The ridge height , lower layer water depth ,
and potential energy effects are all significant factors affecting
the amplitude-based incident rate.
(5)
A comparison of the parameters shown in
Tables 6 and 8 and the
odds ratio estimates in Tables 6 and
8
shows that the parameters and the odds ratio estimates obtained from conditional exact logistic regression are
different from those obtained from unconditional asymptotic logistic regression. As recommended by Stokes
et al. [27], in cases of small
sample sizes where the approximate P-values
are less than 0.10, it is a good idea to look at the exact results. For this
study, the small sample size and P-values
indicate that an exact analysis would be more appropriate.
Acknowledgments
The authors would like
to thank the National Science Council of the Republic of China, Taiwan
for
financial support of this research under Contracts no. NSC 96-2628-E-366-004-MY2
and NSC 96-2628-E-132-001-MY2. They also wish to thank the editor of
Mathematical Problems in Engineering, and the three anonymous reviewers for
their helpful suggestions on the improvement of this paper.