Abstract

This outlook paper reviews the research of van der Laan’s group on Targeted Learning, a subfield of statistics that is concerned with the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the data and corresponding confidence intervals, aiming at only relying on realistic statistical assumptions. Targeted Learning fully utilizes the state of the art in machine learning tools, while still preserving the important identity of statistics as a field that is concerned with both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statistical conclusions. We also provide a philosophical historical perspective on Targeted Learning, also relating it to the new developments in Big Data. We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current Big Data movement.

1. Introduction

In Section 2 we start out with reviewing some basic statistical concepts such as data probability distribution, statistical model, and target parameter, allowing us to define the field Targeted Learning, a subfield of statistics that develops data adaptive estimators of user supplied target parameters of data distributions based on high dimensional data under realistic assumptions (e.g., incorporating the state of the art in machine learning) while preserving statistical inference. This also allows us to clarify how Targeted Learning distinguishes from typical current practice in data analysis that relies on unrealistic assumptions and describe the key ingredients of targeted minimum loss based estimation (TMLE), a general tool to achieve the goals set out by Targeted Learning: a substitution estimator, construction of initial estimator through super-learning, targeting of the initial estimator to achieve asymptotic linearity with known influence curve by solving the efficient influence curve estimating equation, and statistical inference in terms of a normal limiting distribution.

Targeted Learning resurrects the pillars of statistics such as the facts that a model represents actual knowledge about the data generating experiment and that a target parameter represents the feature of the data generating distribution we want to learn from the data. In this manner, Targeted Learning defines a truth and sets a scientific standard for estimation procedures, while current practice typically defines a parameter as a coefficient in a misspecified parametric model (e.g., logistic linear regression, repeated measures generalized linear regression) or small unrealistic semi parametric regression models (e.g., Cox proportional hazards regression), where different choices of such misspecified models yield different answers. This lack of truth in current practice, supported by statements such as “All models are wrong but some are useful,” allows a user to make arbitrary choices even though these choices result in different answers to the same estimation problem. In fact, this lack of truth in current practice presents a fundamental drive behind the epidemic of false positives and lack of power to detect true positives our field is suffering from. In addition, this lack of truth makes many of us question the scientific integrity of the field we call statistics and makes it impossible to teach statistics as a scientific discipline, even though the foundations of statistics, including a very rich theory, are purely scientific. That is, our field has suffered from a disconnect between the theory of statistics and the practice of statistics, while practice should be driven by relevant theory and theoretical developments should be driven by practice. For example, a theorem establishing consistency and asymptotic normality of a maximum likelihood estimator for a parametric model that is known to be misspecified is not a relevant theorem for practice since the true data generating distribution is not captured by this theorem.

Defining the statistical model to actually contain the true probability distribution has enormous implications for the development of valid estimators. For example, maximum likelihood estimators are now ill defined due to the curse of dimensionality of the model. In addition, even regularized maximum likelihood estimators are seriously flawed: a general problem with maximum likelihood based estimators is that the maximum likelihood criterion only cares about how well the density estimator fits the true density, resulting in a wrong trade-off for the actual target parmaeter of interest. From a practical perspective, when we use AIC, BIC, or cross-validated log-likelihood to select variables in our regression model, then that procedure is ignorant of the specific feature of the data distribution we want to estimate. That is, in large statistical models it is immediately apparent that estimators need to be targeted towards their goal, just like a human being learns the answer to a specific question in a targeted manner, and maximum likelihood based estimators fail to do that.

In Section 3 we review the roadmap for Targeted Learning of a causal quantity, involving defining a causal model and causal quantity of interest, establishing an estimand of the data distribution that equals the desired causal quantity under additional causal assumptions, applying the pure statistical Targeted Learning of the relevant estimand based on a statistical model compatible with the causal model but for sure containing the true data distribution, and careful interpretation of the results. In Section 4 we proceed with describing our proposed targeted minimum loss-based estimation (TMLE) template, which represents a concrete template for construction of targeted efficient substitution estimators which are not only asymptotically consistent, asymptotically normally distributed, and asymptotically efficient, but also tailored to have robust finite sample performance. Subsequently, in Section 5 we review some of our most important advances in Targeted Learning, demonstrating the remarkable power and flexibility of this TMLE methodology, and in Section 6 we describe future challenges and areas of research. In Section 7 we provide a historical philosophical perspective of Targeted Learning. Finally, in Section 8 we conclude with some remarks, putting Targeted Learning in the context of the modern era of Big Data.

We refer to our papers and book on Targeted Learning for overviews of relevant parts of the literature that put our specific contributions within the field of Targeted Learning in the context of the current literature, thereby allowing us to focus on Targeted Learning itself in the current outlook paper.

2. Targeted Learning

Our research takes place in a subfield of statistics we named Targeted Learning [1, 2]. In statistics the data on units is viewed as a realization of a random variable, or equivalently, an outcome of a particular experiment, and thereby has a probability distribution , often called the data distribution. For example, one might observe on a subject , where are baseline characteristics of the subject, is a binary treatment or exposure the subject received, and is a binary outcome of interest such as an indicator of death, . Throughout this paper we will use this data structure to demonstrate the concepts and estimation procedures.

2.1. Statistical Model

A statistical model is defined as a set of possible probability distributions for the data distribution and thus represents the available statistical knowledge about the true data distribution . In Targeted Learning, this core-definition of the statistical model is fully respected in the sense that one should define the statistical model to contain the true data distribution: . So contrary to the often conveniently used slogan “All models are wrong, but some are useful” and erosion over time of the original true meaning of a statistical model throughout applied research, Targeted Learning defines the model for what it actually is [3]. If there is truly no statistical knowledge available, then the statistical model is defined as all data distributions. A possible statistical model is the model that assumes that are independent and identically distributed random variables with completely unknown probability distribution , representing the case that the sampling of the data involved repeating the same experiment independently. In our example, this would mean that we assume that are independent with a completely unspecified common probability distribution. For example, if is 10-dimensional, while and are two-dimensional, then is described by a 12-dimensional density, and this statistical model does not put any restrictions on this 12-dimensional density. One could factorize this density of as follows: where is the density of the marginal distribution of , is the conditional density of , given , and is the conditional density of , given , . In this model, each of these factors is unrestricted. On the other hand, suppose now that the data is generated by a randomized controlled trial in which we randomly assign treatment with probability 0.5 to a subject. In that case, the conditional density of , given , is known, but the marginal distribution of the covariates and the conditional distribution of the outcome, given covariates and treatment, might still be unrestricted. Even in an observational study, one might know that treatment decisions were only based on a small subset of the available covariates , so that it is known that only depends on through these few covariates. In the case that death represents a rare event, it might also be known that the probability of death is known to be between 0 and some small number (e.g., 0.03). This restriction should then be included in the model .

In various applications, careful understanding of the experiment that generated the data might show that even these rather large statistical models assuming the data generating experiment equals the independent repetition of a common experiment are too small to be true: see [48] for models in which is a joint random variable described by a single experiment, which nonetheless involves a variety of conditional independence assumptions. That is, the typical statement that are independent and identically distributed (i.i.d.) might already represent a wrong statistical model. For example, in a community randomized trial it is often the case that the treatments are assigned by the following type of algorithm: based on the characteristics , one first applies an algorithm that aims to split the communities in pairs that are similar with respect to baseline characteristics; subsequently, one randomly assigns treatment and control to each pair. Clearly, even when the communities would have been randomly sampled from a target population of communities, the treatment assignment mechanism creates dependence so that the data generating experiment cannot be described as an independent repetition of experiments: see [7] for a detailed presentation.

In a study in which one observes a single community of interconnected individuals one might have that the outcome for subject is not only affected by the subject’s past , but also affected by the covariate and treatment of friends of subject . Knowing the friends of each subject would now impose strong conditional independence assumptions on the density of the data , but one cannot assume that the data is a result of independent experiments: in fact, as in the community randomized trial example, such data sets have sample size since the data can only be described as the result of a single experiment [8].

In group sequential randomized trials, one often may use a randomization probability for a next recruited th subject that depends on the observed data of the previously recruited and observed subjects , which makes the treatment assignment a function of . Even when the subjects are sampled randomly from a target population, this type of dependence between treatment and the past data implies that the data is the result of a single large experiment (again, the sample size equals 1) [46].

Indeed, many realistic statistical models only involve independence and conditional independence assumptions and known bounds (e.g., it is known that the observed clinical outcome is bounded between or the conditional probability of death is bounded between 0 and a small number). Either way, if the data distribution is described by a sequence of independent (and possibly identical) experiments or by a single experiment satisfying a variety of conditional independence restrictions, parametric models, though representing common practice, are practically always invalid statistical models since such knowledge about the data distribution is essentially never available.

An important by-product of requiring that the statistical model needs to be truthful is that one is forced to obtain as much knowledge about the experiment before committing to a model, which is precisely the role a good statistician should play. On the other hand, if one commits to a parametric model, then why would one still bother trying to find out the truth about the data generating experiment?

2.2. Target Parameter

The target parameter is defined as a mapping that maps the data distribution into the desired finite dimensional feature of the data distribution one wants to learn from the data: . This choice of target parameter requires careful thought independent from the choice of statistical model and is not a choice made out of convenience. The use of parametric or semiparametric models such as the Cox-proportional hazards model is often accompanied with the implicit statement that the unknown coefficients represent the parameter of interest. Even in the unrealistic scenario that these small statistical models would be true, there is absolutely no reason why the very parametrization of the data distribution should correspond with the target parameter of interest. Instead, the statistical model and the choice of target parameter are two completely separate choices, and, by no means, one should imply the other. That is, the statistical knowledge about the experiment that generated the data and defining what we hope to learn from the data are two important key steps in science that should not be convoluted. The true target parameter value is obtained by applying the target parameter mapping to the true data distribution and represents the estimand of interest.

For example, if are independent and have common probability distribution , then one might define the target parameter as an average of the conditional -specific treatment effects: By using that is binary, this can also be written as follows: where denotes the true conditional probability of death, given treatment , and covariate .

For example, suppose that the true conditional probability of death is given by some logistic function for some function of treatments and . The reader can plug in a possible form for such as . Given this function , the true value is computed by the above formula as follows: This parameter has a clear statistical interpretaion as the average of all the -specific additive treatment effects .

2.3. The Important Role of Models Also Involving Nontestable Assumptions

However, this particular statistical estimand has an even richer interpretation if one is willing to make additional so called causal (nontestable) assumptions. Let us assume that , , are generated by a set of so called structural equations: where are random inputs following a particular unknown probability distribution, while the functions , , deterministically map the realization of the random input sequentially into a realization of , , . One might not make any assumptions about the form of these functions , , . In that case, these causal assumptions put no restrictions on the probability distribution of , but through these assumptions we have parametrized by a choice of functions and a choice of distribution of . Pearl [9] refers to such assumptions as a structural causal model for the distribution of .

This structural causal model allows one to define a corresponding postintervention probability distribution that corresponds with replacing by our desired intervention on the intervention node . For example, a static intervention results in a new system of equations , , , where this new random variable is called a counterfactual outcome or potential outcome corresponding with intervention . Similarly, one can define . Thus, () represent the outcome on the subject one would have seen if the subject would have been assigned treatment . One might now define the causal effect of interest as , that is, the difference between the expected outcome of and the expected outcome of . If one also assumes that is independent of , given , which is often referred to as the assumption of no unmeasured confounding or the randomization assumption, then it follows that . That is, under the structural causal model, including this no unmeasured confounding assumption, can not only be interpreted purely statistically as an average of conditional treatment effects, but it actually equals the marginal additive causal effect.

In general, causal models or, more generally, sets of nontestable assumptions can be used to define underlying target quantities of interest and corresponding statistical target parameters that equal this target quantity under these assumptions. Well known classes of such models are models for censored data in which the observed data is represented as a many to one mapping on the full data of interest and censoring variable, and the target quantity is a parameter of the full data distribution. Similarly, causal inference models represent the observed data as a mapping on counterfactuals and the observed treatment (either explicitly as in the Neyman-Rubin model or implicitly as in the Pearl structural causal models), and one defines the target quantity as a parameter of the distribution of the counterfactuals. One is now often concerned with providing sets of assumptions on the underlying distribution (i.e., of the full-data) that allow identifiability of the target quantity from the observed data distribution (e.g., coarsening at random or randomization assumption). These nontestable assumptions do not change the statistical model and, as a consequence, once one has defined the relevant estimand , do not affect the estimation problem either.

2.4. Estimation Problem

The estimation problem is defined by the statistical model (i.e., ) and choice of target parameter (i.e., . Targeted Learning is now the field concerned with the development of estimators of the target parameter that are asymptotically consistent as the number of units converges to infinity and whose appropriately standardized version (e.g., ) converges in probability distribution to some limit probability distribution (e.g., normal distribution), so that one can construct confidence intervals that for large enough sample size contain with a user supplied high probability the true value of the target parameter. In the case that , a common method for establishing asymptotic normality of an estimator is to demonstrate that the estimator minus truth can be approximated by an empirical mean of a function of . Such an estimator is called asymptotically linear at . Formally, an estimator is asymptotically linear under i.i.d. sampling from if , where is the so called influence curve at . In that case, the central limit theorem teaches us that converges to a normal distribution with variance defined as the variance of the influence curve. An asymptotic 0.95 confidence interval for is then given by , where is the sample variance of an estimate of the true influence curve , .

The empirical mean of the influence curve of an estimator represents the first order linear approximation of the estimator as a functional of the empirical distribution, and the derivation of the influence curve is a by-product of the application of the so called functional delta-method for statistical inference based on functionals of the empirical distribution [1012]. That is, the influence curve of an estimator, viewed as a mapping from the empirical distribution into the estimated value , is defined as the directional derivative at in the direction , where is the empirical distribution at a single observation .

2.5. Targeted Learning Respects Both Local and Global Constraints of the Statistical Model

Targeted Learning is not just satisfied with asymptotic performance such as asymptotic efficiency. Asymptotic efficiency requires fully respecting the local statistical constraints for shrinking neighborhoods around the true data distribution implied by the statistical model, defined by the so called tangent space generated by all scores of parametric submodels through [13], but it does not require respecting the global constraints on the data distribution implied by the statistical model (e.g., see [14]). Instead Targeted Learning pursues the development of such asymptotically efficient estimators that also have excellent and robust practical performance by also fully respecting the global constraints of the statistical model. In addition, Targeted Learning is also concerned with the development of confidence intervals with good practical coverage. For that purpose, our proposed methodology for Targeted Learning, so called targeted minimum loss based estimation discussed below, does not only result in asymptotically efficient estimators, but the estimators (1) utilize unified cross-validation to make practically sound choices for estimator construction that actually work well with the very data set at hand [1519], (2) focus on the construction of substitution estimators that by definition also fully respect the global constraints of the statistical model, and (3) use influence curve theory to construct targeted computer friendly estimators of the asymptotic distribution, such as the normal limit distribution based on an estimator of the asymptotic variance of the estimator.

Let us succinctly review the immediate relevance to Targeted Learning of the above mentioned basic concepts: influence curve, efficient influence curve, substitution estimator, cross-validation, and super-learning. For the sake of discussion, let us consider the case that the observations are independent and identically distributed: , and can now be defined as a parameter on the common distribution of , but each of the concepts has a generalization to dependent data as well (e.g., see [8]).

2.6. Targeted Learning Is Based on a Substitution Estimator

Substitution estimators are estimators that can be described as the target parameter mapping applied to an estimator of the data distribution that is an element of the statistical model. More generally, if the target parameter is represented as a mapping on a part of the data distribution (e.g., factor of likelihood), then a substitution estimator can be represented as , where is an estimator of that is contained in the parameter space implied by the statistical model . Substitution estimators are known to be particularly robust by fully respecting that the true target parameter is obtained by evaluating the target parameter mapping on this statistical model. For example, substitution estimators are guaranteed to respect known bounds on the target parameter (e.g., it is a probability or difference between two probabilities) as well as known bounds on the data distribution implied by the model .

In our running example, we can define , where is the probability distribution of under , and is the conditional mean of the outcome, given the treatment and covariates, and represent the target parameter as a function of the conditional mean and the probability distribution of . The model might restrict to be between and a small number but otherwise puts no restrictions on . A substitution estimator is now obtained by plugging in the empirical distribution for and a data adaptive estimator of the regression :

Not every type of estimator is a substitution estimator. For example, an inverse probability of treatment type estimator of could be defined as where is an estimator of the conditional probability of treatment . This is clearly not a substitution estimator. In particular, if is very small for some observations, this estimator might not be between and and thus completely ignores known constraints.

2.7. Targeted Estimator Relies on Data Adaptive Estimator of Nuisance Parameter

The construction of targeted estimators of the target parameter requires construction of an estimator of infinite dimensional nuisance parameters, specifically the initial estimator of the relevant part of the data distribution in the TMLE, and the estimator of the nuisance parameter that is needed to target the fit of this relevant part in the TMLE. In our running example, we have and the is the conditional distribution of , given .

2.8. Targeted Learning Uses Super-Learning to Estimate the Nuisance Parameter

In order to optimize these estimators of the nuisance parameters , we use a so called super-learner that is guaranteed to asymptotically outperform any available procedure by simply including it in the library of estimators that is used to define the super-learner.

The super-learner is defined by a library of estimators of the nuisance parameter and uses cross-validation to select the best weighted combination of these estimators. The asymptotic optimality of the super-learner is implied by the oracle inequality for the cross-validation selector that compares the performance of the estimator that minimizes the cross-validated risk over all possible candidate estimators with the oracle selector that simply selects the best possible choice (as if one has available an infinite validation sample). The only assumption this asymptotic optimality relies upon is that the loss function used in cross-validation is uniformly bounded and that the number of algorithms in the library does not increase at a faster rate than a polynomial power in sample size when sample size converges to infinity [1519]. However, cross-validation is a method that goes beyond optimal asymptotic performance, since the cross-validated risk measures the performance of the estimator on the very sample it is based upon, making it a practically very appealing method for estimator selection.

In our running example, we have that , where is the squared error loss, or one can also use the log-likelihood loss . Usually, there are a variety of possible loss functions one could use to define the super-learner: the choice could be based on the dissimilarity implied by the loss function [15] but probably should itself be data adaptively selected in a targeted manner. The cross-validated risk of a candidate estimator of is then defined as the empirical mean over a validation sample of the loss of the candidate estimator fitted on the training sample, averaged across different spits of the sample in a validation and training sample. A typical way to obtain such sample splits is so called -fold cross-validation in which one first partitions the sample in subsets of equal size, and each of the subsets plays the role of a validation sample while its complement of subsets equals the corresponding training sample. Thus, -fold cross-validation results in sample splits into a validation sample and corresponding training sample. A possible candidate estimator is a maximum likelihood estimator based on a logistic linear regression working model for . Different choices of such logistic linear regression working models result in different possible candidate estimators. So in this manner one can already generate a rich library of candidate estimators. However, the statistics and machine learning literature has also generated lots of data adaptive estimators based on smoothing, data adaptive selection of basis functions, and so on, resulting in another large collection of possible candidate estimators that can be added to the library. Given a library of candidate estimators, the super-learner selects the estimator that minimizes the cross-validated risk over all the candidate estimators. This selected estimator is now applied to the whole sample to give our final estimate of . One can enrich the collection of candidate estimators by taking any weighted combination of an initial library of candidate estimators, thereby generating a whole parametric family of candidate estimators.

Similarly, one can define a super-learner of the conditional distribution of , given .

The super-learner’s performance improves by enlarging the library. Even though for a given data set, one of the candidate estimators will do as well as the super-learner, across a variety of data sets, the super-learner beats an estimator that is betting on particular subsets of the parameter space containing the truth or allowing good approximations of the truth. The use of super-learner provides on important step in creating a robust estimator whose performance is not relying on being lucky but on generating a rich library so that a weighted combination of the estimators provides a good approximation of the truth, wherever the truth might be located in the parameter space.

2.9. Asymptotic Efficiency

An asymptotically efficient estimator of the target parameter is an estimator that can be represented as the target parameter value plus an empirical mean of a so called (mean zero) efficient influence curve , up till a second order term that is asymptotically negligible [13]. That is, an estimator is efficient if and only if it is asymptotically linear with influence curve equal to the efficient influence curve :

The efficient influence curve is also called the canonical gradient and is indeed defined as the canonical gradient of the pathwise derivative of the target parameter . Specifically, one defines a rich family of one-dimensional submodels through at , and one represents the pathwise derivative as an inner product (the covariance operator in the Hilbert space of functions of with mean zero and inner product , where is the score of the path , and is a so called gradient. The unique gradient that is also in the closure of the linear span of all scores generated by the family of one-dimensional submodels through , also called the tangent space at , is now the canonical gradient at . Indeed, the canonical gradient can be computed as the projection of any given gradient onto the tangent space in the Hilbert space . An interesting result in efficiency theory is that an influence curve of a regular asymptotically linear estimator is a gradient.

In our running example, it can be shown that the efficient influence curve of the additive treatment effect is given by

As noted earlier, the influence curve of an estimator also characterizes the limit variance of the mean zero normal limit distribution of . This variance can be estimated with , where is an estimator of the influence curve . Efficiency theory teaches us that for any regular asymptotically linear estimator its influence curve has a variance that is larger than or equal to the variance of the efficient influence curve, , which is also called the generalized Cramer-Rao lower bound. In our running example, the asymptotic variance of an efficient estimator is thus estimated with the sample variance of an estimate of obtained by plugging in the estimator of and the estimator of , and is replaced by .

2.10. Targeted Estimator Solves the Efficient Influence Curve Equation

The efficient influence curve is a function of that depends on through and possible nuisance parameter , and it can be calculated as the canonical gradient of the pathwise derivative of the target parameter mapping along paths through . It is also called the efficient score. Thus, given the statistical model and target parameter mapping, one can calculate the efficient influence curve whose variance defines the best possible asymptotic variance of an estimator, also referred to as the generalized Cramer-Rao lower bound for the asymptotic variance of a regular estimator. The principal building block for achieving asymptotic efficiency of a substitution estimator , beyond being an excellent estimator of as achieved with super-learning, is that the estimator solves the so called efficient influence curve equation , for a good estimator of . This property cannot be expected to hold for a super-learner, and that is why the TMLE discussed in Section 4 involves an additional update of the super-learner that guarantees that it solves this efficient influence curve equation.

For example, maximum likelihood estimators solve all score equations, including this efficient score equation that targets the target parameter, but maximum likelihood estimators for large semi parametric models typically do not exist for finite sample sizes. Fortunately, for efficient estimation of the target parameter one should only be concerned with solving this particular efficient score tailored for the target parameter. Using the notation for the expectation operator, one way to understand why the efficient influence curve equation indeed targets the true target parameter value is that there are many cases in which , and, in general, as a consequence of being a canonical gradient, where is a term involving second order differences . This key property explains why solving targets to be close to and thus explains why solving targets to fit .

In our running example, we have , where . So in our example, the remainder only involves a cross-product difference . In particular, the remainder equals zero if either or , which is often referred to as double robustness of the efficient influence curve with respect to in the causal and censored data literature (see, e.g., [20]). This property translates into double robustness of estimators that solve the efficient influence curve estimating equation.

Due to this identity (12), an estimator that solves and is in a local neighborhood of so that approximately solves where the latter behaves as a mean zero centered empirical mean with minimal variance that will be approximately normally distributed. This is formalized in an actual proof of asymptotic efficiency in the next subsection.

2.11. Targeted Estimator Is Asymptotically Linear and Efficient

In fact, combining with (12) at yields where is a second order term. Thus, if second order differences such as , , and converge to zero at a rate faster than , then it follows that . To make this assumption as reasonable as possible one should use super-learning for both and . In addition, empirical process theory teaches us that if converges to zero in probability as converges to infinity (a consistency condition) and if falls in a so called Donsker class of functions [11]. An important Donsker class is the class of all -variate real valued functions that have a uniform sectional variation norm that is bounded by some universal : that is, the variation norm of the function itself and the variation norm of its sections are all bounded by this . This Donsker class condition essentially excludes estimators that heavily overfit the data so that their variation norms converge to infinity as converges to infinity. So under this Donsker class condition, , and the consistency condition, we have That is, is asymptotically efficient. In addition, the right-hand side converges to a normal distribution with mean zero and variance equal to the variance of the efficient influence curve. So, in spite of the fact that the efficient influence curve equation only represents a finite dimensional equation for an infinite dimensional object , it implies consistency of up till a second order term and even asymptotic efficiency if under some weak regularity conditions.

3. Road Map for Targeted Learning of Causal Quantity or Other Underlying Full-Data Target Parameters

This is a good moment to review the roadmap for Targeted Learning. We have formulated a roadmap for Targeted Learning of a causal quantity that provides a transparent roadmap [2, 9, 21], involving the following steps:(i)defining a full-data model such as a causal model and a parameterization of the observed data distribution in terms of the full-data distribution (e.g., the Neyman-Rubin-Robins counterfactual model [2228]) or the structural causal model [9];(ii)defining the target quantity of interest as a target parameter of the full-data distribution;(iii)establishing identifiability of the target quantity from the observed data distribution under possible additional assumptions that are not necessarily believed to be reasonable;(iv)committing to the resulting estimand and the statistical model that is believed to contain the true ;(v)a subroadmap for the TMLE discussed below to construct an asymptotically efficient substitution estimator of the statistical target parameter;(vi)establishing an asymptotic distribution and corresponding estimator of this limit distribution to construct a confidence interval;(vii)honest interpretation of the results, possibly including a sensitivity analysis [2932].

That is, the statistical target parameters of interest are often constructed through the following process. One assumes an underlying model of probability distributions, which we will call the full-data model, and one defines the data distribution in terms of this full-data distribution. This can be thought of as modeling: that is, one obtains a parameterization for the statistical model for some underlying parameter space and parameterization . The target quantity of interest is defined as some parameter of the full-data distribution, that is, of . Under certain assumptions one establishes that the target quantity can be represented as a parameter of the data distribution, a so called estimand: such a result is called an identifiability result for the target quantity. One might now decide to use this estimand as the target parameter and develop a TMLE for this target parameter. Under the nontestable assumptions the identifiability result relied upon, the estimand can be interpreted as the target quantity of interest, but importantly it can always be interpreted as a statistical feature of the data distribution (due to the statistical model being true), possibly of independent interest. In this manner, one can define estimands that are equal to a causal quantity of interest defined in an underlying (counterfactual) world. The TMLE of this estimand, which is only defined by the statistical model and the target parameter mapping, and thus ignorant of the nontestable assumptions that allowed the causal interpretation of the estimand, provides now an estimator of this causal quantity. In this manner, Targeted Learning is in complete harmony with the development of models such as causal and censored data models and identification results for underlying quantities: the latter just provides us with a definition of a target parameter mapping and statistical model and thereby the pure statistical estimation problem that needs to be addressed.

4. Targeted Minimum Loss Based Estimation (TMLE)

The TMLE [1, 2, 4] is defined according to the following steps. Firstly, one writes the target parameter mapping as a mapping applied to a part of the data distribution , say , that can be represented as the minimizer of a criterion at the true data distribution over all candidate values for this part of the data distribution: we refer to this criterion as the risk of the candidate value .

Typically, the risk at a candidate parameter value can be defined as the expectation under the data distribution of a loss function that maps the unit data structure and the candidate parameter value in a real value number: . Examples of loss functions are the squared error loss for a conditional mean and the log-likelihood loss for a (conditional) density. This representation of as a minimizer of a risk allows us to estimate it with (e.g., loss-based) super-learning.

Secondly, one computes the efficient influence curve identified by the canonical gradient of the pathwise derivative of the target parameter mapping along paths through a data distribution , where this efficient influence curve does only depend on through and some nuisance parameter . Given an estimator , one now defines a path with Euclidean parameter through the super-learner whose score at spans the efficient influence curve at the initial estimator : this is called a least favorable parametric submodel through the super-learner.

In our running example, we have so that it suffices to construct a path through and with corresponding loss functions and show that their scores span the efficient influence curve (11). We can define the path , where and loss function . Note that We also define the path with loss function , where . Note that Thus, if we define the sum loss function , then This proves that indeed these proposed paths through and and corresponding loss functions span the efficient influence curve at , as required.

The dimension of can be selected to be equal to the dimension of the target parameter , but by creating extra components in one can arrange to solve additional score equations beyond the efficient score equation, providing important additional flexibility and power to the procedure. In our running example, we can use an for the path through and a separate for the path through . In this case, the TMLE update will solve two score equations and , and thus, in particular, . In this example, the main benefit of using a bivariate is that the TMLE does not update (if selected to be the empirical distribution) and converges in a single step.

One fits the unknown parameter of this path by minimizing the empirical risk along this path through the super-learner, resulting in an estimator . This defines now an update of the super-learner fit defined as . This updating process is iterated till . The final update we will denote with , the TMLE of , and the target parameter mapping applied to defines the TMLE of the target parameter . This TMLE solves the efficient influence curve equation , providing the basis, in combination with statistical properties of ) for establishing that the TMLE is asymptotically consistent, normally distributed, and asymptotically efficient, as shown above.

In our running example, we have , while equals zero. That is, the TMLE does not update since the empirical distribution is already a nonparametric maximum likelihood estimator solving all score equations. In this case since the convergence of the TMLE-algorithm occurs in one step, and, of course, is just the initial empirical distribution function of . The TMLE of is the substitution estimator .

5. Advances in Targeted Learning

As apparent from the above presentation, TMLE is a general method that can be developed for all types of challenging estimation problems. It is a matter of representing the target parameters as a parameter of a smaller , defining a path and loss function with generalized score that spans the efficient influence curve, and the corresponding iterative targeted minimum loss-based estimation algorithm.

We have used this framework to develop TMLE in a large number of estimation problems that assumes that . Specifically, we developed TMLE of a large variety of effects (e.g., causal) of single and multiple time point interventions on an outcome of interest that may be subject to right-censoring, interval censoring, case-control sampling, and time-dependent confounding: see, for example, [4, 3363, 6372].

An original example of a particular type of TMLE (based on a double robust parametric regression model) for estimation of a causal effect of a point-treatment intervention was presented in [73] and we refer to [47] for a detailed review of this earlier literature and its relation to TMLE.

It is beyond the scope of this overview paper to get into a review of some of these examples. For a general comprehensive book on Targeted Learning, which includes many of these applications on TMLE and more, we refer to [2].

To provide the reader with a sense, consider generalizing our running example to a general longitudinal data structure , where are baseline covariates, are time dependent covariates realized between intervention nodes and , and is the final outcome of interest. The intervention nodes could include both censoring variables and treatment variables: the desired intervention for the censoring variables is always “no censoring” since the outcome is only of interest when it is not subject to censoring (in which case it might just be a forward imputed value, e.g.).

One may now assume a structural causal model of the type discussed earlier and be interested in the mean counterfactual outcome under a particular intervention on all the intervention nodes, where these interventions could be static, dynamic, or even stochastic. Under the so called sequential randomization assumption, this target quantity is identified by the so called G-computation formula for the postintervention distribution corresponding with a stochastic intervention : Note that this postintervention distribution is nothing else but the actual distribution of factorized according to the time-ordering but with the true conditional distributions of , given parenethesis , replaced by the desired stochastic intervention. The statistical target parameter is thus , that is, the mean outcome under this postintervention distribution. A big challenge in the literature has been to develop robust efficient estimators of this estimand, and, more generally, one likes to estimate this mean outcome under a user supplied class of stochastic interventions . Such robust efficient substitution estimators have now been developed using the TMLE framework [58, 60], where the latter is a TMLE inspired by important double robust estimators established in earlier work of [74]. This work thus includes causal effects defined by working marginal structural models for static and dynamic treatment regimens, time to event outcomes, and incorporating right-censoring.

In many data sets one is interested in assessing the effect of one variable on an outcome, controlling for many other variables, across a large collection of variables. For example, one might want to know the effect of a single nucleotide polymorphism (SNP) on a trait of a subject across a whole genome, controlling each time for a large collection of other SNPs in the neighborhood of the SNP in question. Or one is interested in assessing the effect of a mutation in the HIV virus on viral load drop (measure of drug resistance) when treated with a particular drug class, controlling for the other mutations in the HIV virus and for characteristics of the subject in question. Therefore, it is important to carefully define the effect of interest for each variable. If the variable is binary, one could use the target parameter in our running example, but with now being the SNP in question and being the variables one wants to control for, while is the outcome of interest. We often refer to such a measure as a particular variable importance measure. Of course, one now defines such a variable importance measure for each variable. When the variable is continuous, the above measure is not appropriate. In that case, one might define the variable importance as the projection of onto a linear model such as and use as the variable importance measure of interest [75], but one could think of a variety of other interesting effect measures. Either way, for each variable, one uses a TMLE of the corresponding variable importance measure. The stacked TMLE across all variables is now an asymptotically linear estimator of the stacked variable importance measure with stacked influence curve and thus approximately follows a multivariate normal distribution that can be estimated from the data. One can now carry out multiple testing procedures controlling a desired family wise type I error rate and construct simultaneous confidence intervals for the stacked variable importance measure, based on this multivariate normal limit distribution. In this manner, one uses Targeted Learning to target a large family of target parameters while still providing honest statistical inference taking into account multiple testing. This approach deals with a challenge in machine learning in which one wants estimators of a prediction function that simultaneously yield good estimates of the variable importance measures. Examples of such efforts are random forest and LASSO, but both regression methods fail to provide reliable variable importance measures and fail to provide any type of statistical inference. The truth is that if the goal is not prediction but to obtain a good estimate of the variable importance measures across the variables, then one should target the estimator of the prediction function towards the particular variable importance measure, for each variable separately, and only then one obtains valid estimators and statistical inference. For TMLE of effects of variables across a large set of variables, a so called variable importance analysis, including the application to genomic data sets we refer to [48, 7579].

Software has been developed in the form of general R-packages implementing super-learning and TMLE for general longitudinal data structures: these packages are publicly available on CRAN under the function names tmle(), ltmle(), and superlearner().

Beyond the development of TMLE in this large variety of complex statistical estimation problems, as usual, the careful study of real world applications resulted in new challenges for the TMLE, and in response to that we have developed general TMLE that have additional properties dealing with these challenges. In particular, we have shown that TMLE has the flexibility and capability to enhance the finite sample performance of TMLE under the following specific challenges that come with real data applications.

Dealing with Rare Outcomes. If the outcome is rare, then the data is still sparse even though the sample size might be quite large. When the data is sparse with respect to the question of interest, the incorporation of global constraints of the statistical model becomes extremely important and can make a real difference in a data analysis. Consider our running example and suppose that is the indicator of a rare event. In such cases it is often known that the probability of , conditional on a treatment and covariate configuration, should not exceed a certain value : for example, the marginal prevalence is known and it is known that there are no subpopulations that increase the relative risk by more than a certain factor relative marginal prevalence. So the statistical model should now include the global constraint that for some known . A TMLE should now be based on an initial estimator satisfying this constraint, and the least favorable submodel should also satisfy this constraint for each so that it is a real submodel. In [80] such a TMLE is constructed and it is demonstrated to very significantly enhance its practical performance for finite sample sizes. Even though a TMLE ignoring this constraint would still be asymptotically efficient, by ignoring this important knowledge, its practical performance for finite samples suffers.

Targeted Estimation of Nuisance Parameter in TMLE. Even though an asymptotically consistent estimator of yields an asymptotically efficient TMLE, the practical performance of the TMLE might be enhanced by tuning this estimator not only with respect to to its performance in estimating , but also with respect to how well the resulting TMLE fits . Consider our running example. Suppose that among the components of there is a that is an almost perfect predictor of but has no effect on the outcome . Inclusion of such a covariate in the fit of makes sense if the sample size is very large and one tries to remove some residual confounding due to not adjusting for , but in most finite samples adjustment for in will hurt the practical performance of TMLE, and effort should be put in variables that are stronger confounders than . We developed a method for building an estimator that uses as criterion the change in fit between initial estimator of and the updated estimator (i.e., the TMLE) and thereby selects variables that result in the maximal increase in fit during the TMLE updating step. However, eventually, as sample size converges to infinity, all variables will be adjusted for so that asymptotically the resulting TMLE is still efficient. This version of TMLE is called the collaborative TMLE since it fits in collaboration with the initial estimator [2, 4446, 58]. Finite sample simulations and data analyses have shown remarkable important finite sample gains of C-TMLE relative to TMLE (see above references).

Cross-Validated TMLE. The asymptotic efficiency of TMLE relies on a so called Donsker class condition. For example, in our running example, it requires that and are not too erratic functions of . This condition is not just theoretical but one can observe its effects in finite samples by evaluating the TMLE when using a heavily overfitted initial estimator. This makes sense, since if we use an overfitted initial estimator, there is little reason to think that the that maximizes the fit of the update of the initial estimator along the least favorable parametric model will still do a good job. Instead, one should use the fit of that maximizes honest estimate of the fit of the resulting update of the initial estimator as measured by the cross-validated empirical mean of the loss function. This insight results in a so called cross-validated TMLE, and we have proven that one can establish asymptotic linearity of this CV-TMLE without a Donsker class condition [2, 81]; thus the CV-TMLE is asymptotically linear under weak conditions compared to the TMLE.

Guaranteed Minimal Performance of TMLE. If the initial estimator is inconsistent, but is consistent, then the TMLE is still consistent for models and target parameters in which the efficient influence curve is double robust. However, there might be other estimators that will now asymptotically beat the TMLE, since the TMLE is not efficient anymore. The desire for estimators to have a guarantee to beat certain user supplied estimators was formulated and implemented for double robust estimating equation based estimators in [82]. Such a property can also be arranged within the TMLE framework by incorporating additional fluctuation parameters in its least favorable submodel though the initial estimator so that the TMLE solves additional score equations that guarantee that it beats a user supplied estimator, even under heavy misspecification of the initial estimator [58, 68].

Targeted Selection of Initial Estimator in TMLE. In situations where it is unreasonable to expect that the initial estimator will be close to the true , such as in randomized controlled trials in which the sample size is small, one may improve the efficiency of the TMLE by using a criterion for tuning the initial estimator that directly evaluates the efficiency of the resulting TMLE of . This general insight was formulated as empirical efficiency maximization in [83] and further worked out in the TMLE context in chapter 12 and Appendix of [2].

Double Robust Inference. If the efficient influence curve is double robust, then the TMLE remains consistent if either or is consistent. However, if one uses a data adaptive consistent estimator of (and thus with bias larger than , and is inconsistent, then the bias of might directly map into a bias for the resulting TMLE of of the same order. As a consequence, the TMLE might have a bias with respect to that is larger than , so that it is not asymptotically linear. However, one can incorporate additional fluctuation parameters in the least favorable submodel (by also fluctuating ) to guarantee that the TMLE remains asymptotically linear with known influence curve when either or is inconsistent, but we do not know which one [84]. So these enhancements of TMLE result in TMLE that are asymptotically linear under weaker conditions than a standard TMLE, just like the CV-TMLE that removed a condition for asymptotic linearity. These TMLE now involve not only targeting but also targeting to guarantee that when is misspecified the required smooth function of will behave as a TMLE, and if is misspecified, the required smooth functional of is still asymptotically linear. The same method was used to develop an IPTW estimator that targets so that the IPTW estimator is asymptotically linear with known influence curve even when the initial estimator of is estimated with a highly data adaptive estimator.

Super-Learning Based on CV-TMLE of the Conditional Risk of a Candidate Estimator. Super-learner relies on a cross-validated estimate of the risk of a candidate estimator. The oracle inequalities of the cross-validation selector assumed that the cross-validated risk is simply an empirical mean over the validation sample of a loss function at the candidate estimator based on training sample, averaged across different sample splits, where we generalized these results to loss functions that depend on an unknown nuisance parameter (which are thus estimated in the cross-validated risk).

For example, suppose that in our running example is continuous, and we are concerned with estimation of the dose-response curve , where . One might define the risk of a candidate dose response curve as a mean squared error with respect to the true curve . However, this risk of a candidate curve is itself an unknown real valued target parameter. On the contrary, to standard prediction or density estimation, this risk is not simply a mean of a known loss function, and the proposed unknown loss functions indexed by a nuisance parameter can have large values making the cross-validated risk a nonrobust estimator. Therefore, we have proposed to estimate this conditional risk of candidate curve with TMLE and, similarly, the conditional risk of a candidate estimator with a CV-TMLE. One can now develop a super-learner that uses CV-TMLE as an estimate of the conditional risk of a candidate estimator [53, 85]. We applied this to construct a super-learner of the causal dose response curve for a continuous valued treatment, and we obtained a corresponding oracle inequality for the performance of the cross-validation selector [53].

6. Eye on the Future of Targeted Learning

We hope that the above clarifies that Targeted Learning is an ongoing exciting research area that is able to address important practical challenges. Each new application concerning learning from data can be formulated in terms of a statistical estimation problem with a large statistical model and a target parameter. One can now use the general framework of super-learning and TMLE to develop efficient targeted substitution estimators and corresponding statistical inference. As is apparent from the previous section, the general structure of TMLE and super-learning appears to be flexible enough to handle/adapt to any new challenges that come up, allowing researchers in Targeted Learning to make important progress in tackling real world problems. By being honest in the formulation, typically new challenges come up asking for expert input from a variety of researchers, ranging from subject-matter scientists, computer scientists, to statisticians. Targeted Learning requires multidisciplinary teams, since it asks for careful knowledge about data experiment, the questions of interest, possible informed guesses for estimation that can be incorporated as candidates in the library of the super-learner, and input from the state of the art in computer science to produce scalable software algorithms implementing the statistical procedures.

There are a variety of important areas of research in Targeted Learning we began to explore.

Variance Estimation.The asymptotic variance of an estimator such as the TMLE, that is, the variance of the influence curve of the estimator, is just another target parameter of great interest. It is common practice to estimate this asymptotic variance with an empirical sample variance of the estimated influence curves. However, in the context of sparsity, influence curves can be large, making such an estimator highly nonrobust. In particular, such a sample mean type estimator will not respect the global constraints of the model. Again, this is not just a theoretical issue since we have observed that in sparse data situations standard estimators of the asymptotic variance often underestimate the variance of the estimator, thereby resulting in overly optimistic confidence intervals. This sparsity can be due to rare outcomes or strong confounding or highly informative censoring, for example, and naturally occurs even when sample sizes are large. Careful inspection of these variance estimators shows that the essential problem is that these variance estimators are not substitution estimators. Therefore, we are in the process to apply TMLE to improve the estimators of the asymptotic variance of TMLE of a target parameter, thereby improving the finite sample coverage of our confidence intervals, especially in sparse-data situations.

Dependent Data. Contrary to experiments that involve random sampling from a target population, if one observes the real world over time, then naturally there is no way to argue that the experiment can be represented as a collection of independent experiments, let alone identical independent experiments. An environment over time and space is a single organism that cannot be separated out into independent units without making very artificial assumptions and losing very essential information: the world needs to be seen as a whole to see truth. Data collection in our societies is moving more and more towards measuring total populations over time, resulting in what we often refer to as Big Data, and these populations consist of interconnected units. Even in randomized controlled settings where one randomly samples units from a target population, one often likes to look at the past data and change the sampling design in response to the observed past, in order to optimize the data collection with respect to certain goals. Once again, this results in a sequence of experiments that cannot be viewed as independent experiments; the next experiment is only defined once one knows the data generated by the past experiments.

Therefore we believe that our models that assume independence, even though they are so much larger than the models used in current practice, are still not realistic models in many applications of interest. On the other hand, even when the sample size equals 1, things are not hopeless if one is willing to assume that the likelihood of the data factorizes in many factors due to conditional independence assumptions and stationarity assumptions that state that conditional distributions might be constant across time or that different units are subject to the same laws for generating their data as a function of their parent variables. In more recent research we have started to develop TMLE for statistical models that do not assume that the unit-specific data structures are independent, handling adaptive pair matching in community randomized controlled trials, group sequential adaptive randomization designs, and studies that collect data on units that are interconnected through a causal network [48].

Data Adaptive Target Parameters. It is common practice that people first look at data before determining their choice of target parameter they want to learn, even though it is taught that this is unacceptable practice since it makes the values and confidence intervals unreliable. But maybe we should view this common practice as a sign that a priori specification of the target parameter (and null hypothesis) limits the learning from data too much, and by enforcing it we only force data analysts to cheat. Current teaching would tell us that one is only allowed to do this by splitting the sample, use one part of the sample to generate a target parameter, and use the other part of the sample to estimate this target parameter and obtain confidence intervals. Clearly, this means that one has to sacrifice a lot of sample size for being allowed to look at the data first. Another possible approach for allowing us to obtain inference for a data driven parameter is to a priori formulate a large class of target parameters and use multiple testing or simultaneous confidence interval adjustments. However, also with this approach one has to pay a big price through the multiple testing adjustment and one still needs to a priori list the target parameters.

For that purpose, acknowledging that one likes to mine the data to find interesting questions that are supported by the data, we developed statistical inference based on CV-TMLE for a large class of target parameters that are defined as functions of the data [86]. This allows one to define an algorithm that when applied to the data generates an interesting target parameter, while we provide formal statistical inference in terms of confidence intervals for this data adaptive target parameter. This provides a much broader class of a priori specified statistical analyses than current practice which requires a priori specification of the target parameter, while still providing valid statistical inference. We believe that this is a very promising direction for future research, opening up many new applications which would normally be overlooked.

Optimal Individualized Treatment. One is often interested in learning the best rule for treating a subject in response to certain time-dependent measurements on that subject, where best rule might be defined as the rule that optimizes the expected outcome. Such a rule is called an individualized treatment rule, or dynamic treatment regimen, and an optimal treatment rule is defined as the rule that minimizes the mean outcome for a certain outcome (e.g., indicator of death or other health measurement). We started to address data adaptive learning of the best possible treatment rule by developing super-learners of this important target parameter, while still providing statistical inference (and thus confidence intervals) for the mean of the outcome in the counterfactual world in which one applies this optimal dynamic treatment to everybody in the target population [87]. In particular, this problem itself provides a motivation for a data adaptive target parameter, namely, the mean outcome under a treatment rule fitted based on the data. Optimal dynamic treatment has been an important area in statistics and computer science, but we target this problem within the framework of Targeted Learning, thereby avoiding reliance on unrealistic assumptions that cannot be defended and will heavily affect the true optimality of the fitted rules.

Statistical Inference Based on Higher Order Inference. Another key assumption the asymptotic efficiency or asymptotic linearity of TMLE relies upon is the remainder/second order term . For example, in our running example this means that the product of the rate at which the super-learner estimators of and converge to their target converges to zero at a faster rate than . The density estimation literature proves that if the density is many times differentiable, then it is possible to construct density estimators whose bias is driven by the last term of a higher order Tailor expansion of the density around a point. Robins et al. [88] have developed theory based on higher order influence functions, under the assumption that the target parameter is higher order pathwise differentiable. Just as density estimators exploiting underlying smoothness, this theory also aims to construct estimators of higher order pathwise differentiable target parameters whose bias is driven by the last term of the higher order Tailor expansion of the target parameter. The practical implementation of the proposed estimators has been challenging and is suffering from lack of robustness. Targeted Learning based on these higher order expansions (thus incorporating not only the first order efficient influence function but also the higher order influence functions that define the Tailor expansion of the target parameter) appears to be a natural area of future research to further build on these advances.

Online TMLE: Trading Off Statistical Optimality and Computing Cost. We will be more and more confronted with online data bases that continuously grow and are massive in size. Nonetheless, one wants to know if the new data changes the inference about target parameters of interest, and one wants to know it right away. Recomputing the TMLE based on the old data augmented with the new chunk of data would be immensely computer intensive. Therefore, we are confronted with the challenge on constructing an estimator that is able to update a current estimator without having to recompute the estimator, but instead one wants to update it based on computations with the new data only. More generally, one is interested in high quality statistical procedures that are scalable. We started doing research in such online TMLE that preserve all or most of the good properties of TMLE but can be continuously updated where the number of computations required for this update is only a function of the size of the new chunk of data.

7. Historical Philosophical Perspective on Targeted Learning: A Reconciliation with Machine Learning

In the previous sections the main characteristics of TMLE/SL methodology have been outlined. We introduced the most important fundamental ideas and statistical concepts, urged the need for revision of current data-analytic practice, and showed some recent advances and application areas. Also research in progress on such issues as dependent data and data adaptive target parameters has been brought forward. In this section we put the methodology in a broader historical-philosophical perspective, trying to support the claim that its relevance exceeds the realms of statistics in a strict sense and even those of methodology. To this aim we will discuss both the significance of TMLE/SL for contemporary epistemology and its implications for the current debate on Big Data and the generally advocated, emerging new discipline of Data Science. Some of these issues have been elaborated more extensively in [3, 8991] where we have put the present state of statistical data analysis in a historical and philosophical perspective with the purpose to clarify, understand, and account for the current situation in statistical data analysis and relate the main ideas underlying TMLE/SL to it.

First and foremost, it must be emphasized that rather than extending the toolkit of the data analyst, TMLE/SL establishes a new methodology. From a technical point of view it offers an integrative approach to data analysis or statistical learning by combining inferential statistics with techniques derived from the field of computational intelligence. This field includes such related and usually eloquently phrased disciplines like machine learning, data mining, knowledge discovery in databases, and algorithmic data analysis. From a conceptual or methodological point of view, it sheds new light on several stages of the research process, including such items as the research question, assumptions and background knowledge, modeling, and causal inference and validation, by anchoring these stages or elements of the research process in statistical theory. According to TMLE/SL all these elements should be related to or defined in terms of (properties of) the data generating distribution and to this aim the methodology provides both clear heuristics and formal underpinnings. Among other things this means that the concept of a statistical model is reestablished in a prudent and parsimonious way, allowing humans to include only their true, realistic knowledge in the model. In addition, the scientific question and background knowledge are to be translated into a formal causal model and target causal parameter using the causal graphs and counterfactual (potential outcome) frameworks, including specifying a working marginal structural model. And, even more significantly, TMLE/SL reassigns to the very concept of estimation, canonical as it has always been in statistical inference, the leading role in any theory of/approach to learning from data, whether it deals with establishing causal relations, classifying or clustering, time series forecasting, or multiple testing. Indeed, inferential statistics arose at the background of randomness and variation in a world represented or encoded by probability distributions, and it has therefore always presumed and exploited the sample-population dualism, which underlies the very idea of estimation. Nevertheless, the whole concept of estimation seems to be discredited and disregarded in contemporary data analytical practice.

In fact, the current situation in data analysis is rather paradoxical and inconvenient. From a foundational perspective the field consists of several competing schools with sometimes incompatible principles, approaches, or viewpoints. Some of these can be traced back to Karl Pearsons goodness-of-fit-approach to data-analysis or to the Fisherian tradition of significance testing and ML-estimation. Some principles and techniques have been derived from the Neyman-Pearson school of hypothesis testing, such as the comparison between two alternative hypotheses and the identification of two kinds of errors of usual unequal importance that should be dealt with. And, last but not least, the toolkit contains all kinds of ideas taken from the Bayesian paradigm, which rigorously pulls statistics into the realms of epistemology. We only have to refer here to the subjective interpretation of probability and the idea that hypotheses should be analyzed in a probabilistic way by assigning probabilities to these hypotheses, thus abandoning the idea that the parameter is a fixed, unknown quantity and thus moving the knowledge about the hypotheses from the meta-language into the object language of probability calculus. In spite of all this, the burgeoning statistical textbook market offers many primers and even advanced studies, which wrongly suggest a uniform and united field with foundations that are fixed, and on which full agreement has been reached. It offers a toolkit based on the alleged unification of ideas and methods derived from the aforementioned traditions. As pointed out in [3] this situation is rather inconvenient from a philosophical point of view for two related reasons.

First, nearly all scientific disciplines have experienced a probabilistic revolution since the late 19th century. Increasingly, key notions are probabilistic, research methods, entire theories are probabilistic, if not the underlying worldview is probabilistic, that is, they are all dominated by and rooted in probability theory and statistics. When the probabilistic revolution emerged in the late 19th century, this transition became recognizable in old, established sciences like physics (kinetic gas theory, statistical mechanics of Bolzmann, Maxwell, and Gibbs), but especially in new emerging disciplines like the social sciences (Quetelet and later Durkheim), biology (evolution, genetics, zoology), agricultural science, and psychology. Biology even came to maturity due to close interaction with statistics. Today, this trend has only further strengthened, and as a result there is a plethora of fields of application of statistics ranging from biostatistics, geostatistics, epidemiology, and econometrics to actuarial science, statistical finance, quality control, and operational research in industrial engineering and management science. Probabilistic approaches have also intruded many branches of computer science; most noticeably they dominate artificial intelligence.

Secondly, at a more abstract level, probabilistic approaches also dominate epistemology, the branch of philosophy committed to classical questions on the relation between knowledge and reality like: What is reality? Does it exist mind-independent? Do we have access to it? If yes, how? Do our postulated theoretical entities exist? How do they correspond to reality? Can we make true statements about it? If yes, what is truth and how is it connected to reality? The analyses conducted to address these issues are usually intrinsically probabilistic. As a result these approaches dominate key issues and controversies in epistemology such as the scientific realism debate, the structure of scientific theories, Bayesian confirmation theory, causality, models of explanation, and natural laws. All too often scientific reasoning seems nearly synonymous with probabilistic reasoning. In view of the fact that scientific inference more and more depends on probabilistic reasoning and that statistical analysis is not as well-founded as might be expected, the issue addressed in this chapter is of crucial importance for epistemology [3].

Despite these philosophical objections against the hybrid character of inferential statistics, its successes were enormous in the first decades of the twentieth century. In newly established disciplines like psychology and economics significance testing and maximum likelihood estimation were applied with methodological rigor in order to enhance prestige and apply scientific method to their field. Although criticism that a mere chasing of low values and naive use of parametric statistics did not do justice to specific characteristics of the sciences involved, emerging from the start of the application of statistics, the success story was immense. However, this rise of the inference experts, like Gigerenzer calls them in The Rise of Statistical Thinking, was just a phase or stage in the development of statistics and data analysis, which manifests itself as a Hegelian triptych, that unmistakably is now being completed in the era of Big Data. After this thesis of a successful, but ununified, field of inferential statistics, an antithesis in the Hegelian sense of the word was unavoidable and it was this antithesis that gave rise to the current situation in data-analytical practice as well. Apart from the already mentioned Bayesian revolt, the rise of nonparametric statistics in the thirties must be mentioned here as an intrinsically statistical criticism that heralds this antithesis. The major caesura in this process however was the work of John Tukey in the sixties and seventies of the previous century. After a long career in statistics and other mathematical disciplines Tukey wrote Explorative Data analysis in 1978. This study is in many ways a remarkable, unorthodox book. First, it contains no axioms, theorems, lemmas, or proofs, and even barely formulas. There are no theoretical distributions, significance tests, values, hypothesis tests, parameter estimation, and confidence intervals. No inferential or confirmatory statistics, but just the understanding of data, looking for patterns, relationships and structures in data, and visualizing the results. According to Tukey the statistician is a detective; as a contemporary Sherlock Holmes he must strive for signs and “clues.” Tukey maintains this metaphor consistently throughout the book and wants to provide the data analyst with a toolbox full of methods for understanding frequency distributions, smoothing techniques, scale transformations, and above all many graphical techniques for exploration, storage, and summary illustrations of data. The unorthodox approach of Tukey in EDA reveals not so much a contrarian spirit, but rather a fundamental dissatisfaction with the prevailing statistical practice and the underlying paradigm of inferential/confirmatory statistics [90].

In EDA Tukey endeavors to emphasize the importance of confirmatory, classical statistics, but this looks for the main part a matter of politeness and courtesy. In fact, he had already put his cards on the table in 1962 in the famous opening passage from The Future of Data Analysis: “for a long time I have thought that I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. And when I have pondered about why such techniques as the spectrum analysis of time series have proved so useful, it has become clear that their “dealing with fluctuations” aspects are, in many circumstances, of lesser importance than the aspects that would already have been required to deal effectively with the simpler case of very extensive data where fluctuations would no longer be a problem. All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things, procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of mathematical statisticswhich apply to analyzing data. … Data analysis is a larger and more varied field than inference, or allocation.” Also in other writings Tukey makes a sharp distinction between statistics and data analysis.

First, Tukey gave unmistakable impulse to the emancipation of the descriptive/visual approach, after pioneering work of William Playfair (18th century) and Florence Nightingale (19th century) on graphical techniques, that were soon overshadowed by the “inferential” coup, which marked the probabilistic revolution. Furthermore, it is somewhat ironic that many consider Tukey a pioneer of computational fields such as data mining and machine learning, although he himself preferred a small role for the computer in his analysis and kept it in the background. More importantly, however, because of his alleged antitheoretical stance, Tukey is sometimes considered the man who tried to reverse or undo the Fisherian revolution and an exponent or forerunner of today’s erosion of models, the view that all models are wrong, the classical notion of truth is obsolete, and pragmatic criteria as predictive success in data analysis must prevail. Also the idea currently frequently uttered in the data analytical tradition that the presence of Big Data will make much of the statistical machinery superfluous is an import aspect of the here very briefly sketched antithesis. Before we come to the intended synthesis, the final stage of the Hegelian triptych, let us make two remarks concerning Tukey’s heritage. Although it almost sounds like a cliché, yet it must be noted that EDA techniques nowadays are routinely applied in all statistical packages along with in itself sometimes hybrid inferential methods. In the current empirical methodology EDA is integrated with inferential statistics at different stages of the research process. Secondly, it could be argued that Tukey did not so much undermine the revolution initiated by Galton and Pearson but understood the ultimate consequences of it. It was Galton who had shown that variation and change are intrinsic in nature, and that we have to look for the deviant, the special or the peculiar. It was Pearson who did realize that the constraints of the normal distribution (Laplace, Quetelet) had to be abandoned and who distinguished different families of distributions as an alternative. Galton’s heritage was just slightly under pressure hit by the successes of the parametric Fisherian statistics on strong model assumptions and it could well be stated that this was partially reinstated by Tukey.

Unsurprisingly, the final stage of the Hegelian triptych strives for some convergence if not synthesis. The 19th century dialectical German philosopher G.F.W. Hegel argued that history is a process of becoming or development, in which a thesis evokes and binds itself to an antithesis; in addition both are placed at a higher level to be completed and to result in a fulfilling synthesis. Applied to the less metaphysically oriented present problem, this dialectical principle seems particularly relevant in the era of Big Data, which makes a reconciliation between inferential statistics and computational science imperative. Big Data sets high demands and offers challenges to both. For example, it sets high standards for data management, storage, and retrieval and has great influence on the research of efficiency of machine learning algorithms. But it is also accompanied by new problems, pitfalls, and challenges for statistical inference and its underlying mathematical theory. Examples include the effects of wrongly specified models, the problems of small, high-dimensional datasets (microarray data), the search for causal relationships in nonexperimental data, quantifying uncertainty, efficiency theory, and so on. The fact that many data-intensive empirical sciences are highly dependent on machine learning algorithms and statistics makes bridging the gap of course for practical reasons compelling.

In addition, it seems that Big Data itself also transforms the nature of knowledge: the way of acquiring knowledge, research methodology, nature, and status of models and theories. In the reflections of all the briefly sketched contradiction often emerges and in the popular literature the differences are usually enhanced, leading to annexation of Big Data by one of the two disciplines.

Of course the gap between both has many aspects, both philosophical and technical, that have been left out here. However, it must be emphasized that for the main part Targeted Learning intends to support the reconciliation between inferential statistics and computational intelligence. It starts with the specification of a nonparametric and semiparametric model that contains only the realistic background knowledge and focuses on the parameter of interest, which is considered as a property of the as yet unknown, true data-generating distribution. From a methodological point of view it is a clear imperative that model and parameter of interest must be specified in advance. The (empirical) research question must be translated in terms of the parameter of interest and a rehabilitation of the concept model is achieved. Then, Targeted Learning involves a flexible, data-adaptive estimation procedure that proceeds in two steps. First an initial estimate is searched on the basis of the relevant part of the true distribution that is needed to evaluate the target parameter. This initial estimator is found by means of the super learning-algorithm. In short, this is based on a library of many diverse analytical techniques ranging from logistic regression to ensemble techniques, random forest, and support vector machines. Because the choice of one of these techniques by human intervention is highly subjective and the variation in the results of the various techniques usually substantial, SL uses a sort of weighted sum of the values calculated by means of cross-validation. Based on these initial estimators, the second stage of the estimation procedure can be initiated. The initial fit is updated with the goal of an optimal bias-variance trade-off for the parameter of interest. This is accomplished with a targeted maximum likelihood estimator of the fluctuation parameter of a parametric submodel selected by the initial estimator. The statistical inference is then completed by calculating standard errors on the basis of “influence-curve theory” or resampling techniques. This parameter estimation retains a crucial place in the data analysis. If one wants to do justice to variation and change in the phenomena, then you cannot deny Fisher’s unshakable insight that randomness is intrinsic and implies that the estimator of the parameter of interest itself has a distribution. Thus Fisher proved himself to be a dualist in making the explicit distinction between sample and population. Neither Big Data nor full census research or any other attempt to take into account the whole of reality or a world encoded or encrypted in data can compensate for it. Although many aspects have remained undiscussed in this contribution we hope to have shown that TMLE/SL contributes to the intended reconciliation between inferential statistics and computational science and that both, rather than being in contradiction, should be integrating parts in any concept of Data Science.

8. Concluding Remark: Targeted Learning and Big Data

The expansion of available data has resulted in a new field often referred to as Big Data. Some advocate that Big Data changes the perspective on statistics: for example, since we measure everything, why do we still need statistics? Clearly, Big Data refers to measuring (possibly, very) high dimensional data on a very large number of units. The truth is that there will never be enough data so that careful design of studies and interpretation of data is not needed anymore.

To start with, lots of bad data are useless, so one will need to respect the experiment that generated the data in order to carefully define the target parameter and its interpretation, and design of experiments is as important as ever so that the target parameters of interest can actually be learned.

Even though the standard error of a simple sample mean might be so small that there is no need for confidence intervals, one is often interested in much more complex statistical target parameters. For example, consider the average treatment effect of our running example, which is not a very complex parameter relative to many other parameters of interest such as an optimal individualized treatment rule. Evaluation of the average treatment effect based on a sample (i.e., substitution estimator obtained by plugging in the empirical distribution of the sample) would require computing the mean outcome for each possible strata of treatment and covariates. Even with observations, most of these strata will be empty for reasonable dimensions of the covariates, so that this pure empirical estimator is not defined. As a consequence, we will need smoothing (i.e., super learning), and really, we will also need Targeted Learning for unbiased estimation and valid statistical inference.

Targeted Learning was developed in response to high dimensional data, in which reasonably sized parametric models are simply impossible to formulate and are immensely biased anyway. The high dimension of the data only emphasizes the need for realistic (and thereby large semiparameric) models, target parameters defined as features of the data distribution instead of coefficients in these parametric models, and Targeted Learning.

The massive dimension of the data does make it appealing to not be necessarily restricted by a priori specification of the target parameters of interest so that Targeted Learning of data adaptive target parameters discussed above is particularly important future area of research providing an important additional flexibility without giving up on statistical inference.

One possible consequence of the building of large data bases that collect data on total populations is that the data might correspond with observing a single process, like a community of individuals over time, in which case one cannot assume that the data is the realization of a collection of independent experiments, the typical assumption most statistical methods rely upon. That is, data cannot be represented as random samples from some target population since we sample all units of the target population. In these cases, it is important to document the connections between the units so that one can pose statistical models that rely on the a variety of conditional independence assumptions as in causal inference for networks developed in [8]. That is, we need Targeted Learning for dependent data whose data distribution is modeled through realistic conditional independence assumptions.

Such statistical models do not allow for statistical inference based on simple methods such as the bootstrap (i.e., sample size is 1), so that asymptotic theory for estimators based on influence curves and the state of the art advances in weak convergence theory is more crucial than ever. That is, the state of the art in probability theory will only be more important in this new era of Big Data. Specifically, one will need to establish convergence in distribution of standardized estimators in these settings in which the data corresponds with the realization of one gigantic random variable for which the statistical model assumes a lot of structure in terms of conditional independence assumptions.

Of course, Targeted Learning with Big Data will require the programming of scalable algorithms, putting fundamental constraints on the type of super-learners and TMLE.

Clearly, Big Data does require integration of different disciplines, fully respecting the advances made in the different fields such as computer science, statistics, probability theory, and scientific knowledge that allows us to reduce the size of the statistical model and to target the relevant target parameters. Funding agencies need to recognize this so that money can be spent in the best possible way: the best possible way is not to give up on theoretical advances, but the theory has to be relevant to address the real challenges that come with real data. The Biggest Mistake we can make in this Big Data Era is to give up on deep statistical and probabilistic reasoning and theory and corresponding education of our next generations and somehow think that it is just a matter of applying algorithms to data.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors thank the reviewers for their very helpful comments which improved the paper substantially. This research was supported by an NIH Grant 2R01AI074345.