Abstract

Peer-to-Peer (P2P) lending has attracted increasing attention recently. As an emerging micro-finance platform, P2P lending plays roles in removing intermediaries, reducing transaction costs, and increasing the benefits of both borrowers and lenders. However, for the P2P lending investment, there are two major challenges, the deficiency of loans’ historical observations about the certain borrower and the ambiguity problem of estimated loans’ distribution. In order to solve the difficulties, this paper proposes a data-driven robust model of portfolio optimization with relative entropy constraints based on an “instance-based” credit risk assessment framework. The model exploits a nonparametric kernel approach to estimate P2P loans’ expected return and risk under the condition that the historical data of the same borrower is unavailable. Furthermore, we construct a robust mean–variance optimization problem based on relative entropy method for P2P loan investment decision. Using the real-world dataset from a notable P2P lending platform, Prosper, we validate the proposed model. Empirical results reveal that our model provides better investment performances than the existing model.

1. Introduction

Peer-to-peer lending, as an emerging online micro-finance, provides services that bring borrowers and lenders together virtually and help them to lend to and borrow from each other directly. P2P lending platforms play roles in removing traditional financial intermediaries, reducing transaction costs, and increasing the benefits of both borrowers and lenders; therefore, they improve the efficiency of financial market. However, due to the absence of traditional financial intermediaries which can use collateral, certified accounts, and other means to enhance the creditworthiness of borrowers, the information asymmetry between borrowers and lenders severely exist and the credit risk of P2P loan investment is very high.

Credit risk of P2P lending refers to the potential monetary loss arising from the default of a borrower to a loan. Efficient and reasonable investment in P2P loans needs to be based on the reliable credit risk distribution assessment. It is very challenging to estimate the credit risk distribution of P2P loans for the difficulty of obtaining the historical returns (or losses) data of the loan waiting for investment. In other words, the historical yield data about the same borrower is usually unavailable. Moreover, even the distribution of loans’ returns (or losses) is approximated from the limited available data or the expert knowledge, the approximation is usually not accurate, and it is also known as the distribution ambiguity (probability measure uncertainty) problem. In this paper, we formulate a data-driven robust portfolio optimization model based on an “instance-based” credit risk assessment method for investment decisions in P2P lending.

To help personal lenders mitigate the risk, the current online P2P lending platforms have taken some risk-reducing measures, such as filtering out the high-risk borrower whose FICO score is lower than a threshold, making a preliminary rating on each loan and providing investors with risk level of each loan. Thus, each loan is marked as a grade, like AA, A, B, C, D, E, or NR, and the loans with the same grade are considered to have the same risk level. These rating-based models are more suitable for traditional banks and lending institutions, since they have the capability to grant large amounts of loans to diversify their investments. However, the individual investors just possess small amount of funds; they need more refined risk assessment methods and investment strategies.

Similar to bond investment, P2P investors can fund a portion, not the whole, of each loan. Therefore, investors can decide which loans to invest and, meanwhile, determine the amount of investment for each loan. This mechanism allows investors to construct a credit portfolio to mitigate risk.

Markowitz [1] proposes the famous mean-variance model, which is still widely used in portfolio selection and risk management. From then on, researchers propose a variety of mean-risk models, such as mean-downside risk model [2], mean-VaR model [3], mean-CVaR model [4], and so on. In practice, the distribution of the assets needs to be estimated firstly, and then the optimal portfolio can be identified by the optimization model.

For P2P lending investment, as mentioned above, such procedures face at least two major challenges, i.e., the deficiency of loans’ historical observations and the ambiguity problem of estimated loans’ distribution (probability measure uncertainty problem). Thus, this paper proposes a data-driven robust model of portfolio optimization based on relative entropy constraints combined with an instance-based credit risk assessment method.

Specifically, we use the “instance-based” credit risk assessment method proposed by Guo et al. [5] to evaluate the return and risk of each loan without sufficient historical data of loans for each individual borrower. In this instance-based framework, the expected return of each loan is predicted as a weighted average of historical loans of other similar borrowers, where the optimal weights are learnt based on kernel regression. Furthermore, using the moment information (mean and variance) of the new loans, we formulate the robust portfolio optimization model with relative entropy constraints, which could obtain an optimal portfolio under the worst scenario and has the ability of reducing the potential loss caused by the uncertainty of loans distribution.

Our work is somewhat related to the paper by Guo et al. [5] and the paper by Yam et al. [6]. Guo et al. [5] introduce the instance-based framework into credit risk assessment of P2P loan and use the classical mean-variance model to obtain the optimal allocation. Yam et al. [6] derive a robust mean-variance optimization model with relative entropy constrains on the uncertainty of the interaction between the returns of different assets and discuss its mathematical and financial properties in portfolio selection. Although some other scholars have contributed novel insights into credit risk assessment of P2P lending and robust optimization, to the best of our knowledge, few have taken both into consideration synthetically. The main contribution of this paper is that we propose a data-driven robust portfolio optimization model based on relative entropy constraints combined with instance-based risk assessment framework for P2P loan investment and obtain superior performance in numerical experiments.

The rest of this paper is organized as follows. Section 2 provides the literature review. Section 3 introduces the instance-based model for credit risk assessment, as well as the mathematical framework of kernel regression approach. In Section 4, we elaborate the robust optimization model based on relative entropy method and formulate a robust mean-variance optimization model for P2P lending investment. The empirical results on the effectiveness of our model is reported in Section 5. Finally, Section 6 concludes this work.

2. Literature Review

In order to assess risk and assist investment decisions making in P2P lending, researchers have done many studies: Emekter et al. [7] explore the dominated factors that explain the funding success and credit risk and, meanwhile, measure the performance of P2P loans. They find that credit grade, debt-to-income ratio, FICO score, and revolving line utilization play an important role in loan defaults; furthermore, loans with lower credit grade and longer duration may result in high mortality rate and higher interest rates charged on low credit grade borrowers are not sufficient to cover the potential loss for the higher likelihood of loan defaults. Thus, the authors suggest that investors should invest more to high grade loans. Similarly, Berkovich’s [8] study finds that high quality loans offer excess return.

The above researches investigate the factors determining the credit risk and analyze the performance of P2P loans; however, they do not propose a mechanism which assist individual investors in allocating loans effectively and making optimal investment decisions.

To help personal lenders mitigate the risk, the popular online P2P platforms, like Lending Club and Prosper, have developed credit scoring systems to assess the creditworthiness of each borrower based on data mining or machine learning techniques. There is a large body of existing literatures concerned with credit rating using data mining techniques, for example, linear discriminate analysis (LDA) [9], k-nearest neighbors [10], logistic regression [11], classification and regression trees (CART) [12], Markov chains [13], survival analysis [14], artificial neural network (ANN) [15], genetic methods [16], support vector machine (SVM) [17, 18], lasso-probit [19], and so on.

In the portfolio selection problem, full knowledge of the assets’ distribution is usually assumed to determine the optimal portfolio. In most real-life applications, we need to approximate the assets’ distribution. However, the approximations are not necessarily accurate, and it is known as the distribution ambiguity (probability measure uncertainty) problem.

The robust optimization algorithm is an attractive way to solve the portfolio selection problem under distribution ambiguity. As the exact parameters are unavailable, Natarajan et al. [20] use a set of parameters (which represent different distributions or scenarios) rather than a point estimation of the parameters to formulate the asset allocation problem. Following this idea, there are different ways to model ambiguity by using a set of parameters. Chen et al. [21] take the lower partial moments and CVaR as two risk measures and consider a tight bound which are likely to cover the possible parameters. Epstein [22] considered intervals that may include the actual parameters. Natarajan et al. [23] use a piecewise-linear concave utility function to derive accurate and estimated optimal strategies for the expected utility model in the portfolio optimization issue under the worst-case scenarios. Paç and Pinar [24] use an ellipsoidal uncertainty set to represent the distribution ambiguity to identify the optimal portfolio.

Since relative entropy has the ability to measure the difference between two probability distributions (probability measures), it can be used to construct the uncertainty set for robust optimization. In the studies of Hansen and Sargent [25] and Calafiore [26], relative entropy is used to model uncertainty and obtain the optimal investment decision. Yam et al. [6] derive a robust mean-variance optimization model with relative entropy constrains on the uncertainty of the interaction between the returns of different assets and discuss its mathematical and financial properties in portfolio selection.

In recent years, research on data-driven methods has been well studied. In this framework, it is assumed that investors only possess the information about history data of asset return. Bertsimas et al. [27] use KS test, χ2 test, Anderson-Darling test, and some other testing tools to construct uncertainty sets and take the worst case of each set to formulate the robust optimization. They assume that the uncertainty sets are defined by certain structures and sizes based on the data points available. While the structure of uncertainty set in our study is not predefined, we consider the uncertainty of mean, covariance, and distribution synthetically. Kang et al. [28] propose a data-driven robust mean-CVaR portfolio selection model under the condition of distribution ambiguity and adopt a nonparametric bootstrap approach to calibrate the levels of ambiguity. Their work is based on the mean-CVaR framework with data of stock indices, while our work is based on the mean-variance framework with data of P2P loans.

3. Instance-Based Model for Credit Risk Assessment

Using historical data to evaluate future performance and potential loss is a convention. However, unlike bonds or stocks investment, the historical yield data about the same P2P borrower is usually unavailable. Thus, the risk assessment of new loan is very challenging. In this section, we briefly introduce the instance-based credit risk assessment model proposed by Guo et al. [5].

3.1. Instance-Based Assessment Framework

In this instance-based assessment framework, the expected return of each loan is estimated as a weighted average of historical observations of other borrowers’ closed loans. Specifically, for a new loan i, using n past loans, each with an historical return (j = 1, 2,..., n), we can calculate the expected retrun of loan i, , based on a weighted average of past loans’ actual returns:where denotes the weight of loan j for predicting the expected retrun of loan i. The weight depends on the similarity between loan i and loan j. Intuitively, the more the similarity, the greater the weight. The calculation of the weight will be introduced in Section 3.2.

The weighted returns of the past loans are assumed as historical observations of a new loan. According to this line of thought, taking variance as the risk measure, weighted variance of past loans are used to assess the new loan’s risk, that is,where , , and have the same meanings as (1).

The absolute deviation between two loans’ default probabilities is used to measure the similarity; the smaller the absolute deviation, the more the similarity, and, therefore, the larger the weight. In particular, absolute deviation of default probabilities between loans i and j is defined as follows: dij = pi - pj, where pi and pj are the default probabilities of loans i and j, respectively. Kernel regression is exploited to investigate the nonlinear relationship between the absolute deviation and the weight. This process will be introduced in the next subsection.

3.2. Kernel Regression of Return and Risk

Kernel regression is a nonparameter statistical method to investigate the nonlinear relation between random variables, which is based on the kernel density estimation. First of all, the preliminaries of kernel estimation are introduced.

Given n realizations zj, j = 1,..., n, of random variable z, the kernel estimation of the probability density function p(z) is defined bywhere K(·) is a kernel function and h is a smoothing parameter.

Kernel function K(·) is nonnegative and bounded and, meanwhile, satisfies the following properties:

(a) ; (b) ; (c) .

There are a range of commonly used kernel functions, such as uniform, triangular, biweight, triweight, and Gaussian [29]. Because the kernel estimation is insensitive to the choice of kernel function, we use the Gaussian kernel function due to its convenient mathematical properties, which is written as .

The smoothing parameter h=h(n) is also called the bandwidth that depends on the sample size n. Specifically, h(n) and n·h(n) decrease to 0 as n tend to ∞.

Many literatures reveal that the choice of kernel function does not affect the estimation significantly; however, the choice of the bandwidth is a vital issue [30, 31]. The determination of the bandwidth will be shown in detail in Section 5.3.

In the following, we introduce the kernel regression model proposed by Nadaraya [32]. Theoretically, we assume that each observation is denoted as (X, Y) which is a random vector R2-valued. With the sample set, xj, yj) j = 1, 2,..., , the kernel estimator of the target y given its predictive observation x is defined aswhere K(·) is a kernel function and h is the bandwidth.

For the instance-based credit risk modeling, the set of historical observations is represented as pj, Rj) j = 1, 2,..., , where pj and Rj are the default probability and return rate of the jth loan, respectively. Thereby, the estimation of the ith loan’s return could be written as Note that the determination of loans’ default probability will be introduced in Section 5.1.

Comparing (1) to (5), we can represent the optimal weight as Using the optimal weight and the expected return derived from (5), (2) can be rewritten as

4. Robust Investment Decision Model

Similar to bond investment, P2P lenders can invest a portion of each loan. Thus, P2P loan investment decisions can be transformed into a credit portfolio optimization problem. This section introduces the portfolio optimization model for investment decisions in P2P lending, which accounts for the uncertainty of the distribution of the loans. We start from the classical mean-variance optimization model proposed by Markowitz [1] to its tractable robust counterpart.

4.1. Robust Optimization Model Based on Relative Entropy Constraints

In the classical mean-variance optimization model, the optimal asset allocation strategy is identified by solving the tradeoff between risk and return according to investors’ risk preference. A portfolio that invests in n assets is represented as a vector of weights, λ∈ Rn, where each weight denotes the proportion of wealth allocated to an asset. Then the return and risk of the portfolio become and , respectively, where μ∈ Rn and V∈ Rn×n are the expected return and the covariance matrix of the assets’ returns under the probability measure (or probability distribution) P, respectively. Here, P represents the ideal estimated market condition where μ and V estimated by using all available information, including historical observations, news, expert knowledge, and so on, are assumed as the actual expected return and covariance matrix. Thus, the classical mean-variance portfolio selection problem (MV) can be formulated as where Ω ⊆ Rn denotes the set of feasible portfolios and is the required return rate specified by the investor.

In reality, the assumption that the expected return μ and covariance matrix V are known with certainty is less reasonable. It is quite possible that the estimated parameters are different with the actual ones. Thus, the optimal portfolio identified by using the estimated inputs parameters μ and V directly may be inappropriate. Robust optimization seeks for portfolios that are insensitive to the uncertain in the parameters and the solutions that must be feasible no matter what the actual value of the parameters is.

The investors might consider a set of probability measures, i.e., an uncertainty set, to cover a range of scenarios based on their assessments, and then use robust optimization to obtain approximate optimal strategies for the worst scenarios within the uncertainty set. In this paper, we define as the set of probability measures representing the possible scenarios, and as the expected return and covariance matrix estimated under the probability measure . Mathematically, the robust counterpart of the classical mean-variance optimization problem (RMV) can be written as It is rational to assume that the actual value of the parameters is in the neighborhood of the estimator. Thus, we can generate the uncertainty set based on the assumption that the measures in the set should be not far from the ideal measure P. Relative entropy, also known as the Kullback–Leibler divergence, can be used to measure the difference between probability measures. The relative entropy of the measure in with respect to the measure P iswhere and are the probability density functions (pdf) of the loans’ returns under probability measures P and , respectively. In the context of mean-variance analysis, relative entropy can be rewritten as where , V, , and carry the same meaning as in (8) and (9); tr(V), , and V be the trace, the determinant, and the transpose of V, respectively; n is the amount of assets in the portfolio.

Let denote the set of parameters (, ) under the measure Q in . Using the constraint of relative entropy, we can rewrite the robust optimization model, (9), aswhere K is a positive constant and determines the size of uncertainty set. Parameter K measures the level of uncertainty and reflects the investors’ confidence in and V estimated under probability measure P, i.e., the greater K’s value, the less confidence.

Yam et al. [6] prove that the robust mean-variance portfolio selection model based on relative entropy method (RMV-RE) can be formulated as quadratic optimization problem, which is a tractable formulation and can be efficiently solved. That is,Herein, =ζμ, V=V+ζ(1-ζ)μ and is related to K in (12) closely, which reflects the level of confidence in μ and V estimated under measure P. For example, ζ=1 means that investors believe the estimated μ and V are the true parameters. And as ζ decreases, the investor’s confidence is weaker. The details of the proof are referred to by Yam et al. [6].

4.2. Robust Mean-Variance Portfolio Optimization Model in P2P Lending

In the Section 3.2, we estimated each loan’s expected return and variance of return, i.e., and , using the instance-based credit risk assessment model. Let and denote the expected return vector and the covariance matrix of the loans’ returns under the probability measure P. Here, we assume that the correlation between P2P loans is negligible. Now we can rewrite (13) as The feasible region Ω of our problem is defined by the following constraints:(1)The value of the portfolio remains at its initial value, i.e., .(2)Short-selling is forbidden; thus .(3)For each loan, the amount that lender can invest is no more than the borrower request, mi; thereby, M ≤ mi, where M is the total investment amount and investor has available.

5. Empirical Analysis

In this section, we investigate the validity of the robust mean-variance portfolio optimization model in P2P lending using the real-world dataset from a notable P2P lending platform, Prosper. All numerical experiments are performed by using MATLAB on PC.

5.1. Data Description and Preprocess

The dataset for empirical study is from a notable P2P lending platform in the United States, Prosper. It consists of 17,001 loans including 3039 default loans and 13908 completed loans, whose issue dates within the period from November 2005 to March 2014.

Using the data, a credit scoring model is learnt to transform the loan attributes into the default probability. The loan attributes are as follows: the borrower’s FICO score which reflects borrower’s creditworthiness, the borrower’s number of inquiries in the past six months, the monetary amount of the loan, the homeownership status of the borrower, the debt-to-income ratio of the borrower, the borrower’s current delinquencies representing the number of accounts delinquent, and the borrower’s number of public records in the past 10 years (Row 1-7 in Table 1). The target variable is a binary variable (0 represents completed and 1 represents default), as described in Row 8 of Table 1.

There exist many credit scoring models to predict the default probability of a loan, such as: Xgboost model [3335], hybrid KMV model [36], credit scoring based on genetic algorithms [37, 38], and so on. However, discussing how to choose and construct the optimal credit scoring model is beyond the scope of this study, and we use the most popular model, logistic regression, to make the prediction in this preprocessing step.

We randomly divide the dataset into two parts, one containing 40% of all loans for determining the optimal bandwidth h in (5), which will be described in detail in Section 5.3 and the second part containing 60% of the loans. Moreover, using k-fold cross-validation, we randomly divide the second part into 20 subsets, each of which contains approximately 510 loans. In each round, one of the subsets is used as the testing set which consists of loans waiting to be invested, thus their pay-back statuses are unknown, and all other subsets are taken as a training set which consists of historical loans with known yield.

5.2. Model Description

In this paper, we propose a robust credit portfolio optimization model for investment decisions in P2P lending. In order to show its effectiveness, we compare it with a benchmark model proposed by Guo et al. [5]. In the following, we describe models in detail.

IOM is the instance-based model proposed by Guo et al. [5]. Each loan is assessed using kernel weights and the historical performance of similar loans. Then use the classical mean-variance model (8), to identify the optimal allocation strategy. The performance of this model outperforms some rating-based models, as the results of Guo et al. [5] show.

RIOM is the robust instance-based model in this study. Expected return and risk of each loan are also assessed based on the “instance-based” assessment framework. However, we use the robust model of credit portfolio optimization based on relative entropy method, Equation (15), to obtain the optimal investment decision.

We compare the two models by the following procedure:(1)Train the credit risk assessment model with the training set, and use the trained model to predict the expected return () and variance () of each loan in the testing set. Thus, the expected return vector and the covariance matrix, μ and V, can be obtained.(2)For each model, feed the predicted expected return vector μ and the covariance matrix of the testing loans into the portfolio optimization algorithm, and compute the performance of investment on the optimal portfolio.(3)Compare the return rate of the two models.

5.3. Analysis of Results

As mentioned before, we select the Gaussian kernel, , as the kernel function. And the important parameter in the kernel regression model, bandwidth h, is optimized by the following leave-one-out cross validation:where is the leave-one-out estimation of expected return rate , specifically,The curve of CV(h) is exhibited in Figure 1. The shape of the curve clearly shows a minimal point and h corresponding to the minimal point is the optimal bandwith for the model.

To apply the robust credit portfolio optimization method to obtain the optimal investment strategy in problems (13), we select the parameter ζ=0.75, the investment amount M = 15 thousand dollars, and the required rate of return = 0.05. We also set the risk-free return rate as 0.025, which is about equivalent to the average yield of T-Bills over the same period. And we use the MATLAB built-in solver “quaprog” to solve the two portfolio optimization problems.

Table 2 summarizes investment return rate of each test subset and the average performance of the Prosper dataset. It shows that the two portfolios are almost always efficient and feasible, except subset 16. The results also show that the actual performances of the optimal portfolio derived from RIOM always outperform the optimal portfolio from IOM. And the Sharpe ratio shows that median-based optimal portfolio performs better as well.

In order to test and verify that the conclusions obtained from the above experiments are stable, we consider different investment amounts and required returns as input parameters for portfolio selection and keep other conditions unchanged. As summarized in Table 3, we consider nine parameters pairs about required return rate and investment amount M.

The computational results for each parameters pair are summarized in Table 4. Table 4 shows performance comparison of the two optimal portfolios from the perspectives of actual return rate of investment. The more intuitive results are shown in Figure 2, which shows the actual return rate comparison of the two models. The first 9 numbers of the horizontal axis in Figure 2 represent the corresponding parameters combinations (sets 1 through 9 from Table 3), and the number 10 shows the average. We can find that the RIOM model outperforms the IOM model comprehensively.

In conclusion, the optimal portfolio identified from the robust optimization model in this study is more efficient than the existing model. And the performance of our model is more robust and stable.

6. Conclusions

In this paper, we formulate a data-driven robust model of portfolio optimization with relative entropy constraints based on an instance-based credit risk assessment framework for investment decisions in P2P lending. This P2P lending investment decision model has at least three advantages. Firstly, it provides a more refined measure of P2P loans’ risk and reveals a more intuitive and quantized risk estimate to investors, instead of just labelling each loan with a credit grade. Secondly, this model can estimate each loan’s expected return and risk when the historical observation of the same borrower is unavailable. Finally, this model considers the loans’ distribution ambiguity (probability measure uncertainty) problem and uses relative entropy to model parameter uncertainty to ensure the optimal allocation strategy efficient and feasible under various actual scenarios. Numerical experiments imply that the P2P lending investment decision model using the robust optimization with relative entropy constraints provides better performance than existing model.

Data Availability

The data this paper used is downloaded from the website of Prosper: https://www.prosper.com/invest/download.aspx.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.”

Acknowledgments

The research is supported by the National Natural Science Foundation of China (Grants nos. 71471027, 71731003, and 71873103), the National Social Science Foundation of China (Grant no. 16BTJ017), National Natural Science Foundation of China Youth Project (Grant no. 71601041), Liaoning Economic and Social Development Key Issues (Grant no. 2015lslktzdian-05), and Liaoning Provincial Social Science Planning Fund Project (Grant no. L16BJY016). The authors acknowledge the organizations mentioned above.