Article

An application of two-stage quantile regression to insurance ratemaking

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Two-part models based on generalized linear models are widely used in insurance rate-making for predicting the expected loss. This paper explores an alternative method based on quantile regression which provides more information about the loss distribution and can be also used for insurance underwriting. Quantile regression allows estimating the aggregate claim cost quantiles of a policy given a number of covariates. To do so, a first stage is required, which involves fitting a logistic regression to estimate, for every policy, the probability of submitting at least one claim. The proposed methodology is illustrated using a portfolio of car insurance policies. This application shows that the results of the quantile regression are highly dependent on the claim probability estimates. The paper also examines an application of quantile regression to premium safety loading calculation, the so-called Quantile Premium Principle (QPP). We propose a premium calculation based on quantile regression which inherits the good properties of the quantiles. Using the same insurance portfolio data-set, we find that the QPP captures the riskiness of the policies better than the expected value premium principle.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The main question is then how to find the risk factors that cause the highest variations of the individual premiums. A traditional approach to choose such risk factors is based on the p-values relative to the logistic regression coefficients (Heras et al., 2018). Nevertheless, the significance of regression coefficients alone does not provide a comprehensive understanding of the ranking of risk factors in terms of importance, which is a critical aspect to explore. ...
... To overcome this limitation, Heras et al. (2018) extend Kudryavtsev's model through the estimation of a different probability of having no claims for each risk class. Firstly, they adopt a logistic regression to estimate the class-specific probabilities of having no claims. ...
... In the second stage, the QR is applied to model the conditional loss distribution given that a claim occurs. Despite that the Two-Part model of Heras et al. (2018) offers a more accurate approach, it is computationally inefficient for portfolios with a medium or large number of policyholders, since it requires to run a QR for each risk class. To solve this issue, Baione and Biancalana (2019) consider the specification of a unique probability level for the conditional aggregate claims amount. ...
Article
Full-text available
The ratemaking process is a key issue in insurance pricing. It consists in pooling together policyholders with similar risk profiles into rating classes and assigning the same premium for policyholders in the same class. In actuarial practice, rating systems are typically not based on all risk factors but rather only some of factors are selected to construct the rating classes. The objective of this study is to investigate the selection of risk factors in order to construct rating classes that exhibit maximum internal homogeneity. For this selection, we adopt the Shapley effects from global sensitivity analysis. While these sensitivity indices are used for model interpretability, we apply them to construct rating classes. We provide a new strategy to estimate them, and we connect them to the intra-class variability and heterogeneity of the rating classes. To verify the appropriateness of our procedure, we introduce a measure of heterogeneity specifically designed to compare rating systems with a different number of classes. Using a well-known car insurance dataset, we show that the rating system constructed with the Shapley effects is the one minimizing this heterogeneity measure.
... computing the Value-at-Risk of a given portfolio). Modeling the quantile claim amount through the quantile regression (QR) has already been discussed by a handful of authors: Kudryavtsev (2009) was the first introducing the use of the two-stage QR model to estimate the quantile of the total claim amount; Heras et al. (2018) propose a refinement of the previous model since they take into account heterogeneous claim probabilities, whereas Kudryavtsev (2009) only considers a single probability of having claims for each type of policyholder; Baione & Biancalana (2019) propose an alternative two-stage approach, where the risk margin considered in the ratemaking is calibrated on the claim's severity for each risk class in the portfolio, avoiding some of the drawbacks that characterise the technique proposed by Heras et al. (2018). ...
... computing the Value-at-Risk of a given portfolio). Modeling the quantile claim amount through the quantile regression (QR) has already been discussed by a handful of authors: Kudryavtsev (2009) was the first introducing the use of the two-stage QR model to estimate the quantile of the total claim amount; Heras et al. (2018) propose a refinement of the previous model since they take into account heterogeneous claim probabilities, whereas Kudryavtsev (2009) only considers a single probability of having claims for each type of policyholder; Baione & Biancalana (2019) propose an alternative two-stage approach, where the risk margin considered in the ratemaking is calibrated on the claim's severity for each risk class in the portfolio, avoiding some of the drawbacks that characterise the technique proposed by Heras et al. (2018). ...
... Following this approach, we fit a logistic regression for the binary variable to estimate the claim probability while we use a QRNN or a Quantile-CANN to model the quantile of the positive outcome. Using the estimated quantiles of the claim amount, we finally calculate a loaded premium following the quantile premium principle considered in Heras et al. (2018). ...
Article
Full-text available
In this paper, we discuss the estimation of conditional quantiles of aggregate claim amounts for non-life insurance embedding the problem in a quantile regression framework using the neural network approach. As the first step, we consider the quantile regression neural networks (QRNN) procedure to compute quantiles for the insurance ratemaking framework. As the second step, we propose a new quantile regression combined actuarial neural network (Quantile-CANN) combining the traditional quantile regression approach with a QRNN. In both cases, we adopt a two-part model scheme where we fit a logistic regression to estimate the probability of positive claims and the QRNN model or the Quantile-CANN for the positive outcomes. Through a case study based on a health insurance dataset, we highlight the overall better performances of the proposed models with respect to the classical quantile regression one. We then use the estimated quantiles to calculate a loaded premium following the quantile premium principle, showing that the proposed models provide a better risk differentiation.
... A rich variety of premium principles has been proposed in the actuarial literature for predicting the risk premium of individual policies, for example, Bühlmann (1970), Mack (1997), Wang et al. (1997), Kudryavtsev (2009), and Heras et al. (2018). The standard approach for predicting the risk premium involves a separate analysis of two parts of the risk premium: the pure premium and the risk loading. ...
... This premium principle explains the needs of risk loading quite well, as it estimates the maximum possible losses that an individual policy may incur with a given probability 1 − τ during the forecasting period. Following the VaR premium principle, the quantile premium principle for classification ratemaking is proposed by Heras et al. (2018), and the corresponding risk premium is calculated as follows: ...
... While the top-dwon method is well developed, see for example Cossette et al. (2012) and Heras et al. (2018), the use of covariate information in order to estimate the risk loading parameters through generalized linear models and quantile regression models has received much less attention. Following this line of study, Baione and Biancalana (2019) extend the work of Heras et al. (2018) by developing a down-top-down method for risk premium calculation in classification ratemaking. ...
Preprint
The risk premium of a policy is the sum of the pure premium and the risk loading. In the classification ratemaking process, generalized linear models are usually used to calculate pure premiums, and various premium principles are applied to derive the risk loadings. No matter which premium principle is used, some risk loading parameters should be given in advance subjectively. To overcome this subjective problem and calculate the risk premium more reasonably and objectively, we propose a top-down method to calculate these risk loading parameters. First, we implement the bootstrap method to calculate the total risk premium of the portfolio. Then, under the constraint that the portfolio's total risk premium should equal the sum of the risk premiums of each policy, the risk loading parameters are determined. During this process, besides using generalized linear models, three kinds of quantile regression models are also applied, namely, traditional quantile regression model, fully parametric quantile regression model, and quantile regression model with coefficient functions. The empirical result shows that the risk premiums calculated by the method proposed in this study are more coherent and can reasonably differentiate the heterogeneity of different risk classes.
... Two-part models are useful for modelling the expectation of the interested loss variables, but they can not provide more information about the high risks of the losses. Recently, two-stage approaches have been widely studied in the literature for the estimation of the value-at-risk of the interested loss variables based on quantile regression (henceforth QR, (10) and (2)). Following (3), we suggest to adopt a Two Stage Parametric Quantile Regression model (TSPQR henceforth): the first stage of TSPQR involves fitting a logistic regression to estimate, for every policy, the probability of submitting at least one claim, then a PQR is used to estimate the individual aggregate claim amount; in this way is possible to estimate the quantiles of the loss random variable conditioned to a set of covariates and get information not only in terms of the expected values but also in terms of high risks of the losses. ...
... , I) and a set of independent covariates x i = (x i1 , . . . x im ), in order to model the random variable (S|x i ) = S i , total claim amount per insured, we consider the two-part model proposed by Heras et al. (10). Under this approach, the r.v. ...
Preprint
Full-text available
In 2012 the European Court of Justice introduced the ban on differentiating car insurance premiums for gender to avoid gender inequality. This paper deals with a gender analysis of driving ability by investigating the relationship between gender and the relative total claim amount in Motor Third Party Liability insurance, also considering the effect of age. Leveraging on a two-part model based on parametric quantile regression, we want to investigate the average behaviour of drivers and the tail behaviour in order to highlight the importance of dispersion and largest claims. As a consequence, the purpose of our contribution is to study how gender and age can influence the entire probability distribution of the insurance claim with a particular focus on the quantiles with high probability levels, which are very important indicators to determine the effective riskiness of an insured. We apply our model to an Australian insurance dataset; our results suggest that man are in general riskier in terms of both average and extreme behaviour.
... To better appreciate our study, as a starting point, we first model the conditional claim occurrence probabilities by logistic regression, the first step of the ratemaking process in the existing literature, see Heras et al. (2018) and Kang et al. (2020). Then, we test the serial dynamics and dependence structure of conditional claim occurrence indicators over time. ...
... For forecasting the conditional claim frequency, one needs to specify a discrete distribution, such as Poisson and Negative Binomial, which affects the variance of the claim frequency and hence may influence the ratemaking decision. For forecasting the conditional Value-at-Risk of the aggregate loss, one can employ quantile regression at an adjusted risk level estimated from the above logistic regression; see Kudryavtsev (2009), Heras et al. (2018, Kang et al. (2020), andKang et al. (2021). As the main goal of this paper is to provide a fundamental starting point to test the serial dynamic and dependence assumptions, we refrain from fully specifying the claim frequency and severity distributions. ...
Preprint
In non-life insurance, it is essential to understand the serial dynamics and dependence structure of the longitudinal insurance data before using them. Existing actuarial literature primarily focuses on modeling, which typically assumes a lack of serial dynamics and a pre-specified dependence structure of claims across multiple years. To fill in the research gap, we develop two diagnostic tests, namely the serial dynamic test and correlation test, to assess the appropriateness of these assumptions and provide justifiable modeling directions. The tests involve the following ingredients: i) computing the change of the cross-sectional estimated parameters under a logistic regression model and the empirical residual correlations of the claim occurrence indicators across time, which serve as the indications to detect serial dynamics; ii) quantifying estimation uncertainty using the randomly weighted bootstrap approach; iii) developing asymptotic theories to construct proper test statistics. The proposed tests are examined by simulated data and applied to two non-life insurance datasets, revealing that the two datasets behave differently.
... Additionally, it would be intriguing to examine our proposal within a frequency-severity modeling framework (Frees, 2009). In this context, besides examining loss amounts, it becomes pertinent to analyze loss frequency, which signifies whether a claim has occurred or, more in general, the overall number of observed claims (Frees et al., 2013;Heras et al., 2018). This approach would entail a two-part model: one part for frequency, typically modeled using logit or probit models, and another part for losses, which would be conditionally distributed based on one of our proposed models. ...
Article
Full-text available
Insurance loss data have peculiar features that can rarely be accounted for by simple parametric distributions. Thus, in this manuscript, we first introduce a new type of location mixture model: the mode mixture. By using convenient mode-parameterized hump-shaped distributions, we present a family of eight mode mixture of unimodal distributions. Then, we fit these models to two real insurance loss datasets, where they are evaluated in terms of goodness of fit and ability to reproduce classical risk measures. We extend the comparisons to existing models based on mode-parameterized hump-shaped distributions. Lastly, using simulated data, we further investigate the performance of the estimated risk measures of our models.
... We will apply it using the R package quantreg (Koenker 2015). This methodology is more recent in its application to insurance and is basically oriented to rate making; see, for instance, Heras et al. (2018) and Baione and Biancalana (2021). ...
Article
Full-text available
There is growing concern that climate change poses a serious threat to the sustainability of the insurance business. Understanding whether climate warming is a cause for an increase in claims and losses, and how this cause–effect relationship will develop in the future, are two significant open questions. In this article, we answer both questions by particularizing the geographical area of Spain, and a precise risk, hailstorm in crop insurance in the line of business of wine grapes. We quantify climate change using the Spanish Actuarial Climate Index (SACI). We utilize a database containing all the claims resulting from hail risk in Spain from 1990 to 2022. With homogenized data, we consider as dependent variables the monthly number of claims, the monthly number of loss costs equal to one, and the monthly total losses. The independent variable is the monthly Spanish Actuarial Climate Index (SACI). We attempt to explain the former through the latter using regression and quantile regression models. Our main finding is that climate change, as measured by the SACI, explains these three dependent variables. We also provide an estimate of the increase in the monthly total losses’ Value at Risk, corresponding to a future increase in climate change measured in units of the SACI. Spanish crop insurance managers should carefully consider these conclusions in their decision-making process to ensure the sustainability of this line of business in the future.
... In addition, QRNNs have also been implemented in the insurance field, for example in relation to the claim amount of an insurance policy (cf. Laporta et al. 2021Laporta et al. , 2023Heras et al. 2018). ...
Article
Full-text available
The study deals with the application of a neural network algorithm for fronting and solving problems connected with the riskiness in financial contexts. We consider a specific contract whose characteristics make it a paradigm of a complex financial transaction, that is the Reverse Mortgage. Reverse Mortgages allow elderly homeowners to get a credit line that will be repaid through the selling of their homes after their deaths, letting them continue to live there. In accordance with regulatory guidelines that direct prudent assessments of future losses to ensure solvency, within the perspective of the risk assessment of Reverse Mortgage portfolios, the paper deals with the estimation of the Conditional Value at Risk. Since the riskiness is affected by nonlinear relationships between risk factors, the Conditional Value at Risk is estimated using Neural Networks, as they are a suitable method for fitting nonlinear functions. The Conditional Value at Risk estimated by means of Neural Network approach is compared with the traditional Value at Risk in a numerical application.
... To facilitate inferences, we define marginal quantile treatment effects and develop inference tools to determine their statistical significance. Similar applications of two-part quantile regression models have been used in Heras et al. (2018) to estimate actuarial profiles and provide new insights for actuarial science. The work, however, does not involve any theoretical validation and development. ...
Article
An extension of quantile regression is proposed to model zero-inflated outcomes, which have become increasingly common in biomedical studies. The method is flexible enough to depict complex and nonlinear associations between the covariates and the quantiles of the outcome. We establish the theoretical properties of the estimated quantiles, and develop inference tools to assess the quantile effects. Extensive simulation studies indicate that the novel method generally outperforms existing zero-inflated approaches and the direct quantile regression in terms of the estimation and inference of the heterogeneous effect of the covariates. The approach is applied to data from the Northern Manhattan Study to identify risk factors for carotid atherosclerosis, measured by the ultrasound carotid plaque burden.
... The binary variable could illustrate whether a claim happened, and the continuous variable could indicate the amount of a claim. GLM is a widely used method to estimate the expectation of the binary variable ( [38,39]). In particular, logit and probit models are typical models utilized in insurance pricing and are commonly used to estimate and predict the possibility of claims ( [40]). ...
Article
Full-text available
This paper investigates the effectiveness of the Actuaries Climate Index (ACI), a climate index jointly launched by multiple actuarial societies in North America in 2016, on predicting crop yields and (re)insurance ratemaking. The ACI is created using a variety of climate variables reflecting extreme weather conditions in 12 subregions in the US and Canada. Using data from eight Midwestern states in the US, we find that the ACI has significant predictive power for crop yields. Moreover, allowing the constituting variables of the ACI to have data-driven rather than pre-determined weights could further improve the predictive accuracy. Furthermore, we create the county-level ACI index using high-resolution climate data and investigate its predictive power on county-level corn yields, which are more relevant to insurance practices. We find that although the self-constructed ACI index leads to a slightly worse fit due to noisier county-specific yield data, the predictive results are still reasonable. Our findings suggest that the ACI index is promising for crop yield forecasting and (re)insurance ratemaking, and its effectiveness could be further improved by allowing for the data-driven weights of the constituting variables and could be created at higher resolution levels.
... Smyth and Jørgensen (2002) used double generalized linear models for the case where we only observe the claim cost but not the frequency. Many authors have proposed methods for insurance pricing using different frameworks other than GLM, including quantile regression (Heras et al. 2018), hierarchical modeling (Frees and Valdez 2008), machine learning (Kašćelan et al. 2015;Yang et al. 2016), the copula model (Czado et al. 2012), and the spatial model (Gschlößl and Czado 2007). ...
Article
Full-text available
This paper aims to better predict highly skewed auto insurance claims by combining candidate predictions. We analyze a version of the Kangaroo Auto Insurance company data and study the effects of combining different methods using five measures of prediction accuracy. The results show the following. First, when there is an outstanding (in terms of Gini Index) prediction among the candidates, the “forecast combination puzzle” phenomenon disappears. The simple average method performs much worse than the more sophisticated model combination methods, indicating that combining different methods could help us avoid performance degradation. Second, the choice of the prediction accuracy measure is crucial in defining the best candidate prediction for “low frequency and high severity” (LFHS) data. For example, mean square error (MSE) does not distinguish well between model combination methods, as the values are close. Third, the performances of different model combination methods can differ drastically. We propose using a new model combination method, named ARM-Tweedie, for such LFHS data; it benefits from an optimal rate of convergence and exhibits a desirable performance in several measures for the Kangaroo data. Fourth, overall, model combination methods improve the prediction accuracy for auto insurance claim costs. In particular, Adaptive Regression by Mixing (ARM), ARM-Tweedie, and constrained Linear Regression can improve forecast performance when there are only weak learners or when no dominant learner exists.
... Many actuaries use a two-part model to set rates, where the first part predicts how many claims a policyholder will have (the claim frequency) and the second part predicts the average cost of an individual claim (Frees and Sun 2010;Heras et al. 2018;Prabowo et al. 2019). Multiplying the two outputs of these models predicts the total cost of a given policyholder. ...
Article
Full-text available
Two-part models are important to and used throughout insurance and actuarial science. Since insurance is required for registering a car, obtaining a mortgage, and participating in certain businesses, it is especially important that the models that price insurance policies are fair and non-discriminatory. Black box models can make it very difficult to know which covariates are influencing the results, resulting in model risk and bias. SHAP (SHapley Additive exPlanations) values enable interpretation of various black box models, but little progress has been made in two-part models. In this paper, we propose mSHAP (or multiplicative SHAP), a method for computing SHAP values of two-part models using the SHAP values of the individual models. This method will allow for the predictions of two-part models to be explained at an individual observation level. After developing mSHAP, we perform an in-depth simulation study. Although the kernelSHAP algorithm is also capable of computing approximate SHAP values for a two-part model, a comparison with our method demonstrates that mSHAP is exponentially faster. Ultimately, we apply mSHAP to a two-part ratemaking model for personal auto property damage insurance coverage. Additionally, an R package (mshap) is available to easily implement the method in a wide variety of applications.
... Many actuaries use a two-part model to set rates, where the first part predicts how many claims a policyholder will have (the claim frequency) and the second part predicts the average cost of an individual claim [7,4]. Multiplying the two outputs of these models predicts the total cost of a given policyholder. ...
Preprint
Full-text available
Two-part models are important to and used throughout insurance and actuarial science. Since insurance is required for registering a car, obtaining a mortgage, and participating in certain businesses, it is especially important that the models which price insurance policies are fair and non-discriminatory. Black box models can make it very difficult to know which covariates are influencing the results. SHAP values enable interpretation of various black box models, but little progress has been made in two-part models. In this paper, we propose mSHAP (or multiplicative SHAP), a method for computing SHAP values of two-part models using the SHAP values of the individual models. This method will allow for the predictions of two-part models to be explained at an individual observation level. After developing mSHAP, we perform an in-depth simulation study. Although the kernelSHAP algorithm is also capable of computing approximate SHAP values for a two-part model, a comparison with our method demonstrates that mSHAP is exponentially faster. Ultimately, we apply mSHAP to a two-part ratemaking model for personal auto property damage insurance coverage. Additionally, an R package (mshap) is available to easily implement the method in a wide variety of applications.
... Alfò et al. (2017), for example, defined a finite mixture of quantile regression models for heterogeneous data. Quantile regression and two-part models have also been positively considered in several studies: see for example Grilli et al. (2016); Heras et al. (2018); Sauzet et al. (2019) and Biswas et al. (2020). In particular, Biswas et al. ...
Preprint
This paper develops a two-part finite mixture quantile regression model for semi-continuous longitudinal data. The proposed methodology allows heterogeneity sources that influence the model for the binary response variable, to influence also the distribution of the positive outcomes. As is common in the quantile regression literature, estimation and inference on the model parameters are based on the Asymmetric Laplace distribution. Maximum likelihood estimates are obtained through the EM algorithm without parametric assumptions on the random effects distribution. In addition, a penalized version of the EM algorithm is presented to tackle the problem of variable selection. The proposed statistical method is applied to the well-known RAND Health Insurance Experiment dataset which gives further insights on its empirical behavior.
... Smyth and Jørgensen (2002) used double generalized linear models for the case where we only observe the claim cost but not the frequency. Many authors have proposed methods for insurance pricing using different frameworks other than GLM, including quantile regression (Heras et al. 2018), hierarchical modeling (Frees and Valdez 2008), machine learning (Kašćelan et al. 2015;Yang et al. 2016), the copula model (Czado et al. 2012), and the spatial model (Gschlößl and Czado 2007). ...
Preprint
Full-text available
This paper aims to better predict highly skewed auto insurance claims by combining candidate predictions. We analyze a version of the Kangaroo Auto Insurance company data and study the effects of combining different methods using five measures of prediction accuracy. The results show the following. First, when there is an outstanding (in terms of Gini Index) prediction among the candidates, the "forecast combination puzzle" phenomenon disappears. The simple average method performs much worse than the more sophisticated model combination methods, indicating that combining different methods could help us avoid performance degradation. Second, the choice of the prediction accuracy measure is crucial in defining the best candidate prediction for "low frequency and high severity" (LFHS) data. For example, mean square error (MSE) does not distinguish well between model combination methods, as the values are close. Third, the performances of different model combination methods can differ drastically. We propose using a new model combination method, named ARM-Tweedie, for such LFHS data; it benefits from an optimal rate of convergence and exhibits a desirable performance in several measures for the Kangaroo data. Fourth, overall, model combination methods improve the prediction accuracy for auto insurance claim costs. In particular, Adaptive Regression by Mixing (ARM), ARM-Tweedie, and constrained Linear Regression can improve forecast performance when there are only weak learners or when no dominant learner exists.
Preprint
Full-text available
We propose a class of linear classifier models and consider a flexible loss function to study binary classification problems. The loss function consists of two penalty terms, one penalizing false positive (FP) and the other penalizing false negative (FN), and can accommodate various classification targets by choosing a weighting function to adjust the impact of FP and FN on classification. We show, through both a simulated study and an empirical analysis, that the proposed models under certain parametric weight functions outperform the logistic regression model and can be trained to meet flexible targeted rates on FP or FN.
Article
Disasters have severe implications for life and property, often requiring large-scale collective action to facilitate recovery. One key determinant of recovery is access to resources that mitigate damage losses and shorten disaster recovery trajectories. However, communities with the disabiled present may be excluded from such services despite federal mandates for equal access and reasonable accommodations. We examined Hurricane Harvey federal recovery assistance distributions based on underlying community disability profiles. Through cross-sectional quantile regression, we used Federal Emergency Management Agency (FEMA) direct-to-household administrative data at the zip code level regressed onto American Community Survey estimates of disability. We found that as the prevalence of disability increased in communities, the total dollar amount of FEMA direct-to-household assistance decreased, controlling for factors such as storm damage, poverty, population density, and race/ethnicity. Moreover, disability-related funding disparities were driven primarily by hearing-related disabilities, with disparities in funding widening as total assistance increased in communities. Such inequities in community-level funding have implications on how well communities may recover from disasters.
Article
In actuarial practice, modern statistical methodologies are one primary consideration for real actuarial problems, such as premium calculation, insurance preservation, marginal risk analysis, etc. The claim data usually possesses a complex data structure, so direct applications of statistical techniques will result in unstable prediction. For example, insurance losses are semicontinuous variables, where a positive mass on zero is often associated with an otherwise positive continuous outcome. Thus, the prediction of high-risk events of claim data needs additional treatment to avoid significant underestimation. In this article, we propose a new two-stage composite quantile regression model for the prediction of the value-at-risks of the aggregate insurance losses. As we are interested in the statistical properties of our method, the asymptotic results are established corresponding to different types of risk levels. Finally, some simulation studies and a data analysis are implemented for the illustration of our method.
Article
As catastrophic events happen more and more frequently, accurately forecasting risk at a high level is vital for the financial stability of the insurance industry. This paper proposes an efficient three-step procedure to deal with the semicontinuous property of insurance claim data and forecast extreme risk. The first step uses a logistic regression model to estimate the nonzero claim probability. The second step employs a quantile regression model to select a dynamic threshold for fitting the loss distribution semiparametrically. The third step fits a generalized Pareto distribution to exceedances over the selected dynamic threshold. Combining these three steps leads to an efficient risk forecast. Furthermore, a random weighted bootstrap method is employed to quantify the uncertainty of the derived risk forecast. Finally, we apply the proposed method to an automobile insurance data set.
Article
We consider quantile regressions 1) for adequate cyber-insurance pricing across heterogenous policyholders and 2) calculation of claims cost associated with data breach events. Using a novel dataset, our study is the first to take firm-specific security information into account. We show that the impact of a firm's revenue is stronger (weaker) in the lower (upper) quantile of the cost distribution. This result suggests that mispricing may occur if small and large firms are priced using the average effect estimated by the traditional least squares approach. We find that firms with weaker security levels than the industry average are more likely to be exposed to large-cost events. Regarding data breaches, small or mid-size loss events are related to higher cost per breached record. We compare the premiums of a quantile-based insurance pricing scheme with those of a two-part generalized linear model (GLM) and the Tweedie model to explore the usefulness of the quantile-based model in addressing heterogeneous effects of firm size. Our findings provide useful implications for cyber insurers and policymakers who wish to assess the impacts of firm-specific factors in pricing insurance and to estimate the cost of claims.
Article
In short-term non-life (e.g. car and homeowner) insurance, policies are renewed yearly. Insurance companies typically keep track of each policyholder's claims per year, resulting in longitudinal data. Efficient modeling of time dependence in longitudinal claim data will improve the prediction of future claims needed for routine actuarial practice, such as ratemaking. Insurance claim data %have their special complexity. They usually follow a two-part mixed distribution: a probability mass at zero corresponding to no claim and an otherwise positive claim from a skewed and long-tailed distribution. This two-part data structure leads to difficulties in applying established models for longitudinal data. In this paper, we propose a two-part D-vine copula model to study longitudinal mixed claim data. We build two stationary D-vine copulas. One is used to model the time dependence in binary outcomes resulting from whether or not a claim has occurred. The other studies the dependence in the claim size given occurrence. %In addition to the benefit of efficient computation, Under the proposed model, the prediction of the probability of making claims and the quantiles of severity given occurrence is straightforward. We use our approach to investigate a dataset from the Local Government Property Insurance Fund in the state of Wisconsin.
Article
This article develops a two-part finite mixture quantile regression model for semi-continuous longitudinal data. The proposed methodology allows heterogeneity sources that influence the model for the binary response variable to also influence the distribution of the positive outcomes. As is common in the quantile regression literature, estimation and inference on the model parameters are based on the asymmetric Laplace distribution. Maximum likelihood estimates are obtained through the EM algorithm without parametric assumptions on the random effects distribution. In addition, a penalized version of the EM algorithm is presented to tackle the problem of variable selection. The proposed statistical method is applied to the well-known RAND Health Insurance Experiment dataset which gives further insights on its empirical behaviour.
Article
Recently, Heras et al. (2018. An application of two-stage quantile regression to insurance ratemaking. Scandinavian Actuarial Journal 9, 753–769) propose a two-step inference to forecast the Value-at-Risk of aggregated losses in insurance ratemaking by combining logistic regression and quantile regression without discussing the critical issue of uncertainty quantification. This paper proposes a random weighted bootstrap method to quantify the estimation uncertainty and an alternative two-step inference via weighted quantile regression.
Article
The projection of outstanding liabilities caused by incurred losses or claims has played a fundamental role in general insurance operations. Loss reserving methods based on individual losses generally perform better than those based on aggregate losses. This study uses a parametric individual information model taking not only individual losses but also individual information such as age, gender, and so on from policies themselves into account. Based on this model, this study proposes a computation procedure for the projection of the outstanding liabilities, discusses the estimation and statistical properties of the unknown parameters, and explores the asymptotic behaviors of the resulting loss reserving as the portfolio size approaching infinity. Most importantly, this study demonstrates the benefits of individual information on loss reserving. Remarkably, the accuracy gained from individual information is much greater than that from considering individual losses. Therefore, it is highly recommended to use individual information in loss reserving in general insurance.
Article
This paper deals with the use of parametric quantile regression for the calculation of a loaded premium, based on a quantile measure, corresponding to individual insurance risk. Heras et al. have recently introduced a ratemaking process based on a two-stage quantile regression model. In the first stage, a probability to have at least one claim is estimated by a GLM logit, whereas in the second stage several quantile regressions are necessary for the estimate of the severity component. The number of quantile regressions to be performed is equal to the number of risk classes selected for ratemaking. In the actuarial context, when a large number of risk classes are considered (e.g. in Motor Third Party Liability), such approach can imply an over-parameterization and time-consuming. To this aim, in the second stage, we suggest to apply a more parsimonious approach based on Parametric Quantile Regression as introduced by Frumento and Bottai and never used in the actuarial context. This more conservative approach allows you not to lose efficiency in the estimation of premiums compared to the traditional Quantile Regression.
Article
To better forecast the Value-at-Risk of the aggregate insurance losses, Heras et al. (2018) propose a two-step inference of using logistic regression and quantile regression without providing detailed model assumptions, deriving the related asymptotic properties, and quantifying the inference uncertainty. This paper argues that the application of quantile regression at the second step is not necessary when explanatory variables are categorical. After describing the explicit model assumptions, we propose another two-step inference of using logistic regression and the sample quantile. Also, we provide an efficient empirical likelihood method to quantify the uncertainty. A simulation study confirms the good finite sample performance of the proposed method.
Article
This article deals with the use of quantile regression and generalized linear models for a premium calculation based on quantiles. A premium principle is a functional that assigns a usually loaded premium to any distribution of claims. The loaded premium is generally greater than the expected value of the loss and the difference is considered to be a risk margin or a safety loading. The failure of a right charge of individual risk rate exposes the insurer to adverse selection and, consequently, to deteriorating financial results. The article aim is to define the individual pure premium rates and the corresponding risk margin in balance with some profit or solvency constraints. We develop a ratemaking process based on a two-part model using quantile regression and a gamma generalized linear model, respectively, in order to estimate the claim’s severity quantiles. Generalized linear models focus on the estimation of the mean of the conditional loss distribution but have some drawbacks in assessing distribution moments other than the mean and are very sensitive to outliers. On the contrary, quantile regression exceeds these limits leading to the estimate of conditional quantiles and is more accurate for a better measurement of the variability of a risk class. The proposed methodology for premium calculation allows us to find further limits of the generalized linear model as the solution we found, under specific assumptions for a generalized linear model, is equivalent to the application of the expected value premium principle. Finally, the methodology we suggest for the two-part quantile regression allows reducing the practical issue of overmodeling and over-parameterization with respect to other proposals on the same topic.
Article
Full-text available
This paper concerns the study of the diversification effect involved in a portfolio of non-life policies priced via traditional premium principles when individual pure premiums are calculated via Quantile Regression. Our aim is to use Quantile Regression to estimate the individual conditional loss distribution given a vector of rating factors. To this aim, we make a comparison of the outcomes obtained via Quantile Regression with the widely used industry standard method based on generalized linear models. Then, considering a specific premium principle, we calculate individual pure premium by means of a specific functional of the conditional loss distribution, the standard deviation. We determine the portfolio risk margin according to the Solvency 2 framework and then we allocate it over each policy in a way consistent with his/her riskiness. Indeed, considering a portfolio of heterogeneous policies, we determine the individual reduction of the safety loading, due to the diversification, and we measure the risk contribution of each individual.
Article
Classical statistics deals with the following standard problem of estimation: Given : random variables X 1 , X 2 … X n independent, identically distributed, and observations x 1 , X 2 … x n , Estimate : parameter (or function thereof) of the distribution function common to all X i . It is not surprising that the “classical actuary” has mostly been involved in solving the actuarial equivalent of this problem in insurance, namely Given : risks R 1 R 2 … R n no contagion, homogeneous group, Find : the proper (common) rate for all risks in the given class. There have, of course, always been actuaries who have questioned the assumptions of independence (no contagion) and/or identical distribution (homogeneity). As long as ratemaking is considered equivalent to the determination of the mean, there seem to be no additional difficulties if the hypothesis of independence is dropped. But is there a way to drop the condition of homogeneity (identical distribution)?
Article
The present notes aim at providing a basis in non-life insurance mathematics which forms a core subject of actuarial sciences. It discusses collective risk modeling, individual claim size modeling, approximations for compound distributions, ruin theory, premium calculation principles, tariffication with generalized linear models, credibility theory, claims reserving and solvency.
Article
This paper synthesizes and extends the literature on multivariate two-part regression modelling, with an emphasis on actuarial applications. To illustrate the modelling, we use data from the US Medical Expenditure Panel Survey to explore expenditures that come in two parts. In the first part, zero expenditures correspond to no payments for health care services during a year. For the second part, a positive expenditure corresponds to the payment amount, a measure of utilization. Expenditures are multivariate, the five components being (i) office-based, (ii) hospital outpatient, (iii) emergency room, (iv) hospital inpatient, and (v) home health expenditures. Not surprisingly, there is a high degree of association among expenditure types and so we utilize models that account for these associations. These models include multivariate binary regressions for the payment type and generalized linear models with Gaussian copulas for payment amounts. As anticipated, the strong associations among expenditure types allow us to establish significant model differences on an in-sample basis. Despite these strong associations, we find that commonly used statistical measures perform similarly on a held-out validation sample. In contrast, out-of-sample risk measures used by actuaries reveal differences in the association among expenditure types.
Article
We develop links between credibility theory and quantiles. More specifically, we show how quantiles can be embedded within the classical H. Bühlmann’s [“Experience rating and credibility”, Astin Bull. 4, No. 3, 199–207 (1967)] credibility model and within C. A. Hachemeister’s [in: Credibility, Theory Appl., Proc. Actuarial Res. Conf., Berkeley 1974, 129–163 (1975; Zbl 0354.62057)] regression credibility model. The context of influence function is also incorporated into the above two models. For each model, credibility estimators are established and applications to real data are presented.
Article
A simple minimization problem yielding the ordinary sample quantiles in the location model is shown to generalize naturally to the linear model generating a new class of statistics we term "regression quantiles." The estimator which minimizes the sum of absolute residuals is an important special case. Some equivariance properties and the joint aymptotic distribution of regression quantiles are established. These results permit a natural generalization to the linear model of certain well-known robust estimators of location. Estimators are suggested, which have comparable efficiency to least squares for Gaussian linear models while substantially out-performing the least-squares estimator over a wide class of non-Gaussian error distributions.
Chapter
In this review article, we discuss how premium principles arise in theory and in practice. We propose three methods that actuaries use to derive premium principles: (1) the ad hoc method in which the actuary simply proposes a (potentially) reasonable premium principle then studies its properties; (2) the characterization method in which the actuary lists the properties that she wants the premium principle to satisfy, then finds (all) the premium principles that satisfy those properties; and (3) the economic method in which the actuary espouses an economic theory and determines the resulting premium principle. These methods are not exclusive, as we note in the article.
Book
This is the only book actuaries need to understand generalized linear models (GLMs) for insurance applications. GLMs are used in the insurance industry to support critical decisions. Until now, no text has introduced GLMs in this context or addressed the problems specific to insurance data. Using insurance data sets, this practical, rigorous book treats GLMs, covers all standard exponential family distributions, extends the methodology to correlated data structures, and discusses recent developments which go beyond the GLM. The issues in the book are specific to insurance data, such as model selection in the presence of large data sets and the handling of varying exposure times. Exercises and data-based practicals help readers to consolidate their skills, with solutions and data sets given on the companion website. Although the book is package-independent, SAS code and output examples feature in an appendix and on the website. In addition, R code and output for all the examples are provided on the website.
Book
This text gives budding actuaries and financial analysts a foundation in multiple regression and time series. They will learn about these statistical techniques using data on the demand for insurance, lottery sales, foreign exchange rates, and other applications. Although no specific knowledge of risk management or finance is presumed, the approach introduces applications in which statistical techniques can be used to analyze real data of interest. In addition to the fundamentals, this book describes several advanced statistical topics that are particularly relevant to actuarial and financial practice, including the analysis of longitudinal, two-part (frequency/severity), and fat-tailed data. Datasets with detailed descriptions, sample statistical software scripts in ‘R’ and ‘SAS’, and tips on writing a statistical report, including sample projects, can be found on the book’s Web site: http://research.bus.wisc.edu/RegActuaries.
Article
Regression models are popular tools for rate-making in the framework of heterogeneous insurance portfolios; however, the traditional regression methods have some disadvantages particularly their sensitivity to the assumptions which significantly restrict the area of their applications. This paper is devoted to an alternative approach–quantile regression. It is free of some disadvantages of the traditional models. The quality of estimators for the approach described is approximately the same as or sometimes better than that for the traditional regression methods. Moreover, the quantile regression is consistent with the idea of using the distribution quantile for rate-making. This paper provides detailed comparisons between the approaches and it gives the practical example of using the new methodology.
Article
In this paper we investigate the adequacy of the own funds a company requires in order to remain healthy and avoid insolvency. Two methods are applied here; the quantile regression method and the method of mixed effects models. Quantile regression is capable of providing a more complete statistical analysis of the stochastic relationship among random variables than least squares estimation. The estimated mixed effects line can be considered as an internal industry equation (norm), which explains a systematic relation between a dependent variable (such as own funds) with independent variables (e.g. financial characteristics, such as assets, provisions, etc.). The above two methods are implemented with two data sets.