Document Type : Research Article


University of Sfax , Probability and Statistics Laboratory


Given the importance of policyholder classification in helping to make a good decision in predicting optimal premiums for actuaries.This paper proposes, first, an optimal construction of policyholder classes. Second, Poisson-negative Binomial mixture regression model is proposed as an alternative to deal with the overdispersion of these classes.The proposed method is unique in that it takes Tunisian data and classifies the insured population based on the K-means approach which is an unsupervised machine learning algorithm. The choice of the model becomes extremely difficult due to the presence of zero mass in one of the classes and the significant degree of overdispersion. For this purpose, we proposed a mixture regression model that leads us to estimate the density of each class and to predict its probability distribution that allows us to understand the underlying properties of our data. In the learning phase, we estimate the values of the model parameters using the Expectation-Maximization algorithm. This allows us to determine the probability of occurrence of each new insured to create the most accurate classification. The goal of using mixed regression is to get as heterogeneous a classification as possible while having a better approximation. The proposed mixed regression model, which uses a number of factors, has been evaluated on different criteria, including mean square error, variance, chi-square test and accuracy. According to the experimental findings on several datasets, the approach can reach an overall accuracy of 80%. Then, the application on real Tunisian data shows the effectiveness of using the mixed regression model.


[1] A. E. Renshaw, Modelling the Claims Process in the Presence of Covariates, ASTIN Bulletin, Vol. 24, No. 2, 265-285, (1994).
[2] A. Dempster, N. Laird, and D. Rubin, Maximum likelihood from Incomplete Data via the
EM Algorithm, Journal of the Royal Statistical Society B (Methodological), Vol. 39, No. 1,
pages 1-38, (1997).
[3] C. Akantziliotou, R. A. Rigby, and D. M. Stasinopoulos, A framework for modelling
overdispersed count data, including the Poisson-shifted generalized inverse Gaussian distribution, Comput. Stat. Data Anal., 53, 381-393, (2008).
[4] S. Aryuyuen and W. Bodhisuwan, The negative binomial-generalized exponential (NB-GE)
distribution, Appl. Math. Sci., 7, 1093-1105, (2013).
[5] J. Del Castillo and M. Perez-Casany ´ , Overdispersed and underdispersed Poisson generalizations, J. Stat. Plan. Inference, 134, 486-500, (2005).
[6] T. C. Fung, A. L. Badescu, and X. S. Lin, A class of mixture of experts models for general
insurance: Application to correlated claim frequencies, ASTIN Bulletin: The Journal of the
IAA, 49(3), 647-688, (2019a).
[7] T. C. Fung, A. L. Badescu, and X. S. Lin, A class of mixture of experts models for general
insurance: Theoretical developments, Insurance: Mathematics and Economics, 89, 111-127,
[8] W. H. Greene, Accounting for Excess Zeros and Sample Selection in Poisson and Negative
Binomial Regression Models, Working Paper EC-94-10, Department of Economics, Stern
School of Business, New York University, (1994).
[9] G. Maclachlan and T. Krishnan, The EM Algorithm and Extensions, Wiley Series in
Probability and Statistics, 2nd edition, (2007).
[10] J. Garrido, C. Genest, and J. Schulz, Generalized linear models for dependent frequency
and severity of insurance claims, Insurance: Mathematics and Economics, 70, 205-215,
[11] J. Hinde and C. G. B. Demetrio ´ , Overdispersion: Models and Estimation, Associacao
Brasileira de Estatistica, Sao Paulo, (1998).
[12] H. Zamani and N. Ismail, Negative Binomial-Lindley Distribution And Its Application, J.
Mathematics And Statistics, 1, 49, (2010).
[13] J. F. Lawless, Negative binomial and mixed Poisson regression, The Canadian Journal of
Statistics, Vol. 15, No. 3, Pages 209-225, (1987).
[14] D. Lukasz, L. Mathias, and M. V. Wuthrich ¨ , Gamma Mixture Density Networks and their
application to modelling insurance claim amounts, Insurance: Mathematics and Economics,
Vol. 101, Part B, Pages 240-261, (2011).
[15] L. Simon, Fitting negative binomial distribution by the method of maximum likelihood, J.
Casualty Actuarial Society, 17, 45-53, (1961).
[16] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and regression trees,
Wadsworth Brooks, (1984).
[17] B. G. Lindsay, Mixture Models: Theory, Geometry, and Applications, NSF-CBMS Regional
Conference Series in Probability and Statistics 5, Hayward: Institute of Mathematical Statistics, (1995).
[18] L. Breiman, Random forests, Machine Learning, 45, 5-32, (2001).
[19] M. Aitkin, D. Anderson, B. Francis, and J. Hinde, Statistical Modelling in GLIM, Oxford
University Press, New York, (1990).
[20] G. McLachlan and D. Peel, Finite Mixture Models, Wiley Series in Probability and Statistics, John Wiley and Sons Inc., (2000).
[21] M. Lichman, UCI Machine Learning Repository,, (2013).
[22] N. E. Breslow, Extra-Poisson Variation in Log-Linear Models, Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 33(1), pages 38-44, (1984).
[23] R. Verbelen, L. Gong, K. Antonio, A. Badescu, and S. Lin, Fitting Mixtures Of Erlangs
To Censored And Truncated Data Using The EM Algorithm, ASTIN Bulletin: The Journal
of the IAA, Vol. 45, Issue 3, pp. 729-758, (2015).
[24] S. M. Goldfeld and R. E. Quandt, A Markov model for switching regressions, Journal of
Econometrics, Vol 1, Issue 1, Pages 3-15, (1973).
[25] S. C. K. Lee and X. Sheldon Lin, Modeling Dependent Risks with Multivariate Erlang
Mixtures, ASTIN Bulletin: The Journal of the IAA, Volume 42, Issue 1, pp. 153-180, (2012).
[26] P. Shi and E. A. Valdez, Multivariate negative binomial models for insurance claim counts,
Insur Math Econ 55, 1829, (2014).
[27] K. F. Sellers and A. Raim, A flexible zero-inflated model to address data dispersion, Comput. Stat. Data Anal., 99, 68-80, (2016).
[28] S. C. Tseung, A. Badescu, T. C. Fung, and X. S. Lin, LRMoE.jl: a software package for
insurance loss modelling using mixture of experts regression model, Ann. Actuar. Sci., 15(2),
419-440, (2021).
[29] W. Jiang and M. A. Tanner, On the Approximation Rate of Hierarchical Mixtures-ofExperts for Generalized Linear Models, Neural Computation, Vol 11, Issue 5, 1183-1198,
[30] R. Winkelmann, Econometric Analysis of Count Data, Springer-Verlag, (2003).
[31] K. K. Yau, K. Wang, and A. H. Lee, Zero-Inflated Negative Binomial Mixed Regression
Modelling of Over-Dispersed Count Data with Extra Zeros, Biometrical Journal, 45, 437-452,
[32] Z. Yang, J. W. Hardin, C. L. Addy, and Q. H. Vuong, Testing Approaches for Overdispersion in Poisson Regression versus the Generalized Poisson Model, Biometrical Journal,
49, 565-584, (2007).
[33] Y. Lv, S. Tang, and H. Zhao, Real-time Highway Traffic Accident Prediction Based on the
k-Nearest Neighbor Method, International Conference on Measuring Technology and Mechatronics Automation, pp. 547-550, (2009).
[34] H. Zamani, P. Faroughi, and N. Ismail, Bivariate generalized Poisson regression model:
applications on health care data, Empir Econ, 51(4), 1607-1621, (2016).
[35] P. Zhang, D. Pitt, and X. Wu, A new multivariate zero-inflated hurdle model with applications in automobile insurance, ASTIN Bulletin: The Journal of the IAA, 124, (2022).