Datamining models for scoring the risk-evaluation using quantitative and qualitative information
Silvia Carducci | Università degli Studi di Bologna (13/07/2011)
Corso di Laurea in Direzione Aziendale, Curriculum International Management
Relatore: prof. Furio Camillo
This research aims at providing, with an integrated data mining approach, a useful model that can improve the credit scoring and the risk evaluation practices by better addressing the default probability's issue. The result is indeed an improvement of the correctness of the prediction of the default probability for new clients, based on the analysis of both qualitative and quantitative information collected on past clients. It is important to highlight that this can be very useful for banks and lending institutions because by better allocating the risk they face and better conceding credits they can reach significant savings; and at the same time it can be of great help for the society as a whole thanks to a general better policy of credit management. In particular the choice will be of course made by the specific lending institution on a case by case basis, however the research proposes an approach which is at the same accurate but not too restrictive, therefore allowing those who potentially deserve it to receive the credit.
The focus of the analysis emphasizes the concept of credit risk, from a data mining and statistical point of view, having as pillar the evaluation of the default probability. A first theoretical part describes the most common data mining methods, procedures and algorithms providing main points as well as mathematical insights taken from acknowledged literature. Then a second empirical part deals with the practical case of a real-world dataset of over 85000 firms, obtained through a convention with the UniCredit bank, that has been analysed in order to provide a realistic approach. Different data mining models have been considered in order to predict the default probability of a new potential client asking for a credit to the bank. According to past qualitative and quantitative information, an estimation of the predicted default probability is done. Consequently clients are classified in one of the two groups, namely good or bad clients, and will receive the credit or not. Indeed also following the Basel procedures it is more correct to talk about default probability thus for the purpose of this research, the good clients are those that are predicted not to default over the next twelve months after the credit receiving, while the bad clients are those predicted to default.
The empirical part has been performed using the SAS analytics which, thanks to a vast variety of tools and solutions, have been very useful in order to analyse the data, to simulate a real-world situation and to calculate the scoring functions. The help of the SAS software has been fundamental for the investigation of the dataset, for the application of data mining models finalized at identifying the credit-worthiness, and thus for the understanding of customers and banks' needs for the purpose of improving the services offered. Going into details, both SAS BASE and SAS Enterprise Guide have been used, the first one specifically offers high flexibility thanks to the programmability of the procedures, while the second one offers a more immediate approach for the most common models.
As regards the specific analysis undertaken for this research, the first data mining model considered has been the logistic regression which has been chosen since it is a model that well predicts a binary variable and because it is the model nowadays used by UniCredit. As far as the logistic regression has been performed, a correction has been made, which is the under-sampling. In particular the bad client is a rare event since only the 4% of the UniCredit past clients fortunately defaulted, hence it results difficult to predict the bad clients. An asymmetry has been discovered between the prediction of the two groups, namely the good clients are always predicted better than the bad ones, which should instead be the first priority of the lending institution. The overall predictive capacity of the logistic regression, performed on the dataset and validated with an out-of-sample test, can be considered good since is above 70% however there may still be room for improvement. In order to determine whether the predictive power can be increased, the discriminant analysis has been chosen, which the literature demonstrate is a valid alternative to the logistic regression.
In particular given the fact that the datasets contains both qualitative and quantitative variables, a major problem is given by the fact that one of the limitations of the discriminant analysis is that it can be applied to quantitative variables only. In order to overcome this problem Prof. Gilbert Saporta invented a specific model that allow to perform the discriminant on qualitative variables, which is the so called Disqual. Thanks to the help of Prof. Saporta, the Disqual has been analysed and then performed on the dataset. It has to be remember that the Disqual performs a multiple correspondence analysis of the qualitative variables in order to obtain factorial axes that are used as input of a SAS procedure (DISCRIM or CANDISC) which allow to perform the model. The results have shown that the model predicts the default probability even better than the logistic regression and the overall predictive capacity of the model is very high and is around 80%, while also the asymmetry between the two groups decreased. What has to be highlighted is also that the inclusion in the model of qualitative variables improves the prediction. Nevertheless the research demonstrates that there is margin for improvement in the field of credit risk evaluation and that, even if resources may be needed in order to further investigate the problem, by improving the default probability's a series of positive insights is provided.
First of all the bank or lending institution that can chose to allocate resources in the credit-evaluation model's improvement has good changes to achieve significant savings thanks to a better classification of the clients and a safer credit policy. Last but not least, clients that have the potential for receiving a credit will be allowed to do so, which is important to promote economic growth and investment. A correct prediction of the default probability is at the centre of several modern studies and is a fundamental issue especially in recent times, linked to the subprime mortgages crises, the fear of banks having un-covered risks and the need for the economy to be boosted to re-start, thus this research can offer useful insights on the subject.