Loading ...
Data Science Credit Risk Modelling



The full credit risk modelling cycle

At Booking.com, I led a project to improve the credit risk evaluation of Booking globally, resulting in a comprehensive method that increased our prediction accuracy and operational efficiency. We began purely from a business perspective to define what we mean with 'risk'. We ended up with a system that categorized partners into different levels of risk. This approach not only provided a detailed view of each partner's risk level but also of the entire portfolio's risk exposure globally. It's now being used to inform various partner-facing actions and to measure several key performance indicators (KPIs) related to Booking's treasury operations.

  • Talking to shareholders is the only correct, and often overlooked, first step in such a cycle. Understand the problem to the smallest detail is necessary for an optimal outcome.
  • Data Gathering from Multiple Source Systems was an early and critical phase, where we aggregated diverse data sets across various internal platforms. This phase ensured a rich and robust dataset, essential for accurate model training and prediction.
  • Setting Up an ETL Pipeline was instrumental in streamlining our data workflow. By automating the extraction, transformation, and loading processes, we achieved a seamless flow of information, significantly reducing manual intervention and data processing time.
  • Feature Selection Using Information Value allowed us to identify the most predictive variables for default. This technique enabled us to streamline our model by focusing on the most impactful features, enhancing both efficiency and predictive power.
  • Feature Transformation using Weight of Evidence involved adjusting data variables to better reflect their relationship with the probability of default. This step was crucial for improving model accuracy and ensuring that our predictions were grounded in the real-world dynamics of credit risk.
  • Bayesian Hyperparameter Tuning was employed to optimize our models. By adopting a Bayesian approach to tuning, we navigated the complex hyperparameter space more effectively, enhancing model performance without overfitting.
  • Modelling Using Logistic Regression emerged as our strategy of choice, following extensive experimentation with LightGBM and Random Forest models. The decision was heavily influenced by the need for model explainability, a critical requirement in financial services for regulatory compliance and internal stakeholder trust.
  • Optimal Threshold Finding Based Upon F1 Score, in good consultation with the business, was a key phase where we balanced precision and recall to find the best operational point. This approach ensured our model's practical utility in identifying default risk while maintaining operational efficiency.
  • Coefficient Calibration followed, fine-tuning our model to reflect the true scale and impact of each feature. This calibration process was pivotal in ensuring that our predictions accurately mirrored the likelihood of default.
  • Risk Bucketing Using K-Means allowed us to segment potential defaults into distinct risk categories. This segmentation enabled tailored risk management strategies, optimizing both risk mitigation efforts and resource allocation.
  • Inference Pipeline Implementation was the culmination of our efforts, where we deployed the model into production. This pipeline facilitated real-time risk assessment, enabling dynamic risk management and decision-making.