Data driven initialization for machine learning classification models

López Jaimes, David Santiago

doi:https://doi.org/10.48713/10336_34737

Ítem

Acceso Abierto

Data driven initialization for machine learning classification models

https://repository.urosario.edu.co/handle/10336/34737
https://doi.org/10.48713/10336_34737

Autores

López Jaimes, David Santiago

Fecha

2022-05-08

Directores

Caicedo Dorado, Alexander

Editor

Universidad del Rosario

Export Format:

Buscar en:

Métricas alternativas

Resumen

El principal objetivo de este proyecto de grado es desarrollar una estrategia para la inicialización de los parámetros θ tanto para la regresión logística (clasificador lineal) como para la regresión multinomial, y las redes neuronales clásicas (fully connected feed-forward). Esta inicialización se basó en las propiedades de la distribución estadística de los datos con los que se entrenan los modelos. Esto con el fin de inicializar el modelo en una región de la función de costo más adecuada y así, pueda llegar a una mejorar su tasa de convergencia, y producir mejores resultados en menores tiempos de entrenamiento. La tesis presenta una explicación intuitiva y matemática de los modelos de inicialización propuestos, y contrasta el desarrollo teórico con un benchmark donde se utilizaron diferentes datasets, incluyendo toy examples. Así mismo, también se presenta un análisis de estos resultados, se discuten las limitaciones de las propuestas y el trabajo futuro que se puede derivar a partir de este trabajo.

Abstract

Thanks to the great advance technology has had, the increase in computer resources and the strong impact that the era of "big data" has had on society, artificial intel- ligence has become a highly studied and used area. Machine learning is a branch of artificial intelligence which main objective is to build models that are capable of learning from a set of data, without the need to be explicitly programmed. These models use tools from different branches of mathematics, such as statistics and lin- ear algebra, to identify patterns and relationships between a set of data. Regarding machine learning, it allows us to generate models that are capable of classifying a set of data based on its intrinsic characteristics and its relationship with an objective variable. These models are widely used in real-life problems, such as classifying a bank transaction as malicious or normal, determining with a certain probability whether a tumor is malignant or benign, estimating a person’s credit risk, among others. Most of these classification models learn through the use of gradient descent or its variations. This is an iterative algorithm which allows finding the parameters θ of the model that minimize a cost function and allow an adequate classification. These parameters are initialized randomly. However, there are several limitations when training these models. The real data is in considerably large dimensions, and it is difficult to know the shape of the cost surface that is generated with it. This causes the models to require a lot of care, and a large amount of time and computational resources for their training. On the other hand, due that in most cases the cost func- tion is not convex as it normally happens in neural networks, it is possible that when initializing the weights randomly, the algorithm stalls because it was initialized in a flat region of the cost function, or that it initializes in a very rough region and does not converge to an appropriate minimum. This is way the present study aims to propose an initialization strategy for classi- fication problems that initialize the models in an appropriate region of the cost func- tion in order to improve its convergence rate and produce better results in faster training times. We aim to propose a new deterministic initialization strategy for the logistic regression (linear classifier), the multinomial logistic regression and the classical neural networks (fully connected feed-forward) for classification problems. We proposed an initialization strategy based on the properties of the statistical distribution of the data on which the models are trained. For the logistic regression and the multinomial logistic regression we propose to initialize the models with a characteristic vector of the data distribution of each class, such as its mean or me- dian. In the fully connected feed-forward neural networks we propose to use pro- totype data of each one of the classes. These prototype data are not the most repre- sentative data of the entire class distribution, but in this case, they are data that map and linearize the separation boundary with the other classes. A benchmark for the initialization proposal was made using various real datasets for classification tasks from the UCI and Kaggle repositories. We also tested the proposed initializations with different toy examples. In the logistic regression, we compared the behavior of the model using ran- dom initialization and using the proposed initialization. For fully connected feed- forward neural networks, we compared the behavior of the neural networks using the proposed initialization and the state of the art initializations for these models,Xavier’s and He’s initialization. In both cases, we were able to successfully initial- ize the models reducing it’s required training time and making the learning algo- rithm start in a better region of the cost function. In this way, we proposed new initialization strategies for the multinomial logistic regression and the neural network models for classification problems. The logistic regression initialization is based on statistical estimators of the data distribution and distance metrics, particularly the mean and the euclidean distance between different scalar products. The neural networks strategy is based on the decision boundary linearization using prototype data of each one of the classes. We have seen that our approach works very well for all the tested datasets, considerably reducing the computational resources required for the training of these models and increasing their performance.

Palabras clave

Redes Neuronales , Regresión Logística , Gradiente Descendiente , Parámetros de un Modelo de Clasificación , Vectores Característicos , Distribución de las Clases