Ítem
Acceso Abierto

Desarrollo de una librería MLOps: versionamiento, trazabilidad y automatización del ciclo de vida de modelos en entornos Big Data

dc.contributor.advisorAvilán Vargas, Nicolás Guillermo
dc.creatorAcevedo Orjuela, Bryam Camilo
dc.creator.degreeMagíster en Matemáticas Aplicadas y Ciencias de la Computación
dc.date.accessioned2026-03-18T15:13:16Z
dc.date.available2026-03-18T15:13:16Z
dc.date.created2026-02-27
dc.descriptionEsta tesis de maestría presenta el diseño, implementación y validación empírica de MomentumML, una librería modular de MLOps construida sobre PySpark y MLflow, desarrollada para cerrar la brecha entre la experimentación de modelos de ML y su despliegue confiable en producción — desafío respaldado por evidencia que indica que más del 90% de los modelos desarrollados nunca alcanzan entornos productivos estables. La librería comprende 34.582 líneas de código organizadas en módulos especializados que cubren el ciclo de vida completo del ML: preprocesamiento (10 clases Transformer), entrenamiento (5 clases Estimator con soporte para 8 algoritmos), versionado automático en Unity Catalog, predicción y monitoreo de drift mediante técnicas estadísticas multimodales como PSI, Kolmogorov-Smirnov, Jensen-Shannon Divergence y Chi-cuadrado. Validada durante seis meses en una organización real del sector telecomunicaciones, los resultados fueron contundentes: reducción del 81% en código para pipelines end-to-end, disminución del 71% en tiempos de despliegue (de 3–4 semanas a 5–7 días), incremento del 740% en frecuencia de despliegue, reducción del 77% en tasa de fallos, y disminución del 40% en consumo de unidades de cómputo en Databricks. El resultado más destacado: 35 de 85 modelos operativos (41.2%) lograron transitar exitosamente a entornos de QA, un hito inédito en la organización. El trabajo aporta un framework práctico, de código abierto y escalable que integra ingeniería de software, ciencia de datos y operaciones, posicionándose como referencia replicable para la adopción empresarial de MLOps.
dc.description.abstractThis master's thesis presents the design, implementation, and empirical validation of MomentumML, a modular MLOps library built on PySpark and MLflow, developed to address the well-documented gap between ML model experimentation and reliable production deployment — a challenge reflected in industry figures suggesting that over 90% of developed models never reach stable production environments. The library comprises 34,582 lines of code organized into specialized modules covering the full ML lifecycle: preprocessing (10 Transformer classes), training (5 Estimator classes supporting 8 algorithms), automatic versioning via Unity Catalog, inference, and drift monitoring through multimodal statistical techniques including PSI, Kolmogorov-Smirnov, Jensen-Shannon Divergence, and Chi-squared tests. Validated over six months in a real telecommunications organization, the results were substantial: an 81% reduction in end-to-end pipeline code, a 71% decrease in deployment time (from 3–4 weeks to 5–7 days), a 740% increase in deployment frequency, a 77% reduction in failure rate, and a 40% reduction in Databricks compute unit consumption. Most notably, 35 of 85 operational models (41.2%) successfully transitioned to QA environments — an organizational first. The work contributes a practical, open-source, and scalable framework that bridges software engineering, data science, and operations, positioning itself as a replicable reference for enterprise MLOps adoption.
dc.format.extent60 pp
dc.format.mimetypeapplication/pdf
dc.identifier.doihttps://doi.org/10.48713/10336_47639
dc.identifier.urihttps://repository.urosario.edu.co/handle/10336/47639
dc.language.isospa
dc.publisherUniversidad del Rosariospa
dc.publisher.departmentFacultad de Ciencias Naturales y Matemáticas-Estudios Profesionales
dc.publisher.programMaestría en Matemáticas Aplicadas y Ciencias de la Computaciónspa
dc.rightsAttribution-NonCommercial-ShareAlike 4.0 International*
dc.rights.accesRightsinfo:eu-repo/semantics/openAccess
dc.rights.accesoAbierto (Texto Completo)
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/*
dc.source.bibliographicCitationM. Allurwar and G. Dhule, "MLOps – A complete guide for operationalizing machine learning," Int. Res. J. Eng. Technol. (IRJET), vol. 8, no. 6, pp. 4370–4373, 2021.
dc.source.bibliographicCitationAmdocs, "Cloud-based machine learning operations (MLOps) drives 2.3% revenue uplift for FinTech unicorn," 2024. [Online]. Available: https://www.amdocs.com/insights/case-study/cloud-based-machinelearning-operations-mlops-drives-23-revenue-uplift-fintech
dc.source.bibliographicCitationS. Amershi et al., "Software engineering for machine learning: A case study," in Proc. 41st Int. Conf. Softw. Eng.: Softw. Eng. Practice (ICSE-SEIP), Montreal, QC, Canada, 2019, pp. 291–300.
dc.source.bibliographicCitationAnalytics Vidhya, "Best practices and performance tuning activities for PySpark," 2024. [Online]. Available: https://www.analyticsvidhya.com/blog/2021/08/best-practices-and-performance-tuningactivities-for-pyspark/
dc.source.bibliographicCitationBinaryScripts, "Optimizing PySpark applications for large data processing," 2024. [Online]. Available: https://binaryscripts.com/spark/2024/12/19/optimizing-pyspark-applications-for-large-dataprocessing.html
dc.source.bibliographicCitationE. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, "The ML test score: A rubric for ML production readiness and technical debt reduction," Google Research, 2017. [Online]. Available: https://arxiv.org/abs/1706.08536
dc.source.bibliographicCitationChaosGenius, "7 ways to optimize Apache Spark performance," 2025. [Online]. Available: https://www.chaosgenius.io/blog/spark-performance-tuning/
dc.source.bibliographicCitationChaosGenius, "10 tips to reduce Databricks costs," 2025. [Online]. Available: https://www.chaosgenius.io/blog/databricks-optimization-techniques/
dc.source.bibliographicCitationDatabricks Community, "PySpark optimizations and best practices," 2023. [Online]. Available: https://community.databricks.com/t5/data-engineering/pyspark-optimizations-and-best-practices/tdp/10456
dc.source.bibliographicCitationDEV Community, "PySpark optimization techniques," 2024. [Online]. Available: https://dev.to/rado_mayank/pyspark-optimization-techniques-56l0
dc.source.bibliographicCitationB. Eken, S. Pallewatta, L. E. Lwakatare, A. van der Hoek, and J. Bosch, "A multivocal review of MLOps practices, challenges and open issues," arXiv preprint, arXiv:2406.09737, 2023.
dc.source.bibliographicCitationFinout, "Databricks cost optimization," 2025. [Online]. Available: https://www.finout.io/blog/optimizedatabricks-costs
dc.source.instnameinstname:Universidad del Rosario
dc.source.reponamereponame:Repositorio Institucional EdocURspa
dc.subjectMLOps
dc.subjectMachine Learning
dc.subjectPySpark
dc.subjectMLflow
dc.subjectCiclo de vida de modelos
dc.subjectAutomatización
dc.subjectTrazabilidad
dc.subjectReproducibilidad
dc.subjectVersionado
dc.subjectDetección de drift
dc.subjectBig Data
dc.subjectEntornos distribuidos
dc.subjectDespliegue de modelos
dc.subjectMonitoreo continuo
dc.subjectDeuda técnica
dc.subjectUnity Catalog
dc.subjectDatabricks
dc.subjectPipelines de ML
dc.subjectIngeniería de software
dc.subjectCiencia de datos
dc.subject.keywordMlops
dc.subject.keywordMachine Learning
dc.subject.keywordPyspark
dc.subject.keywordMlflow
dc.subject.keywordModel Lifecycle
dc.subject.keywordAutomation
dc.subject.keywordTraceability
dc.subject.keywordReproducibility
dc.subject.keywordVersioning
dc.subject.keywordDrift Detection
dc.subject.keywordDistributed Environments
dc.subject.keywordBig Data
dc.subject.keywordModel Deployment
dc.subject.keywordContinuous Monitoring
dc.subject.keywordTechnical Debt
dc.subject.keywordUnity Catalog
dc.subject.keywordDatabricks
dc.subject.keywordML Pipelines
dc.subject.keywordSoftware Engineering
dc.subject.keywordData Science
dc.titleDesarrollo de una librería MLOps: versionamiento, trazabilidad y automatización del ciclo de vida de modelos en entornos Big Data
dc.title.TranslatedTitleDevelopment of an MLOps Library: Versioning, Traceability, and Automation of the Model Lifecycle in Big Data Environments
dc.typemasterThesis
dc.type.hasVersioninfo:eu-repo/semantics/acceptedVersion
dc.type.spaTesis de maestría
local.department.reportEscuela de Ciencias e Ingeniería
local.regionesBogotá
Archivos
Bloque original
Mostrando1 - 1 de 1
Cargando...
Miniatura
Nombre:
Desarrollo_de_una_libreria_MLOps.pdf
Tamaño:
852.49 KB
Formato:
Adobe Portable Document Format
Descripción: