Ítem
Acceso Abierto
Desarrollo de una librería MLOps: versionamiento, trazabilidad y automatización del ciclo de vida de modelos en entornos Big Data
| dc.contributor.advisor | Avilán Vargas, Nicolás Guillermo | |
| dc.creator | Acevedo Orjuela, Bryam Camilo | |
| dc.creator.degree | Magíster en Matemáticas Aplicadas y Ciencias de la Computación | |
| dc.date.accessioned | 2026-03-18T15:13:16Z | |
| dc.date.available | 2026-03-18T15:13:16Z | |
| dc.date.created | 2026-02-27 | |
| dc.description | Esta tesis de maestría presenta el diseño, implementación y validación empírica de MomentumML, una librería modular de MLOps construida sobre PySpark y MLflow, desarrollada para cerrar la brecha entre la experimentación de modelos de ML y su despliegue confiable en producción — desafío respaldado por evidencia que indica que más del 90% de los modelos desarrollados nunca alcanzan entornos productivos estables. La librería comprende 34.582 líneas de código organizadas en módulos especializados que cubren el ciclo de vida completo del ML: preprocesamiento (10 clases Transformer), entrenamiento (5 clases Estimator con soporte para 8 algoritmos), versionado automático en Unity Catalog, predicción y monitoreo de drift mediante técnicas estadísticas multimodales como PSI, Kolmogorov-Smirnov, Jensen-Shannon Divergence y Chi-cuadrado. Validada durante seis meses en una organización real del sector telecomunicaciones, los resultados fueron contundentes: reducción del 81% en código para pipelines end-to-end, disminución del 71% en tiempos de despliegue (de 3–4 semanas a 5–7 días), incremento del 740% en frecuencia de despliegue, reducción del 77% en tasa de fallos, y disminución del 40% en consumo de unidades de cómputo en Databricks. El resultado más destacado: 35 de 85 modelos operativos (41.2%) lograron transitar exitosamente a entornos de QA, un hito inédito en la organización. El trabajo aporta un framework práctico, de código abierto y escalable que integra ingeniería de software, ciencia de datos y operaciones, posicionándose como referencia replicable para la adopción empresarial de MLOps. | |
| dc.description.abstract | This master's thesis presents the design, implementation, and empirical validation of MomentumML, a modular MLOps library built on PySpark and MLflow, developed to address the well-documented gap between ML model experimentation and reliable production deployment — a challenge reflected in industry figures suggesting that over 90% of developed models never reach stable production environments. The library comprises 34,582 lines of code organized into specialized modules covering the full ML lifecycle: preprocessing (10 Transformer classes), training (5 Estimator classes supporting 8 algorithms), automatic versioning via Unity Catalog, inference, and drift monitoring through multimodal statistical techniques including PSI, Kolmogorov-Smirnov, Jensen-Shannon Divergence, and Chi-squared tests. Validated over six months in a real telecommunications organization, the results were substantial: an 81% reduction in end-to-end pipeline code, a 71% decrease in deployment time (from 3–4 weeks to 5–7 days), a 740% increase in deployment frequency, a 77% reduction in failure rate, and a 40% reduction in Databricks compute unit consumption. Most notably, 35 of 85 operational models (41.2%) successfully transitioned to QA environments — an organizational first. The work contributes a practical, open-source, and scalable framework that bridges software engineering, data science, and operations, positioning itself as a replicable reference for enterprise MLOps adoption. | |
| dc.format.extent | 60 pp | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.doi | https://doi.org/10.48713/10336_47639 | |
| dc.identifier.uri | https://repository.urosario.edu.co/handle/10336/47639 | |
| dc.language.iso | spa | |
| dc.publisher | Universidad del Rosario | spa |
| dc.publisher.department | Facultad de Ciencias Naturales y Matemáticas-Estudios Profesionales | |
| dc.publisher.program | Maestría en Matemáticas Aplicadas y Ciencias de la Computación | spa |
| dc.rights | Attribution-NonCommercial-ShareAlike 4.0 International | * |
| dc.rights.accesRights | info:eu-repo/semantics/openAccess | |
| dc.rights.acceso | Abierto (Texto Completo) | |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ | * |
| dc.source.bibliographicCitation | M. Allurwar and G. Dhule, "MLOps – A complete guide for operationalizing machine learning," Int. Res. J. Eng. Technol. (IRJET), vol. 8, no. 6, pp. 4370–4373, 2021. | |
| dc.source.bibliographicCitation | Amdocs, "Cloud-based machine learning operations (MLOps) drives 2.3% revenue uplift for FinTech unicorn," 2024. [Online]. Available: https://www.amdocs.com/insights/case-study/cloud-based-machinelearning-operations-mlops-drives-23-revenue-uplift-fintech | |
| dc.source.bibliographicCitation | S. Amershi et al., "Software engineering for machine learning: A case study," in Proc. 41st Int. Conf. Softw. Eng.: Softw. Eng. Practice (ICSE-SEIP), Montreal, QC, Canada, 2019, pp. 291–300. | |
| dc.source.bibliographicCitation | Analytics Vidhya, "Best practices and performance tuning activities for PySpark," 2024. [Online]. Available: https://www.analyticsvidhya.com/blog/2021/08/best-practices-and-performance-tuningactivities-for-pyspark/ | |
| dc.source.bibliographicCitation | BinaryScripts, "Optimizing PySpark applications for large data processing," 2024. [Online]. Available: https://binaryscripts.com/spark/2024/12/19/optimizing-pyspark-applications-for-large-dataprocessing.html | |
| dc.source.bibliographicCitation | E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, "The ML test score: A rubric for ML production readiness and technical debt reduction," Google Research, 2017. [Online]. Available: https://arxiv.org/abs/1706.08536 | |
| dc.source.bibliographicCitation | ChaosGenius, "7 ways to optimize Apache Spark performance," 2025. [Online]. Available: https://www.chaosgenius.io/blog/spark-performance-tuning/ | |
| dc.source.bibliographicCitation | ChaosGenius, "10 tips to reduce Databricks costs," 2025. [Online]. Available: https://www.chaosgenius.io/blog/databricks-optimization-techniques/ | |
| dc.source.bibliographicCitation | Databricks Community, "PySpark optimizations and best practices," 2023. [Online]. Available: https://community.databricks.com/t5/data-engineering/pyspark-optimizations-and-best-practices/tdp/10456 | |
| dc.source.bibliographicCitation | DEV Community, "PySpark optimization techniques," 2024. [Online]. Available: https://dev.to/rado_mayank/pyspark-optimization-techniques-56l0 | |
| dc.source.bibliographicCitation | B. Eken, S. Pallewatta, L. E. Lwakatare, A. van der Hoek, and J. Bosch, "A multivocal review of MLOps practices, challenges and open issues," arXiv preprint, arXiv:2406.09737, 2023. | |
| dc.source.bibliographicCitation | Finout, "Databricks cost optimization," 2025. [Online]. Available: https://www.finout.io/blog/optimizedatabricks-costs | |
| dc.source.instname | instname:Universidad del Rosario | |
| dc.source.reponame | reponame:Repositorio Institucional EdocUR | spa |
| dc.subject | MLOps | |
| dc.subject | Machine Learning | |
| dc.subject | PySpark | |
| dc.subject | MLflow | |
| dc.subject | Ciclo de vida de modelos | |
| dc.subject | Automatización | |
| dc.subject | Trazabilidad | |
| dc.subject | Reproducibilidad | |
| dc.subject | Versionado | |
| dc.subject | Detección de drift | |
| dc.subject | Big Data | |
| dc.subject | Entornos distribuidos | |
| dc.subject | Despliegue de modelos | |
| dc.subject | Monitoreo continuo | |
| dc.subject | Deuda técnica | |
| dc.subject | Unity Catalog | |
| dc.subject | Databricks | |
| dc.subject | Pipelines de ML | |
| dc.subject | Ingeniería de software | |
| dc.subject | Ciencia de datos | |
| dc.subject.keyword | Mlops | |
| dc.subject.keyword | Machine Learning | |
| dc.subject.keyword | Pyspark | |
| dc.subject.keyword | Mlflow | |
| dc.subject.keyword | Model Lifecycle | |
| dc.subject.keyword | Automation | |
| dc.subject.keyword | Traceability | |
| dc.subject.keyword | Reproducibility | |
| dc.subject.keyword | Versioning | |
| dc.subject.keyword | Drift Detection | |
| dc.subject.keyword | Distributed Environments | |
| dc.subject.keyword | Big Data | |
| dc.subject.keyword | Model Deployment | |
| dc.subject.keyword | Continuous Monitoring | |
| dc.subject.keyword | Technical Debt | |
| dc.subject.keyword | Unity Catalog | |
| dc.subject.keyword | Databricks | |
| dc.subject.keyword | ML Pipelines | |
| dc.subject.keyword | Software Engineering | |
| dc.subject.keyword | Data Science | |
| dc.title | Desarrollo de una librería MLOps: versionamiento, trazabilidad y automatización del ciclo de vida de modelos en entornos Big Data | |
| dc.title.TranslatedTitle | Development of an MLOps Library: Versioning, Traceability, and Automation of the Model Lifecycle in Big Data Environments | |
| dc.type | masterThesis | |
| dc.type.hasVersion | info:eu-repo/semantics/acceptedVersion | |
| dc.type.spa | Tesis de maestría | |
| local.department.report | Escuela de Ciencias e Ingeniería | |
| local.regiones | Bogotá |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- Desarrollo_de_una_libreria_MLOps.pdf
- Tamaño:
- 852.49 KB
- Formato:
- Adobe Portable Document Format
- Descripción:



