Ítem
Solo Metadatos

Evaluating replication for parallel jobs: an efficient approach

dc.creatorQiu, Zhanspa
dc.creatorPérez, Juan F.spa
dc.date.accessioned2020-08-19T14:43:23Z
dc.date.available2020-08-19T14:43:23Z
dc.date.created2015-10-30spa
dc.description.abstractMany modern software applications rely on parallel job processing to exploit large resource pools available in cloud and grid infrastructures. The response time of a parallel job, made of many subtasks, is determined by the last subtask that finishes. Thus, a single laggard subtask or a failure, requiring re-processing, may increase the response time substantially. To overcome these issues, we explore concurrent replication with canceling. This mechanism executes two job replicas concurrently, and retrieves the result of the first replica that completes, immediately canceling the other one. To analyze this mechanism we propose a stochastic model that considers replication at both job-level and task-level. We find that task-level replication achieves a much higher reliability and shorter response times than job-level replication. We also observe that the impact of replication depends on the system utilization, the subtask reliability, and the correlation among replica failures. Based on the model, we propose a resource-provisioning strategy that determines the minimum number of computing nodes needed to achieve a service-level objective (SLO) defined as a response-time percentile. This strategy is evaluated by considering realistic traffic patterns from a parallel cluster, where task-level replication shows the potential to reduce the resource requirements for tight response-time SLOs.eng
dc.format.mimetypeapplication/pdf
dc.identifier.doihttps://doi.org/10.1109/TPDS.2015.2496593
dc.identifier.issnISSN: 1045-9219
dc.identifier.issnEISSN: 1558-2183
dc.identifier.urihttps://repository.urosario.edu.co/handle/10336/27694
dc.language.isoengspa
dc.publisherIEEEspa
dc.relation.citationEndPage2302
dc.relation.citationIssueNo. 8
dc.relation.citationStartPage2288
dc.relation.citationTitleIEEE Transactions on Parallel and Distributed Systems
dc.relation.citationVolumeVol. 27
dc.relation.ispartofIEEE Transactions on Parallel and Distributed Systems, ISSN: 1045-9219;EISSN: 1558-2183, Vol.27, No.8 (1 Aug 2016); pp. 2288-2302spa
dc.relation.urihttps://ieeexplore.ieee.org/document/7313012spa
dc.rights.accesRightsinfo:eu-repo/semantics/restrictedAccess
dc.rights.accesoRestringido (Acceso a grupos específicos)spa
dc.sourceIEEE Transactions on Parallel and Distributed Systemsspa
dc.source.instnameinstname:Universidad del Rosario
dc.source.reponamereponame:Repositorio Institucional EdocUR
dc.subject.keywordTime factorsspa
dc.subject.keywordReliabilityspa
dc.subject.keywordCorrelationspa
dc.subject.keywordProgram processorsspa
dc.subject.keywordComputational modelingspa
dc.subject.keywordAbsorptionspa
dc.subject.keywordServersspa
dc.titleEvaluating replication for parallel jobs: an efficient approachspa
dc.title.TranslatedTitleEvaluación de la replicación para trabajos paralelos: un enfoque eficientespa
dc.typearticleeng
dc.type.hasVersioninfo:eu-repo/semantics/publishedVersion
dc.type.spaArtículospa
Archivos
Colecciones