HeSRPT: parallel scheduling to minimize mean slowdown

Benjamin Berg, Rein Vesilo, Mor Harchol-Balter

Research output: Contribution to journalArticlepeer-review

Abstract

Modern data centers serve workloads which can exploit parallelism. When a job parallelizes across multiple servers it completes more quickly. However, it is unclear how to share a limited number of servers between many parallelizable jobs.

In this paper we consider a typical scenario where a data center composed of N servers will be tasked with completing a set of M parallelizable jobs. Typically, M is much smaller than N. In our scenario, each job consists of some amount of inherent work which we refer to as a job's size. We assume that job sizes are known up front to the system, and each job can utilize any number of servers at any moment in time. These assumptions are reasonable for many parallelizable workloads such as training neural networks using TensorFlow [2]. Our goal in this paper is to allocate servers to jobs so as to minimize the mean slowdown across all jobs, where the slowdown of a job is the job's completion time divided by its running time if given exclusive access to all N servers. Slowdown measures how a job was interfered with by other jobs in the system, and is often the metric of interest in the theoretical parallel scheduling literature (where it is also called stretch), as well as the HPC community (where it is called expansion factor).
Original languageEnglish
Pages (from-to)35-36
Number of pages2
JournalPerformance Evaluation Review
Volume48
Issue number3
DOIs
Publication statusPublished - Mar 2021

Fingerprint Dive into the research topics of 'HeSRPT: parallel scheduling to minimize mean slowdown'. Together they form a unique fingerprint.

Cite this