TY - JOUR
T1 - Diversity/parallelism trade-off in distributed systems with redundancy
AU - Peng, Pei
AU - Soljanin, Emina
AU - Whiting, Philip
PY - 2022/2
Y1 - 2022/2
N2 - Distributed computing enables parallel execution of smaller tasks that make up a large computing job. Its purpose is to reduce the job completion time. However, random fluctuations in task service times lead to straggling tasks with long execution times. Redundancy provides diversity that allows job completion when only a subset of redundant tasks is executed, thus removing the dependency on the straggling tasks. Under constrained resources (here, a fixed number of parallel servers), increasing redundancy reduces the available resources for parallelism. In this paper, we characterize the diversity vs. parallelism trade-off and identify the optimal strategy among replication, coding, and splitting, which minimizes the expected job completion time. We consider three common service time distributions and establish three models that describe the scaling of these distributions with the task size. We find that different distributions with different scaling models operate optimally at different redundancy levels, thus requiring very different code rates.
AB - Distributed computing enables parallel execution of smaller tasks that make up a large computing job. Its purpose is to reduce the job completion time. However, random fluctuations in task service times lead to straggling tasks with long execution times. Redundancy provides diversity that allows job completion when only a subset of redundant tasks is executed, thus removing the dependency on the straggling tasks. Under constrained resources (here, a fixed number of parallel servers), increasing redundancy reduces the available resources for parallelism. In this paper, we characterize the diversity vs. parallelism trade-off and identify the optimal strategy among replication, coding, and splitting, which minimizes the expected job completion time. We consider three common service time distributions and establish three models that describe the scaling of these distributions with the task size. We find that different distributions with different scaling models operate optimally at different redundancy levels, thus requiring very different code rates.
UR - http://www.scopus.com/inward/record.url?scp=85119411765&partnerID=8YFLogxK
U2 - 10.1109/TIT.2021.3127920
DO - 10.1109/TIT.2021.3127920
M3 - Article
AN - SCOPUS:85119411765
SN - 0018-9448
VL - 68
SP - 1279
EP - 1295
JO - IEEE Transactions on Information Theory
JF - IEEE Transactions on Information Theory
IS - 2
ER -