TY - GEN
T1 - Robust scheduling for large-scale distributed systems
AU - Lee, Young Choon
AU - King, Jayden
AU - Kim, Young Ki
AU - Hong, Seok-Hee
PY - 2020
Y1 - 2020
N2 - In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.
AB - In large-scale distributed systems, such as clouds, failures are rather the norm than the exception. These failures include job failures, server failures, network outage and power failure. Among them, server failures are most common. With the wide adoption of cloud computing, the impact of server failures in clouds is far greater than that in traditional computer clusters as jobs of different tenants are often co-located (multi-tenancy). In this paper, we address the problem of robust scheduling, with realistic failure modeling, to minimize such impact on the execution of (co-located) jobs. To this end, we develop four online failure-aware (FA) scheduling algorithms, FAFF-WJ, FAFF-FC, FABF-WJ and FABF-FC, considering the availability and reliability of servers. In particular, FF (First-Fit) and BF (Best-Fit) indicate how the availability of servers is checked while WJ (Waiting Job) and FC (Failure Count) differ primarily in whether the reliability is measured from job's perspective or server's perspective. All four algorithms are designed essentially by combining these availability and reliability check methods. We evaluate our scheduling algorithms with failures generated based on our failure modeling of six real-world server failure traces. Our evaluation results show the effectiveness of our scheduling algorithms in robust job execution, with respect to both performance and cost.
KW - Clouds
KW - Reliability
KW - Robust scheduling
KW - Server failures
UR - http://www.scopus.com/inward/record.url?scp=85101263971&partnerID=8YFLogxK
U2 - 10.1109/TrustCom50675.2020.00019
DO - 10.1109/TrustCom50675.2020.00019
M3 - Conference proceeding contribution
AN - SCOPUS:85101263971
SN - 9780738143804
T3 - IEEE International Conference on Trust Security and Privacy in Computing and Communications
SP - 38
EP - 45
BT - Proceedings - 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2020
A2 - Wang, Guojun
A2 - Ko, Ryan
A2 - Bhuiyan, Md Zakirul Alam
A2 - Pan, Yi
PB - Institute of Electrical and Electronics Engineers (IEEE)
CY - Piscataway, NJ
T2 - 19th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2020
Y2 - 29 December 2020 through 1 January 2021
ER -