TY - JOUR
T1 - Rollback mechanisms for cloud management APIs using AI planning
AU - Satyal, Suhrid
AU - Weber, Ingo
AU - Bass, Len
AU - Fu, Min
PY - 2020/1
Y1 - 2020/1
N2 - Human-induced faults play a large role in systems reliability. In cloud platforms, system administrators may inadvertently make catastrophic mistakes, like deleting a virtual disk with important data. Providing rollback for cloud operations can reduce the severity and impact of such mistakes, by allowing to revert to a known, good state. However, in the context of cloud management this is non-trivial, since cloud consumers only have limited visibility and indirect control. In this paper, we present a scalable approach to rollback operations that change the state of a system on proprietary cloud platforms. In our previous work, we provided a system that augments cloud APIs and provides rollback operation using an AI planner. In this paper, we build upon our previous work, but parallelize the rollback plan generation based on characteristics unique to rollback scenario. Furthermore, we introduce a distributed anytime algorithm that gradually improves plan quality over time, until either an optimal plan is found or a timeout is reached. Through experimental evaluation we show that our approach scales better than a naïve approach, and effectively avoids the exponential behavior of AI planning. Further, we explore the trade-offs between the quality of rollback plans and plan generation time.
AB - Human-induced faults play a large role in systems reliability. In cloud platforms, system administrators may inadvertently make catastrophic mistakes, like deleting a virtual disk with important data. Providing rollback for cloud operations can reduce the severity and impact of such mistakes, by allowing to revert to a known, good state. However, in the context of cloud management this is non-trivial, since cloud consumers only have limited visibility and indirect control. In this paper, we present a scalable approach to rollback operations that change the state of a system on proprietary cloud platforms. In our previous work, we provided a system that augments cloud APIs and provides rollback operation using an AI planner. In this paper, we build upon our previous work, but parallelize the rollback plan generation based on characteristics unique to rollback scenario. Furthermore, we introduce a distributed anytime algorithm that gradually improves plan quality over time, until either an optimal plan is found or a timeout is reached. Through experimental evaluation we show that our approach scales better than a naïve approach, and effectively avoids the exponential behavior of AI planning. Further, we explore the trade-offs between the quality of rollback plans and plan generation time.
KW - AI planning
KW - cloud computing
KW - Reliability
KW - system administration
KW - web service
UR - http://www.scopus.com/inward/record.url?scp=85028909488&partnerID=8YFLogxK
U2 - 10.1109/TDSC.2017.2729543
DO - 10.1109/TDSC.2017.2729543
M3 - Article
AN - SCOPUS:85028909488
SN - 1545-5971
VL - 17
SP - 148
EP - 161
JO - IEEE Transactions on Dependable and Secure Computing
JF - IEEE Transactions on Dependable and Secure Computing
IS - 1
ER -