Abstract
Artificial intelligence (AI) server infrastructure has been built to
support AI applications and handle data-intensive workloads. AI server
infrastructure is the essential building blocks, and errors in AI server
infrastructure may lead to fatal consequences to any AI applications
built upon it. Compared to traditional software, software for AI server
infrastructure is more configurable, and thus more likely to have
configuration errors that might prevent correct software behaviors.
Previous work on misconfiguration diagnosis requires sufficient
execution history or manual intervention, and can hardly diagnose
potential misconfigurations which are not triggered at launching. In
this paper, we propose a real-time method to address these issues.
Specifically, we combine program analysis and real-time log parsing to
diagnose configuration errors. It maps each configuration option to the
log code by applying program slicing only once, and parses real-time
logs during the operation of the AI server without manual intervention.
We evaluate the effectiveness of our approach on the core components of
Hadoop, an exemplar AI Server Infrastructure Software. The results show
that our method mapped more than 80% of the configuration options to log
outputs, identified 90% of the configuration read sites as the slicing
seeds, and successfully diagnosed about 10% configuration errors that
can not be addressed by previous studies.
Original language | English |
---|---|
Journal | IEEE Transactions on Dependable and Secure Computing |
DOIs | |
Publication status | E-pub ahead of print - 10 Apr 2023 |
Keywords
- AI server infrastructure
- Artificial intelligence
- Computer crashes
- diagnosis
- misconfiguration
- Real-time systems
- Servers
- slicing
- Software
- Source coding
- static program analysis
- Yarn