Autonomous Performance Engineering Framework Using Artificial Intelligence for Resilient Cloud Native Systems
Main Article Content
Abstract
Mass scale cloud services have high-dynamism conditions wherein it becomes more difficult to sustain performance, reliability and scale given the changing workloads, distributed components, and dependencies between services. Conventional approaches of monitoring and rule-based management do not very easily identify anomalies at early stages or even allocate resources optimally. In this paper, an Explainable Artificial Intelligence (XAI)-based enterprise reliability analytics approach to performance optimization in a large-scale cloud service is proposed. The framework proposed combines real-time monitoring agents, anomaly detection on the basis of machine learning, explainable inference, reliability analytics and an autonomous performance optimization controller. The system gathers operational measurements and log data of the cloud infrastructure, identifies abnormal system behavior with the help of AI models, and implements the explainability approach in the form of SHAP to understand the causal factors of poor performance performance. Through these lessons, the framework dynamically balances the resource allocation plans and scaling plans to keep the services efficient and reliable. Experimental analysis confirms that the system driven by AI affects the performance of cloud systems in a significant positive way. The results show that there has been greater degree of failure detection rate, reduction of system recovery time, increased degree of resource usage efficiency and decreased degree of response latency in the cases of the dynamic workload. Moreover, the offered infrastructure enhances the general accessibility of the service and the adherence to SLA in comparison to the conventional cloud management systems. These results suggest that explaining AI with reliability analytics can be an efficient strategy in the development of intelligent, transparent, and self-optimizing cloud service management systems.