Kubernetes, an open-source platform for automating the deployment, scaling, and management of containerized applications, plays a pivotal role in modern software engineering and DevOps. As Kubernetes clusters become complex, troubleshooting of failures is critical for DevOps practitioners. This paper introduces a method that utilizes Large Language Models (LLMs) to analyze Kubernetes logs, aiding in the troubleshooting and identification of bugs in Kubernetes environments. Our proof of concept demonstrates LLMs' potential to automate application monitoring, and we hope the community can build on this insight and leverage LLMs for this task.
As organizations adopt Kubernetes for workload deployment, the complexity of Kubernetes clusters increases, necessitating efficient troubleshooting for system reliability. This paper proposes leveraging Kubernetes logs and Large Language Models (LLMs) to automate troubleshooting, expediting the identification of affected services and offering actionable insights.
As organizations adopt Kubernetes for workload deployment, the complexity of Kubernetes clusters increases, necessitating efficient troubleshooting for system reliability. This paper proposes leveraging Kubernetes logs and Large Language Models (LLMs) to automate troubleshooting, expediting the identification of affected services and offering actionable insights.
Our technique involves teaching LLMs about the architecture and services affected due to the failure of one service. This enables LLMs to efficiently analyze logs, determine the severity of service failures, and identify the services impacted by the failure of a single service.
In instances where access to company-specific logs are restricted, we innovatively employ LLMs for data augmentation. Leveraging original logs as references, we task LLMs with generating synthetic logs to enrich our dataset.
We did a couple of experiments using zero shot, few shot, and few shot chain of thought on different LLMs to compare their performances.
We fed our architecture asking the LLM to learn it and then asked it questions about our logs.
We gave it a little assistance by giving it an answer for three different scenarios and then asked it questions about our logs and stack trace.
Alongside doing everything we did for few shot, we asked it to think about the solution step by step.
Companies generate PetBytes of data each day, and with the right anonymisation we can extend our work. We have explored just one aspect of DevOps; there is scope for this sort of work in many other aspects.
We appreciate your attention and interest in our work. If you have any further questions or feedback, please feel free to reach out.