Proceedings of the 18th International Conference on Agents and Artificial Intelligence

Authors

L. Cernau, A. Dobrescu, Ecaterina Cărbune, Georgiana Asandei

Abstract

The use of large language models (LLMs) to analyse and identify errors in code is becoming increasingly common among developers. While many studies aim to improve the quality and effectiveness of LLM-generated code, this paper investigates how LLMs perceive real-world code compared to established code metrics. We evaluate two widely used models, OpenAI GPT-4o and Gemini 2.0 Flash, to determine whether their identification of architectural issues remains consistent when provided solely with code, code combined with metrics, or metrics alone. We asked the models to provide a brief assessment of potential problems in a file under each of these conditions. Our analysis shows that LLMs can often correctly identify errors exclusively based on metrics. Building on this finding, we further asked the models to assign a simple label—GOOD, MEDIUM, or BAD—reflecting their evaluation with minimal context. While prior research has focused on LLMs as code generators or bug fixers, few studies have explored their ability to evaluate code quality using abstract indicators such as metrics, we combined the two approaches (hybrid between code and metrics). Our results suggest that LLMs can assess code quality even in the absence of a full code context.

Citation

@Inproceedings{Cernau2026ComparativeAO,
 author = {L. Cernau and A. Dobrescu and Ecaterina Cărbune and Georgiana Asandei},
 booktitle = {Proceedings of the 18th International Conference on Agents and Artificial Intelligence},
 title = {Comparative Analysis of LLMs for Software Quality Assessment via Code and Metrics},
 year = {2026}
}