Paulo Canelas

Are Large Language Models Memorizing Bug Benchmarks?

Daniel Ramos, Claudia Mamede*, Kush Jain*, Paulo Canelas*, Catarina Gamboa*, Claire Le Goues (* equal contribution)

Large Language Models for Code (LLM4Code) Workshop. 2025. Best Paper Award

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and *n*-gram accuracy. Our findings show that certain models, in particular CodeGen, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like Llama 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Are Large Language Models Memorizing Bug Benchmarks?

Daniel Ramos, Claudia Mamede*, Kush Jain*, Paulo Canelas*, Catarina Gamboa*, Claire Le Goues (* equal contribution)

Large Language Models for Code (LLM4Code) Workshop. 2025. Best Paper Award

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and *n*-gram accuracy. Our findings show that certain models, in particular CodeGen, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like Llama 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

The Usability Argument for ROS-based Robot Architectural Description Languages

Paulo Canelas, Bradley Schmerl, Alcides Fonseca, Christopher S. Timperley

PLATEAU - The Annual Workshop on the Intersection of HCI and PL. 2025.

The Usability Argument for ROS-based Robot Architectural Description Languages

Paulo Canelas, Bradley Schmerl, Alcides Fonseca, Christopher S. Timperley

PLATEAU - The Annual Workshop on the Intersection of HCI and PL. 2025.

Understanding Misconfigurations in ROS: An Empirical Study and Current Approaches

Paulo Canelas, Bradley Schmerl, Alcides Fonseca, Christopher S. Timperley

International Symposium on Software Testing and Analysis (ISSTA). 2024.

The Robot Operating System (ROS) is a popular framework for building robot software from reusable components, but configuring and connecting these components correctly is challenging. Developers often face issues due to unstated assumptions, leading to misconfigurations that can result in unpredictable and dangerous behavior. To improve the reliability of ROS projects, it is critical to identify the broader set of misconfigurations. To that end, we perform a study on ROS Answers, a Q&A platform, to categorize these misconfigurations and evaluate how well existing detection techniques cover them. We identified 12 high-level categories and 50 sub-categories, with 27 not covered by current techniques.

Understanding Misconfigurations in ROS: An Empirical Study and Current Approaches

Paulo Canelas, Bradley Schmerl, Alcides Fonseca, Christopher S. Timperley

International Symposium on Software Testing and Analysis (ISSTA). 2024.

The Robot Operating System (ROS) is a popular framework for building robot software from reusable components, but configuring and connecting these components correctly is challenging. Developers often face issues due to unstated assumptions, leading to misconfigurations that can result in unpredictable and dangerous behavior. To improve the reliability of ROS projects, it is critical to identify the broader set of misconfigurations. To that end, we perform a study on ROS Answers, a Q&A platform, to categorize these misconfigurations and evaluate how well existing detection techniques cover them. We identified 12 high-level categories and 50 sub-categories, with 27 not covered by current techniques.

Is it a Bug? Understanding Physical Unit Mismatches in Robot Software

Paulo Canelas, Trenton Tabor, John-Paul Ore, Alcides Fonseca, Claire Le Goues, Christopher S. Timperley

International Conference in Robotics and Automation (ICRA). 2024.

Is it a Bug? Understanding Physical Unit Mismatches in Robot Software

Paulo Canelas, Trenton Tabor, John-Paul Ore, Alcides Fonseca, Claire Le Goues, Christopher S. Timperley

International Conference in Robotics and Automation (ICRA). 2024.

Usability-Oriented Design of Liquid Types for Java

Catarina Gamboa, Paulo Canelas, Alcides Fonseca, Christopher S. Timperley

International Conference in Software Engineering (ICSE). 2023.

Usability-Oriented Design of Liquid Types for Java

Catarina Gamboa, Paulo Canelas, Alcides Fonseca, Christopher S. Timperley

International Conference in Software Engineering (ICSE). 2023.

Data types as a more ergonomic frontend for Grammar-Guided Genetic Programming

Guilherme Espada, Leon Ingelse, Paulo Canelas, Pedro Barbosa, Alcides Fonseca

International Conference on Generative Programming: Concepts and Experiences. 2022.

Data types as a more ergonomic frontend for Grammar-Guided Genetic Programming

Guilherme Espada, Leon Ingelse, Paulo Canelas, Pedro Barbosa, Alcides Fonseca

International Conference on Generative Programming: Concepts and Experiences. 2022.

An Experience Report on Challenges in Learning the Robot Operating System

Paulo Canelas, Miguel Tavares, Ricardo Cordeiro, Alcides Fonseca, Christopher S. Timperley

International Workshop on Robotics Software Engineering (RoSE) at the International Conference in Software Engineering (ICSE). 2022.

An Experience Report on Challenges in Learning the Robot Operating System

Paulo Canelas, Miguel Tavares, Ricardo Cordeiro, Alcides Fonseca, Christopher S. Timperley

International Workshop on Robotics Software Engineering (RoSE) at the International Conference in Software Engineering (ICSE). 2022.

Grammatical Evolution Mapping for Semantically-Constrained Genetic Programming

Alcides Fonseca, Paulo Santos, Guilherme Espada, Sara Silva

Genetic Programming Theory and Practice XVIII. 2022.

Grammatical Evolution Mapping for Semantically-Constrained Genetic Programming

Alcides Fonseca, Paulo Santos, Guilherme Espada, Sara Silva

Genetic Programming Theory and Practice XVIII. 2022.

Augmenting Search-based Techniques with Static Synthesis-based Input Generation

Paulo Santos, José Campos, Christopher S. Timperley, Alcides Fonseca

International Workshop on Search-Based Software Testing (SBST) at the International Conference in Software Engineering (ICSE). 2021.

Augmenting Search-based Techniques with Static Synthesis-based Input Generation

Paulo Santos, José Campos, Christopher S. Timperley, Alcides Fonseca

International Workshop on Search-Based Software Testing (SBST) at the International Conference in Software Engineering (ICSE). 2021.

The Usability Argument for Refinement Typed Genetic Programming

Alcides Fonseca, Paulo Santos, Sara Silva

International Conference on Parallel Problem Solving From Nature. 2020.

The Usability Argument for Refinement Typed Genetic Programming

Alcides Fonseca, Paulo Santos, Sara Silva

International Conference on Parallel Problem Solving From Nature. 2020.

Towards the Conceptualization of Refinement Typed Genetic Programming

Paulo Santos

Masters Thesis. Faculty of Sciences of University of Lisbon. 2020.

Advised by Alcides Fonseca

Towards the Conceptualization of Refinement Typed Genetic Programming

Paulo Santos

Masters Thesis. Faculty of Sciences of University of Lisbon. 2020.

Advised by Alcides Fonseca

Refined Typed Genetic Programming as a user interface for Genetic Programming

Paulo Santos, Alcides Fonseca, Sara Silva

Short Paper. Genetic and Evolutionary Computation Conference Companion (GECCO). 2020.

Refined Typed Genetic Programming as a user interface for Genetic Programming

Paulo Santos, Alcides Fonseca, Sara Silva

Short Paper. Genetic and Evolutionary Computation Conference Companion (GECCO). 2020.