Paulo Canelas
Logo PhD Student at Carnegie Mellon University
Logo PhD Student at University of Lisbon

I am a PhD student working under the supervision of Alcides Fonseca, Sara Silva, and Christopher S. Timperley. My research focuses on developing program analysis techniques to detect errors in software systems. Previously, I worked on evolutionary program synthesis using refinement types, and I am currently closely researching the application of software engineering techniques to the robotics field (Software Engineering for Robotics).


Education
  • Carnegie Mellon University
    Carnegie Mellon University
    Sep. 2020 - Jul. 2026 (Expected)
    Dual Degree PhD in Software Engineering with University of Lisbon
    Thesis: Specification-Driven Detection of Misconfigurations in ROS-based Robotic Systems
    Advisor: Alcides Fonseca, Sara Silva, and Chris Timperley
  • Faculty of Sciences of University of Lisbon
    Faculty of Sciences of University of Lisbon
    Sep. 2018 - Jul. 2020
    MSc. in Computer Science
    Thesis: Towards the Conceptualization of Refinement Typed Genetic Programming
    Advisor: Alcides Fonseca
  • Faculty of Sciences of University of Lisbon
    Faculty of Sciences of University of Lisbon
    Sep. 2015 - Jul. 2018
    BSc. in Computer Science

Work Experience
  • Uber Technologies Inc.
    Uber Technologies Inc.
    May 2024 - Aug. 2024
    PhD Software Engineer Research Intern

Teaching Experience
  • Carnegie Mellon University
    Carnegie Mellon University
    Teaching Assistant
    17-643 - Quality Management
    Mar. 2024 - May 2024
    17-623 - Quality Assurance
    Oct. 2023 - Dec. 2023
  • Faculty of Sciences of University of Lisbon
    Faculty of Sciences of University of Lisbon
    Invited Teaching Assistant
    Programming
    Sep. 2021 - Feb. 2022
    Object Oriented Development
    Jan. 2021 - Jun. 2021
  Selected Publications
Are Large Language Models Memorizing Bug Benchmarks?

Daniel Ramos, Claudia Mamede*, Kush Jain*, Paulo Canelas*, Catarina Gamboa*, Claire Le Goues (* equal contribution)

Large Language Models for Code (LLM4Code) Workshop. . 2025.  Just Accepted!   🎉

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and *n*-gram accuracy. Our findings show that certain models, in particular CodeGen, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like Llama 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Are Large Language Models Memorizing Bug Benchmarks?
Are Large Language Models Memorizing Bug Benchmarks?

Daniel Ramos, Claudia Mamede*, Kush Jain*, Paulo Canelas*, Catarina Gamboa*, Claire Le Goues (* equal contribution)

Large Language Models for Code (LLM4Code) Workshop. . 2025.  Just Accepted!   🎉

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world bugs from software projects have been developed. However, a growing concern within the software engineering community is that these benchmarks may not reliably reflect true LLM performance due to the risk of data leakage. Despite this concern, limited research has been conducted to quantify the impact of potential leakage. In this paper, we systematically evaluate popular LLMs to assess their susceptibility to data leakage from widely used bug benchmarks. To identify potential leakage, we use multiple metrics, including a study of benchmark membership within commonly used training datasets, as well as analyses of negative log-likelihood and *n*-gram accuracy. Our findings show that certain models, in particular CodeGen, exhibit significant evidence of memorization in widely used benchmarks like Defects4J, while newer models trained on larger datasets like Llama 3.1 exhibit limited signs of leakage. These results highlight the need for careful benchmark selection and the adoption of robust metrics to adequately assess models capabilities.

Understanding Misconfigurations in ROS: An Empirical Study and Current Approaches

Paulo Canelas, Bradley Schmerl, Alcides Fonseca, Christopher S. Timperley

International Symposium on Software Testing and Analysis (ISSTA). 2024.  

The Robot Operating System (ROS) is a popular framework for building robot software from reusable components, but configuring and connecting these components correctly is challenging. Developers often face issues due to unstated assumptions, leading to misconfigurations that can result in unpredictable and dangerous behavior. To improve the reliability of ROS projects, it is critical to identify the broader set of misconfigurations. To that end, we perform a study on ROS Answers, a Q&A platform, to categorize these misconfigurations and evaluate how well existing detection techniques cover them. We identified 12 high-level categories and 50 sub-categories, with 27 not covered by current techniques.

Understanding Misconfigurations in ROS: An Empirical Study and Current Approaches
Understanding Misconfigurations in ROS: An Empirical Study and Current Approaches

Paulo Canelas, Bradley Schmerl, Alcides Fonseca, Christopher S. Timperley

International Symposium on Software Testing and Analysis (ISSTA). 2024.  

The Robot Operating System (ROS) is a popular framework for building robot software from reusable components, but configuring and connecting these components correctly is challenging. Developers often face issues due to unstated assumptions, leading to misconfigurations that can result in unpredictable and dangerous behavior. To improve the reliability of ROS projects, it is critical to identify the broader set of misconfigurations. To that end, we perform a study on ROS Answers, a Q&A platform, to categorize these misconfigurations and evaluate how well existing detection techniques cover them. We identified 12 high-level categories and 50 sub-categories, with 27 not covered by current techniques.

All publications
  News
2024
Paper on ROS Misconfigurations accepted at the International Symposium on Software Testing and Analysis (ISSTA)!
Jul 03
Started my PhD Software Engineer Summer Internship at Uber Technologies Inc.
May 14
Paper on Physical Unit Mismatches accepted at the International Conference in Robotics and Automation (ICRA).
Jan 15
2023
2-minute Lightning Talk at ROSCon 2023 on Understanding, Detecting and Repairing Misconfigurations in ROS. ⚡ Watch
Oct 20
Paper on the Usability of Liquid Types in Java accepted at the International Conference in Software Engineering (ICSE).
Jan 12
2022
Paper on the Challenges in Learning ROS accepted at the International Workshop on Robotics Software Engineering (RoSE).
Feb 25
2020
Our project ecoServer achieved the Top 15 out of 1152 at the EDP University Challenge Competition. Read more
Jul 10
Best Poster award at the 5th LASIGE Workshop! 🏆 Read more
Feb 14