MLKD

Ongoing dissertations

Unmixing mass spectra using machine learning for species identification Supervised by Arlindo L. Oliveira and authored by Vasco Paisana MALDI-TOF mass spectrometry is a fast and powerful tool for identifying biological species based on molecular fingerprints. However, when samples contain mixtures, such as multiple species or strains, the resulting spectra overlap, making it hard to tell them apart. To solve this, we can apply machine learning techniques that “unmix” these complex signals into their original components. This concept, known as spectral deconvolution, is similar to separating instruments in a song or resolving mixed cell types in genomics. With access to a high-quality reference spectral database from the EU-funded MALDIBANK project, we can explore statistical and ML models that enhance identification accuracy, potentially transforming how we analyze complex biological samples. This project will develop and test spectral deconvolution models, such as ICA, NMF, and neural networks, to separate overlapping MALDI-TOF signals. This will evaluate how different input formats (raw spectra vs. spectral peaks) and modeling approaches (including Bayesian techniques) affect performance. The goal is to build a prototype pipeline that improves species identification from mixed samples.

Requisites: It is required that the student has a background in Data Science, Machine Learning.

Notes: This project is part of a recently EU-funded project MALDIBANK (https://cordis.europa.eu/project/id/101188201). A fellowship may be offered depending on progress and results. The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/cluster), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).
Developing a Sense of Humour in Large Language Models Supervised by Arlindo L. Oliveira and authored by Duarte Sousa Artificial Intelligence, particularly Large Language Models (LLMs) such as GPT, Gemini, and Claude, demonstrate high levels of autonomy, sometimes even surpassing human capabilities across various domains. However, in the domain of comedy and humour, these models still show notable limitations: they often struggle to generate genuinely funny content, understand the subtleties of humour and are not effective at supporting the creative process involved in generating comedy.

This dissertation focuses on exploring the roots of these limitations, whether they are based on the lack of targeted training data, inherent architectural constraints, or other factors. With a particular emphasis on Portuguese humour, this research aims to investigate how LLMs can be improved to better understand and generate comedy content.

The core objectives of the thesis include:
- Review related work, such as “A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs’ Humour Alignment with Comedians” (https://arxiv.org/pdf/2405.20956) and define a roadmap for the thesis.
- Curate a dataset composed of comedy sketches and material from well-known Portuguese comedic groups (“Gato Fedorento”, “Herman José”, “Porta dos Fundos”…).
- Use this dataset in a Retrieval-Augmented Generation (RAG) pipeline to enhance LLM performance in comedy brainstorming and joke creation.
- Fine-tune open-source language models such as LLaMA, Mistral, or Qwen with the dataset to assess whether domain-specific tuning improves their humour capabilities.
- Analyze model activations and embeddings.
- Evaluate model outputs through structured human feedback to measure improvements in comedy quality and relevance.

Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).
An Intelligent Software Agent for Personalized Tutoring Supervised by Arlindo L. Oliveira and authored by Natan Gloeh Software agents play a central role in modern artificial intelligence, particularly in digital environments that require autonomy, adaptability, and goal-oriented behaviour. Among their many applications, intelligent tutoring systems stand out as a promising domain in which software agents can deliver personalised, context-aware support to learners. These systems aim to adapt content, strategies, and feedback to individual students’ needs, thereby enhancing learning outcomes and engagement. This dissertation will focus on the development of an intelligent software agent designed to act as a tutor, supporting the creation and dynamic management of personalised study plans.

The primary objective of the dissertation is to design, implement, and evaluate a goal-driven tutoring agent that operates within a virtual learning environment. The agent will monitor a learner’s progress, propose tailored study plans, and adapt its guidance based on performance, preferences, and evolving goals. The project will draw on models of autonomous software agents, particularly utility-based and plan-based approaches, and will use the Letta platform to support the design, execution, and coordination of agentic behaviours.

The dissertation will pursue the following goals:
- Survey the literature on intelligent tutoring systems and software agents, with a focus on approaches that support autonomy, personalisation, and decision-making under uncertainty.
- Design an agent architecture that can represent user goals, monitor learning activities, and select pedagogical strategies accordingly. The agent should be capable of interfacing with structured curricular resources and managing evolving study plans.
- Implement the tutoring agent using a platform, such as Letta, Jason or other existing alternatives, which offer a structured framework for agent specification, including goal definition, capability management, and reasoning over action choices.
- Enable multi-modal interaction within a software environment, such as recommending resources, issuing reminders, or querying user preferences and goals. While the system will remain fully digital, it may simulate dialogue-like exchanges with the learner to enhance responsiveness.
- Evaluate the effectiveness of the tutor agent, both in terms of its internal reasoning capabilities and its ability to adapt to different learner profiles. Metrics may include goal completion rate, alignment with user preferences, and perceived usefulness in simulated user trials.

The dissertation is expected to lead to a functioning prototype and to include an analysis of the challenges and trade-offs involved in developing software agents for personalised education. Potential extensions include integrating domain-specific knowledge models, supporting long-term learning trajectories, or incorporating multiple agents for collaborative learning support.

Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).
Reasoning in Large Language Models and the Dual-Process Theory of Cognition Supervised by Arlindo L. Oliveira and authored by Adrian Graur Large Language Models (LLMs) such as GPT-4, Claude, and Gemini represent a major leap in artificial intelligence, demonstrating capabilities in language understanding, generation, and increasingly, reasoning. However, their reasoning abilities remain a topic of active investigation, raising fundamental questions about the nature of their inferential processes, their limitations, and their relation to human cognition. One fruitful perspective to approach this challenge is through the lens of dual-process theories of reasoning, particularly the distinction between System 1 (fast, intuitive, automatic thinking) and System 2 (slow, deliberate, analytical reasoning), as developed in cognitive science since the 19th century and popularized by Daniel Kahneman.

The goal of this dissertation is to explore the reasoning behaviour of LLMs and investigate how it maps onto the System 1 / System 2 framework. The project will combine theoretical insights with empirical evaluations to better understand whether—and under what conditions—LLMs exhibit characteristics of fast, intuitive responses versus slow, structured reasoning.

The main objectives of the dissertation include:
- Review the literature on LLM reasoning capabilities, including benchmarks such as logical reasoning tasks, mathematical problem solving, commonsense inference, and chain-of-thought prompting. The review will also include relevant studies in cognitive science on dual-process theories.
- Formulate a conceptual mapping between typical behaviours of LLMs and features of System 1 and System 2. For instance, short unprompted completions may align with intuitive (System 1) outputs, while multi-step chain-of-thought reasoning may reflect deliberative (System 2) processes—albeit implemented via different mechanisms.
- Design and run experiments using open LLMs (e.g., LLaMA, Mistral, or GPT-4 via API access) to test their performance across tasks designed to dissociate intuitive from analytical reasoning. This may include tasks such as syllogistic reasoning, cognitive reflection tests (CRT), and problems known to elicit System 1/System 2 divergence in humans.
- Investigate the role of prompting strategies, such as zero-shot, few-shot, and chain-of-thought prompts, in shifting the LLM's behaviour between fast, heuristic-like responses and slower, structured reasoning patterns.
- Analyse the results quantitatively and qualitatively to determine to what extent LLMs emulate dual-process characteristics, and reflect on the limitations of this analogy—e.g., the lack of internal metacognition or working memory in current LLMs.
- Discuss implications for both AI and cognitive science: what LLM performance tells us about artificial reasoning systems, and whether dual-process theories can inform the design or evaluation of next-generation AI.

This dissertation will bridge computational experimentation with cognitive theory and is well suited for students interested in the intersection of AI, psychology, and the philosophy of mind. Optional extensions may include comparisons across LLM architectures, integration with neuro-symbolic methods, or the use of LLMs as models of human reasoning behaviour in behavioural science simulations.

Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).
Handling Portuguese Varieties with Pre-Trained Language Models Supervised by Bruno Martins and Arlindo L. Oliveira and authored by Rodrigo Laia Automatic language identification of written texts is a well-established area of research within Natural Language Processing (NLP). State-of-the-art algorithms often rely on n-gram character models, or more recently on neural language models, to identify the correct language of texts, with good results seen for different European languages. However, distinguishing between similar language varieties with a considerable overlap in the lexicon and in frequently used linguistic expressions, such as the case of Brazilian Portuguese (PT-BR) and European Portuguese (PT-PT), still presents significant challenges.

It is important to note that accurate identification of language varieties can nowadays have important applications in terms of training large language models. In the particular case of the Portuguese language, a predominance of Brazilian Portuguese corpora online induces linguistic traces on those models, limiting their adoption outside Brazil. To address this gap and promote the creation of European Portuguese resources, this work will address:
(a) the development of training datasets focusing on the discrimination between European and Brazilian Portuguese, e.g. leveraging abundant sources of data such as OpenSubtitles; and
(b) the fine-tuning of pre-trained language models to classify textual utterances according to language variety (i.e., PT-PT and PT-BR), and for generating transliterations between varieties.

In a second stage, the project may also explore the use of European Portuguese resources, filtered through the model that is to be developed, for fine-tuning LLMs in order to better support the PT-PT variety.

Experiments will be performed on existing corpora used in previous studies (e.g., the DSL-TL corpus, used for instance in the study Enhancing Portuguese Varieties Identification with Cross-Domain Approaches), and the main deliverable from this project will be a technical report detailing the results of a large set of comparative experiments, envisioning the publication in a conference related to the areas of natural language processing or information retrieval.

Requisites:
• Commitment and availability to work on the project (e.g., not recommended for students with other professional activities);
• Interest in the application of deep learning methods to natural language processing — previous projects involving the use of Transformer-based models will be valued;
• Good commandment of English;
• Excellent grades in courses related to the topics of the project (i.e., an average grade of 17 values or higher);
• Knowledge and experience with the use of tools like Overleaf and GitHub;
• Knowledge of Python and machine learning libraries such as PyTorch and HuggingFace Transformers;
• Preference will be given to students enrolled, or interested in enrolling, in the PhD fast track programme.

Notes: The project will be supervised by Bruno Martins (DEEC/IST and INESC-ID) and Arlindo Oliveira (DEI/IST and INESC-ID). It will be developed in the context of ongoing research projects currently being executed at the Human Language Technologies (HLT) group of INESC-ID (e.g., the Center for Responsible AI — https://centerforresponsible.ai).

Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs), and they are expected to interact with other HLT researchers working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).

Students interested in this proposal should contact Prof. Bruno Martins (bruno.g.martins@tecnico.ulisboa.pt) in order to schedule an interview. One student from MEIC, named Rodrigo Farate Laia, has already expressed his interest in selecting this proposal.
Assessing Large Language Models for information verification in Portuguese Supervised by Arlindo L. Oliveira and Nuno Cardoso and authored by Luís Câmara In an increasingly interconnected world driven by digital technologies, the ability to navigate, critically evaluate, and effectively use information is more essential than ever. However, amidst the vast amount of online information, misinformation proliferates, posing significant challenges to informed decision-making. With the advent of Large Language Models (LLMs), there is a potential solution in using these models to verify the accuracy of circulating information online. Different Artificial Intelligence models offer distinct capabilities in processing and understanding large amounts of textual data. This research aims to explore and compare the effectiveness of these models in the context of information verification. The objectives of this thesis are: to expand the knowledge of the state of the art in the area of information verification using Artificial Intelligence; to identify and select appropriate methods for developing a solution that aims to investigate the ability of Large Language Models to verify the truthfulness of a text, compared to a knowledge base; to use open-source models (LLaMA, Vicuna, Mistral, etc.) and proprietary ones (GPT-4, BARD, Claude, etc.) to verify information; to analyze the results obtained in experiments, comparing the performance of different LLMs with the existing state of the art and to draw pertinent conclusions based on these results. The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. Notes: The work will be be developed in cooperation with SIED (Serviço de Informações Estratégicas de Defesa), which have significant expertise in the the topic. The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Deep neural network architectures for dual process computation Supervised by Arlindo L. Oliveira and Bruno Martins and authored by João Meneses Santos Dual process theories have been used to explain the different modes of behavior of the human brain when processing information. These theories became popular with the work of Kaheman, Thinking Fast and Slow, but they are based on decades of experimental evidence that the human brain works in two different modes. System 1 processes large amounts of visual and sensory information, efficiently and unconsciously. For instance, face and object recognition, speech processing and many other automatic functions are performed effortlessly by the human brain, using system 1. Other tasks require conscious effort, like answering complex riddles, executing non-trivial arithmetic operations of planning unfamiliar tasks. These tasks are performed by system 2. Existing systems, like convolutional neural networks, for vision, or transformers, for natural language processing, behave very much like system 1 in the human brain: they perform fast, high-throughput, processing of high-dimensional information, in an unconscious way. This dissertation will be focused on the design of deep neural network architectures that can be used to emulate the dual process computation that characterizes the human brain and also on the relation of dual process architectures and consciousness. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. He/she should also have interest in developing the understanding of neuroscience and human psychology. Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)