MLKD

New dissertations (open for applications)

Unmixing mass spectra using machine learning for species identification Supervised by Arlindo L. Oliveira MALDI-TOF mass spectrometry is a fast and powerful tool for identifying biological species based on molecular fingerprints. However, when samples contain mixtures, such as multiple species or strains, the resulting spectra overlap, making it hard to tell them apart. To solve this, we can apply machine learning techniques that “unmix” these complex signals into their original components. This concept, known as spectral deconvolution, is similar to separating instruments in a song or resolving mixed cell types in genomics. With access to a high-quality reference spectral database from the EU-funded MALDIBANK project, we can explore statistical and ML models that enhance identification accuracy, potentially transforming how we analyze complex biological samples. This project will develop and test spectral deconvolution models, such as ICA, NMF, and neural networks, to separate overlapping MALDI-TOF signals. This will evaluate how different input formats (raw spectra vs. spectral peaks) and modeling approaches (including Bayesian techniques) affect performance. The goal is to build a prototype pipeline that improves species identification from mixed samples.

Requisites: It is required that the student has a background in Data Science, Machine Learning.

Notes: This project is part of a recently EU-funded project MALDIBANK (https://cordis.europa.eu/project/id/101188201). A fellowship may be offered depending on progress and results. The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/cluster), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).

Developing a Sense of Humour in Large Language Models Supervised by Arlindo L. Oliveira Artificial Intelligence, particularly Large Language Models (LLMs) such as GPT, Gemini, and Claude, demonstrate high levels of autonomy, sometimes even surpassing human capabilities across various domains. However, in the domain of comedy and humour, these models still show notable limitations: they often struggle to generate genuinely funny content, understand the subtleties of humour and are not effective at supporting the creative process involved in generating comedy.

This dissertation focuses on exploring the roots of these limitations, whether they are based on the lack of targeted training data, inherent architectural constraints, or other factors. With a particular emphasis on Portuguese humour, this research aims to investigate how LLMs can be improved to better understand and generate comedy content.

The core objectives of the thesis include:
- Review related work, such as “A Robot Walks into a Bar: Can Language Models Serve as Creativity Support Tools for Comedy? An Evaluation of LLMs’ Humour Alignment with Comedians” (https://arxiv.org/pdf/2405.20956) and define a roadmap for the thesis.
- Curate a dataset composed of comedy sketches and material from well-known Portuguese comedic groups (“Gato Fedorento”, “Herman José”, “Porta dos Fundos”…).
- Use this dataset in a Retrieval-Augmented Generation (RAG) pipeline to enhance LLM performance in comedy brainstorming and joke creation.
- Fine-tune open-source language models such as LLaMA, Mistral, or Qwen with the dataset to assess whether domain-specific tuning improves their humour capabilities.
- Analyze model activations and embeddings.
- Evaluate model outputs through structured human feedback to measure improvements in comedy quality and relevance.

Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).

An Intelligent Software Agent for Personalized Tutoring Supervised by Arlindo L. Oliveira Software agents play a central role in modern artificial intelligence, particularly in digital environments that require autonomy, adaptability, and goal-oriented behaviour. Among their many applications, intelligent tutoring systems stand out as a promising domain in which software agents can deliver personalised, context-aware support to learners. These systems aim to adapt content, strategies, and feedback to individual students’ needs, thereby enhancing learning outcomes and engagement. This dissertation will focus on the development of an intelligent software agent designed to act as a tutor, supporting the creation and dynamic management of personalised study plans.

The primary objective of the dissertation is to design, implement, and evaluate a goal-driven tutoring agent that operates within a virtual learning environment. The agent will monitor a learner’s progress, propose tailored study plans, and adapt its guidance based on performance, preferences, and evolving goals. The project will draw on models of autonomous software agents, particularly utility-based and plan-based approaches, and will use the Letta platform to support the design, execution, and coordination of agentic behaviours.

The dissertation will pursue the following goals:
- Survey the literature on intelligent tutoring systems and software agents, with a focus on approaches that support autonomy, personalisation, and decision-making under uncertainty.
- Design an agent architecture that can represent user goals, monitor learning activities, and select pedagogical strategies accordingly. The agent should be capable of interfacing with structured curricular resources and managing evolving study plans.
- Implement the tutoring agent using a platform, such as Letta, Jason or other existing alternatives, which offer a structured framework for agent specification, including goal definition, capability management, and reasoning over action choices.
- Enable multi-modal interaction within a software environment, such as recommending resources, issuing reminders, or querying user preferences and goals. While the system will remain fully digital, it may simulate dialogue-like exchanges with the learner to enhance responsiveness.
- Evaluate the effectiveness of the tutor agent, both in terms of its internal reasoning capabilities and its ability to adapt to different learner profiles. Metrics may include goal completion rate, alignment with user preferences, and perceived usefulness in simulated user trials.

The dissertation is expected to lead to a functioning prototype and to include an analysis of the challenges and trade-offs involved in developing software agents for personalised education. Potential extensions include integrating domain-specific knowledge models, supporting long-term learning trajectories, or incorporating multiple agents for collaborative learning support.

Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).

Reasoning in Large Language Models and the Dual-Process Theory of Cognition Supervised by Arlindo L. Oliveira Large Language Models (LLMs) such as GPT-4, Claude, and Gemini represent a major leap in artificial intelligence, demonstrating capabilities in language understanding, generation, and increasingly, reasoning. However, their reasoning abilities remain a topic of active investigation, raising fundamental questions about the nature of their inferential processes, their limitations, and their relation to human cognition. One fruitful perspective to approach this challenge is through the lens of dual-process theories of reasoning, particularly the distinction between System 1 (fast, intuitive, automatic thinking) and System 2 (slow, deliberate, analytical reasoning), as developed in cognitive science since the 19th century and popularized by Daniel Kahneman.

The goal of this dissertation is to explore the reasoning behaviour of LLMs and investigate how it maps onto the System 1 / System 2 framework. The project will combine theoretical insights with empirical evaluations to better understand whether—and under what conditions—LLMs exhibit characteristics of fast, intuitive responses versus slow, structured reasoning.

The main objectives of the dissertation include:
- Review the literature on LLM reasoning capabilities, including benchmarks such as logical reasoning tasks, mathematical problem solving, commonsense inference, and chain-of-thought prompting. The review will also include relevant studies in cognitive science on dual-process theories.
- Formulate a conceptual mapping between typical behaviours of LLMs and features of System 1 and System 2. For instance, short unprompted completions may align with intuitive (System 1) outputs, while multi-step chain-of-thought reasoning may reflect deliberative (System 2) processes—albeit implemented via different mechanisms.
- Design and run experiments using open LLMs (e.g., LLaMA, Mistral, or GPT-4 via API access) to test their performance across tasks designed to dissociate intuitive from analytical reasoning. This may include tasks such as syllogistic reasoning, cognitive reflection tests (CRT), and problems known to elicit System 1/System 2 divergence in humans.
- Investigate the role of prompting strategies, such as zero-shot, few-shot, and chain-of-thought prompts, in shifting the LLM's behaviour between fast, heuristic-like responses and slower, structured reasoning patterns.
- Analyse the results quantitatively and qualitatively to determine to what extent LLMs emulate dual-process characteristics, and reflect on the limitations of this analogy—e.g., the lack of internal metacognition or working memory in current LLMs.
- Discuss implications for both AI and cognitive science: what LLM performance tells us about artificial reasoning systems, and whether dual-process theories can inform the design or evaluation of next-generation AI.

This dissertation will bridge computational experimentation with cognitive theory and is well suited for students interested in the intersection of AI, psychology, and the philosophy of mind. Optional extensions may include comparisons across LLM architectures, integration with neuro-symbolic methods, or the use of LLMs as models of human reasoning behaviour in behavioural science simulations.

Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).

Handling Portuguese Varieties with Pre-Trained Language Models Supervised by Bruno Martins and Arlindo L. Oliveira Automatic language identification of written texts is a well-established area of research within Natural Language Processing (NLP). State-of-the-art algorithms often rely on n-gram character models, or more recently on neural language models, to identify the correct language of texts, with good results seen for different European languages. However, distinguishing between similar language varieties with a considerable overlap in the lexicon and in frequently used linguistic expressions, such as the case of Brazilian Portuguese (PT-BR) and European Portuguese (PT-PT), still presents significant challenges.

It is important to note that accurate identification of language varieties can nowadays have important applications in terms of training large language models. In the particular case of the Portuguese language, a predominance of Brazilian Portuguese corpora online induces linguistic traces on those models, limiting their adoption outside Brazil. To address this gap and promote the creation of European Portuguese resources, this work will address:
(a) the development of training datasets focusing on the discrimination between European and Brazilian Portuguese, e.g. leveraging abundant sources of data such as OpenSubtitles; and
(b) the fine-tuning of pre-trained language models to classify textual utterances according to language variety (i.e., PT-PT and PT-BR), and for generating transliterations between varieties.

In a second stage, the project may also explore the use of European Portuguese resources, filtered through the model that is to be developed, for fine-tuning LLMs in order to better support the PT-PT variety.

Experiments will be performed on existing corpora used in previous studies (e.g., the DSL-TL corpus, used for instance in the study Enhancing Portuguese Varieties Identification with Cross-Domain Approaches), and the main deliverable from this project will be a technical report detailing the results of a large set of comparative experiments, envisioning the publication in a conference related to the areas of natural language processing or information retrieval.

Requisites:
• Commitment and availability to work on the project (e.g., not recommended for students with other professional activities);
• Interest in the application of deep learning methods to natural language processing — previous projects involving the use of Transformer-based models will be valued;
• Good commandment of English;
• Excellent grades in courses related to the topics of the project (i.e., an average grade of 17 values or higher);
• Knowledge and experience with the use of tools like Overleaf and GitHub;
• Knowledge of Python and machine learning libraries such as PyTorch and HuggingFace Transformers;
• Preference will be given to students enrolled, or interested in enrolling, in the PhD fast track programme.

Notes: The project will be supervised by Bruno Martins (DEEC/IST and INESC-ID) and Arlindo Oliveira (DEI/IST and INESC-ID). It will be developed in the context of ongoing research projects currently being executed at the Human Language Technologies (HLT) group of INESC-ID (e.g., the Center for Responsible AI — https://centerforresponsible.ai).

Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs), and they are expected to interact with other HLT researchers working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).

Students interested in this proposal should contact Prof. Bruno Martins (bruno.g.martins@tecnico.ulisboa.pt) in order to schedule an interview. One student from MEIC, named Rodrigo Farate Laia, has already expressed his interest in selecting this proposal.

Sign Language Recognition with Large Vision-and-Language Models Supervised by Bruno Martins and Arlindo L. Oliveira Sign language plays a vital role in enabling communication for deaf and hard-of-hearing individuals, yet automatic Sign Language Recognition (SLR) remains a challenging task due to its high visual complexity and the limited availability of large annotated datasets, particularly for underrepresented sign languages such as Portuguese Sign Language (Língua Gestual Portuguesa, LGP). Traditional approaches often rely on intermediate gloss annotations, which are costly to produce and restrict scalability. Meanwhile, recent progress in Large Vision-and-Language Models (LVLMs) opens new opportunities to bypass these constraints by enabling direct translation from video to text without the need for intermediate representations. These models can exploit rich semantic and temporal features directly from raw video input, offering an end-to-end alternative for gloss-free sign language understanding.

This project proposes the development of a sign language recognition system that directly utilizes video frames as input to a Large Vision-and-Language Model, taking inspiration from recent proposals (e.g., 2412.16524, 2404.00925, 2408.10593). Unlike previous approaches dependent on body key-point extraction or gloss-level supervision, the method will explore the use of LVLMs fine-tuned with paired video-text data, treating sign language video frames as a first-class input modality. Inspired by recent advances in multimodal learning and the bridging of visual and language token spaces, the model will be trained to recognize and translate full video sequences into natural language descriptions. In parallel, the project will address a critical gap in data availability for LGP, by curating a new dataset from publicly available sources such as televised news broadcasts interpreted in sign language, applying automatic alignment and segmentation strategies to create training pairs.

Evaluation will be conducted using both newly collected LGP data and existing sign language datasets for other languages, such as PHOENIX14T, CSL-Daily, How2Sign (how2sign.github.io, signllm.github.io), or the data from the Sign Language Translation Task of WMT22/WMT23 (wmt-slt.com). The proposed model will be benchmarked against existing state-of-the-art gloss-free SLT systems, with comparisons focusing on translation accuracy, robustness to signer variation, and performance in low-resource language scenarios. The main deliverable will be a technical report detailing the system architecture, data collection and preprocessing pipelines, and experimental results, with the aim of submission to a leading conference in computer vision or natural language processing.

Requisites:
• Commitment and availability to work on the project (e.g., not recommended for students with other professional activities);
• Interest in the application of deep learning methods to natural language processing — previous projects involving the use of Transformer-based models will be valued;
• Good commandment of English;
• Excellent grades in courses related to the topics of the project (i.e., an average grade of 17 values or higher);
• Knowledge and experience with the use of tools like Overleaf and GitHub;
• Knowledge of Python and machine learning libraries such as PyTorch and HuggingFace Transformers;
• Preference will be given to students enrolled, or interested in enrolling, in the PhD fast track programme.

Notes: The project will be supervised by Bruno Martins (DEEC/IST and INESC-ID) and Arlindo Oliveira (DEI/IST and INESC-ID). It will be developed in the context of ongoing research projects currently being executed at the Human Language Technologies (HLT) group of INESC-ID (e.g., the Center for Responsible AI — https://centerforresponsible.ai).

Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs), and they are expected to interact with other HLT researchers working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).

Students interested in this proposal should contact Prof. Bruno Martins (bruno.g.martins@tecnico.ulisboa.pt) in order to schedule an interview.

Text Segmentation by TextTiling with Modern Neural Retrieval Models Supervised by Bruno Martins and Arlindo L. Oliveira Text segmentation is a foundational task in natural language processing, supporting a wide range of applications such as document summarization, topic modeling, media content repurposing, and more recently, Retrieval-Augmented Generation (RAG) systems (2503.09600). Effective segmentation enables better contextualization of content, allowing downstream models to focus on coherent blocks of information rather than arbitrary fixed-length inputs. Classical approaches like TextTiling segment texts into semantically coherent passages based on lexical cohesion, typically using text similarity measures. Subsequent approaches have revisited the TextTiling idea leveraging representations obtained from encoder models such as BERT (PeerJ-CS, GitHub - DeepTiling). More recently, with the advent of neural retrieval and large language models (LLMs), there is a renewed interest in developing segmentation strategies that can more precisely model topical boundaries and improve the efficiency and quality of retrieval-based systems. Recent approaches have even proposed the use of LLMs to perform text segmentation (2406.17526), although computational efficiency is a concern with these methods.

In the context of this M.Sc. research project, the candidate will revisit the original TextTiling method and augment it by integrating modern neural sentence representations, particularly those derived from models pre-trained for dense retrieval tasks (e.g., sentence-transformers or embedding models based on larger Transformer decoders fine-tuned for semantic search). By replacing TF-IDF vectors with dense embeddings, the method aims to improve the detection of semantic shifts between text segments, while maintaining the interpretability and unsupervised nature of the original algorithm. In parallel, we will compare this enhanced TextTiling method with LLM-based segmentation approaches such as LumberChunker, which dynamically determine chunk boundaries through generative prompting. The goal is to assess the trade-offs between lightweight embedding-based segmentation, and more computationally intensive instruction-tuned models in real-world scenarios.

Evaluation will be performed using a mixture of segmentation benchmarks, perhaps also considering downstream task performance within RAG pipelines. Benchmarks such as Wiki-727K, or the GutenQA benchmark used in LumberChunker, will be employed to quantify segmentation quality using established metrics. A technical report will document the results, envisioning the submission to a leading conference in natural language processing or information retrieval.

Requisites:
• Commitment and availability to work on the project (e.g., not recommended for students with other professional activities);
• Interest in the application of deep learning methods to natural language processing — previous projects involving the use of Transformer-based models will be valued;
• Good commandment of English;
• Excellent grades in courses related to the topics of the project (i.e., an average grade of 17 values or higher);
• Knowledge and experience with the use of tools like Overleaf and GitHub;
• Knowledge of Python and machine learning libraries such as PyTorch and HuggingFace Transformers;
• Preference will be given to students enrolled, or interested in enrolling, in the PhD fast track programme.

Notes: The project will be supervised by Bruno Martins (DEEC/IST and INESC-ID) and Arlindo Oliveira (DEI/IST and INESC-ID). It will be developed in the context of ongoing research projects currently being executed at the Human Language Technologies (HLT) group of INESC-ID (e.g., the Center for Responsible AI — https://centerforresponsible.ai).

Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs), and they are expected to interact with other HLT researchers working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).

Students interested in this proposal should contact Prof. Bruno Martins (bruno.g.martins@tecnico.ulisboa.pt) in order to schedule an interview.

Currently, there are no dissertations open for application.

Analysis of sensor data for monitoring of open spaces Supervised by Arlindo L. Oliveira and authored by André Duarte