MLKD

New dissertations (open for applications)

Multimodal Document Representation Learning Supervised by Arlindo L. Oliveira Learning representations (embeddings) of documents from their constituent elements, such as images, text, and even the layout of the document i.e., the positions of various document components such as text, paragraphs, tables, etc. These embeddings can be used to search for similar documents, which in turn can be used for document classification based on similarity to existing entries in the database, which is something useful for routing documents to different processes.

Concrete implementation suggestions:

  1. Use CLIP [1] with an image encoder and a text encoder, aligning the representations of both. With two distinct encoders sharing an embedding space, it would in principle be possible to search for documents using text and also using images of other documents.
  2. Follow approach 1 but use a text and layout encoder instead of text alone, or use 3 encoders: one for text, one for layout, and one for images. Using the layout, it would be possible to also search by elements at certain positions in the document.
  3. Instead of using CLIP, simply use the encoder of a multimodal model already pre-trained on documents (UDOP [2], LayoutLMv3 [3], etc.) with the various modalities described above and train the embeddings to bring similar documents closer together. This approach has the added difficulty of needing to aggregate similar documents and define what makes a document similar, making it not 100% self-supervised.
Cooperation with external entity: Fidelidade S.A.
Requisites: We value strong proximity of students with Fidelidade, so it would be ideal if students could be present at the office on at least some of the team's in-person working days.

  • [1] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2103.00020
  • [2] Tang, Z., Yang, Z., Wang, G., Fang, Y., Liu, Y., Zhu, C., … Bansal, M. (2023). Unifying Vision, Text, and Layout for Universal Document Processing. arXiv [Cs.CV]. Retrieved from http://arxiv.org/abs/2212.02623
  • [3] Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2204.08387
Improving Retrieval Augmented Generation using LLMs Supervised by Arlindo L. Oliveira In a business context, specifically in the insurance sector, conversational RAG models grounded in a company's private information can have significant productivity impacts. Despite the potential of these systems, the high risk associated with the insurance sector makes the assertiveness of the system a crucial factor. Additionally, the existence of many different methods and distinct architectures — which must be chosen according to the specific case under study — makes evaluating these systems challenging. This dissertation aims to investigate recent RAG system techniques as well as the metrics used to evaluate them, with the goal of finding the best architecture for a system in the specific Fidelidade case study.

Techniques to explore:

  • Chunking, Indexing and Retrieval techniques for tabular data
  • Chunking, Indexing and Retrieval techniques for images
  • Agent frameworks with Reasoning for complex/iterative questions
  • Portuguese-European Large Language Models

Cooperation with company or external entity: Fidelidade S.A.

Requisites: We value strong proximity of students with Fidelidade, so it would be ideal if students could be present at the office on at least some of the team's in-person working days.
Generation of New Scientific Knowledge using LLMs Supervised by Arlindo L. Oliveira and Vitória Cruz Large Language Models (LLMs) have achieved impressive capabilities, yet they remain largely constrained by the data and distributions they were trained on. Scientific discovery, however, requires generating new knowledge. This raises the question: how can we enable LLMs to move beyond their training data and produce novel scientific knowledge? The scientific method relies on experimentation, but text-only chatbot LLMs receive no feedback on the hypotheses they generate unless they are grounded in environments where their ideas can be executed and evaluated. Recent work explores agentic frameworks in which LLMs write and execute scientific code, shifting the model from a passive text generator to an active agent capable of proposing ideas, testing them, and iterating toward better solutions. Recent systems such as AlphaEvolve [1] and other “AI Scientist” [2] frameworks highlight the promise of this approach, having already discovered algorithmic improvements that sometimes surpass the state-of-the-art. So far, these frameworks have not been widely applied to AI itself, although there are early examples, such as work at Google using AlphaEvolve to improve parts of the Gemini training infrastructure [1], the Darwin Gödel Machine (a self-improving LLM-based agent) [3], and applications to kernel optimization [4] [5] More recently, Andrej Karpathy’s “autoresearch” exposes a small LLM training pipeline to coding agents. The goal of this master’s thesis is to contribute to this direction by identifying suitable problems in AI research that can be tackled with automated research frameworks and applying (and possibly improving) these automated research frameworks to solve them.

The core objectives of this thesis are:

  • Review existing work on automated research frameworks and understand what kinds of AI problems they have been applied to.
  • Identify a small number of AI research problems that are simple enough to experiment with, but still interesting (for example, improving how LLMs generate answers by testing different prompting or sampling strategies).
  • Define an evaluation metric for each problem and build an experimental setup where many solutions can be tested efficiently (e.g. by using tools such as vLLM to speed up inference). Faster experiments will allow more ideas to be explored.
  • Apply an automated research framework (e.g., ShinkaEvolve [6], an open-source variation of AlphaEvolve) to these problems and analyze the results, looking at the quality, diversity, and originality of the solutions found.
  • Release the experimental setup as open-source software and write a paper with the results of the experiments.

Requisites: The student should have interest in the field of theory of mind and significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).

[1] Novikov, Alexander, et al. "Alphaevolve: A coding agent for scientific and algorithmic discovery." arXiv preprint arXiv:2506.13131 (2025).
[2] Lu, Chris, et al. "The ai scientist: Towards fully automated open-ended scientific discovery." arXiv preprint arXiv:2408.06292 (2024).
[3] Jenny Zhang, Shengran Hu, Cong Lu, Robert LangeJeff Clune, Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents, arXiv preprint
[4] Yuksekgonul, Mert, et al. "Learning to discover at test time." arXiv preprint arXiv:2601.16175 (2026).
[5] Liao, Gang, et al. "Kernelevolve: Scaling agentic kernel coding for heterogeneous ai accelerators at meta." arXiv preprint arXiv:2512.23236 (2025).
[6] Lange, Robert Tjarko, Yuki Imajuku, and Edoardo Cetin. "Shinkaevolve: Towards open-ended and sample-efficient program evolution." arXiv preprint arXiv:2509.19349 (2025).
Enhancing Fine-Grained Perceptual Understanding of Scientific Graphs in Vision Language Models Supervised by Arlindo L. Oliveira and Hélder Dias While modern Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general scene understanding, they are frequently hindered by a "global bias" - a tendency to prioritize high-level semantic summaries over local, fine-grained details. As explored in [1], this bias is often a byproduct of pre-training on general image-caption datasets that favor the "gist" of a scene over pixel-level precision. In the context of scientific communication, this limitation becomes a critical failure point; the misinterpretation of a single outlier or a subtle change in slope can lead to entirely inappropriate conclusions. Recent evaluations in MultiChartQA [2] underscore this issue, demonstrating that even frontier models struggle to integrate information across complex, multi-plot figures due to foundational perceptual inaccuracies. Even 37% of all errors of the top-performing model in the study could be attributed to such perceptual issues. Similarly, research in MeasureBench [3] documents systematic fine-grained errors where models consistently misread precise visual markers like axes’ ticks and pointers. This project aims to address this "perception gap" by developing methods that force VLMs to move beyond global heuristics toward a more rigorous, fine-grained interpretation of scientific graphs.

Possible research directions (more ideas welcome):

  • Domain-specific tokenisation. Current models use a generic "patch-to-token" pipeline that often fragments small details. The student could explore object-centric or coordinate-aware tokenization, where the model learns to represent a graph not as a grid, but as a set of semantically meaningful units. See [4, 5, 6, 7] for more inspiration.
  • Patch-grid alignment & symbolic anchoring. Research in [8] shows that VLM performance drops when a small visual feature (like a data point) shifts from the center of a patch to a boundary. The student could investigate "symbolic anchoring," where the model is fine-tuned to explicitly detect and "mark" axis ticks and legend symbols as a pre-processing step to align the visual grid with the numerical scale.
  • Contrastive locality alignment. Standard VLMs suffer from global bias because they are trained on image-wide captions. This direction explores fine-tuning the vision-language connector using contrastive learning specifically on "local" pairs: for example, contrasting a plot with its correct outlier value vs. a plot with a subtly "shifted" outlier.
  • Neural-Symbolic hybrid interpretation. Building on the DePlot concept [9], this project would move away from simple "image-to-table" translation toward "image-to-coordinate" extraction. The student would develop a module that predicts a sparse numerical representation of the curves/points first, which is then fed into the LLM as a "visual hint," bypassing the noisy perception of the raw pixels during the reasoning phase.

  • Requisites: The student should have a solid understanding of the Transformer architecture and familiarity with Vision Encoders. They should be comfortable with programming in Python with PyTorch and the HuggingFace ecosystem.

    Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).

    [1] Covert, I., Sun, T., Zou, J., & Hashimoto, T. (2024). Locality alignment improves vision-language models. arXiv preprint arXiv:2410.11087.
    [2] Zhu, Z., Jia, M., Zhang, Z., Li, L., & Jiang, M. (2025, April). MultiChartQA: Benchmarking vision-language models on multi-chart problems. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 11341-11359).
    [3] Lin, F., Liu, Y., Xu, H., Yue, C., He, Z., Zhao, M., ... & Yang, X. (2025). Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench. arXiv preprint arXiv:2510.26865.
    [4] Sun, Z., Ma, Y., Liu, G., Chen, Y., Tang, X., Hu, Y., & Xu, Y. (2026). IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning. arXiv preprint arXiv:2602.03060.
    [5] Guo, Z., Diao, E., Yang, C., & Shi, C. (2026). Graph Tokenization for Bridging Graphs and Transformers. arXiv preprint arXiv:2603.11099.
    [6] Chen, J., Kong, L., Wei, H., Liu, C., Ge, Z., Zhao, L., ... & Zhang, X. (2024, October). Onechart: Purify the chart structural extraction via one auxiliary token. In Proceedings of the 32nd ACM International Conference on Multimedia (pp. 147-155).
    [7] Wu, J., Lu, B., Di, Z., Gan, X., Jin, M., Fu, L., ... & Zhou, C. (2026). < SOG_k>: One LLM Token for Explicit Graph Structural Understanding. arXiv preprint arXiv:2602.01771.
    [8] Yuan, Y. (2025). Seeing Small: Probing Visual Perception Limits of Vision-Language Models (Master's thesis, University of California, Los Angeles).
    [9] Liu, F., Eisenschlos, J., Piccinno, F., Krichene, S., Pang, C., Lee, K., ... & Altun, Y. (2023, July). DePlot: One-shot visual language reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 10381-10399).
Multi-Modal Autoencoder Framework for Predicting Plasma Proteome from Genomic Data Using AlphaGenome Supervised by Arlindo L. Oliveira and Inês Duarte This study aims to investigate the integration of multi-modal genomic data to enhance the prediction of plasma protein abundance with AlphaGenome predictions using the UK Biobank Pharma Proteomics Project dataset, with a focus on cancer genetic variants and their proteomic effects. The primary objective is to develop a deep learning autoencoder model that combines genomic variant data from cancer GWAS with AlphaGenome derived regulatory features to predict how genetic variation influences protein expression levels in cancer contexts [1,2]. The model will learn compressed representations (embeddings) that capture how genetic variants influence protein expression through gene regulatory mechanisms, addressing the fundamental challenge of predicting molecular phenotypes from genotypic data in cancer biology. Assigning function to genetic variants as expression quantitative trait loci is an expanding and useful approach, but focuses exclusively on mRNA rather than protein levels, and many variants remain without annotation [3]. Recent work has demonstrated that trans-associations of cancer related variants with distal molecular targets can identify convergent regulatory effects across related cancers, with studies identifying shared proteomic effects across cancer types [4], highlighting the importance of understanding trans-regulatory mechanisms in protein abundance prediction for cancer research. By incorporating state-of-the-art AI predictions of intermediate regulatory processes through AlphaGenome [5], this work aims to bridge the gap between cancer genotype and proteotype more effectively than traditional approaches.

The student will design and implement a neural network architecture that processes multi-modal genomic data from cancer GWAS, combining genotype information from genome significant cancer associated pQTL variants with AlphaGenome's functional predictions. AlphaGenome takes as input DNA sequence and predicts thousands of functional genomic tracks [6]. The focus on cancer associated variants will enable the model to learn regulatory patterns specific to oncogenic processes and tumor biology. A deeply stacked denoising autoencoder approach has been successfully applied to protein structure reconstruction [6], demonstrating the viability of deep autoencoder architectures for biological prediction tasks. Performance evaluation will compare architectural variants with benchmarking against traditional linear regression approaches used in conventional cancer pQTL analyses [7,8].

The final phase focuses on extracting biological insights relevant to cancer biology through systematic interpretation and validation of predictions. The student will implement attention mechanisms or gradient-based feature attribution methods to quantify which regulatory features contribute most to accurate protein abundance predictions in cancer contexts. Protein quantitative trait loci (pQTLs) identify novel relationships between inter-individual protein levels and genetic variants [3], and understanding which regulatory mechanisms mediate these relationships is crucial for identifying potential therapeutic targets and biomarkers in cancer. Latent space analysis using dimensionality reduction will reveal whether the model learns biologically meaningful clusters where cancer-associated variants with similar regulatory mechanisms group together, potentially identifying shared regulatory programs across cancer types [9].

Validation will include stratified performance analysis comparing cis-pQTLs versus trans-pQTLs in cancer-related genes, testing whether incorporating regulatory information particularly improves predictions for non-coding variants in cancer-relevant regulatory regions, and conducting case studies on well-characterized oncogenic regulatory variants to confirm if the model captures known biological mechanisms in cancer pathways. Expression quantitative trait locus (eQTLs) and other molecular QTL studies have been valuable resources in identifying candidate causal genes from GWAS loci through statistical colocalization methods [10], and this work extends that framework to the cancer proteome level with regulatory context.

Requisites: Implementation will use Python with PyTorch or TensorFlow, requiring development of efficient data preprocessing pipelines for large-scale cancer genomic data and feature engineering from AlphaGenome's outputs.

Cooperation with company or external entity: Potential collaboration with NCI’s (National Cancer Institute) researchers.

[1] Sun BB, et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature 622:329-338 (2023).
[2] Sun BB, et al. Genomic atlas of the human plasma proteome. Nature 558:73-79 (2018).
[3] Lourdusamy A, et al. Protein Quantitative Trait Loci Identify Novel Candidates Modulating Cellular Response to Chemotherapy. PLoS Genetics 10(4):e1004192 (2014). PMC3974641.
[4] Mukherjee D, et al. TASTE identifies shared proteomic effects on multiple related cancers. medRxiv doi:10.64898/2025.12.19.25342717 (2025).
[5] Avsec Ž, et al. Advancing regulatory variant effect prediction with AlphaGenome. Nature 649:1206-1218 (2026).
[6] Li H, et al. A Template-Based Protein Structure Reconstruction Method Using Deep Autoencoder Learning. J Proteomics Bioinform 9(12):306-313 (2016). PMC29081613.
[7] Suhre K, et al. Connecting genetic risk to disease end points through the human blood plasma proteome. Nat Commun 8:14357 (2017).
[8] Melzer D, et al. A Genome-Wide Association Study Identifies Protein Quantitative Trait Loci (pQTLs). PLoS Genet 4(5):e1000072 (2008).
[9] Dutta D, et al. Aggregative trans-eQTL analysis detects trait-specific target gene sets in whole blood. Nat Commun 13:4323 (2022).
[10] Zhang Z, et al. ezQTL: Interactive Visualization and Colocalization of Quantitative Trait Loci and GWAS. NCI Division of Cancer Epidemiology and Genetics (2022). https://dceg.cancer.gov/tools/analysis/ez-qtl
Adaptive Early Vision Networks: Learning Biologically Constrained Front-Ends for Robust Visual Recognition Supervised by Arlindo L. Oliveira and Tiago Marques Deep neural networks have achieved remarkable performance in visual recognition, but they remain more fragile than biological vision systems when images are corrupted, perturbed, or shifted away from the training distribution. In contrast, primate vision is highly robust across changes in contrast, noise, blur, illumination, and viewpoint. This has motivated a growing research area investigating whether principles from neuroscience can help design more robust and interpretable artificial vision systems.

Previous work from our group and collaborators introduced VOneNets, hybrid neural networks that place a biologically inspired model of the primate primary visual cortex (V1) at the front of standard convolutional neural networks. This architecture showed that constraining the early stages of artificial vision using known properties of biological visual processing can improve robustness to several image perturbations. More recently, Early Vision Networks (EVNets) extended this idea by explicitly modeling pre-cortical visual processing stages, including retina- and LGN-inspired computations, before the V1-like stage.

The goal of this Master's thesis is to develop a new generation of biologically inspired neural network architectures in which the early visual front-end is no longer fully fixed, but instead learnable under biological constraints. The project will investigate whether selected parameters of the VOneNet/EVNet front-end — such as spatial frequency tuning, orientation selectivity, center-surround interactions, contrast normalization, or neural noise — can be optimized during training while remaining close to biologically plausible distributions. This would allow the model to adapt to the task and dataset while preserving the interpretability and robustness benefits of biologically inspired design.

The student will implement and evaluate adaptive variants of VOneNet/EVNet architectures, comparing fixed, unconstrained learnable, and biologically constrained learnable front-ends. The project will involve model implementation in PyTorch, training and evaluation on standard computer vision datasets, and systematic ablation studies to understand which biological constraints contribute most to accuracy, robustness, and representation quality.

Possible research directions include making selected V1 or subcortical parameters learnable with regularization toward empirical biological distributions; introducing differentiable constraints on Gabor filters, Difference-of-Gaussian filters, center-surround mechanisms, and normalization parameters; comparing fixed versus adaptive early visual front-ends; evaluating whether biological constraints improve robustness compared with fully learnable alternatives; and exploring whether the adaptive front-end can be combined with different downstream architectures such as ResNets, EfficientNets, ConvNeXt models, or Vision Transformers.

Requisites: The student should be very comfortable programming in Python. Experience with PyTorch and neural network training is recommended.

Cooperation with company or external entity: Collaboration with the Digital Surgery Lab of the Champalimaud Foundation, where the co-supervisor Tiago Marques is a PI.

[1] Dapello, J., Marques, T., Schrimpf, M., Geiger, F., Cox, D. D., & DiCarlo, J. J. (2020). Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations. NeurIPS
[2] Piper, L., Oliveira, A. L., & Marques, T. (2025). Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness. NeurIPS
[3] Hendrycks, D., & Dietterich, T. (2019). Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. ICLR
[4] Cirincione, A., Verrier, R., Bic, A., Olaiya, S., DiCarlo, J. J., Udeigwe, L., & Marques, T. (2022). Implementing Divisive Normalization in CNNs Improves Robustness to Common Image Corruptions. SVRHM Workshop @ NeurIPS
Optimization Model for Biomass Plant Control Supervised by Arlindo L. Oliveira Development and validation of a computational operation model for an industrial biomass plant, capable of processing data from multiple sensors in real time and recommending operating configurations that maximize combustion efficiency. Given the intrinsic variability of organic residues and fluctuations in their moisture content, the ideal combustion model is highly dynamic in nature. Solving this problem requires the execution of the following fundamental steps:

1. Multimodal data ingestion and processing: designing an architecture capable of fusing data from heterogeneous sensors (numerical signals for temperature and pressure, and unstructured data from video) into a unified data stream.
2. Prediction/simulation model development: creating a model (based on Artificial Intelligence, Machine Learning, or physical-mathematical models) capable of predicting the thermal efficiency of the boiler based on current state variables.
3. Operating parameter optimization: implementing an optimization algorithm that identifies, in a multidimensional space, the ideal combination of inputs for the current scenario.
4. Image analysis for combustion control: developing computer vision algorithms to extract flame characteristics (color, position, height, etc.) from the video feed, using them as early indicators of combustion quality.
5. Decision support system prototyping: developing an interface that provides operators with actionable real-time recommendations, enabling dynamic adjustments to boiler operation.
6. Validation and testing: evaluating model performance against historical data and validating the accuracy of suggestions against pre-established efficiency metrics.

Cooperation with company or external entity: The work will be developed with Renova's teams.

Requisites: In addition to Renova valuing student proximity and this experience contributing enrichingly to professional preparation, the success of the project requires the student to be present at Renova's facilities on at least some days of the week.
"Agentification" in a Manufacturing Context Supervised by Arlindo L. Oliveira Development and implementation of an ecosystem of distributed intelligent agents capable of automating administrative and operational processes in a manufacturing context, ensuring interoperability between heterogeneous data sources (ERP, email, databases) and optimizing the value chain. The challenges inherent to this project include:

- Multi-source integration architecture: developing a robust connector system that ensures bidirectional data flow between agents and external sources (SAP, email servers, production logs, and internal databases).
- NLP processing agents: creating agents specialized in the automatic triage and categorization of incidents, capable of interpreting free text and routing issues to the responsible departments.
- Automated data extraction: developing intelligent extraction modules for documents (invoices, delivery notes, purchase orders, etc.) to automatically feed the management system, reducing manual errors.
- Quality Assurance (QA) optimization: implementing monitoring agents that, through data analysis, identify non-conformance patterns and suggest corrective actions.
- Agent orchestrators: creating a workflow management layer that coordinates communication between the different agents, ensuring that the output of one process correctly triggers the next action.
- Industrial validation: evaluating the impact of the solution through performance metrics (KPIs), such as reduction in incident response time, data extraction precision rates, and efficiency gains in quality control.

Cooperation with company or external entity: The work will be developed with Renova's teams.

Requisites: In addition to Renova valuing student proximity and this experience contributing enrichingly to professional preparation, the success of the project requires the student to be present at Renova's facilities on at least some days of the week.
AI-Driven Pulmonary Function Analysis for Respiratory Screening Supervised by Arlindo L. Oliveira The early detection and monitoring of chronic respiratory diseases such as Chronic Obstructive Pulmonary Disease (COPD) remains a major global healthcare challenge, particularly in primary care and low-resource settings. Pulmonary Function Testing (PFT) is considered the clinical gold standard for diagnosing respiratory conditions, yet traditional spirometry systems are difficult to deploy at scale due to their operational complexity, strict quality-control requirements, and the need for specialized interpretation. At the same time, recent advances in Artificial Intelligence (AI) and large-scale medical data platforms create new opportunities for intelligent, scalable, and automated pulmonary diagnostics capable of supporting clinicians in real-world settings.

This project proposes the development of an AI-powered pulmonary function analysis platform within the context of the China-Portugal AI Pulmonary Function Data Platform initiative, a collaborative effort involving Guangzhou Medical University, INESC-ID Portugal, and Macau University of Science and Technology. The proposed research will focus on the application of deep learning methods to respiratory signal analysis, including the automatic interpretation of spirometry curves and real-time quality control, anomaly detection. Inspired by recent advances in AI-driven healthcare systems, the project will explore machine learning models capable of reconstructing and predicting full pulmonary function patterns from short-duration expiratory signals, while correcting artifacts such as cough interruptions or premature termination of breathing maneuvers.

Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The project will be supervised by Arlindo Oliveira (DEI/IST and INESC-ID). It will be developed in the context of ongoing research projects currently being executed at China-Portugal AI joint laboratory. Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs), and they are expected to interact with other MLKD researchers (https://mlkd.idss.inesc-id.pt/) working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).
Improving Medical Image Segmentation through Human Feedback and Reward Modeling Supervised by Arlindo L. Oliveira Medical image segmentation is a fundamental task in AI diagnosis and clinical image analysis. It consists of identifying and delineating regions of interest within medical images, such as tumors, lesions, or specific anatomical structures. Accurate segmentation enables clinicians and AI systems to better characterize findings based on properties such as shape, size, texture, and spatial organization, while also supporting downstream tasks including disease detection, treatment planning, quantitative analysis, and quality control. However, the creation of high-quality segmentation annotations remains a major bottleneck, as it requires substantial domain expertise and is both time-consuming and expensive.

Recent advances in deep learning have led to highly effective segmentation architectures, including models such as U-Net, DeepLabv3+, and other transformer-based approaches, which achieve strong performance across multiple imaging domains. Nevertheless, medical imaging continues to present unique challenges due to high inter-observer variability, noisy annotations, domain shifts across institutions, and the need for extremely precise boundaries in clinically relevant regions.

This project proposes the development of reward-based learning methods for medical image segmentation using Reinforcement Learning from Human Feedback (RLHF). The central idea is to incorporate expert preferences directly into the training process through reward models that learn to evaluate the quality of segmentation masks based on human feedback. These reward models may consist of lightweight neural networks trained to distinguish preferred segmentations from suboptimal ones, enabling the segmentation system to iteratively improve according to expert-defined criteria such as boundary precision, anatomical consistency, or clinical usefulness.

The initial focus of the project will be on histopathology datasets based on Hematoxylin and Eosin (H&E) stained tissue images, where fine-grained segmentation plays a crucial role in cancer analysis and tissue characterization. The proposed framework, however, will be designed to generalize to other medical imaging modalities and applications.

The expected outcomes include the development of novel RLHF methodologies for medical image segmentation and improved human-in-the-loop annotation workflows.

Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow.

Notes: The project will be supervised by Arlindo Oliveira (DEI/IST and INESC-ID). The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster).
Sign Language Recognition with Large Vision-and-Language Models Supervised by Bruno Martins and Arlindo L. Oliveira Sign language plays a vital role in enabling communication for deaf and hard-of-hearing individuals, yet automatic Sign Language Recognition (SLR) remains a challenging task due to its high visual complexity and the limited availability of large annotated datasets, particularly for underrepresented sign languages such as Portuguese Sign Language (Língua Gestual Portuguesa, LGP). Traditional approaches often rely on intermediate gloss annotations, which are costly to produce and restrict scalability. Meanwhile, recent progress in Large Vision-and-Language Models (LVLMs) opens new opportunities to bypass these constraints by enabling direct translation from video to text without the need for intermediate representations. These models can exploit rich semantic and temporal features directly from raw video input, offering an end-to-end alternative for gloss-free sign language understanding.

This project proposes the development of a sign language recognition system that directly utilizes video frames as input to a Large Vision-and-Language Model, taking inspiration from recent proposals (e.g., 2412.16524, 2404.00925, 2408.10593). Unlike previous approaches dependent on body key-point extraction or gloss-level supervision, the method will explore the use of LVLMs fine-tuned with paired video-text data, treating sign language video frames as a first-class input modality. Inspired by recent advances in multimodal learning and the bridging of visual and language token spaces, the model will be trained to recognize and translate full video sequences into natural language descriptions. In parallel, the project will address a critical gap in data availability for LGP, by curating a new dataset from publicly available sources such as televised news broadcasts interpreted in sign language, applying automatic alignment and segmentation strategies to create training pairs.

Evaluation will be conducted using both newly collected LGP data and existing sign language datasets for other languages, such as PHOENIX14T, CSL-Daily, How2Sign (how2sign.github.io, signllm.github.io), or the data from the Sign Language Translation Task of WMT22/WMT23 (wmt-slt.com). The proposed model will be benchmarked against existing state-of-the-art gloss-free SLT systems, with comparisons focusing on translation accuracy, robustness to signer variation, and performance in low-resource language scenarios. The main deliverable will be a technical report detailing the system architecture, data collection and preprocessing pipelines, and experimental results, with the aim of submission to a leading conference in computer vision or natural language processing.

Requisites:
• Commitment and availability to work on the project (e.g., not recommended for students with other professional activities);
• Interest in the application of deep learning methods to natural language processing — previous projects involving the use of Transformer-based models will be valued;
• Good commandment of English;
• Excellent grades in courses related to the topics of the project (i.e., an average grade of 17 values or higher);
• Knowledge and experience with the use of tools like Overleaf and GitHub;
• Knowledge of Python and machine learning libraries such as PyTorch and HuggingFace Transformers;
• Preference will be given to students enrolled, or interested in enrolling, in the PhD fast track programme.

Notes: The project will be supervised by Bruno Martins (DEEC/IST and INESC-ID) and Arlindo Oliveira (DEI/IST and INESC-ID). It will be developed in the context of ongoing research projects currently being executed at the Human Language Technologies (HLT) group of INESC-ID (e.g., the Center for Responsible AI — https://centerforresponsible.ai).

Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs), and they are expected to interact with other HLT researchers working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).

Students interested in this proposal should contact Prof. Bruno Martins (bruno.g.martins@tecnico.ulisboa.pt) in order to schedule an interview.
Text Segmentation by TextTiling with Modern Neural Retrieval Models Supervised by Bruno Martins and Arlindo L. Oliveira Text segmentation is a foundational task in natural language processing, supporting a wide range of applications such as document summarization, topic modeling, media content repurposing, and more recently, Retrieval-Augmented Generation (RAG) systems (2503.09600). Effective segmentation enables better contextualization of content, allowing downstream models to focus on coherent blocks of information rather than arbitrary fixed-length inputs. Classical approaches like TextTiling segment texts into semantically coherent passages based on lexical cohesion, typically using text similarity measures. Subsequent approaches have revisited the TextTiling idea leveraging representations obtained from encoder models such as BERT (PeerJ-CS, GitHub - DeepTiling). More recently, with the advent of neural retrieval and large language models (LLMs), there is a renewed interest in developing segmentation strategies that can more precisely model topical boundaries and improve the efficiency and quality of retrieval-based systems. Recent approaches have even proposed the use of LLMs to perform text segmentation (2406.17526), although computational efficiency is a concern with these methods.

In the context of this M.Sc. research project, the candidate will revisit the original TextTiling method and augment it by integrating modern neural sentence representations, particularly those derived from models pre-trained for dense retrieval tasks (e.g., sentence-transformers or embedding models based on larger Transformer decoders fine-tuned for semantic search). By replacing TF-IDF vectors with dense embeddings, the method aims to improve the detection of semantic shifts between text segments, while maintaining the interpretability and unsupervised nature of the original algorithm. In parallel, we will compare this enhanced TextTiling method with LLM-based segmentation approaches such as LumberChunker, which dynamically determine chunk boundaries through generative prompting. The goal is to assess the trade-offs between lightweight embedding-based segmentation, and more computationally intensive instruction-tuned models in real-world scenarios.

Evaluation will be performed using a mixture of segmentation benchmarks, perhaps also considering downstream task performance within RAG pipelines. Benchmarks such as Wiki-727K, or the GutenQA benchmark used in LumberChunker, will be employed to quantify segmentation quality using established metrics. A technical report will document the results, envisioning the submission to a leading conference in natural language processing or information retrieval.

Requisites:
• Commitment and availability to work on the project (e.g., not recommended for students with other professional activities);
• Interest in the application of deep learning methods to natural language processing — previous projects involving the use of Transformer-based models will be valued;
• Good commandment of English;
• Excellent grades in courses related to the topics of the project (i.e., an average grade of 17 values or higher);
• Knowledge and experience with the use of tools like Overleaf and GitHub;
• Knowledge of Python and machine learning libraries such as PyTorch and HuggingFace Transformers;
• Preference will be given to students enrolled, or interested in enrolling, in the PhD fast track programme.

Notes: The project will be supervised by Bruno Martins (DEEC/IST and INESC-ID) and Arlindo Oliveira (DEI/IST and INESC-ID). It will be developed in the context of ongoing research projects currently being executed at the Human Language Technologies (HLT) group of INESC-ID (e.g., the Center for Responsible AI — https://centerforresponsible.ai).

Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs), and they are expected to interact with other HLT researchers working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).

Students interested in this proposal should contact Prof. Bruno Martins (bruno.g.martins@tecnico.ulisboa.pt) in order to schedule an interview.
Currently, there are no dissertations open for application.
Analysis of sensor data for monitoring of open spaces Supervised by Arlindo L. Oliveira and authored by André Duarte