MLKD

New dissertations (open for applications)

MIRACLE – MultImodal Representations for Automated Concept Learning and Extraction Supervised by Arlindo L. Oliveira and André Carreiro After the tsuccess of Large Language Models, multimodal foundation models are showing traction, with new models effectively combining image, text, video, and audio to improve capabilities and the potential for interpretability. Concept learning, which involves discovering and leveraging high-level features from data (think of color or shape as concepts in images, or tone and pitch as concepts in audio), is crucial in improving explainability and control in both discriminative and generative tasks. Based on the groundwork laid by previous work, this dissertation will build a framework that automatically discovers concepts from multimodal datasets (e.g., X-rays and their reports) and extracts concept-related features from new samples. The methodology focuses on self-supervised multimodal representation learning. Additionally, we aim to leverage the recent Concept Relevance Propagation (CRP) framework, used to improve the interpretability of ML models, to study how concepts, defined through multimodal relationships (e.g., text-image), are relevant for downstream tasks. Real-world applications are numerous, although the student will have available a large dataset combining Chest X-Ray images with corresponding textual medical reports. The disseration will leverage the platform Concept-MNIST, a dataset generator with controllable image concepts such as shape, color, texture, rotation, scale, among others, providing a scalable and trusted groundtruth for concept learning. Questions to be addressed include: how can multimodal representation learning be extended to include and effectively integrate additional data types beyond image-text; what alternative strategies beyond CLIP-based models can be employed for multimodal learning to enhance concept discovery and extraction, e.g., with smoother latent representations; how can concept relationships (e.g. concept graph) be effectively modeled and utilized to enhance multimodal representations?; how can existing knowledge bases (e.g. dictionaries or ontologies) be integrated into the learning process to support and improve concept discovery and inference; how can the CRP framework be adapted to incorporate multimodal data, and what improvements can this bring to the explainability of AI models?
Transfer single-cell foundation models to predict drug resistance in cancer Supervised by Arlindo L. Oliveira and Emanuel Gonçalves Drug resistance is one of the leading causes of therapy in cancer with devastating consequences for the patients and their close ones. Single-cell technologies offer an unprecedented resolution, analysing thousands to millions of cells, revealing the heterogeneity of tumours along with rare drug-tolerant cells. Deep learning models, e.g. variational autoencoders and transformers, have been very successfully applied to integrate a wide range of multi-modal single-cell datasets spanning over thousands to millions of cells. Single-cell generative models, including foundation models, scBERT and, are now becoming commonplace to study many aspects of cancer biology, including predicting how cancer cells respond to gene deletions to propose novel drug targets. Aligned with this, recents efforts focus on the creation of computational pipelines to integrate and harmonise single-cell perturbation datasets. Large-scale unsupervised single-cell models can be fine-tuned for specific tasks, such as cancer type identification and prediction of drug responses. In this project, we propose to harness these foundation models to explore zero-shot, few-shot and fine-tune approaches to integrate Cancer Dependency Map data (https://www.broadinstitute.org/cancer/cancer-dependency-map). This will leverage screens of hundreds of anti-cancer drugs across >1,000 cancers cell lines. This will enable us to study the variation in drug responses among different cancer types using computational models. Ultimately, we can pinpoint cell populations that are resistant to drug treatments, shedding light on the regulatory mechanisms behind drug resistance and uncovering new strategies to overcome it. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. He/she should also have interest in developing the understanding of large language models and LLM APIs. Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Characterization of the internal representations of multi-modal Large Language Models Supervised by Arlindo L. Oliveira and Bruno Martins Multi-modal Large language models, mainly based on the transformer architecture, trained in extensive amounts of multi-modal data, have shown a surprising ability to perform many different tasks with little specific training. In particular, some of these models exhibited an ability to perform few shot learning and even zero shot learning. Reports on the emerging abilities of large language models, such a Gemini, GPT4o and Claude 3 have also presented evidence that these models are able to significantly extrapolate from the training data used and solve problems in which they were not supposed to be proficient. This dissertation will explore the internal representations or large language models, in order to characterize the nature of these representations and shed light on the mechanisms used to represent knowledge. The work to be developed will use open-source models, such as LlaMA and LlaVA, together with publicly available data from functional MRI experiments with animals to explore and characterize the internal representations of large language models, with the objective of shedding light on the way these models work and, possibly, on how they are able to generalize from training data to new domains. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. He/she should also have interest in developing the understanding of large language models and LLM APIs. Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Deep neural network architectures for dual process computation Supervised by Arlindo L. Oliveira Dual process theories have been used to explain the different modes of behavior of the human brain when processing information. These theories became popular with the work of Kaheman, Thinking Fast and Slow, but they are based on decades of experimental evidence that the human brain works in two different modes. System 1 processes large amounts of visual and sensory information, efficiently and unconsciously. For instance, face and object recognition, speech processing and many other automatic functions are performed effortlessly by the human brain, using system 1. Other tasks require conscious effort, like answering complex riddles, executing non-trivial arithmetic operations of planning unfamiliar tasks. These tasks are performed by system 2. Existing systems, like convolutional neural networks, for vision, or transformers, for natural language processing, behave very much like system 1 in the human brain: they perform fast, high-throughput, processing of high-dimensional information, in an unconscious way. This dissertation will be focused on the design of deep neural network architectures that can be used to emulate the dual process computation that characterizes the human brain and also on the relation of dual process architectures and consciousness. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. He/she should also have interest in developing the understanding of neuroscience and human psychology. Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Assessing Large Language Models for information verification in Portuguese Supervised by Arlindo L. Oliveira and José Martinho In an increasingly interconnected world driven by digital technologies, the ability to navigate, critically evaluate, and effectively use information is more essential than ever. However, amidst the vast amount of online information, misinformation proliferates, posing significant challenges to informed decision-making. With the advent of Large Language Models (LLMs), there is a potential solution in using these models to verify the accuracy of circulating information online. Different Artificial Intelligence models offer distinct capabilities in processing and understanding large amounts of textual data. This research aims to explore and compare the effectiveness of these models in the context of information verification. The objectives of this thesis are: to expand the knowledge of the state of the art in the area of information verification using Artificial Intelligence; to identify and select appropriate methods for developing a solution that aims to investigate the ability of Large Language Models to verify the truthfulness of a text, compared to a knowledge base; to use open-source models (LLaMA, Vicuna, Mistral, etc.) and proprietary ones (GPT-4, BARD, Claude, etc.) to verify information; to analyze the results obtained in experiments, comparing the performance of different LLMs with the existing state of the art and to draw pertinent conclusions based on these results. The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. Notes: The work will be be developed in cooperation with SIED (Serviço de Informações Estratégicas de Defesa), which have significant expertise in the the topic. The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Automatic penetration testing with Large Language Models Supervised by Arlindo L. Oliveira and José Martinho This thesis proposes to explore the application of LLMs in automating the generation of vulnerability identification scripts, leveraging the extensive database of vulnerability reports and existing payloads. This research not only holds the promise of streamlining vulnerability identification processes but also of pushing the boundaries of what is currently achievable in the intersection of artificial intelligence and information security. The objectives include: investigate the capability of LLMs in generating code, by assessing the current state-of-the-art LLMs' ability to understand and generate code, specifically focusing on scripts used in penetration testing; develop a methodology for automating vulnerability identification by creating a framework that utilizes LLMs to automate the generation of vulnerability identification scripts based on existing vulnerability reports and payloads; evaluate the effectiveness of automated scripts by testing the generated scripts against a controlled set of vulnerabilities to measure their accuracy, efficiency, and reliability in identifying security flaws. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. Notes: The work will be be developed in cooperation with Ethiack, which have significant expertise in the the topic. The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Currently, there are no dissertations open for application.
Analysis of sensor data for monitoring of open spaces Supervised by Arlindo L. Oliveira and authored by André Duarte