MIRACLE – MultImodal Representations for Automated Concept
Learning and Extraction
Supervised by Arlindo L. Oliveira and André Carreiro
After the tsuccess of Large Language Models, multimodal
foundation models are showing traction, with new models
effectively combining image, text, video, and audio to improve
capabilities and the potential for interpretability. Concept
learning, which involves discovering and leveraging high-level
features from data (think of color or shape as concepts in
images, or tone and pitch as concepts in audio), is crucial in
improving explainability and control in both discriminative
and generative tasks. Based on the groundwork laid by previous
work, this dissertation will build a framework that
automatically discovers concepts from multimodal datasets
(e.g., X-rays and their reports) and extracts concept-related
features from new samples. The methodology focuses on
self-supervised multimodal representation learning.
Additionally, we aim to leverage the recent Concept Relevance
Propagation (CRP) framework, used to improve the
interpretability of ML models, to study how concepts, defined
through multimodal relationships (e.g., text-image), are
relevant for downstream tasks. Real-world applications are
numerous, although the student will have available a large
dataset combining Chest X-Ray images with corresponding
textual medical reports. The disseration will leverage the
platform Concept-MNIST, a dataset generator with controllable
image concepts such as shape, color, texture, rotation, scale,
among others, providing a scalable and trusted groundtruth for
concept learning. Questions to be addressed include: how can
multimodal representation learning be extended to include and
effectively integrate additional data types beyond image-text;
what alternative strategies beyond CLIP-based models can be
employed for multimodal learning to enhance concept discovery
and extraction, e.g., with smoother latent representations;
how can concept relationships (e.g. concept graph) be
effectively modeled and utilized to enhance multimodal
representations?; how can existing knowledge bases (e.g.
dictionaries or ontologies) be integrated into the learning
process to support and improve concept discovery and
inference; how can the CRP framework be adapted to incorporate
multimodal data, and what improvements can this bring to the
explainability of AI models?
Transfer single-cell foundation models to predict drug
resistance in cancer
Supervised by Arlindo L. Oliveira and Emanuel Gonçalves
Drug resistance is one of the leading causes of therapy in
cancer with devastating consequences for the patients and
their close ones. Single-cell technologies offer an
unprecedented resolution, analysing thousands to millions of
cells, revealing the heterogeneity of tumours along with rare
drug-tolerant cells. Deep learning models, e.g. variational
autoencoders and transformers, have been very successfully
applied to integrate a wide range of multi-modal single-cell
datasets spanning over thousands to millions of cells.
Single-cell generative models, including foundation models,
scBERT and, are now becoming commonplace to study many aspects
of cancer biology, including predicting how cancer cells
respond to gene deletions to propose novel drug targets.
Aligned with this, recents efforts focus on the creation of
computational pipelines to integrate and harmonise single-cell
perturbation datasets. Large-scale unsupervised single-cell
models can be fine-tuned for specific tasks, such as cancer
type identification and prediction of drug responses. In this
project, we propose to harness these foundation models to
explore zero-shot, few-shot and fine-tune approaches to
integrate Cancer Dependency Map data
(https://www.broadinstitute.org/cancer/cancer-dependency-map).
This will leverage screens of hundreds of anti-cancer drugs
across >1,000 cancers cell lines. This will enable us to
study the variation in drug responses among different cancer
types using computational models. Ultimately, we can pinpoint
cell populations that are resistant to drug treatments,
shedding light on the regulatory mechanisms behind drug
resistance and uncovering new strategies to overcome it.
Requisites: The student should have significant programming
experience, and practical knowledge of machine learning
languages and environments, such as PyTorch or TensorFlow.
He/she should also have interest in developing the
understanding of large language models and LLM APIs. Notes:
The selected student will have access to the facilities of
INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/),
including computing facilities that include four DELL
PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four
NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other
computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Characterization of the internal representations of
multi-modal Large Language Models
Supervised by Arlindo L. Oliveira and Bruno Martins
Multi-modal Large language models, mainly based on the
transformer architecture, trained in extensive amounts of
multi-modal data, have shown a surprising ability to perform
many different tasks with little specific training. In
particular, some of these models exhibited an ability to
perform few shot learning and even zero shot learning. Reports
on the emerging abilities of large language models, such a
Gemini, GPT4o and Claude 3 have also presented evidence that
these models are able to significantly extrapolate from the
training data used and solve problems in which they were not
supposed to be proficient. This dissertation will explore the
internal representations or large language models, in order to
characterize the nature of these representations and shed
light on the mechanisms used to represent knowledge. The work
to be developed will use open-source models, such as LlaMA and
LlaVA, together with publicly available data from functional
MRI experiments with animals to explore and characterize the
internal representations of large language models, with the
objective of shedding light on the way these models work and,
possibly, on how they are able to generalize from training
data to new domains. Requisites: The student should have
significant programming experience, and practical knowledge of
machine learning languages and environments, such as PyTorch
or TensorFlow. He/she should also have interest in developing
the understanding of large language models and LLM APIs.
Notes: The selected student will have access to the facilities
of INESC-ID and the MLKD group
(https://mlkd.idss.inesc-id.pt/), including computing
facilities that include four DELL PowerEdge C41402 servers,
eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four
NVIDIA 64GB Tesla A100, among other computing servers
(https://mlkd.idss.inesc-id.pt/cluster)
Deep neural network architectures for dual process computation
Supervised by Arlindo L. Oliveira
Dual process theories have been used to explain the different
modes of behavior of the human brain when processing
information. These theories became popular with the work of
Kaheman, Thinking Fast and Slow, but they are based on decades
of experimental evidence that the human brain works in two
different modes. System 1 processes large amounts of visual
and sensory information, efficiently and unconsciously. For
instance, face and object recognition, speech processing and
many other automatic functions are performed effortlessly by
the human brain, using system 1. Other tasks require conscious
effort, like answering complex riddles, executing non-trivial
arithmetic operations of planning unfamiliar tasks. These
tasks are performed by system 2. Existing systems, like
convolutional neural networks, for vision, or transformers,
for natural language processing, behave very much like system
1 in the human brain: they perform fast, high-throughput,
processing of high-dimensional information, in an unconscious
way. This dissertation will be focused on the design of deep
neural network architectures that can be used to emulate the
dual process computation that characterizes the human brain
and also on the relation of dual process architectures and
consciousness. Requisites: The student should have significant
programming experience, and practical knowledge of machine
learning languages and environments, such as PyTorch or
TensorFlow. He/she should also have interest in developing the
understanding of neuroscience and human psychology. Notes: The
selected student will have access to the facilities of
INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/),
including computing facilities that include four DELL
PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four
NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other
computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Assessing Large Language Models for information verification
in Portuguese
Supervised by Arlindo L. Oliveira and José Martinho
In an increasingly interconnected world driven by digital
technologies, the ability to navigate, critically evaluate,
and effectively use information is more essential than ever.
However, amidst the vast amount of online information,
misinformation proliferates, posing significant challenges to
informed decision-making. With the advent of Large Language
Models (LLMs), there is a potential solution in using these
models to verify the accuracy of circulating information
online. Different Artificial Intelligence models offer
distinct capabilities in processing and understanding large
amounts of textual data. This research aims to explore and
compare the effectiveness of these models in the context of
information verification. The objectives of this thesis are:
to expand the knowledge of the state of the art in the area of
information verification using Artificial Intelligence; to
identify and select appropriate methods for developing a
solution that aims to investigate the ability of Large
Language Models to verify the truthfulness of a text, compared
to a knowledge base; to use open-source models (LLaMA, Vicuna,
Mistral, etc.) and proprietary ones (GPT-4, BARD, Claude,
etc.) to verify information; to analyze the results obtained
in experiments, comparing the performance of different LLMs
with the existing state of the art and to draw pertinent
conclusions based on these results. The student should have
significant programming experience, and practical knowledge of
machine learning languages and environments, such as PyTorch
or TensorFlow. Notes: The work will be be developed in
cooperation with SIED (Serviço de Informações Estratégicas de
Defesa), which have significant expertise in the the topic.
The selected student will have access to the facilities of
INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/),
including computing facilities that include four DELL
PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four
NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other
computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Automatic penetration testing with Large Language Models
Supervised by Arlindo L. Oliveira and José Martinho
This thesis proposes to explore the application of LLMs in
automating the generation of vulnerability identification
scripts, leveraging the extensive database of vulnerability
reports and existing payloads. This research not only holds
the promise of streamlining vulnerability identification
processes but also of pushing the boundaries of what is
currently achievable in the intersection of artificial
intelligence and information security. The objectives include:
investigate the capability of LLMs in generating code, by
assessing the current state-of-the-art LLMs' ability to
understand and generate code, specifically focusing on scripts
used in penetration testing; develop a methodology for
automating vulnerability identification by creating a
framework that utilizes LLMs to automate the generation of
vulnerability identification scripts based on existing
vulnerability reports and payloads; evaluate the effectiveness
of automated scripts by testing the generated scripts against
a controlled set of vulnerabilities to measure their accuracy,
efficiency, and reliability in identifying security flaws.
Requisites: The student should have significant programming
experience, and practical knowledge of machine learning
languages and environments, such as PyTorch or TensorFlow.
Notes: The work will be be developed in cooperation with
Ethiack, which have significant expertise in the the topic.
The selected student will have access to the facilities of
INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/),
including computing facilities that include four DELL
PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four
NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other
computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Currently, there are no dissertations open for application.