Assessing Large Language Models for information verification
in Portuguese
Supervised by Arlindo L. Oliveira and Nuno Cardoso and authored by Luís Câmara
In an increasingly interconnected world driven by digital
technologies, the ability to navigate, critically evaluate,
and effectively use information is more essential than ever.
However, amidst the vast amount of online information,
misinformation proliferates, posing significant challenges to
informed decision-making. With the advent of Large Language
Models (LLMs), there is a potential solution in using these
models to verify the accuracy of circulating information
online. Different Artificial Intelligence models offer
distinct capabilities in processing and understanding large
amounts of textual data. This research aims to explore and
compare the effectiveness of these models in the context of
information verification. The objectives of this thesis are:
to expand the knowledge of the state of the art in the area of
information verification using Artificial Intelligence; to
identify and select appropriate methods for developing a
solution that aims to investigate the ability of Large
Language Models to verify the truthfulness of a text, compared
to a knowledge base; to use open-source models (LLaMA, Vicuna,
Mistral, etc.) and proprietary ones (GPT-4, BARD, Claude,
etc.) to verify information; to analyze the results obtained
in experiments, comparing the performance of different LLMs
with the existing state of the art and to draw pertinent
conclusions based on these results. The student should have
significant programming experience, and practical knowledge of
machine learning languages and environments, such as PyTorch
or TensorFlow. Notes: The work will be be developed in
cooperation with SIED (Serviço de Informações Estratégicas de
Defesa), which have significant expertise in the the topic.
The selected student will have access to the facilities of
INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/),
including computing facilities that include four DELL
PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four
NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other
computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Deep neural network architectures for dual process computation
Supervised by Arlindo L. Oliveira and Bruno Martins and authored by João Meneses Santos
Dual process theories have been used to explain the different
modes of behavior of the human brain when processing
information. These theories became popular with the work of
Kaheman, Thinking Fast and Slow, but they are based on decades
of experimental evidence that the human brain works in two
different modes. System 1 processes large amounts of visual
and sensory information, efficiently and unconsciously. For
instance, face and object recognition, speech processing and
many other automatic functions are performed effortlessly by
the human brain, using system 1. Other tasks require conscious
effort, like answering complex riddles, executing non-trivial
arithmetic operations of planning unfamiliar tasks. These
tasks are performed by system 2. Existing systems, like
convolutional neural networks, for vision, or transformers,
for natural language processing, behave very much like system
1 in the human brain: they perform fast, high-throughput,
processing of high-dimensional information, in an unconscious
way. This dissertation will be focused on the design of deep
neural network architectures that can be used to emulate the
dual process computation that characterizes the human brain
and also on the relation of dual process architectures and
consciousness. Requisites: The student should have significant
programming experience, and practical knowledge of machine
learning languages and environments, such as PyTorch or
TensorFlow. He/she should also have interest in developing the
understanding of neuroscience and human psychology. Notes: The
selected student will have access to the facilities of
INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/),
including computing facilities that include four DELL
PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four
NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other
computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Characterization of the internal representations of
multi-modal Large Language Models
Supervised by Arlindo L. Oliveira and Bruno Martins and authored by João Marques Cardoso
Multi-modal Large language models, mainly based on the
transformer architecture, trained in extensive amounts of
multi-modal data, have shown a surprising ability to perform
many different tasks with little specific training. In
particular, some of these models exhibited an ability to
perform few shot learning and even zero shot learning. Reports
on the emerging abilities of large language models, such a
Gemini, GPT4o and Claude 3 have also presented evidence that
these models are able to significantly extrapolate from the
training data used and solve problems in which they were not
supposed to be proficient. This dissertation will explore the
internal representations or large language models, in order to
characterize the nature of these representations and shed
light on the mechanisms used to represent knowledge. The work
to be developed will use open-source models, such as LlaMA and
LlaVA, together with publicly available data from functional
MRI experiments with animals to explore and characterize the
internal representations of large language models, with the
objective of shedding light on the way these models work and,
possibly, on how they are able to generalize from training
data to new domains. Requisites: The student should have
significant programming experience, and practical knowledge of
machine learning languages and environments, such as PyTorch
or TensorFlow. He/she should also have interest in developing
the understanding of large language models and LLM APIs.
Notes: The selected student will have access to the facilities
of INESC-ID and the MLKD group
(https://mlkd.idss.inesc-id.pt/), including computing
facilities that include four DELL PowerEdge C41402 servers,
eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four
NVIDIA 64GB Tesla A100, among other computing servers
(https://mlkd.idss.inesc-id.pt/cluster)
Transfer single-cell foundation models to predict drug resistance in cancer
Supervised by Arlindo L. Oliveira and Emanuel Gonçalves and auhored by Gonçalo Gonçalves
Drug resistance is one of the leading causes of therapy in
cancer with devastating consequences for the patients and
their close ones. Single-cell technologies offer an
unprecedented resolution, analysing thousands to millions of
cells, revealing the heterogeneity of tumours along with rare
drug-tolerant cells. Deep learning models, e.g. variational
autoencoders and transformers, have been very successfully
applied to integrate a wide range of multi-modal single-cell
datasets spanning over thousands to millions of cells.
Single-cell generative models, including foundation models,
scBERT and, are now becoming commonplace to study many aspects
of cancer biology, including predicting how cancer cells
respond to gene deletions to propose novel drug targets.
Aligned with this, recents efforts focus on the creation of
computational pipelines to integrate and harmonise single-cell
perturbation datasets. Large-scale unsupervised single-cell
models can be fine-tuned for specific tasks, such as cancer
type identification and prediction of drug responses. In this
project, we propose to harness these foundation models to
explore zero-shot, few-shot and fine-tune approaches to
integrate Cancer Dependency Map data
(https://www.broadinstitute.org/cancer/cancer-dependency-map).
This will leverage screens of hundreds of anti-cancer drugs
across >1,000 cancers cell lines. This will enable us to
study the variation in drug responses among different cancer
types using computational models. Ultimately, we can pinpoint
cell populations that are resistant to drug treatments,
shedding light on the regulatory mechanisms behind drug
resistance and uncovering new strategies to overcome it.
Requisites: The student should have significant programming
experience, and practical knowledge of machine learning
languages and environments, such as PyTorch or TensorFlow.
He/she should also have interest in developing the
understanding of large language models and LLM APIs. Notes:
The selected student will have access to the facilities of
INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/),
including computing facilities that include four DELL
PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four
NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other
computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Biologically inspired CNNs for Medical Imaging tasks
Supervised by Arlindo L. Oliveira and Tiago Marques and
authored by Daniela Carvalho
Medical image data poses several challenges for computer
vision algorithms: it spans multiple imaging modalities and
biological tissues, it contains several sources of noise and
variation, and there is a scarcity of available labeled
datasets. Some recent advances in computer vision models, such
as the use of vision transformers and self-supervised learning
have showed promising results in dealing with some of these
challenges. However, it has not been tested whether the use of
biologically inspired computations, another recent advanced in
computer vision with considerable improvements in robustness,
also translates to gains in medical imaging tasks. The goal of
this project is to adapt the VOneNet family, a hybrid CNN with
a front-end inspired and constrained by the primate primary
visual cortex (V1), to multiple computer vision neural network
architectures used for medical imaging tasks and to test their
performance in a wide range of related benchmarks.
Using large language models to interact with personal
information systems
Supervised by Arlindo L. Oliveira and authored by João Amoroso
Large language models, such as ChatGPT and GPT-4 have shown
remarkable abilities to interact in natural language. However,
they cannot be used to access and learn from personal data,
stored in email records, note taking systems or photos and
videos. The objective of this dissertation is to design a
system that uses large language model as the interface for
personal data, using APIs and enabling the user to query,
relate and retrieve information stored in different
sub-systems, such as mailboxes, Google records and note taking
platforms such as Obsidian. The resulting system should be
able to emulate the behavior of an intelligent assistant that
has access to all stored personal data and, ultimately, to
answer questions about that data in a way similar to the user
that owns the data. Requisites: The student should have
significant programming experience, and practical knowledge of
machine learning languages and environments, such as PyTorch
or TensorFlow. He/she should also have interest in developing
the understanding of large language models and LLM APIs.
Notes: The selected student will have access to the facilities
of INESC-ID and the MLKD group
(https://mlkd.idss.inesc-id.pt/), including computing
facilities that include four DELL PowerEdge C41402 servers,
eight NVIDIA 32GB Tesla V100S and eight NVIDIA 64GB Tesla
A100, among other computing servers
(https://mlkd.idss.inesc-id.pt/cluster)
Accurate prediction of stroke outcome from computed tomography
scans
Supervised by Arlindo L. Oliveira and authored by João
Teixeira
Accurately predicting the functional outcome of stroke
patients remains a problem with medical relevance that cannot
yet be adequately solved by automated means. Although it is
known that brain computed tomography (CT) scans contain
relevant information, their practical usefulness in predicting
this variable remains an open question. The objective of this
thesis is to develop algorithms to determine whether brain CT
scans (with and without contrast) could be automatically
analysed using deep learning models, to improve the prediction
of the three months post stroke functional outcome, as
measured by the modified ranking score. The selected student
will study the application of deep learning architectures,
including convolutional neural networks and vision
transformers, to the this problem. One intermediate variable
that will be studied as a possible predictor for the
functional outcome is occlusion, the existence of blocked
arteries that lead to the death of brain regions. Requisites:
The student should have significant programming experience,
and practical knowledge of machine learning languages and
environments, such as PyTorch or TensorFlow. The student
should have an interest in becoming familiar with the
biological and medical phenomena involved in stroke. Notes:
This work will be developed in close cooperation with the
neurology department of the Santa Maria Hospital. The selected
student will have access to the facilities of INESC-ID and the
MLKD group (https://mlkd.idss.inesc-id.pt/), including
computing facilities that include four DELL PowerEdge C41402
servers, eight NVIDIA 32GB Tesla V100S and eight NVIDIA 64GB
Tesla A100, among other computing servers.