MLKD

Ongoing dissertations

Assessing Large Language Models for information verification in Portuguese Supervised by Arlindo L. Oliveira and Nuno Cardoso and authored by Luís Câmara In an increasingly interconnected world driven by digital technologies, the ability to navigate, critically evaluate, and effectively use information is more essential than ever. However, amidst the vast amount of online information, misinformation proliferates, posing significant challenges to informed decision-making. With the advent of Large Language Models (LLMs), there is a potential solution in using these models to verify the accuracy of circulating information online. Different Artificial Intelligence models offer distinct capabilities in processing and understanding large amounts of textual data. This research aims to explore and compare the effectiveness of these models in the context of information verification. The objectives of this thesis are: to expand the knowledge of the state of the art in the area of information verification using Artificial Intelligence; to identify and select appropriate methods for developing a solution that aims to investigate the ability of Large Language Models to verify the truthfulness of a text, compared to a knowledge base; to use open-source models (LLaMA, Vicuna, Mistral, etc.) and proprietary ones (GPT-4, BARD, Claude, etc.) to verify information; to analyze the results obtained in experiments, comparing the performance of different LLMs with the existing state of the art and to draw pertinent conclusions based on these results. The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. Notes: The work will be be developed in cooperation with SIED (Serviço de Informações Estratégicas de Defesa), which have significant expertise in the the topic. The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Deep neural network architectures for dual process computation Supervised by Arlindo L. Oliveira and Bruno Martins and authored by João Meneses Santos Dual process theories have been used to explain the different modes of behavior of the human brain when processing information. These theories became popular with the work of Kaheman, Thinking Fast and Slow, but they are based on decades of experimental evidence that the human brain works in two different modes. System 1 processes large amounts of visual and sensory information, efficiently and unconsciously. For instance, face and object recognition, speech processing and many other automatic functions are performed effortlessly by the human brain, using system 1. Other tasks require conscious effort, like answering complex riddles, executing non-trivial arithmetic operations of planning unfamiliar tasks. These tasks are performed by system 2. Existing systems, like convolutional neural networks, for vision, or transformers, for natural language processing, behave very much like system 1 in the human brain: they perform fast, high-throughput, processing of high-dimensional information, in an unconscious way. This dissertation will be focused on the design of deep neural network architectures that can be used to emulate the dual process computation that characterizes the human brain and also on the relation of dual process architectures and consciousness. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. He/she should also have interest in developing the understanding of neuroscience and human psychology. Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Characterization of the internal representations of multi-modal Large Language Models Supervised by Arlindo L. Oliveira and Bruno Martins and authored by João Marques Cardoso Multi-modal Large language models, mainly based on the transformer architecture, trained in extensive amounts of multi-modal data, have shown a surprising ability to perform many different tasks with little specific training. In particular, some of these models exhibited an ability to perform few shot learning and even zero shot learning. Reports on the emerging abilities of large language models, such a Gemini, GPT4o and Claude 3 have also presented evidence that these models are able to significantly extrapolate from the training data used and solve problems in which they were not supposed to be proficient. This dissertation will explore the internal representations or large language models, in order to characterize the nature of these representations and shed light on the mechanisms used to represent knowledge. The work to be developed will use open-source models, such as LlaMA and LlaVA, together with publicly available data from functional MRI experiments with animals to explore and characterize the internal representations of large language models, with the objective of shedding light on the way these models work and, possibly, on how they are able to generalize from training data to new domains. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. He/she should also have interest in developing the understanding of large language models and LLM APIs. Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Transfer single-cell foundation models to predict drug resistance in cancer Supervised by Arlindo L. Oliveira and Emanuel Gonçalves and auhored by Gonçalo Gonçalves Drug resistance is one of the leading causes of therapy in cancer with devastating consequences for the patients and their close ones. Single-cell technologies offer an unprecedented resolution, analysing thousands to millions of cells, revealing the heterogeneity of tumours along with rare drug-tolerant cells. Deep learning models, e.g. variational autoencoders and transformers, have been very successfully applied to integrate a wide range of multi-modal single-cell datasets spanning over thousands to millions of cells. Single-cell generative models, including foundation models, scBERT and, are now becoming commonplace to study many aspects of cancer biology, including predicting how cancer cells respond to gene deletions to propose novel drug targets. Aligned with this, recents efforts focus on the creation of computational pipelines to integrate and harmonise single-cell perturbation datasets. Large-scale unsupervised single-cell models can be fine-tuned for specific tasks, such as cancer type identification and prediction of drug responses. In this project, we propose to harness these foundation models to explore zero-shot, few-shot and fine-tune approaches to integrate Cancer Dependency Map data (https://www.broadinstitute.org/cancer/cancer-dependency-map). This will leverage screens of hundreds of anti-cancer drugs across >1,000 cancers cell lines. This will enable us to study the variation in drug responses among different cancer types using computational models. Ultimately, we can pinpoint cell populations that are resistant to drug treatments, shedding light on the regulatory mechanisms behind drug resistance and uncovering new strategies to overcome it. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. He/she should also have interest in developing the understanding of large language models and LLM APIs. Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100, four NVIDIA 48GB A40 and four NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Biologically inspired CNNs for Medical Imaging tasks Supervised by Arlindo L. Oliveira and Tiago Marques and authored by Daniela Carvalho Medical image data poses several challenges for computer vision algorithms: it spans multiple imaging modalities and biological tissues, it contains several sources of noise and variation, and there is a scarcity of available labeled datasets. Some recent advances in computer vision models, such as the use of vision transformers and self-supervised learning have showed promising results in dealing with some of these challenges. However, it has not been tested whether the use of biologically inspired computations, another recent advanced in computer vision with considerable improvements in robustness, also translates to gains in medical imaging tasks. The goal of this project is to adapt the VOneNet family, a hybrid CNN with a front-end inspired and constrained by the primate primary visual cortex (V1), to multiple computer vision neural network architectures used for medical imaging tasks and to test their performance in a wide range of related benchmarks.
Using large language models to interact with personal information systems Supervised by Arlindo L. Oliveira and authored by João Amoroso Large language models, such as ChatGPT and GPT-4 have shown remarkable abilities to interact in natural language. However, they cannot be used to access and learn from personal data, stored in email records, note taking systems or photos and videos. The objective of this dissertation is to design a system that uses large language model as the interface for personal data, using APIs and enabling the user to query, relate and retrieve information stored in different sub-systems, such as mailboxes, Google records and note taking platforms such as Obsidian. The resulting system should be able to emulate the behavior of an intelligent assistant that has access to all stored personal data and, ultimately, to answer questions about that data in a way similar to the user that owns the data. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. He/she should also have interest in developing the understanding of large language models and LLM APIs. Notes: The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100S and eight NVIDIA 64GB Tesla A100, among other computing servers (https://mlkd.idss.inesc-id.pt/cluster)
Accurate prediction of stroke outcome from computed tomography scans Supervised by Arlindo L. Oliveira and authored by João Teixeira Accurately predicting the functional outcome of stroke patients remains a problem with medical relevance that cannot yet be adequately solved by automated means. Although it is known that brain computed tomography (CT) scans contain relevant information, their practical usefulness in predicting this variable remains an open question. The objective of this thesis is to develop algorithms to determine whether brain CT scans (with and without contrast) could be automatically analysed using deep learning models, to improve the prediction of the three months post stroke functional outcome, as measured by the modified ranking score. The selected student will study the application of deep learning architectures, including convolutional neural networks and vision transformers, to the this problem. One intermediate variable that will be studied as a possible predictor for the functional outcome is occlusion, the existence of blocked arteries that lead to the death of brain regions. Requisites: The student should have significant programming experience, and practical knowledge of machine learning languages and environments, such as PyTorch or TensorFlow. The student should have an interest in becoming familiar with the biological and medical phenomena involved in stroke. Notes: This work will be developed in close cooperation with the neurology department of the Santa Maria Hospital. The selected student will have access to the facilities of INESC-ID and the MLKD group (https://mlkd.idss.inesc-id.pt/), including computing facilities that include four DELL PowerEdge C41402 servers, eight NVIDIA 32GB Tesla V100S and eight NVIDIA 64GB Tesla A100, among other computing servers.