Sign Language Recognition with Large Vision-and-Language Models
Supervised by Bruno Martins and Arlindo L. Oliveira
Sign language plays a vital role in enabling communication for deaf and hard-of-hearing individuals,
yet automatic Sign Language Recognition (SLR) remains a challenging task due to its high visual complexity
and the limited availability of large annotated datasets, particularly for underrepresented sign languages such as
Portuguese Sign Language (Língua Gestual Portuguesa, LGP). Traditional approaches often rely on intermediate gloss
annotations, which are costly to produce and restrict scalability. Meanwhile, recent progress in Large Vision-and-Language Models (LVLMs)
opens new opportunities to bypass these constraints by enabling direct translation from video to text without the need for intermediate representations.
These models can exploit rich semantic and temporal features directly from raw video input, offering an end-to-end alternative for gloss-free sign language understanding.
This project proposes the development of a sign language recognition system that directly utilizes video frames as input to a Large Vision-and-Language Model,
taking inspiration from recent proposals (e.g.,
2412.16524,
2404.00925,
2408.10593).
Unlike previous approaches dependent on body key-point extraction or gloss-level supervision, the method will explore the use of LVLMs fine-tuned with paired video-text data,
treating sign language video frames as a first-class input modality. Inspired by recent advances in multimodal learning and the bridging of visual and language token spaces,
the model will be trained to recognize and translate full video sequences into natural language descriptions. In parallel, the project will address a critical gap in data availability for LGP,
by curating a new dataset from publicly available sources such as televised news broadcasts interpreted in sign language, applying automatic alignment and segmentation strategies to create training pairs.
Evaluation will be conducted using both newly collected LGP data and existing sign language datasets for other languages, such as
PHOENIX14T, CSL-Daily, How2Sign
(how2sign.github.io,
signllm.github.io),
or the data from the Sign Language Translation Task of WMT22/WMT23
(wmt-slt.com).
The proposed model will be benchmarked against existing state-of-the-art gloss-free SLT systems, with comparisons focusing on translation accuracy, robustness to signer variation,
and performance in low-resource language scenarios. The main deliverable will be a technical report detailing the system architecture, data collection and preprocessing pipelines,
and experimental results, with the aim of submission to a leading conference in computer vision or natural language processing.
Requisites:
• Commitment and availability to work on the project (e.g., not recommended for students with other professional activities);
• Interest in the application of deep learning methods to natural language processing — previous projects involving the use of Transformer-based models will be valued;
• Good commandment of English;
• Excellent grades in courses related to the topics of the project (i.e., an average grade of 17 values or higher);
• Knowledge and experience with the use of tools like Overleaf and GitHub;
• Knowledge of Python and machine learning libraries such as PyTorch and HuggingFace Transformers;
• Preference will be given to students enrolled, or interested in enrolling, in the PhD fast track programme.
Notes: The project will be supervised by
Bruno Martins (DEEC/IST and INESC-ID)
and Arlindo Oliveira (DEI/IST and INESC-ID).
It will be developed in the context of ongoing research projects currently being executed at the Human Language Technologies (HLT) group of INESC-ID
(e.g., the Center for Responsible AI —
https://centerforresponsible.ai).
Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs),
and they are expected to interact with other HLT researchers working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).
Students interested in this proposal should contact Prof. Bruno Martins
(bruno.g.martins@tecnico.ulisboa.pt)
in order to schedule an interview.
Text Segmentation by TextTiling with Modern Neural Retrieval Models
Supervised by Bruno Martins and Arlindo L. Oliveira
Text segmentation is a foundational task in natural language processing, supporting a wide range of applications such as document summarization, topic modeling, media content repurposing,
and more recently, Retrieval-Augmented Generation (RAG) systems
(2503.09600).
Effective segmentation enables better contextualization of content, allowing downstream models to focus on coherent blocks of information rather than arbitrary fixed-length inputs.
Classical approaches like TextTiling segment texts into semantically coherent passages based on lexical cohesion, typically using text similarity measures.
Subsequent approaches have revisited the TextTiling idea leveraging representations obtained from encoder models such as BERT
(PeerJ-CS,
GitHub - DeepTiling).
More recently, with the advent of neural retrieval and large language models (LLMs), there is a renewed interest in developing segmentation strategies that can more precisely model topical boundaries
and improve the efficiency and quality of retrieval-based systems. Recent approaches have even proposed the use of LLMs to perform text segmentation
(2406.17526), although computational efficiency is a concern with these methods.
In the context of this M.Sc. research project, the candidate will revisit the original TextTiling method and augment it by integrating modern neural sentence representations,
particularly those derived from models pre-trained for dense retrieval tasks (e.g., sentence-transformers or embedding models based on larger Transformer decoders fine-tuned for semantic search).
By replacing TF-IDF vectors with dense embeddings, the method aims to improve the detection of semantic shifts between text segments, while maintaining the interpretability
and unsupervised nature of the original algorithm. In parallel, we will compare this enhanced TextTiling method with LLM-based segmentation approaches such as LumberChunker,
which dynamically determine chunk boundaries through generative prompting. The goal is to assess the trade-offs between lightweight embedding-based segmentation,
and more computationally intensive instruction-tuned models in real-world scenarios.
Evaluation will be performed using a mixture of segmentation benchmarks, perhaps also considering downstream task performance within RAG pipelines.
Benchmarks such as Wiki-727K, or the GutenQA benchmark used in LumberChunker, will be employed to quantify segmentation quality using established metrics.
A technical report will document the results, envisioning the submission to a leading conference in natural language processing or information retrieval.
Requisites:
• Commitment and availability to work on the project (e.g., not recommended for students with other professional activities);
• Interest in the application of deep learning methods to natural language processing — previous projects involving the use of Transformer-based models will be valued;
• Good commandment of English;
• Excellent grades in courses related to the topics of the project (i.e., an average grade of 17 values or higher);
• Knowledge and experience with the use of tools like Overleaf and GitHub;
• Knowledge of Python and machine learning libraries such as PyTorch and HuggingFace Transformers;
• Preference will be given to students enrolled, or interested in enrolling, in the PhD fast track programme.
Notes: The project will be supervised by
Bruno Martins (DEEC/IST and INESC-ID)
and Arlindo Oliveira (DEI/IST and INESC-ID).
It will be developed in the context of ongoing research projects currently being executed at the Human Language Technologies (HLT) group of INESC-ID
(e.g., the Center for Responsible AI —
https://centerforresponsible.ai).
Within INESC-ID, students will have access to computational resources supporting the training of large neural networks (i.e., servers with A100 GPUs),
and they are expected to interact with other HLT researchers working on similar topics (e.g., Ph.D. students in the group that can act as mentors to newcomers).
Students interested in this proposal should contact Prof. Bruno Martins
(bruno.g.martins@tecnico.ulisboa.pt)
in order to schedule an interview.
Currently, there are no dissertations open for application.