MLKD

Publications

See the studies developed by the group
Segmentation of X-ray coronary angiography with an artificial intelligence deep learning model: Impact in operator visual assessment of coronary stenosis severity MN Menezes, B Silva, JL Silva, T Rodrigues, JS Marques, C Guerreiro, ...
2023 Catheterization and cardiovascular interventions: official journal of the Society for Cardiac Angiography & Interventions
Visual assessment of the percentage diameter stenosis (%DSVE ) of lesions is essential in coronary angiography (CAG) interpretation. We have previously developed an artificial intelligence (AI) model capable of accurate CAG segmentation. We aim to compare operators' %DSVE in angiography versus AI-segmented images. Quantitative coronary analysis (QCA) %DS (%DSQCA ) was previously performed in our published validation dataset. Operators were asked to estimate %DSVE of lesions in angiography versus AI-segmented images in separate sessions and differences were assessed using angiography %DSQCA as reference. Results: A total of 123 lesions were included. %DSVE was significantly higher in both the angiography (77% ± 20% vs. 56% ± 13%, p < 0.001) and segmentation groups (59% ± 20% vs. 56% ± 13%, p < 0.001), with a much smaller absolute %DS difference in the …
Improving Address Matching using Siamese Transformer Networks AV Duarte, AL Oliveira
2023 arXiv preprint arXiv:2307.02300
Matching addresses is a critical task for companies and post offices involved in the processing and delivery of packages. The ramifications of incorrectly delivering a package to the wrong recipient are numerous, ranging from harm to the company's reputation to economic and environmental costs. This research introduces a deep learning-based model designed to increase the efficiency of address matching for Portuguese addresses. The model comprises two parts: (i) a bi-encoder, which is fine-tuned to create meaningful embeddings of Portuguese postal addresses, utilized to retrieve the top 10 likely matches of the un-normalized target address from a normalized database, and (ii) a cross-encoder, which is fine-tuned to accurately rerank the 10 addresses obtained by the bi-encoder. The model has been tested on a real-case scenario of Portuguese addresses and exhibits a high degree of accuracy, exceeding 95% at the door level. When utilized with GPU computations, the inference speed is about 4.5 times quicker than other traditional approaches such as BM25. An implementation of this system in a real-world scenario would substantially increase the effectiveness of the distribution process. Such an implementation is currently under investigation.
Coronary X-ray angiography segmentation using Artificial Intelligence: a multicentric validation study of a deep learning model M Nobre Menezes, JL Silva, B Silva, T Rodrigues, C Guerreiro, ...
2023 The international journal of cardiovascular imaging, 1-12
We previously developed an artificial intelligence (AI) model for automatic coronary angiography (CAG) segmentation, using deep learning. To validate this approach, the model was applied to a new dataset and results are reported. Retrospective selection of patients undergoing CAG and percutaneous coronary intervention or invasive physiology assessment over a one month period from four centers. A single frame was selected from images containing a lesion with a 50–99% stenosis (visual estimation). Automatic Quantitative Coronary Analysis (QCA) was performed with a validated software. Images were then segmented by the AI model. Lesion diameters, area overlap [based on true positive (TP) and true negative (TN) pixels] and a global segmentation score (GSS – 0 -100 points) - previously developed and published - were measured.
Artificial intelligence-based diagnosis of acute pulmonary embolism: Development of a machine learning model using 12-lead electrocardiogram BV Silva, J Marques, MN Menezes, AL Oliveira, FJ Pinto
2023 Revista Portuguesa de Cardiologia
Pulmonary embolism (PE) is a life-threatening condition, in which diagnostic uncertainty remains high given the lack of specificity in clinical presentation. It requires confirmation by computed tomography pulmonary angiography (CTPA). Electrocardiography (ECG) signals can be detected by artificial intelligence (AI) with precision. The purpose of this study was to develop an AI model for predicting PE using a 12-lead ECG. We extracted 1014 ECGs from patients admitted to the emergency department who underwent CTPA due to suspected PE: 911 ECGs were used for development of the AI model and 103 ECGs for validation. An AI algorithm based on an ensemble neural network was developed. The performance of the AI model was compared against the guideline recommended clinical prediction rules for PE (Wells and Geneva scores combined with a standard D-dimer cut-off of 500 ng/mL …
Connecting metrics for shape-texture knowledge in computer vision T Oliveira, T Marques, AL Oliveira
2023 arXiv preprint arXiv:2301.10608
Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not cause humans to misclassify images. Part of this different behavior may be explained by the type of features humans and deep neural networks use in vision tasks. Humans tend to classify objects according to their shape while deep neural networks seem to rely mostly on texture. Exploring this question is relevant, since it may lead to better performing neural network architectures and to a better understanding of the workings of the vision system of primates. In this work, we advance the state of the art in our understanding of this phenomenon, by extending previous analyses to a much larger set of deep neural network architectures. We found that the performance of models in image classification tasks is highly correlated with their shape bias measured at the output and penultimate layer. Furthermore, our results showed that the number of neurons that represent shape and texture are strongly anti-correlated, thus providing evidence that there is competition between these two types of features. Finally, we observed that while in general there is a correlation between performance and shape bias, there are significant variations between architecture families.
Learning the dynamics of realistic models of C. elegans nervous system with recurrent neural networks R Barbulescu, G Mestre, AL Oliveira, LM Silveira
2023 Scientific Reports 13 (1), 467
Given the inherent complexity of the human nervous system, insight into the dynamics of brain activity can be gained from studying smaller and simpler organisms. While some of the potential target organisms are simple enough that their behavioural and structural biology might be well-known and understood, others might still lead to computationally intractable models that require extensive resources to simulate. Since such organisms are frequently only acting as proxies to further our understanding of underlying phenomena or functionality, often one is not interested in the detailed evolution of every single neuron in the system. Instead, it is sufficient to observe the subset of neurons that capture the effect that the profound nonlinearities of the neuronal system have in response to different stimuli. In this paper, we consider the well-known nematode Caenorhabditis elegans and seek to investigate the possibility of …
Biomedical image analysis competitions: The state of current participation practice M Eisenmann, A Reinke, V Weru, MD Tizabi, F Isensee, TJ Adler, ...
2022 arXiv preprint arXiv:2212.08568
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants …
Development of deep learning segmentation models for coronary X-ray angiography: Quality assessment by a new global segmentation score and comparison with human performance MN Menezes, J Lourenço-Silva, B Silva, T Rodrigues, ARG Francisco, ...
2022 Revista Portuguesa de Cardiologia 41 (12), 1011-1021
Introduction and objectives Although automatic artificial intelligence (AI) coronary angiography (CAG) segmentation is arguably the first step toward future clinical application, it is underexplored. We aimed to (1) develop AI models for CAG segmentation and (2) assess the results using similarity scores and a set of criteria defined by expert physicians. Methods Patients undergoing CAG were randomly selected in a retrospective study at a single center. Per incidence, an ideal frame was segmented, forming a baseline human dataset (BH), used for training a baseline AI model (BAI). Enhanced human segmentation (EH) was created by combining the best of both. An enhanced AI model (EAI) was trained using the EH. Results were assessed by experts using 11 weighted criteria, combined into a Global Segmentation Score (GSS: 0–100 points). Generalized Dice Score (GDS) and Dice Similarity Coefficient (DSC) were …
Modeling the geospatial evolution of COVID-19 using spatio-temporal convolutional sequence-to-sequence neural networks M Cardoso, A Cavalheiro, A Borges, AF Duarte, A Soares, MJ Pereira, ...
2022 ACM Transactions on Spatial Algorithms and Systems
Europe was hit hard by the COVID-19 pandemic and Portugal was severely affected, having suffered three waves in the first twelve months. Approximately between January 19th and February 5th 2021 Portugal was the country in the world with the largest incidence rate, with 14-day incidence rates per 100,000 inhabitants in excess of 1,000. Despite its importance, accurate prediction of the geospatial evolution of COVID-19 remains a challenge, since existing analytical methods fail to capture the complex dynamics that result from the contagion within a region and the spreading of the infection from infected neighboring regions. We use a previously developed methodology and official municipality level data from the Portuguese Directorate-General for Health (DGS), relative to the first twelve months of the pandemic, to compute an estimate of the incidence rate in each location of mainland Portugal. The resulting …
Encoder-Decoder Architectures for Clinically Relevant Coronary Artery Segmentation J Lourenço-Silva, MN Menezes, T Rodrigues, B Silva, FJ Pinto, ...
2022 Computational Advances in Bio and Medical Sciences: 11th International …
Coronary X-ray angiography is a crucial clinical procedure for the diagnosis and treatment of coronary artery disease, which accounts for roughly 16% of global deaths every year. However, the images acquired in these procedures have low resolution and poor contrast, making lesion detection and assessment challenging. Accurate coronary artery segmentation not only helps mitigate these problems, but also allows the extraction of relevant anatomical features for further analysis by quantitative methods. Although automated segmentation of coronary arteries has been proposed before, previous approaches have used non-optimal segmentation criteria, leading to less useful results. Most methods either segment only the major vessel, discarding important information from the remaining ones, or segment the whole coronary tree, based mostly on contrast information, producing a noisy output that includes vessels …
Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning M Goulão, AL Oliveira
2022 arXiv preprint arXiv:2209.10901
The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, Convolutional Neural Networks (CNN) remain the preferential architecture for the representation module in Reinforcement Learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess data-efficiency gains from this training framework. We propose a new self-supervised learning method called TOV-VICReg that extends VICReg to better capture temporal relations between observations by adding a temporal order verification task. Furthermore, we evaluate the resultant encoders with Atari games in a sample-efficiency regime. Our results show that the vision transformer, when pretrained with TOV-VICReg, outperforms the other self-supervised methods but still struggles to overcome a CNN. Nevertheless, we were able to outperform a CNN in two of the ten games where we perform a 100k steps evaluation. Ultimately, we believe that such approaches in Deep Reinforcement Learning (DRL) might be the key to achieving new levels of performance as seen in natural language processing and computer vision. Source code will be available at: https://github.com/mgoulao/TOV-VICReg
Stenosis Detection in X-ray Coronary Angiography with Deep Neural Networks Leveraged by Attention Mechanisms PV Stralen, DL Rodrigues, AL Oliveira, MN Menezes, FJ Pinto
2022 Proceedings of the 9th International Conference on Bioinformatics Research …
Coronary artery disease (CAD) is one of the most prevalent causes of death worldwide. The automatic detection of coronary artery stenosis on X-ray images is important in coronary heart disease diagnosis. Coronary artery disease is caused by atherosclerotic plaques with subsequent stenosis (e.g. narrowing) of the coronary arteries. This makes the heart work harder, risking failure. Automated identification of stenosis may be used for triage or as a second reader in clinical practice, providing a valuable tool for cardiologists. In this paper, we evaluate the detection of stenosis in X-ray coronary angiography images with novel object detection methods based on deep neural networks. We trained and tested three promising object detectors based on different neural network architectures leveraging attention mechanisms (EfficientDet, RetinaNet ResNet-50- FPN, and Faster R-CNN ResNet-101) using clinical …
Using a Siamese Network to Accurately Detect Ischemic Stroke in Computed Tomography Scans AB Vieira, AC Fonseca, J Ferro, AL Oliveira
2022 Progress in Artificial Intelligence: 21st EPIA Conference on Artificial …
The diagnosis of stroke, a leading cause of death in the world, using computed tomography (CT) scans, makes it possible to assess the severity of the incident and to determine the type and location of the lesion. The fact that the brain has two hemispheres with a high level of anatomical similarity, exhibiting significant symmetry, has led to extensive research based on the assumption that a decrease in symmetry is directly related to the presence of pathologies. This work is focused on the analysis of the symmetry (or lack of it) of the two brain hemispheres, and on the use of this information for the classification of computed tomography brain scans of stroke patients. The objective is to contribute to a more precise diagnosis of brain lesions caused by ischemic stroke events. To perform this task, we used a siamese network architecture that receives a double two-dimensional image of a CT slice (the original and a …
COVID-19 Contact Tracing Applications in Portugal: Effectiveness and Privacy Issues A L. Oliveira
2022 Towards Trustworthy Artificial Intelligent Systems, 109-114
Portugal developed and adopted a contact tracing smartphone application, in order to help in the containment of the spreading of the COVID-19 pandemic. The particular technology used was adopted in order to guaranteed privacyPrivacy and security but it did not prove particularly effective, since it was downloaded by only a fraction of the population and very lightly used to report cases of contacts between infected and non-infected persons. These issues are addressed and discussed in the present text.
Assessing the impact of attention and self-attention mechanisms on the classification of skin lesions R Pedro, AL Oliveira
2022 2022 International Joint Conference on Neural Networks (IJCNN), 1-8
Attention mechanisms have raised significant interest in the research community, since they promise relevant improvements in the performance of neural network architectures. However, in any specific problem, we still lack a principled way to choose specific mechanisms and hyper-parameters that lead to guaranteed improvements. More recently, self-attention has been proposed and widely used in transformer-like architectures, leading to significant breakthroughs in some applications. In this work we focus on two forms of attention mechanisms, attention modules and self-attention. Attention modules are used to reweigh the features of each layer input tensor. Different modules have different ways to perform this reweighting in fully connected or convolutional layers. The attention models studied are completely modular and in this work they will be used with the popular ResNet architecture. Self-attention, originally …
Using soft labels to model uncertainty in medical image segmentation J Lourenço-Silva, AL Oliveira
2022 Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries …
Medical image segmentation is inherently uncertain. For a given image, there may be multiple plausible segmentation hypotheses, and physicians will often disagree on lesion and organ boundaries. To be suited to real-world application, automatic segmentation systems must be able to capture this uncertainty and variability. Thus far, this has been addressed by building deep learning models that, through dropout, multiple heads, or variational inference, can produce a set - infinite, in some cases - of plausible segmentation hypotheses for any given image. However, in clinical practice, it may not be practical to browse all hypotheses. Furthermore, recent work shows that segmentation variability plateaus after a certain number of independent annotations, suggesting that a large enough group of physicians may be able to represent the whole space of possible segmentations. Inspired by this, we propose a simple …
A blueprint for conscious machines AL Oliveira
2022 Proceedings of the National Academy of Sciences 119 (23), e2205971119
Although it has been the subject of human thought for many centuries, consciousness remains a mysterious and controversial topic. Every one of us is overly familiar with the phenomenon, but there is little agreement on what it is, what it entails, and how it is created. Certainly, no other phenomenon is simultaneously so familiar and so hard to explain. Researchers from fields as different as medicine, philosophy, psychology, neurobiology, and computer science have tried for centuries to frame, describe, and address the mind–body problem—how consciousness arises from purely physical processes—but success has been, to say the least, limited. Many believe that it is a phenomenon that will remain, forever, unknowable, outside the reach of human understanding. David Chalmers (1) famously argued that, while we may advance in our understanding of the different physical processes that are associated with …
Assessing Policy, Loss and Planning Combinations in Reinforcement Learning using a New Modular Architecture TG Oliveira, AL Oliveira
2022 EPIA Conference on Artificial Intelligence,, 427-439
The model-based reinforcement learning paradigm, which uses planning algorithms and neural network models, has recently achieved unprecedented results in diverse applications, leading to what is now known as deep reinforcement learning. These agents are quite complex and involve multiple components, factors that create challenges for research and development of new models. In this work, we propose a new modular software architecture suited for these types of agents, and a set of building blocks that can be easily reused and assembled to construct new model-based reinforcement learning agents. These building blocks include search algorithms, policies, and loss functions (Code available at https://github.com/GaspTO/Modular_MBRL). We illustrate the use of this architecture by combining several of these building blocks to implement and test agents that are optimized to three different test …
Modelling Neuronal Behaviour with Time Series Regression: Recurrent Neural Networks on C. Elegans Data G Mestre, R Barbulescu, AL Oliveira, LM Silveira
2021 arXiv preprint arXiv:2107.06762
Given the inner complexity of the human nervous system, insight into the dynamics of brain activity can be gained from understanding smaller and simpler organisms, such as the nematode C. Elegans. The behavioural and structural biology of these organisms is well-known, making them prime candidates for benchmarking modelling and simulation techniques. In these complex neuronal collections, classical, white-box modelling techniques based on intrinsic structural or behavioural information are either unable to capture the profound nonlinearities of the neuronal response to different stimuli or generate extremely complex models, which are computationally intractable. In this paper we show how the nervous system of C. Elegans can be modelled and simulated with data-driven models using different neural network architectures. Specifically, we target the use of state of the art recurrent neural networks architectures such as LSTMs and GRUs and compare these architectures in terms of their properties and their accuracy as well as the complexity of the resulting models. We show that GRU models with a hidden layer size of 4 units are able to accurately reproduce with high accuracy the system's response to very different stimuli.
Artificial Intelligence Applications in Stroke AL Oliveira
2021 Precision Medicine in Stroke, 261
Machine learning and other artificial intelligence techniques have been extensively used in a large number of medical applications, in diagnosis, treatment selection, mining of electronic health records, genetics, and image processing, among several others. Machine learning techniques can be used to infer predictive models, from labeled data, in many areas of medicine, including stroke. In particular, in patients who suffered from stroke, machine learning can be used to perform or improve outcome prediction, lesion segmentation, and treatment assessment, among others. Machine learning algorithms can be based on a number of different approaches. In this chapter, we cover the symbolic, statistical, similarity-based, and connectionist approaches, which have different properties and trade-offs. Recently, a set of techniques generally known as deep learning have increased significantly the range of applicability of …
Combining Off and On-Policy Training in Model-Based Reinforcement Learning A Borges, A Oliveira
2021 Adaptive and Learning Agents Workshop
The combination of deep learning and Monte Carlo Tree Search (MCTS) has shown to be effective in various domains, such as board and video games. AlphaGo represented a significant step forward in our ability to learn complex board games, and it was rapidly followed by significant advances, such as AlphaGo Zero and AlphaZero. Recently, MuZero demonstrated that it is possible to master both Atari games and board games by directly learning a model of the environment, which is then used with MCTS to decide what move to play in each position. During tree search, the algorithm simulates games by exploring several possible moves and then picks the action that corresponds to the most promising trajectory. When training, limited use is made of these simulated games since none of their trajectories are directly used as training examples. Even if we consider that not all trajectories from simulated games are useful, there are thousands of potentially useful trajectories that are discarded. Using information from these trajectories would provide more training data, more quickly, leading to faster convergence and higher sample efficiency. Recent work introduced an off-policy value target for AlphaZero that uses data from simulated games. In this work, we propose a way to obtain off-policy targets using data from simulated games in MuZero. We combine these off-policy targets with the on-policy targets already used in MuZero in several ways, and study the impact of these targets and their combinations in three environments with distinct characteristics. When used in the right combinations, our results show that these targets can speed up the …
Automated detection of coronary artery stenosis in x-ray angiography using deep neural networks DL Rodrigues, MN Menezes, FJ Pinto, AL Oliveira
2021 arXiv preprint arXiv:2103.02969
Coronary artery disease leading up to stenosis, the partial or total blocking of coronary arteries, is a severe condition that affects millions of patients each year. Automated identification and classification of stenosis severity from minimally invasive procedures would be of great clinical value, but existing methods do not match the accuracy of experienced cardiologists, due to the complexity of the task. Although a number of computational approaches for quantitative assessment of stenosis have been proposed to date, the performance of these methods is still far from the required levels for clinical applications. In this paper, we propose a two-step deep-learning framework to partially automate the detection of stenosis from X-ray coronary angiography images. In the two steps, we used two distinct convolutional neural network architectures, one to automatically identify and classify the angle of view, and another to determine the bounding boxes of the regions of interest in frames where stenosis is visible. Transfer learning and data augmentation techniques were used to boost the performance of the system in both tasks. We achieved a 0.97 accuracy on the task of classifying the Left/Right Coronary Artery (LCA/RCA) angle view and 0.68/0.73 recall on the determination of the regions of interest, for LCA and RCA, respectively. These results compare favorably with previous results obtained using related approaches, and open the way to a fully automated method for the identification of stenosis severity from X-ray angiographies.
The bio. tools registry of software tools and data resources for the life sciences J Ison, H Ienasescu, P Chmura, E Rydza, H Ménager, M Kalaš, ...
2019 Genome biology 20 (1), 1-4
Bioinformaticians and biologists rely increasingly upon workflows for the flexible utilization of the many life science tools that are needed to optimally convert data into knowledge. We outline a pan-European enterprise to provide a catalogue ( https://bio.tools ) of tools and databases that can be used in these workflows. bio.tools not only lists where to find resources, but also provides a wide variety of practical information.
Biotechnology, big data and artificial intelligence AL Oliveira
2019 Biotechnology journal 14 (8), 1800613
Developments in biotechnology are increasingly dependent on the extensive use of big data, generated by modern high‐throughput instrumentation technologies, and stored in thousands of databases, public and private. Future developments in this area depend, critically, on the ability of biotechnology researchers to master the skills required to effectively integrate their own contributions with the large amounts of information available in these databases. This article offers a perspective of the relations that exist between the fields of big data and biotechnology, including the related technologies of artificial intelligence and machine learning and describes how data integration, data exploitation, and process optimization correspond to three essential steps in any future biotechnology project. The article also lists a number of application areas where the ability to use big data will become a key factor, including drug …
Inteligência artificial A Oliveira
2019 Fundação Francisco Manuel dos Santos
Somos a espécie animal mais inteligente que se conhece, produto de uma extraordinária evolução biológica com milhares de milhões de anos. Daí que não seja de estranhar que hoje queiramos ultrapassar os nossos próprios limites, criando sistemas que reproduzam comportamentos inteligentes de forma artificial. Em que ponto estamos nesta aventura? Será que algum dia esses sistemas irão superar a inteligência dos seus criadores? Devemos temê-los? Que papel podem desempenhar na evolução futura da espécie humana e na conquista do espaço? Este ensaio descreve, de forma acessível, o que é a inteligência artificial e a sua relação com a inteligência humana, assim como possíveis aplicações e implicações societais e económicas. Inclui uma perspectiva histórica e uma análise da situação actual da tecnologia. Reflecte sobre as possíveis consequências do desenvolvimento da inteligência artificial e convida-nos a projectarmos a nossa inteligência no futuro.
Graph Neural Networks for Traffic Forecasting J Rico, J Barateiro, A Oliveira
2019 17th International Operations & Maintenance Conference in the Arab Countries
The significant increase in world population and urbanisation has brought several important challenges, in particular regarding the sustainability, maintenance and planning of urban mobility. At the same time, the exponential increase of computing capability and of available sensor and location data have offered the potential for innovative solutions to these challenges. In this work, we focus on the challenge of traffic forecasting and review the recent development and application of graph neural networks (GNN) to this problem. GNNs are a class of deep learning methods that directly process the input as graph data. This leverages more directly the spatial dependencies of traffic data and makes use of the advantages of deep learning producing state-of-the-art results. We introduce and review the emerging topic of GNNs, including their most common variants, with a focus on its application to traffic forecasting. We address the different ways of modelling traffic forecasting as a (temporal) graph, the different approaches developed so far to combine the graph and temporal learning components, as well as current limitations and research opportunities.
COMPUTER ARCHITECTURE: Digital Circuits to Microprocessors G Arroz, J Monteiro, A Oliveira
2019
This chapter is focused on the ways computers represent information in digital format. In particular, it discusses the mechanisms used to represent various quantities in a digital computer. The digital electronic circuits that are commonly used in a digital computer can assume only one of two possible values, which implies that different quantities have to be represented in a format compatible with this restriction.
Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge S Bakas, M Reyes, A Jakab, S Bauer, M Rempfler, A Crimi, RT Shinohara, ...
2018 arXiv preprint arXiv:1811.02629
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the …
ISLES 2016 and 2017-benchmarking ischemic stroke lesion outcome prediction based on multispectral MRI S Winzeck, A Hakim, R McKinley, JA Pinto, V Alves, C Silva, M Pisov, ...
2018 Frontiers in neurology 9, 679
Performance of models highly depend not only on the used algorithm but also the data set it was applied to. This makes the comparison of newly developed tools to previously published approaches difficult. Either researchers need to implement others' algorithms first, to establish an adequate benchmark on their data, or a direct comparison of new and old techniques is infeasible. The Ischemic Stroke Lesion Segmentation (ISLES) challenge, which has ran now consecutively for 3 years, aims to address this problem of comparability. ISLES 2016 and 2017 focused on lesion outcome prediction after ischemic stroke: By providing a uniformly pre-processed data set, researchers from all over the world could apply their algorithm directly. A total of nine teams participated in ISLES 2015, and 15 teams participated in ISLES 2016. Their performance was evaluated in a fair and transparent way to identify the state-of-the-art among all submissions. Top ranked teams almost always employed deep learning tools, which were predominately convolutional neural networks (CNNs). Despite the great efforts, lesion outcome prediction persists challenging. The annotated data set remains publicly available and new approaches can be compared directly via the online evaluation system, serving as a continuing benchmark (www.isles-challenge.org).
Making algorithms work for us A Oliveira
2018 Nature Electronics 1 (9), 487-487
We are, increasingly, unwitting slaves to algorithms. Some of the largest companies in the world deal primarily with data manipulation, using powerful algorithms to help us decide what we buy, which songs we listen to, where we go and how we get there. And yet, few people understand what an algorithm is, what artificial intelligence really means, or what machine learning can do. In Hello World, Hannah Fry—a mathematician at University College London—opens a window into this world of algorithms and the ways they are changing our lives. For programmers, like myself, the title will be familiar. The first program that many of us wrote was one that prints “Hello world” on the terminal, a tradition that originated from the book by Brian Kernighan and Dennis Ritchie on how to program in C. But Fry’s book is not about programming, it is about the diverse ways algorithms are being used to process data and obtain …
Sparse network-based regularization for the analysis of patientomics high-dimensional survival data A Veríssimo, E Carrasquinha, MB Lopes, AL Oliveira, MF Sagot, S Vinga
2018 bioRxiv, 403402
Data availability by modern sequencing technologies represents a major challenge in oncological survival analysis, as the increasing amount of molecular data hampers the generation of models that are both accurate and interpretable. To tackle this problem, this work evaluates the introduction of graph centrality measures in classical sparse survival models such as the elastic net. We explore the use of network information as part of the regularization applied to the inverse problem, obtained both by external knowledge on the features evaluated and the data themselves. A sparse solution is obtained either promoting features that are isolated from the network or, alternatively, hubs, i.e., features that are highly connected within the network. We show that introducing the degree information of the features when inferring survival models consistently improves the model predictive performance in breast invasive carcinoma (BRCA) transcriptomic TCGA data while enhancing model interpretability. Preliminary clinical validation is performed using the Cancer Hallmarks Analytics Tool API and the String database. These case studies are included in the recently released glmSparseNet R package, a flexible tool to explore the potential of sparse network-based regularizers in generalized linear models for the analysis of omics data.
Conditional random fields as recurrent neural networks for 3d medical imaging segmentation M Monteiro, MAT Figueiredo, AL Oliveira
2018 arXiv preprint arXiv:1807.07464
The Conditional Random Field as a Recurrent Neural Network layer is a recently proposed algorithm meant to be placed on top of an existing Fully-Convolutional Neural Network to improve the quality of semantic segmentation. In this paper, we test whether this algorithm, which was shown to improve semantic segmentation for 2D RGB images, is able to improve segmentation quality for 3D multi-modal medical images. We developed an implementation of the algorithm which works for any number of spatial dimensions, input/output image channels, and reference image channels. As far as we know this is the first publicly available implementation of this sort. We tested the algorithm with two distinct 3D medical imaging datasets, we concluded that the performance differences observed were not statistically significant. Finally, in the discussion section of the paper, we go into the reasons as to why this technique transfers poorly from natural images to medical images.
Computational approach to the discovery of phytochemical molecules with therapeutic potential targets to the PKCZ protein PG Freitas, TC Elias, IA Pinto, LT Costa, PVSD de Carvalho, ...
2018 Letters in Drug Design & Discovery 15 (5), 488-499
Background: Head and neck squamous cell carcinoma (HNSCC) is one of the most common malignancies in humans and the average 5-year survival rate is one of the lowest among aggressive cancers. Protein kinase C zeta (PKCZ) is highly expressed in head and neck tumors, and the inhibition of PKCZ reduces MAPK activation in five of seven head and neck tumors cell lines. Considering the world-wide HNSCC problems, there is an urgent need to develop new drugs to treat this disease, that present low toxicity, effective results and that are relatively inexpensive. Methods: A unified approach involving homology modeling, docking and molecular dynamics simulations studies on PKCZ are presented. The in silico study on this enzyme was undertaken using 10 compounds from latex of Euphorbia tirucalli L. (aveloz). Results: The binding free energies highlight that the main contribution in energetic terms for the …
Using machine learning to improve the prediction of functional outcome in ischemic stroke patients M Monteiro, AC Fonseca, AT Freitas, TP e Melo, AP Francisco, JM Ferro, ...
2018 IEEE/ACM transactions on computational biology and bioinformatics 15 (6 …
Ischemic stroke is a leading cause of disability and death worldwide among adults. The individual prognosis after stroke is extremely dependent on treatment decisions physicians take during the acute phase. In the last five years, several scores such as the ASTRAL, DRAGON, and THRIVE have been proposed as tools to help physicians predict the patient functional outcome after a stroke. These scores are rule-based classifiers that use features available when the patient is admitted to the emergency room. In this paper, we apply machine learning techniques to the problem of predicting the functional outcome of ischemic stroke patients, three months after admission. We show that a pure machine learning approach achieves only a marginally superior Area Under the ROC Curve (AUC) ( ) than that of the best score ( ) when using the features available at admission. However, we observed …
The digital mind: How science is redefining humanity A Oliveira
2017 MIT Press
How developments in science and technology may enable the emergence of purely digital minds—intelligent machines equal to or greater in power than the human brain. What do computers, cells, and brains have in common? Computers are electronic devices designed by humans; cells are biological entities crafted by evolution; brains are the containers and creators of our minds. But all are, in one way or another, information-processing devices. The power of the human brain is, so far, unequaled by any existing machine or known living being. Over eons of evolution, the brain has enabled us to develop tools and technology to make our lives easier. Our brains have even allowed us to develop computers that are almost as powerful as the human brain itself. In this book, Arlindo Oliveira describes how advances in science and technology could enable us to create digital minds. Exponential growth is a pattern built deep into the scheme of life, but technological change now promises to outstrip even evolutionary change. Oliveira describes technological and scientific advances that range from the discovery of laws that control the behavior of the electromagnetic fields to the development of computers. He calls natural selection the ultimate algorithm, discusses genetics and the evolution of the central nervous system, and describes the role that computer imaging has played in understanding and modeling the brain. Having considered the behavior of the unique system that creates a mind, he turns to an unavoidable question: Is the human brain the only system that can host a mind? If digital minds come into existence—and, Oliveira says, it is difficult …
Brains, Minds, and Machines A Oliveira
2017 MIT Press
The preceding chapters focused on the behavior of the one system known that undoubtedly creates a mind: the human brain. The question of whether the human brain is the only system that can host a mind is worth discussing in some detail.
From Maxwell to the Internet A Oliveira
2017 MIT Press
A number of important technologies developed rapidly in the nineteenth and twentieth centuries, including electricity, electronics, and computers, but also biotechnology, nanotechnologies, and technologies derived from them. In fact, it is the exponential rate of development of these technologies, more than anything else, that has resulted in the scientific and economic developments of recent decades. In those technologies and in others, that rate is likely to increase.
The Quest for Intelligent Machines A Oliveira
2017 MIT Press
Can a computer be intelligent? Can a program, running in a computer, behave intelligently in a human-like way? These are not simple questions, and answers have eluded scientists and thinkers for hundreds of years.
Understanding the Brain A Oliveira
2017 MIT Press
For many reasons, understanding the mechanisms that underlie the development and the plasticity phenomena of the human brain would be of enormous value. From a medical standpoint, the understanding of these mechanisms would lead to ways to prevent degenerative brain disorders, to extend life, and to increase the length of time people live in full possession of their mental abilities.
How the Brain Works A Oliveira
2017 MIT Press
Ever since Alcmaeon of Croton (fifth century BC) and Hippocrates (fourth century BC) recognized the brain as the seat of intelligence, humanity has been interested in understanding how it works. Despite this early start, it took more than two millennia to become common knowledge that it is in the brain that intelligence and memory reside.
Speculations A Oliveira
2017 MIT Press
The possibilities addressed in the chapters above, far-fetched as they may seem, raise the possibility that future technological developments may be even more unpredictable than one might expect from recent history. In this final chapter, I discuss a few possible long-term directions for the evolution of technology. These are educated guesses or, if you prefer, wild speculations.
The Universal Machine A Oliveira
2017 MIT Press
Computers are now so ubiquitous that it is hard to imagine a time when the very concept of computer didn't exist. The word computer, referring to a person who carried out calculations, or computations, was probably used for the first time in the seventeenth century, and continued to be used in that sense until the middle of the twentieth century. A computer is someone or something that executes sequences of simple calculations—sequences that can be programmed.
Cells, Bodies, and Brains A Oliveira
2017 MIT Press
The work that led Charles Darwin to write On the Origin of Species by Means of Natural Selection probably didn't begin as an effort to start what would become the major scientific revolution of all time, a paradigm shift that would change people's views of the world more deeply than any other in human history.
The Exponential Nature of Technology A Oliveira
2017 MIT Press
Since the dawn of mankind, cultures and civilizations have developed many different technologies that have changed profoundly the way people live. At the time of its introduction, each technology changed the lives of individuals, tribes, and whole populations. Change didn't occur at a constant pace, though. The speed at which new technologies have been developed and used has been on the increase ever since the first innovations were introduced, hundreds of thousands of years ago.
Challenges and Promises A Oliveira
2017 MIT Press
It is difficult to argue, with conviction, that digital minds will never exist, unless you are a firm believer in dualism. In fact, even though today we don't have the technologies necessary to create digital minds, it is difficult to believe that such technologies will never be developed.
Biology Meets Computation A Oliveira
2017 MIT Press
If the twentieth century was the century of physics, the twenty-first may well become the century of biology. One may argue that biology is a science with an already long and rich history and that the twenty-first century will be a century of many technologies. Still, the parallel has its merits.
Mentes digitais: a ciência redefinindo a humanidade A Oliveira
2017 Lisboa: IST Press
O que têm em comum computadores, células e cérebros? Os computadores são dispositivos eletrónicos concebidos pelos seres humanos; as células entidades biológicas criadas pela evolução; os cérebros, os recipientes e os criadores das nossas mentes. Todos são, de uma forma ou outra, dispositivos de processamento de informação, que resultam dos padrões de crescimento exponencial que caracterizam não só os seres vivos mas também muitos desenvolvimentos tecnológicos. O poder do cérebro humano é, até à data, não igualado por qualquer máquina ou ser vivo conhecidos. Aperfeiçoado pela evolução, o mais poderoso algoritmo de sempre, durante milhões de anos, esse cérebro permitiu-nos desenvolver ferramentas e tecnologias que nos simplificam a vida, e até nos permitiu desenvolver computadores quase tão poderosos como ele próprio. Poderemos, em breve, ter a capacidade de criar mentes digitais, tornadas possíveis pelos desenvolvimentos das áreas da eletrónica, computação, física e biologia. Será o cérebro humano o único sistema capaz de albergar uma mente inteligente? Se não for, que caminho teremos de percorrer até termos mentes digitais? Serão conscientes, nossas parceiras ou nossas rivais? Quais as implicações sociais, económicas, legais e éticas? Questões complexas, que este livro aborda e tenta responder.
Improving the Prediction of Functional Outcome in Ischemic Stroke Patients M Monteiro, AC Fonseca, AT Freitas, TP e Melo, AP Francisco, JM Ferro, ...
2017 Proceedings of International Workshop on Data Mining in Bioinformatics …
Our original dataset was comprised of 541 patients with acute stroke from the Safe Implementation of Treatments in Stroke (SITS)-Thrombolysis registry [20]. The cohort of patients came from the Hospital of Santa Maria, in Lisbon, Portugal, which is a tertiary university hospital. Even though di erent hospitals have di erent cohorts, all hospitals collect the same data for the SITS registry. This registry had baseline data collected at admission, follow-up data collected 2 hours, 24 hours, and 7 days after the initial stroke event, data collected on discharge, and the mRS recorded three months after the event. All of the patients on the registry were treated using Recombinant Tissue Plasminogen Activator (rtPA).
Fully Convolutional Neural Network for 3D Stroke Lesion Segmentation M Monteiro, AL Oliveira
2017 ISLES: Ischemic Stroke Lesion Segmentation Challenge 2017
The segmentation of stroke lesions is an important step for the assessment of stroke outcome. This problem can be particularly difficult if the size of the lesion is very small when compared with the size of healthy tissue. Our solution consists of what has been denoted as a U-shaped [1] or V-shaped [2] fully convolutional neural network. This type of architecture has been shown to perform well for other medical image segmentation problems. To address the fact that the lesions can be very small sized, we propose a new loss function to optimize our network. This loss function consists of the sum between the cross entropy and a “soft” dice coefficient, which is the dice coefficient calculated before binarizing the outputs. We found that this loss function leads to much faster convergence than the simple cross entropy loss or just the dice coefficient as proposed in [2]. For post processing we used a dense Conditional …
Network-based sparse modeling of breast invasive carcinoma survival data A Verıssimo, E Carrasquinha, MB Lopes, AL Oliveira, MF Sagot, S Vinga
2017
Learning survival models from oncological data has now become a major challenge due to the significant increase of molecular information. The inherent high-dimensionality of these datasets, where the number of features largely exceeds the number of observations, leads to ill-posed inverse problems and, consequently, to models that often lack interpretability. In order to tackle this problem, regularized optimization has emerged as a promising approach, allowing to impose constraints on the structure of the solutions, these include sparsity, for example, using LASSO, or other penalizing functions that use network-based information if the features have a graph-based configuration. We compared how different sparse methods perform when applied to a Breast Invasive Carcinoma dataset and how introducing network knowledge impact model prediction. These include Elastic Net and their coupling with DEGREECOX. The results regarding the concordance c-index show an improvement when network information is included, whereas the log-rank tests on the separation between high and low-risk patients exhibit a decrease in performance. It is expected that the obtained models can further support clinical decision and prognostic assessment of oncological patients.
DegreeCox–a network-based regularization method for survival analysis A Veríssimo, AL Oliveira, MF Sagot, S Vinga
2016 BMC bioinformatics 17, 109-121
Background Modeling survival oncological data has become a major challenge as the increase in the amount of molecular information nowadays available means that the number of features greatly exceeds the number of observations. One possible solution to cope with this dimensionality problem is the use of additional constraints in the cost function optimization. Lasso and other sparsity methods have thus already been successfully applied with such idea. Although this leads to more interpretable models, these methods still do not fully profit from the relations between the features, specially when these can be represented through graphs. We propose DegreeCox, a method that applies network-based regularizers to infer Cox proportional hazard models, when the features are genes and the outcome is patient survival. In particular, we propose to use network centrality measures to constrain the model in terms of …
The ELIXIR channel in F1000Research B Niklas, O Arlindo, M Barend, P Bengt, J Inge
2015 F1000Research 4
ELIXIR, the European life science infrastructure for biological information, is a unique initiative to consolidate Europe’s national centres, services, and core bioinformatics resources into a single, coordinated infrastructure. ELIXIR brings together Europe’s major life-science data archives and connects these with national bioinformatics infrastructures-the ELIXIR Nodes. This editorial introduces the ELIXIR channel in F1000Research; the aim of the channel is to collect and present ELIXIR’s scientific and operational output, engage with the broad life science community and encourage discussion on proposed infrastructure solutions. Submissions will be assessed by the ELIXIR channel Editorial Board to ensure they are relevant to ELIXIR community, and subjected to F1000Research open peer review process.
The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae MC Teixeira, PT Monteiro, JF Guerreiro, JP Gonçalves, NP Mira, ...
2014 Nucleic acids research, gkt1015
The YEASTRACT (http://www.yeastract.com) information system is a tool for the analysis and prediction of transcription regulatory associations in Saccharomyces cerevisiae. Last updated in June 2013, this database contains over 200 000 regulatory associations between transcription factors (TFs) and target genes, including 326 DNA binding sites for 113 TFs. All regulatory associations stored in YEASTRACT were revisited and new information was added on the experimental conditions in which those associations take place and on whether the TF is acting on its target genes as activator or repressor. Based on this information, new queries were developed allowing the selection of specific environmental conditions, experimental evidence or positive/negative regulatory effect. This release further offers tools to rank the TFs controlling a gene or genome-wide response by their relative importance, based on (i …
Efficient identification of epistatic effects in multifactorial disorders O Anunciaçao, S Vinga, AL Oliveira
2013 NIPS Machine Learning in Computational Biology Workshop
Complex diseases are typically caused by the combined effects of multiple genetic variations. When epistatic effects are present, these genetic variations show stronger effects when considered together than when considered individually. To identify groups of single nucleotide polymorphisms (SNP) that can be of use in the explanation of multifactorial conditions, we look for pairs of SNPs with a high value of information interaction. We show that standard classification methods or greedy feature selection methods do not perform well on this problem. We propose a computationally efficient method that uses the information interaction value as a figure of merit, and compare it with state of the art methods (BEAM and SNPHarvester) in artificial datasets simulating epistatic interactions. The results show that the method is powerful and efficient, and more effective at detecting pairwise epistatic interactions than existing alternatives. We also present results of the application of the method to the WTCCC breast cancer dataset. We found 89 statistically significant pairwise interactions with a p-value lower than 10−3 . Somewhat unexpectedly, almost all the SNPs involved in pairs with high value of information interaction also have moderate or high marginals, a result that may imply that the search for more complex interactions may be more effectively conducted by looking only at SNPs which, by themselves, have correlations with the condition under study.
Using information interaction to discover epistatic effects in complex diseases O Anunciação, S Vinga, AL Oliveira
2013 PLoS One 8 (10), e76300
It is widely agreed that complex diseases are typically caused by the joint effects of multiple instead of a single genetic variation. These genetic variations may show stronger effects when considered together than when considered individually, a phenomenon known as epistasis or multilocus interaction. In this work, we explore the applicability of information interaction to discover pairwise epistatic effects related with complex diseases. We start by showing that traditional approaches such as classification methods or greedy feature selection methods (such as the Fleuret method) do not perform well on this problem. We then compare our information interaction method with BEAM and SNPHarvester in artificial datasets simulating epistatic interactions and show that our method is more powerful to detect pairwise epistatic interactions than its competitors. We show results of the application of information interaction method to the WTCCC breast cancer dataset. Our results are validated using permutation tests. We were able to find 89 statistically significant pairwise interactions with a p-value lower than. Even though many recent algorithms have been designed to find epistasis with low marginals, we observed that all (except one) of the SNPs involved in statistically significant interactions have moderate or high marginals. We also report that the interactions found in this work were not present in gene-gene interaction network STRING.
LEMPEL-ZIV SLIDING WINDOW UPDATE WITH SUFFIX ARRAYS A Ferreira, A Oliveira, M Figueiredo
2013 i-ETC: ISEL Academic Journal of Electronics Telecommunications and Computers …
The sliding window dictionary-based algorithms of the Lempel-Ziv (LZ) 77 family are widely used for universal lossless data compression. The encoding component of these algorithms performs repeated substring search. Data structures, such as hash tables, binary search trees, and suffix trees have been used to speedup these searches, at the expense of memory usage. Previous work has shown how suffix arrays (SA) can be used for dictionary representation and LZ77 decomposition. In this paper, we improve over that work by proposing a new efficient algorithm to update the sliding window each time a token is produced at the output. The proposed algorithm toggles between two SA on consecutive tokens. The resulting SA-based encoder requires less memory than the conventional tree-based encoders. In comparing our SA-based technique against tree-based encoders, on a large set of benchmark files, we find that, in some compression settings, our encoder is also faster than tree-based encoders.
Mining query log graphs towards a query folksonomy AP Francisco, R Baeza‐Yates, AL Oliveira
2012 Concurrency and Computation: Practice and Experience 24 (17), 2179-2192
The human interaction through the web generates both implicit and explicit knowledge. An example of an implicit contribution is searching, as people contribute with their knowledge by clicking on retrieved documents. When this information is available, an important and interesting challenge is to extract relations from query logs, and, in particular, semantic relations between queries and their terms. In this paper, we present and discuss results on query contextualization through the association of tags to queries, that is, query folksonomies. Note that tags may not even occur within the query. Our results rely on the analysis of large query log induced graphs, namely click induced graphs. Results obtained with real data show that the inferred query folksonomy provide interesting insights both on semantic relations among queries and on web users intent.Copyright © 2011 John Wiley & Sons, Ltd.
TAPYR: An efficient high-throughput sequence aligner for re-sequencing applications F Fernandes, PGS da Fonseca, LMS Russo, AL Oliveira, AT Freitas
2012 7th Internation Conference of the Brazilian Association for Bioinformatics …
During the last two decades most laboratories used Sanger's" shotgun" method in many significant largescale sequencing projects, being this method considered the'gold standard'in terms of both read length and sequencing accuracy. Recently, several next generation sequencing (NGS) technologies have emerged, including the GS FLX (454) Genome Analyzer, the Illumina's Solexa 1G Sequencer, the SOLiDTM and the Ion Torrent Systems, which are able to generate three to four orders of magnitude more sequences and are considerably less expensive than the Sanger method. However, the read lengths of NGS technologies create important algorithmic challenges. While the 454 platform is able to obtain reads in the 400-600 base pairs (bp), the Illumina's Solexa 1G Sequencer and the Ion Torrent Systems present reads with an average length of 100 bp and the SOLiD platform is currently limited to 25-50 bp.
Efficient and accurate haplotype inference by combining parsimony and pedigree information A Graça, I Lynce, J Marques-Silva, AL Oliveira
2012 Algebraic and Numeric Biology: 4th International Conference, ANB 2010 …
Existing genotyping technologies have enabled researchers to genotype hundreds of thousands of SNPs efficiently and inexpensively. Methods for the imputation of non-genotyped SNPs and the inference of haplotype information from genotypes, however, remain important, since they have the potential to increase the power of statistical association tests. In many cases, studies are conducted in sets of individuals where the pedigree information is relevant, and can be used to increase the power of tests and to decrease the impact of population structure on the obtained results. This paper proposes a new Boolean optimization model for haplotype inference combining two combinatorial approaches: the Minimum Recombinant Haplotyping Configuration (MRHC), which minimizes the number of recombinant events within a pedigree, and the Haplotype Inference by Pure Parsimony (HIPP), that aims at finding a …
Efficient alignment of pyrosequencing reads for re-sequencing applications F Fernandes, PGS da Fonseca, LMS Russo, AL Oliveira, AT Freitas
2011 BMC bioinformatics 12, 1-9
Background Over the past few years, new massively parallel DNA sequencing technologies have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing. However, these techniques also raise important computational difficulties mostly due to the huge volume of data produced, but also because of some of their specific characteristics such as read length and sequencing errors. Among the most critical problems is that of efficiently and accurately mapping reads to a reference genome in the context of re-sequencing projects. Results We present an efficient method for the local alignment of pyrosequencing reads produced by the GS FLX (454) system against a reference sequence. Our approach explores the characteristics of the data in these re-sequencing applications and uses state of the art …
GRISOTTO: A greedy approach to improve combinatorial algorithms for motif discovery with prior knowledge AM Carvalho, AL Oliveira
2011 Algorithms for Molecular Biology 6 (1), 1-13
Position-specific priors (PSP) have been used with success to boost EM and Gibbs sampler-based motif discovery algorithms. PSP information has been computed from different sources, including orthologous conservation, DNA duplex stability, and nucleosome positioning. The use of prior information has not yet been used in the context of combinatorial algorithms. Moreover, priors have been used only independently, and the gain of combining priors from different sources has not yet been studied. We extend RISOTTO, a combinatorial algorithm for motif discovery, by post-processing its output with a greedy procedure that uses prior information. PSP's from different sources are combined into a scoring criterion that guides the greedy search procedure. The resulting method, called GRISOTTO, was evaluated over 156 yeast TF ChIP-chip sequence-sets commonly used to benchmark prior-based motif discovery algorithms. Results show that GRISOTTO is at least as accurate as other twelve state-of-the-art approaches for the same task, even without combining priors. Furthermore, by considering combined priors, GRISOTTO is considerably more accurate than the state-of-the-art approaches for the same task. We also show that PSP's improve GRISOTTO ability to retrieve motifs from mouse ChiP-seq data, indicating that the proposed algorithm can be applied to data from a different technology and for a higher eukaryote. The conclusions of this work are twofold. First, post-processing the output of combinatorial algorithms by incorporating prior information leads to a very efficient and effective motif discovery method. Second, combining priors from …
TFRank: network-based prioritization of regulatory associations underlying transcriptional responses JP Gonçalves, AP Francisco, NP Mira, MC Teixeira, I Sá-Correia, ...
2011 Bioinformatics 27 (22), 3149-3157
Motivation: Uncovering mechanisms underlying gene expression control is crucial to understand complex cellular responses. Studies in gene regulation often aim to identify regulatory players involved in a biological process of interest, either transcription factors coregulating a set of target genes or genes eventually controlled by a set of regulators. These are frequently prioritized with respect to a context-specific relevance score. Current approaches rely on relevance measures accounting exclusively for direct transcription factor–target interactions, namely overrepresentation of binding sites or target ratios. Gene regulation has, however, intricate behavior with overlapping, indirect effect that should not be neglected. In addition, the rapid accumulation of regulatory data already enables the prediction of large-scale networks suitable for higher level exploration by methods based on graph theory. A …
QUANTITATIVE MODELING OF THE SACCHAROMYCES CEREVISIAE FLR1 REGULATORY NETWORK USING AN S-SYSTEM FORMALISM D Calçada, S Vinga, AT Freitas, AL Oliveira
2011 Journal of Bioinformatics and Computational Biology 9 (05), 613-630
In this study we address the problem of finding a quantitative mathematical model for the genetic network regulating the stress response of the yeast Saccharomyces cerevisiae to the agricultural fungicide mancozeb. An S-system formalism was used to model the interactions of a five-gene network encoding four transcription factors (Yap1, Yrr1, Rpn4 and Pdr3) regulating the transcriptional activation of the FLR1 gene. Parameter estimation was accomplished by decoupling the resulting system of nonlinear ordinary differential equations into a larger nonlinear algebraic system, and using the Levenberg–Marquardt algorithm to fit the models predictions to experimental data. The introduction of constraints in the model, related to the putative topology of the network, was explored. The results show that forcing the network connectivity to adhere to this topology did not lead to better results than the ones obtained using an …
Fully compressed suffix trees LMS Russo, G Navarro, AL Oliveira
2011 ACM transactions on algorithms (TALG) 7 (4), 1-34
Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Θ(n log n) bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but the linear extra bits are still unsatisfactory when σ is small as in DNA sequences. In this article, we introduce the first compressed suffix tree representation that breaks this Θ(n)-bit space barrier. The Fully Compressed Suffix Tree (FCST) representation requires only sublinear space on top of the compressed text size, and supports a wide set of …
Qualitative modelling and formal verification of the FLR1 gene mancozeb response in Saccharomyces cerevisiae PT Monteiro, PJ Dias, D Ropers, AL Oliveira, I Sá-Correia, MC Teixeira, ...
2011 IET systems biology 5 (5), 308-316
Background: Qualitative models allow understanding the relation between the structure and the dynamics of gene regulatory networks. The dynamical properties of these models can be automatically analysed by means of formal verification methods, like model checking. This facilitates the model-validation process and the test of new hypotheses to reconcile model predictions with the experimental data. Results: The authors report in this study the qualitative modelling and simulation of the transcriptional regulatory network controlling the response of the model eukaryote Saccharomyces cerevisiae to the agricultural fungicide mancozeb. The model allowed the analysis of the regulation level and activity of the components of the gene mancozeb-induced network controlling the transcriptional activation of the FLR1 gene, which is proposed to confer multidrug resistance through its putative role as a drug eflux pump …
Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood. AM Carvalho, T Roos, AL Oliveira, P Myllymäki
2011 Journal of machine learning research 12 (7)
We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources.
Haplotype inference with pseudo-Boolean optimization. A Graça, J Marques-Silva, I Lynce, A Oliveira
2011 Annals of Operations Research 184 (1)
The fast development of sequencing techniques in the recent past has required an urgent development of efficient and accurate haplotype inference tools. Besides being a crucial issue in genetics, haplotype inference is also a challenging computational problem. Among others, pure parsimony is a viable modeling approach to solve the problem of haplotype inference and also an interesting NP-hard problem in itself. Recently, the introduction of SAT-based methods, including pseudo-Boolean optimization (PBO) methods, has produced very efficient solvers. This paper provides a detailed description of RPoly, a PBO approach for the haplotype inference by pure parsimony (HIPP) problem. Moreover, an extensive evaluation of existent HIPP solvers, on a comprehensive set of instances, confirms that RPoly is currently the most efficient and robust HIPP approach.
Sliding window update using suffix arrays A Ferreira, A Oliveira, M Figueiredo
2011 2011 Data Compression Conference (DCC), 456-456
We describe the outcome and experience of trying to develop an architecture and framework for a Networked Object Oriented Gaming Architecture (NOOGA). The aim of this project was to create an easily extensible framework that facilitates teaching students about object oriented design, design patterns, and software engineering in an interesting context. Our original goal was to develop a game server to serve multiplayer games such as battleship, dots, and Boggle. We planned to implement these three games, but to design an extensible server that would permit new games to be added to the server framework. The NOOGA framework would be studied and used from both a client and server standpoint. Students would first develop clients that would connect to the server to play a specific game and would, in more advanced courses, add new games to the server. This paper focuses on the development process …
Using systems biology approaches to study a multidrug resistance network PJ Dias, CP Costa, I Sá-Correia, MC Teixeira, PT Monteiro, AL Oliveira, ...
2011 1st Portuguese Biomedical Engineering Meeting, 1-4
Multidrug resistance (MDR), a phenomenon with impact in Human Health and in Agro-Food and Environmental Biotechnology, often results from the activation of drug efflux pumps, many times controlled at the transcriptional level. The complex transcriptional control of these genes has been on the focus of our research, guided by the information gathered in the YEASTRACT database. In this paper, the approach used to elucidate the transcriptional control of FLR1, encoding a Saccharomyces cerevisiae Drug:H Antiporter, in response to stress induced by the fungicide mancozeb is explained. The transcription regulatory network underlying FLR1 activation was defined based on experimental data. Subsequently, a mathematical model describing this network was built and its response to mancozeb stress in different genetic backgrounds was simulated, using the Genetic Network Analyzer (GNA) software. This …
On community detection in very large networks AP Francisco, AL Oliveira
2011 Complex Networks: Second International Workshop, CompleNet 2010, Rio de …
Community detection or graph clustering is an important problem in the analysis of computer networks, social networks, biological networks and many other natural and artificial networks. These networks are in general very large and, thus, finding hidden structures and functional modules is a very hard task. In this paper we propose new data structures and a new implementation of a well known agglomerative greedy algorithm to find community structure in large networks, the CNM algorithm. The experimental results show that the improved data structures speedup the method by a large factor, for large networks, making it competitive with other state of the art algorithms.
Using graph modularity analysis to identify transcription factor binding sites AP Francisco, S Schbath, AT Freitas, AL Oliveira
2010 2010 IEEE International Conference on Bioinformatics and Biomedicine …
Despite the remarkable success of computational biology methods in some areas of application like gene finding and sequence alignment, there are still topics for which no definitive approaches have been proposed. One of these is the accurate detection of biologically significant cis-regulatory motifs, that remains an open problem, despite intensive research in the field. Probabilistic motif finders are most popular, mainly because combinatorial motif finders generate extensive and hard to understand lists of potential motifs. In this work, we present Needle, a method for de novo motif discovery that works by post-processing the output of a combinatorial motif finder, using graph analysis techniques. The method is based on the identification of highly connected modules in the graph that is obtained by connecting the nodes that correspond to motifs if these motifs are co-located in the sequences under analysis. We have …
YEASTRACT: providing a programmatic access to curated transcriptional regulatory associations in Saccharomyces cerevisiae through a web services interface D Abdulrehman, PT Monteiro, MC Teixeira, NP Mira, AB Lourenco, ...
2010 Nucleic acids research 39 (suppl_1), D136-D140
The YEAst Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT) information system ( http://www.yeastract.com ) was developed to support the analysis of transcription regulatory associations in Saccharomyces cerevisiae . Last updated in June 2010, this database contains over 48 200 regulatory associations between transcription factors (TFs) and target genes, including 298 specific DNA-binding sites for 110 characterized TFs. All regulatory associations stored in the database were revisited and detailed information on the experimental evidences that sustain those associations was added and classified as direct or indirect evidences. The inclusion of this new data, gathered in response to the requests of YEASTRACT users, allows the user to restrict its queries to subsets of the data based on the existence or not of experimental evidences for the direct action of the TFs in the promoter …
Apt-pbo: solving the software dependency problem using pseudo-boolean optimization P Trezentos, I Lynce, AL Oliveira
2010 Proceedings of the IEEE/ACM international conference on Automated software …
The installation of software packages depends on the correct resolution of dependencies and conflicts between packages. This problem is NP-complete and, as expected, is a hard task. Moreover, today's technology still does not address this problem in an acceptable way. This paper introduces a new approach to solving the software dependency problem in a Linux environment, devising a way for solving dependencies according to available packages and user preferences. This work introduces the "apt-pbo" tool, the first publicly available tool that solves dependencies in a complete and optimal way.
Haplotype inference by Pure Parsimony: a survey A Graça, I Lynce, J Marques-Silva, AL Oliveira
2010 Journal of Computational Biology 17 (8), 969-992
Given a set of genotypes from a population, the process of recovering the haplotypes that explain the genotypes is called haplotype inference. The haplotype inference problem under the assumption of pure parsimony consists in finding the smallest number of haplotypes that explain a given set of genotypes. This problem is NP-hard. The original formulations for solving the Haplotype Inference by Pure Parsimony (HIPP) problem were based on integer linear programming and branch-and-bound techniques. More recently, solutions based on Boolean satisfiability, pseudo-Boolean optimization, and answer set programming have been shown to be remarkably more efficient. HIPP can now be regarded as a feasible approach for haplotype inference, which can be competitive with other different approaches. This article provides an overview of the methods for solving the HIPP problem, including preprocessing …
Computational modeling and analysis of the yeast FLR1 regulatory network in mancozeb-challenged cells PT Monteiro, PJ Dias, D Ropers, AL Oliveira, I Sà-Correia, MC Teixeira, ...
2010 HAL 2010
Computational modeling and analysis of the yeast FLR1 regulatory network in mancozeb-challenged cells logo dml Revues Livres Sources Tout Tout Auteur Titre Bibliographie Inclure les e-prints dans la recherche (arXiv, HAL) Rechercher HAL Tome 2010 (2010) no. 0 Computational modeling and analysis of the yeast FLR1 regulatory network in mancozeb-challenged cells Monteiro, Pedro T. ; Dias, Paulo J. ; Ropers, Delphine ; Oliveira, Arlindo L. ; Sà-Correia, Isabel ; Teixeira, MC ; Freitas, Ana T. HAL, hal-00793029 / Harvested from HAL Text on HAL Résumé no abstract Détail BibTeX Comment citer Publié le : 2010-07-05 Classification: [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], [SPI.AUTO]Engineering Sciences [physics]/Automatic, [MATH.MATH-DS]Mathematics [math]/Dynamical Systems [math.DS], [MATH.MATH-OC]…
Fully Generalized Graph Cores. AP Francisco, AL Oliveira
2010 CompleNet, 22-34
A core in a graph is usually taken as a set of highly connected vertices. Although general, this definition is intuitive and useful for studying the structure of many real networks. Nevertheless, depending on the problem, different formulations of graph core may be required, leading us to the known concept of generalized core. In this paper we study and further extend the notion of generalized core. Given a graph, we propose a definition of graph core based on a subset of its subgraphs and on a subgraph property function. Our approach generalizes several notions of graph core proposed independently in the literature, introducing a general and theoretical sound framework for the study of fully generalized graph cores. Moreover, we discuss emerging applications of graph cores, such as improved graph clustering methods and complex network motif detection.
Improved Model Checking Techniques for State Space Analysis of Gene Regulatory Networks HC Pais, KL McMillan, EM Sentovich, AT Freitas, AL Oliveira
2010 Handbook of Research on Computational Methodologies in Gene Regulatory …
A better understanding of the behavior of a cell, as a system, depends on our ability to model and understand the complex regulatory mechanisms that control gene expression. High level, qualitative models of gene regulatory networks can be used to analyze and characterize the behavior of complex systems, and to provide important insights on the behavior of these systems. In this chapter, we describe a number of additional functionalities that, when supported by a symbolic model checker, make it possible to answer important questions about the nature of the state spaces of gene regulatory networks, such as the nature and size of attractors, and the characteristics of the basins of attraction. We illustrate the type of analysis that can be performed by applying an improved model checker to two well studied gene regulatory models, the network that controls the cell cycle in the yeast S. cerevisiae, and the network that …
Parallel and distributed compressed indexes LMS Russo, G Navarro, AL Oliveira
2010 Combinatorial Pattern Matching: 21st Annual Symposium, CPM 2010, New York …
We study parallel and distributed compressed indexes. Compressed indexes are a new and functional way to index text strings. They exploit the compressibility of the text, so that their size is a function of the compressed text size. Moreover, they support a considerable amount of functions, more than many classical indexes. We make use of this extended functionality to obtain, in a shared-memory parallel machine, near-optimal speedups for solving several stringology problems. We also show how to distribute compressed indexes across several machines.
Mining query logs induced graphs AP Francisco, R Baeza-Yates, AL Oliveira
2010 Proceedings of Tech. Rep. INESC-ID 36
The human interaction through the web generates both implicit and explicit knowledge. An example of an implicit contribution is searching, as people contribute with their knowledge by clicking on retrieved documents. When this information is available, an important and interesting challenge is to extract relations from query logs, and, in particular, semantic relations between queries and their terms. In this paper we present and discuss results on mining large query log induced graphs, namely click induced graphs, and how they contribute to query classification and to understand user intent and interest. Our approach consists on efficiently obtaining a hierarchical clustering for such graphs, which provide a hierarchical query folksonomy. We compare it against conventional taxonomies, namely the ODP categorization. Results obtained with real data show that the hierarchical clustering and the inferred query folksonomy provide interesting insights both on semantic relations among queries and on web users intent, being quite different from standard taxonomies.
Mining large query induced graphs towards a hierarchical query folksonomy AP Francisco, R Baeza-Yates, AL Oliveira
2010 String Processing and Information Retrieval: 17th International Symposium …
The human interaction through the web generates both implicit and explicit knowledge. An example of an implicit contribution is searching, as people contribute with their knowledge by clicking on retrieved documents. Thus, an important and interesting challenge is to extract semantic relations among queries and their terms from query logs. In this paper we present and discuss results on mining large query log induced graphs, and how they contribute to query classification and to understand user intent and interest. Our approach consists on efficiently obtaining a hierarchical clustering for such graphs and, then, a hierarchical query folksonomy. Results obtained with real data provide interesting insights on semantic relations among queries and are compared with conventional taxonomies, namely the ODP categorization.
A data mining approach for the detection of high-risk breast cancer groups O Anunciaçao, BC Gomes, S Vinga, J Gaspar, AL Oliveira, J Rueff
2010 Advances in Bioinformatics: 4th International Workshop on Practical …
It is widely agreed that complex diseases are typically caused by the joint effects of multiple instead of a single genetic variation. These genetic variations may show very little effect individually but strong effect if they occur jointly, a phenomenon known as epistasis or multilocus interaction. In this work, we explore the applicability of decision trees to this problem. A case-control study was performed, composed of 164 controls and 94 cases with 32 SNPs available from the BRCA1, BRCA2 and TP53 genes. There was also information about tobacco and alcohol consumption. We used a Decision Tree to find a group with high-susceptibility of suffering from breast cancer. Our goal was to find one or more leaves with a high percentage of cases and small percentage of controls. To statistically validate the association found, permutation tests were used. We found a high-risk breast cancer group composed of 13 …
Refining current knowledge on the yeast FLR1 regulatory network by combined experimental and computational approaches MC Teixeira, PJ Dias, PT Monteiro, A Sala, AL Oliveira, AT Freitas, ...
2010 Molecular BioSystems 6 (12), 2471-2481
Multidrug resistance is often the result of the activation of drug efflux pumps able to catalyze the extrusion of the toxic compound to the outer medium, this activation being frequently controlled at the transcriptional level. Transcriptional regulation in the model eukaryote S. cerevisiae is the result of the interaction and cross-talk between networks of transcription factors. This is the case of the transcriptional activation of the FLR1 gene occurring in response to stress induced by the agricultural fungicide mancozeb in yeast. FLR1 up-regulation depends on the integrated action of Yap1, a key regulator of oxidative stress response, Pdr3 and Yrr1, two of the transcription factors controlling multidrug resistance, and Rpn4, a regulator of proteasome gene expression, which interplay to produce the observed transcriptional up-shift. Based on the expression profiles of FLR1, YAP1, PDR3, YRR1 and RPN4 registered during …
Time and Memory Efficient Lempel-Ziv Compression Using Suffix Arrays A Ferreira, A Oliveira, M Figueiredo
2009 arXiv preprint arXiv:0912.5449
The well-known dictionary-based algorithms of the Lempel-Ziv (LZ) 77 family are the basis of several universal lossless compression techniques. These algorithms are asymmetric regarding encoding/decoding time and memory requirements, with the former being much more demanding. In the past years, considerable attention has been devoted to the problem of finding efficient data structures to support these searches, aiming at optimizing the encoders in terms of speed and memory. Hash tables, binary search trees and suffix trees have been widely used for this purpose, as they allow fast search at the expense of memory. Some recent research has focused on suffix arrays (SA), due to their low memory requirements and linear construction algorithms. Previous work has shown how the LZ77 decomposition can be computed using a single SA or an SA with an auxiliary array with the longest common prefix information. The SA-based algorithms use less memory than the tree-based encoders, allocating the strictly necessary amount of memory, regardless of the contents of the text to search/encode. In this paper, we improve on previous work by proposing faster SA-based algorithms for LZ77 encoding and sub-string search, keeping their low memory requirements. For some compression settings, on a large set of benchmark files, our low-memory SA-based encoders are also faster than tree-based encoders. This provides time and memory efficient LZ77 encoding, being a possible replacement for trees on well known encoders like LZMA. Our algorithm is also suited for text classification, because it provides a compact way to describe text in a …
Modeling of the Saccharomyces cerevisiae FLR1 Regulatory Network using an S-System Formalism D Calçada, S Vinga, A Oliveira
2009
This work tackled the problem of finding a mathe-matical model for the genetic network regulating the stress response of the yeast Saccharomyces cerevisiae to the fungicide mancozeb This regulatory network comprises five genes, and an S-system formalism was used to model their interactions. Parameter estimation was accomplished by decoupling the resulting system of nonlinear ordinary differential equations into a larger nonlinear algebraic system, and using the Levenberg-Marquardt algorithm to fit the model’s predictions to available experimental data. The introduction of constraints in the parameter estimation procedure, related to the partially known connectivity of the network, was explored. The success of the results obtained was limited, mainly due to the insufficient number of experimental points available, as well as to their poor dynamical variability.
BiGGEsTS: integrated environment for biclustering analysis of time series gene expression data JP Gonçalves, SC Madeira, AL Oliveira
2009 BMC research notes 2 (1), 1-11
The ability to monitor changes in expression patterns over time, and to observe the emergence of coherent temporal responses using expression time series, is critical to advance our understanding of complex biological processes. Biclustering has been recognized as an effective method for discovering local temporal expression patterns and unraveling potential regulatory mechanisms. The general biclustering problem is NP-hard. In the case of time series this problem is tractable, and efficient algorithms can be used. However, there is still a need for specialized applications able to take advantage of the temporal properties inherent to expression time series, both from a computational and a biological perspective. BiGGEsTS makes available state-of-the-art biclustering algorithms for analyzing expression time series. Gene Ontology (GO) annotations are used to assess the biological relevance of the biclusters. Methods for preprocessing expression time series and post-processing results are also included. The analysis is additionally supported by a visualization module capable of displaying informative representations of the data, including heatmaps, dendrograms, expression charts and graphs of enriched GO terms. BiGGEsTS is a free open source graphical software tool for revealing local coexpression of genes in specific intervals of time, while integrating meaningful information on gene annotations. It is freely available at: http://kdbio.inesc-id.pt/software/biggests . We present a case study on the discovery of transcriptional regulatory modules in the response of Saccharomyces cerevisiae to heat stress.
A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series SC Madeira, AL Oliveira
2009 Algorithms for Molecular Biology 4 (1), 1-39
The ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NP-hard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters. In this work, we propose e-CCC-Biclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters. We present results in real data showing the effectiveness of e-CCC-Biclustering and its …
Approximate string matching with compressed indexes LMS Russo, G Navarro, AL Oliveira, P Morales
2009 Algorithms 2 (3), 1105-1136
A compressed full-text self-index for a text T is a data structure requiring reduced space and able to search for patterns P in T. It can also reproduce any substring of T, thus actually replacing T. Despite the recent explosion of interest on compressed indexes, there has not been much progress on functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in bioinformatics. We study ASM algorithms for Lempel-Ziv compressed indexes and for compressed suffix trees/arrays. Most compressed self-indexes belong to one of these classes. We start by adapting the classical method of partitioning into exact search to self-indexes, and optimize it over a representative of either class of self-index. Then, we show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to a Lempel-Ziv index. Finally, we improve hierarchical verification, a successful technique for sequential searching, so as to extend the matches of pattern pieces to the left or right. Most compressed suffix trees/arrays support the required bidirectionality, thus enabling the implementation of the improved technique. In turn, the improved verification largely reduces the accesses to the text, which are expensive in self-indexes. We show experimentally that our algorithms are competitive and provide useful space-time tradeoffs compared to classical indexes.
Haplotype inference combining pedigrees and unrelated individuals A Graça, I Lynce, J Marques-Silva, A Oliveira
2009 Workshop on Constraint Based Methods for Bioinformatics (WCB 2009), 27-36
Haplotype inference is a crucial topic in genetic studies and also represents a challenging computational problem. A significant number of combinatorial approaches tackle the haplotype inference problem either for pedigrees or for unrelated individuals. This work integrates two relevant and well-known constraint based haplotyping approaches. The Minimum Recombinant Haplotyping Configuration (MRHC) problem targets the haplotyping solution which minimizes the number of recombinant events within a pedigree. MRHC only takes into consideration the family information. In contrast, the Haplotype Inference by Pure Parsimony (HIPP) problem aims at finding a solution which minimizes the number of distinct haplotypes. The HIPP approach is adequate for phasing unrelated individuals from the same population. This paper proposes a method for inferring haplotypes for individuals of the same population, although organized in different families, thus combining both MRHC and HIPP approaches. This new method can take into account family information and population information, both important in haplotype inference. Experimental results show that the proposed approach is more accurate, both in terms of switch error rate and missing error rate, than the MRHC approach (performed by the PedPhase tool), on sets of families from the same population.
On the use of suffix arrays for memory-efficient Lempel-Ziv data compression A Ferreira, A Oliveira, M Figueiredo
2009 2009 Data Compression Conference, 444-444
The Lempel-Ziv 77 (LZ77) and LZ-Storer-Szymanski (LZSS) text compression algorithms use a sliding window over the sequence of symbols, with two sub-windows: the dictionary (symbols already encoded) and the look-ahead-buffer (LAB) (symbols not yet encoded). Binary search trees and suffix trees (ST) have been used to speedup the search of the LAB over the dictionary, at the expense of high memory usage [1]. A suffix array (SA) is a simpler, more compact data structure which uses (much) less memory [2,3] to hold the same information. The SA for a length m string is an array of integers ([1], ...[k], ...a[m]) that stores the lexicographic order of suffix k of the string; sub-string searching, as used in LZ77/LZSS, is done by searching the SA.
Constant time clash detection in protein folding MMF Bugalho, AL Oliveira
2009 Journal of bioinformatics and computational biology 7 (01), 55-74
Applications for the manipulation of molecular structures are usually computationally intensive. Problems like protein docking or ab-initio protein folding need to frequently determine if two atoms in the structure collide. Therefore, an efficient algorithm for this problem, usually referred as clash detection, can greatly improve the application efficiency. This work focus mainly on the ab-initio protein folding problem. A naive approach for the clash problem, the most commonly-used by molecular structure programs, consists in calculating the distance between every pair of atoms. We propose an efficient data structure that uses a three-dimensional array to store the atoms' position. We compare the proposed data structure with one of the best known general data structures for this type of problems (SAT tree) and with the naive approach. While the naive approach takes linear time to the number of atoms to verify if a new …
The regulatory network underlying the transcriptional up-regulation of the FRL1 gene in mancozeb stressed yeast cells: qualitative modeling and simulation PT Monteiro, PJ Dias, D Ropers, AL Oliveira, AT Freitas, I Sá-Correia, ...
2009 HAL 2009
The regulatory network underlying the transcriptional up-regulation of the FRL1 gene in mancozeb stressed yeast cells: qualitative modeling and simulation logo dml Revues Livres Sources Tout Tout Auteur Titre Bibliographie Inclure les e-prints dans la recherche (arXiv, HAL) Rechercher HAL Tome 2009 (2009) no. 0 The regulatory network underlying the transcriptional up-regulation of the FRL1 gene in mancozeb stressed yeast cells: qualitative modeling and simulation Monteiro, Pedro T. ; Dias, PJ ; Ropers, Delphine ; Oliveira, AL ; Freitas, Ana T. ; Sa-Correia, I. ; Teixeira, MC HAL, hal-00793018 / Harvested from HAL Text on HAL Résumé no abstract Détail BibTeX Comment citer Publié le : 2009-07-05 Classification: [INFO.INFO-BI]Computer Science [cs]/Bioinformatics [q-bio.QM], [SDV.BIBS]Life Sciences [q-bio]/Quantitative Methods [q-bio.QM], [MATH.MATH-DS]Mathematics [math]/Dynamical Systems [math.DS], […
On the suitability of suffix arrays for Lempel-Ziv data compression AJ Ferreira, AL Oliveira, MAT Figueiredo
2009 e-Business and Telecommunications: International Conference, ICETE 2008 …
Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used nowadays. Regarding time and memory requirements, LZ encoding is much more demanding than decoding. In order to speed up the encoding process, efficient data structures, like suffix trees, have been used. In this paper, we explore the use of suffix arrays to hold the dictionary of the LZ encoder, and propose an algorithm to search over it. We show that the resulting encoder attains roughly the same compression ratios as those based on suffix trees. However, the amount of memory required by the suffix array is fixed, and much lower than the variable amount of memory used by encoders based on suffix trees (which depends on the text to encode). We conclude that suffix arrays, when compared to suffix trees in terms of the trade-off among time, memory, and compression ratio, may be preferable in scenarios (e.g., embedded …
Exploring HTTP Content Negotiation X Wang, AL Oliveira
2009 Proceedings of the 18th International Conference on World Wide Web, WWW2009 …
In this paper, we explored a relatively underused feature of Hypertext Transportation Protocol (HTTP)–content negotiation. With a concrete use case, we illustrated how extending content negotiation with Uniform Resource Identifier (URI) can provide us with many new and exciting possibilities, such as managed URI transparency and contract-first methodology, to improve a data’s life in the Web. In addition, we discussed the conceptual issues raised by content negotiation and showed how Fred Dretske’s semantic information theory can help us formulate a concrete definition of “information resource” and a more meaningful architectural model of the Web, which, in turn, can help us clarify and probably settle many other issues, such as the new URI scheme, the fragment identifier, and the metadata issue, that have been heatedly debated in the Web. At last, we suggested a design pattern–the AND pattern–that will …
Efficient biclustering algorithms for time series gene expression data analysis SC Madeira, AL Oliveira
2009 Distributed Computing, Artificial Intelligence, Bioinformatics, Soft …
We present a summary of a PhD thesis proposing efficient biclustering algorithms for time series gene expression data analysis, able to discover important aspects of gene regulation as anticorrelation and time-lagged relationships, and a scoring method based on statistical significance and similarity measures. The ability of the proposed algorithms to efficiently identify sets of genes with statistically significant and biologically meaningful expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific transcriptional regulatory mechanisms.
Clique analysis of query log graphs AP Francisco, R Baeza-Yates, AL Oliveira
2009 String Processing and Information Retrieval: 15th International Symposium …
In this paper we propose a method for the analysis of very large graphs obtained from query logs, using query coverage inspection. The goal is to extract semantic relations between queries and their terms. We take a new approach to successfully and efficiently cluster these large graphs by analyzing clique overlap and a priori induced cliques. The clustering quality is evaluated with an extension of the modularity score. Results obtained with real data show that the identified clusters can be used to infer properties of the queries and interesting semantic relations between them and their terms. The quality of the semantic relations is evaluated both using a tf-idf based score and data from the Open Directory Project. The proposed approach is also able to identify and filter out multitopical URLs, a feature that is interesting in itself.
Improved algorithm and data structures for modularity analysis of large networks AP Francisco, AL Oliveira
2008 NIPS Workshop on Analyzing Graphs
Graph clustering is an important problem in the analysis of computer networks, social networks, biological networks and many other natural and artificial networks. These networks are in general very large and, thus, finding hidden structures and functional modules is a very hard task. In this paper we propose new data structures and make available a new implementation of a well known agglomerative greedy algorithm to find community structure in large networks. The experimental results show that the improved data structures speedup the method by a large factor, for very large networks.
An analysis of the positional distribution of DNA motifs in promoter regions and its biological relevance AC Casimiro, S Vinga, AT Freitas, AL Oliveira
2008 BMC bioinformatics 9 (1), 1-13
Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially. We propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery. We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.
Indexed Hierarchical Approximate String Matching. LMS Russo, G Navarro, AL Oliveira
2008 SPIRE, 144-154
We present a new search procedure for approximate string matching over suffix trees. We show that hierarchical verification, which is a well-established technique for on-line searching, can also be used with an indexed approach. For this, we need that the index supports bidirectionality, meaning that the search for a pattern can be updated by adding a letter at the right or at the left. This turns out to be easily supported by most compressed text self-indexes, which represent the index and the text essentially in the same space of the compressed text alone. To complete the symbiotic exchange, our hierarchical verification largely reduces the need to access the text, which is expensive in compressed text self-indexes. The resulting algorithm can, in particular, run over an existing fully compressed suffix tree, which makes it very appealing for applications in computational biology. We compare our algorithm with related …
Haplotype inference with boolean constraint solving: An overview I Lynce, A Graça, J Marques-Silva, AL Oliveira
2008 2008 20th IEEE International Conference on Tools with Artificial …
Boolean satisfiability (SAT) finds a wide range of practical applications, including Artificial Intelligence and, more recently, Bioinformatics. Although encoding some combinatorial problems using Boolean logic may not be the most intuitive solution, the efficiency of state-of-the-art SAT solvers often makes it worthwhile to consider encoding a problem to SAT. One representative application of SAT in Bioinformatics is haplotype inference. The problem of haplotype inference under the assumption of pure parsimony consists in finding the smallest number of haplotypes that explains a given set of genotypes. The original formulations for solving the problem of Haplotype Inference by Pure Parsimony (HIPP) were based on Integer Linear Programming. More recently, solutions based on SAT have been shown to be remarkably more efficient. This paper provides an overview of SAT-based approaches for solving the HIPP …
A compressed self-index using a Ziv–Lempel dictionary LMS Russo, AL Oliveira
2008 Information Retrieval 11, 359-388
A compressed full-text self-index for a text T, of size u, is a data structure used to search for patterns P, of size m, in T, that requires reduced space, i.e. space that depends on the empirical entropy (H k or H 0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we present a new compressed self-index able to locate the occurrences of P in O((m + occ)log u) time, where occ is the number of occurrences. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O(m 2) to O(m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the suffix tree. We show that our method is very competitive in practice by comparing it against …
ccrB typing tool: an online resource for staphylococci ccrB sequence typing DC Oliveira, M Santos, C Milheiriço, JA Carriço, S Vinga, AL Oliveira, ...
2008 Journal of antimicrobial chemotherapy 61 (4), 959-960
In 2006, we published a manuscript in this journal describing the allelic variation in the ccrAB locus in a representative collection of methicillin-resistant Staphylococcus aureus (MRSA) based on DNA sequencing of internal fragments of both genes. 1 This work confirmed the very close relationships among ccrAB alleles associated with SCCmec types I–IV and VI, which was found to be independent of the MRSA lineage, geographic origin or isolation period. Moreover, particularly for the ccrB gene, SCCmec types II and IV, both defined by ccrAB allotype 2, could be discriminated. This method provides a significant improvement in ccrAB typing resolution since these SCCmec types have different epidemiological characteristics: type II is mostly found among hospital-acquired MRSA (eg ST5/USA100 and ST36/EMRSA-16 clones), whereas type IV is mostly found among community-acquired MRSA (eg ST80, ST30 and …
Identification of regulatory modules in time series gene expression data using a linear time biclustering algorithm SC Madeira, MC Teixeira, I Sa-Correia, AL Oliveira
2008 IEEE/ACM Transactions on Computational Biology and Bioinformatics 7 (1), 153-165
Although most biclustering formulations are NP-hard, in time series expression data analysis, it is reasonable to restrict the problem to the identification of maximal biclusters with contiguous columns, which correspond to coherent expression patterns shared by a group of genes in consecutive time points. This restriction leads to a tractable problem. We propose an algorithm that finds and reports all maximal contiguous column coherent biclusters in time linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a discretized matrix and efficient string processing techniques based on suffix trees. We also propose a method for ranking biclusters based on their statistical significance and a methodology for filtering highly overlapping and, therefore, redundant biclusters. We report results in synthetic and real data showing the effectiveness of the approach and its …
Using grammatical inference techniques to learn ontologies that describe the structure of domain instances AL Martins, H Sofia Pinto, AL Oliveira
2008 Applied Artificial Intelligence 22 (1-2), 139-167
Information produced by people usually has an implicit agreed-upon structure. However, this structure is not usually available to computer programs, where it could be used, for example, to aid in answering search queries. For example, when considering technical articles, one could ask for the occurrence of a keyword in a particular part of the article, such as the reference section. This implicit structure could be used, in the form of an ontology, to further the efforts of improving search in the semantic web. We propose a method to build ontologies encoding this structure information by the application of grammar inference techniques. This results in a semi-automatic approach to the inference of such ontologies. Our approach has two main components: (1) the inference of a grammatical description of the implicit structure of the supplied examples, and (2) the transformation of that description into an ontology. We …
Generic ILP vs specialized 0-1 ILP for haplotype inference A Graça, I Lynce, J Marques-Silva, A Oliveira
2008
Haplotype inference is an important and computationally challenging problem in genetics. A well-known approach to haplotype inference is pure parsimony (HIPP). Despite being based on a simple optimization criterion, HIPP is a computationally hard problem. Recent work has shown that approaches based on Boolean satisfiability namely pseudo-Boolean optimization (PBO), are very effective at tackling the HIPP problem. Extensive work on PBO-based HIPP approaches has been recently developed. Considering that the PBO problem, also known as 0-1 ILP problem, is a particular case of the integer linear programming (ILP) problem, generic ILP solvers can be considered. This paper compares the performance of PBO and ILP solvers on a variety of HIPP models. We conclude that specialized PBO solvers are more suitable than generic ILP solvers.
Ontology design principles and normalization techniques in the Web: a BioPAX use case X Wang, J Almeida, A Oliveira
2008 WWW
The open, decentralized nature of the Semantic Web demands fundamental changes in our approach to ontology development and deployment. To maximize the expressiveness and robustness of an ontology system, ontologies should be ideally designed to be orthogonal to each other in their conceptualization while physically independent in their engineering representation. Using the existing design and deployment of BioPAX ontologies as examples, this article illustrates how various design principles and engineering techniques can be used to improve the overall stability and reusability of an ontological system.
An Evaluation of the Impact of Side Chain Positioning on the Accuracy of Discrete Models of Protein Structures MMF Bugalho, AL Oliveira
2008 Advances in Bioinformatics and Computational Biology: Third Brazilian …
Discrete models are important to reduce the complexity of the protein folding problem. However, a compromise must be made between the model complexity and the accuracy of the model. Previous work by Park and Levitt has shown that the protein backbone can be modeled with good accuracy by four state discrete models. Nonetheless, for ab-initio protein folding, the side chains are important to determine if the structure is physically possible and well packed. We extend the work of Park and Levitt by taking into account the positioning of the side chain in the evaluation of the accuracy. We show that the problem becomes much harder and more dependent on the type of protein being modeled. In fact, the structure fitting method used in their work is no longer adequate to this extended version of the problem. We propose a new method to test the model accuracy. The presented …
Identification of transcription factor binding sites in promoter regions by modularity analysis of the motif co-occurrence graph AP Francisco, AL Oliveira, AT Freitas
2008 Bioinformatics Research and Applications: Fourth International Symposium …
Many algorithms have been proposed to date for the problem of finding biologically significant motifs in promoter regions. They can be classified into two large families: combinatorial methods and probabilistic methods. Probabilistic methods have been used more extensively, since their output is easier to interpret. Combinatorial methods have the potential to identify hard to detect motifs, but their output is much harder to interpret, since it may consist of hundreds or thousands of motifs. In this work, we propose a method that processes the output of combinatorial motif finders in order to find groups of motifs that represent variations of the same motif, thus reducing the output to a manageable size. This processing is done by building a graph that represents the co-occurrences of motifs, and finding communities in this graph. We show that this innovative approach leads to a method that is as easy to use as a …
Suffix Arrays-A Competitive Choice for Fast Lempel-Ziv Compressions. AJ Ferreira, AL Oliveira, MAT Figueiredo
2008 SIGMAP, 5-12
Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used in a variety of applications. The LZ encoder and decoder exhibit a high asymmetry, regarding time and memory requirements, with the former being much more demanding. Several techniques have been used to speed up the encoding process; among them is the use of suffix trees. In this paper, we explore the use of a simple data structure, named suffix array, to hold the dictionary of the LZ encoder, and propose an algorithm to search the dictionary. A comparison with the suffix tree based LZ encoder is carried out, showing that the compression ratios are roughly the same. The ammount of memory required by the suffix array is fixed, being much lower than the variable memory requirements of the suffix tree encoder, which depends on the text to encode. We conclude that suffix arrays are a very interesting option regarding the tradeoff between time, memory, and compression ratio, when compared with suffix trees, that make them preferable in some compression scenarios.
Ontology design principles and normalization techniques in the web X Wang, JS Almeida, AL Oliveira
2008 Data Integration in the Life Sciences: 5th International Workshop, DILS 2008 …
The fundamental issue of knowledge sharing in the web is the ability to share the ontological constrains associated with the Uniform Resource Identifiers (URI). To maximize the expressiveness and robustness of an ontological system in the web, each ontology should be ideally designed for a confined conceptual domain and deployed with minimal dependencies upon others. Through a retrospective analysis of the existing design of BioPAX ontologies, we illustrate the often encountered problems in ontology design and deployment. In this paper, we identify three design principles – minimal ontological commitment, granularity separation, and orthogonal domain – and two deployment techniques – intensionally normalized form (INF) and extensionally normalized form (ENF) – as the potential remedies for these problems.
Efficient haplotype inference with combined CP and OR techniques A Graça, J Marques-Silva, I Lynce, A Oliveira
2008
Haplotype inference has relevant biological applications, and represents a challenging computational problem. Among others, pure parsimony provides a viable modeling approach for haplotype inference and provides a simple optimization criterion. Alternative approaches have been proposed for haplotype inference by pure parsimony (HIPP), including branch and bound, integer programming and, more recently, propositional satisfiability and pseudo-Boolean optimization (PBO). Among these, the currently best performing HIPP approach is based on PBO. This paper proposes a number of effective improvements to PBO-based HIPP, including the use of lower bounding and pruning techniques effective with other approaches. The new PBO-based HIPP approach reduces by 50% the number of instances that remain unsolvable by HIPP based approaches.
Dynamic fully-compressed suffix trees LMS Russo, G Navarro, AL Oliveira
2008 Combinatorial Pattern Matching: 19th Annual Symposium, CPM 2008, Pisa, Italy …
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics, data compression and information retrieval. Classical representations of suffix trees require O(n logn) bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. A recent so-called fully-compressed suffix tree (FCST) requires asymptotically only the space of the text entropy. FCSTs, however, have the disadvantage of being static, not supporting updates to the text. In this paper we show how to support dynamic FCSTs within the same optimal space of the static version and executing all the operations in polylogarithmic time. In particular, we are able to build the suffix tree within optimal space.
Fully-compressed suffix trees LMS Russo, G Navarro, AL Oliveira
2008 LATIN 2008: Theoretical Informatics: 8th Latin American Symposium, Búzios …
Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require O (n log n) bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ (n) extra bits. This is already spectacular, but still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this linear-space barrier. Our representation requires sublinear extra space and supports a large set of navigational operations in logarithmic time. An essential ingredient of our representation is the lowest …
Learning Bayesian networks consistent with the optimal branching AM Carvalho, AL Oliveira
2007 Sixth International Conference on Machine Learning and Applications (ICMLA …
We introduce a polynomial-time algorithm to learn Bayesian networks whose structure is restricted to nodes with in-degree at most k and to edges consistent with the optimal branching, that we call consistent k-graphs (CkG). The optimal branching is used as an heuristic for a primary causality order between network variables, which is subsequently refined, according to a certain score, into an optimal CkG Bayesian network. This approach augments the search space exponentially, in the number of nodes, relatively to trees, yet keeping a polynomial-time bound. The proposed algorithm can be applied to scores that decompose over the network structure, such as the well known LL, MDL, AIC, BIC, K2, BD, BDe, BDeu and MIT scores. We tested the proposed algorithm in a classification task. We show that the induced classifier always score better than or the same as the Naive Bayes and Tree Augmented Naive Bayes …
YEASTRACT-DISCOVERER: new tools to improve the analysis of transcriptional regulatory associations in Saccharomyces cerevisiae PT Monteiro, ND Mendes, MC Teixeira, S d’Orey, S Tenreiro, NP Mira, ...
2007 Nucleic acids research 36 (suppl_1), D132-D136
The Yeast search for transcriptional regulators and consensus tracking (YEASTRACT) information system ( www.yeastract.com ) was developed to support the analysis of transcription regulatory associations in Saccharomyces cerevisiae . Last updated in September 2007, this database contains over 30 990 regulatory associations between Transcription Factors (TFs) and target genes and includes 284 specific DNA binding sites for 108 characterized TFs. Computational tools are also provided to facilitate the exploitation of the gathered data when solving a number of biological questions, in particular the ones that involve the analysis of global gene expression results. In this new release, YEASTRACT includes DISCOVERER, a set of computational tools that can be used to identify complex motifs over-represented in the promoter regions of co-regulated genes. The motifs identified are then clustered in families …
An efficient clash detection method for molecular structures applications MMF Bugalho, AL Oliveira
2007
Molecular structures applications are usually computationally intensive. Problems like protein docking, ab-initio protein folding and even some visualization software that verify structure correctness, need to constantly determine if two atoms in the structure collide. This problem is usually referred as clash detection. If for the ab-initio and docking problems the cost of these computations determines the number of configurations that can be searched, for the visualization program the cost determines the application response time. Either way, a fast clash detection method can greatly improve the application effectiveness. This work focus mainly in the ab-initio protein folding problem. Several protein folding methods avoid collisions by using energy functions that consider the Van der Walls forces. For other methods, which apply less expensive scoring functions, a faster clash detection algorithm may greatly increase the overall time efficiency. This will allow for more structures to be considered and thus improve the chances of finding the right structure. We propose an efficient data structure with which we can achieve constant time clash detection, independently of the size of the protein. We compare the proposed data structure with one of the best known general data structures for this type of problems (range queries) and with the naive approach. The results show that the proposed data structure surpasses the other techniques for any protein size. The proposed data structure takes near half the time of the general data structure and close to a fifth of the time of the naive approach for the larger proteins.
Identification of cooperative mechanisms in transcription regulatory networks using non-supervised learning techniques AT Freitas, AP Ramalho, CA Oliveira, CS Nogueira, MC Teixeira, ...
2007 2 ND WORKSHOP IN DATA MINING IN FUNCTIONAL GENOMICS AND PROTEOMICS, 79
Accurate identification of biological processes and gene regulatory mechanisms remains an open problem that needs to be addressed, since it represents a key stepping stone in our path to the goal of systems biology. In this work, we propose a new methodology for the identification of putative cooperative regulatory mechanisms in transcription regulatory networks. Our approach is based on the identification of biclusters in a binary regulation matrix, where each row corresponds to a given transcription factor, each column to a target gene and each matrix entry contains the value one if there is a documented (or potential) regulation between the transcription factor and the gene that correspond to that entry. We show that a perfectly homogeneous bicluster in this matrix identifies a set of transcription factors that jointly regulate a group of genes. Less than perfectly homogeneous biclusters may also provide new clues about still unknown regulation processes. Our methodology was tested using data from a microarray analysis of the early response of the eukaryotic model Saccharomyces cerevisiae to the widely used herbicide 2, 4-D dichlorophenoxyacetic acid. Results obtained with this validation show that the approach has the ability to uncover groups of genes and transcription factors that are jointly involved in a number of relevant biological processes. More significantly, our method has the potential to identify a number of possibly interesting new hypotheses of regulatory mechanisms, that may be validated experimentally.
Efficient generation of super condensed neighborhoods LMS Russo, AL Oliveira
2007 Journal of Discrete Algorithms 5 (3), 501-513
Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Condensed neighborhoods, however, are not a minimal representation of a pattern neighborhood. Super condensed neighborhoods, proposed in this work, are smaller, provably minimal and can be used to locate approximate matches that can later be extended by on-line search. We present an algorithm for generating Super Condensed Neighborhoods. The algorithm can be implemented either by using dynamic programming or non-deterministic automata. The time complexity is O(ms) for the first case and O(kms) for the second, where m is the pattern size, s is the size of the super condensed neighborhood and k the number of errors. Previous algorithms depended on the size of the condensed neighborhood instead. These algorithms can be implemented using Bit-Parallelism and …
Sensitivity to SEUs evaluation using probabilistic testability analysis at RTL JM Fernandes, MB Santos, AL Oliveira, JP Teixeira, R Velazco
2007 8th Latin-American Test Workshop (LATW'07), session 9
This work presents probabilistic methods for testability analysis at RTL and their use to evaluate the sensitivity of a digital circuit to Single Event Upsets (SEUs). A new probabilistic testability metric is proposed, in order to evaluate if the possible changes caused by SEUs are ignored or propagated by the dynamic behavior of the circuit. The new metric, event detectability, is defined based on the recently proposed event observability metric combined with the controllability. A methodology for error rate prediction based on the new metric is proposed. The testability analysis methods were implemented in a tool (ASCOPA) that takes as input a Verilog RTL description, solves the Chapman-Kolmogorov equations that describe the steady-state of the circuit, and outputs the computed values for the testability metrics. The proposed metric is used to obtain the relation between the static and the dynamic SEUs cross-section for ITC'99 benchmark circuits. It is shown that the error probability, for the b04 benchmark, can be 50% reduced using only 1/3 of radiationhardened registers.
Semi-supervised single-label text categorization using centroid-based classifiers A Cardoso-Cachopo, AL Oliveira
2007 Proceedings of the 2007 ACM symposium on Applied computing, 844-851
In this paper we study the effect of using unlabeled data in conjunction with a small portion of labeled data on the accuracy of a centroid-based classifier used to perform single-label text categorization. We chose to use centroid-based methods because they are very fast when compared with other classification methods, but still present an accuracy close to that of the state-of-the-art methods. Efficiency is particularly important for very large domains, like regular news feeds, or the web. We propose the combination of Expectation-Maximization with a centroid-based method to incorporate information about the unlabeled data during the training phase. We also propose an alternative to EM, based on the incremental update of a centroid-based method with the unlabeled documents during the training phase. We show that these approaches can greatly improve accuracy relatively to a simple centroid-based method, in …
New Generation of Linux Meta-installers P Trezentos, R Di Cosmo, S Lauriere, M Morgado, J Abecasis, ...
2007 Research Track of FOSDEM 2007
One of the points that differentiates most Linux distributions nowadays is the meta-installer. Meta-installers are software programs that run over RPM/DEB/TGZ installation systems, providing for automatic resolution of dependencies. Meta-installers like Apt-get, Smart or Yum are used widely and are starting to have more features. A framework for the new generation of meta-installers are being devised by a joint group of research and commercial partners of EDOS project (Environment for the development and Distribution of Open Source software)[1]. This article presents the new meta-installer features that are being developed and the ones that are still under research such as: rollback, new dependency solving and hardware support by RPM/DEB packages. These features have as common point the fact they extend the packaging system.
Arquitectura de Computadores, dos Sistemas Digitais aos Microprocessadores G Arroz, JC Monteiro, A Oliveira
2007 Secções 5 (5.2), 5.3
O presente texto foi desenvolvido com o objectivo de apoiar o ensino de disciplinas introdutórias, ao nível do ensino superior, nas áreas dos sistemas digitais e das arquitecturas de computadores. A abordagem proposta neste livro proporciona uma visão realista dos processadores como sistemas físicos, que têm de obedecer a numerosas restrições e compromissos impostos por diversos factores. É, por essa razão, a mais adequada para alunos de áreas de Engenharia que exijam conhecimento específico da tecnologia dos processadores, tais como as áreas das Engenharia Electrotécnica, Informática, Electrónica, de Telecomunicações, Aeronáutica e Mecatrónica, entre outras. Além do texto propriamente dito, é disponibilizado online um conjunto de materiais de apoio, que incluem um assembler e um simulador para uma arquitectura descrita no livro, o Pequeno Processador Pedagógico P3, uma placa com uma implementação deste processador ligado a um número de periféricos e diverso material pedagógico adicional.
P1303 clfB typing in Staphylococcus aureus: online database for dis-play, clustering analysis and storage of clfB DNA sequences AR Gomes, J Carriço, S Vinga, AL Oliveira, M Zavolan, H de Lencastre
2007 International Journal of Antimicrobial Agents, S360
P1303 clfB typing in Staphylococcus aureus: online database for dis-play, clustering analysis and storage of clfB DNA sequences × Close The Infona portal uses cookies, ie strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser. I accept Polski English Login or register account remember me Password recovery INFONA - science …
Bioinformatics: a new approach for the challenges of molecular biology AL Oliveira, AT Freitas, I Sá-Correia
2007 A Portrait of State-of-the-Art Research at the Technical University of …
We describe the research being undertaken by the ALGOS/KDBIO and Biological Sciences groups of Instituto Superior Técnico on the field of bioinformatics and computational biology, with emphasis on the efforts under way to develop new approaches, methods and algorithms for the determination of gene regulatory networks. We put the field in perspective by first looking at recent developments in the field of bioinformatics, and how these developments contributed to the advance of science. We then describe the approach that is being followed, based on the development of algorithms and information systems for the problems of motif detection, gene expression analysis and inference of gene regulatory networks. We conclude by pointing out possible directions for future research in the fields of systems biology and synthetic biology, two critical areas for the development of science in the coming years.
Efficient learning of Bayesian network classifiers: An extension to the TAN classifier AM Carvalho, AL Oliveira, MF Sagot
2007 AI 2007: Advances in Artificial Intelligence: 20th Australian Joint …
We introduce a Bayesian network classifier less restrictive than Naive Bayes (NB) and Tree Augmented Naive Bayes (TAN) classifiers. Considering that learning an unrestricted network is unfeasible the proposed classifier is confined to be consistent with the breadth-first search order of an optimal TAN. We propose an efficient algorithm to learn such classifiers for any score that decompose over the network structure, including the well known scores based on information theory and Bayesian scoring functions. We show that the induced classifier always scores better than or the same as the NB and TAN classifiers. Experiments on modeling transcription factor binding sites show that, in many cases, the improved scores translate into increased classification accuracy.
Combining LSI with other classifiers to improve accuracy of single-label text categorization A Cardoso-Cachopo, A Oliveira
2007 First European Workshop on Latent Semantic Analysis in Technology Enhanced …
This paper describes the combination of k-NN and SVM with LSI to improve their performance in single-label text categorization tasks, and the experiments performed with six datasets to show that both k-NN-LSI (the combination of k-NN with LSI) and SVM-LSI (the combination of SVM with LSI) outperform the original methods for a significant fraction of the datasets. Overall, both combinations present an average Accuracy over the six datasets used in this work that is higher than the average Accuracy of each original method. Having in mind that SVM is usually considered the best performing classification method, it is particularly interesting that the combinations perform even better for some datasets.
Efficient Biclustering Algorithms for identifying transcriptional regulation relationships using time series gene expression data SC Madeira, JP Gonçalves, AL Oliveira
2007 INESC_ID Tec. Rep 22
Biclustering algorithms have shown to be remarkably effective in a variety of applications. Although the biclustering problem is known to be NP-complete, in the particular case of time series gene expression data analysis, efficient and complete biclustering algorithms, are known and have been used to identify biologically relevant expression patterns. However, these algorithms, namely CCC-Biclustering (a linear time algorithm to identify biclusters with perfect expression patterns) and e-CCC-Biclustering (a polynomial time algorithm to identify biclusters with approximate expression patterns) do not take into account the importance of time-lagged and anti-correlation relationships in the study of transcription regulation using time series expression data. Furthermore, it is known that certain type of network motifs can generate temporal programs of expression, in which genes are activated one by one in a predefined order and can thus produce time-lagged expression patterns. In this context, we proposed extensions to these state of the art algorithms to analyze time series gene expression data in order to identify time-lagged activation together with anti-correlation while dealing directly with missing values. We present preliminary results obtained by applying the extended version of the CCC-Biclustering algorithm to the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress.
IR-BASE: An integrated framework for the research and teaching of information retrieval technologies P Calado, A Cardoso-Cachopo, AL Oliveira
2007 First International Workshop on Teaching and Learning of Information …
Due to the rapid growth in digital storage technologies, Information Retrieval (IR) has gained a significant importance, both for academia and for the industry. However, very few proposals have been made for the creation of an environment capable of both providing useful IR services and acting as a workbench for the design, implementation, and research of new IR solutions. In this work, we propose IR-BASE, a basic object oriented framework for the integration of components, documentation and services, focused on the rapid development of prototypes for research and teaching. At its core, IRBASE consists of a set of base classes that provide a skeleton to create new IR components, a set of guidelines to ensure that the developed components are fully interoperable, and a component pool, where IR-BASE users can find implementations of all kinds of functionalities needed to build a full IR system, and to which IR-BASE users can contribute with their own implementations. IR-BASE is of interest not only to IR researchers and professionals, who want to rapidly develop prototypes to test new ideas, but also to teachers and students, who will have an easily configurable, modifiable and extendible set of components to act has a basis for learning, building and experimenting with current IR algorithms.
Approximate string matching with Lempel-Ziv compressed indexes LMS Russo, G Navarro, AL Oliveira
2007 String Processing and Information Retrieval: 14th International Symposium …
A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Lempel-Ziv index. We show …
Efficient and tight upper bounds for haplotype inference by pure parsimony using delayed haplotype selection J Marques-Silva, I Lynce, A Graça, AL Oliveira
2007 Progress in Artificial Intelligence: 13th Portuguese Conference on …
Haplotype inference from genotype data is a key step towards a better understanding of the role played by genetic variations on inherited diseases. One of the most promising approaches uses the pure parsimony criterion. This approach is called Haplotype Inference by Pure Parsimony (HIPP) and is NP-hard as it aims at minimising the number of haplotypes required to explain a given set of genotypes. The HIPP problem is often solved using constraint satisfaction techniques, for which the upper bound on the number of required haplotypes is a key issue. Another very well-known approach is Clark’s method, which resolves genotypes by greedily selecting an explaining pair of haplotypes. In this work, we combine the basic idea of Clark’s method with a more sophisticated method for the selection of explaining haplotypes, in order to explicitly introduce a bias towards parsimonious explanations. This new …
An efficient biclustering algorithm for finding genes with similar patterns in time-series expression data SC Madeira, AL Oliveira
2007 Proceedings Of The 5th Asia-Pacific Bioinformatics Conference, 67-80
Biclustering algorithms have emerged as an important tool for the discovery of local patterns in gene expression data. For the case where the expression data corresponds to time-series, efficient algorithms that work with a discretized version of the expression matrix are known. However, these algorithms assume that the biclusters to be found are perfect, in the sense that each gene in the bicluster exhibits exactly the same expression pattern along the conditions that belong to it. In this work, we propose an algorithm that identifies genes with similar, but not necessarily equal, expression patterns, over a subset of the conditions. The results demonstrate that this approach identifies biclusters biologically more significant than those discovered by other algorithms in the literature.
Efficient haplotype inference with pseudo-Boolean optimization A Graça, J Marques-Silva, I Lynce, AL Oliveira
2007 Algebraic Biology: Second International Conference, AB 2007, Castle of …
Haplotype inference from genotype data is a key computational problem in bioinformatics, since retrieving directly haplotype information from DNA samples is not feasible using existing technology. One of the methods for solving this problem uses the pure parsimony criterion, an approach known as Haplotype Inference by Pure Parsimony (HIPP). Initial work in this area was based on a number of different Integer Linear Programming (ILP) models and branch and bound algorithms. Recent work has shown that the utilization of a Boolean Satisfiability (SAT) formulation and state of the art SAT solvers represents the most efficient approach for solving the HIPP problem. Motivated by the promising results obtained using SAT techniques, this paper investigates the utilization of modern Pseudo-Boolean Optimization (PBO) algorithms for solving the HIPP problem. The paper starts by applying PBO to existing …
MUSA: a parameter free algorithm for the identification of biologically significant motifs ND Mendes, AC Casimiro, PM Santos, I Sa-Correia, AL Oliveira, ...
2006 Bioinformatics 22 (24), 2996-3002
Motivation: The ability to identify complex motifs, i.e. non-contiguous nucleotide sequences, is a key feature of modern motif finders. Addressing this problem is extremely important, not only because these motifs can accurately model biological phenomena but because its extraction is highly dependent upon the appropriate selection of numerous search parameters. Currently available combinatorial algorithms have proved to be highly efficient in exhaustively enumerating motifs (including complex motifs), which fulfill certain extraction criteria. However, one major problem with these methods is the large number of parameters that need to be specified. Results: We propose a new algorithm, MUSA (Motif finding using an UnSupervised Approach), that can be used either to autonomously find over-represented complex motifs or to estimate search parameters for modern motif finders. This method …
DFT and probabilistic testability analysis at RTL JM Fernandes, MB Santos, AL Oliveira, JC Teixeira
2006 2006 IEEE International High Level Design Validation and Test Workshop, 41-47
This work presents probabilistic methods for testability analysis at RTL and their use to guide DFT techniques like partial-scan and TPI. Controllability is analyzed using three different approaches, an exact one, an approximated one that ignores the correlation between state variables and a third that only takes into account correlations within pre-defined groups that are formed based on an originally proposed heuristic that uses RTL information. These controllability analysis methods are evaluated using simulation based controllability as a reference. Two observability metrics are originally defined: event observability and LSA observability. The proposed testability analysis methods were implemented in a tool that takes as input a Verilog RTL description, solves the Chapman-Kolmogorov equations that describe the steady-state of the circuit, and outputs the computed values for the testability. A methodology for …
Analysis of the Dynamic Behavior of Gene Regulatory Networks JV Caldas, AL Oliveira, AT Freitas
2006
Recent technological advancements have led to a dramatic increase in the quantity of available biological data. In this context, a general challenge nowadays is to transform the data into useful information. One of the most important problems is how to model and infer a gene regulatory network (GRN) from microarray expression data. A GRN consists of a set of mutuallyinfluencing genes. Influence is exerted by several means, such as transcriptional regulation or post-translational modifications. There are several computational approaches for modeling and inferring GRNs. In this technical report we study the following topics: modeling the yeast’s cell cycle GRN and analyzing its dynamics in both wild type and mutant strains, using Boolean networks; inferring parameters in artificial GRNs, from artificial expression data, using a particular piecewise linear equations model.
Improved lower bounds for SAT-based haplotype inference J Marques-Silva, I Lynce, AL Oliveira
2006 Technical Report 17
Mutation in DNA is the principal cause for differences among human beings, and Single Nucleotide Polymorphisms (SNPs) are the most common mutations. Hence, a fundamental task is to complete a map of haplotypes (which identify SNPs) in the human population. Associated with this effort, a key computational problem is the inference of haplotype data from genotype data. Existing solutions for the haplotype inference by pure parsimony (HIPP) problem include Integer Linear Programming (ILP)[6, 1, 2] and branch and bound [13]. Recent work has shown that Boolean Satisfiability (SAT) approaches to the HIPP problem, built on top of modern SAT solvers [3], are remarkably efficient, yielding most often orders of magnitude speedups with respect to previous solutions to the HIPP problem. The objectives of this paper are twofold. The first objective is to provide a brief survey of SHIPs [10, 11], a SAT-based solution to the HIPP problem. An additional objective of the paper is to propose improvements to the SHIPs approach described in [10, 11]. One important aspect of SHIPs is the lower bounding procedure, which reduces the number of iterations of the basic algorithm, but also indirectly simplifies the resulting SAT model. The paper describes a new, much more effective, lower bounding procedure. This new lower bounding procedure is guaranteed to always be as tight as the procedure of [10, 11]. In practice, however, the new lower bounding procedure is in most cases significantly tighter, allowing relevant performance speedups in challenging problem instances.
Empirical evaluation of centroid-based models for single-label text categorization A Cardoso-Cachopo, AL Oliveira
2006 INSEC-ID Technical Report 7
Centroid-based models have been used in Text Categorization because, despite their computational simplicity, they show a robust behavior and good performance. In this paper we experimentally evaluate several centroidbased models on single-label text categorization tasks. We also analyze document length normalization and two different term weighting schemes. We show that:(1) Document length normalization is not always the best option in a classification task.(2) The traditional tfidf term weighting approach remains very effective, even when compared to more recent approaches.(3) Despite the fact that several ways to calculate the centroid of a class in a dataset have been proposed, there is one that always outperforms the others.(4) A computationally simple and fast centroid-based model can give results similar to the top-performing SVM model.
An efficient algorithm for the identification of structured motifs in DNA promoter sequences AM Carvalho, AT Freitas, AL Oliveira, MF Sagot
2006 IEEE/ACM Transactions on Computational Biology and Bioinformatics 3 (2), 126-140
We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the data set sequences. This type of conserved regions, called structured motifs, is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The complexity analysis shows a time and space gain over the best known exact algorithms that is exponential in the spacings between binding sites. A full implementation of the algorithm was developed and made available online. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than four orders of magnitude. The application of the method to biological data sets shows its ability to extract …
Probabilistic testability analysis and DFT methods at RTL JM Fernandes, MB Santos, AL Oliveira, JC Teixeira
2006 2006 IEEE Design and Diagnostics of Electronic Circuits and systems, 214-215
This work presents probabilistic methods for testability analysis at RTL and their use to guide DFT techniques like partial-scan and TPI. Controllability is analyzed using an approach that takes into account correlations within pre-defined groups formed based on an originally proposed heuristic. A method for observability computation at RTL based on the Boolean difference is presented. These testability analysis methods were implemented in a tool that reads a Verilog RTL description, solves the Chapman-Kolmogorov equations that describe the steady-state of the circuit, and outputs the computed values for the testability. A methodology for partial-scan and TPI optimization is proposed and implemented. The methodology is based on the testability metrics and on a "DFT dictionary". The proposed heuristic and methodology are evaluated using the ITC99 benchmark circuits
Assessing the efficacy of haplotype inference by pure parsimony on biological data I Lynce, JP Marques-Silva, AL Oliveira
2006
Motivation: Mutation in DNA is the principal factor responsible for the differences among human beings, and Single Nucleotide Polymorphisms (SNPs) are the most common mutations. Hence, it is fundamental to complete a map of haplotypes (which identify SNPs) in the human population. However, due to technological limitations, genotype data rather than haplotype data is usually obtained. The haplotype inference by pure parsimony (HIPP) problem consists in inferring haplotypes from genotypes st the number of required haplotypes is minimum. Experimental results provide support for this method: the number of haplotypes in a large population is typically very small, although genotypes exhibit a great diversity. Results: We define a new method for solving the HIPP problem using Boolean satisfiability (SAT). Experimental results clearly demonstrate that our SAT-based tool is by far more efficient than the existing tools following the pure parsimony approach. Besides being more efficient, our tool is also the only capable of computing the solution for a large number of instances. Hence, we have been able to infer haplotypes for a wide variety of biological data, that no other tool following the pure parsimony approach is able to. Additionally, we compare the accuracy of the results obtained with our tool with the accuracy of the results obtained with tools following statistical approaches, and conclude that the pure parsimony criterion used by our methods leads to good results in terms of precision of inferred haplotypes.
Discovering modules in time-series gene expression data using biclustering SC Madeira123, AL Oliveira12
2006
Several non-supervised machine learning methods have been used in the analysis of gene expression data. Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. The advantages of biclustering (when compared to clustering) in the discovery of local expression patterns has been extensively studied and documented [1]. These expression patterns can be used to identify relevant biological processes possibly involved in regulatory mechanisms. Although, in its general form, biclustering is NP-complete, in the case of time-series expression data the interesting biclusters can be restricted to those with contiguous columns leading to a tractable problem. In this context, we have recently proposed CCC-Biclustering [2], an algorithm that finds and reports all …
Dotted suffix trees a structure for approximate text indexing LP Coelho, AL Oliveira
2006 String Processing and Information Retrieval: 13th International Conference …
In this work, the problem we address is text indexing for approximate matching. Given a text which undergoes some preprocessing to generate an index, we can later query this index to identify the places where a string occurs up to a certain number of errors k (edition distance). The indexing structure occupies space in the average case, independent of alphabet size. This structure can be used to report the existence of a match with k errors in and to report the occurrences in $\mathcal{O}(3^k m^{k+1} + \mbox{\it ed})$ time, where m is the length of the pattern and ed and the number of matching edit scripts. The construction of the structure has time bound by , where N is the number of nodes in the index and |Σ| the alphabet size.
A compressed self-index using a Ziv-Lempel dictionary LMS Russo, AL Oliveira
2006 SPIRE 4209, 163-180
A compressed full-text self-index for a text T, of size u, is a data structure used to search patterns P, of size m, in T that requires reduced space, ie that depends on the empirical entropy (Hk, H0) of T, and is, furthermore, able to reproduce any substring of T. In this paper we present a new compressed self-index able to locate the occurrences of P in O ((m+ occ) log n) time, where occ is the number of occurrences and σ the size of the alphabet of T. The fundamental improvement over previous LZ78 based indexes is the reduction of the search time dependency on m from O (m 2) to O (m). To achieve this result we point out the main obstacle to linear time algorithms based on LZ78 data compression and expose and explore the nature of a recurrent structure in LZ-indexes, the T78 suffix tree. We show that our method is very competitive in practice by comparing it against the LZ-Index, the FM-index and a compressed …
The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae MC Teixeira, P Monteiro, P Jain, S Tenreiro, AR Fernandes, NP Mira, ...
2006 Nucleic acids research 34 (suppl_1), D446-D451
We present the YEAst Search for Transcriptional Regulators And Consensus Tracking (YEASTRACT; www.yeastract.com) database, a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. This database is a repository of 12 346 regulatory associations between transcription factors and target genes, based on experimental evidence which was spread throughout 861 bibliographic references. It also includes 257 specific DNA-binding sites for more than a hundred characterized transcription factors. Further information about each yeast gene included in the database was obtained from Saccharomyces Genome Database (SGD), Regulatory Sequences Analysis Tools and Gene Ontology (GO) Consortium. Computational tools are also provided to facilitate the exploitation of the gathered data when solving a number of biological questions as exemplified in the Tutorial also …
An evaluation of discretization methods for non-supervised analysis of time-series gene expression data SC Madeira, AL Oliveira
2005 INESC-ID Technical Report 42
Gene expression data has been extensively analyzed using non-supervised machine learning algorithms, with the objective of extracting potential relationships between genes. Many of these algorithms work with discretized versions of the expression data. However, the many possible methods that can be used to discretize the data have not been comprehensively studied. In this paper, we describe a number of methods that have been used to discretize gene expression data, and analyze the application of these methods to time series gene expression data. The effectiveness of the discretization methods is assessed by evaluating the biological relevance of the results obtained by a non-supervised learning algorithm based on biclustering, applied to the discretized versions of Yeast cell cycle data.
Inference of regular languages using state merging algorithms with search M Bugalho, AL Oliveira
2005 Pattern Recognition 38 (9), 1457-1467
State merging algorithms have emerged as the solution of choice for the problem of inferring regular grammars from labeled samples, a known NP-complete problem of great importance in the grammatical inference area. These methods derive a small deterministic finite automaton from a set of labeled strings (the training set), by merging parts of the acceptor that corresponds to this training set. Experimental and theoretical evidence have shown that the generalization ability exhibited by the resulting automata is highly correlated with the number of states in the final solution. As originally proposed, state merging algorithms do not perform search. This means that they are fast, but also means that they are limited by the quality of the heuristics they use to select the states to be merged. Sub-optimal choices lead to automata that have many more states than needed and exhibit poor generalization ability. In this work …
YEASTRACT: a database of transcription regulatory associations in Saccharomyces cerevisiae P Monteiro, MC Teixeira, P Jain, S Tenreiro, AR Fernandes, N Mira, ...
2005 BKDB2005-Bioinformatics: Knowledge Discovery in Biology
We present the YEASTRACT (Yeast Search for Transcriptional Regulators And Consensus Tracking; www. yeastract. com) database. This database is a repository of more than 9000 regulatory associations between genes and transcription factors in Saccharomyces cerevisiae, based on more than 900 bibliographic references. It also includes the description of 242 specific DNA binding sites for 102 characterized transcription factors. Further information about each yeast gene included in the database was extracted from a number of different sources, and combined in order to make available a number of queries related with regulatory processes in Yeast. All the information in YEASTRACT will be updated regularly to match the latest data from SGD, GO consortium and recent literature on yeast regulatory networks. Future releases of the database will include additional computational tools to support researchers in the process of identification of transcription regulatory associations.
Faster generation of super condensed neighbourhoods using finite automata LMS Russo, AL Oliveira
2005 String Processing and Information Retrieval: 12th International Conference …
We present a new algorithm for generating super condensed neighbourhoods. Super condensed neighbourhoods have recently been presented as the minimal set of words that represent a pattern neighbourhood. These sets play an important role in the generation phase of hybrid algorithms for indexed approximate string matching. An existing algorithm for this purpose is based on a dynamic programming approach, implemented using bit-parallelism. In this work we present a bit-parallel algorithm based on automata which is faster, conceptually much simpler and uses less memory than the existing method.
An efficient algorithm for generating super condensed neighborhoods LMS Russo, AL Oliveira
2005 Combinatorial Pattern Matching: 16th Annual Symposium, CPM 2005, Jeju Island …
Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. Here, we point out that condensed neighborhoods are not a minimal representation of a pattern neighborhood. We show that we can restrict our attention to super condensed neighborhoods which are minimal. We then present an algorithm for generating Super Condensed Neighborhoods. The algorithm runs in O(m⌈ m / w ⌉ s), where m is the pattern size, s is the size of the super condensed neighborhood and w the size of the processor word. Previous algorithms took O(m⌈ m / w ⌉ c) time, where c is the size of the condensed neighborhood. We further improve this algorithm by using Bit-Parallelism and Increased Bit-Parallelism techniques. Our experimental results show that the resulting algorithm is very fast.
Using a more powerful teacher to reduce the number of queries of the L* algorithm in practical applications AL Martins, HS Pinto, AL Oliveira
2005 Progress in Artificial Intelligence: 12th Portuguese Conference on …
In this work we propose to use a more powerful teacher to effectively apply query learning algorithms to identify regular languages in practical, real-world problems. More specifically, we define a more powerful set of replies to the membership queries posed by the L* algorithm that reduces the number of such queries by several orders of magnitude in a practical application. The basic idea is to avoid the needless repetition of membership queries in cases where the reply will be negative as long as a particular condition is met by the string in the membership query. We present an example of the application of this method to a real problem, that of inferring a grammar for the structure of technical articles.
Constraint relaxations for discovering unknown sequential patterns C Antunes, AL Oliveira
2005 Knowledge Discovery in Inductive Databases: Third International Workshop …
The main drawbacks of sequential pattern mining have been its lack of focus on user expectations and the high number of discovered patterns. However, the solution commonly accepted – the use of constraints – approximates the mining process to a verification of what are the frequent patterns among the specified ones, instead of the discovery of unknown and unexpected patterns. In this paper, we propose a new methodology to mine sequential patterns, keeping the focus on user expectations, without compromising the discovery of unknown patterns. Our methodology is based on the use of constraint relaxations, and it consists on using them to filter accepted patterns during the mining process. We propose a hierarchy of relaxations, applied to constraints expressed as context-free languages, classifying the existing relaxations (legal, valid and naïve, previously proposed), and proposing several …
A linear time biclustering algorithm for time series gene expression data SC Madeira, AL Oliveira
2005 Algorithms in Bioinformatics: 5th International Workshop, WABI 2005 …
Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. The goal of biclustering is to find subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated behaviors. In the most common settings, biclustering is an NP-complete problem, and heuristic approaches are used to obtain sub-optimal solutions using reasonable computational resources. In this work, we examine a particular setting of the problem, where we are concerned with finding biclusters in time series expression data. In this context, we are interested in finding biclusters with consecutive columns. For …
A highly scalable algorithm for the extraction of cis-regulatory regions AM Carvalho, AT Freitas, AL Oliveira, MF Sagot
2005 Proceedings Of The 3rd Asia-Pacific Bioinformatics Conference, 273-282
In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The proposed algorithm uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets …
Probabilistic and Simulation-Based Masked BIST Implemen-tation F Guerreiro, JM Fernandes, MB Santos, AL Oliveira, IM Teixeira, ...
2004 Proc. Design of Circuits and Integrated Systems Conference (DCIS)
Among the high-quality BIST solutions for digital systems (reseeding, bit flipping, etc) the masked-based (m-BIST) solution has been proposed, leading to the identification (at RTL) of hard functionality, accessibility problems and functional test generation which may be reused for production test. Mask generation is not a trivial problem for sequential modules. In this paper, a novel, hybrid approach for m-BIST is proposed, combining simulation and probabilistic (S-based and P-based) techniques for automatic mask generation. Both controllability and observability metrics are computed. Two proprietary tools, VeriDOS and Ascopa, are introduced for m-BIST implementation. A limited subset of feedback registers is identified at RTL for partial scan BIST solutions. A ITC’99 benchmark circuit (b13) is used as test vehicle. Results are ascertained by fault simulation finally at structural level, leading to high single line stuck-at fault coverage results.
Sequential pattern mining algorithms: trade-offs between speed and memory C Antunes, A Oliveira
2004 MGTS
Increased application of structured pattern mining requires a perfect understanding of the problem and a clear identification of the advantages and disadvantages of existing algorithms. Among those algorithms, pattern-growth methods have been shown to have the best performance when applied to sequential pattern mining. However, their advantages over apriori-based methods are not well explained and understood. Detailed analysis of the performance and memory requirements for these algorithms shows that counting the support for each potential pattern is the most computationally demanding step. Additionally, the analysis makes clear that the main advantage of patterngrowth over apriori-based methods resides on the restriction of the search space that is obtained from the creation of projected databases. In this paper, we present this analysis and describe how apriori-based algorithms can achieve the efficiency of pattern-growth methods.
Mining patterns using relaxations of user defined constraints C Antunes, AL Oliveira
2004 Proc. of the Workshop on Knowledge Discovery in Inductive Databases
The main drawbacks of sequential pattern mining have been its lack of focus on user expectations and the high number of discovered patterns. However, the solution commonly accepted–the use of constraints–approximates the mining process to a hypothesis-testing task. In this paper, we propose a new methodology to mine sequential patterns, keeping the focus on user expectations, without compromising the discovery of unknown patterns. Our methodology is based on the use of constraint relaxations, and it consists on using them to filter accepted patterns during the mining process. We propose a hierarchy of relaxations, applied to constraints expressed as context-free languages.
Biclustering algorithms for biological data analysis: a survey SC Madeira, AL Oliveira
2004 IEEE/ACM transactions on computational biology and bioinformatics 1 (1), 24-45
A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also …
Towards automatic learning of a structure ontology for technical articles AL Martins, HS Pinto, AL Oliveira
2004 Semantic Web Workshop at SIGIR 2004
Despite the high level of success attained by keyword based information retrieval methods, a significant fraction of information retrieval tasks still needs to take into account the semantics of the data. We propose a method that combines an hand-crafted ontology with a robust inductive inference method to assign semantic labels to pieces of technical articles available on the Web. This approach, together with a query language developed for the purpose, supports queries that cannot be resolved using currently available tools. We present preliminary results that describe the precision of the assigned labels and the accuracy of the replies to the semantic queries present to the system.
A parallel algorithm for the extraction of structured motifs AM Carvalho, AL Oliveira, AT Freitas, MF Sagot
2004 Proceedings of the 2004 ACM symposium on Applied computing, 147-153
In this work we propose a parallel algorithm for the efficient extraction of binding-site consensus from genomic sequences. This algorithm, based on an existing approach, extracts structured motifs, that consist of an ordered collection of p ≥ 1 boxes with sizes and spacings between them specified by given parameters. The contents of the boxes, which represent the extracted motifs, are unknown at the start of the process and are found by the algorithm using a suffix tree as the fundamental data structure. By partitioning the structured motif searching space we divide the most demanding part of the algorithm by a number of processors that can be loosely coupled. In this way we obtain, under conditions that are easily met, a speedup that is linear on the number of available processing units. This speedup is verified by both theoretical and experimental analysis, also presented in this paper.
Special issue on string processing and information retrieval AHF Laender, AL Oliveira
2004 Journal of Discrete Algorithms 2 (1), 1-2
Editorial: special issue on string processing and information retrieval: Journal of Discrete Algorithms: Vol 2, No 1 ACM Digital Library home ACM home Google, Inc. (search) Advanced Search Browse About Sign in Register Advanced Search Journals Magazines Proceedings Books SIGs Conferences People More Search ACM Digital Library SearchSearch Advanced Search Journal of Discrete Algorithms Periodical Home Latest Issue Archive Authors Affiliations Award Winners More HomeBrowse by TitlePeriodicalsJournal of Discrete AlgorithmsVol. 2, No. 1Editorial: special issue on string processing and information retrieval article Share on Editorial: special issue on string processing and information retrieval Authors: Alberto HF Laender View Profile , Arlindo L. Oliveira View Profile Authors Info & Claims Journal of Discrete AlgorithmsVolume 2Issue 1March 2004 pp 1–2https://doi.org/10.1016/S1570-8667(03)00061-…
Sequential pattern mining with approximated constraints C Antunes, AL Oliveira
2004 Int. Conf Applied Computing, 131-138
The lack of focus that is a characteristic of unsupervised pattern mining in sequential data represents one of the major limitations of this approach. This lack of focus is due to the inherently large number of rules that is likely to be discovered in any but the more trivial sets of sequences. Several authors have promoted the use of constraints to reduce that number, but those constraints approximate the mining task to a hypothesis test task. In this paper, we propose the use of constraint approximations to guide the mining process, reducing the number of discovered patterns without compromising the prime goal of data mining: to discover unknown information. We show that existent algorithms, that use regular languages as constraints, can be used with minor adaptations. We propose a simple algorithm (ε-accepts) that verifies if a sequence is approximately accepted by a given regular language.
A probabilistic method for the computation of testability of RTL constructs JM Fernandes, MB Santos, AL Oliveira, JC Teixeira
2004 Proceedings Design, Automation and Test in Europe Conference and Exhibition …
Validation of RTL descriptions remains one of the principal bottlenecks in the circuit design process. Random simulation based methods for functional validation suffer from fundamental limitations and may be inappropriate or too expensive. In fact, for some circuits, a large number of vector is required in order to make the circuit reach hard to test constructs and obtains accurate values for their testability. In this work, we present a static, non-simulation based, method for the determination of the controllability of RTL constructs that is efficient and gives accurate feedback to the designers in what regards the presence of hard to control constructs in their RTL code. The method takes as input a Verilog RTL description, solves the Chapman-Kolmogorov equations that describe the steady-state of the circuit and outputs the computed values for the controllability of the RTL constructs. To avoid the exponential blow-up that …
Grammatical Inference: Algorithms and Applications: 5th International Colloquium, ICGI 2000, Lisbon, Portugal, September 11-13, 2000 Proceedings AL Oliveira
2004 Springer
The Fifth International Colloquium on Grammatical Inference (ICGI-2000) was held in Lisbon on September 11–13th, 2000. ICGI-2000 was the fifth in a series of successful biennial international conferences in the area of grammatical inference. Previous conferences were held in Essex, UK; Alicante, Spain; Montpellier, France; and Ames, Iowa, USA. This series of meetings seeks to provide a forum for the presentation and discussion of original research on all aspects of grammatical inference. Grammatical inference, the process of inferring grammar from given data, is a field that is not only challenging from a purely scientific standpoint but also finds many applications in real world problems.
Biclustering Algorithms for Biological Data Analysis: A Survey, INESC-ID TEC SC Madeira, AL Oliveira
2004 REP. Jan
A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the gene expression matrix has been proposed to date. This simultaneous clustering, usually designated by biclustering, seeks to find sub-matrices, that is subgroups of genes and subgroups of columns, where the genes exhibit highly correlated activities for every condition. This type of algorithms has also been proposed and used in other fields, such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search and the target applications.
Un modelo de recuperación de información basado en SVMs MF Maldonado, AML Oliveira, JA Sánchez
2004 e-Gnosis, 0
Los clasificadores como los SVMs (Support Vector Machines) se usaron para la clasificación de documentos de manera muy eficiente, pero su utilidad no ha sido comprobada para la recuperación de información (RI) en el momento de jerarquerizar los documentos. En este artículo proponemos una transformación que asocia el proceso de la RI a un nuevo espacio vectorial en el que un clasificador basado en SVMs se entrena para aprender el concepto de similitud frente a los documentos.
A new quick point location algorithm J Poveda, M Gould, A Oliveira
2004 Conceptual Modeling for Advanced Application Domains: ER 2004 Workshops …
We present a new quick algorithm for the solution of the well-known point location problem and for the more specific problem of point-in-polygon determination. Previous approaches to this problem are presented in the first sections of this paper. In the remainder of the paper, we present a new quick location algorithm based on a quaternary partition of the space, as well as its associated cost and data structures.
Metrics for grid applicability: a distributed elliptic curve platform assessment P Trezentos, AL Oliveira
2004 Parallel Processing and Applied Mathematics: 5th International Conference …
The distribution of computational load among several nodes is an important step for problems requiring High Performance Computing (HPC). This phase is critical because bad decisions could cost time and money. The emergence of heterogeneous networks with resources available for a wide group of users as provided by Grid Computing [3] creates the need for formal processes and metrics to evaluate if the application of the problem to the Grid is feasible. In this article we introduce some auxiliary indicators for measuring the potential applicability of a parallel application to a Grid. Using the measures defined in the internal draft (GWD-I) produced by the Network Measurement WG [6] from the Global Grid Forum and RFC 2330 [7], the authors present some aggregate metrics that are useful in the characterization of the parallel applications that are well adapted to existing grids. The …
Efficient extraction of structured motifs using box-links AM Carvalho, AT Freitas, AL Oliveira, MF Sagot
2004 String Processing and Information Retrieval: 11th International Conference …
In this paper we propose a new data structure for the efficient extraction of structured motifs from DNA sequences. A structured motif is defined as a collection of highly conserved motifs with pre-specified sizes and spacings between them. The new data structure, called box-link, stores the information on how to jump over the spacings which separate each motif in a structured motif. A factor tree, a variation of a suffix tree, endowed with box-links provide the means for the efficient extraction of structured motifs.
Analog macromodeling using kernel methods J Phillips, J Afonso, A Oliveira, LM Silveira
2003 ICCAD-2003. International Conference on Computer Aided Design (IEEE Cat. No …
In this paper we explore the potential of using a general class of functional representation techniques, kernel-based regression, in the nonlinear model reduction problem. The kernel-based viewpoint provides a convenient computational framework for regression, unifying and extending the previously proposed polynomial and piecewise-linear reduction methods. Furthermore, as many familiar methods for linear system manipulation can be leveraged in a nonlinear context, kernels provide insight into how new, more powerful, nonlinear modeling strategies can be constructed. We present an SVD-like technique for automatic compression of nonlinear models that allows systematic identification of model redundancies and rigorous control of approximation error.
On the problem of gate assignment under different rise and fall delays AL Oliveira, R Murgai
2003 IEEE Transactions on Computer-Aided Design of Integrated Circuits and …
In most libraries, gate parameters such as the pin-to-pin intrinsic delays, load-dependent coefficients, and input pin capacitances have different values for rising and falling signals. Most performance optimization algorithms, however, assume a single value for each parameter. It is known that under the load-independent delay model, the gate assignment (or resizing) problem is solvable in time polynomial in the circuit size when a single value is assumed for each parameter (Kukimoto et al., 1998). We show that, in the presence of different rise and fall parameter values, this problem is NP-complete even for chain and tree topology circuits under the simple load-independent delay model (Murgai, 1999). However, we also show that, for tree circuits, the problem is not NP-complete in the strong sense, and we propose a dynamic programming algorithm that solves it exactly in pseudopolynomial time. More specifically …
Introdução aos sistemas digitais e microprocessadores G Arroz, J Monteiro, A Oliveira
2003 IST
As técnicas de projecto de circuitos digitais, combinatórios e sequenciais, apresentadas nos capítulos anteriores permitem a realização de sistemas de baixa e média complexidade. O nível de detalhe a que estas técnicas são aplicadas é demasiado elevado para que possam ser usadas na concepção de circuitos de grande dimensão. Assim, no projecto de sistemas com uma funcionalidade mais complexa é necessário um nível de abstracção mais elevado de forma a esconder muitos detalhes e a tornar o problema manejável. Neste capítulo descreve-se o projecto de sistemas digitais em termos de duas componentes. Uma é a Unidade de Processamento, também chamada de circuito de dados (ou datapath, em inglês), que contém toda a lógica que faz os cálculos propriamente ditos bem como os registos onde os dados são guardados. A segunda é a Unidade de Controlo que gere quais as operações que a unidade de processamento deve efectuar em cada ciclo de relógio. Esta abordagem pressupõe que uma complexidade de processamento mais elevada requer em geral vários ciclos de relógio para se completar. De facto, operações acima de um certo nível de complexidade podem implicar um circuito lógico específico com uma dimensão tal que tornaria incomportável a sua realização na prática. Estas operações são assim divididas numa sequência de operações mais simples, estas sim facilmente realizáveis em hardware. A unidade de processamento é o circuito que disponibiliza estas operações mais simples e a unidade de controlo é o circuito que as sequencia de forma a realizar a operação complexa. Para permitir …
Implicit resolution of the Chapman-Kolmogorov equations for sequential circuits: an application in power estimation AT Freitas, AL Oliveira
2003 2003 Design, Automation and Test in Europe Conference and Exhibition, 764-769
In this work we describe an approach that implicitly formulates and solves the Chapman-Kolmogorov equations that describe the state probabilities associated with the stationary behavior of sequential circuits. Unlike previous approaches that assumed uncorrelated input signals, we model the more general case where the sequential circuit is driven by a sequence of inputs described by a discrete time Markov chain. This Markov chain is described implicitly using a formalism that allows for a compact description of chains with an exponentially high number of states. Using this approach, we present an application in power estimation of sequential circuits that takes into account all the temporal and spatial correlations between the primary inputs and the internal signals. We present results showing that, in some cases, it is possible to solve exactly the Chapman-Kolmogorov equations for systems with more than 10/sup …
A Data Mining Approach to Credit Risk SC Madeira, AL Oliveiral, CS Conceição
2003 Progress in Artificial Intelligence:... Portuguese Conference on AI, EPIA …
Behaviour scoring is used in several companies to score the customers according to credit risk by analyzing historical data about their past behaviour. In this paper we describe a data mining approach to credit risk evaluation in a Portuguese telecommunication company.
A data mining approach to credit risk evaluation and behaviour scoring SC Madeira, AL Oliveira, CS Conceiçao
2003 Progress in Artificial Intelligence: 11th Portuguese Conference on …
Behaviour scoring is used in several companies to score the customers according to credit risk by analyzing historical data about their past behaviour. In this paper we describe a data mining approach to credit risk evaluation in a Portuguese telecommunication company.
Generalization of pattern-growth methods for sequential pattern mining with gap constraints C Antunes, AL Oliveira
2003 Machine Learning and Data Mining in Pattern Recognition: Third International …
The problem of sequential pattern mining is one of the several that has deserved particular attention on the general area of data mining. Despite the important developments in the last years, the best algorithm in the area (PrefixSpan) does not deal with gap constraints and consequently doesn’t allow for the introduction of background knowledge into the process. In this paper we present the generalization of the PrefixSpan algorithm to deal with gap constraints, using a new method to generate projected databases. Studies on performance and scalability were conducted in synthetic and real-life datasets, and the respective results are presented.
An empirical comparison of text categorization methods A Cardoso-Cachopo, AL Oliveira
2003 SPIRE, 183-196
In this paper we present a comprehensive comparison of the performance of a number of text categorization methods in two different data sets. In particular, we evaluate the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of the Vector and LSA models. We report the results obtained using the Mean Reciprocal Rank as a measure of overall performance, a commonly used evaluation measure for question answering tasks. We argue that this evaluation measure is also very well suited for text categorization tasks. Our results show that overall, SVMs and k-NN LSA perform better than the other methods, in a statistically significant way.
Implicit FSM decomposition applied to low-power design JC Monteiro, AL Oliveira
2002 IEEE transactions on very large scale integration (VLSI) systems 10 (5), 560-565
Clock-gating techniques are very effective in the reduction of the switching activity in sequential logic circuits. In this paper, we describe a clock-gating technique based on finite-state machine (FSM) decomposition. The approach is based on the computation of two sub-FSMs that together have the same functionality as the original FSM. For all the transitions within one sub-FSM, the clock for the other sub-FSM is disabled. To minimize the average switching activity, we search for a small cluster of states with high stationary state probability and use it to create the small sub-FSM. Explicit manipulation of the state transition graph requires time and space exponential on the number of registers in the circuit, thereby restricting the applicability of explicit methods to relatively small circuits. The approach we propose is based on a method that implicitly performs the FSM decomposition. Using this technique, the FSM …
Proceedings of the 9th International Symposium on String Processing and Information Retrieval AHF Laender, AL Oliveira
2002 Springer-Verlag
Proceedings of the 9th International Symposium on String Processing and Information Retrieval | Guide Proceedings ACM Digital Library home ACM home Google, Inc. (search) Advanced Search Browse About Sign in Register Advanced Search Journals Magazines Proceedings Books SIGs Conferences People More Search ACM Digital Library SearchSearch Advanced Search Browse Browse Digital Library Collections More HomeBrowse by TitleProceedingsSPIRE 2002 ABSTRACT No abstract available. Comments Login options Check if you have access through your login credentials or your institution to get full access on this article. Sign in Full Access Get this Publication Information Contributors Published in Guide Proceedings cover image SPIRE 2002: Proceedings of the 9th International Symposium on String Processing and Information Retrieval September 2002 337 pages ISBN:3540441581 Editors: …
String Processing and Information Retrieval: 9th International Symposium, SPIRE 2002, Lisbon, Portugal, September 11-13, 2002 Proceedings AHF Laender
2002 Springer Science & Business Media
This volume of the Lecture Notes in Computer Science series provides a c-prehensive, state-of-the-art survey of recent advances in string processing and information retrieval. It includes invited and research papers presented at the 9th International Symposium on String Processing and Information Retrieval, SPIRE2002, held in Lisbon, Portugal. SPIREhas its origins in the South Am-ican Workshop on String Processing which was? rst held in Belo Horizonte, Brazil, in 1993. Starting in 1998, the focus of the workshop was broadened to include the area of information retrieval due to its increasing relevance and its inter-relationship with the area of string processing. The call for papers for SPIRE2002 resulted in the submission of 54 papers from researchers around the world. Of these, 19 were selected for inclusion in the program (an acceptance rate of 35%). In addition, the Program Committee decided to accept six other papers, considered as describing interesting ongoing research, in the form of short papers. The authors of these 25 papers came from 18 di? erent countries (Argentina, Australia, Brazil, Canada, Czech Republic, Chile, Colombia, Finland, France, Germany, Japan, Italy, Mexico, Saudi Arabia, Switzerland, Spain, United Kingdom, and USA).
Using context-free grammars to constrain apriori-based algorithms for mining temporal association rules C Antunes, A Oliveira
2002 Proc. Workshop on Temporal Data Mining
Algorithms for the inference of association with sequential information have been proposed and used but are ineffective, in some cases, because too many candidate rules are extracted. Filtering the relevant ones is usually difficult and inefficient. In this work, we present an algorithm for the inference of temporal association rules that uses context-free grammars to restrict the search process, in order to filter, in an efficient and effective way, the associations discovered by the algorithm. Moreover, we present experimental results that empirically evaluate its performance using synthetic data.
String Processing and Information Retrieval: 9. International Symposium, SPIRE 2002, Lisbon, Portugal, September 11-13, 2002: Proceedings AL Oliveira, AHF Laender
2002 Springer
This volume of the Lecture Notes in Computer Science series provides a c- prehensive, state-of-the-art survey of recent advances in string processing and information retrieval. It includes invited and research papers presented at the 9th International Symposium on String Processing and Information Retrieval, SPIRE2002, held in Lisbon, Portugal. SPIREhas its origins in the South Am- ican Workshop on String Processing which was ?rst held in Belo Horizonte, Brazil, in 1993. Starting in 1998, the focus of the workshop was broadened to include the area of information retrieval due to its increasing relevance and its inter-relationship with the area of string processing. The call for papers for SPIRE2002 resulted in the submission of 54 papers from researchers around the world. Of these, 19 were selected for inclusion in the program (an acceptance rate of 35%). In addition, the Program Committee decided to accept six other papers, considered as describing interesting ongoing research, in the form of short papers. The authors of these 25 papers came from 18 di?erent countries (Argentina, Australia, Brazil, Canada, Czech Republic, Chile, Colombia, Finland, France, Germany, Japan, Italy, Mexico, Saudi Arabia, Switzerland, Spain, United Kingdom, and USA).
Inference of sequential association rules guided by context-free grammars CM Antunes, AL Oliveira
2002 Grammatical Inference: Algorithms and Applications: 6th International …
One of the main unresolved problems in data mining is related with the treatment of data that is inherently sequential. Algorithms for the inference of association rules that manipulate sequential data have been proposed and used to some extent but are ineffective, in some cases, because too many candidate rules are extracted and filtering the relevant ones is difficult and inefficient. In this work, we present a method and algorithm for the inference of sequential association rules that uses context-free grammars to guide the discovery process, in order to filter, in an efficient and effective way, the associations discovered by the algorithm.
Temporal Data Mining: an Overview C Antunes, AL Oliveira
2001 Proceedings of KDD Workshop on Temporal Data Mining, 1-13
The synthesis, characterization and spectroscopic studies of compounds 6a–g with analgesic activity is described. A new model of interaction between the drug and the enzyme is suggested. Application of the Resonance Valence Bond theory led us to propose, for the first time, an entirely new mechanism involving an electron transfer from the amino acid residue of the enzyme to the drug. Theoretical studies of various transition states involved in the interaction mechanism employing the semi-empirical molecular orbital calculations (AM1 method) have been carried out. This article also deals with an extensive study of the structure–activity relationships of seven oxadiazolo-phthalimides 6a–g.
Circuit partitioning techniques for power estimation using the full set of input correlations AT Freitas, AL Oliveira
2001 ICECS 2001. 8th IEEE International Conference on Electronics, Circuits and …
Exact power estimation, at logic level, is only possible if all the input correlations are taken into account. Recently, a probabilistic approach that uses a simple but powerful formalism for power estimation taking into account all the input correlations has been proposed. With this probabilistic approach it is possible to compute exactly the power dissipation of combinational modules using input statistics that would require extremely large traces if simulation based methods were to be used. However, the applicability of the method is limited to very small circuits. This paper describes a circuit partitioning technique that speeds up that method. By using partitioning techniques we avoid the computation of global BDD representations for node functions, thereby extending considerably the range of applicability of the algorithm. Moreover, the partitioning maintains the full set of correlations and, therefore, does not induce any …
Techniques for the creation of digital watermarks in sequential circuit designs AL Oliveira
2001 IEEE Transactions on Computer-Aided Design of Integrated Circuits and …
We present a methodology for the watermarking of synchronous sequential circuits that makes it possible to identify the authorship of designs by imposing a digital watermark on the state transition graph (STG) of the circuit. The methodology is applicable to sequential designs that are made available as firm intellectual property, the designation commonly used to characterize designs specified as structural hardware description languages or circuit netlists. The watermarking is obtained by manipulating the STG of the design in such a way as to make it exhibit a chosen property that is extremely rare in nonwatermarked circuits while, at the same time, not changing the functionality of the circuit. This manipulation is performed without ever actually computing this graph in either implicit or explicit form. Instead, the digital watermark is obtained by direct manipulation of the circuit description. We present evidence that no …
Efficient algorithms for the inference of minimum size dfas AL Oliveira, JPM Silva
2001 Machine Learning 44, 93-119
This work describes algorithms for the inference of minimum size deterministic automata consistent with a labeled training set. The algorithms presented represent the state of the art for this problem, known to be computationally very hard. In particular, we analyze the performance of algorithms that use implicit enumeration of solutions and algorithms that perform explicit search but incorporate a set of techniques known as dependency directed backtracking to prune the search tree effectively. We present empirical results that show the comparative efficiency of the methods studied and discuss alternative approaches to this problem, evaluating their advantages and drawbacks.
An exact gate assignment algorithm for tree circuits under rise and fall delays AL Oliveira, R Murgai
2000 IEEE/ACM International Conference on Computer Aided Design. ICCAD-2000. IEEE …
In most libraries, gate parameters such as the pin-to-pin intrinsic delays, load-dependent coefficients, and input pin capacitances have different values for rising and falling signals. The performance optimization algorithms, however, assume a single value for each parameter. It is known that under the load-independent delay model, the gate assignment (or resizing) problem is solvable in time polynomial in the circuit size when a single value is assumed for each parameter. In the presence of different rise and fall parameter values, this problem was recently shown to be NP-complete even for chain and tree topology circuits under the simple load-independent delay model. In this paper, we propose it dynamic programming algorithm for solving this problem exactly in pseudo-polynomial time for tree circuits. More specifically, we show that the problem can be solved in time proportional to the size of the tree circuit, the …
On the complexity of power estimation problems AT Freitas, HC Neto, AL Oliveira
2000 International Workshop on Logic Synthesis (ILWS), 239-244
Although many algorithms for power estimation have been proposed to date, no comprehensive results have been presented on the actual complexity of power estimation problems. In this paper we select a number of relevant problems and show that they belong to classes of high complexity. In particular we show that: a) for a given combinational circuit, the decision version of the peak power estimation problem is NP-complete; b) average power estimation for combinational circuits is# P-complete under the uniform input distribution; c) the decision version of the sequential power estimation problem is PSPACE-complete. We also address a number of related problems and conclude that even approximation algorithms for some of these problems are NP-hard.
FSM decomposition by direct circuit manipulation applied to low power design JC Monteiro, AL Oliveira
2000 Proceedings of the 2000 Asia and South Pacific Design Automation Conference …
Clock-gating techniques are very effective in the reduction of the switching activity in sequential logic circuits. In particular, recent work has shown that significant power reductions are possible with techniques based on finite state machine (FSM) decomposition. A serious limitation of previously proposed techniques is that they require the state transition graph (STG) of the FSM to be given or extracted from the circuit. Since the size of the STG can be exponential on the number of registers in the circuit, explicit techniques can only be applied to relatively small sequential circuits. In this paper, we present a new approach to perform FSM decomposition by direct manipulation of the circuit. This way, we do not require the STG, either explicit or implicit, thus further avoiding the limitations imposed by the use of BDDs. Therefore, this technique can be applied to circuits with very large STGs. We provide a set of experimental …
FSM Decomposition by Direct Circuit Manipulation Applied to Low Power Design AL Oliveira
2000 null, 351
Clock-gating techniques are very effective in the reduction of the switching activity in sequential logic circuits. In particular, recent work has shown that significant power reductions are possible with techniques based on finite state machine (FSM) decomposition. A serious limitation of previously proposed techniques is that they require the state transition graph (STG) of the FSM to be given or extracted from the circuit. Since the size of the STG can be exponential on the number of registers in the circuit, explicit techniques can only be applied to relatively small sequential circuits. In this paper, we present a new approach to perform FSM decomposition by direct manipulation of the circuit. This way, we do not require the STG, either explicit or implicit, thus further avoiding the limitations imposed by the use of BDDs. Therefore, this technique can be applied to circuits with very large STGs. We provide a set of experimental …
An Empirical Comparison of Text Categorization Methods, Instituto Superior T_ecnico Departamento de Engenharia Inform_atica Av A Cardoso-Cachopo, AL Oliveira
2000 Rovisco Pais
In this paper we present a comprehensive comparison of the performance of a number of text categorization methods in two different data sets. In particular, we evaluate the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of the Vector and LSA models. We report the results obtained using the Mean Reciprocal Rank as a measure of overall performance, a commonly used evaluation measure for question answering tasks. We argue that this evaluation measure is also very well suited for text categorization tasks. Our results show that overall, SVMs and k-NN LSA perform better than the other methods, in a statistically significant way.
Integrating Dynamic Power Management in the Design Flow A Mota, N Ferreira, A Oliveira, J Monteiro
2000 VLSI: Systems on a Chip: IFIP TC10 WG10. 5 Tenth International Conference on …
Power dissipation has recently emerged as one of the most critical design constraints. A wide range of techniques has already been proposed for the optimization of logic circuits for low power. Power management methods are among the most effective techniques for power reduction. These methods detect periods of time during which parts of the circuit are not doing useful work and shut them down by either turning off the power supply or the clock signal. In this work, we describe the integration of dynamic power management tools in a design flow. The designer can thus easily apply these techniques to the design, evaluate the optimization achieved and decide if the changes are worthwhile to include in the final design. We have used this design flow with a real project. An HDLC controller has been designed and these power management techniques have been applied.
A new algorithm for exact reduction of incompletely specified finite state machines JM Pena, AL Oliveira
1999 IEEE Transactions on Computer-Aided Design of Integrated Circuits and …
We propose a new algorithm for the problem of state reduction in incompletely specified finite state machines. Unlike the most commonly used algorithms for this problem, our approach is not based on the enumeration of compatible sets, and, therefore, its performance is not dependent on its number. Instead, the algorithm uses techniques for finite state machine identification that are well known in the computer science literature, but have never been applied to this problem. We prove that the algorithm is exact and present results that show that, in a set of hard problems, it is much more efficient than both the explicit and implicit approaches based on the enumeration of compatible sets. We also present a complexity analysis for the special cases where worst case polynomial time bounds can be obtained and present experiments that validate empirically the bounds obtained.
Exact power estimation using word level transition probabilities AT Freitas, AL Oliveira, JC Monteiro, HC Neto
1999 Proceedings of the Ninth International Workshop on Power and Timing …
We propose a model and an algorithm to perform exact power estimation taking into account all temporal and spatial correlations of the input signals. The proposed methodology is able to accurately model temporal and spatial correlations at the logic level, with the input signal correlations being specified at the word level using a simple but effective formulation. This paper includes three independent contributions: a proposal of a simple specification language that can represent complex input correlations in a compact form; a methodology for module characterization that is independent of a specific input trace; and an extension of the model to make it applicable to RT-level power estimation by abstracting the details of the internal structure of the combinational modules. Although the complexity associated with the exact characterization of the modules restricts the applicability of the approach to medium sized circuits, it provides an important tool to validate approximate methods. We are currently researching heuristic approximations that can be used to extend the range of the applicability of the methodology while minimizing the loss in accuracy.
Low overhead encodings for reduced activity in data and address buses P Ramos, A Oliveira, RA Redol
1999 IEEE International Symposium on Signals, Circuits and Systems, 21-24
In complex VLSI systems, the power dissipated in buses represents a large fraction of the total power. This is due, on one hand, to the high capacitance of bus lines, and, on the other hand, to the increasing larger bandwidth required in high performance systems. Due to the large difference in intrinsic capacitance exhibited by external buses lines and internal nodes, the dynamic power dissipated on bus lines can represent more than half of the total power dissipated in the system. For these reasons, there have been several proposals for bus encoding techniques that aim at reducing activity on external, high capacitance lines, at the expense of some added circuit complexity.
Robust techniques for watermarking sequential circuit designs AL Oliveira
1999 Proceedings of the 36th annual ACM/IEEE design automation conference, 837-842
In complex VLSI systems, the power dissipated in buses represents a large fraction of the total power. This is due, on one hand, to the high capacitance of bus lines, and, on the other hand, to the increasing larger bandwidth required in high performance systems. Due to the large difference in intrinsic capacitance exhibited by external buses lines and internal nodes, the dynamic power dissipated on bus lines can represent more than half of the total power dissipated in the system. For these reasons, there have been several proposals for bus encoding techniques that aim at reducing activity on external, high capacitance lines, at the expense of some added circuit complexity.
A new algorithm for the reduction of incompletely specified finite state machines JM Pena, AL Oliveira
1998 Proceedings of the 1998 IEEE/ACM international conference on Computer-aided …
We present a methodology for the watermarking of synchronous sequential circuits that makes it possible to identify the authorship of designs by imposing a digital watermark on the state transition graph of the circuit. The methodology is applicable to sequential designs that are made available as firm Intellectual Property (IP), the designation commonly used to characterize designs specified as structural descriptions or circuit netlists. The watermarking is obtained by manipulating the state transition graph of the design in such a way as to make it exhibit a chosen property that is extremely rare in non-watermarked circuits, while, at the same time, not changing the functionality of the circuit. This manipulation is performed without ever actually computing this graph in either implicit or explicit form. We present both theoretical and experimental results that show that the watermarking can be created and verified efficiently.
Exact minimization of binary decision diagrams using implicit techniques AL Oliveira, LP Carloni, T Villa, AL Sangiovanni-Vincentelli
1998 IEEE Transactions on Computers 47 (11), 1282-1296
We propose a new rdgorithm to the problem of state reduction in incompletely specified finite state machines.~ is algorithm is not based on the enumeration of compatible sets, and, therefore, its performance is not dependent on the number of prime compatibles. We prove that the algorithm is exact and present results that show that, in a set of hard problems, it is much more efficient than both the explicit and implicit approaches based on the enumeration of compatible sets.
Efficient search techniques for the inference of minimum size finite automata AL Oliveira, JPM Silva
1998 Proceedings. String Processing and Information Retrieval: A South American …
This paper addresses the problem of binary decision diagram (BDD) minimization in the presence of don't care sets. Specifically given an incompletely specified function g and a fixed ordering of the variables, we propose an exact algorithm for selecting f such that f is a cover for g and the binary decision diagram for f is of minimum size. The approach described is the only known exact algorithm for this problem not based on the enumeration of the assignments to the points in the don't care set. We show also that our problem is NP-complete. We show that the BDD minimization problem can be formulated as a binate covering problem and solved using implicit enumeration techniques. In particular, we show that the minimum-sized binary decision diagram compatible with the specification can be found by solving a problem that is very similar to the problem of reducing incompletely specified finite state machines. We …
Models and algorithms for computing minimum-size prime implicants VM Manquinho, AL Oliveira, J Marques-Silva
1998 Proceedings of the International Workshop on Boolean Problems
Minimum-size prime implicants of Boolean functions find application in many areas of Computer Science including, among others, Electronic Design Automation and Artificial Intelligence. The main purpose of this paper is to describe and evaluate two fundamentally different modeling and algorithmic solutions for the computation of minimum-size prime implicants. One is based on explicit search methods, and uses Integer Linear Programming models and algorithms, whereas the other is based on implicit techniques, and so it uses Binary Decision Diagrams. For the explicit approach we propose new dedicated ILP algorithms, specifically target at solving these types of problems. As shown by the experimental results, other well-known ILP algorithms are in general impractical for computing minimumsize prime implicants. Moreover, we experimentally evaluate the two proposed algorithmic strategies.
Satisfiability-based algorithms for 0-1 integer programming VM Manquinho, JPM Silva, AL Oliveira, KA Sakallah
1998 Proceedings of the IEEE/ACM International Workshop on Logic Synthesis (IWLS)
In this paper we describe two Propositional Satisfiability-based algorithms for solving 0-1 integer linear programs (ILP). The algorithms are specifically targeted at ILP instances that are highly constrained, ie instances for which the constraints are hard to satisfy. The two algorithms are based on recent algorithms for solving instances of Propositional Satisfiability (SAT) which are also highly constrained. In particular we illustrate how the algorithms for solving ILPs can be improved with search pruning techniques commonly used in SAT algorithms. The usefulness of the proposed algorithms is illustrated on a practical application for which instances are in general highly constrained.
Power optimization of combinational modules using self-timed precomputation A Mota, J Monteiro, A Oliveira
1998 1998 IEEE International Symposium on Circuits and Systems (ISCAS) 2, 17-20
Precomputation has recently been proposed as a very effective power management technique. Precomputation works by preventing some of the inputs from being loaded into the input registers, thus significantly reducing the switching activity in the circuit. In this paper we present a self-timed approach for the precomputation of combinational logic circuits. This technique allows for maximum power savings without the need of a clock signal. However we may incur in some delay penalty. We describe how to achieve significant power reductions without increasing the maximum delay, by choosing a judicious placement of the latches in the combinational logic circuit. Experimental results are presented for arithmetic modules, confirming that power dissipation can be greatly reduced with marginal increases in circuit area and almost zero delay increase.
Using complementation and resequencing to minimize transitions R Murgai, M Fujita, A Oliveira
1998 Proceedings of the 35th annual Design Automation Conference, 694-697
Recently, in [3], the following problem was addressed: Given a set of data words or messages to be transmitted over a bus such that the sequence (order) in which they are transmitted is irrelevant, determine the optimum sequence that minimizes the total number of transitions on the bus. In 1994, Stan and Burleson [5] presented the bus-invert method as a means of encoding words for reducing I/O power, in which a word may be inverted and then transmitted if doing so reduces the number of transitions. In this paper, we combine the two paradigms into one — that of sequencing words under the bus-invert scheme for the minimum transitions, i.e., words can be complemented, reordered and then transmitted. We prove that this problem DOPI — Data Ordering Problem with Inversion — is NP-complete. We present a polynomial-time approximation algorithm to solve DOPI that comes within a factor of 1.5 from the …
Finite state machine decomposition for low power JC Monteiro, AL Oliveira
1998 Proceedings of the 35th annual Design Automation Conference, 758-763
Clock-gating techniques have been shown to be very effective in the reduction of the switching activity in sequential logic circuits. In this paper we describe a new clock-gating technique based on finite state machine (FSM) decomposition. We compute two sub-FSMs that together have the same functionality as the original FSM. For all the transitions within one sub-FSM, the clock for the other sub-FSM is disabled. To minimize the average switching activity, we search for a small cluster of states with high stationary state probability and use it to create the small sub-FSM. This way we will have a small amount of logic that is active most of the time, during which is disabling a much larger circuit, the other sub-FSM. We provide a set of experimental results that show that power consumption can be substantially reduced, in some cases up to 80%.
Prime implicant computation using satisfiability algorithms VM Manquinho, PF Flores, JPM Silva, AL Oliveira
1997 Proceedings Ninth IEEE International Conference on Tools with Artificial …
The computation of prime implicants has several and significant applications in different areas, including automated reasoning, non-monotonic reasoning, electronic design automation, among others. The authors describe a new model and algorithm for computing minimum-size prime implicants of propositional formulas. The proposed approach is based on creating an integer linear program (ILP) formulation for computing the minimum-size prime implicant, which simplifies existing formulations. In addition, they introduce two new algorithms for solving ILPs, both of which are built on top of an algorithm for propositional satisfiability (SAT). Given the organization of the proposed SAT algorithm, the resulting ILP procedures implement powerful search pruning techniques, including a non-chronological backtracking search strategy, clause recording procedures and identification of necessary assignments. Experimental …
Improving Satisfiability Algorithms with Dominance and Partitioning JPM Silva, AL Oliveira
1997 International Workshop on Logic Synthesis
In this paper we describe how several search pruning concepts, commonly used in algorithms for solving covering problems, can be incorporated in algorithms for propositional satisfiability (SAT). In particular, we show that the concepts of row dominance and matrix partitioning, commonly used for solving unate and binate covering problems, can be naturally applied to SAT algorithms. Experimental results, conducted on a large number of benchmarks, indicate that these techniques lead to significant performance improvements for an existing SAT algorithm.
An implicit formulation for exact BDD minimization of incompletely specified functions AL Oliveira, LP Carloni, T Villa, A Sangiovanni-Vincentelli
1997 VLSI: Integrated Systems on Silicon: IFIP TC10 WG10. 5 International …
This paper addresses the problem of binary decision diagram (BDD) minimization in the presence of don’t care sets. Specifically, given an incompletely specified function g and a fixed ordering of the variables, we propose an exact algorithm for selecting f such that f is a cover for g and the binary decision diagram for f is of minimum size. We proved that this problem is NP-complete. Here we show that the BDD minimization problem can be formulated as a binate covering problem and solved using implicit enumeration techniques similar to the ones used in the reduction of incompletely specified finite state machines.
Branch and bound algorithms for highly constrained integer programs VM Manquinho, JPM Silva, AL Oliveira, KA Sakallah
1997 Technical Report, Cadence European Laboratories, Portugal
In this paper we describe a new branch and bound algorithm for solving 0-1 integer linear programs (ILP). The algorithm is specifically targeted at ILP instances that are highly constrained, ie instances for which the constraints are hard to satisfy. Our approach is based on recent algorithms for solving instances of Propositional Satisfiability (SAT) which are also highly constrained. In particular we illustrate how the branch and bound algorithm for solving ILPs can be improved with search pruning techniques commonly used in SAT algorithms. The usefulness of the proposed branch and bound algorithm is illustrated on a practical application for which instances are in general highly constrained.
Using the minimum description length principle to infer reduced ordered decision graphs AL Oliveira, A Sangiovanni-Vincentelli
1996 Machine Learning 25, 23-50
We propose an algorithm for the inference of decision graphs from a set of labeled instances. In particular, we propose to infer decision graphs where the variables can only be tested in accordance with a given order and no redundant nodes exist. This type of graphs, reduced ordered decision graphs, can be used as canonical representations of Boolean functions and can be manipulated using algorithms developed for that purpose. This work proposes a local optimization algorithm that generates compact decision graphs by performing local changes in an existing graph until a minimum is reached. The algorithm uses Rissanen's minimum description length principle to control the tradeoff between accuracy in the training set and complexity of the description. Techniques for the selection of the initial decision graph and for the selection of an appropriate ordering of the variables are also presented …
Exact minimization of boolean decision diagrams using implicit techniques AL Oliveira, LP Carloni, T Villa, AL Sangiovanni-Vincentelli
1996
This paper addresses the problem of binary decision diagram (BDD) minimization in the presence of don't care sets. Specifically given an incompletely specified function g and a fixed ordering of the variables, we propose an exact algorithm for selecting f such that f is a cover for g and the binary decision diagram for f is of minimum size. The approach described is the only known exact algorithm for this problem not based on the enumeration of the assignments to the points in the don't care set. We show also that our problem is NP-complete. We show that the BDD minimization problem can be formulated as a binate covering problem and solved using implicit enumeration techniques. In particular, we show that the minimum-sized binary decision diagram compatible with the specification can be found by solving a problem that is very similar to the problem of reducing incompletely specified finite state machines. We report experiments of an implicit implementation of our algorithm, by means of which a class of interesting examples was solved exactly. We compare it with existing heuristic algorithms to measure the quality of the latter
Limits of exact algorithms for inference of minimum size finite state machines AL Oliveira, S Edwards
1996 Algorithmic Learning Theory: 7th International Workshop, ALT'96 Sydney …
We address the problem of selecting the minimum sized finite state machine consistent with given input/output samples. The problem can be solved by computing the minimum finite state machine equivalent to a finite state machine without loops obtained from the training set. We compare the performance of four algorithms for this task: two algorithms for incompletely specified finite state machine reduction, an algorithm based on a well known explicit search procedure and an algorithm based on a new implicit search procedure that is introduced in this paper.
Inference of state machines from examples of behavior AL Oliveira, SA Edwards
1995 UCB/ERL Tech. Rep. M 95
Often, the desired behavior of a system is known before a design of the system is known. It usually falls to the designer to correctly translate this behavior into a system that exhibits it. This report describes two algorithms that design systems guaranteed to exhibit speci ed behavior. Specifically, our algorithms identify a state machine with the fewest states exhibiting behavior speci ed by a set of input/output strings. One algorithm builds the machine explicitly by tting together transitions that correspond to input/output pairs. The other implicitly considers all machines of a given size and discards those not exhibiting the desired behavior. Although the problem is NP-complete, our algorithms behave exponentially only in the number of states in the minimum machine; other state-minimization algorithms behave exponentially in the size of the behavior speci cation. This advantage has allowed machines of up to fourteen states to be identi ed exactly within an hour.
Inferring reduced ordered decision graphs of minimum description length AL Oliveira, A Sangiovanni-Vincentelli
1995 Machine Learning Proceedings 1995, 421-429
We propose an heuristic algorithm that induces decision graphs from training sets using Rissanen's minimum description length principle to control the tradeoff between accuracy in the training set and complexity of the hypothesis description.
Inductive learning by selection of minimal complexity representations AML de Oliveira
1994 Electronics Research Laboratory, College of Engineering, University of …
The main objective of the discipline known as artificial intelligence (AI) is to make computers behave in ways that can be defined as intelligent. One of the hallmarks of intelligence is the ability to learn from past experience and to adapt the behavior in accordance with this experience. Machine learning is the branch of AI that is concerned with the ability of systems to learn and adapt. In fact, this branch became so important, and the techniques developed have found applications in such a wide variety of fields that machine learning is now a very important discipline on its own right. One of the central topics of research in machine learning is inductive inference, the study of algorithms that enable systems to learn from examples. This subject is also the central topic of this dissertation. In particular, this thesis describes algorithms for the inference of classification rules in discrete domains. The exposition made in the following sections intends to introduce the basic concepts and definitions involved in inductive inference but does not pretend to be exhaustive or even complete. The reader is referred to [43] and [64] for comprehensive reviews of both the empirical and theoretical issues involved in inductive inference.
Learning complex boolean functions: Algorithms and applications A Oliveira, A Sangiovanni-Vincentelli
1993 Advances in Neural Information Processing Systems 6
The most commonly used neural network models are not well suited to direct digital implementations because each node needs to per (cid: 173) form a large number of operations between floating point values. Fortunately, the ability to learn from examples and to generalize is not restricted to networks ofthis type. Indeed, networks where each node implements a simple Boolean function (Boolean networks) can be designed in such a way as to exhibit similar properties. Two algorithms that generate Boolean networks from examples are pre (cid: 173) sented. The results show that these algorithms generalize very well in a class of problems that accept compact Boolean network descriptions. The techniques described are general and can be ap (cid: 173) plied to tasks that are not known to have that characteristic. Two examples of applications are presented: image reconstruction and hand-written character recognition.
What Can Boolean Networks Learn? AL Oliveira, A Sangiovanni-Vincentelli
1992
We study the generalization abilities of networks that are composed of boolean nodes, ie, nodes that implement only basic boolean functions: and, or and not. The majority of the network learning algorithms proposed so far generate networks where each node implements a threshold function and are inappropriate for the generation of boolean networks from training set data. We propose an algorithm that, given a training set, generates a boolean network of small complexity that is compatible with the training set. The algorithm, inspired in techniques used in the logic synthesis community for the design of VLSI circuits, generates both the connectivity pattern and the architecture of the network. Furthermore, the resulting network can be implemented in silicon in a straightforward way. Experimental results obtained in a set of problems from the machine learning literature show that the generalization performed by boolean networks synthesized with this algorithm compares favorably with the generalization obtained by alternative learning algorithms. Some of these results and examples of the layout of networks obtained using the algorithm are presented and discussed.
Constructive induction using a non-greedy strategy for feature selection A Sangiovanni-Vincentelli, AL Oliveira
1992 Machine Learning Proceedings 1992: Proceedings of the Ninth International …
We present a method for feature construction and selection that finds a minimal set of conjunctive features that are appropriate to perform the classification task. For problems where this bias is appropriate, the method outperforms other constructive induction algorithms and is able to achieve higher classification accuracy. The application of the method in the search for minimal multi-level boolean expressions is presented and analyzed with the help of some examples.
LSAT-An Algorithm for the Synthesis of Two Level Threshold Gate Networks. AL Oliveira, AL Sangiovanni-Vincentelli
1991 ICCAD, 130-133
We present are algorithm for the synthesis of twolevel threshold gate networks inspired in techniques used in classical two-level minimization of logic circuits. We specifically address a restricted version of the problem where the on and off set minterms are. explicitly listed. Experimental results show that a simple branch and bound algorithm can be used to obtain solutions close io the absolute minimum in a set of standard problems, outperforming other minimizers even when restricted to use only classic logic gates as building blocks.
Empirical Learning of Boolean Functions Using Two-Level Logic Synthesis AL Oliveira, A Sangiovanni-Vincentelli
1991
Most approaches to the design of networks that learn from examples don’t address the architecture design problem except as a side issue. Logic synthesis techniques can be used to derive both the structure and the connection patterns for a network that matches the examples in the training set while minimizing some cost function such as the size of the network. This cost function yields a very effective solution for the classification of examples not present in the training set. Thus, minimizing the size of the network can be considered as an effective generalization principle in accordance with the Occam’s razor paradigm. In this paper we restrict our attention to the use of two-level and-or networks. An efficient algorithm for the synthesis of networks of this type is proposed and the quality of the generalization performed by the network is evaluated against alternative approaches. The algorithm was tested in several test problems taken from the machine learning literature. Results show that, in most cases, the proposed approach outperforms other popular methods, both in accuracy and in the number of examples needed for accurate generalization.
Learning concepts by synthesizing minimal threshold gate networks AL Oliveira, A Sangiovanni-Vincentelli
1991 Machine Learning Proceedings 1991, 193-197
We propose a new methodology for the synthesis of two-level networks of threshold gates based on techniques related to the ones used in the logic synthesis of digital networks. The proposed approach starts with a large network that performs the desired mapping and reduces its size by applying transformations that preserve the functionality for all examples in the training set.
Bottom-up methodology for test preparation and refinement JA Grácio, PA Bicudo, NN Rua, AM Oliveira, CFB Almeida, JP Teixeira
1989 IEEE International Symposium on Circuits and Systems,, 949-952
A bottom-up testing methodology is used to derive realistic fault lists (at layout level), to refine gate-level test patterns, and to provide accurate test validation (at switch-level), thus making possible the evaluation of the fault coverage of the most likely circuit faults. For this purpose, the developed software tools, which are integrated in the ICD toolbox, are described, and their usefulness is ascertained through several design examples. Interesting spin-offs of the proposed methodology and directions of future work are also described.< >
Test preparation and fault analysis using a bottom-up methodology JA Grácio, PA Bicudo, NN Rua, AM Oliveira, CFB Almeida, JP Teixeira
1989 Proceedings of the 1st European Test Conference, 168,169,170,171,172,173,174 …
The testing methodology for digital VLSI circuits proposed by JP Teixeira et al.(1988) is based on the automatic definition of a realistic fault list which depends on the technology, the manufacturing process, and the IC layout. The proposed methodology is extended to include an initial stage of test preparation during the top-down design phase. In this step, an initial test-pattern generation is performed, to be used for test refinement during the bottom-up verification phase. Fault collecting is done by means of a hierarchical layout-to-fault extractor based on the physical failures most likely to occur in MOS designs. Fault list compression is performed, according to user-defined fault listing objectives, by means of a postprocessor. Test vectors derived by means of a gate-level automatic test-pattern generator are validated by using a concurrent switch-level fault simulator. The visualization of undetected faults in the layout is …
Test Preparation for MOS Digital Circuits Using Heuristics for Reliable Fault Simulation JPC Teixeira, IMC Teixeira, CFB Almeida, A Oliveira, F Goncalves
1989
The production of complex digital VLSI (very large scale integration) circuits requires a reliable and cost-effective testing procedure, allowing the release of chips with a very low defect level. For MOS (metal oxide semiconductor) VLSI circuits, test preparation, which has to be carried out by the IC (integrated circuit) designer, needs to take into account, with reasonable computer costs, the possible occurrence of faults which are not detected by test patterns derived for stuck-at-fault detection. In this paper, a bottom-up test preparation methodology is reviewed, together with the set of software tools that implement it, which allows the generation of switch-level realistic fault lists, fault analysis and accurate test pattern validation, and paves the way to testability enhancements, at layout level, by LLDFT (Layout-Level Design for Testability) techniques. Simulation results are presented, ascertaining the features of the software …
Bottom-up testing methodology for VLSI JP Teixeira, CFB Almeida, JA Grácio, PA Bicudo, AL Oliveira, N Rua
1988 Proceedings of the IEEE 1988 Custom Integrated Circuits Conference, 16.6/1 …
A testing methodology for digital VLSI circuits is proposed that is based on the definition of a realistic fault list, which depends on the technology, the manufacturing process, and the IC layout. Automatic fault listing is carried out by a hierarchical layout-to-fault extractor, LIFE. Fault-list compression is performed according to user-defined fault listing objectives. Test-pattern validation is made by an accurate switch-level fault simulator with timing information capabilities, SWIFT. The methodology and the correspondent software tools can be used in the IC design and production testing environments. At present, the two software tools are to be included in the ICD tool box, in a workstation-based IC design environment.< >
Session details: Graph based retrieval (IR) A Oliveira
Session details: Graph based retrieval (IR) | Proceedings of the sixteenth ACM conference on Conference on information and knowledge management ACM Digital Library home ACM home Google, Inc. (search) Advanced Search Browse About Sign in Register Advanced Search Journals Magazines Proceedings Books SIGs Conferences People More Search ACM Digital Library SearchSearch Advanced Search cikm Conference Proceedings Upcoming Events Authors Affiliations Award Winners More HomeConferencesCIKMProceedingsCIKM '07Session details: Graph based retrieval (IR) section Share on Session details: Graph based retrieval (IR) Session Chair: Arlindo Oliveira Technical University of Lisbon and INESC-ID Technical University of Lisbon and INESC-ID View Profile Authors Info & Claims CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge …
Additional materials and results for MUSA ND Mendes, AC Casimiro, PM Santos, I Sa-Correiab, AL Oliveira, ...
We also present and discuss the result of applying the proposed method to the artificially generated (synthetic) data sets. The advantage of using synthetic data sets is the ability to specify exactly which motifs we wish to plant in the data against a random background. In this way, we can safely test our method, since we control every aspect of the signal hidden in the random sequences. However, synthetic sequences are still very different from real sequences, since the generators used are not precise enough to model the actual distribution of nucleotides in promoter regions. This appendix also includes additional results of the application of our method to the data set, composed of 69 putative σ54-dependent promoter sequences of Pseudomonas putida KT2440 and the well characterized σ54-dependent promoter Pu. These results are very useful to understand the post-processing assembling of motifs presented in table 1 in the original paper. To conclude we present a comparison of the obtained results with competing approaches. We think that his comparison is very important to understand the main advantages of this new method.
Additional File 1 “GRISOTTO: A greedy approach to improve combinatorial algorithms for motif discovery with prior knowledge” AM Carvalho, AL Oliveira
Herein we describe the call to RISOTTO algorithm found in Step 1 of the Algorithm 1 (main text). This call tries to tune the RISOTTO input, presented in Section 1. 1, in order to obtain good initial starting points to be processed by GRISOTTO. In Section 1. 2, a description of the core idea and the pseudocode of the tuning procedure is provided.
Biclustering Algorithms for Biological Data Analysis SC Madeira, AL Oliveira
Biclustering Algorithms for Biological Data Analysis Page 1 Biclustering Algorithms for Biological Data Analysis Sara C. Madeira and Arlindo L. Oliveira Presentation by Matthew Hibbs Page 2 2 What is Biclustering? • Given an nxm matrix, A, find a set of submatrices, B k , such that the contents of each B k follow a desired pattern • Row/Column order need not be consistent between different B k s Page 3 3 Why Bicluster? • Genes not regulated under all conditions • Genes regulated by multiple factors/processes concurrently • Key to determine function of genes • Key to determine classification of conditions Page 4 4 Simple “Bicluster” Unclustered Clustered Dataset from Garber et al. Page 5 5 Formal Definitions • A contains rows X and columns Y • X = {x 1 , …, x n } Y = {y 1 , …, y n } • I ⊆ X, J ⊆ Y, A IJ = (I,J) = submatrix • I = {i 1 , …, i k } J = {j 1 , …, j s } Page 6 6 Bipartite Graph • Matrix can be thought of as a Graph • Rows …
An analysis of the positional distribution of DNA motifs in promoter regions and its biolog-ical relevance: additional materials and results AC Casimiro, S Vinga, AT Freitas, AL Oliveira
Drosophila For this dataset there are reported three general factors: TATA-box, DPE (downstream promoter element) and Iniciator. The TATA-box consensus consists of a sequence with 5 of 6 nucleotides conforming to the consensus TATAWAWR. The DPE consensus consists of a sequence with 5 of 6 nucleotides conforming to the DPE functional range set A/G/T-C/G-A/T-C/T-A/C/G-C/T. The Iniciator is described by the range set t/gCA/tg/t/ct/c/ac/tt/c/gt/c. Figure 1 shows the distribution of the documented general transcription factors described for Drosophila melanogaster. The p-values obtained using the proposed test are respectively 9.999× 10− 5, 0.0789 and 0.0789. Both histograms and numerical values suggest that these factors do not locate randomly along the promoter region but have positional preferences.
Grammatical inference: algorithms and applications (Lisbon, 11-13 September 2000) AL Oliveira
Grammatical inference : algorithms and applications (Lisbon, 11-13 September 2000) CNRS Inist Pascal-Francis CNRS Pascal and Francis Bibliographic Databases Simple search Advanced search Search by classification Search by vocabulary My Account Home > Search results Help Export Export Selection : Selected items (1) Format : Permanent link CopyPermanent link Copy http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=43719 Grammatical inference : algorithms and applications (Lisbon, 11-13 September 2000) Author Oliveira, Arlindo L (Editor) Conference name ICGI 2000 : international colloquium on grammatical inference (5 ; Lisbon 2000-09-11) Source Lecture notes in computer science. 2000 ; VIII, 311 p ; Illustration ; ref : dissem ISSN 0302-9743 ISBN 3-540-41011-2 Scientific domain Computer science Publisher Springer, Berlin Publication country Germany Document type …
Efficient high-throughput read mapping for re-sequencing applications F Fernandes, PGS da Fonseca, LMS Russo, AL Oliveira, AT Freitas
Over the past few years, new massively parallel DNA sequencing (MPDS) technologies based on the sequencing-by-synthesis approach have emerged. These platforms generate massive amounts of data per run, greatly reducing the cost of DNA sequencing compared to the traditional capillary electrophoresis method. However, these techniques also raise important computational difficulties due to the huge volume of data produced and their specific error characteristics. One of the main applications of these technologies consists in sequencing a new genome or region of interest using a homologous sequence of another individual of the same species or a taxonomically related species as a reference. A crucial step of this task corresponds to mapping the sequenced fragments, or reads, onto the reference sequence. The sheer volume of data generated by MPDS technologies (to the order of hundreds of gigabases per run), and the need to align reads to large reference genomes limit the applicability of standard techniques. We have recently proposed a new method for the alignment of MPDS reads [1]. Our initial efforts focused on high-throughput pyrosequencing data, like those produced by the Roche 454 GS FLX platform. However, the heuristic employed by our method has proven to be sufficiently generic to handle reads with different lengths and error characteristics equally as well. Our proposed tool, called TAPyR, builds a Burrows-Wheeler Transform (BWT) based index of the reference sequence to accelerate the alignment. It then employs a multiple seed heuristic to anchor the best candidate alignments. Contrary to other seedbased …
LUÍS MS RUSSO, INESC-ID/Instituto Superior Técnico, Tech Univ of Lisbon, Portugal GONZALO NAVARRO, University of Chile AL OLIVEIRA
Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Θ (n log n) bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ (n) extra bits. This is already spectacular, but the linear extra bits are still unsatisfactory when σ is small as in DNA sequences. In this paper we introduce the first compressed suffix tree representation that breaks this Θ (n)-bit space barrier. The Fully Compressed Suffix Tree (FCST) representation requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time. This includes extracting arbitrary text substrings, so the FCST replaces the text using almost the same space as the compressed text. An essential ingredient of FCSTs is the lowest common ancestor (LCA) operation. We reveal important connections between LCAs and suffix tree navigation. We also describe how to make FCSTs dynamic, ie, support updates to the text. The dynamic FCST also supports several operations. In particular it can build the static FCST within optimal space and polylogarithmic time per symbol. Our theoretical results are also validated experimentally, showing that FCSTs are very effective in practice as well.
SPIRE 2002: string processing and information retrieval (Lisbon, 11-13 September 2002) AHF Laender, AL Oliveira
SPIRE 2002 : string processing and information retrieval (Lisbon, 11-13 September 2002) CNRS Inist Pascal-Francis CNRS Pascal and Francis Bibliographic Databases Simple search Advanced search Search by classification Search by vocabulary My Account Home > Search results Help Export Export Selection : Selected items (1) Format : Permanent link CopyPermanent link Copy http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=14397759 SPIRE 2002 : string processing and information retrieval (Lisbon, 11-13 September 2002) Author Laender, Alberto HF (Editor) ; Oliveira, Arlindo L (Editor) Conference name International symposium on string processing and information retrieval (9 ; Lisbon 2002-09-11) Source Lecture notes in computer science. 2002 ; XI, 336 p ; Illustration ; ref : dissem ISSN 0302-9743 ISBN 3-540-44158-1 Scientific domain Documentation; Computer science Publisher …
SUFFIX ARRAYS AJ Ferreira, AL Oliveira, MAT Figueiredo
Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used in a variety of applications. The LZ encoder and decoder exhibit a high asymmetry, regarding time and memory requirements, with the former being much more demanding. Several techniques have been used to speed up the encoding process; among them is the use of suffix trees. In this paper, we explore the use of a simple data structure, named suffix array, to hold the dictionary of the LZ encoder, and propose an algorithm to search the dictionary. A comparison with the suffix tree based LZ encoder is carried out, showing that the compression ratios are roughly the same. The ammount of memory required by the suffix array is fixed, being much lower than the variable memory requirements of the suffix tree encoder, which depends on the text to encode. We conclude that suffix arrays are a very interesting option regarding the tradeoff between time, memory, and compression ratio, when compared with suffix trees, that make them preferable in some compression scenarios.
A graph based approach to motif clustering AP Francisco, AL Oliveira, AT Freitas
Many algorithms have been proposed to date for the problem of finding biologically significant motifs in promoter regions. They can be classified into two large families: combinatorial methods and probabilistic methods. Probabilistic methods have been used more extensively, since they require less input from the user, and their output is easier to interpret. Combinatorial methods have the potential to identify hard to detect motifs, but their output is much harder to interpret, since it may consist of hundreds or thousands of motifs. In this work, we propose a method that processes the output of combinatorial motif finders in order to find groups of motifs that represent variations of the same motif, thus reducing the output to a manageable size. This processing is done by building a graph that represents the co-occurrences of motifs, and finding clusters or communities in this graph. Structured motifs, ie, motifs with gaps or spacers, are also processed. We show that this innovative approach leads to a method that is as easy to use as a probabilistic motif finder, and as sensitive to low quorum motifs as a combinatorial motif finder. The method was integrated with two combinatorial motif finders, and made available on the Web.
e-CCC-Biclustering: Algorithmic and complexity details SC Madeira, AL Oliveira
Biclustering. For clarity we repeat here the details of the main steps of e-CCC-Biclustering already presented in the main manuscript. We believe that a complete description of the algorithmic details makes it easier to understand the detailed complexity analysis presented afterwards.
Prime Implicarli Computation Using Satisfiability Algorithms VM Manquinho, PF Flores, JPM Silva, AL Oliveira
The computation of prime implicants has several and significant applications in different areas, including Automated Reasoning, Non-Monotonic Reasoning, Electronic Design Automation, among others. In this paper we describe a new model and algorithm for computing minimum-size prime implicants of prepositional formulas. The proposed approach is based on creating an integer linear program (ILP) formulation for computing the minimumsize prime implicant, which simplifies existing formulations. In addition, we introduce two new algorithms for solving ILPs, both of which are built on top of an algorithm for propositional satisfiability (SAT). Given the organization of the proposed SAT algorithm, the resulting ILP procedures implement powerful search pruning techniques, including a non-chronological backtracking search strategy, clause recording procedures and identification of necessary assignments …
e-CCC-Biclustering: Related work on biclustering algorithms for time series gene expression data SC Madeira, AL Oliveira
This document provides supplementary material describing related work on biclustering algorithms for time series gene expression data analysis. We describe in detail three state of the art biclustering approaches specifically design to discover biclusters in gene expression time series and identify their strengths and weaknesses.
Accurate Power Estimation Using Circuit Partitioning AT Freitas, AL Oliveira, HC Neto
Recently, a probabilistic approach that uses a simple but powerful formalism for exact power estimation taking into account word level input correlations has been proposed. This paper describes a circuit partitioning technique that can be used to speed up this method, that is otherwise limited to small circuits. By using partitioning techniques, the method does not require the computation of global BDD representations for node functions, thereby extending considerably its range of applicability. Moreover, the partitioning maintains the full set of correlations and, therefore, does not induce any loss of accuracy. By applying the approach proposed in this paper, it is possible to compute exactly the power dissipation of combinational modules using input statistics that would require extremely large traces if simulation based methods were used.
Impact of side chain positioning on the accuracy of discrete models MMF Bugalho, AL Oliveira
Discrete models are important to reduce the complexity of the protein folding problem. However, a compromise must be made between the model complexity and the accuracy of the model. Previous work by Park and Levitt has shown that the protein backbone can be modeled with good accuracy by four state discrete models. Nonetheless, for abinitio protein folding, the side chains are important to determine if the structure is physically possible and well packed.
Approximate string matching with Ziv-Lempel compressed indexes LMS Russo, G Navarro, AL Oliveira
A compressed full-text self-index for a text T is a data structure requiring reduced space and able of searching for patterns P in T. Furthermore, the structure can reproduce any substring of T, thus it actually replaces T. Despite the explosion of interest on self-indexes in recent years, there has not been much progress on search functionalities beyond the basic exact search. In this paper we focus on indexed approximate string matching (ASM), which is of great interest, say, in computational biology applications. We present an ASM algorithm that works on top of a Lempel-Ziv self-index. We consider the so-called hybrid indexes, which are the best in practice for this problem. We show that a Lempel-Ziv index can be seen as an extension of the classical q-samples index. We give new insights on this type of index, which can be of independent interest, and then apply them to the Ziv-Lempel index. We show experimentally that our algorithm has a competitive performance and provides a useful space-time tradeoff compared to classical indexes.
E cient Search Techniques for the Inference of Minimum Sized Finite State Machines AL Oliveira, JPM Silva
We propose a new algorithm for the inference of minimum size nite state machines from training set data. Our approach is based on a well known search algorithm proposed by Bierman, but it incorporates a set of techniques known as dependency directed backtracking. Although these techniques have already been used in other applications, we believe we are the rst to apply them to this problem. The results show that the application of these techniques yields an algorithm that is, for the problems studied, orders of magnitude faster than existing approaches.
E cient algorithms for the inference of minimum size DFAs AL OLIVEIRA, JPM SILVA
This work describes algorithms for the inference of minimum size deterministic automata consistent with a labeled training set. The algorithms presented represent the state of the art for this problem, known to be computationally very hard. In particular, we analyze the performance of algorithms that use implicit enumeration of solutions and algorithms that perform explicit search but incorporate a set of techniques known as dependency directed backtracking to prune the search tree e ectively.
DFDF Primer Status: Draft This version: http://dfdf. inesc-id. pt/tr/doc/primer/20070925 Latest version: http://dfdf. inesc-id. pt/tr/primer X Wang, JS Almeida, AL Oliveira
This document describes the data format description framework (DFDF), a system that uses semantic web technology to describe the format of data resources. In addition, the framework is also used as the basis for defining URI fragment identifiers that can be used to address, and subsequently access to, parts the binary resources. As all other DFDF documents, all terminologies, special notations and syntax are defined in a separate document at" http://dfdf. inesc-id. pt/tr/terms".
Discovering transcriptional regulatory modules using BiGGEsTS A case study JP Gonçalves, SC Madeira, AL Oliveira
This document provides supplementary material to the manuscript “BiGGEsTS: integrated environment for biclustering analysis of time series gene expression data”. We present a case study, describing how to use the software to discover transcriptional regulatory modules in a dataset containing the response of Saccharomyces cerevisiae to heat stress, and reproducing the results published in [1].
An Overview on Mixture and Hidden Markov Models of Gene Expression in Time Series SC Madeira, AL Oliveira
There has been a number of approaches to model gene expression in time series. These can be divided into two classes, depending on whether they assume the different experiments to be independent or not. This paper overviews models of gene expression in time series, with emphasis on mixture and hidden Markov models, having in mind that taking into account the inherent nature of the data and using the dependencies along the time-axis should improve the quality of the models. We present the expression models used for time series expression data, study the learning approaches used to infer the models and analyze how they were used to solve real biological problems. This analysis include a study about the task (s) for which the model was designed, the biological problem under study, the specificities of time series expression data taken into consideration during model design, the type of time series used to test the model and the data transformations applied.