Bioinformatics

Research Laboratory for Bioinformatics Technology

Bioinformatics super green laser

Bioinformatics is the field that studies the applications of computers to molecular biology and biochemistry. It encompasses:

the information processing needs of biological data;
the acquisition of knowledge from the data;
the mathematical modeling and computer modeling of phenomena in a cell; and
visualization of models and information.

Bioinformatics is the driver for many of the discoveries from genomics. Researchers in genomics need collaborators with a broad range of expertise in both computer systems, software, and bioinformatics, namely,

computer systems installation and administration;
networks, internet/web, distributed computing environments;
sequence analysis, data analysis, scientific computing and visualization;
databases, machine learning, data mining; and
software development.

This is a fertile domain for both biologists and computer scientists. It needs genuine advances in algorithms, data- and knowedge-base systems, artificial intelligence, and software for supporting the experimental process, data analysis, and the creative aspects of biology and biochemistry.

The BioIT Lab

The aim of the BioIT Lab is to research, develop, and apply advanced computing technology to the problems of genomics.

Greg Butler: software and database technology, scientific knowledge-bases.
Clement Lam: large-scale computation, algorithms.
Volker Haarslev: description logics, ontologies, semantic web.
Gosta Grahne: web databases, data mining, sequence databases.
Nematollaah Shiri: knowledge-based systems
Ahmed Seffah: usability, user interface design
Sudhir P. Mudur: visualization
Leila Kosseim: natural language processing
Sabine Bergler: natural language processing

Collaboration

The Centre for Structural and Functional Genomics at Concordia University, which focuses on micro-organisms with industrial, agricultural, and environmental importance.
McGill Centre for Bioinformatics.
The European Media Lab's research groups in Bioinformatics and in Scientific Databases and Visualization.

Projects - Bioinformatics Infrastructure

Scientists working on the genome projects have been early adopters of internet technology. Faced with a huge amount of data, that is growing rapidly in terms of volume, location, and diversity, they have developed pragmatic approaches to resolve their problems of data access, data analysis, large-scale computation, and intelligent data mining to create scientific knowledge from the raw experimental data.

We propose a three-layer approach to providing an infrastructure for supporting data management, computational analysis, and reasoning in bioinformatics.

Know-It-All framework for data management
Do-It-All framework for workflow and computational tasks
Solve-It-All framework for intelligent computing

Know-It-All

The Know-It-All framework concerns the development of an infrastructure for data management that incorporates the diverse variety of data, allows the different data models to be integrated, and provides query mechanisms that are intuitive to scientists. Deeper concerns of how to handle uncertain and incomplete data will also be explored.

Support for managing the variety of data in bioinformatics is seriously lacking. There are many data models in the research literature that handle a range of data types: relations, objects, spatial and geometric data, images, networks, temporal information, and many more. All the research prototypes supporting the novel data models are implemented by mapping to a relational database or an object database. Although the data model might provide quite intuitive query mechanisms for the data type, these are not truly supported by the query processing, query optimization, and indexing facilities of the underlying relational or object database.

Realistically, the only widely available database technologies that have proven themselves in bioinformatics applications are flat files and relational database management systems (DBMS). There is still hope that object DBMS will become a proven technology, but most bioinformaticians hedge their bet by creating an object data model for their application and then mapping it to either flat files or a relational DBMS. Queries must then be in SQL, or SQL-like extensions, which are not at all matched to the novel data types nor intuitive for the scientist. Today, an intuitive query equates to either a form-based query, or point-and-click browsing of text and gif images on the web. These do not support complex or ad hoc queries.

Integration of databases has been well resolved from a pragmatic point of view. Indeed, the use of ontologies specified in description logic provides a solid theoretical foundation to TAMBIS for integration at the conceptual level.

Our approach is to apply the software technology of frameworks and product-lines to the development of a framework for database management and knowledge management. Frameworks are designed to be customizable and extensible, providing rapid implementation of variations in concepts, strategies, or techniques.

Do-It-All

The Do-It-All project concerns the development of an infrastructure for distributed computing with workflow programming and the deployment of computational grids. We intend to use Jini as the development platform, with application servers for individual computations, and servers for access to grids, and to control workflow.

Sequence analysis will provide a test case study for the infrastructure. The analysis of DNA and protein sequences is a large-scale computation requiring a very large number of individual analyses of many sequences followed by an intelligent assessment to the results of those analyses.

Solve-It-All

The Solve-It-All project concerns the addition of reasoning facilities to the Do-It-All system. Initially, the intelligent computing will be provided by an expert system shell such as CLIPS/JESS, but we propose to incorporate mediators, such as blackboard architectures for intelligent control of large-scale computations. Longer term we will use agent technology and the "semantic web" (the internet passing XML documents with semantics defined using OIL) as the basis of Solve-It-All.

Analysis of sequence function is chosen as the application area for the prototype infrastructure because it is reasonably well understood. However, there are many potential applications within bioinformatics for the infrastructure, such as gene expression data analysis, 3D structure prediction, docking, and threading.

The analysis of DNA and protein sequences is a large-scale computation requiring a very large number of individual analyses of many sequences followed by an intelligent assessment to the results of those analyses. The end-result is a putative assignment of a functional role to a sequence, together with the ability to ``drill-down'' to examine the steps in the reasoning and the results of individual analyses.

Projects - Bioinformatics Algorithms

We are expert at the development, implementation, and optimization of algorithms. We are interested in their performance in practice, rather than their asymptotic behaviour (their big-Oh complexity).

We have worked in the following areas, and are keen to do more:

sequence alignment
secondary structure prediction
3D structure prediction
machine learning applied to biomedical applications.

Department of Computer Science and Software Engineering (CSSE)