Tutorials

The tutorials will be held on Tuesday, 19. September 2006, at the CS department, Sand 6/7.

The following tutorials will be presented:

  1. Daniel Huson: Introduction to Phylogenetic Networks
  2. Yu Wang: Non-coding RNA - No Longer the Dark Matter in a Cellular Universe
  3. Gunnar Rätsch and Cheng Soon Ong: Kernel Methods for Predictive Sequence Analysis
  4. Hagit Shatkay: Mining the Biomedical Literature: State of the Art, Challenges and Evaluation Issues
Schedule:

    12:00   Registration Open
    13:00   Tutorial A (Lecture Hall 1)                     Tutorial B (Lecture Hall 2)
    15:30   Coffee Break
    16:00   Tutorial C (Lecture Hall 1)                     Tutorial D (Lecture Hall 2)
    18:30   Reception (Sand 1, A104)


Introduction to Phylogenetic Networks

Daniel Huson
Center for Bioinformatics Tuebingen
Tuebingen University

The evolutionary history of species is best represented by a phylogenetic tree and there exist many well-known methods for reconstructing such trees, in particular, from bio-molecular sequences. However, for some types of data, this may not be appropriate and approaches that generate phylogenetic networks may be more suitable. For example, hybridization between different plant species or recombination between closely related bacteria can lead to sequence data whose evolutionary history is best represented by a network.

There are two types of phylogenetic networks: ones that are intepretable as non-tree-like models of evolution, and ones that do not possess such an interpretation, but should rather be considered as visualizations of incompatibilities within a data set. Examples of the former are ancestral recombination graphs, hybridization networks and reticulation network, whereas examples of the latter are splits graphs, consensus networks and neighbor-nets. This tutorial will discuss both types.

We will first give a brief introduction to reticulate evolution and phylogenetic networks. Then, a number of important network reconstruction methods will be discussed. Finally, the application of some phylogenetic methods will be illustrated using typical data sets (haplotypes, plant hybrids, and whole genome data for a set of prokaryotes). Download handout, slides.

Kernel Methods for Predictive Sequence Analysis

Gunnar Rätsch and Cheng Soon Ong
Friedrich Miescher Laboratory
Tuebingen

This tutorial is meant for a broad audience: Students, researchers, biologists and computer scientist interested in (a) an overview of general and efficient algorithms for statistical learning used in computational biology, (b) sequence kernels for the problems such as remote homology, promoter or splice site detection and protein subcellular localization. No specific knowledge will be required since the tutorial is self-contained and most fundamental concepts are introduced during the course.

Contents:


Download slides.

Non-coding RNA - No Longer the Dark Matter in a Cellular Universe

Yu Wang
German Research Center for Environment and Health
Neuherberg

Non-coding RNA (ncRNA) has recently emerged as an important landmark on the genomic landscape of eukaryote. Recent large-scale studies of the human and mouse transcriptomes, using both cDNA cloning and genome tiling arrays, indicate that majority of mammalian genome is transcribed. For example, FANTOM3 project reported that an astonishing 62 percent of the mouse genome is indeed transcribed and half of 181,000 independent transcripts are ncRNAs. Mapping transcripts of 10 human chromosomes showed that unannotated, nonpolyadenylated transcripts comprise the major proportion of the transcriptional output of the human genome. In plant, massively parallel signature sequencing (MPSS) of Arabidopsis flower and sendling discovered two million small RNAs, whereas the number of unique small RNA signatures is about 75,000. This tremendous amount of data of ncRNA might only be the tip of a huge iceberg. Indeed, an unbiased mapping of Sp1, cMyc, and p53 along human chromosomes 21 and 22 has identified a surprisingly large number of transcription factor binding sites (TFBS), of which many are significantly associated with ncRNAs. Computational prediction of ncRNAs based on RNA secondary structures also hints that a large amount of ncRNAs are encoded on the human genome.

In the post-genomic era, assigning function to all protein coding genes and understanding their interaction is already a challenge, now we are facing an even more difficult task - identification of genes which encode non-coding RNAs, elucidation of their functions and incorporating them into genetic networks. The lack of open reading frames and other statistical signals make systematic prediction of ncRNAs on genomes a formidable work. We have to rely on RNA secondary structure and thermodynamic stability together with primary sequence conservation. Computational approaches will certainly play a leading role in answering questions as "How many ncRNAs are encoded by the genome?"

In this tutorial, we will review the definition of ncRNA and the current status of ncRNA research in different organisms, from Arabidopsis to fly , mouse and human. We will also discuss the common computational strategies in ncRNA identification and functional prediction. In the end we will present the RNA Ontology Consortium, an ongoing project which aimed to create an integrated conceptual framework for RNomics, the science to study RNA and their interactions.

Download slides.

Mining the Biomedical Literature: State of the Art, Challenges and Evaluation Issues

Hagit Shatkay
Queens University
Kinston, Ontario

Almost every kind of known or postulated information pertaining to genes, proteins, and their role in biological processes is reported in the vast amounts of published literature. The advancement of biological techniques supporting large-scale genomics and proteomics, is accompanied by an overwhelming increase in the amount of literature discussing the biology of genes and proteins. The ability to rapidly survey the literature forms a necessary step in both the design and the interpretation of any large-scale experiment. Moreover, automated text mining offers a yet-untapped opportunity to integrate many fragments of information, gathered by researchers from multiple fields of expertise, into a complete picture exposing the interrelated roles of various genes, proteins and chemical reactions in cells and organisms. For all these reasons, the past few years have seen a surge of interest in utilizing biomedical text for various purposes, ranging from identifying gene and protein names within sentences and articles, to trying to establish and predict regulatory networks. (See complete sessions in past ISMB and PSB conferences, as well as publications in biological and bioinformatics journals such as Nature Reviews, Nature Genetics, Journal of Computational Biology and Bioinformatics.) Several text-related disciplines are harnessed in such efforts, including natural language processing, information extraction, and information retrieval. The objective of this tutorial is to provide a structured introduction to biomedical text mining, from both the biomedical-application and the text-mining perspectives. It will provide background, as well as build and raise the awareness of researchers - who arrived at the area from either the biomedical or the text-mining domains - about the well-established and the more recently formed theory, techniques, data sources and tools.

The tutorial will present general and biomedical-specific text mining methods. It will discuss the kinds of text these methods can be applied to, and the work that was done so far towards such applications. Existing work in biomedical literature mining will be analyzed and put in the context of the explicit text mining disciplines, as well as examined from a machine learning perspective. Until recently, little has been done about objective assessment of text mining in biology. The tutorial will put a special emphasis on critical assessment, validation and evaluation methods that are used in text mining and information retrieval, and focus on their application in the context of emerging benchmarks. In particular we shall discuss recent evaluation efforts such as the KDD 2002, BioCreAtIvE 2004, and TREC Genomics.