Identifying Candidate Genes Using The BioWarehouse: A Case Study

Citation

Pouliot, Y., Lee, T.J., Wagner, V. and Karp, P.D. Identifying Candidate Genes Using The BioWarehouse: A Case Study, in Eighteenth International Conference on Systems Engineering, pp. 332-40, 2005.

Abstract

The BioWarehouse is an open source data warehousing environment focused on supporting bioinformatics databases (DBs). Operating on the MySQL or Oracle relational database management systems (RDBMSs), BioWarehouse integrates public source DBs such as Swiss-Prot and GenBank into a unified normalized schema operating under a single DB management system. BioWarehouse also imposes partial semantic normalization on the source data, thus decreasing semantic heterogeneity and facilitating multi-DB queries using the Structured Query Language (SQL). As an application case study of the BioWarehouse, we have identified candidate genes for “orphan” activities, defined as activities for which no cognate gene sequences exist. 1,356 (36%) of enzymatic activities that have been assigned an enzyme commission (EC) number are orphans (Karp, 2004). Such high prevalence is problematic, given that many of these activities are decades old and often perform essential functions. Most notably, the existence of orphans introduces gaps in sequence data that significantly limit the accuracy of genome annotation and metabolic pathway prediction. Fortunately, with more than 200 hundred genomes sequenced to completion, and with the availability of systems such as BioWarehouse, the computational identification of candidate genes associated with orphan activities can be envisioned. The BioWarehouse’s conglomeration of databases, combined with Oracle 10g’s native integration of analytical tools into SQL queries (such as the basic local alignment search tool (BLAST) and POSIX regular expressions), enabled us to identify a small number of high-confidence candidate genes associated with a specific orphan activity. We describe the complex queries used in this work to illustrate the value of the data warehousing approach to bioinformatics research.


Read more from SRI