BioMediator is a data integration system that provides a common interface to Web-accessible sources of biologic information. Standard data integration techniques can be used to provide access to these sources, but these techniques are not always adequate for biologic researchers (e.g., as new experiments are devised, the mediated schema needs to evolve). BioMediator includes several features (e.g., an easily modified mediated schema) that address the challenges introduced by biologic researchers’ needs.    (3BL)

In the post-genomic era, biologic research can benefit from access to the large amounts of data and knowledge stored in public repositories. Each of these sources was developed to address specific needs and is organized around a unifying concept and/or organism. However, the new “systems” approach to biology requires analyzing experimental results in a more general context. Thus, this approach requires integrating information from a distributed set of highly heterogeneous sources including both public and private data sources. More generally, inductive research (i.e., research that generates rather than tests hypotheses) depends on integrating immense data sets.    (3BM)

The biologists performing these experiments have two other needs not addressed by standard data integration approaches. First, the system must support both poorly specified queries (What is known about the genetic disease narcolepsy?) as well as very specific queries (A mutation of what gene(s) results in dysprothrombinemia, haemophilia caused by an inactive protein.). Second, the mediated schema must be easily customized for different user groups whose needs evolve over time.    (3BN)

BioMediator addresses the first challenge by providing support for flexible query answering. A user begins by issuing a declarative query, which establishes the basic topics of interest. The user can then browse these results and issue new queries to explore related topics. For example, a query for narcolepsy returns a number of related genes. The user could retrieve more information about one, some or all of these related topics.    (3BO)

The second challenge requires several innovations: In biologic data integration systems the mediated schema is often hard-coded. However the heterogeneous and evolving needs of biologists mandate supporting multiple mediated schemas that can easily be changed. As a consequence, BioMediator is driven by information stored in a Protégé knowledge-base that is modified using a graphical interface. The contents of this knowledge-base can be easily extended to support new user groups.    (3BP)

As a result, researchers can create custom mediated schemas. A standard timeline is as follows: A new user selects an existing mediated schema and begins using BioMediator. Eventually, limitations of this schema become evident and the user makes changes to a copy of the original schema (by copying the Protégé knowledge-base). After running some experiments, the new schema is modified to reflect the new data needs.    (3BQ)

To support user driven schema evolution, the wrappers are as general as possible. All of the available data-fields are exposed, whether or not they appear in a given mediated schema. When changes are made to a mediated schema, previously invisible data-fields can be mapped to the new schema. This can often be done with no additional programming using a plugin for the Protégé environment. These features make BioMediator an excellent tool for data integration in the biologic sciences.    (3BR)