MBL-RPI

From Tetherless World Wiki

Jump to: navigation, search

Contents

people

affiliation name
RPI, Professor, Tetherless World Constellation Peter Fox
RPI Li Ding
MBL, Director MBLWHOI Library Cathy Norton
MBL Holly Miller
MBL Ryan Schenk

applications@MBL

http://ligercat.ubio.org (new)

  • select journal using tag, user select journal, show tag cloud of keywords in selected publications
    • gene -gene bank -> tag cloud
    • keyword (e.g. a person, a concept) -pub med-> tag cloud
    • journal -> selection of journal -pub med->

http://ontospecies.ubio.org/

  • Ryan: It's an ontology browser written in Javascript, with a Rails backend. The data is stored in MySQL, and I wrote a basic (emphasis on basic) "semantic" reasoning engine.
  • user interface allows filter the taxonomy by constraints
  • difficulty: show me all the diseases related to the organism or its subclasses (recusive sub-class relation killed the database application)
  • Show organisms that live (longer than|between) X years. The eventual idea here is that, once we get this backed by a triple store and a real reasoning engine, you can see the tree of life as filtered through various parameters. IE: Show me all the animals that live 100 years and have the gene for Alzheimer's.
  • Here it is with a Gene Ontology filter on top: http://ontospecies.ubio.org/terms/filter/go/51179_8150?
  • This is NOT a fully-baked application. It's a proof of concept at best, as well as an exercise in creating a more humane tree layout in pure CSS
  • Click any of the species in the left-hand column and you can get to a less-than-aesthetically-pleasing display of gene and disease information for that species.


http://taxatoy.ubio.org/

  • listing the number organisms discovered by year
  • we may switch to another UI that count organisms by age, or use more complex query

http://aging.ubio.org

  • Biology of Aging Portal, there are twelve exemplar species on the upper left of the page.

http://aging.ubio.org/organisms/5857/genes

discussion log

  • (Cathy, Jan 30) Planning : I am going to put you in touch with Holly Miller and Ryan Schenk who have been working on the species data as related to aging. They have it in MySQL tables and the data files include data on: Species, Genes, Diseases, Location, Life Span
  • (Ryan, Feb 10) Dataset acquisition (see below #dataset)
  • (Li, Ryan) Dataset clarification
  • (Ryan, Feb 18) pointed Biology of Aging Portal: http://aging.ubio.org
  • (Li, Feb 25) RPI Wiki populated http://tw2.tw.rpi.edu/wiki/bot
  • (Ryan, Mar 4) new application, http://ligercat.ubio.org
  • (Li,Ryan,Holly, Mar 5) telecon, review RPI wiki and MBL applications,
todo: 
1. (mbl) get sample queries for filtering taxonomy in http://ontospecies.ubio.org/
2. (rpi) connect allegrograph expert with MBL
3. (mbl) try the RDF dump generated by RPI
4. (mbl) come up with the next task in the next week

Task1: convert data to RDF and store in SPARQL - Jan 30

  • Kathy: You can do what YOU want with the data: but what would be helpful to us is if you could take this data and develop a strategy of converting this into triple stores so that it can be query using a SPARQLE search engine.
  • Ryan: What would be helpful to me is if you at some point publish our data into rdf/owl or ntriples format.
  • Li: wiki has been created with extra features, RDF dump has been generated, SPARQL was tested.

dataset

There are six files in two categories: "dictionary" files, and "triples" files.

  • Dictionary files contain definitions of our data objects.
  • Triple files contain triples. Pretty straightforward.

As Cathy mentioned, our data is currently stored in "triples" in MySQL database tables, which I have on my machine as well as up on a server. I can write a script to export the data in any (simple) format that you need. Just let me know what you need and in what format, and we can work together on getting the data out of our database and into yours.

The figure below shows the database schema

Image:mbl_organism_data_schema.png

clarification1: duplicated name

Question: I see entries with duplicated name in disease_dictionary.csv, is that intended?

 10564,"Neoplasms, Squamous Cell"
 9796,"Neoplasms, Squamous Cell"

Answer:

  • The terms in the disease_dictionary are Medical Subject Headings (MeSH terms) and, to the best of my knowledge, should not occur twice.
  • The following 10 diseases are bogus duplicates that can be safely deleted from disease_dictionary
disease_id, name
9042,"Neoplasms, Neuroepithelial"
9659,"Neoplasms, Connective and Soft Tissue"
9899,"Sexually Transmitted Diseases, Viral"
10119,"Fractures, Bone"
10458,"Carcinoma, Squamous Cell"
10564,"Neoplasms, Squamous Cell"
10920,"Anemia, Hemolytic"
11410,"Neoplasms, Germ Cell and Embryonal"
11948,"Genital Neoplasms, Male"
11953,"Heart Defects, Congenital"

clarification2: meaning of ID

Question: Are the ids (of gene, disease, organism) domain knowledge. i.e. they are assigned by certain organization such as NIH and not automatically generated your database. If I'm referring to an gene in a bio-informatics paper, gene_id is commonly agreed to refer to a gene, and name is rather an alias.

Answer:

  • The gene_ids that we are using are NOT assigned by an organization -- they are internal to our application. The genbank_id, when available, is assigned by NIH.
  • If you are citing a gene in a paper, please do not use our gene_ids. I suspect you could use the genbank_id if available. I would guess that you'd want to use as much of the three metadata fields (symbol, secondary symbol, genbank_id) as are possible, but you'll want to ask someone more knowledgeable than I about the subject.
  • As for the diseases, you can use the name itself; The names come straight out of a controlled vocabulary called MeSH (Medical Subject Headings), so I suspect that as long as you specify that the term you're using is a MeSH Heading, then that will be acceptable.
  • The Organism IDs (ubids) are from an ontology called uBiota, that my former supervisor Neil Sarkar developed. Theoretically, you should be able to use the uBiota IDs.

Media:organism_dictionary.csv

  • fields
    • UBID - (stands for uBiota ID -- a unique id for each taxa)
    • name - the scientific name of that taxa.
  • example
ubid name
455440 Xenopus laevis
  • note: We have 11 species for which we have gene and disease data. Each has an average of 600 genes, and 186 diseases.
Organism ID (UBID) Scientific Name Number of Genes Number of Diseases
455440 Xenopus laevis 3317 221
1521361 Corvus brachyrhynchos 7 243
694596 Drosophila melanogaster 198 230
737968 Geochelone nigra 4 5
308477 Myxine glutinosa 7 20
1704605 Homo sapiens 1414 255
332224 Mus musculus 944 257
488779 Caenorhabditis elegans 412 221
844172 Ursus maritimus 1 94
332219 Rattus norvegicus 185 257
1138320 Saccharomyces cerevisiae 98 245
Average: 599 186

Media:gene_dictionary.csv

  • fields
    • gene_id - a unique id for each gene (assigned by MBL);
    • symbol - the gene's symbol,
    • secondary symbol
    • name
    • genbank_id - if available
  • example
gene_id symbol secondary symbol name genbank_id
6891 tcp1-a t-complex polypeptide 1 "Xenopus laevis t-complex polypeptide 1 (tcp1-a), mRNA"
19079 6707287
  • note: Not every gene has all four of those fields. In fact, most of them do not; the only assumption you should make is that every gene has a unique gene_id.
  • note: Be aware that some of the fields are quoted if they contain commas.



Media:disease_dictionary.csv

  • fields
    • disease_id: a unique id for each disease (assigned by MBL)
    • name of the disease.
disease_id name
9042 "Neoplasms, Neuroepithelial"
  • note: Be aware that some of the gene names contain commas; those were surrounded by quotes.



Media:taxonomy_triples.csv

  • fields
    • subject - ubid
    • relation
    • object - ubid
  • example
subject relation object
2 (Animalia UBID) is_parent_of 92 (Chordata UBID)
  • note: this table only contains sub-class relation
  • note: From this file you can construct the taxonomic tree for our 11 species, from Kingdom all the way down to Species

Media:gene_triples.csv

  • fields
    • subject - ubid
    • relation
    • object - gene_id
  • example
subject relation object
2 (Animalia UBID) has_gene ...
  • note: Contains the relationships between which species (UBID) have which genes (gene_id)

Media:disease_triples.csv

  • fields
    • subject - ubid
    • relation
    • object - disease_id
  • example
subject relation object
2 (Animalia UBID) has_disease ...
  • note: Contains the relationships between which species (UBID) are afflicted by which diseases (disease_id)


examples

For example there is information on the taxonomy of the species along with Genes, Diseases, Location and longevity.

Taxonomy, Animalia, Chordata, Reptilia, Testudines, Testudinidae, Geochelone, Geochelone nigra ( Galagos Tortoise) has a life span of 177 years and has (Diseases) Cocidiosis, Protozoan Infections, Gastronintestinal Diseases, Intestinal Diseases, Bacterial Infections) and has COX1, Cmos, genes. ( location is unknown since we have not gotten all the geolocation data in on this one)
Taxonomy: Animalia:Chordata:Mammalia:Rodentia:Muridae:Mus:Mus musculus Lifespan 6 years Lives in Location: Mediterranean Region, China, As Diseases: Dementia, Alzheimer Disease, Memory Disorders, etc. Gene: ERCC2, WRN, SPDd, etc.


RPI wiki

we imported the data into a wiki, http://tw2.tw.rpi.edu/wiki/bot

  • each page corresponds to a organism, disease, and gene and includes its direct semantic annotations. it can
    • display HTML view
    • export RDF/XML dump
  • we can generate an RDF/XML dump for the entire dataset
#  Test case
#
PREFIX rdf:      <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX wiki:      <http://tw2.tw.rpi.edu/wiki/r2d2/> 
select ?org 
from <http://tw.rpi.edu/2009/03/mbl_organism.rdf>
WHERE  { ?org rdf:type wiki:Category-3AGene} 

Task 2 triple store backend - March 14

Requirements

Essentially, what I'd like to do is be able to power the OntoSpecies taxonomic tree browser with the triple store, rather than the hacky relational setup I have now. Most of the queries below reflect functions of OntoSpecies. http://tw.rpi.edu/wiki.tw/index.php?title=MBL-RPI&action=edit&section=8

  • Given a disease, show me the full taxonomic hierarchy (Kingdom, Phylum, Class ... Species) of every organism that is afflicted with that disease
  • At any level of the taxonomic hierarchy, (recursively) show me all the diseases and genes associated with all of its children. I.E. Show me all the genes and diseases for all the Species that are members of phylum Chordata
  • Given a taxa/species show all the parents (i.e. the lineage)
  • Given a taxa/species show all the direct children
  • Only species (tip of branch) have lifespan values we would like to present aggregated lifespan info for higher level taxa, for example maximum lifespan, minimum lifespan, average lifespan, lifespan range
  • Similar to above, only species have genes but the genes list for a parent of that species should include the genes from all the children of that node
  • Would it be possible to look at aggregated genes and diseases for two families (ex: Primatates and Rodentia) and do some comparisons, for example, what diseases do both have, what diseases does one have that the other doesn’t

Semantic Wiki Prototype

All requirements (except lifespan which is not available in provided data) are supported by Semantic Wiki, (see http://tw2.tw.rpi.edu/wiki/bot) - ( Li 23:31, 16 March 2009 (EDT) )

Personal tools