MBL-RPI
From Tetherless World Wiki
Contents |
people
| affiliation | name |
| RPI, Professor, Tetherless World Constellation | Peter Fox |
| RPI | Li Ding |
| MBL, Director MBLWHOI Library | Cathy Norton |
| MBL | Holly Miller |
| MBL | Ryan Schenk |
applications@MBL
http://ligercat.ubio.org (new)
- select journal using tag, user select journal, show tag cloud of keywords in selected publications
- gene -gene bank -> tag cloud
- keyword (e.g. a person, a concept) -pub med-> tag cloud
- journal -> selection of journal -pub med->
- Ryan: It's an ontology browser written in Javascript, with a Rails backend. The data is stored in MySQL, and I wrote a basic (emphasis on basic) "semantic" reasoning engine.
- user interface allows filter the taxonomy by constraints
- difficulty: show me all the diseases related to the organism or its subclasses (recusive sub-class relation killed the database application)
- Show organisms that live (longer than|between) X years. The eventual idea here is that, once we get this backed by a triple store and a real reasoning engine, you can see the tree of life as filtered through various parameters. IE: Show me all the animals that live 100 years and have the gene for Alzheimer's.
- Here it is with a Gene Ontology filter on top: http://ontospecies.ubio.org/terms/filter/go/51179_8150?
- This is NOT a fully-baked application. It's a proof of concept at best, as well as an exercise in creating a more humane tree layout in pure CSS
- Click any of the species in the left-hand column and you can get to a less-than-aesthetically-pleasing display of gene and disease information for that species.
- listing the number organisms discovered by year
- we may switch to another UI that count organisms by age, or use more complex query
- Biology of Aging Portal, there are twelve exemplar species on the upper left of the page.
http://aging.ubio.org/organisms/5857/genes
- this is equivalent to http://tw2.tw.rpi.edu/wiki/bot/index.php/Category:Muridae, while the inference is not yet done
discussion log
- (Cathy, Jan 30) Planning : I am going to put you in touch with Holly Miller and Ryan Schenk who have been working on the species data as related to aging. They have it in MySQL tables and the data files include data on: Species, Genes, Diseases, Location, Life Span
- (Ryan, Feb 10) Dataset acquisition (see below #dataset)
- (Li, Ryan) Dataset clarification
- (Ryan, Feb 18) pointed Biology of Aging Portal: http://aging.ubio.org
- (Li, Feb 25) RPI Wiki populated http://tw2.tw.rpi.edu/wiki/bot
- (Ryan, Mar 4) new application, http://ligercat.ubio.org
- (Li,Ryan,Holly, Mar 5) telecon, review RPI wiki and MBL applications,
todo: 1. (mbl) get sample queries for filtering taxonomy in http://ontospecies.ubio.org/ 2. (rpi) connect allegrograph expert with MBL 3. (mbl) try the RDF dump generated by RPI 4. (mbl) come up with the next task in the next week
Task1: convert data to RDF and store in SPARQL - Jan 30
- Kathy: You can do what YOU want with the data: but what would be helpful to us is if you could take this data and develop a strategy of converting this into triple stores so that it can be query using a SPARQLE search engine.
- Ryan: What would be helpful to me is if you at some point publish our data into rdf/owl or ntriples format.
- Li: wiki has been created with extra features, RDF dump has been generated, SPARQL was tested.
dataset
There are six files in two categories: "dictionary" files, and "triples" files.
- Dictionary files contain definitions of our data objects.
- Triple files contain triples. Pretty straightforward.
As Cathy mentioned, our data is currently stored in "triples" in MySQL database tables, which I have on my machine as well as up on a server. I can write a script to export the data in any (simple) format that you need. Just let me know what you need and in what format, and we can work together on getting the data out of our database and into yours.
The figure below shows the database schema
clarification1: duplicated name
Question: I see entries with duplicated name in disease_dictionary.csv, is that intended?
10564,"Neoplasms, Squamous Cell" 9796,"Neoplasms, Squamous Cell"
Answer:
- The terms in the disease_dictionary are Medical Subject Headings (MeSH terms) and, to the best of my knowledge, should not occur twice.
- The following 10 diseases are bogus duplicates that can be safely deleted from disease_dictionary
disease_id, name 9042,"Neoplasms, Neuroepithelial" 9659,"Neoplasms, Connective and Soft Tissue" 9899,"Sexually Transmitted Diseases, Viral" 10119,"Fractures, Bone" 10458,"Carcinoma, Squamous Cell" 10564,"Neoplasms, Squamous Cell" 10920,"Anemia, Hemolytic" 11410,"Neoplasms, Germ Cell and Embryonal" 11948,"Genital Neoplasms, Male" 11953,"Heart Defects, Congenital"
clarification2: meaning of ID
Question: Are the ids (of gene, disease, organism) domain knowledge. i.e. they are assigned by certain organization such as NIH and not automatically generated your database. If I'm referring to an gene in a bio-informatics paper, gene_id is commonly agreed to refer to a gene, and name is rather an alias.
Answer:
- The gene_ids that we are using are NOT assigned by an organization -- they are internal to our application. The genbank_id, when available, is assigned by NIH.
- If you are citing a gene in a paper, please do not use our gene_ids. I suspect you could use the genbank_id if available. I would guess that you'd want to use as much of the three metadata fields (symbol, secondary symbol, genbank_id) as are possible, but you'll want to ask someone more knowledgeable than I about the subject.
- As for the diseases, you can use the name itself; The names come straight out of a controlled vocabulary called MeSH (Medical Subject Headings), so I suspect that as long as you specify that the term you're using is a MeSH Heading, then that will be acceptable.
- The Organism IDs (ubids) are from an ontology called uBiota, that my former supervisor Neil Sarkar developed. Theoretically, you should be able to use the uBiota IDs.
- fields
- UBID - (stands for uBiota ID -- a unique id for each taxa)
- name - the scientific name of that taxa.
- example
| ubid | name |
| 455440 | Xenopus laevis |
- note: We have 11 species for which we have gene and disease data. Each has an average of 600 genes, and 186 diseases.
| Organism ID (UBID) | Scientific Name | Number of Genes | Number of Diseases |
| 455440 | Xenopus laevis | 3317 | 221 |
| 1521361 | Corvus brachyrhynchos | 7 | 243 |
| 694596 | Drosophila melanogaster | 198 | 230 |
| 737968 | Geochelone nigra | 4 | 5 |
| 308477 | Myxine glutinosa | 7 | 20 |
| 1704605 | Homo sapiens | 1414 | 255 |
| 332224 | Mus musculus | 944 | 257 |
| 488779 | Caenorhabditis elegans | 412 | 221 |
| 844172 | Ursus maritimus | 1 | 94 |
| 332219 | Rattus norvegicus | 185 | 257 |
| 1138320 | Saccharomyces cerevisiae | 98 | 245 |
| Average: | 599 | 186 |
- fields
- gene_id - a unique id for each gene (assigned by MBL);
- symbol - the gene's symbol,
- secondary symbol
- name
- genbank_id - if available
- example
| gene_id | symbol | secondary symbol | name | genbank_id |
| 6891 | tcp1-a | t-complex polypeptide 1 | "Xenopus laevis t-complex polypeptide 1 (tcp1-a), mRNA" | |
| 19079 | 6707287 |
- note: Not every gene has all four of those fields. In fact, most of them do not; the only assumption you should make is that every gene has a unique gene_id.
- note: Be aware that some of the fields are quoted if they contain commas.
- fields
- disease_id: a unique id for each disease (assigned by MBL)
- name of the disease.
| disease_id | name |
| 9042 | "Neoplasms, Neuroepithelial" |
- note: Be aware that some of the gene names contain commas; those were surrounded by quotes.
- fields
- subject - ubid
- relation
- object - ubid
- example
| subject | relation | object |
| 2 (Animalia UBID) | is_parent_of | 92 (Chordata UBID) |
- note: this table only contains sub-class relation
- note: From this file you can construct the taxonomic tree for our 11 species, from Kingdom all the way down to Species
- fields
- subject - ubid
- relation
- object - gene_id
- example
| subject | relation | object |
| 2 (Animalia UBID) | has_gene | ... |
- note: Contains the relationships between which species (UBID) have which genes (gene_id)
- fields
- subject - ubid
- relation
- object - disease_id
- example
| subject | relation | object |
| 2 (Animalia UBID) | has_disease | ... |
- note: Contains the relationships between which species (UBID) are afflicted by which diseases (disease_id)
examples
For example there is information on the taxonomy of the species along with Genes, Diseases, Location and longevity.
Taxonomy, Animalia, Chordata, Reptilia, Testudines, Testudinidae, Geochelone, Geochelone nigra ( Galagos Tortoise) has a life span of 177 years and has (Diseases) Cocidiosis, Protozoan Infections, Gastronintestinal Diseases, Intestinal Diseases, Bacterial Infections) and has COX1, Cmos, genes. ( location is unknown since we have not gotten all the geolocation data in on this one)
Taxonomy: Animalia:Chordata:Mammalia:Rodentia:Muridae:Mus:Mus musculus Lifespan 6 years Lives in Location: Mediterranean Region, China, As Diseases: Dementia, Alzheimer Disease, Memory Disorders, etc. Gene: ERCC2, WRN, SPDd, etc.
RPI wiki
we imported the data into a wiki, http://tw2.tw.rpi.edu/wiki/bot
- each page corresponds to a organism, disease, and gene and includes its direct semantic annotations. it can
- display HTML view
- export RDF/XML dump
- we can generate an RDF/XML dump for the entire dataset
- a dataset dump has been generated http://tw.rpi.edu/2009/03/mbl_organism.rdf
- we can run sparql query (search for all genes) using an online sparql service http://onto.rpi.edu/alpha/sparql/
# Test case
#
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX wiki: <http://tw2.tw.rpi.edu/wiki/r2d2/>
select ?org
from <http://tw.rpi.edu/2009/03/mbl_organism.rdf>
WHERE { ?org rdf:type wiki:Category-3AGene}
Task 2 triple store backend - March 14
Requirements
Essentially, what I'd like to do is be able to power the OntoSpecies taxonomic tree browser with the triple store, rather than the hacky relational setup I have now. Most of the queries below reflect functions of OntoSpecies. http://tw.rpi.edu/wiki.tw/index.php?title=MBL-RPI&action=edit§ion=8
- Given a disease, show me the full taxonomic hierarchy (Kingdom, Phylum, Class ... Species) of every organism that is afflicted with that disease
- At any level of the taxonomic hierarchy, (recursively) show me all the diseases and genes associated with all of its children. I.E. Show me all the genes and diseases for all the Species that are members of phylum Chordata
- Given a taxa/species show all the parents (i.e. the lineage)
- Given a taxa/species show all the direct children
- Only species (tip of branch) have lifespan values we would like to present aggregated lifespan info for higher level taxa, for example maximum lifespan, minimum lifespan, average lifespan, lifespan range
- Similar to above, only species have genes but the genes list for a parent of that species should include the genes from all the children of that node
- Would it be possible to look at aggregated genes and diseases for two families (ex: Primatates and Rodentia) and do some comparisons, for example, what diseases do both have, what diseases does one have that the other doesn’t
Semantic Wiki Prototype
All requirements (except lifespan which is not available in provided data) are supported by Semantic Wiki, (see http://tw2.tw.rpi.edu/wiki/bot) - ( Li 23:31, 16 March 2009 (EDT) )

