Dynamic Semantic Metadata on Web Resources in Biomedicine

Tim Clark
Director of Informatics, Massachusetts General Hospital, Boston MA, USA

Tuesday, May 3rd, 2011,
Advanced Semantic Technologies Class
Winslow Building (105 8th Street)
Room 1140, 4-5 PM

Biomedical research foregrounds a number of Web data integration challenges seemingly well-adapted to applications of semantic Web technologies. However, experience has shown that a "pure" semantic Web approach to biomedical science may be doomed to inevitable failure. We will discuss the reasons for this, and present an alternative hybrid approach in which semantic web data is "injected" into the scientific communications ecosystem by activities of scientific researchers themselves. Finally we will look at some other applications of this suite of technology patterns in other scientific disciplines and present challenges for future work. 


Tim Clark is Director of Informatics at the Mass General Institute for Neurodegenerative Disease (MIND); and an Instructor in Neurology at Harvard Medical School. His research group is based in the neurology department of Massachusetts General Hospital, and its members explore new applications for neuroinformatics, Semantic Web, and social computing. 

Recent projects have included the SWAN Alzheimer Disease Knowledge Base; StemBook, and online review of Stem Cell biology; the SWAN scientific discourse ontology; the PD Online Research web community for Parkinson's Disease researchers; the Domeo semantic web document annotation framework; AO, the web document Annotation Ontology; and the forthcoming Pain Research Forum scientific web community. 

Tim was formerly vice president of informatics at Millennium Pharmaceuticals, where his team built one of the first integrated bio- and chemi-informatics software platforms in the pharmaceutical industry. He began his career in life science informatics at the National Center for Biotechnology Information (NCBI), where he led the database development team for GenBank.

Class Notes

writing one of the oldest forms of semantic communication
rosetta stone

ways of representing data on the web in formal ways

catch 22 - why you need a hybrid approach
web 3.0 integrates data and documents

these diseases…. these are all omplec disorders, and one thing they have in common ___

need to intgrate findings form thousands of researchers across these diseases (a lot of labas and a lot of publications
Major breakthrough Brain
by the time symptoms occur - it's already at the end stage. patient has lost large number of neruons
wnt to detect before - to stop neron death process
at a stage MCI - mild cognitive impairment
Alz Disease Neuro Imaging Consortium - (ADNI) 218 subjects, radiolable bound to a protein, - littel fragments that condence into the plaque are toxic (solution phase problem)

  • can use CSF (cerebral final fluid) to diagnose early onset altzeimers
  • these images are statistical overlays, not one subject, so it was a chore to integrate these images

need to be able in real time, and not buidl specific new platforms for it

now want to see if these disorders have anyting in common
looking at relations among the neurodegenerative diseases -

  • but, not how people publish their data,

Catch 22
we want to organize all our facts
there are no facts i
all we have are assertins supported by the primary data

lots of advances in the publishing of this material, and available on the web, but they are available in the PDF

how to you inject a semantic aspect into the paper/discourse
want to exploit the structure you see and tie them back into the documents

extract the proto-truth into these arguments

hundreds of years of work on terminology - complex

automatic annotiontions - to be able to publish and share in the community as they wish

annotation ontology - linking schema between a doc fragment and a term expressed as a URI (OWL-DL)
SWAN AF - manages the docmuents (collab w/ text mining people)

AO is based on the annotea project

term localization and curation - can manually curate - hover over and it gives provenance, but can annotatte
term localization in text - pre and postfix and offset in publication [need both because web documents have different formats]
can also select a polygon in an image

GWT (?)

Hypothesis management - relevant in drug development. 10,000 projects get started and 1 comes. high failure rate.

Mons/Groth model of a nanopublication - context, provenance, subject, ...

using drug development in pharma - use hupothesis to add to knowledge base

What's the incentive for pharma to share nanopubs of hypotheses? - internally they are now more motivated. they have had a terrible time coming up with new drugs -

  • right but the problem is not a lack of targets, its the high costs at the later end of the drug development pipeline (Phase 1, Phase 2, human trials)

approx 150 differnt types of lung cancer -
looking for concepts and tools when doing research

    • can we learn from the dirt in data? and can we use that to clean it (or is there info content in that, so maybe it shouldn't be clean). maybe there's a way to capture or deal with the different types of dirt - not all dirt is the same.

Is the KB of the size that you can you do large scale consistency checking?

What's the multiplactative factor (or making the annotations first class citizens) - i provenance typically say 10:1

Different types of citations - supporting, inconsistent, .. please think about this, i cited this guy - is as fine grained as they got.

