Semantic eScience Meeting March 10, 2014

Printer-friendly version

General Meeting Information

  • This Pad
  • Previous Meeting
  • Call-in information
    • Goto Meeting
      • Dial +1 (805) 309-0012
      • Access Code: 776-009-689
      • Audio PIN: Shown after joining the meeting
      • Meeting ID: 776-009-689


  • Jin’s talk about his thesis work


  • deborah will join remotely for a portion of this meeting
  • Linyun Fu
  • Han Wang
  • Jin Zheng
  • xixi Luo
  • Anirudh Prabhu
  • Corey Li
  • Marshall X Ma
  • Massimo di Stefano (remote)

Past Action Items

Action Items

  • Jin upload his slides and send the group the link.


Keep this list from week to week so we know who’s presented and who will present

  • Keep this list from week to week so we know who’s presented and who will present. Please sign up if you have a good topic to share with others.
  • Feb. 3, 2014 - Introductions, no presentation
  • Feb. 24, 2014 - Linyun Fu - Data theory (Wickett et al.’s and Mealy’s papers)
  • Mar. 10, 2014 - Jin Zheng - thesis
  • Mar. 24, 2014 - Yu Chen - TBD
  • Apr. 14, 2014 - Patrick West - ToolMatch
  • Apr. 28, 2014 - Peter Fox - event calculus


  • Jin’s thesis work
    • semantic similarity: definition (similar in meaning) and problem (how to compute similarity scores)
    • applications: entity matching and entity recognition
    • approach: information entropy and weighted similarity (IEWS) model based similarity calculation
    • challenges on the web of data: 1) two entities describing same information with different properties; 2) same information structured differently; 3) unusable extra information; 4) scalability
    • advantages: well-structured and linked
    • assumptions: no cross language matching, no conflicting descriptions, descriptions are complete, similar entities have similar literal descriptions
    • similarity calculation based on string (lexical) similarity, property matching and semantic content collection
    • introducing information entropy
      • related to number of possible values of a property
    • importance of property
      • we use weight to measure the importance
      • weight learning problem -- reduced to binary classification problem
    • entity matching problem -- instance and ontology matching
      • workflow: block, compute and select
      • entities sharing a common keyword in their description contents belong to the same block
      • inside each block, perform pair-wise similarity calculation
      • threshold-based or top-k match selection
    • surface form base from billion triple challenge 2009 to test entity recognition
      • information entropy obtained by analyzing the whole billion triple challenge dataset
      • no weight learning
      • only direct entity descriptions are used
    • evaluation
      • human-based intuition validation
      • compare with 17 other systems
      • iimb dataset
      • f(1)-measure
      • NYTimes and DBpedia instances (people, organizations, and locations) to test blocking effectiveness
      • information entropy based stop traverse algorithm is validated with NYTimes and DBpedia data
      • Cucerzan’s dataset on news entities linking to wikipedia entries