ABSTRACT: Deep learning has been successfully applied to image-based and free-text applications, but there are several challenges in applying it to structured or semi-structured data. In this talk, I discuss deep-learning solutions to two important tasks in data management: entity/record matching and table search. Entity matching is important for identifying duplicate records and integrating data from different sources to provide a comprehensive description of an entity. Machine learning approaches to the problem require significant amounts of training data, but this can be a barrier to pragmatic application of automated matching. An automated approach is typically used because there are insufficient resources (people and time) to manually match large numbers of records, but this is exactly what is needed to create a training set for a new matching task! I describe a zero-shot learning approach to the problem, where useful matching can be achieved on a new matching task, after the system has only been trained on different tasks. Furthermore, I show how the same approach can be applied in a few-shot learning case to get even better results: that is, accuracy increases when the number of task-specific training examples increases. For the second part of my talk, I will discuss the problem of locating relevant tabular data: this is a problem often faced by data journalists and academic researchers. I describe an approach to learn a representation for a table that can be used in keyword search and query-by-example tasks. In particular, our approach fully encompasses the structural information of the table, by considering how data is organized by both columns and rows. We create structurally-informed sequences that are input into the well-known BERT model, in order to produce a comprehensive embedding of the table. We then demonstrate how this embedding can be leveraged to achieve state-of-the-art results in keyword search and query-by-example tasks.
BIOGRAPHY: Dr. Jeff Heflin is an associate professor in the department of Computer Science and Engineering at Lehigh University. He is generally interested in applying artificial intelligence to problems in data management. His specific research interests include establishing semantic interoperability between heterogeneous information systems, machine learning for dataset search, exploration and analysis of complex data, and scalable ontology reasoning. He is one of the pioneers of Semantic Web research and wrote the first Ph.D. dissertation on the subject. He has been involved in the design of many important Semantic Web languages, including SHOE, DAML+OIL, and OWL. In 2004, he received an NSF CAREER award to study the theory and algorithms of distributed ontologies. His research group developed the Lehigh University Benchmark (LUBM), the de facto standard for evaluating the capabilities of large-scale Semantic Web Knowledge Bases. He is a Senior Member of AAAI, on the editorial board for the Artificial Intelligence Journal. He was a guest editor for four journal issues, was co-program chair for ISWC 2012 and will be the general chair for ISWC 2017. Dr. Heflin received his B.S. in computer science from the College of William and Mary. He received his M.S. and Ph.D. in computer science from the University of Maryland.