Don’t remember exactly how many times I have been to bay area, especially for summer interns. The foggy and windy San Francisco is nothing new to me, but the AGU meeting is like an adventure that I could never imagine what I will come across. I knew from lab-mates that AGU is a huge events, over 20,000 attendants from numerous of domains, it is nevertheless until the time I register at the moscone center west that I realize its overwhelming diversity. To be honest, as I talk to some students in the student breakfast, I literally “know” what are their background, such as biological chemistry, but essentially the problem and the approaches are never friendly to me. Initially, I try to show appreciation, as a sign of politeness however I never feel really any spontaneous reflection from the conversation. I told myself I need to get out of something from this 7 days conversation, something useful for both parties of the conversation.
Therefore, I decided to take charge of the conversation, asking questions like: I am working on A, which tries to solve general problems such as B and C, does that benefit what you are working on? Basically, I regard most of the conversation to be a good survey if something I did will make their lives much easier. One reason is from my observation that domain scientist, which category most attendant falls into, are using some information technologies, such as matlab or R or excel but they are not satisfied. For most of the scientists, as general users, might not know exactly what they want since otherwise they will just build the tool! In most cases, they are looking for some magical tools that can do things better, just like the first time automobile is invented where most people think they need faster horses. From this observation, I had several great conversations with a couple of scientist on problem I am working on and it turns out that these conversations really help me a lot!
One topic people are talking about is data integration. Essentially, this is the problem of ontology and vocabulary matching, where many efforts haven been made on aligning heterogenous of schemas and the corresponding keywords in the schema. Just as the project I am working on with Cyndy Chandler and Adam Shepherd form WHOI, domain data scientists from US, EU and Australia try to align part of their research data with a commonly accepted vocabulary entry, which is called the NERC that is something looks like http://vocab.nerc.ac.uk/collection/L06/current/. Most of the work by now is through manual aligning, therefore I am here to apply some natural language processing trick to align the terms based on their metadata such as description and definitions. Some folks are working on this already, but not a completely automatic process,where they try to provide a guideline tool for the scientist when they are not sure which terms to use when devising their dataset. They learn the “recommended” terms by putting a large set of vocabularies that let them cluster using the LDA tricks. However, I really doubt whether it would work as LDA is simply for unsupervised learning which means we can’t specify what data to put into the same category. Then it means we disregard an important aspect of the dataset, which is not a good approach to an appropriate model. Besides I have more concern regarding the applicability of this work, as most scientist are familiar with what vocabularies to use and those who didn’t might not be the domain scientists. Usually, computer scientist will be under supervision of those domain scientist while creating these dataset, so it is still not necessary for them to use this vocabulary guideline service. Anyways, bringing some AI techniques to this domain is always a good trial.
Data exploration and data portals:
Another trend is that many people are concerning about building data portal. A lot of them are featured with facet search, map based search and filtering. As I see a number of posters from George Mason, I just realized that they are doing the same thing as S2S however the generation of those facets and user-interface are generated by hard-query and xml file. They are not concerning using the ontology as the guideline for the hierarchy of the user interface. But as I briefed something on our work of S2S, they become very interested and eager to try out if there is source code and enough documents available for the S2S framework. I showed them our work on DCO, not a perfect one but already articulate the idea. They even suggest to create a user community for the s2s framework such that whenever there is a question or they would like to contribute a widget, there is a place they can discuss and commit the code. Besides George Mason, there are also other schools that raise the same suggestion.
Big data analysis:
Lots of people are talking about big data analysis, ranging from NASA to small institutes. What I really expect from the talk is the technical aspect of what’s the scale of their dataset and how they solve the problem using parallelism, redundancy etc. However, most of the talk are discussing the hierarchy of their system, key components without much detail on what’s the data looks like, any problems they come across and how are those problem solved. It might be because the audiences for this meeting are more with domain backgrounds that doesn’t really know or even concern those technical details.
The best thing I found about this meeting is that there are several people that working towards the same goal as I do but with slightly different approach. It’s good because it shows what I are interested is meaningful and there is still room for me as we are not doing exactly the same thing. One professor I talked with is Prof. Jia Zhang from Carnegie Mellon. What she is working on is to facilitate the scientist to reuse data, practices and algorithms such that preventing reinventing the wheels and more importantly accelerating the process of adopting someone else work to save time and effort. Moreover, she also developed an service recommendation system for scientists. The system will be able to suggest specific algorithms based on input metadata and goal. The algorithm workflow is figured out based on path-finding algorithms. However, something they haven’t done yet is providing a web-based platform to execute the workflow and they are not using reasoning engine to find the path. I talked some of my work, idea and thoughts and how I will approach the problem using some semantic web technologies. We are both happy after the conversion because although the goal is the same but the approach is slightly different, so there is a good source of reference and collaborations.
After the meeting, I have a clearer picture of the significance of my work, possible directions, and potential collaborators etc. One regret is that I didn’t give any presentation on any of my work so this is a really good catalyst for me to get down to my work and contribute some of my work to this community in the future year.