Geodata 2014

June 30th, 2014

A few weeks ago I attended the 2014 Geodata Workshop. Like the previous Geodata workshop in 2011, this workshop was focused on discussing policies and techniques to improve inter-agency geographic data integration and data citation. While there have been advances in recommendations for data citation and geodata integration since the last Geodata workshop, I felt the mood of the attendees indicated that we are now in much the same place we were in 2011. There was strong consensus as to the importance of data citation and integration, but a feeling that no one is really doing it at scale, the tools aren’t where we need them to be, and the agency policies are not yet at a state to successfully drive widespread adoption. Despite these hurdles this is a community that is clearly excited and willing to take the first steps towards making widespread data integration and data citation a reality in the geodata community.

Meanwhile, in the trenches…

I had several conversations with attendees who represent publishers of oceanographic vocabularies. Many of these vocabularies have been publicly available for several years, but have been traditionally been 3-star open data (publicly available in a non-proprietary machine-readable format, no links to external vocabularies). These publishers are excited about upgrading their vocabulary services to be 5-star open data (use open W3C standards such as RDF/SPARQL, identify things with resolvable URIs, link to other people’s data) because they see a major benefit in being able refer to the authoritative source for a term or identified resource that is related to their vocabulary but for which they are not the authoritative source. This is a great example of a group that has already identified a specific real-world need and benefit from integration and who are actively laying the groundwork that will enable that integration to be successful. This group was enthusiastic about cross-linking their vocabluaries and I have no doubt their efforts will be viewed as a data integration success at the next Geodata workshop.

Where we can help…

As a result of these discussions our lab is starting a Linked Vocabulary API effort whose goal is to provide a Linked Data API configuration specialized to the purpose of publishing SKOS vocabularies. Our goal is to develop a configuration that makes bootstraping a RESTful linked data API to a SKOS vocabulary simple and accessible for the broad scientific community.  This effort is based on work we previously did for the CMSPV project.

In conclusion

What I will remember most from Geodata 2014 is the excitment members of the community had towards adopting new technologies and techniques and making widespread data integration and citation a reality. Where conventions have yet to be established the community is willing to take the first steps and establish best practices.  Where policies have yet to be formalized the community is ready to work with policy makers to ensure clear and helpful policies are established .  Whenever the next Geodata workshop is held, I am confident that it’s narrative will be full of success stories that began at the 2014 workshop.

GeoData 2014 in sunny “People’s Republic of Boulder”

June 25th, 2014

I was so excited to see so many domain scientists as well as data specialists getting together talking about one common topic — data science, which gave me a great chance to communicate and to learn from others. We had people from NOAA, NASA, USGS, RDA, UCAR, universities, industries, as well as many other organizations and agencies.

I have to say that Boulder is such a wonderful small town to stay at. A large amount of scientific government agencies are located there. Local people are very friendly, and from any spot of the town you could have a wonderful view of the beautiful mountains on the west side. Moreover, it is said that there are around 300 sunny days per year over there! No wonder why Patrick and Stephan love this place so much and prefer not to stay at Troy all the time, lol~

The first important thing I have learned from the workshop is about data policies. I have to admit that as a student I never cared about data policies in the real world, which in fact always exist and should be obeyed. Regulations have been made about data citations, making scientific data much easier to be accessed and reused. This has really broadened my horizon a lot and gave me a chance to get prepared for the real world challenges. However, there are still much more we need to do about data policies. For example, there are both federal and university scientists working on similar scientific data. Sometimes it is not so easy to make the policies work for these different groups. Coordinating all entities is difficult. Some places are really running with it while others are left behind.

During one break, Prof. Fox reminded me that I should walk around and mingle with people instead of sitting there alone. This was indeed great advice for me. Research sometimes needs great amount of communication. You never know how much others could help with your own work and how much others could also learn from your research achievements. This is also a main purpose for the whole workshop, which is to stimulate academic and agency collaboration in geoinformatics and geodata retrieving, integrating, reusing and citing.

During one lunch break I happened to sit with a guy who conducted time-series hydrologic data visualizations. He used some simple 2-D grids to visualize the data (x-axis as days in a year, y-axis as years, each small grid as one data point in a particular day in the history, different colors as different values). Compared to a whole bunch of plots as well as hard-to-read 3-D visualizations, I was very surprised to see that such a simple idea could reveal much more conclusions from the data. This is just one simple example of how important communication and collaboration are.

Additionally, we still have a long way to go to transfer from the age of relational database to semantic triple store. Nobody could argue that triple store does a much better job than the relational database on the “linked data” aspect. However, it also costs much more on the maintenance side. So in the real world, especially in the industry, people still prefer not to use it since the main goal of business is to make profits. However, I believe that it will change soon in the near future, starting from our domain experts who participated in this workshop.

According to the feedback of a breakout session, a large amount of people are still confused about the different between ontology and vocabulary. Data-related education is another key issue we discussed during the workshop. I felt so lucky that I had the chance to learn data-related knowledge systematically from our tetherless world professors. However, there is still a big challenge to make the whole data community realize how important it is to make data management plans, to carry out data citations, and so on. It is everyone’s responsibility to create an even bigger and better “open data” community.

Conceptual model of a workshop

June 24th, 2014

In June 2014 I helped organize two workshops, the DCO Data Science Day 2014 and the GeoData 2014. The experience was unique and I thought it is necessary to write down some notes for future events. Hope it also be useful to other people who are planning to organize a workshop or small conference.
The list of models below is following the idea of an ontology spectrum.

Model 1 (via Bruce Caron, easy and impressive): people, coffee, beer + shaking well.

Model 2 (following the context model of 5W1H): date, topic, location, people, agenda, logistics.

Model 3 (things to do – result of a brainstorm):
0 website;
1 date;
2 central topic, purpose, output;
3 topic of sessions, preferred topic of invited talks, topic of panels, topic of breakouts;
4 meeting rooms, hotel, visa application support;
5 organizing committee, meeting chair, session chair, invited speaker, breakout moderator, note taker, technical assistant, workshop report writer;
6 handouts pack (agenda, badge, logistics memo);
7 logistics: announcement, wifi, power strips, emergency contact, projector, whiteboard and marker, remote access facility, alcohol service permission, travel support, travel agency, dietary requirement, morning and afternoon break, lunch, dinner, reception, local transportation, reimbursement method.

Model 4 (following a timeline):
0 Science: topic, purpose;
1 Finance: meeting budget;
2 Planning: meeting proposal, organizing committee, logistics administrator, organizing meetings, date, location, announcement;
3 Agenda: topic of sessions, preferred topic of invited talks, topic of panels, topic of breakouts, meeting rooms, meeting chair, session chair, invited speaker, breakout moderator, note taker, technical assistant;
4 Logistics: handouts pack (agenda, badge, reimbursement form, logistics memo), emergency contact, wifi, power strips, projector, whiteboard and marker, remote access facility, travel support, travel agency, hotel, visa application support, dietary requirement, morning and afternoon break, lunch, dinner, reception, alcohol service permission, local transportation, reimbursement method.
5 Output: online virtual community of attendees, workshop summary and recommendations, workshop report writer.

Model 5 (an ontology? ;-))
Should be something like:
twc:Workshop a prov:Activity.
twc:SessionChair a prov:Role.

Comments and complements are welcome!

Data Science Day Symposium

June 14th, 2014

It was my first time to attend a Data Science workshop since I began to work for the DCO project. I must say that it was great experience to mingle with scientists and scholars in different domains from all around the world. I felt very proud to see researchers from far away gathering together at our beautiful RPI campus to have conversations about Data Science.

We have done lots of wonderful work to build the Data Science and Management infrastructures for the DCO. Now it is the time to convince our DCO colleagues why it is so important to share “linked data” and “open data” for the whole DCO community.

During the break out session in the afternoon, I joined the EPC group. Dr. Mark Ghiorso led our discussion. Currently a lot of members in DCO haven’t realized how important it is to make a management plan to preserve the data generated in their research. One of the questions which has been raised was that what types and formats of data do you produce or use in your work, and how do you archive them. This question will lead to one of our DCO-DS boundary activities. Since some of the research publications are too old, there are no electronic versions existing at all. If we need to reuse the data from these published literatures, we have to figure out a way to regenerate data from the paper versions.  OCR (optical character recognition) technique could be used to transfer images into machine-readable text after scanning the paper versions. Then one key problem we need to solve is how to extraction the metadata and data we need automatically from the text while maintaining the quality control. We still have a long way to go regarding this.

In the meantime, the fact that we have already lost a lot of valuable data in the history clearly shows why it is so crucial to make a good data management plan before conducting any research activities in the future. The Data Science and Management infrastructures our data science built for DCO plays a very significant role on this, and we also managed to make the portals user-friendly.

During the second day of our workshop, we were very glad to meet a lot of DCO researchers who are willing to share “linked data” via our drupal-VIVO-CKAN Data Science and Management infrastructures. We believe that after this workshop more and more researchers will make use of our data science platforms which could benefit the whole DCO community!

Congrui Li, DCO-DS team member

The Day of Data Science

June 14th, 2014

The DCO Data Science team hosted a two-day workshop in RPI to promote the data management and infrastructure we have been working very hard on for DCO. As part of the team, I attended the whole event and met many scientists and scholars from the DCO communities.

This is a very good opportunity for our team to understand how the DCO scientists deal with their data and what their data needs are. In the breakout session in the afternoon of the first day, I joined the DL-DE group, and had a good chat with them about data science. There were some discussions on the data management, and apparently most of them have a very loose “plan” for their data especially the raw ones. They produced spreadsheets out of the raw data, and left them in a poorly archived, hard-to-access way, and worst of all, they don’t have a habit of writing good data management plans, at least not the ones that they would stick to. But the good news is, they had already started to realize the benefit of qualified data management, and they were seeking help from us.

One feedback we got from the scientists about the DCO portal was that it was “rather low priority for people”. They said it did contain helpful information, but people just didn’t go there all the time. They need some sort of pop-up reminders that show up on the desktops such that they could see it without logging in on the portal.

Another feature the scientists would very much like to have was the cross dataset searching capability. For example, they were very interested in pulling up all the data about a certain concentration of hydrogen from multiple datasets and databases, and they weren’t able to do this because of the lack of compatible schemas across datasets. This is definitely something interesting but hasn’t got under our radar in our DCO-DS work. Digging into the domain data might not be something easy, but once we establish the model, the power of linked data will reveal itself.

These are just some of the things we talked about in the workshop. All in all, as more and more data get collected in the research activities, scientists have become aware that they really need to get these data in order such that they can make the most of them.On the other hand, we, the data science crew, still have a lot of work to make this world a better data place :-)


