Data Science 2018

Instructor: Thilanka Munasinghe, munast at rpi dot edu

TA: Aayushi Baghel - baghea at rpi dot edu

Class Meeting times: Monday 0900-1200 ET (synchronous) and online (asynchronous; see Location)

Class Location: Lally 104 and Adobe Connect (login as guest) and Learning Management System (LMS) 1709_Data Science (RCS login)

Instructor Office Hours: Tue/Fri from 1030 - 1130 ET or by appointment/ email/ online

Instructor Office Location: Amos Eaton 133

TA Office Hours: Wednesday 1100 - 1300 ET OR by appointment
 



Course Numbers:
  • CSCI/ERTH/ITWS 4350/ 6350
Description:
To instruct future scientists how to sustainably generate/ collect and use data for their research as well as for others: data science. To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers For both to know tools, and requirements to properly handle data and information Will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.

Data science is advancing the inductive conduct of science and is driven by the greater volumes, complexity and heterogeneity of data being made available over the Internet. Data science combines aspects of data management, library science, computer science, and physical science using supporting cyberinfrastructure and information technology. It is changing the way all of these disciplines do both their individual and collaborative work. Key methodologies in application areas based on real research experience are taught to build a skill-set. To instruct future scientists how to sustainably generate/ collect and use data for their research as well as for others: data science. To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers For both to know tools, and requirements to properly handle data and information Will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.

To instruct future scientists how to sustainably generate/ collect and use data for their research as well as for others: data science. To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers For both to know tools, and requirements to properly handle data and information Will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.

Goal:
To instruct future scientists how to sustainably generate/ collect and use data for their research as well as for others: data science. To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers For both to know tools, and requirements to properly handle data and information Will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.
Learning Objective:
Through class lectures, practical sessions, written and oral presentation assignments and projects, students should: Develop and demonstrate skill in Data Collection and Management Develop Data Models and Generate Metadata Demonstrate Knowledge of Data Standards Demonstrate Skill in Data Science Tool Use and Evaluation Demonstration the application the Data Life-Cycle principles Become proficient in Data and Information Product Generation.
Assessment Criteria:
Via written assignments with specific percentage of grade allocation provided with each assignment Via oral presentations with specific percentage of grade allocation provided Via group presentations Via participation in class (not to exceed 10% of total) Late submission policy: first time with valid reason – no penalty, otherwise 20% of score deducted each late day.
Academic Integrity:
Student-teacher relationships are built on trust. For example, students must trust that teachers have made appropriate decisions about the structure and content of the courses they teach, and teachers must trust that the assignments that students turn in are their own. Acts, which violate this trust, undermine the educational process. The Rensselaer Handbook of Student Rights and Responsibilities defines various forms of Academic Dishonesty and you should make yourself familiar with these. In this class, all assignments that are turned in for a grade must represent the student’s own work. In cases where help was received, or teamwork was allowed, a notation on the assignment should indicate your collaboration. Submission of any assignment that is in violation of this policy will result in a penalty. If found in violation of the academic dishonesty policy, students may be subject to two types of penalties. The instructor administers an academic (grade) penalty, and the student may also enter the Institute judicial process and be subject to such additional sanctions as: warning, probation, suspension, expulsion, and alternative actions as defined in the current Handbook of Student Rights and Responsibilities. of an academic grade penalty or . If you have any question concerning this policy before submitting an assignment, please ask for clarification. First violation results in zero grade for the relevant portion of the work. Second offence results in a failing grade.

Syllabus/ Calendar

Refer to Reading/ Assignment/ Reference list for each week (see below).

  • Week 1 (Sep. 10): History of Data and Information, Data, Information, Knowledge Concepts and State-of-the-Art, Data life-cycle for Science; Data acquisition, curation, preservation, metadata
  • Week 2 (Sep. 17): Data and information acquisition (curation) and metadata/ provenance - management
  • Week 3 (Sep. 24): Data formats, metadata standards, conventions, reading and writing data and information
  • Week 4 (Oct. 01): Module 2 and 3 Review
  • Week 5 (Oct. 9) (Tuesday follows Monday schedule): Class exercise - collecting data - individual
  • Week 6 (Oct. 15): Presentations: present your data (part of Assignment 2)
  • Week 7 (Oct. 22): Data Analysis II and Class exercise - group project definitions - working with someone else's data
  • Week 8 (Oct. 29): Intro to Data Mining for Data Science
  • Week 9 (Nov. 5): Academic basis for Data Science, Data Models, Schema, Markup Languages
  • Week 10 (Nov. 12): Data Workflow Management, Preservation and Data Stewardship
  • Week 11 (Nov. 19): Data Quality, Uncertainty, and Bias
  • Week 12 (Nov. 26): Webs of Data and Data on the Web, the Deep Web, Data Infrastructures, Data Discovery, Data Citation
  • Week 13 (Dec. 3): Final Project Presentations
  • Week 13 (Dec. 10): Final Project Presentations

Reading/ Assignment/ Reference List

Class 1 Reading Assignment (choose 5-6 and at least 2-3 in depth):

  • Changing Science: Chris Anderson: [1]
  • Rise of the Data Scientist [2]
  • Where to draw the line? [3]
  • Career of the Future [4]
  • What is Data Science (I) [5]
  • What is Data Science (II) [6]
  • Data Scientist: The Hottest Job You've Never Heard Of [7]
  • What Is a Data Scientist? [8]
  • Data Scientist - sexiest job of the 21st C? [9]
  • An example of data science [10]
  • Big Data Science [11]
  • A Very Short History of Data Science [12]
  • Data Science Programs on the Increase[13]

Reference

  • Fourth Paradigm: [14]
  • Humanities - Digging into Data [15]
  • National Science Founcation Cyberinfrastructure Plan chapter on Data [15a]

Class 2: Reading Assignment:

  • ISO Lineage Model (NOAA Environmental Data Management) [16]
  • Earth Science Information Partners Data Management Workshop: [17]
  • Earth Science Information Partners: Course Outline [18]
  • Univ. Minnesota [19]
  • Moore et al., Data Management Systems for Scientific Applications, IFIP Conference Proceedings; Vol. 188, Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software, pp. 273 – 284 (2000) [20]
  • Data Management and Workflows [21]
  • Metadata and Provenance Management [22]
  • Provenance Management in Astronomy (case study) [23]
  • Web Data Provenance for QA [24]
  • W3 PROV Overview [25]
  • W3 PROV Data Model [26]

Class 3: Reading Assignment:

  • Data formats: netCDF [27]
  • Spatial Data Transfer Standard GIS format [28]
  • Metadata resources [29]
  • Metadata Encoding and Transfer Standard - METS [30]
  • Open Archives Initiative - Protocol for Metadata Harvesting - OAI-PMH [31]
  • Keyhole Markup Languge - KML Tutorial [32]
  • Earth Science Markup Language - ESML [33]
  • Climate Science Markup Language - CSML [34]
  • Climate and Forecast (CF) conventions [35]

     

Class 4: Reading Assignment:

  • Brief Introduction to Data Mining [36]
  • Longer Introduction to Data Mining and slide sets [37]
  • See the software resources list [38]
  • Data Analysis - Introduction[39]
  • Example: Data Mining[40]

Class 5: Reading Assignment: None.

Class 6: Reading Assignment: None.

Class 7: Reading Assignment: preview government and other (science) data repositories

Some of these have no single "entry point" to their data; you can find them fairly easily by searching for the name of the agency:

  • Department of Energy EIA [41]
  • Humanities - Digging into Data [42]
  • Environmental Protection Agency (EPA)
  • US Geological Survey (and state surveys) (USGS), data.usgs.gov
  • NASA Earth Observing System (EOS) and ECHO, data.nasa.gov
  • National Oceanic and Atmospheric Administration (NOAA) NCEI, data.noaa.gov
  • Department of Energy (DoE): [43]
  • National Library of Medicine (NLM): [43a]
  • data.gov [44]
  • data.ny.gov [45]
  • Find one of your own

 

Class 8: Reading Assignment:

  • See Class 4 reading

Class 9: Reading Assignment: pre-reading

  • Another Look at Data (Mealy 1967)! [53]
  • Identifying Content and Levels of Representation in Scientific Data (Wickett et al. 2012) [54]

Class 10: Reading Assignment: none

  • Introduction to Data Management [45]
  • Changing software, hardware a nightmare for tracking scientific data [46] (and Parts I, II and III)
  • Overview of Scientific Workflow Systems, Gil (AAAI08 Tutorial) [47]
  • Comparison of workflow software products, Krasimira Stoilova ,Todor Stoilov [48]
  • Scientific Workflow Systems for 21st Century, New Bottle or New Wine? Yong Zhao, Ioan Raicu, Ian Foster [49]
  • OCLC Sustainable Digital Preservation and Access [50]
  • Preservation and Access of NOAA Open Data [51]
  • NITRD report: [52]

Class 11: Reading Assignment:

Class 12: Reading Assignment:

  • The Deep Web (Internet Tutorials) [55]
  • Digital Image Resources on the Deep Web [56]
  • Facilitating Discovery of Public Datasets [57]
  • Tom Heath Linked Data Tutorial (2009)[58]
  • Relational Databases on the Semantic Web, Tim Berners-Lee, Design Issue Note, 1998-2009. [59]
  • A Survey of Current Approaches for Mapping of Relational Databases to RDF (PDF), Satya S. Sahoo, Wolfgang Halb, Sebastian Hellmann, Kingsley Idehen, Ted Thibodeau Jr, Sören Auer, Juan Sequeda, Ahmed Ezzat, 2009-01-31. [60]
  • On directly mapping relational databases to RDF and OWL, 2012, Sequeda, Arenas, Miranker in WWW '12 Proceedings of the 21st international conference on World Wide Web, pp. 649-658 [61]

 

Class 13: Reading Assignment: none

Reference material (purchase not required - please ask instructor if you are interested in any of these):

  • Parsons and Fox Is Data Publication the Right Metaphor?[61]
  • Beautiful data: [62]
  • Scientific data management: [63]
  • BRDI activities: [64]
  • Data policy [65]
  • Self-directed study (answer the quiz): [66]

Course: Data Science

Date: to