Data Science 2021

  • Instructor: Thilanka Munasinghe, munast at rpi dot edu
  • T1: Devanshoo Jain - jaind2 at rpi dot edu
  • TA office hours : Monday 9:30 to 11:30 AM ET OR Appointment via email (Virtual Office Hours via WebEx, See LMS for Login info)
  • Class Meeting times: Thursdays 11:00AM - 1:50 PM EST
  • Class Location: Lally 104 and and Learning Management System (LMS) 1709_Data Science (RCS login)
  • Instructor Office Hours: Tue/Fri from 12:30PM - 1:30PM EST or by appointment/ email/ online via WebEx (Instructor online office hours loing information available on LMS
  • Instructor Office Location: Amos Eaton 133

Sections: CSCI 4350/6350, ITWS 4350/6350, ERTH 4350/6350

Syllabus/ Calendar

Refer to Reading/ Assignment/ Reference list for each week (see below).

Recommended Reading/Textbooks : Data Action : Using Data for Public Good by Prof. Sarah Williams

  • Week 1 (Sep. 02): History of Data and Information, Data, Information, Knowledge Concepts and State-of-the-Art, Data life-cycle for Science; Data acquisition, curation, preservation, metadata
  • Week 2 (Sep. 09): Data and information acquisition (curation) and metadata/ provenance - management
  • Week 3 (Sep.16): Data formats, metadata standards, conventions, reading and writing data and information
  • Week 4 (Sep. 23): Module 2 and 3 Review, Data Analysis I
  • Week 5 (Sep. 30): Class exercise - collecting data - individual
  • Week 6 (Oct. 07): Presentations: present your data (part of Assignment 2)
  • Week 7 (Oct. 14): Presentations: present your data (part of Assignment 2)
  • Week 8 (Oct. 21): Academic basis for Data Science, Data Models, Schema, Markup Languages, group project, working with someone else's data
  • Week 9 (Oct. 28): Intro to Data Mining for Data Science
  • Week 10 (Nov. 04): Data Analysis II and Class exercise
  • Week 11 (Nov. 11): Data Workflow Management, Preservation and Data Stewardship
  • Week 12 (Nov. 18): Data Quality, Uncertainty and Bias , Final Project Preparation – Project work discussion with the instructor
  • Week 13 (Nov. 25): No Classes - Thanksgiving recess
  • Week 14 (Dec. 02): Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration, Data Citation
  • Week 15 (Dec. 09): Final Project Presentation : Final Project Group Discussion – Project work discussion with the instructor/TA:
  • Week 15 (Dec. 09): Final Project Report Due on LMS

Reading/ Assignment/ Reference List

Class 1 Reading Assignment (choose 5-6 and at least 2-3 in depth):


  • Fourth Paradigm: [14]
  • Humanities - Digging into Data [15]
  • National Science Founcation Cyberinfrastructure Plan chapter on Data [15a]

Class 2: Reading Assignment:

  • Data Lineage [16]
  • Earth Science Information Partners Data Management Workshop: [17]
  • Earth Science Information Partners: Course Outline [18]
  • Data Management Plan - Univ. Minnesota [19]
  • Moore et al., Data Management Systems for Scientific Applications, IFIP Conference Proceedings; Vol. 188, Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software, pp. 273 – 284 (2000) [20]
  • Data Management Plan (DMP) and Getting Started with Data Management Plan Template - Univ. Minnesota [21]
  • Metadata and Provenance Management [22]
  • Provenance Management in Astronomy (case study) [23]
  • Web Data Provenance for QA [24]
  • W3 PROV Overview [25]
  • W3 PROV Data Model [26]

Class 3: Reading Assignment:

  • Data formats: netCDF [27]
  • Spatial Data Transfer Standard GIS format [28]
  • HDF5 TUTORIAL: Learning HDF5 with HDFVIEW [29]
  • Metadata Encoding and Transfer Standard - METS [30]
  • Open Archives Initiative - Protocol for Metadata Harvesting - OAI-PMH [31]
  • Keyhole Markup Languge - KML Tutorial [32]
  • Earth Science Markup Language - ESML [33]
  • HDF5View User's Guide [34]
  • HDF5 files in Python [35]

Class 4: Reading Assignment:

  • Brief Introduction to Data Mining [36]
  • Longer Introduction to Data Mining and slide sets [37]
  • See the software resources list [38]
  • Data Analysis - Introduction[39]
  • Example: Data Mining[40]

Class 5: Reading Assignment: Data Action: Using Data for Public Good: -- How to use data as a tool for empowerment rather than oppression.

Class 6: Reading Assignment: None.

Class 7: Reading Assignment: preview government and other (science) data repositories

Some of these have no single "entry point" to their data; you can find them fairly easily by searching for the name of the agency:

  • Department of Energy EIA [41]
  • Humanities - Digging into Data [42]
  • Environmental Protection Agency (EPA)
  • US Geological Survey (and state surveys) (USGS),
  • NASA Earth Observing System (EOS) and ECHO,
  • National Oceanic and Atmospheric Administration (NOAA) NCEI,
  • Department of Energy (DoE): [43]
  • National Library of Medicine (NLM): [43a]
  • [44]
  • [45]
  • Find one of your own

Class 8: Reading Assignment:

Class 9: Reading Assignment: pre-reading

  • Another Look at Data (Mealy 1967)! [53]
  • Identifying Content and Levels of Representation in Scientific Data (Wickett et al. 2012) [54]

Class 10: Reading Assignment: none

  • Introduction to Data Management [45]
  • Changing software, hardware a nightmare for tracking scientific data [46] (and Parts I, II and III)
  • Overview of Scientific Workflow Systems, Gil (AAAI08 Tutorial) [47]
  • Comparison of workflow software products, Krasimira Stoilova ,Todor Stoilov [48]
  • Scientific Workflow Systems for 21st Century, New Bottle or New Wine? Yong Zhao, Ioan Raicu, Ian Foster [49]
  • OCLC Sustainable Digital Preservation and Access [50]
  • Preservation and Access of NOAA Open Data [51]
  • NITRD report: [52]

Class 11: Reading Assignment:

Class 12: Reading Assignment:

  • The Deep Web (Internet Tutorials) [55]
  • Digital Image Resources on the Deep Web [56]
  • Facilitating Discovery of Public Datasets [57]
  • Tom Heath Linked Data Tutorial (2009)[58]
  • Relational Databases on the Semantic Web, Tim Berners-Lee, Design Issue Note, 1998-2009. [59]
  • A Survey of Current Approaches for Mapping of Relational Databases to RDF (PDF), Satya S. Sahoo, Wolfgang Halb, Sebastian Hellmann, Kingsley Idehen, Ted Thibodeau Jr, Sören Auer, Juan Sequeda, Ahmed Ezzat, 2009-01-31. [60]
  • On directly mapping relational databases to RDF and OWL, 2012, Sequeda, Arenas, Miranker in WWW '12 Proceedings of the 21st international conference on World Wide Web, pp. 649-658 [61]

Class 13: Reading Assignment: none

Reference material (purchase not required - please ask instructor if you are interested in any of these):

Course Goals / Objectives

To instruct future scientists how to sustainably generate/ collect and use data for their research as well as for others: data science. To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers For both to know tools, and requirements to properly handle data and information Will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.

Through class lectures, practical sessions, written and oral presentation assignments and projects, students should: Develop and demonstrate skill in Data Collection and Management Develop Data Models and Generate Metadata Demonstrate Knowledge of Data Standards Demonstrate Skill in Data Science Tool Use and Evaluation Demonstration the application the Data Life-Cycle principles Become proficient in Data and Information Product Generation.

Academic Integrity:

Student-teacher relationships are built on trust. For example, students must trust that teachers have made appropriate decisions about the structure and content of the courses they teach, and teachers must trust that the assignments that students turn in are their own. Acts that violate this trust undermine the educational process. The Rensselaer Handbook of Student Rights and Responsibilities and The Graduate Student Supplement define various forms of Academic Dishonesty and you should make yourself familiar with these. In this class, all assignments that are turned in for a grade must represent the student’s own work. In cases where help was received, or teamwork was allowed, a notation on the assignment should indicate your collaboration. Submission of any assignment that is in violation of this policy will result in a penalty. If found in violation of the academic dishonesty policy, students may be subject to two types of penalties. The instructor administers an academic (grade) penalty, and the student may also enter the Institute judicial process and be subject to such additional sanctions as: warning, probation, suspension, expulsion, and alternative actions as defined in the current Handbook of Student Rights and Responsibilities. of an academic grade penalty or. If you have any questions concerning this policy before submitting an assignment, please ask for clarification. First violation results in zero grade for the relevant portion of the work. Second offense results in a failing grade.

Submission of any assignment that is in violation of this policy will result in a penalty for the first violation results in zero grade for the relevant portion of the work. Second offense results in a failing grade.

If you have any question concerning this policy before submitting an assignment, please ask for clarification.

COVID-19 code of conduct :

All students must comply with all health and safety protocols specified by the Institute under the Return-to-Campus plan available at the Rensselaer COVID-19 website. Appropriate action will be taken against those who do not comply fully with these protocols”.

This code will apply to any class that meets fully or partially in an on-campus physical classroom for in-person instruction.

Violations: Refusal to comply with the COVID-19 code of conduct will be treated just as any classroom disruption, which will receive requests for immediate compliance, failing which the student will be asked to leave the classroom. Any further noncompliance will result in the dismissal of the entire class. All Covid-19 related violations will be reported by the instructor to the Compliance Officer at Lally School, and the Dean of Students. A student found to be in violation of the code, or required repeated reminders for compliance, will be asked to participate in all classes remotely. This is to protect their health and safety as well as the health and safety of their classmates, instructor, and the university community.

Masks: All students must wear a mask in classrooms and all public places including anywhere inside the building. Masks will be provided to the student by the Institute.

Traffic Flow and Social Distancing: Students and faculty will respect the need for social distancing. They are required to follow the traffic flow arrows posted in all rooms and buildings, including bathrooms and common areas.

In-Class Seating: Students should sit in the appropriate designated seating in the classroom. Students are not allowed to move furniture or sit in seats not designated by the Institute.

Cleaning of Spaces: Students are encouraged to clean the surfaces of the chairs/tables/desks they occupy before they sit down and as they prepare to leave. Cleaning and sanitizing solutions will be provided in the classroom.

Students who are ill, under quarantine for COVID-19, or suspect they are ill should not come to class. All faculty will make every reasonable effort to accommodate the student’s absence and will communicate that accommodation directly to the student. Students who need to report an illness should contact the Student Health Center via email or call 518-276-6287. For students seen off campus, a student may request an excused absence via with an uploaded doctor’s note that excuses them.

  • Parsons and Fox Is Data Publication the Right Metaphor?[61]
  • Beautiful data: [62]
  • Scientific data management: [63]
  • BRDI activities: [64]
  • Data policy [65]
  • Self-directed study (answer the quiz): [66]

Course: Data Science

Date: to