Instructor: Professor Peter Fox - pfox at cs dot rpi dot edu
TA: Katie Chastain - chastk at rpi dot edu
Meeting times: Tuesday morning 9:00 am - 11:50 am.
Office Hours: Monday 2:00-3:00pm in Winslow 2120 or by appointment in JRSC 1W06
phone: 518-276-4862
Class Listing: DATA SCIENCE - 45229 - ITEC- CSCI -ERTH 6961 - 01
Class Location Sage 2715
- To instruct future scientists how to sustainably generate/ collect and use data for their research as well as for others: data science.
- To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers
- For both to know tools, and requirements to properly handle data and information
- Will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.
Syllabus/ Calendar
Refer to Reading/ Assignment/ Reference list for each week (see below). Note that the schedule is likely to change based on the number of people in the class, especially around weeks 5 and 6.
- Week 1 (Aug. 28): History of Data and Information, Data, Information, Knowledge Concepts and State-of-the-Art, Data life-cycle for Science; Data acquisition, curation, preservation, metadata Week 1 slides [Download] Week 1 in-class notes [Download]
- Week 2 (Sep. 4): Data and information acquisition (curation) and metadata/ provenance - management Week 2 slides [Download] Week 2 in-class notes [Download]
- Week 3 (Sep. 11): Data formats, metadata standards, conventions, reading and writing data and information Week 3 slides [Download] Week 3 in-class notes
- Week 4 (Sep. 18): Class exercise - collecting data - individual Week 4 notes [Download]
- Week 5 (Sep. 25): Class Presentations: present your data I
- Week 6 (Oct. 2) : Class Presentations: present your data II
- Oct. 9 - no classes (Tuesday follows Monday schedule)
- Week 7 (Oct. 16): Data Analysis Week 7 slides [Download]
- Week 8 (Oct. 23): Data Mining and Class exercise - group project - working with someone else's data Week 8 slides and notes [Download]
- Week 9 (Oct. 30): Academic basis for Data and Information Science, Data Models, Schema, Markup Languages and Data as Service Paradigms Week 9 slides [Download]
- Week 10 (Nov. 6): Data Workflow Management, Preservation and Data Stewardship Week 10 slides [Download]
- Week 11 (Nov. 13): Data Quality, Uncertainty and Bias in more detail, some examples Week 11 slides [Download]
- Week 12 (Nov. 20): Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration, Data Citation Week 12 slides [Download]
- Week 13 (Nov. 27): Final Project Presentations
Reading/ Assignment/ Reference List
Class 1 Reading Assignment:
- Changing Science: Chris Anderson: [1]
- Rise of the Data Scientist: [2]
- Where to draw the line: [3]
- Career of the Future
- What is Data Science: [4]
- Data Scientist: The Hottest Job You've Never Heard Of[4a]
- What Is a Data Scientist? [4b]
- Data Scientist - sexiest job of the 21st C?[4c]
- An example of data science: [5]
- BRDI activities: [6]
- Data policy [7]
- Self-directed study (answer the quiz): [8]
Reference
Class 2: Reading Assignment:
- MIT Libraries: [11]
- Earth Science Information Partners Data Management Workshop: [11a]
- Earth Science Information Partners: Course Outline [11c]
- Univ. Minnesota [12]
- Moore et al., Data Management Systems for Scientific Applications, IFIP Conference Proceedings; Vol. 188, Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software, pp. 273 – 284 (2000) [13]
- Data Management and Workflows [14]
- Metadata and Provenance Management [15]
- Provenance Management in Astronomy (case study) [16]
- Web Data Provenance for QA [17]
Assignment 1 - Data Science 2012 Assignment 1 [Download] Preparing for Data Collection (10% of grade) due week 3 on Sept. 11, 2012
Class 3: Reading Assignment:
- Data formats: netCDF [18]
- Spatial Data Transfer Standard GIS format [19]
- Metadata resources [20]
- Metadata Encoding and Transfer Standard - METS [21]
- Open Archives Initiative - Protocol for Metadata Harvesting - OAI-PMH [22]
- Keyhole Markup Languge - KML Tutorial [23]
- Earth Science Markup Language - ESML [24]
- Climate Science Markup Language - CSML [25]
- Climate and Forecast (CF) conventions [26]
Assignment 2: Data Science 2012 Assignment 2 [Download] Presenting your Data (20% of grade) due in week 5, Sept. 25, 2012.
Class 4: Reading Assignment: None
Class 5: Reading Assignment:
- Brief Introduction to Data Mining [27]
- Longer Introduction to Data Mining and slide sets [28]
- See the software resources list [29]
- Data Analysis - Introduction
- Example: Data Mining
Class 6: Reading Assignment:
- None
Assignment 3: Data Science 2012 Assignment 3 [Download] Reformatting Data (20% of grade) due in week 8, October 23, 2012
Class 7: Reading Assignment: preview government and other (science) data repositories
Some of these have no single "entry point" to their data; you can find them fairly easily by searching for the name of the agency:
- Department of Energy EIA [30]
- Humanities - Digging into Data [31]
- Environmental Protection Agency (EPA)
- US Geological Survey (and state surveys) (USGS)
- NASA Earth Observing System (EOS) and ECHO
- National Oceanic and Atmospheric Administration (NOAA) NODC, NGDC, NCDC
- Department of Energy (DoE): [32]
- National Library of Medicine (NLM): [32a]
- Cancer Grid (CaBIG)
- OneGeology
- data.gov [33]
- Find one of your own
Class 8: Reading Assignment:
- None
Assignment 4: Data Science 2012 Assignment 4 [Download] Working with someone else's data (40% of grade)
Class 9: Reading Assignment:
- NITRD report: [38]
- OCLC Sustainable Digital Preservation and Access [39]
- National Science Founcation Cyberinfrastructure Plan chapter on Data [40]
- European High-Level Group on Data [41]
- Relational Databases on the Semantic Web, Tim Berners-Lee, Design Issue Note, 1998-2009. [42]
- A Survey of Current Approaches for Mapping of Relational Databases to RDF (PDF), Satya S. Sahoo, Wolfgang Halb, Sebastian Hellmann, Kingsley Idehen, Ted Thibodeau Jr, Sören Auer, Juan Sequeda, Ahmed Ezzat, 2009-01-31. [43]
Class 10: Reading Assignment:
- Introduction to Data Management [44]
- Changing software, hardware a nightmare for tracking scientific data [45] (and Parts I, II and III)
- Overview of Scientific Workflow Systems, Gil (AAAI08 Tutorial) [46]
- Comparison of workflow software products, Krasimira Stoilova ,Todor Stoilov [47]
- Scientific Workflow Systems for 21st Century, New Bottle or New Wine? Yong Zhao, Ioan Raicu, Ian Foster [48]
Assignment - final: Data Science 2012 Final Assignment [Download] Stewardship: Workflow construction for Preservation (10% of grade)
Class 11: Reading Assignment:
Class 12: Reading Assignment:
- The Deep Web (Internet Tutorials) [50]
- Digital Image Resources on the Deep Web [51]
- Parsons and Fox Is Data Publication the Right Metaphor?[52]
- Tom Heath Linked Data Tutorial (2009)[53]
Reference material (purchase not required - please ask instructor if you are interested in any of these):
- Beautiful data: [52]
- Scientific data management: [53]
- Interface to Science Archives [54]
- To instruct future scientists how to sustainably generate/ collect and use data for their research as well as for others: data science.
- To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers
- For both to know tools, and requirements to properly handle data and information
- Will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.
Learning Objective:
Through class lectures, practical sessions, written and oral presentation assignments and projects, students should:
- Develop and demonstrate skill in Data Collection and Management
- Develop Data Models and Generate Metadata
- Demonstrate Knowledge of Data Standards
- Demonstrate Skill in Data Science Tool Use and Evaluation
- Demonstration the application the Data Life-Cycle principles
- Become proficient in Data and Information Product Generation
- Via written assignments with specific percentage of grade allocation provided with each assignment
- Via oral presentations with specific percentage of grade allocation provided
- Via group presentations
- Via participation in class (not to exceed 10% of total)
- Late submission policy: first time with valid reason – no penalty, otherwise 20% of score deducted each late day
Course: Data Science
Date: to