Data Science (2009 Fall)

From Tetherless World Wiki

Revision as of 13:18, 18 November 2009 by Pfox (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search
Event Info [ Edit ]
Type Course
Title Data Science
Location LOW 4040
When 2009/08/31 09:00:00 AM - 2009/12/10 12:00:00 PM

More Details (out-links)

What Links Here (in-links)

Data Science.

Instructors: Professor Peter Fox

Meeting times: Wednesday morning 9:00 am - 11:50 am. Low Center for Industrial Innovation Room 4040 initially;

Office Hours: Tuesday 10-11am in Winslow 2120 or by appointment

phone: 276-4862

Class Listing: DATA SCIENCE - 45229 - ITEC- CSCI -ERTH 6961 - 01



Science has fully entered a new mode of operation. escience, defined as a combination of science, informatics, computer science, cyberinfrastructure and information technology is changing the way all of these disciplines do both their individual and collaborative work.

Scienists are facing global problems of a magnitude, complexity and interdisciplinary nature that progress is limited by a trained and agile workforce.

At present, there is a lack formal training in the key cognitive and skill areas that would enable graduates to become key participants in escience collaborations. The need is to teach key methodologies in application areas based on real research experience and build a skill-set.

At the heart of this new way of doing science, especially experimental and observational science but also increasingly computational science, is the generation of data.

Goals: to instruct future scientist how to sustainably generate/ collect and use data for their research as well as for others: data science. Participants will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.

Topics for Data Science/ Foundations:

  • History of Data and Information
  • Data, Information, Knowledge Concepts and State-of-the-Art
  • Academic basis for Data and Information Science
  • Introduction to Informatics
  • Data life-cycle for Science
  • Data acquisition, curation, preservation
  • Data Integration
  • Metadata
  • Data Models, Schema
  • Data Tools and Data as Service Paradigms
  • Webs of Data and Data on the Web, the Deep Web
  • Data Workflow Management
  • Data Visualization
  • Data Discovery
  • Data and Information Management

Data Science Applications:

  • Geoinformatics
  • Bioinformatics
  • Sun, Earth, Environment and Climate
  • Chemistry, Physics and Astronomy
  • Environmental Engineering
  • Digital Libraries and Scientific Publications

Data Science Project options (examples):

  • Data Collection and Management
  • Data Models and Metadata
  • Data Standards
  • Tool Use and Evaluation
  • Data Life-Cycle Studies
  • Data and Information Product Generation

Syllabus/ Calendar

Refer to Reading/ Assignment/ Reference list for each week (see below).

  • Week 1 (Sep. 2): History of Data and Information, Data, Information, Knowledge Concepts and State-of-the-Art, Data life-cycle for Science; Data acquisition, curation, preservation, metadata slides
  • Week 2 (Sep. 9): Data and information acquisition (curation, preservation) and metadata - management slides
  • Week 3 (Sep. 16): Class exercise - collecting data - individual notes
  • Week 4 (Sep. 23): Data formats, metadata standards, conventions, reading and writing data and information slides
  • Week 5 (Sep. 30): Class Presentations: present your data I
  • Week 6 (Oct.7) : Class Presentations: present your data II,and Introduction to Data Mining slides
  • Week 7 (Oct. 14): Data Mining slides
  • Week 8 (Oct. 21): Academic basis for Data and Information Science, Data Models, Schema, Markup Languages and Data as Service Paradigms slides
  • Week 9 (Oct. 28: Data Analysis and Visualization slides
  • Week 10 (Nov. 4): Class exercise - group project - working with someone else's data notes
  • Week 11 (Nov. 11): Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration, slides
  • Week 12 (Nov. 18): Data Workflow Management and Data Stewardship slides
  • Nov. 25 - no classes
  • Week 13 (Dec. 2): TBD
  • Week 14 (Dec. 9): Project Presentations

Academic Integrity

Student-teacher relationships are built on trust. For example, students must trust that teachers have made appropriate decisions about the structure and content of the courses they teach, and teachers must trust that the assignments that students turn in are their own. Acts, which violate this trust, undermine the educational process. The Rensselaer Handbook of Student Rights and Responsibilities defines various forms of Academic Dishonesty and you should make yourself familiar with these. In this class, all assignments that are turned in for a grade must represent the student’s own work. In cases where help was received, or teamwork was allowed, a notation on the assignment should indicate your collaboration. Submission of any assignment that is in violation of this policy will result in a penalty. If found in violation of the academic dishonesty policy, students may be subject to two types of penalties. The instructor administers an academic (grade) penalty, and the student may also enter the Institute judicial process and be subject to such additional sanctions as: warning, probation, suspension, expulsion, and alternative actions as defined in the current Handbook of Student Rights and Responsibilities. of an academic grade penalty or . If you have any question concerning this policy before submitting an assignment, please ask for clarification.


  • To instruct future scientists how to sustainably generate/ collect and use data for their research as well as for others: data science.
  • To instruct future technologists how to understand and support essential data and information needs of a wide variety of producers and consumers
  • For both to know tools, and requirements to properly handle data and information
  • Will learn and be evaluated on the full life-cycle of data and relevant methods, technologies and best practices.

Course Learning Objectives

Through class lectures, practical sessions, written and oral presentation assignments and projects, students should:

  • Understand and develop skill in Data Collection and Management
  • Understand and know how to developData Models and Metadata
  • Knowledge of Data Standards
  • Skill in Data Science Tool Use and Evaluation
  • Understand and apply the Data Life-Cycle principles
  • Become proficient in Data and Information Product Generation

Assessment Criteria

  • Via written assignments with specific percentage of grade allocation provided with each assignment
  • Via oral presentations with specific percentage of grade allocation provided
  • Via group presentations
  • Via participation in class (not to exceed 10% of total)
  • Late submission policy: first time with valid reason – no penalty, otherwise 20% of score deducted each late day

Suggested Prerequisites

  • Knowledge such as that gained in a Data Base class (e.g., CSCI-XXXX)
  • Knowledge such as that gained in a Data Structures class (e.g., CSCI-XXXX)
  • or permission of the instructor

Reading/ Assignment/ Reference List

Class 1 Reading Assignment:

  • Changing Science: Chris Anderson: [1]
  • BRDI activities: [2]
  • Data policy [3]
  • Self-directed study (answer the quiz): [4]


  • Humanities - Digging into Data [5]

Class 2: Reading Assignment: None

Assignment 1 - Preparing for Data Collection (10% of grade)

Class 3: Reading Assignment:

  • Data formats: netCDF [6]
  • Spatial Data Transfer Standard GIS format [7]
  • Metadata resources [8]
  • Metadata Encoding and Transfer Standard - METS [9]
  • Open Archives Initiative - Protocol for Metadata Harvesting - OAI-PMH [10]
  • Earth Science Markup Language - ESML [11]
  • Climate Science Markup Language - CSML [12]
  • Climate and Forecast (CF) conventions [13]

Class 4: Reading Assignment: None

Assignment 2: Presenting your Data (20% of grade)

Class 5: Reading Assignment:

  • Brief Introduction to Data Mining [14]
  • Longer Introduction to Data Mining and slide sets [15]
  • See the software resources list [16]

Class 6: Reading Assignment:

  • Baker, Barton, Peterson and Fox 2008, reprint
  • Library and Information Science (none required)

Class 7: Reading Assignment:none

Assignment 3: Reformatting Data (20% of grade)

Class 8: Reading Assignment:

  • Peirce and Semiotics - [17]
  • Modern Visualization - [18]
  • Periodic Table of Visualization - [19]

Class 9: Reading Assignment: preview government and other (science) data repositories

  • Department of Energy EIA [20]
  • Humanities - Digging into Data [21]
  • Environmental Protection Agency (EPA)
  • US Geological Survey (and state surveys) (USGS)
  • NASA Earth Observing System (EOS) and ECHO
  • National Oceanic and Atmospheric Administration (NOAA) NODC, NGDC, NCDC
  • Department of Energy (DoE): [22]
  • National Library of Medicine (NLM)
  • Cancer Grid (CaBIG)
  • OneGeology
  • [23]
  • Find one of your own

Assignment 4: Working with someone else's data (40% of grade)

Class 10: Reading Assignment:

  • NITRD report: [24]
  • National Science Founcation Cyberinfrastructure Plan chapter on Data [25]

Class 11: Reading Assignment:

  • Relational Databases on the Semantic Web, Tim Berners-Lee, Design Issue Note, 1998-2009. [26]
  • A Survey of Current Approaches for Mapping of Relational Databases to RDF (PDF), Satya S. Sahoo, Wolfgang Halb, Sebastian Hellmann, Kingsley Idehen, Ted Thibodeau Jr, Sören Auer, Juan Sequeda, Ahmed Ezzat, 2009-01-31. [27]
  • Semantic Deep Web, James Geller, Soon Ae Chun, and Yoo Jung An, [28]
  • The Deep Web (Internet Tutorials) [29]
  • Digital Image Resources on the Deep Web [30]

Class 12: Reading Assignment:

  • Introduction to Data Management [31]
  • Overview of Scientific Workflow Systems, Gil (AAAI08 Tutorial [32]
  • Comparison of workflow software products, Krasimira Stoilova ,Todor Stoilov [33]
  • Scientific Workflow Systems for 21st Century, New Bottle or New Wine? Yong Zhao, Ioan Raicu, Ian Foster [34]
  • Guest Editors’ Introduction to the Special Section on Scientific Workflows, Ludaesher and Goble [35]

Assignment - final: Stewardship: Workflow construction for Preservation (10% of grade)

Class 13: Reading Assignment:

  • Optional - may be assigned by guest Lecturer

Reference material (purchase not required - please ask instructor if you are interested in any of these):

  • Beautiful data: [36]
  • Scientific data management: [37]
  • Interface to Science Archives [38]

Class 14: No Reading Assignment: Present group projects, Assignments 4 and Final due

Attendance Policy

Enrolled students may miss at most one class without permission of the instructor.

Personal tools