Instructor: Professor Peter Fox
TA: Jiaju Shen - shenj6 at rpi dot edu
Meeting times: TF 12-1:50
Office Hours:Winslow 2120 or by appointment in Lally 207A
phone: x4862
TA Office Hours: TBD
Class Listing: ITWS 4963/ITWS 6965
Class Location: LALLY HALL 102
Syllabus/ Calendar
Refer to Reading/ Assignment/ Reference list for each week (see below).
- Week 1 (Jan. 27/30): Introduction to Course, Case Studies, and Preview of Course Material Week 1 Tuesday slides [Download], Introduction/ refresher on basic statistics Week 1 Friday slides [Download]
- Week 2 (Feb. 3/6): Starting with Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices Week 2 Tuesday slides [Download], Data filtering, hypothesis exploration, visual analysis, model consideration and assessment (lab) Week 2 Friday slides [Download]
- Week 3 (Feb. 10/13): Preliminary Analysis, Interpretation, Detailed Analysis, Introduction to Analytic Methods, Types of Data Mining for Analytics Week 3 Tuesday slides [Download], (lab) Week 3 Friday slides [Download]
- Week 4 (Feb. 20 - NOTE no class on Feb. 17): Exercises for linear regression, kNN and K-means (lab) Week 4 Friday slides [Download], Week 3 solution [Download], Week 3 output [Download]
- Week 5 (Feb. 24/27): Weighted kNN, Clustering, early decision trees and Bayesian Inference Week 5 Tuesday slides [Download] lab for knn and kmeans on a dirty dataset Week 5 Friday slides [Download]
- Week 6 (Mar. 3/6):More Clustering and Bayesian Inference Week 6 Tuesday slides [Download] (lab) Week 6 Friday slides [Download] revised lab 5b Revised Week 5 Friday slides [Download]
- Week 7 (Mar. 10/13): (lab) - Week 7 Tuesday lab slides [Download], Interpreting weighted kNN, decision trees, cross-validation, dimension reduction and scaling Week 7 Friday slides [Download]
- Week 8 (Mar. 17/20): Assignment 5 presentations: Project Proposals (Tuesday/ Friday)
- Mar. 24/27 - no classes - Spring Break
- Week 9 (Mar. 31/Apr. 3): Support Vector Machines (Tuesday) Week 9 Tuesday slides (linear, up to duality) [Download] , and Week 9 Friday slides (nonlinear) and SVM lab (Friday) Week 9 Friday lab [Download]
- Week 10 (Apr. 7/10): Factor Analysis, Fischer Linear Discriminant Week 10 Tuesday slides [Download] Cross-validation, Random Forest, Dimension Reduction, MDS, Factor Analysis and Fischer LD - lab Week 10 Friday slides [Download]
- Week 11 (Apr. 14/17): Bootstrapping, Bagging Week 11 Tuesday slides [Download] Trees and Cross-validation (lab) Week 11 Friday slides [Download]
- Week 12 (Apr. 21/24): Revisiting Regression - local methods Week 12 Tuesday slides [Download] Lab - Regression - local methods and continue project and assignment work Week 12 Friday slides [Download]
- Week 13 (Apr. 28/May 1): Mixed Models, Optimizing, Iterating Week 13 Tuesday slides [Download] Open Lab and continue project and assignment work - Assignment 7 due (no slides)
- Week 14 (May 5/May 8): Final Project Presentations (NO CLASS on May 12)
Reading/ Assignment/ Reference List
Class 1: Reading Assignment:
- Sports Analytics – Moneyball (http://www.imdb.com/title/tt1210166/),
- Nate Silver (http://en.wikipedia.org/wiki/Nate_Silver)
- Google Analytics - http://www.marketingscoop.com/google-analytics-casestudy.htm
- http://www.slideshare.net/lsakoda/case-studies-utilizing-real-time-data-...
- http://www.marketquotient.com/case-studies.html
- http://www.ibm.com/analytics/us/en/case-studies/
Class 2 Reading Assignment: no reading
Class 3 Reading Assignment:
- http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)
- http://en.wikipedia.org/wiki/Regression_analysis
- http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
- http://varianceexplained.org/r/kmeans-free-lunch/
- http://en.wikipedia.org/wiki/K-means_clustering
Class 4 Reading Assignment: none
Class 5 Reading Assignment: none
Class 6 Reading Assignment: none
Class 7 Reading Assignment:
Class 8 Reading Assignment: none
- http://stat-www.berkeley.edu/users/breiman/RandomForests/ Random Forests
Class 9 Reading Assignment: None
Class 10 Reading Assignment:
- http://escience.rpi.edu/data/DA/v15i09.pdf Karatzoglou et al. 2006
- http://escience.rpi.edu/data/DA/svmbasic_notes.pdf Vert SVM basic
- http://www.stjuderesearch.org/site/data/ALL1/ ALL dataset
- http://www.stanford.edu/group/wonglab/RSVMpage/R-SVM.html RSVM
Class 11 Reading Assignment: None
Class 12 Reading Assignment:
Class 13 Reading Assignment: none
Reference material (available through RPI library - RCS login required):
- Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (online) (RECOMMENDED)
- Big data analytics : turning big data into big money
- Big Data Analytics : Turning Big Data into Big Money (online)
- Big Data Analytics : From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph (online)
- Big Data Analytics with R and Hadoop (online)
- R for Everyone: Advanced Analytics and Graphics (online)
Course Description:
Data and Information analytics extends analysis (descriptive and predictive models to obtain knowledge from data) by using insight from analyses to recommend action or to guide and communicate decision-making. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with an entire methodology. The world at-large is confronted with increasingly larger and complex sets of structured/unstructured information; from sensors, instruments, and generated by computer simulations; data is "hidden" in websites, application servers, social networks and on mobile devices. As a nation, assimilating information across disparate domains (e.g., intelligence, economics, science) has the potential to provide improved capabilities for decision makers. In commerce and industry, analytics-driven enterprises are becoming mainstream. Yet, there is a shortfall in the key education skills needed to meet the growing needs. Traditional enterprises are moving toward analytics-driven approaches for core business functions. In the government and corporations, cybersecurity problems are prevalent. The investment in advanced analytics capabilities could potentially be more broadly leveraged today and greater than any prior government investments in computing. Emphasis is now placed on disruptive data and information sources on the Web and Internet: using Web Science and informatics to explore social networks, platform competition, the "long tail" and economic or resource impacts of the search for new findings. Key topics include: advanced statistical computing theory, multivariate analysis, and application of computer science courses such as data mining and machine learning and change detection by uncovering unexpected patterns in data.
Course goals:
• Introduce students to relevant methods to recognize and apply quantitative algorithms, techniques and interpretation
• To develop students' strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making.
• Develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems
• Students will examine real-world examples using modern cyberinfrastructure to place statistical and data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science.
• By the end of the course, students can effectively communicate analytic findings to non-specialists
Course Learning Objectives:
- Students to demonstrate knowledge of relevant analytic methods, and to recognize and apply quantitative algorithms, techniques and interpret results
- Students to demonstrate strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making.
- Students to develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems
- Students will examine real-world examples to place data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science.
- Students must effectively communicate analytic findings to non-specialists.
- [graduate level]
Students must develop and demonstrate a working knowledge of decision making under uncertainty, be able to build optimization models that incorporate random parameters: static stochastic optimization, two-stage optimization with recourse, chance-constrained optimization, and sequential decision making.
Course: Data Analytics
Date: to