Instructor: Professor Peter Fox
TA: Lakshmi Chenicheri chenil at rpi dot edu
Meeting times: TF 12-1:50
Office Hours:Winslow 2120 or by appointment in Lally 207A
phone: x4862
TA Office Hours: TBD
Class Listing: ITWS 4963/ITWS 6965
Class Location: SAGE 3101
Syllabus/ Calendar
Refer to Reading/ Assignment/ Reference list for each week (see below).
- Week 1 (Jan. 21/24): Introduction to Course, Case Studies, and Preview of Course Material Week 1 Tuesday slides [Download], Relevant software and getting it installed (lab) Week 1 Friday slides [Download]
- Week 2 (Jan. 28/31): Starting with Data and Information Resources, Role of Hypothesis, Synthesis and Model Choices Week 2 Tuesday slides [Download], Data filtering, hypothesis exploration, visual analysis, model consideration and assessment (lab) Week 2 Friday slides [Download]
- Week 3 (Feb. 4/7): Preliminary Analysis, Interpretation, Detailed Analysis, Assessment Week 3 Tuesday slides [Download] (lab) Week 3 Friday slides [Download]
- Week 4 (Feb. 11/14): Introduction to Analytic Methods, Types of Data Mining for Analytics Week 4 Tuesday slides [Download] Exercises for linear regression, kNN and K-means (lab) Week 4 Friday slides [Download]
- Week 5 (Feb. 21 - NOTE no class on Feb. 18 - Tuesday follows Monday schedule): Interpreting regression, kNN and K-means results, evaluating models Week 5 Friday slides [Download]
- Week 6 (Feb. 25/28) : Interpreting kNN and K-Means, Clustering and Bayesian Inference Week 6 Tuesday slides [Download] Weighted kNN, Clustering and Bayesian Inference (lab) Week 6 Friday slides [Download]
- Week 7 (Mar. 4/7): Applying models, Decision Trees, Decision Making with Certainty, Uncertainty, Qualitative Methods Week 7 Tuesday slides [Download] (lab) Week 7 Friday slides [Download]
- Mar. 11/14 - no classes - Spring Break
- Week 8 (Mar. 18/21): Assignment 5 presentations: Project Proposals (Tuesday), (lab) Week 8 Friday slides [Download]
- Week 9 (Mar. 25/28): Remainder of Assignment 4 and 5 presentations, Support Vector Machines and other tree models (lab) Week 9 Friday slides [Download]
- Week 10 (Apr. 1/4): Support Vector Machines Week 10 Tuesday slides [Download] SVM (lab) Week 10 Friday slides [Download]
- Week 11 (Apr. 8/11): Interpreting Support Vector Machines, Decision Trees and Cross-validation (optimizing) Week 11 Tuesday slides [Download] Trees and Cross-validation (lab) Week 11 Friday slides [Download]
- Week 12 (Apr. 15/18): New Models, Weak models, Optimizing, Iterating (lab) Week 12 Tuesday slides [Download] Lab and continue project and assignment work Week 12 Friday slides [Download]
- Week 13 (Apr. 22/25): Boosting, dimension reduction and a preview of the return to Big Data Week 13 Tuesday slides [Download] Open Lab and continue project and assignment work (no slides)
- Week 14 (Apr. 29/May 2): No class Tuesday, Lab Friday- PCA and Big Data Infrastructure Week 14 Friday slides [Download] and continue project and assignment work
- Week 15 (May 6): Final Project Presentations
Reading/ Assignment/ Reference List
Class 1: Reading Assignment:
- Sports Analytics – Moneyball (http://www.imdb.com/title/tt1210166/),
- Nate Silver (http://en.wikipedia.org/wiki/Nate_Silver)
- Google Analytics - http://www.marketingscoop.com/google-analytics-casestudy.htm
- http://www.slideshare.net/lsakoda/case-studies-utilizing-real-time-data-...
- http://www.marketquotient.com/case-studies.html
- http://www.ibm.com/analytics/us/en/case-studies/
Class 2 Reading Assignment: no reading
Class 3 Reading Assignment:
- http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
- http://www.r-tutor.com/r-introduction/data-frame
- http://www.r-tutor.com/r-introduction/
Class 4 Reading Assignment: none
Class 5 Reading Assignment: none
Class 6 Reading Assignment: none
Class 7 Reading Assignment:
- http://stat-www.berkeley.edu/users/breiman/RandomForests/ Random Forests
Class 9 Reading Assignment: none
Class 10 Reading Assignment:
- http://escience.rpi.edu/data/DA/v15i09.pdf Karatzoglou et al. 2006
- http://escience.rpi.edu/data/DA/svmbasic_notes.pdf Vert SVM basic
- http://www.stjuderesearch.org/site/data/ALL1/ ALL dataset
- http://www.stanford.edu/group/wonglab/RSVMpage/R-SVM.html RSVM
Class 11 Reading Assignment: none
Class 12 Reading Assignment:
Class 13 Reading Assignment: none
Class 14 Reading Assignment: none
Reference material (available through RPI library - RCS login required):
- Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die (online) (RECOMMENDED)
- Big data analytics : turning big data into big money
- Big Data Analytics : Turning Big Data into Big Money (online)
- Big Data Analytics : From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph (online)
- Big Data Analytics with R and Hadoop (online)
- R for Everyone: Advanced Analytics and Graphics (online)
Course Description:
Data and Information analytics extends analysis (descriptive and predictive models to obtain knowledge from data) by using insight from analyses to recommend action or to guide and communicate decision-making. Thus, analytics is not so much concerned with individual analyses or analysis steps, but with an entire methodology. The world at-large is confronted with increasingly larger and complex sets of structured/unstructured information; from sensors, instruments, and generated by computer simulations; data is "hidden" in websites, application servers, social networks and on mobile devices. As a nation, assimilating information across disparate domains (e.g., intelligence, economics, science) has the potential to provide improved capabilities for decision makers. In commerce and industry, analytics-driven enterprises are becoming mainstream. Yet, there is a shortfall in the key education skills needed to meet the growing needs. Traditional enterprises are moving toward analytics-driven approaches for core business functions. In the government and corporations, cybersecurity problems are prevalent. The investment in advanced analytics capabilities could potentially be more broadly leveraged today and greater than any prior government investments in computing. Emphasis is now placed on disruptive data and information sources on the Web and Internet: using Web Science and informatics to explore social networks, platform competition, the "long tail" and economic or resource impacts of the search for new findings. Key topics include: advanced statistical computing theory, multivariate analysis, and application of computer science courses such as data mining and machine learning and change detection by uncovering unexpected patterns in data.
Course goals:
• Introduce students to relevant methods to recognize and apply quantitative algorithms, techniques and interpretation
• To develop students' strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making.
• Develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems
• Students will examine real-world examples using modern cyberinfrastructure to place statistical and data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science.
• By the end of the course, students can effectively communicate analytic findings to non-specialists
Course Learning Objectives:
- Students to demonstrate knowledge of relevant analytic methods, and to recognize and apply quantitative algorithms, techniques and interpret results
- Students to demonstrate strategic thinking skills, combined with a solid technical foundation in data and model-driven decision-making.
- Students to develop ability to apply critical and analytical methods to formulate and solve science, engineering, medical, and business problems
- Students will examine real-world examples to place data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science.
- Students must effectively communicate analytic findings to non-specialists.
- [graduate level]
Students must develop and demonstrate a working knowledge of decision making under uncertainty, be able to build optimization models that incorporate random parameters: static stochastic optimization, two-stage optimization with recourse, chance-constrained optimization, and sequential decision making.
Course: Data Analytics
Date: to