Course Descriptions


Advanced Statistical Modeling and Computing Using R

Din Chen

Summary

This short course covers advanced statistical modeling and computing using R. We will review the multiple linear regression for continuous data and then proceed to cover the logistic regression for binary/binomial data; Poisson regression and negative binomial regression for counts data; longitudinal data analysis and general multi-level modeling. We also cover some topics in statistical computing on data simulations and bootstrapping.

Why Take This Course?

Topics to be covered (Day 1):

  • R packages and how to get help in R
  • Basic statistical computing and simulations
  • Multiple linear regression using R “lm” function with real data analysis
  • Logistic regression for binary/binomial data using R “glm” function with real data analysis
  • Poisson regression for counts data using R function “glm” with real data analysis
  • Generalized Linear Model using “glm”

Topics to be covered (Day 2):

  • Longitudinal data using R packages “nlme”
  • Multi-level modelling using R packages “nlme” for continuous data
  • Multi-level modelling for binary/binomial data using R packages “lme4”
  • Multi-level modelling for counts data using R package “nlme”
  • Introduction to statistical simulations and bootstrapping.

What Will Participants Learn?

Participants will learn advanced statistical modeling and computing using R in analyzing the continuous and non-continuous data with multiple regression, logistic regression, Poisson regression, negative binomial regression, and general multi-level modeling. They will learn how to choose appropriate statistical analysis models that best suit the type of data and research questions for a given study. They will be better able to conceptualize, design, run, interpret, and communicate results clearly and effectively based on the advanced modeling techniques covered in the course.

Course delivery will be presentation with handouts. Lecture will be interspersed with questions and answers, software demonstration, and discussion.

Reference: Chen, D.G., Peace, K.E. and Zhang, P.G. (2017). Clinical Trial Data analysis Using R and SAS. Chapman & Hall/CRC Biostatistics Series.

Prerequisites and Requirements

  • Participants should bring a laptop to this session. Please also install R software, and R packages “nlme” and “bootstrap.”
  • To fully participate, students should be familiar with descriptive and inferential statistics as well as multiple regression analysis.
  • Some R programming skills are helpful, but not required.

Advanced R: Automating Workflow & Sharing Work

Justin Post

Summary

The class provides students with a primer on the use of R for the writing of reproducible reports and presentations that easily embed R output using R markdown as well as the creation of interactive and customizable web applets called R Shiny applications.

Why Take This Course?

This class will be useful for automating the creation of data analyses, descriptive statistics, and graphical summaries.

What Will Participants Learn?

Students will learn about the flexibility and use of R markdown for producing slides, web pages, reports and the R Shiny web application framework.

Prerequisites and Requirements

Please bring a laptop with you. A reasonably solid understanding of R syntax, objects, and functions is assumed.


An Introduction to Deep Learning with R

Ashok Krishnamurthy

Summary

In the past few years, Deep Learning has emerged as a powerful Machine Learning method that has found applications in areas such as object recognition, image classification, video analysis, and natural language processing. This course will introduce participants to Deep Learning from a hands-on perspective. The approach will be to avoid most of the math and concentrate instead on the underlying ideas and principles. We will concentrate on Tensorflow/Keras as the underlying computational platform and use the R interface to Tensorflow created by RStudio as the coding platform. At the end of the course you will have a basic understanding of Deep Learning, Tensorflow, how Deep Learning can be used to solve problems with images, and using Tensorflow/Keras to setup Deep Learning models.

Why Take This Course?

Deep Learning appears to be a magical algorithm that can solve difficult problems in a variety of domains. The coming together of a number of trends underlie the success of Deep Learning. In this course, you will get a chance to look under the hood of the Deep Learning hype and see the underlying architecture, and you will be exposed to a set of tools that you can use to create your own Deep Learning models

What Will Participants Learn?

The course will focus on the following topics:

Neural Networks
Neural Networks as universal approximators
Training Neural Networks as an optimization problem
Deep Neural Networks
Why Deep Learning now?
Convolutional Neural Networks
Tensors, Tensorflow and Keras
Autoencoders
Transfer Learning

  • Neural Networks
  • Neural Networks as universal approximators
  • Training Neural Networks as an optimization problem
  • Deep Neural Networks
  • Why Deep Learning now?
  • Convolutional Neural Networks
  • Tensors, Tensorflow and Keras
  • Autoencoders
  • Transfer Learning

Participants will complete a number of computer exercises using RStudio, R, Keras and Tensorflow.

Prerequisites and Requirements

This course will assume a basic understanding of statistics and calculus at the undergraduate level. Experience with R is necessary to get full benefits from the class.


Introduction to Effective Information Visualization

Eric Monson

Summary

Participants will see how freely and commonly available software can be used to create effective visualizations; learn how to clean and re-structure data; and learn basic effective visualization principles so you can go beyond the defaults and create eye-catching and impactful figures!

Why Take This Course?

Visualization is a powerful way to reveal patterns in data, attract attention, and get your message across to an audience quickly and clearly. But, there are many steps in that journey from information to influence, and many questions – what visualization tools to use, how to get data into the right format, and which choices to make when putting it all together to tell your story? This course will quickly walk participants through a wide variety of data and chart types to help even beginners feel comfortable embarking on a new visualization project.

What Will Participants Learn?

The course will cover five major topic areas.

The course will cover four major topic areas. The sessions will combine lecture and hands-on activities in data cleaning, transformation, and visualization.

  • Tips for effective data visualization
  • Charts, maps and interactive dashboards with Tableau
  • Cleaning and re-structuring data using OpenRefine
  • Free web-based tools such as RAW for less-common visualization types

Prerequisites and Requirements

This course is designed at an introductory level, but will assume a basic understanding of spreadsheets as a way of storing and processing data. No programming will be necessary. Bringing a laptop is not required, but participants are strongly encouraged to do so. Having pre-installed Tableau (Public or Desktop), a web browser, and OpenRefine will allow them to get in-class experience during the hands-on computer-based activities.


Introduction to Machine Learning and Data Mining Using R

Ashok Krishnamurthy (Section 1 – Mon/Tue)

Summary

This course will introduce participants to a selection of the techniques used in machine learning and data mining in a hands-on, application-oriented way. Topics covered will include data exploration, clustering, association rules, decision trees, random forests, support vector machines, and deep learning. The hands-on exercises will be based on the statistical programming language R. At the end of the two days, you will be able to explore a data set, determine which analysis method is appropriate for the data, and be able to use R packages to do the analysis.

Why Take This Course?

The ready availability of digital data from numerous sources is a tremendous opportunity for businesses and scientists to obtain new insights and confirm hypotheses. Machine learning and data mining provide a collection of algorithms and computational methods to manage, analyze and get information from the data. In the world of big data and data science, machine learning is a fundamental tool for data insights.

What Will Participants Learn?

The course will be organized in the following major sections: 

  • introduction
  • data exploration
  • association rules 
  • decision trees
  • random forests
  • clustering
  • support vector machines
  • deep learning  

Each section will have an associated computer exercise; we will make extensive use of R and R packages.

Prerequisites and Requirements

Participants should bring laptops to this course. This course will assume a basic understanding of statistics and calculus at the undergraduate level. Experience with R is necessary to get the full benefits of the class.


Introduction to Machine Learning and Data Mining Using R

Yufeng Liu (Section 2 – Thu/Fri)

Summary

Statistical machine learning and data mining are interdisciplinary research areas that are closely related to statistics, computer science, engineering, and bioinformatics. Many statistical machine learning and data mining techniques and algorithms are useful for various scientific areas. This two-day short course will provide an overview of statistical machine learning and data mining techniques with applications to the analysis of real data.

Both supervised and unsupervised techniques will be covered. Supervised learning techniques include penalized regression such as LASSO and its variants, support vector machines, Boosting, and tree-based methods. Unsupervised learning techniques include dimension reduction methods such as principal components analysis and clustering analysis. The main emphasis will be on the analysis of real data sets from various scientific fields. The techniques discussed will be demonstrated in R.

Why Take This Course?

This course is intended for researchers who have some knowledge of statistics and want to be introduced to statistical machine learning and data mining, or practitioners who would like to apply statistical machine learning techniques to their problems.

What Will Participants Learn?

The course will cover the following:

  • Fundamentals of Statistical Learning
    • Training versus test error rates
    • Supervised versus unsupervised methods
    • Bias/Variance tradeoff
  • Linear regression and Penalized Regression
    • Ridge regression
    • Lasso
    • Further extensions
  • Cross-validation & Bootstrapping
  • Classification Techniques
    • Logistic regression and penalized logistic regression
    • Linear discriminant analysis
    • Quadratic discriminant analysis
    • Nearest neighbors classification
    • Support vector machines
  • Tree-based Methods
    • Bagging
    • Random forests
  • Boosting
  • Unsupervised Learning Techniques
    • Dimension reduction: Principal Component Analysis and others
    • Clustering
    • Other Selected Topics
  • R exercises will be included throughout the two days.

Prerequisites and Requirements

Participants should be familiar with linear regression and basic statistical and probability concepts, as well as some familiarity with R programming. Participants should bring laptops to class.


Introduction to Programming Using Python

Jason Carter

Summary

This course is an introduction to programming as a skill, a discipline, and a profession for graduate students. We will dive into hands-on programming from day one and progress to evaluating and using open source libraries and frameworks to manage large and complex datasets. We will focus equally on reading and writing code.

Why Take This Course?

Students will leave the course with real skills, an ability to learn new programming technologies, and an understanding of how to incorporate open source code into their projects. It will serve as an appropriate foundation for students seeking to use programming as tool in their data science tool belt. This course will help students evaluate, communicate with, or work with programmers or code.

What Will Participants Learn?

  • At the end of this course, students should have:
  • The ability to analyze large or complex data sets.
  • The skills required to solve problems by creating and modifying programs and systems, using modern programming tools.
  • The knowledge of basic programming concepts, their appropriate usage, and how and where to learn more.
  • An attitude of confidence when reading, writing, or discussing computer code.

Prerequisites and Requirements

There are no prerequisite course requirements for the course. However, you will need to be skilled in the use of basic mathematics and algebra, as well as email and web usage.


Introduction to R for Data Science

Justin Post (Section 1)
Jonathan Duggins (Section 2)

Summary

This course provides a basic introduction to the R software environment for the purpose of data science. The course covers importing and exporting data, manipulating data and recoding variables, data visualization, and basic statistical analyses.

Why Take This Course?

R has recently become the preferred computing and statistical analysis software for academic analysis because it offers unparalleled breadth of tools for virtually any model of interest to social scientists—and particularly those interested in so-called “big data.” Unfortunately, R also has a steep learning curve. Courses such as this one are therefore indispensable for obtaining a basic working knowledge of the language and learning how to navigate the complex web of information about R that is currently available online.

What Will Participants Learn?

The course provides an overview of common data objects in R and how to manipulate and access them. We will discuss how to import data from other software formats such as Excel, STATA, and SPSS. We will also cover methods for data cleaning and manipulation, basic descriptive statistics such as cross-tabs, histograms, and scatter plots, and linear regression models.

Prerequisites and Requirements

In order to participate in the hands-on sections of the course, participants must bring their own laptop computer with enough space to install R and RStudio. This course assumes no knowledge of computer programming, but basic familiarity with another statistical analysis software such as STATA, SPSS, or SAS will make the course easier to follow.


Network Analysis for Data Scientists

Bill Shi

Summary

This course will provide an introduction to network analysis with a focus on data and applications. It will introduce basic concepts and ideas in network science, and cover methods that are practical and useful in dealing with network data. At the completion of this course, participants will have a solid understanding of what network analysis does and be able to run common methods on network data.

Why Take This Course?

We live in a connected world – the Internet, social networks, the neural networks in our brains, you name it. Besides those systems with explicit connections, we have realized that most real-life systems are comprised of complicated components with implicit interactions, and ignoring those interactions would fail us in predicting system behaviors, such as the unexpected disaster of the financial market in 2008. Therefore, Networks emerged as a powerful tool in studying those complex systems and have found fruitful applications in a wide range of areas from technology to public health to social problems. This course will provide an introduction to this emerging area with a focus on data and real applications.

What Will Participants Learn?

  • Concepts and terminologies in network science, such as nodes, edges, degrees, small world, etc.
  • Centrality measures of nodes and edges in a network, such as pagerank, betweenness, closeness, etc.
  • Structural features such as clustering coefficient, mixing coefficient, etc.
  • Random network models such as Erdos-Renyi network, scale-free network, small world, etc.
  • Community detection methods
  • Link prediction methods

Prerequisites and Requirements

  • Basic knowledge in statistics
  • The igraph package (http://igraph.org/r/) in R
  • Basic programming skills in R
  • Open-source network visualization software Gephi (gephi.org)

Intermediate Programming in R

Justin Post (Section 1)
Jonathan Duggins (Section 2)

Summary

This class provides an introduction to basic programming techniques in R. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R’s popularity has increased substantially in recent years.

Why Take This Course?

This class will be useful to those who wish to restructure or clean unstructured data, streamline code by automating repeated code, or improve the speed of data analysis.

Prerequisites and Requirements

Students will learn the concepts and syntax of basic programming techniques including “for” loops, if/else statements, applying a function systematically to R objects, and writing their own functions.

What Will Participants Learn?

Please bring a laptop with you. Basic understanding of R objects (e.g. matrices, lists, data frames, etc.) and how to access them is assumed.


Text Analysis Using R

Alison Blaine

Summary

This course explains how to analyze text-based data collected from the internet using R. The course will cover how to load text data from Application Programming Interfaces (APIs) as well as files, and methods for doing basic natural language processing on the data.

Why Take This Course?

The skills required to analyze textual data are useful in a wide array of academic disciplines. Having coding skills provides you many more options for getting data and deriving insights from it and doing so at little to no cost. This course will provide students already familiar with R with some of the programming skills necessary to take more control over their data analysis process and feel empowered to dive deeper.

What Will Participants Learn?

This course is divided into two segments. In the first section, participants will learn basic techniques for collecting text-based data from the Internet by writing code to extract data from application programming interfaces. The second section will explain how to apply natural language processing approaches to analyze data.

Prerequisites and Requirements

Please bring a laptop with you. This course is best suited for those who already have a basic working knowledge of R. Students with no knowledge of R might consider pairing this course with the Introduction to R for Data Science course that is being offered earlier in the week.


Visualization for Data Science Using R

Angela Zoss

Summary

This course is designed for two audiences: experienced visualization designers looking to apply open data science techniques to their work and data science professionals who have limited experience with visualization. Participants will develop skills in visualization design using R, a tool commonly used for data science.

Why Take This Course?

Data science skills are increasingly important for research and industry projects. With complex data science projects, however, come complex needs for understanding and communicating analysis processes and results. Ultimately, an analyst’s data science toolbox is incomplete without visualization skills. Incorporating effective visualizations directly into the analysis tool you are using can facilitate quick data exploration, streamline your research process, and improve the reproducibility of your research.

What Will Participants Learn?

The course will take a project-based approach to learning best practices for visualization for data science. Participants will be guided through several sample analysis and visualization projects that will highlight different types of visualization, different features of R and its visualization libraries, and different challenges that arise when trying to apply an open data science philosophy to visualization. In short, students will learn the following:

  • introduction to visualization in R
  • basic syntax for ggplot2
  • applying common graphic design principles to ggplot2 visualizations
  • using Shiny to create interactive websites that include R data and visualizations

Prerequisites and Requirements

Participants should bring laptops to this course. Basic familiarity with R is required. We will use RStudio to interact with R, and all exercises will be distributed in RMarkdown files (rather than simple R script files). This allows us to combine R code with non-code elements and promotes a literate programming approach to research.

This course assumes basic familiarity with R — e.g., R syntax, data structures, development environments. Visualizations will be created with ggplot2 and other tidyverse libraries, but prior experience with those libraries is not required. In order to fully participate in class exercises, participants should install the following on their laptops: current versions of R, RStudio, the tidyverse package, and the knitr package (optional).


Working with Messy Data

Brown Biggers

Summary

When working with data, one thing is certain: data are rarely in an optimized format. A misplaced space here or an extra comma there can mean the difference between two clicks and two hours of work. In this course, we will work with ways to manipulate, interpret, and present data from webpages and text using Python 2.7. This class will also cover libraries for import, data munging, analysis, and visualization including Regular Expressions, Pandas, NumPy, Matplotlib, and Bokeh.

Why Take This Course?

The tools for handling data can be both expensive and complicated. This class will cover a wide range of techniques with common and open-source data processing tools. If you find yourself curious about how better to handle tabular data but are often intimidated by the steep learning curve associated with programming, this class will get you started on the path towards better data management.

What Will Participants Learn?

Participants will learn Python programming as applied to importing, processing, and exporting data. We will cover some of the libraries associated with mathematical and statistical analysis, grouping, time-based analysis, as well as using regular expressions.

Prerequisites and Requirements

Please bring a laptop for this hands-on course. Please also note that this course is intended for individuals with basic-to-intermediate understanding of one or more of the following: the Python programming language, data import/export formats, and text processing. We will be using Anaconda (for Python 2.7) and SublimeText. Though you are welcome to use your preferred installations, this class expects that you will have a laptop with a running installation of Python 2.7, a text editor with regular expression find-and-replace, and a web browser with internet connectivity.

Recommended installations:
Anaconda for Python 2.7: https://www.anaconda.com/download/
SublimeText 3: https://www.sublimetext.com/3