Course Descriptions

Please note that, due to their popularity, we offer two sections of some courses. For these courses, please pay attention to which instructor you wish to take the class with.

Mon/Tue Block: Introduction to R for Data Science > Specify instructor Duggins or Post

Wed Block: Programming in R > Specify instructor Duggins or Post

Thurs/Fri Block: Intro to Data Mining & Machine Learning > Specify instructor Krishnamurthy or Vatsavai

  • August 13-14, 2018

Introduction to R for Data Science - POST
Instructor: Justin Post

This course provides a basic introduction to the R software environment for the purpose of data science. The course covers importing and exporting data, manipulating data or recoding variables, and visualization and statistical analysis.

R has recently become the preferred computing and statistical analysis software for academic analysis because it offers unparalleled breadth of tools for virtually any model of interest to social scientists—and particularly those interested in so-called “big data.” Unfortunately, R also has a steep learning curve because it is maintained by academics that have few career incentives to make it user friendly. Courses such as this one are therefore indispensable for obtaining a basic working knowledge of the language and learning how to navigate the complex web of information about R that is currently available online.
The course provides an overview of how to install R on your computer, read in data files, and import data sources stored in other software formats, such as STATA, SPSS, and SAS. The course also covers data cleaning and manipulation, basic descriptive statistics such as cross-tabs, histograms, and scatter plots, and basic analyses such as linear regression models.
Please bring a laptop with you. This course assumes no knowledge of computer programming, but basic familiarity with another statistical analysis software program such as STATA, SPSS, or SAS will make the course easier to follow.

Note: In order to participate in the hands-on sections of the course, participants must bring their own laptop computer with enough space to install R and RStudio.


Introduction to R for Data Science - DUGGINS
Instructor: Jonathan Duggins

This course provides a basic introduction to the R software environment for the purpose of data science. The course covers importing and exporting data, manipulating data or recoding variables, and visualization and statistical analysis.


R has recently become the preferred computing and statistical analysis software for academic analysis because it offers unparalleled breadth of tools for virtually any model of interest to social scientists—and particularly those interested in so-called “big data.” Unfortunately, R also has a steep learning curve because it is maintained by academics that have few career incentives to make it user friendly. Courses such as this one are therefore indispensable for obtaining a basic working knowledge of the language and learning how to navigate the complex web of information about R that is currently available online.

The course provides an overview of how to install R on your computer, read in data files, and import data sources stored in other software formats, such as STATA, SPSS, and SAS. The course also covers data cleaning and manipulation, basic descriptive statistics such as cross-tabs, histograms, and scatter plots, and basic analyses such as linear regression models.

Please bring a laptop with you. This course assumes no knowledge of computer programming, but basic familiarity with another statistical analysis software such as STATA, SPSS, or SAS will make the course easier to follow.

Note: In order to participate in the hands-on sections of the course, participants must bring their own laptop computer with enough space to install R and RStudio.

Effective Information Visualization
Instructor: Eric Monson

Participants will learn how to clean and structure data; see how freely and commonly available software can be used to create effective visualizations; and learn basic design principles, so you can go beyond the defaults and create eye-catching and impactful figures and infographics!


Visualization is a powerful way to reveal patterns in data, attract attention, and get your message across to an audience quickly and clearly. But, there are many steps in that journey from information to influence, and many questions – what visualization tools to use, how to get data into the right format, and which choices to make when putting it all together to tell your story? This course will walk participants through a wide variety of data and chart types to help even beginners feel comfortable embarking on a new visualization project.

The course will cover four major topics: charts, maps and interactive dashboards with Tableau; free web-based tools such as RAW for less-common visualization types; graphic design for information visualization; and infographics creation in tools like PowerPoint. The sessions will combine lecture and hands-on activities in data handling, visualization, and graphic design.

This course will assume a basic understanding of spreadsheets as a way of storing and processing data. No programming will be necessary. Bringing a laptop is not required, but participants are encouraged to do so. Having pre-installed Tableau (Public or Desktop), a web browser, a spreadsheet program, and presentation software will allow students to get in-class experience during the computer-based activities.

Introduction to Python
Instructor: Jason Carter

This course is an introduction to programming as a skill, a discipline, and a profession for graduate students. We will dive into hands-on programming from day one and progress to evaluating and using open source libraries and frameworks to manage large and complex datasets. We will focus equally on reading and writing code.


Students will leave the course with real skills, an ability to learn new programming technologies, and an understanding of how to incorporate open source code into their projects. It will serve as an appropriate foundation for students seeking to use programming as tool in their data science tool belt. This course will help students evaluate, communicate with, or work with programmers or code.

At the end of this course, students should have:
• The ability to analyze large or complex data sets.
• The skills required to solve problems by creating and modifying programs and systems, using modern programming tools.
• The knowledge of basic programming concepts, their appropriate usage, and how and where to learn more.
• An attitude of confidence when reading, writing, or discussing computer code.

There are no prerequisite course requirements for the course. However, you will need to be skilled in the use of basic mathematics and algebra, as well as email and web usage.

Advanced Statistics in R:
Generalized Linear Models & Multi-level Modeling
Instructor: Din Chen

This short course covers advanced statistical modelling and computing using R. We will review the multiple linear regression for continuous data and then proceed to cover the logistic regression for binary/binomial data; Poisson regression and negative binomial regression for counts data; longitudinal data analysis and general multi-level modelling. We also cover some topics in statistical computing on data simulations and bootstrapping.


Topics to be covered (Day 1):
• R packages and how to get help in R.
• Basic statistical computing and simulations.
• Multiple linear regression using R “lm” function with real data analysis.
• Logistic regression for binary/binomial data using R “glm” function with real data analysis.
• Poisson regression for counts data using R function “glm” with real data analysis.
• Generalized Linear Model using “glm.”

Topics to be covered (Day 2):
• Longitudinal data using R packages “nlme.”
• Multi-level modelling using R packages “nlme” for continuous data.
• Multi-level modelling for binary/binomial data using R packages “lme4.”
• Multi-level modelling for counts data using R package “nlme.”
• Introduction to statistical simulations and bootstrapping.


Participants will learn advanced statistical modeling and computing using R in analyzing the continuous and non-continuous data with multiple regression, logistic regression, Poisson regression, negative binomial regression, and general multi-level modelling. They will learn how to choose appropriate statistical analysis models that best suit the type of data and research questions for a given study. They will be better able to conceptualize, design, run, interpret, and communicate results clearly and effectively based on the advanced modelling techniques covered in the course.

Course delivery will be presentation with handouts. Lecture will be interspersed with questions and answers, software demonstration, and discussion.

[tab title="Prerequisites and Requirements"]
• Participants should bring a laptop to this session. Please also install R software, and R packages “nlme” and “bootstrap.”
• To fully participate, students should be familiar with descriptive and inferential statistics as well as multiple regression analysis.
• Some R programing skills are helpful, but not required.


  • August 15, 2018

Text Analysis in R
Instructors: Allison Blaine ; Markus Wust

This course explains how to collect, classify, and analyze text-based data from the internet or other digital sources using R. The course will cover a quick overview of R as a programming language, screen-scraping, interfacing with Application Programming Interfaces (APIs), and basic natural language processing such as topic models.


The study of how to use text as data crosses so many different academic disciplines, programming languages, and styles of communication that those who wish to enter this field are quickly overwhelmed. This course will provide students with a panoramic perspective of the field of “big data” and introduce them to some of the programming skills necessary to navigate the rapidly growing wealth of information online about this subject.

This course is divided into four sections. The first section will provide a quick and basic overview of R as a programming language for those who have no prior experience with R and a review for those who have worked with R before. In the second section, participants will learn basic techniques for collecting text-based data from the internet such as screen scraping and writing code to extract data from application programming interfaces. The third section will explain how to clean and code text-based data using a variety of pre-processing techniques such as stemming. The fourth and final section will explain how to apply topic models and other natural language processing tools to sample data.

Please bring a laptop with you. Although there will be a short introduction to R at the beginning of the workshop, this course will be best suited for those who have a basic working knowledge of the R language. Students with no knowledge of R might consider pairing this course with the “Introduction to Data Science in R” course that is also being offered early in the week.

Programming in R - DUGGINS
Instructor: Jonathan Duggins

This class provides students with an introduction to basic programming techniques in R, a program with stronger object-oriented programming facilities than most statistical computing languages. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R’s popularity has increased substantially in recent years.


This class will be useful to those who wish to restructure or clean unstructured data, collect new data in an automated fashion, or improve the speed of data analysis.

Students will learn basic programming techniques such as functions, “for” loops, if/else statements, vectorized functions, and parallel computing techniques.

Please bring a laptop with you. This course requires basic familiarity with R syntax, objects (e.g. matrices, lists, data frames etc.)

Programming in R - POST
Instructor: Justin Post

This class provides students with an introduction to basic programming techniques in R, a program with stronger object-oriented programming facilities than most statistical computing languages. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. R’s popularity has increased substantially in recent years.


This class will be useful to those who wish to restructure or clean unstructured data, collect new data in an automated fashion, or improve the speed of data analysis.

Students will learn basic programming techniques such as functions, “for” loops, if/else statements, vectorized functions, and parallel computing techniques.

Please bring a laptop with you. Basic familiarity with R syntax, objects (e.g. matrices, lists, data frames etc.)

Network Analysis for Data Scientists
Instructor: Bill Shi

This course will provide an introduction to network analysis with a focus on data and applications. It will introduce basic concepts and ideas in network science, and cover methods that are practically useful in dealing with network data. At the completion of this course, participants will have a solid understanding of what network analysis does, and be able to run common methods on network data.


We live in a connected world—the internet, the social networks, and the neural networks in our brains. Besides those systems with explicit connections, we have realized that most real-life systems are comprised of complicated components with implicit interactions, and ignoring those interactions would fail us in predicting system behaviors, such as the unexpected disaster of the financial market in 2008. Therefore, networks have emerged as a powerful tool in studying those complex systems and have found fruitful applications in a wide range of areas from technology to public health to social problems. This course will provide an introduction to this emerging area with a focus on data and real applications.

Participants will learn:
• Concepts and terminologies in network science, such as nodes, edges, degrees, small world, etc.
• Centrality measures of nodes and edges in a network, such as pagerank, betweenness, closeness, etc.
• Structural features such as clustering coefficient, mixing coefficient, etc.
• Random network models such as Erdos-Renyi network, scale-free network, small world, etc.
• Community detection methods.
• Link prediction methods.

Please bring a laptop with you to this course.

Participants are expected to have the following skills:
• Basic knowledge in statistics.
• Basic programming skills in R.

Please also download the following:
• The igraph package (http://igraph.org/r/) in R.
• Open-source network visualization software Gephi (gephi.org).


Dynamic/Interactive Visualization
Instructor: Lorin Bruckner

The web has become an important and popular tool for communicating research findings, but it carries a layer of complexity not found in other media: user interactivity. This course will teach participants how to engage their audience in immersive presentations of their data. Participants will learn basic user experience (UX) principles and apply them to interactive dashboards in Tableau. Beginner experience with dashboard creation in Tableau is required.


Interactive visualizations allow users to answer their own questions about a dataset through an exploratory process, enabling them to become more intimate with the information they consume. Interactivity can even turn data visualizations into useful applications and digital tools. However, in order to create effective and successful interactive systems, a foundational understanding of UX concepts and methods is needed.

Through both lectures and hands-on, project-based activities, participants will learn to:
• Place the user at the center of their thinking.
• Understand the principles of interaction design and UX.
• Communicate effectively with users through visuals and text.
• Create interactive dashboards in Tableau that employ UX best practices.
• Work with Dashboard Actions, Stories and other interactive features in Tableau.
• Evaluate the usability of interactive systems.

A laptop with Tableau Public or Tableau Desktop and spreadsheet software installed is necessary to participate in hands-on projects. This course will assume participants have already been introduced to Tableau and have experience creating basic worksheets and dashboards.

  • August 16-17, 2018

Working with Messy Data
Instructor: Brown Biggers

When working with data, one thing is fairly certain: data is rarely in an optimized format. A misplaced space here, or an extra comma there, can mean the difference between two clicks and two hours of work. In this course, we will work with ways to manipulate, interpret, and present data from web pages and text using Python version 2.7 and OpenRefine. This class will also cover regular expressions, various imported libraries to extend Python functionality, and import/export of data in OpenRefine.


The tools for handling data are often both expensive and complicated. In this course, we will cover a wide range of techniques with common and open-source data processing tools. If you find yourself curious about how better to handle tabular data, but are often intimidated by the steep learning curve associated with programming, this class will get you started on the path toward better data management.

The participants in this course will learn basic and intermediate Python programming and scripting as it pertains to the import and export of data. We will cover some of the libraries associated with mathematical and statistical analysis, as well as text processing using regular expressions. We will cover OpenRefine to adjust datasets to compensate for input inconsistency.

Please bring a laptop with you. This course is intended for data scientists with basic-to-intermediate understanding of one or more of: The Python programming language, data import/export formats, text processing, and some statistical analysis. This class assumes that you will have a laptop with a running installation of Python 2.7, OpenRefine, and a web browser with internet connectivity.

Recommended Installations:
In this class, we will be using Anaconda (for Python 2.7) and OpenRefine from the links below. Though you are welcome to use your preferred installations, this class expects that you will have a laptop with a running installation of Python 2.7, OpenRefine, and a web browser with internet connectivity. Anaconda for Python 2.7: https://www.continuum.io/downloads OpenRefine: http://openrefine.org/download.html.

Intermediate Programming in R
Instructor: Justin Post

The class provides students with a primer on the use of R for the writing of reproducible reports and presentations that easily embed R output using R markdown as well as the creation of interactive and customizable web applets called R Shiny applications.


This class will be useful for automating the creation of data analyses, descriptive statistics, and graphical summaries.

Students will learn about the flexibility and use of R markdown for producing slides, web pages, reports and the R Shiny web application framework.

Please bring a laptop with you. Basic familiarity with R syntax, objects, and functions.

Visualization in Data Science Using R
Instructor: Angela Zoss

This course is designed for two audiences: experienced visualization designers looking to apply open data science techniques to their work, and data science professionals who have limited experience with visualization. Participants will develop skills in visualization design using R, a tool commonly used for data science. Basic familiarity with R is required.


Data science skills are increasingly important for research and industry projects. With complex data science projects, however, come complex needs for understanding and communicating analysis processes and results. The rise of data science has accompanied a comparable rise in business intelligence and the demand for visualizations and dashboards that can explain models, summarize results, assist with decision making, and even predict outcomes. Ultimately, an analyst's data science toolbox is incomplete without visualization skills.

The course will take a project-based approach to learning best practices for visualization for data science. Participants will be guided through 2-3 sample analysis and visualization projects that will highlight different types of visualization, different features of R and its visualization libraries, and different challenges that arise when trying to apply an open data science philosophy to visualization.
• Introduction to visualization in R.
• Visualization for data exploration.
• Visualization for communication.
• Interactive visualizations with Shiny.

This course assumes basic familiarity with R -- e.g., R syntax, data structures, development environments. Most visualizations will be created with ggplot2 and other tidy verse libraries, but prior experience with those libraries is not required. In order to participate in class exercises, participants should bring a laptop where current versions of R and RStudio have been installed and where the participant has sufficient privileges to install new R packages on demand.


Introduction to Data Mining and Machine Learning - KRISHNAMURTHY
Instructor: Ashok Krishnamurthy

This course will introduce participants to a selection of the techniques used in data mining and machine learning in a hands-on, application-oriented way. Topics covered will include data exploration, decision trees, clustering, association rules, regression and pattern classification. The computing exercises will be based on the statistical programming language, R. At the end of the two days, you will be able to explore a data set, and determine which analysis method is appropriate for the data, and be able to use R packages to obtain results.


The ready availability of digital data from numerous sources is a tremendous opportunity for businesses and scientists to obtain new insights and confirm hypotheses. Data mining provides the theoretical basis, algorithms and computational methods to manage, analyze and get information from the data. In the world of big data and data science, data mining is a fundamental tool for data insights.

The course will be organized in the following major sections:
• Data exploration.
• Association rules.
• Decision trees.
• Clustering
• Regression
• Classification
Each section will have an associated computer exercise. We will make extensive use of R and R packages in the computer exercises.

None.

Introduction to Data Mining and Machine Learning - VATSAVAI
Instructor: Raju Vatsavai

This course will introduce participants to a selection of the techniques used in data mining and machine learning in a hands-on, application-oriented way. Topics covered will include data exploration, decision trees, clustering, association rules, regression and pattern classification. The computing exercises will be based on the statistical programming language, R. At the end of the two days, you will be able to explore a data set, and determine which analysis method is appropriate for the data, and be able to use R packages to obtain results.


The ready availability of digital data from numerous sources is a tremendous opportunity for businesses and scientists to obtain new insights and confirm hypotheses. Data mining provides the theoretical basis, algorithms and computational methods to manage, analyze and get information from the data. In the world of big data and data science, data mining is a fundamental tool for data insights.

The course will be organized in the following major sections:
• Data exploration.
• Association rules.
• Decision trees.
• Custering
• Regression
• Classification
Each section will have an associated computer exercise. We will make extensive use of R and R packages in the computer exercises.

None.