Course descriptions subject to updates.

- Basics of R for Data Science and Statistics
- Introduction to Programming in R
- Introduction to Python
- Introduction to Big Data & Machine Learning for Survey Researchers & Social Scientists
- Visualization in Data Science Using R
- Advanced Visualization in R: R Shiny
- Exploratory Data Analysis Using R Markdown
- Intermediate Python
- Introduction to Geospatial Data for the Data Scientist
- Overview to AI and Deep Learning
- Basic Statistics in R
- Deep Learning with Python
- Geospatial Analytics
- Introduction to Effective Information Visualization
- Statistical Machine Learning Using R

### Basics of R for Data Science and Statistics

Santiago Olivella

**Summary**

This course introduces the use of the popular R statistical programming language for conducting and sharing your data science projects. R is a mature and vibrant programming language, and one of the major platforms for doing data science. This course covers how to use RStudio effectively, frequently used data structures, importing data, common data manipulations, summary statistics, data visualizations, and basic regression modeling through the suite of packages called the “tidyverse.”

**Why Take This Course?**

R is an extremely versatile programming language that has the capability to fit a fantastic array of statistical and machine learning models. It is open-source, free, and gives you the ability to easily and widely share your analyses. The “tidyverse” offers an ecosystem of tools that help you harness R’s versatility and power, allowing you to easily import data, wrangle it, and use statistics to gain insights from it. In turn, RStudio offers a one-stop-shop development environment for your data analysis projects. By introducing you to some of the most useful features of RStudio and the tidyverse, this course will help you build a solid foundation in the use of R to complete the common tasks associated with a typical data science pipeline.

**What Will Participants Learn?**

The course provides a modern introduction to using R for data science through the extremely popular suite of packages called the “tidyverse.” More specifically, the course will cover:

Day 1:

- Working with the RStudio IDE
- R object types and data storage
- Reading data from common formats into R
- Piping multiple operations
- Common data manipulations and reshaping

Day 2:

- Numerical data summaries
- Graphical exploratory data analysis
- Basic regression and classification models
- Sharing your work in reproducible ways using RMarkdown

**Prerequisites and Requirements**

This course will make heavy use of hands-on programming in R. Thus, some prior familiarity with the basics of the R language is required. We will generally introduce a topic and then have hands-on exercises to practice and explore that topic. Participants must have access to the internet and the ability to install programs and download files. This course assumes a strong working knowledge of computers and, although not required, it would be beneficial to have past experience with the logic of programming and/or conducting statistical analyses. Participants are provided instructions and resources in advance to assist with installing R, RStudio, and the tidyverse.

### Introduction to Programming in R

Jonathan Duggins

**Summary**

Statistical programming is an integral part of many data-intensive careers and data literacy, and programming skills have become a necessary component of employment in many industries. This course begins with necessary concepts for new programmers—both general and statistical—and explores some necessary programming topics for any job that utilizes data. The R software package will be used as it is one of the most popular programs in statistics and data science. R programming is done in the RStudio integrated development environment.

**Why Take This Course?**

Data is everywhere. Every industry, and most careers, now involve working with data at some point. As data-driven decisions become the norm in many careers, individuals need to add to their professional skills by learning the essentials of programming.

Whether their goal is to better communicate with coworkers who program or to integrate programming in their day-to-day jobs, this course benefits individuals who are looking to understand general principles behind statistical programming and to develop programming skills necessary to work in data-related areas.

Even though computers are “literal,” programming offers many techniques for solving a particular problem. The combination of necessary precision and freedom of choice regarding techniques can be frustrating for beginners.

**What Will Participants Learn?**

Covering both general computing concepts and an introduction to R, a general outline of the course is given below:

Day 1:

- General computing concepts
- Fundamental programming concepts
- Introduction to R and RStudio
- R functions
- Common R data structures

Day 2:

- R packages
- Reading and writing data files (readr package)
- Subsetting data by rows, columns, or both (dplyr package)
- Deriving new variables unconditionally and conditionally (dplyr package)
- Introduction to summarizing data

**Prerequisites and Requirements**

This course aims to support individuals with little to no previous programming experience as they gain the fundamentals necessary to be successful programmers. No prior programming experience is expected in R or any other language. However, this course relies on active participation of the participants via hands-on programming. R and RStudio should be installed in advance so that participants can devote their time to practicing the skills from the course. Participants are provided instructions and resources in advance to assist with the installation process.

### Introduction to Python

Jason Carter

**Summary**

Learning Python is important for any aspiring data scientist who is interested in programming as a skill, a discipline, and a profession.

**Why Take This Course?**

This course will serve as a foundation for students seeking to use programming as a tool in their data science tool belt. This course will help students evaluate, communicate with, and work with programmers or code. Participants will get hands-on practice with creating and manipulating data sets. At an introductory level, participants will learn to create programs that manipulate data, perform error handling, and store and retrieve data using relational databases.

**What Will Participants Learn?**

Students will learn how to:

- create programs that manipulate data
- perform error handling
- store and retrieve data using relational databases
- use the pandas library, the de facto standard to work with tabular data in Python

**Prerequisites and Requirements**

Participants will need to be skilled in basic mathematics and algebra as well as e-mail and web usage. They will also, of course, need a computer to participate in the hands-on exercises.

### Introduction to Big Data & Machine Learning for Survey Researchers & Social Scientists

Trent Buskirk

**Summary**

In this course we explore how Big Data concepts, processes and machine learning methods can be used within the context of Survey and Social Science Research. Throughout this course we will illustrate key concepts using specific survey research examples including tailored survey designs and nonresponse adjustments and evaluation.

**Why Take This Course**

The amount of data generated as a by-product in society is growing fast including data from satellites, sensors, transactions, social media and smartphones, just to name a few. Such data are often referred to as “big data”, and can be used to create value in different areas such as health and crime prevention, commerce and fraud detection. An emerging practice in many areas is to append or link big data sources with more specific and smaller scale sources that often contain much more limited information. This practice has been used for some time by survey researchers in constructing frames by appending auxiliary information that is often not directly available on the frame, but can be obtained from an external source. Using Big Data has the potential to go beyond the sampling phase for survey researchers and in fact has the potential to influence the social sciences in general. Big Data is of interest for public opinion researchers and agencies that produce statistics to find alternative data sources either to reduce costs, to improve estimates or to produce estimates in a more timely fashion. However, Big Data pose several interesting and new challenges to survey researchers and social scientists among others who want to extract information from data. As Robert Groves (2012) pointedly commented, the era is “appropriately called Big Data and not Big Information”, because there is a lot of work for analysts before information can be gained from “auxiliary traces of some process that is going on in society.” As we embark on the fourth era of survey research we will see more use of big data within the survey research process as well as increased use of machine learning methods to process and analyze these data.

**What Participants Will Learn**

This course will offer participants:

- an overview of key big data terminology and concepts
- a discussion of some primary issues with linking big data with survey data
- a discussion of the opportunities for big data within the survey research process
- a discussion of the opportunities for survey data within the big data ecosystem
- issues of coverage and measurement errors within the big data context
- a discussion of information extraction and signal detection in the context of big data
- a discussion of the similarities and differences in model building for inference versus prediction
- a discussion of how to visualize massive amounts of data
- an overview of four popular machine learning methods including k-means clustering, hierarchical clustering, classification and regression trees and random forests using R with example code provided
- an introduction to the Rattle package that provides a graphical user interface for machine learning within the R environment
- a discussion and illustration about how these and other methods can be used in the survey research process

Course Outline:

Day 1:

- An overview of big data, it’s potential, perils and total error framework.
- An overview of machine learning
- An introduction between the difference between prediction and inference
- Challenges and solutions for plotting big data
- Introduction to unsupervised learning with examples and hands-on activities
- K-Means clustering

Day 2:

- K-means clustering, cont.
- Hierarchical clustering
- k-nearest neighbors
- Introduction to supervised learning with examples and hands-on activities
- Tree-based methods
- Ensemble methods – including random forests and extra trees
- Introduction to the Rattle package in R

**Prerequisite and Requirements**

The course is aimed at both producers and users of survey data as well as social scientists who want to learn machine learning methods that can be used within the broader context of social science data. The course is aimed equally at researchers and students from academia, government and the voluntary and private sector and is appropriate for researchers new to this topic who have some familiarity with survey research concepts such as responsive/tailored survey designs, measurement error, nonresponse bias and data linkage. Familiarity with model building and model selection as well as the R program is not required but could also be helpful. While this course is not intended to teach participants machine learning via R, we will explore four common machine learning algorithms and provide R code and output to illustrate these methods within the context of the R language.

### Visualization in Data Science using R

Angela Zoss

**Summary**

This course is designed for two audiences: experienced visualization designers looking to apply open data science techniques to their work, and data science professionals who have limited experience with visualization. Participants will develop skills in visualization design using R, a tool commonly used for data science. Basic familiarity with R is required.

**Why Take This Course**

Data science skills are increasingly important for research and industry projects. With complex data science projects, however, come complex needs for understanding and communicating analysis processes and results. Ultimately, an analyst’s data science toolbox is incomplete without visualization skills. Incorporating effective visualizations directly into the analysis tool you are using can facilitate quick data exploration, streamline your research process, and improve the reproducibility of your research.

**What Participants Will Learn**

The course will take a project-based approach to learning best practices for visualization for data science. Participants will be guided through several sample analysis and visualization projects that will highlight different types of visualization, different features of R and its visualization capabilities, and different challenges that arise when trying to apply an open data science philosophy to visualization.

- Introduction to visualization in R
- Using ggplot2 for publication-ready graphics
- Applying common graphic design principles to ggplot2 visualizations
- Adding interactivity to visualizations through R Markdown and HTML widgets

**Prerequisite and Requirements**

As indicated above, this course assumes basic familiarity with R—e.g., R syntax, data structures, development environments. Participants with no knowledge of R should consider taking an introductory R short course prior to this class.

We will use RStudio to interact with R, and all exercises will be distributed in R Markdown files (rather than simple R script files). This allows us to combine R code with non-code elements and promotes a literate programming approach to research.

A significant portion of the course will use ggplot2 and other tidyverse packages to create visualizations, but prior experience with those packages is not required. In order to participate in class exercises, participants should have installed current versions of R, RStudio, and the following packages: tidyverse, markdown, knitr, readxl, plotly, maps, mapproj, and sf. Permissions to install packages on the fly will be useful.

### Advanced Visualization in R: R Shiny

Angela Zoss

**Summary**

This course will cover the basics of creating R-based web applications with Shiny, an R package that blends data science and statistical operations with interactive interface components. Participants will learn to connect interactive inputs with R operations, develop skills in web application design, and explore different options for hosting Shiny applications on the web. Basic familiarity with R is required.

**Why Take This Course?**

Modern data science projects go beyond research publications and static presentations. Stakeholders interested in the results of a data analysis workflow may need a more direct interface to explore the data themselves. Rather than preparing exhaustive reports that summarize as many different aspects of the results as possible, it may make more sense to create a way for stakeholders to interact with either the data analysis itself or the various outputs of the analysis.

Shiny is a robust web application development system for R. Shiny can be used to build interactive dashboards, adjust parameters of a model, generalize a data processing workflow, and even allow users to customize the look and feel of reports and figures.

**What Will Participants Learn?**

This course will cover the basics of building simple Shiny applications. The course will also cover the range of options for more advanced Shiny applications and the basic process for hosting and sharing a Shiny application. The following broad topics will be included:

- Introduction to web applications
- Planning for interactivity
- Layout and UI design
- Writing reactive R code
- Using charts as inputs

**Prerequisites and Requirements**

As indicated above, this course assumes basic familiarity with R—e.g., R syntax, data structures, development environments. Participants with no knowledge of R should consider taking an introductory R short course prior to this class.

We will use RStudio to interact with R*.* Exercise files and slides will be shared using a GitHub repository, but no prior experience using GitHub is required.

In order to participate in class exercises, participants should have installed current versions of R, RStudio, and the shiny package. Additional required packages will be shared before the start of the course. Permissions to install packages on the fly will be useful.

### Exploratory Data Analysis Using R Markdown

Jonathan Duggins

**Summary**

This course introduces techniques for Exploratory Data Analysis (EDA)—both numeric and graphical—to help provide insight into data sets. Using R and RStudio, we will demonstrate various tools to generate data summaries and introduce R Markdown for generating HTML output. Incorporating R Markdown allows for the generation of output which includes results from statistical analysis, plots, figures (such as diagrams), and explanations of any output.

**Why Take This Course?**

Reading and writing data are only part of the analysis process. Analysis and documentation of results remain a cornerstone of the statistics and data science practices. In this course, methods for basic statistical analysis allow participants the opportunity to analyze data in a familiar way while developing R Markdown skills for creating files with a seamless integration of code and results along with a narrative explaining those methods and their output.

By presenting the process in terms of familiar statistics concepts, participants can substitute any statistical analysis from their toolbox to produce reports that match the required sophistication level for any analysis.

**What Will Participants Learn?**

Covering both general computing concepts and an introduction to R, a tentative outline of the course is given below:

- Numeric data summaries
- Graphical data summaries (ggplot2 package)
- Creating high-quality graphics (ggplot2 package)
- R markdown for professional documentation/reproducibility

**Prerequisites and Requirements**

This course assumes a working knowledge of computers and some prior experience in R is required. At a minimum, knowledge about data objects, an understanding of R syntax, and experience with the tidyverse collection of packages are strongly recommended.

This course relies on active participation of the participants via hands-on programming. R and RStudio should be installed in advance so that participants can devote their time to practicing the skills from the course. Participants will be provided instructions and resources in advance to assist with the installation process.

### Intermediate Python

Jason Carter

**Summary**

Learning Python is important for any aspiring data scientist who is interested in programming as a skill, a discipline, and a profession. For those students attending the Introduction to Python class on Monday and Tuesday, we will continue working with hands-on programming and managing large and complex datasets.

**Why Take This Course**

This course moves beyond the introductory concepts presented in the two-day course, “Introduction to Programming in Python” (offered on Monday-Tuesday) and presents more programming-focused content with data applications. Students will get hands-on practice with creating and manipulating data sets, performing error handling, and storing and retrieving data using relational databases.

**What Participants Will Learn**

Students will learn how to:

- create programs that manipulate data
- perform error handling
- store and retrieve data using relational databases
- use the pandas library, the de facto standard to work with tabular data in Python

**Prerequisite and Requirements**

An introductory knowledge of Python is required. You will need to also be skilled in basic mathematics and algebra as well as e-mail and Web usage. Please bring a laptop to the course.

### Introduction to Geospatial Data for the Data Scientist

Bill Wheaton

**Summary**

This course offers a broad introduction into the use of geospatial data in data science applications. The course will be highly focused on what makes geospatial data different from other types of data and what these differences imply for using and applying geospatial data. The course materials will be built for non-geospatial professionals who find themselves needing to use geospatial data effectively.

**Why Take This Course**

The availability and uses of geospatial data have been growing for decades. Recently, with the advent of robust web-mapping and dynamic client-side web tools many data analysts, applications programmers, web developers, and data scientists of all types have been confronted with geospatial data without having a background in geography or Geographic Information Systems (GIS). This course will ground students in fundamental concepts of geospatial data science, geospatial computing, and geospatial applications so they can be more efficient and accurate in using geospatial data in their daily jobs.

**What Participants Will Learn**

Students will learn how to:

- Basics of map projections and the use of projected and un-projected geospatial data
- How issues of scale, precision, and accuracy affect applications of geospatial data
- Geospatial data models and the main ways geospatial data is presented in computer form
- Key open-source and commercial off-the-shelf applications that handle geospatial data

**Prerequisite and Requirements**

Basic computer skills. An understanding of tools such as spreadsheets, relational database management systems (RDMS), and programming will be beneficial, but not required.

### Overview of AI and Deep Learning

Siobhan Day Grady

**Summary**

There has been tremendous growth in AI over the past 10 years. Everyday we hear of challenging problems that are being solved using AI. At the same time we hear about the lack of explainability of decisions made by AI and about biases in AI models. Many of the key advances in AI are due to the advances in machine learning, especially deep learning. Natural language processing, computer vision, speech translation, biomedical imaging, and robotics are some of the areas that have benefited from deep learning methods. This course is designed to provide an overview of AI and in particular deep learning. We will look at the history of neural networks, how advances in data collection and computing have caused the revival in neural networks, the different types of deep learning networks and their applications, and tools and software available to design and deploy deep networks. The objective is to provide you with an overview, not necessarily to start coding and creating AI systems.

**Why Take This Course?**

This course is for those who are interested in understanding more about AI and deep learning, how they are used in different applications, and their advantages and disadvantages. It is not meant to teach you about any of the deep learning frameworks or developing deep learning models.

**What Will Participants Learn?**

The course will focus on the following topics:

- History of neural networks
- Neural networks as universal approximators
- Training neural networks as an optimization problem
- Deep neural networks
- Why deep learning now?
- Types of neural networks
- Applications of deep learning and neural networks in image processing, natural language processing, robotics, computer vision, biomedical, and health care.

**Prerequisites and Requirements**

This course does not have any prerequisites.

### Basic Statistics in R

Vanessa Miller

Course description coming soon.

### Deep Learning in Python

Adel Alaeddini

**Summary**

In the past few years, deep learning (DL) has emerged as a powerful machine learning method that has found applications in areas such as object recognition, image classification, video analysis, and natural language processing. This course will discuss where and how deep learning is used and how to get people to use it. The approach will be to minimize the math and concentrate instead on the underlying ideas and principles. We will concentrate on Tensorflow/Keras as the underlying computational platforms and use Python to create the DL codes. Much of the course will be driven by a number of hands-on exercises that will help you build simple networks in Keras/Tensorflow. These exercises will cover analysis of tabular data, images, videos, and text. At the end of the course you will have a basic understanding of DL, Tensorflow/Keras as a DL platform, and example applications.

**Why Take This Course?**

Deep Learning appears to be a magical algorithm that can solve difficult problems in a variety of domains. The coming together of a number of trends underlie the success of DL. In this course, you will get a chance to look under the hood of the DL hype and see the underlying architectures, and you will be exposed to a set of tools that you can use to create your own deep learning models.

**What Will Participants Learn?**

The course will focus on the following topics:

- Neural networks
- Deep neural networks
- Training of deep neural networks
- Tensors, Tensorflow and Keras
- Convolutional neural networks
- Transformers
- Autoencoders
- Generative adversarial networks
- Transfer learning
- Text as data

Participants will complete a number of computer exercises using Python and Keras/Tensorflow.

**Prerequisites**

This course will assume an understanding of statistics and calculus at the undergraduate level and programming experience with Python to get full benefits from the class.

### Geospatial Analytics

Laura Tateosian

#### Summary

This course will focus on how to explore, analyze, and visualize geospatial data. Using Python and ArcGIS Pro, we will inspect and manipulate geospatial data, we will use powerful GIS tools to analyze spatial relationships, link tabular data with spatial data, and map the data. Interactive activities throughout the course will provide hands-on experience to cement learning. In these activities, participants will use Python and the arcpy library to invoke key GIS tools for spatial analysis and mapping. The course will provide participants with core programming skills for harnessing GIS data.

**Why Take This Course**

Geographic data is ubiquitous and the ability to analyze and visualize geospatial aspects of a problem can contribute valuable insights. Everything happens somewhere and location matters. Familiarity with GIS data formats and GIS capabilities and the ability to automate the analysis of GIS datasets is beneficial in a broad spectrum of government and industry positions, as well as academic fields.

**What Will Participants Learn**

The course will introduce geospatial data formats and data projection and the capabilities of geographic information systems (GIS). Then students will learn the use of Python for each of the following:

- Invoke GIS tools
- Project data
- Perform proximity analysis
- Select features based on location or data attributes
- Join tabular data with spatial data
- Calculate fields from geometries
- Perform density analysis
- Calculate spatial statistics
- Compute map algebra

- Batch geoprocessing
- List GIS datasets in a workspace
- Batch process the dataset
- List the fields in a GIS dataset
- Access the field properties

- Read and modify GIS tables with cursors
- Map GIS data and export maps
- Additional topics (if time permits)
- Mapping in 3D scene view
- arcpy Charts
- ArcGIS deep learning tools

**Prerequisites and Requirements**

You will need a computer during class and an installation of ArcGIS Pro. This course is best suited for those who already have a basic working knowledge of Python. Students with no knowledge of Python should consider taking an introductory Python short course. One option could be the Monday-Tuesday “Introduction to Programming in Python”. Students who do not already have access to ArcGIS Pro will need to install the free trial prior to class. Additional instructions for testing the installation will be provided in advance of the course.

### Introduction to Effective Information Visualization

Eric Monson

**Summary**

Participants will see how freely and commonly available software can be used to create effective visualizations; learn how to clean and re-structure data; and learn basic effective visualization principles so you can go beyond the defaults and create eye-catching and impactful figures!

**Why Take This Course?**

Visualization is a powerful way to reveal patterns in data, attract attention, and get your message across to an audience quickly and clearly. However, there are many steps in that journey from information to influence, and many questions – what visualization tools to use, how to get data into the right format, and which choices to make when putting it all together to tell your story? This course will quickly walk participants through a wide variety of data and chart types to help even beginners feel comfortable embarking on a new visualization project.

**What Will Participants Learn?**

The course will cover five major topic areas during the eight time blocks over two days. Sessions will combine lecture and hands-on activities in data cleaning, transformation, and visualization.

- Tips for effective data visualization and graphic design (2 course modules)
- Charts, maps, and interactive dashboards with Tableau (3 course modules)
- Cleaning and re-structuring data using OpenRefine (1 course module)
- Free web-based tools such as RAW for less-common visualization types
- Critique of existing visualizations and information graphics

**Prerequisites and Requirements**

This course is designed at
an introductory level but will assume a basic understanding of spreadsheets as
a way of storing and processing data. No programming will be necessary. *Note that the Tableau
training starts at a very introductory level, so if you are already
an advanced user, that particular content may feel basic to you.*

Before class, please
install recent
versions of Tableau (Public or
Desktop), a web browser, and OpenRefine in order to
participate in the hands-on computer-based activities. *Please have
software installed before class. We will not be spending time on that
during the day. *

### Introduction to Statistical Machine Learning in R

Yufeng Liu

**Summary**

Statistical machine learning and data mining is an interdisciplinary research area which is closely related to statistics, computer sciences, engineering, and bioinformatics. Many statistical machine learning and data mining techniques and algorithms are useful in various scientific areas. This 2-day short course will provide an overview of statistical machine learning and data mining techniques with applications to the analysis of real data. Both supervised and unsupervised techniques will be covered.

Supervised learning techniques include penalized regression such as LASSO and its variants, support vector machines and tree-based methods.

Unsupervised learning techniques include dimension reduction methods such as principal components analysis and clustering analysis. The main emphasis will be on the analysis of real data sets from various scientific fields. The techniques discussed will be demonstrated in R.

**Why Take This Course?**

This course is intended for researchers who have some knowledge of statistics and would like to be introduced to statistical machine learning and data mining as well as for practitioners who would like to apply statistical machine learning techniques to their problems. Machine learning plays an increasingly significant role in data science and has become an integral part of many fields—from biomedicine to business/marketing to social media. It is grounded directly in our daily lives.

**What Will Participants Learn?**

Discussion and R exercises will be included, as time permits, on the following:

- Fundamentals of statistical learning
- Training versus test error rates
- Supervised versus unsupervised methods
- Bias/Variance tradeoff

- Linear regression and penalized regression
- Ridge regression
- Lasso
- Further extensions (if time permits)

- Cross-validation
- Classification techniques
- Logistic regression and penalized logistic regression
- Nearest neighbors classification
- Support vector machines

- Tree-based methods
- Bagging
- Random forests

- Unsupervised learning techniques
- Dimension reduction: Principal component analysis
- Other dimension reduction techniques (if time permits)
- Clustering

- Other selected topics (if time permits)

**Prerequisites and Requirements**

Participants should be familiar with linear regression and basic statistical and probability concepts, as well as familiarity with R programming. R and R Studio should be installed before class.