- Introduction to Effective Information Visualization
- Basics of R for Data Science and Statistics
- Introduction to Programming in R
- Introduction to Programming Using Python
- Exploratory Data Analysis Using R Markdown
- Improving R Programs
- Intermediate Python
- Using Linked Data
- Visualization for Data Science Using R
- Introduction to Deep Learning with Python
- Introduction to Statistical Machine Learning Using R
- R For Automating Workflow & Sharing Work
- Text Analysis Using R

### Introduction to Effective Information Visualization

Eric Monson

**Summary**

Participants will see how freely and commonly available software can be used to create effective visualizations; learn how to clean and re-structure data; and learn basic effective visualization principles in order to go beyond the defaults and create eye-catching and impactful figures.

**Why Take This Course?**

Visualization is a powerful way to reveal patterns in data, attract attention, and get your message across to an audience quickly and clearly. But, there are many steps in that journey from information to influence, and many questions—what visualization tools to use, how to get data into the right format, and which choices to make when putting it all together to tell your story? This course will quickly walk participants through a wide variety of data and chart types to help even beginners feel comfortable embarking on a new visualization project.

**What Will Participants Learn?**

The sessions will combine lecture and hands-on activities in data cleaning, transformation, and visualization. The course will cover four major topic areas:

- Tips for effective data visualization
- Charts, maps and interactive dashboards with Tableau
- Cleaning and re-structuring data using OpenRefine
- Free web-based tools such as RAW for less-common visualization types

**Prerequisites and Requirements**

This course is designed at an introductory level, but will assume a basic understanding of spreadsheets as a way of storing and processing data. No programming will be necessary. Having pre-installed recent versions of Tableau (Public or Desktop), a web browser, and OpenRefine will allow students to get in-class experience during the hands-on computer-based activities.

Please have software installed on your laptop before class begins. We will not be spending time on installation during class.

**Basics of R for Data Science and Statistics**

Justin Post

**Summary**

This course introduces the powerful and popular R statistical software through the RStudio integrated development environment. R is a fully developed programming language and one of the major platforms for doing data science. This course covers frequently used data structures, importing raw data, common data manipulations, summary statistics, and data visualizations through the suite of packages called the “tidyverse.”

**Why Take This Course?**

R is an extremely versatile programming language that has the capability to fit a fantastic array of statistical and machine learning models, is extremely easy to collaborate with, and has the capacity to easily and widely share your analyses. However, to be able to use these vast capabilities we must, of course, import data, likely create variables, and subset our data appropriately. We will also want to understand and validate our data through summarizations. R can easily handle these tasks in a multitude of ways. However, the flexibility that comes with R also creates a difficult learning environment. There are often many ways to do the same task and it can be overwhelming, at first, to determine the best methods. This course will help you to gain a solid foundation in the use of R to perform the common tasks mentioned above.

**What Will Participants Learn?**

The course provides a modern introduction to R through the extremely popular suite of packages called the “tidyverse.” More specifically, the course will cover:

Day 1:

- Basics of how R stores data
- R Packages and the tidyverse
- Reading data from common formats into R (readr package)
- Using R Markdown for reproducibility (rmarkdown and knitr packages)
- Common data manipulations and creating new variables (dplyr package)

Day 2:

- Reshaping data for summarizing and modeling (tidyr package)
- Types of data and numeric summaries (including across groups)
- Creating publication ready graphs (ggplot2 package)

**Prerequisites and Requirements**

This course will make heavy use of hands-on programming. We will generally introduce a topic and then have hands-on exercises to practice and explore that topic. Your laptop computer will need to be able to access the internet and you’ll need ability/permissions to install programs and download files. This course assumes a strong working knowledge of computers and, although not required, it would be beneficial to have past experience with the logic of programming and/or executing statistical analyses. Participants are provided instructions and resources in advance to assist with installing R, and that step will need to be completed prior to the beginning of class.

**Introduction to Programming in R**

Jonathan Duggins

**Summary**

Statistical programming is an integral part of many data-intensive careers and data literacy, and programming skills have become a necessary component of employment in many industries. This course begins with necessary concepts for new programmers—both general and statistical—and explores some necessary programming topics for any job that utilizes data. The R software package will be used as it is one of the most popular programs in statistics and data science. R programming is done in the RStudio integrated development environment.

**Why Take This Course?**

Data is everywhere. Every industry, and most careers, now involve working with data at some point. As data-driven decisions become the norm in many careers, individuals need to add to their professional skills by learning the essentials of programming.

Whether their goal is to better communicate with coworkers who program or to integrate programming in their day-to-day jobs, this course benefits individuals who are looking to understand general principles behind statistical programming and to develop programming skills necessary to work in data-related areas.

Even though computers are “literal,” programming offers many techniques for solving a particular problem. The combination of necessary precision and freedom of choice regarding techniques can be frustrating for beginners.

**What Will Participants Learn?**

Covering both general computing concepts and an introduction to R, a general outline of the course is given below:

Day 1:

- General computing concepts
- Fundamental programming concepts
- Introduction to R and RStudio
- R functions
- Common R data structures

Day 2:

- R packages
- Reading and writing data files (readr package)
- Subsetting data by rows, columns, or both (dplyr package)
- Deriving new variables unconditionally and conditionally (dplyr package)
- Introduction to summarizing data

**Prerequisites and Requirements**

This course aims to support individuals with little to no previous programming experience as they gain the fundamentals necessary to be successful programmers. No prior programming experience is expected in R or any other language. However, this course relies on active participation of the participants via hands-on programming. Thus, everyone is expected to have their own laptop computer that has access to the internet and ability/permissions to install programs and download files. R and RStudio should be installed in advance so that participants can devote their time to practicing the skills from the course. Participants are provided instructions and resources in advance to assist with the installation process, and installation will need to occur before the class begins.

### Introduction to Programming Using Python

Jason Carter

**Summary**

This course is an introduction to programming as a skill, a discipline, and a profession for graduate students. We will dive into hands-on programming from day one and progress to evaluating and using open source libraries and frameworks to manage large and complex datasets. We will focus equally on reading and writing code.

**Why Take This Course?**

Students will leave the course with real skills, an ability to learn new programming technologies, and an understanding of how to incorporate open source code into their projects. It will serve as an appropriate foundation for students seeking to use programming as a tool in their data science tool belt. This course will help students evaluate, communicate with, or work with programmers or code.

**What Will Participants Learn?**

At the end of this course, students should have:

- The ability to analyze large or complex data sets.
- The skills required to solve problems by creating and modifying programs and systems, using modern programming tools.
- The knowledge of basic programming concepts, their appropriate usage, and how and where to learn more.
- An attitude of confidence when reading, writing, or discussing computer code.

**Prerequisites and Requirements**

There are no prerequisite course requirements for the course. However, you will need to be skilled in the use of basic mathematics and algebra, as well as email and web usage.

You will need to have a computer for class that can access the internet and the ability/permissions to download/install software.

**Exploratory Data Analysis Using R Markdown**

Jonathan Duggins

**Summary**

This course introduces techniques for Exploratory Data Analysis (EDA)—both numeric and graphical—to help provide insight into data sets. Using R and RStudio, we will demonstrate various tools to generate data summaries and introduce R Markdown for generating HTML output. Incorporating R Markdown allows for the generation of output which includes results from statistical analysis, plots, figures (such as diagrams), and explanations of any output.

**Why Take This Course?**

Reading and writing data are only part of the analysis process. Analysis and documentation of results remain a cornerstone of the statistics and data science practices. In this course, methods for basic statistical analysis allow participants the opportunity to analyze data in a familiar way while developing R Markdown skills for creating files with a seamless integration of code and results along with a narrative explaining those methods and their output.

By presenting the process in terms of familiar statistical concepts, participants can substitute any statistical analysis from their toolbox to produce reports that match the required sophistication level for any analysis.

**What Will Participants Learn?**

Covering both general computing concepts and an introduction to R, a tentative outline of the course is given below:

- Numeric data summaries
- Graphical data summaries (ggplot2 package)
- Creating high-quality graphics (ggplot2 package)
- R markdown for professional documentation/reproducibility

**Prerequisites and Requirements**

This course assumes a working knowledge of computers and some prior experience in R is required. At a minimum, knowledge about data objects, an understanding of R syntax, and experience with the tidyverse collection of packages are strongly recommended.

This course relies on active participation of the participants via hands-on programming. Thus, everyone will need their own computer that has access to the internet and the ability/permissions to install programs and download files. R and RStudio should be installed in advance so that participants can devote their time to practicing the skills from the course. Participants will be provided instructions and resources in advance to assist with the installation process.

**Improving R Programs**

Justin Post

**Summary**

This course introduces common programming techniques that can improve the efficiency of your R programs. These techniques include the use of loops and vectorized functions to avoid repeated sections of code. To really take R programs to the next level, we’ll see how to write custom functions that will help to streamline code.

**Why Take This Course?**

R is an extremely versatile programming language that has the capability to fit a fantastic array of statistical and machine learning models, is extremely easy to collaborate with, and has the capacity to easily and widely share your analyses. Oftentimes, the same types of steps are taken repeatedly to each column, object, or dataset. Rather than copying and pasting the same code over and over and simply making small tweaks, which is error prone, participants will learn how to automate these changes.

Specifically, we will use loops to iteratively reevaluate code while changing elements and then demonstrate the efficiency of using vectorized functions to do similar evaluations. We will also discuss the idea of breaking up common tasks into custom functions to help write clean code that is easy to debug. Being able to write your own R functions opens up the possibilities that R has and can help with general understanding of how R works.

**What Will Participants Learn?**

The course provides a brief overview of R data structures followed by the following topics:

- Loops in R
- Vectorized functions (apply family of functions)
- How R functions work
- Function writing

**Prerequisites and Requirements**

This course will make heavy use of hands-on programming. We will generally introduce a topic and then have exercises to practice and explore. As such, participants will need their own computer that has access to the internet and the ability/permissions to install programs and download files. This course assumes basic knowledge of how to program in R. Participants taking the Monday-Tuesday course “Basics of R for Data Science and Statistics” will be prepared for this course. Participants with a basic understanding of R are also welcome to register. Course assumes that R is installed on your computer.

### Intermediate Python

Jason Carter

**Summary**

Learning Python is important for any aspiring data scientist who is interested in programming as a skill, a discipline, or a profession. For those students attending the Introduction to Python class on Monday and Tuesday, we will continue working with hands-on programming and managing large and complex datasets.

**Why Take This Course**

This course will serve as a foundation for students seeking to use programming as a tool in their data science tool belt. This course will help students evaluate, communicate with, and work with programmers and code. Students will get hands-on practice with creating and manipulating data sets, perform error handling, and storing and retrieving data using relational databases.

**What Participants Will Learn**

Students will learn how to:

- create programs that manipulate data
- perform error handling
- store and retrieve data using relational databases
- use the pandas library, the de facto standard to work with tabular data in Python.

**Prerequisite and Requirements**

An introductory knowledge of Python is required. You will need to also be skilled in basic mathematics and algebra as well as e-mail and Web usage. You will need a computer that can access the internet and have ability/permissions to download and install software.

**Using Linked Data**

Jim Balhoff

**Summary**

Linked data technologies provide the means to create flexible, dynamic knowledge graphs using open standards. This course offers an introduction to linked data and the semantic web tools underlying its use. We will build linked data knowledge graphs using the Resource Description Framework (RDF) data model. RDF represents facts as simple three-part statements called triples; statements that share subjects or objects link together to form a graph. We will learn how to combine data sets into a single knowledge graph, query open linked data resources on the web, and add structure to our data using logical inferences and graph shape constraints.

**Why Take This Course**

Knowledge graphs provide a natural way to structure information, and by using linked data standards, different sources of knowledge can be readily combined and queried as one graph. While efforts to create a global “semantic web” have at times been over-hyped, the linked data standards and technologies this field has produced are both approachable and extremely useful for a wide range of applications in data integration and manipulation. When RDF-based linked data is used to share structured information on the web, providers can use a common set of global identifiers to make it possible to readily combine data from different sources. Linked data technologies have found particular adoption within many life sciences data resources, for example the UniProt Knowledgebase and the Gene Ontology, but is also used for knowledge organization at the BBC, tracking cultural heritage objects, music databases, and more. Wikidata, the structured data component of Wikipedia, provides a comprehensive RDF knowledgebase to which anyone can contribute.

**What Participants Will Learn**

The course will cover the following topics:

- The RDF graph data model
- Elements of RDF: URIs, literals
- RDF file formats
- Querying RDF with SPARQL, a graph-oriented query language
- Tools for manipulating and querying RDF
- command line tools
- RDF databases
- Linked data resources on the Web
- controlled vocabularies and ontologies
- Wikidata
- Combining linked data sets
- Verifying graph structure using shape constraints
- Shape definition languages: SHACL and ShEx
- Adding semantics with an ontology

**Prerequisite and Requirements**

You will need a computer that can access the internet during the class. Many of the exercises will be web-based and require no software installation, however we will also introduce some command-line tools that can be used for RDF processing on your own computer. To follow along with these examples, participants will need an installation of Java and to be comfortable with running commands in their laptop’s terminal window. Prior to the course, participants will receive detailed instructions on how to obtain Java and to install command line tools used in examples.

**Visualization for Data Science using R**

Angela Zoss

**Summary**

This course is designed for two audiences: experienced visualization designers looking to apply open data science techniques to their work and data science professionals who have limited experience with visualization. Participants will develop skills in visualization design using R, a tool commonly used for data science. Basic familiarity with R is required.

**Why Take This Course**

Data science skills are increasingly important for research and industry projects. With complex data science projects, however, come complex needs for understanding and communicating analysis processes and results. Ultimately, an analyst’s data science toolbox is incomplete without visualization skills. Incorporating effective visualizations directly into the analysis tool you are using can facilitate quick data exploration, streamline your research process, and improve the reproducibility of your research.

**What Participants Will Learn**

The course will take a project-based approach to learning best practices for visualization for data science. Participants will be guided through several sample analysis and visualization projects that will highlight different types of visualization, different features of R and its visualization capabilities, and different challenges that arise when trying to apply an open data science philosophy to visualization.

- Introduction to visualization in R
- Using ggplot2 for publication-ready graphics
- Applying common graphic design principles to ggplot2 visualizations
- Adding interactivity to visualizations through R Markdown, HTML widgets, and Shiny applications

**Prerequisite and Requirements**

As indicated above, this course assumes basic familiarity with R—e.g., R syntax, data structures, development environments. Participants with no knowledge of R should consider taking an introductory R short course.

We will use RStudio to interact with R, and all exercises will be distributed in R Markdown files (rather than simple R script files). This allows us to combine R code with non-code elements and promotes a literate programming approach to research.

A significant portion of the course will use ggplot2 and other tidyverse packages to create visualizations, but prior experience with those packages is not required. In order to participate in class exercises, participants will need a computer where you have installed: current versions of R, RStudio, and the following packages: tidyverse, knitr, shiny, plotly, crosstalk, flexdashboard, maps, mapproj, and sf. Permissions to install packages on the fly would be useful.

### Introduction to Deep Learning with Python

Ashok Krishnamurthy

**Summary**

In the past few years, Deep Learning (DL) has emerged as a powerful Machine Learning method that has found applications in areas such as object recognition, image classification, video analysis, and natural language processing. This course will introduce participants to Deep Learning from a hands-on perspective. The approach will be to minimize the math and concentrate instead on the underlying ideas and principles. We will concentrate on Tensorflow/Keras and PyTorch as the underlying computational platforms and use Python to create the DL codes. At the end of the course you will have a basic understanding of DL, Tensorflow/Keras, and PyTorch as DL platforms, as well as example applications. We will do a number of hands-on labs that use DL for image/object recognition and text-as-data analysis.

**Why Take This Course?**

Deep Learning appears to be a magical algorithm that can solve difficult problems in a variety of domains. The coming together of a number of trends underlie the success of DL. In this course, you will get a chance to look under the hood of the DL hype and see the underlying architectures, and you will be exposed to a set of tools that you can use to create your own Deep Learning models.

**What Will Participants Learn?**

The course will focus on the following topics:

- Neural Networks
- Neural Networks as universal approximators
- Training Neural Networks as an optimization problem
- Deep Neural Networks
- Why Deep Learning now?
- Convolutional Neural Networks
- Tensors, Tensorflow and Keras
- PyTorch
- Autoencoders
- Transfer Learning
- Text as data

Participants will complete a number of computer exercises using Python, Keras/Tensorflow and PyTorch.

**Prerequisites**

This course will assume a basic understanding of statistics and calculus at the undergraduate level and some programming experience with Python to get full benefits from the class. You will need a computer to use during the class.

### Intro to Statistical Machine Learning Using R

Yufeng Liu

**Summary**

Statistical machine learning and data mining is an interdisciplinary research area which is closely related to statistics, computer sciences, engineering, and bioinformatics. Many statistical machine learning and data mining techniques and algorithms are useful in various scientific areas. This 2-day short course will provide an overview of statistical machine learning and data mining techniques with applications to the analysis of real data. Both supervised and unsupervised techniques will be covered.

Supervised learning techniques include penalized regression such as LASSO (and its variants), support vector machines, and tree-based methods.

Unsupervervised learning techniques include dimension reduction methods such as principal components analysis and clustering analysis. The main emphasis will be on the analysis of real data sets from various scientific fields. The techniques discussed will be demonstrated in R.

**Why Take This Course?**

This course is intended for researchers who have some knowledge of statistics and would like to be introduced to statistical machine learning and data mining as well as for practitioners who would like to apply statistical machine learning techniques to their problems. Machine learning plays an increasingly significant role in data science and has become an integral part of many fields—from biomedicine to business/marketing to social media. It is grounded directly in our daily lives.

**What Will Participants Learn?**

Discussion and R exercises will be included, as time permits, on the following:

- Fundamentals of Statistical Learning
- Training versus test error rates
- Supervised versus unsupervised methods
- Bias/Variance tradeoff

- Linear regression and Penalized Regression
- Ridge regression
- Lasso
- Further extensions (if time permits)

- Cross-validation
- Classification Techniques
- Logistic regression and penalized logistic regression
- Nearest neighbors classification
- Support vector machines

- Tree-based Methods
- Bagging
- Random forests

- Unsupervised Learning Techniques
- Dimension reduction: Principal Component Analysis
- Others dimension reduction techniques (if time permits)
- Clustering

- Other selected topics (if time permits)

**Prerequisites and Requirements**

Participants should be familiar with linear regression and basic statistical and probability concepts, as well as familiarity with R programming. You will need a computer for class with R and R Studio installed.

**R for Automating Workflow & Sharing Work**

Justin Post

#### Summary

The course provides participants an introduction to utilizing R for writing reproducible reports and presentations that easily embed R output, using online repositories and version control software for collaboration, creation of basic websites using R, and the development of interactive dashboards and web applets.

**Why Take This Course**

Being able to easily collaborate and share work is of paramount importance. This course discusses many topics that can help improve workflow, collaboration, and dissemination of analyses. We will cover how the knitr package can be used to create a multitude of different output files (including PDF, HTML, slideshows, and more) that include both formatted text and R code using the simple R Markdown language. Further, we will see how to automate common analyses to create multiple versions of a report for different subsets of data.

In order to better collaborate, we will discuss basic use of Git and GitHub through the RStudio environment. This software not only allows for easy collaboration but also provides strong version control, ensuring a record of all changes made over time. We will also see that GitHub and R Markdown can be used to create sleek looking websites to easily share your analyses.

Other course topics will include the creation of interactive and customizable dashboards and web applets through RShiny. These provide the user of the app the ability to change sliders, enter values, and more with R running calculations on the backend.

**What Will Participants Learn**

Students will learn about the following topics:

- The R Markdown language
- How to automate reports with R Markdown
- Use of Git and GitHub for collaboration and version control
- Basic creation of websites through Markdown and GitHub
- R Shiny web apps

**Prerequisites and Requirements**

This course will make heavy use of hands-on programming. We will generally introduce a topic and then have exercises to practice and explore. As such, participants will need a computer that has access to the internet and the ability/permissions to install programs and download files during class. This course assumes basic knowledge of how to program in R. Participants taking the course “Basics of R for Data Science and Statistics” (Monday-Tuesday) or “Improving R Programs” (Wednesday) will be prepared for this course. Other participants with experience using R are also welcome to register for the course.

**Text Analysis Using R**

Alison Blaine

**Summary**

This course explains how to clean and analyze textual data using R, including both raw and structured texts. It will cover multiple hands-on approaches to getting data into R and applying analytical methods to it, with a focus on techniques from the fields of text mining and Natural Language Processing.

**Why Take This Course?**

The skills required to analyze textual data are useful in a wide array of academic disciplines and industries. This course will provide students already familiar with R with some of the programming skills necessary to take more control over their data analysis process and feel empowered to dive deeper.

**What Will Participants Learn?**

Participants will learn how to load text-based data into R, how to format and process the data for analysis, and then how to apply multiple methods for analyzing those texts. The instructional approach is to teach a concept by going step-by-step through exercises with lots of opportunities to practice concepts learned.

**Prerequisites and Requirements**

You will need a computer during class. This course is best suited for those who already have a basic working knowledge of R. Students with no knowledge of R should consider taking an introductory R short course. Two options are the “Basics of R for Data Science” (Monday-Tuesday) or “Introduction to Programming in R” course (Monday-Tuesday).