Introduction to R
R is probably the best environment for data analysis, and is the choice of many data scientists and business analysts.
This 3 days workshop will introduce R to these audiences and provide the basic skills needed for conducting data analysis projects independently in R.
This introductory workshop adopts a hands-on, interactive approach. It is comprised of a few “interactive demonstration” modules that introduce key concepts. In these modules we follow a well documented R script, occasionally deviating from it to answer questions and deepen our understanding of the material presented. Interspersed between these demonstrations are in-class exercises in which participants have the opportunity to solve real (sometimes a bit simplified) business cases, using R by themselves. Finally there is always room for participants to bring their own data to the class, and discuss how to perform their task in R. This modular structure enables to cater for the specific needs of participants and their organization. The typical workshop takes a minimum of 2 full days (2×8 hours), but 3 are advised. More “open audience” consultation time can be added as well as more modules to cover additional topics and statistical methods.
The workshop is designed with two main audiences in mind:
– Statisticians and business analysts. It is expected that participants have experience with data analysis, but there is no need for prior programming experience.
– Programmers that need to develop data oriented software. Here the assumption is that participants have programming background, but not necessarily statistical background.
Experience shows that it is not a good idea to mix these two populations, as each needs a different emphasis and approach to master the required skills
Topics and modules (typical):
Module 1: R basics
– introducing the R and Rstudio environemnt
– basic data structures: vectors, lists & data frames
– basic operations on data: element-wise operations, selection and summary operations
– working with dates and strings
Module 2: Data manipulation
– reading and writing data from/to files (including the haven and readxl packages)
– the dplyr package: merging, recoding and aggregating data
– the reshape2 package: long and wide data formats and their usage
Module 3: Basic statistics
– descriptive statistics
– univariate and bivariate statistical tests (t-test, chi-square, etc.)
– linear regression
– logistic regression
Module 4: Data visulaization
– basic plot commands
– the ggplot2 package to cover histograms, scatterplots, boxplots, etc.
Module 5: Process automation
– basics of R programming
– (if time permits) template based reporting with the knitr package
– (if time permits) generating paper-like tables for publication with stargazer package
Introduction to predictive analytics
Traditional data analysis is about describing the data we have. Some more advanced models are used to explain why we get what we see. However, with advances in technology and statistical theory, most companies now employ (or can employ) a new statistical paradigm: predictive analytics: instead of finding characteristics of customers, predictive models can predict which one is likely to leave, or which lead is most likely to convert. This change in paradigm benefits decision makers and managers as it provides more precise insights that lead to focused and valuable actions. However, it also requires new statistical capabilities. It turns out that the best explanatory models are not always the best predictive models, and analysts now need to develop, evaluate and interpret their models differently.
In this workshop we will get an overview of the world of predictive models and analytics, enter into this new “mindset”, and learn the basic considerations and evaluation techniques. In a “hands on” manner we will learn to classify, estimate, cluster and predict outcomes in real world settings, using the R statistical environment. The workshop is modular and built as a mix of interactive demonstrations of key topics, in-class exercises based on real business settings and “open audience consultations” in which participants bring their own data and receive advice on how to accomplish their goal. The typical workshop takes 3-4 full days, and specific topics can be tailored to the needs and background of participants. Some background in using R is required.
The workshop is aimed at business analysts, statisticians and other professionals working with data, who need (or want) to use predictive models in their day-to-day analysis tasks. Previous experience in analyzing data is required, as well as some knowledge of R.
Topics and modules (typical):
Module 1: mapping the terrain
– descriptive, explanatory and predictive modeling
– example: why you need “predictive” regression
– the kind of problems predictive models deal with
– rule and pattern extraction
– time series prediction and time-to-event predictions
– supervised vs. unsupervised learning
– feature creation and feature selection
Module 2: the basic “mindset” and evaluation techniques
– cross validation and bootstrapping techniques
– estimation problems using regression
– regularization of models and its role (bias-variance trade-off)
– tree-based estimation
Module 3: classification
– logistic regression
– decision trees
– random forest
Module 4: clustering and pattern analysis
– hierarchical clustering
– association rules and market basket analysis
Module 5: feature selection and feature extraction and other value enhancing techniques
– principal component analysis
– practical consideration in creating and selecting features
– the concept of boosting
Module 6: common business applications
– customer analytics (customer churn, etc.)
– recommender systems
– people analytics
Additional potentially relevant modules:
– time series models
– survival analysis and time-to-event predictions
– social network analysis
Social network data analysis
Relationships matter, and as social network data become abundant, organizations can extract value from analyzing relations and connections made by social and other activities. In this workshop we explore both the theory and practice of analyzing social network data and visualizing it using the R software package.
Relations matter. Relations create networks among people, businesses, and other entities, and understanding these networks entails understanding important business processes such as influence, information flow, dependencies, etc. In recent years data about relationships and networks becomes available, with many applications externalizing, storing and collecting network data. In parallel, academic research in sociology, economics and psychology have developed theories and valuable insights about the value of networks and relationships. For example, did you know you can spot the more creative people in your organization just by looking at their position in the social network?
In this workshop we will learn both the theory and practice of analyzing social network data. We will survey some of the state-of-the-art theoretical findings and case studies, and then learn ste-by-step how to perform network analysis using the R statistical software and related extension packages. Finally, the beauty of network analysis is its ability to visualize structures in a very compelling manner. We will discuss in depth issues of network visualization.
Advanced: Time Series Forecasting, Survival Models
As data and statistical packages become more available, business analysts are encouraged to use more sophisticated statistical models in order to bring better value to their organizations. Two of the most common advanced approaches are time series forecasting and survival models which have important applications to marketing (e.g. customer churn analysis), hr and operation research, among other applications. In this workshop we will learn the skills required to develop and perform these kinds of analysis.