Data science for Doctors: Variable importance Exercises


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the tenth part of the series and it aims to cover the very basics of the subject of principal correlation coefficient and components analysis, those two methods illustrate how variables are related.
In my opinion, it is necessary for researchers to know how to have a notion of the relationships between variables, in order to be able to find potential cause and effect relation – however this relation is hypothetical, you can’t claim that there is a cause-effect relation only because the correlation is high between those two variables-,remove unecessary variables etc. In particular we will go through Pearson correlation coefficient and Confidence interval by the bootstrap and ( Principal component analysis.

Before proceeding, it might be helpful to look over the help pages for the ggplot, cor, cor.tes, boot.cor, quantile, eigen, princomp, summary, plot, autoplot.

Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("ggfortify")
library(ggfortify)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Compute the value of the correlation coefficient for the variables age and preg.

Exercise 2

Construct the scatterplot for the variables age and preg.

Exercise 3

Apply a correlation test for the variables age and preg with null hypothesis to be the correlation is zero and the alternative to be different from zero.
hint: cor.test

Exercise 4

Construct a 95% confidence interval is by the bootstrap. First find the correlation by bootstrap.
hint: mean

Exercise 5

Now that you have found the correlation, find the 95% confidence interval.

Exercise 6

Find the eigen values and eigen vectors for the data set(exclude the class.fac variable).

Exercise 7

Compute the principal components for the dataset used above.

Exercise 8

Show the importance of each principal component.

Exercise 9

Plot the principal components using an elbow graph.

Exercise 10

Constract a scatterplot with x-axis to be the first component and the y-axis to be the second component. Moreover if possible draw the eigen vectors on the plot.
hint: autoplot




Data Science for Operational Excellence (Part-4)

Suppose your friend is a restaurant chain owner (only 3 units) facing some competitors challenges related to low price, lets call it a price war. Inside his business he knows that there’s no much cost to be cut. But, he thinks that, maybe if he tries harder to find better supplier with low freight and product costs, he could be in a better situation. So, he decided to hire you a recent grad data scientist to figure out how to solve this problem and to build a tool to make your findings to be incorporated in his daily operations. As a Data Scientist you know that this problem could be solved through the use of lpSolve package.

Our goal here is to expand your knowledge to create custom constraints to be used in real business problems.

Answers to the exercises are available here.

Exercise 1
We will solve a transportation problem for your friend’s restaurant chain with 2 different products sold from 4 different suppliers. Create a cost vector that model different costs for each combination of restaurant, supplier and product. Use integer random numbers from 0 to 1000 to fill this vector. In order to be reproducible, set seed equals to 1234.

Exercise 2
Create the demand constraints. Consider that every restaurant need a specific quantity for each product. Use integer random numbers from 100 to 500 to define minimum quantities to keep the restaurant open without running out of any supplies.

Exercise 3
Create the offer constraints. Consider that every supplier can deliver a specific quantity related to each product. Use integer random numbers from 200 to 700 to define maximum quantities that each supplier can deliver.

Exercise 4
Prepare the parameter of the lp() function using the variables created above.

Exercise 5
Now, solve these problem with these constraints created so far.

Learn more about geo visualization in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Build advanced map visualizations
  • Work with different sources for maps
  • And much more visualizations

Exercise 6
We know that some suppliers have minimum order quantities. Create a new set of constraints to represent that. Use integer random numbers from 50 to 70 to define minimum quantities that we can order from each supplier.

Exercise 7
Now, solve these problem with these constraints created so far.

Exercise 8
We also know that some vehicles have maximum capacity in terms of weight and volume. Create a new set of constraints to represent that. Use integer random numbers from 100 to 500 to define maximum quantities that we can order from each supplier.

Exercise 9
Prepare again the lp() function parameters using the variable created above.

Exercise 10
Now, solve these problem with all constraints.




Data Science for Operational Excellence (Part-3)


Optimized transportation planning is a task usually left to the firm’s logistic department. However, it is often difficult to visualize, specially if there are many points involved in the logistic network. R and its packages can help solving this issue. Our goal here is to expand logistics networking visualization. In order to do that, we will use packages as ggmap and leaflet.

Answers to the exercises are available here.

Exercise 1
Load libraries: ggmap, fields, lpSolve, leaflet, dplyr, magrittr. Use the following vectors to create a new one called allCitiesAux: soyaCities <- c("Sapezal","Sorriso", "Nova Mutum", "Diamantino", "Cascavel") , transhipment <- c("Alto Araguaia", "Cascavel"), ports <- c("Santos", "Paranagua").
Exercise 2
Use the function geocode to collect latitude and longitude for all cities.
Exercise 3
Create a data frame and with columns names: City, lat and lng.
Exercise 4
Create a matrix that contains all the distance between all cities. We will use this in the lp.transportation function, so remember that rows must be offer points and columns demand points.

Learn more about geo visualization in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Build advanced map visualizations
  • Work with different sources for maps
  • And much more visualizations

Exercise 5
Create row.signs, row.rhs, col.signs, col.rhs. For that, remember to set a seed equals to 123 and that all soya must be exported through ports. For the “right hand side” variables use random generated number. Port demands should be between 300 and 600. Soya production should be between 50 and 300.
Exercise 6
Solve the transportation problem and change columns and row names to match the names from the cost matrix.
Exercise 7
Create a list of data frames to store all segments presented in the solution. Example, one of this segments should be Sapezal to Santos.
Exercise 8
Create a map using leaflet and add lines for each segment based on the list of data frames created previously.
Exercise 9
Create a list of data frames to store road routes extracted using the route function from ggmap.
Exercise 10
Create a new map using leaflet that, instead of showing straight lines from origin to destination, shows road routes.




Experimental Design Exercises

In this set of exercises we shall follow the practice of conducting an experimental study. Researcher wants to see if there is any influence of working-out on body mass. Three groups of subjects with similar food and sport habits were included in the experiment. Each group was subjected to a different set of exercises. Body mass was measured before and after workout. The focus of the research is the difference in body mass between groups, measured after working-out. In order to examine these effects, we shall use paired t test, t test for independent samples, one-way and two-ways analysis of variance and analysis of covariance.

You can download the dataset here. The data is fictious.

Answers to the exercises are available here.

If you have different solution, feel free to post it.

Exercise 1

Load the data. Calculate descriptive statistics and test for the normality of both initial and final measurements for whole sample and for each group.

Exercise 2

Is there effect of exercises and what is the size of that effect for each group? (Tip: You should use paired t test.)

Exercise 3

Is the variance of the body mass on final measurement the same for each of the three groups? (Tip: Use Levene’s test for homogeneity of variances)

Exercise 4

Is there a difference between groups on final measurement and what is the effect size? (Tip: Use one-way ANOVA)

Learn more about statistics for your experimental design in the online course Learn By Example: Statistics and Data Science in R. In this course you will learn how to:

  • Work thru regression problems
  • use different statistical tests and interpret them
  • And much more

Exercise 5

Between which groups does the difference of body mass appear after the working-out? (Tip: Conduct post-hoc test.)

Exercise 6

What is the impact of age and working-out program on body mass on final measurement? (Tip: Use two-way between groups ANOVA.)

Exercise 7

What is the origin of effect of working-out program between subjects of different age? (Tip: You should conduct post-hoc test.)

Exercise 8

Is there a linear relationship between initial and final measurement of body mass for each group?

Exercise 9

Is there a significant difference in body mass on final measurement between groups, while controlling for initial measurement?

Exercise 10

How much of the variance is explained by independent variable? How much of the variance is explained by covariate?




Data science for Doctors: Cluster Analysis Exercises


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the ninth part of the series and it aims to cover the very basics of the subject of cluster analysis.
In my opinion, it is necessary for researchers to know how to discover relationships between patients and diseases. Therefore in this set of exercises we will go through the basics of cluster analysis relationship discovery. In particular we will use hierarchical clustering and centroid-based clustering , k-means clustering and k-median clustering.

Before proceeding, it might be helpful to look over the help pages for the ggplot, geom_point, dist, hclust, cutree, stats::rect.hclust, multiplot, kmeans, kGmedian.

Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("Gmedian")
library(Gmedian)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the class of the candidate (0 or 1).

Exercise 2

Create a distance matrix for the data.

Exercise 3

Make an hierarchical clustering analysis using the single linkage method. Then create an object that contains only two clusters.

Exercise 4

Make an hierarchical clustering analysis using the complete linkage method(default). Then create an object that contains only two clusters.

Exercise 5

Construct the trees that are produced by exercises 2 and 3 and draw the two clusters(at the plots).
hint: rect.hclust

Learn more about cluster analysis in the online course Applied Multivariate Analysis with R. In this course you will learn how to work with hiërarchical clustering, k-means clustering and much more.

Exercise 6

Construct two scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster that those points belong to. Each scatterplot is for different clustering method.

If possible illustrate those scatterplots (each one at a time) next to the plot of exercise 1, to see whether the clustering can discriminate the positive classified from the negative classified patients. In case you didn’t do that, find it at the solution’s section, I highly encourage you to check it out.

Exercise 7

Run the following in order to create dummy variables data_mat <- model.matrix(~.+0, data = data).
Make a centroid-based cluster analysis using the k-means method with k to be 2. Apply the k-mean clustering on the data_mat data frame.

Exercise 8

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster (retrieved from k-mean method) that those points belong to.

If possible illustrate those scatterplot next to the plot of exercise 1.

Exercise 9

Make a centroid-based cluster analysis using the k-median method with k to be 2. Apply the k-median clustering on the data_mat data frame.

Exercise 10

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster (retrieved from k-median method) that those points belong to.

If possible illustrate those scatterplot next to the plot of exercise 1.




User Defined Functions in R Exercises (Part 1)

In the Exercises we will discuss User Defined Function in R

Answers to the exercises are available here.

Exercise 1

Create a function to print square of number

Exercise 2

Create a function to print a number raise to another with the one argument a default argument
Exercise 3

Create a function to print class of an argument

Exercise 4

Create a function to accept two matrix arguments and do matrix operations with same.

Exercise 5

Create a user defined function to accept a name from the user

Exercise 6

Create a user defined function to accept values from the user using scan and return the values

Exercise 7

Create a user defined function to create a matrix and return the same.

Exercise 8

Create a function to take two arguments, one student marks and other student names and plot a graph based on the same.

Exercise 9

Create a function to accept an employee data frame(Name,Gender,Age,Designation & SSN) and print the First & Fifth employee as well as the Names & the Designation of all the employees in the function

Exercise 10

Create a function to create an employee data frame(Name,Gender,Age,Designation & SSN) and return the Name,Age & Designation of all employees.




Forecasting: Exponential Smoothing Exercises (Part-3)

Exponential smoothing is a method of finding patterns in time series, which can be used to make forecasts. In its simple form, exponential smoothing is a weighted moving average: each smoothed value is a weighted average of all past time series values (with weights decreasing exponentially from the most recent to the oldest values). In more complicated forms, exponential smoothing is applied to a time series recursively to allow for a trend and seasonality. In that case, the model is said to consist of three components – error, trend, and seasonality, from which another notation for exponential smoothing (“ETS”) is derived.
This set of exercises focuses primarily on the ets function from the forecast package. The function can be used to apply various exponential smoothing methods (including Holt’s and Holt-Winters’ methods), and allows for both automatic and manual selection of the model structure (for example, whether the model includes trend and seasonal components). The exercises are based on the monthly data on US civilian unemployment rate as a percentage of the labor force for 2012-2017 retrieved from FRED, the Federal Reserve Bank of St. Louis database (download here)

For other parts of the series follow the tag forecasting.
Answers to the exercises are available here.

Exercise 1
Load the data, transform it the the ts type (indicating that the data is monthly and the first period is January 2012), and plot it.

Exercise 2
Use the ses function from the forecast package to get a forecast based on simple exponential smoothing for the next 12 months, and plot the forecast.

Exercise 3
Estimate an exponential smoothing model using the ets function with default parameters. Then pass the model as input to the forecast function to get a forecast for the next 12 months, and plot the forecast (both functions are from the forecast package).

Exercise 4
Print a summary of the model estimated in the previous exercise, and find the automatically estimated structure of the model. Does it include trend and seasonal components? If those components are present are they additive or multiplicative?

Exercise 5
Use the ets function to estimate an exponential smoothing model with a damped trend. Make a forecast based on the model for the next 12 months, and plot it.

Learn more about Forecasting in the online course Time Series Analysis and Forecasting in R. In this course you will learn how to:

  • A complete introduction on Forecasting
  • Work thru an exponentional smoothing instruction
  • And much more

Exercise 6
Use the ets function to estimate another model that does not include a trend component. Make a forecast based on the model for the next 12 months, and plot it.

Exercise 7
Find a function in the forecast package that estimates the BATS model (exponential smoothing state space model with Box-Cox transformation, ARMA errors, trend and seasonal components). Use it to estimate the model with a dumped trend, and make a forecast. Plot the forecast.

Exercise 8
Use the accuracy function from the forecast package to get a matrix of accuracy measures for the forecast obtained in the previous exercise. Explore the structure of the matrix, and save a measure of the mean absolute error (MAE) in a variable.

Exercise 9
Write a function that inputs a time series and a list of model estimation functions, calculates forecasts for the next 12 periods using each of the functions (with default parameters), and outputs the forecast with the smallest mean absolute error.
Run the function using the unemployment time series and a list of functions that includes ets, bats, and auto.arima. Plot the obtained result.

Exercise 10
Modify the function written in the previous exercise so that it prints the mean absolute error for each forecasting model along with the name of that model (the name can be retrieved from the forecast object).




Data science for Doctors: Inferential Statistics Exercises(Part-5)


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the eighth part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for testing the normality of distributions(Shapiro–Wilk test, Anderson–Darling test.), the existence of outliers(Grubbs’ test for outliers). We will also cover the case that normality assumption doesn’t hold and how to deal with it(<a href="https://en.wikipedia.org/wiki/Rank_test" Rank tests). Finally we will do a brief recap of the previous exercises on inferential statistics.

Before proceeding, it might be helpful to look over the help pages for the hist, qqnorm, qqline, shapiro.test, ad.test, grubbs.test, wilcox.test.

Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("nortest")
library(nortest)
install.packages("outliers")
library(outliers)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Moreover run the chunk below in order to generate the samples that we will test on this set of exercises.
f_1 <- rnorm(28,29,3)
f_2 <- rnorm(23,29,6)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Plot an histogram of the variable pres.

Exercise 2

Plot the QQ-plot with a QQ-line for the variable pres.

Exercise 3

Apply a Shapiro-Wilk normality test for the variable pres.

Exercise 4

Apply a Anderson-Darling normality test for the variable pres.

Exercise 5

What is the percentage of data that passes a normality test?
This might be a bit challenging, consider using the apply function.

Learn more about Inferential Statistics in the online course Learn By Example: Statistics and Data Science in R. This course includes:

  • 6 different case-studies on inferential statistics
  • Extensive coverage of techniques for inference
  • And much more

Exercise 6

Construct a boxplot of pres and see whether there are outliers or not.

Exercise 7

Apply a Grubb’s test on the pres to see whether the variable contains outlier values.

Exercise 8

Apply a two-sided Grubb’s test on the pres to see whether the variable contains outlier values.

Exercise 9

Suppose we test a new diet on a sample of 14 people from the candidates (take a random sample from the set) and after the diet the average mass was 29 with standard deviation of 4 (generate 14 normal distributed samples with the properties mentioned before). Apply Wilcoxon signed rank test for the mass variable before and after the diet.

Exercise 10

Check whether the positive and negative candidates have the same distribution for the pres variable. In order to check that, apply a Wilcoxon rank sum test for the pres variable in respect to the class.fac variable.




Forecasting: Linear Trend and ARIMA Models Exercises (Part-2)

There are two main approaches to time series forecasting. One of them is to find persistent patterns in a time series itself, and extrapolate those patterns. Another approach is to discover how a series depend on other variables, which serve as predictors.
This set of exercises focuses on the first approach, while the second one will be considered in a later set. The present set allows to practice in applying three forecasting models:
– a naive model, which provides probably the simplest forecasting technique, but still can be useful as a benchmark for evaluating other methods,
– a linear trend model (based on a simple linear regression),
– the ARIMA model, a more sophisticated and popular model, which assumes a linear dependence of a time series on its past values and random shocks.
The exercises do not require a deep understanding of underlying theories, and make use of automatic model estimation functions included in the forecast package. The set also help to practice in retrieving useful data from forecasts (confidence intervals, forecast errors), and comparing predictive accuracy of different models. The exercises are based on data on e-commerce retail sales in the USA for 1999-2016 retrieved from FRED, the Federal Reserve Bank of St. Louis database (download here). The data represent quarterly sales volumes in millions of dollars.

For other parts of the series follow the tag forecasting

Answers to the exercises are available here

Exercise 1
Read the data from the file, and transform it into a time series (ts) object (given that the data is quarterly and the starting period is the fourth quarter of 1999).
Plot the obtained series.

Exercise 2
Make a naive forecast for the next 8 periods using the appropriate function from the forecast package (i.e. create an object of the class forecast using the function that implements the naive method of forecasting) (Note that this method sets all forecast values equal to the last known time series value).

Exercise 3
Plot the forecast values.

Exercise 4
Make a forecast for the next 8 periods based on a linear model in two steps:
(1) create a linear regression model for the forecast using the tslm function from the forecast package (use the series as the dependent variable, trend and season as independent variables),
(2) make a forecast based on the model using the forecast function from the same package.
Plot the forecast.

Exercise 5
Retrieve forecast errors (residuals) from the linear model based forecast and save them as a separate variable.

Learn more about Forecasting in the online course Time Series Analysis and Forecasting in R. In this course you will learn how to:

  • A complete introduction on Forecasting
  • Work thru an exponentional smoothing instruction
  • And much more

Exercise 6
Make a forecast for the next 8 periods based on the ARIMA model in two steps:
(1) create a model using the auto.arima function from the forecast package,
(2) make a forecast based on the model using the forecast function from the same package.
Plot the forecast.

Exercise 7
Print the summary of the forecast based on the ARIMA model.

Exercise 8
Explore the structure the forecast summary. Find the forecast value for the last period, and its 5% confidence interval values.

Exercise 9
Retrieve forecast errors (residuals) from the ARIMA based forecast.

Exercise 10
Use the errors from the ARIMA based forecast and the errors from the linear model based forecast to compare predictive accuracy of the two models with the Diebold-Mariano test (implemented as a function in the forecast package). Test the hypothesis that the ARIMA based forecast is more accurate than the linear model based forecast.




Data Science for Operational Excellence (Part-2)

Network problems are everywhere. We can easily find instances in logistics, telecom, project mangement, among others. In order to attack these problems using linear programming we need to go beyond assign and transportation problems that we saw in part I. Our goal here is to expand the problems we can solve using lpsove and igraph R packages. For that, we will formulate the transportation problem using a more generic function lp.
Answers to the exercises are available here. If you obtained a different (correct) answer than those
listed on the solutions page, please feel free to post your answer as a comment on that page.
Please install and load the package lpsolve and igraph before starting the exercise.

Answers to the exercises are available here.

Exercise 1
Load libraries lpsove and igraph and learn how to use lp function by replicating the example in ?lp.

Exercise 2
We want to rewrite the transport problem stated in part I, but this time using lp function. First, load the data used in part I exercise.

Exercise 3
Run the this transportation problem one more time using lp.transport. Check the objective function and decision variables values.

Exercise 4
Rearrange the data in a format required for the objective function to use at the lp function.

Exercise 5
Construct the binary matrix to model the DEMAND constraints.Observe that the variables should be at the same order defined at the objective function.

Learn more about Network Analysis in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Run a cluster analysis
  • Perform multidimensional scaling
  • And much more

Exercise 6
Forget about the OFFER constraints for a while. Run the lp function only with de DEMAND constraint and see what happens.

Exercise 7
Construct the binary matrix to model the OFFER constraints.Observe that the variables should be at the same order defined at the objective function.

Exercise 8
Bind both offer and demand constraints to construct a matrix to f.con and vectors to f.dir, and f.rhs.

Exercise 9
Now, solve the problem using lp function. Find the solution for each variable and the objective function value.Check if it matches the values from lp.transport.

Exercise 10
Rename rows to represent factories and columns to be depots. Create an graph using graph_from_incidence_matrix based on decision variables optimum values.