Data Science for Doctors – Part 4 : Inferential Statistics (1/5)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the fourth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

Before proceeding, it might be helpful to look over the help pages for the sample, mean, sd , sort, pnorm. Moreover it is crucial to be familiar with the Central Limit Theorem.

You also may need to load the ggplot2 library.
install.packages("moments")
library(moments)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generate (10000 iterations) a sampling distribution of sample size 50, for the variable mass.

You are encouraged to experiment with different sample sizes and iterations in order to see the impact that they have to the distribution. (standard deviation, skewness, and kurtosis) Moreover you can plot the distributions to have a better perception of what you are working on.

Exercise 2

Find the mean and standard error (standard deviation) of the sampling distribution.

You are encouraged to use the values from the original distribution (data$mass) in order to comprehend how you derive the mean and standard deviation as well as the importance that the sample size has to the distribution.

Exercise 3

Find the of the skewness and kurtosis of the distribution you generated before.

Exercise 4

Suppose that we made an experiment and we took a sample of size 50 from the population and they followed an organic food diet. Their average mass was 30.5. What is the Z score for a mean of 30.5?

Exercise 5

What is the probability of drawing a sample of 50 with mean less than 30.5? Use the the z-table if you feel you need to.

Exercise 6

Suppose that you did the experiment again but to a larger sample size of 150 and you found the average mass to be 31. Compute the z score for this mean.

Exercise 7

What is the probability of drawing a sample of 150 with mean less than 31?

Exercise 8

If everybody would adopt the diet of the experiment. Find the margin of error for the 95% of sample means.

Exercise 9

What would be our interval estimate that 95% likely contains what this population mean would be if everyone in our population would start adopting the organic diet.

Exercise 10

Find the interval estimate for 98% and 99% likelihood.




Data Science for Doctors – Part 2 : Descriptive Statistics

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the second part of the series, it will contain the main descriptive statistics measures you will use most of the time. Those measures are divided in measures of central tendency and measures of spread. Moreover, most of the exercises can be solved with built-in functions, but I would encourage you to solve them “by hand”, because once you know the mechanics of the measures, then you are way more confident on using those measures. On the “solutions” page, I have both methods, so even if you didn’t solve them by hand, it would be nice if you check them out.

Before proceeding, it might be helpful to look over the help pages for the mean, median, sort , unique, tabulate, sd, var, IQR, mad, abs, cov, cor, summary, str, rcorr.

You also may need to load the Hmisc library.
install.packages('Hmisc')
library(Hmisc)

In case you haven’t solve the part 1, run the following script to load the prerequisites for this part.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Find the mean of the mass variable.

Exercise 2

Find the median of the mass variable.

Exercise 3

Find the mode of the mass.

Exercise 4

Find the standard deviation of the age variable.

Learn more about descriptive statistics in the online courses Learn by Example: Statistics and Data Science in R (including 8 lectures specifically on descriptive statistics), and Introduction to R.

Exercise 5

Find the variance of the mass variable.

Unlike the popular mean/standard deviation combination,interquartile range and median/mean absolute deviation are not sensitive to the presence of outliers. Even though it is recommended to go for MAD because they can approximate the standard deviation.

Exercise 6

Find the interquartile range of the age variable.

Exercise 7

Find the median absolute deviation of age variable. Assume that the age follows a normal distribution.

Exercise 8
Find the covariance of the variables age, mass.

Exercise 9

Find the spearman and pearson correlations of the variables age, mass.

Exercise 10

Print the summary statistics, and the structure of the data set. Moreover construct the correlation matrix of the data set.




Multiple Regression (Part 2) – Diagnostics

Multiple Regression is one of the most widely used methods in statistical modelling. However, despite its many benefits, it is oftentimes used without checking the underlying assumptions. This can lead to results which can be misleading or even completely wrong. Therefore, applying diagnostics to detect any strong violations of the assumptions is important. In the exercises below we cover some material on multiple regression diagnostics in R.

Answers to the exercises are available here.

If you obtain a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Multiple Regression (Part 1) can be found here.

We will be using the dataset state.x77, which is part of the state datasets available in R. (Additional information about the dataset can be obtained by running help(state.x77).)

Exercise 1
a. Load the state datasets.
b. Convert the state.x77 dataset to a dataframe.
c. Rename the Life Exp variable to Life.Exp, and HS Grad to HS.Grad. (This avoids problems with referring to these variables when specifying a model.)
d. Produce the correlation matrix.
e. Create a scatterplot matrix for the variables Life.Exp, HS.Grad, Murder, and Frost.

Exercise 2
a. Fit the model with Life.Exp as dependent variable, and HS.Grad and Murder as predictors.
b. Obtain the residuals.
c. Obtain the fitted values.

Exercise 3
a. Create a residual plot (residuals vs. fitted values).
b. Create the same residual plot using the plot command on the lm object from Exercise 2.

Learn more about multiple linear regression in the online courses Linear regression in R for Data Scientists, Statistics with R – advanced level, and Linear Regression and Modeling.

Exercise 4
Create plots of the residuals vs. each of the predictor variables.

Exercise 5
a. Create a Normality plot.
b. Create the same plot using the plot command on the lm object from Exercise 2.

Exercise 6
a. Obtain the studentized residuals.
b. Does there appear to be any outliers?

Exercise 7
a. Obtain the leverage value for each observation and plot them.
b. Obtain the conventional threshold for leverage values. Are any observations influential?

Exercise 8
a. Obtain DFFITS values.
b. Obtain the conventional threshold. Are any observations influential?
c. Obtain DFBETAS values.
d. Obtain the conventional threshold. Are any observations influential?

Exercise 9
a. Obtain Cook’s distance values and plot them.
b. Obtain the same plot using the plot command on the lm object from Exercise 2.
c. Obtain the threshold value. Are any observations influential?

Exercise 10
Create the Influence Plot using a function from the car package.




Multiple Regression (Part 1)

In the exercises below we cover some material on multiple regression in R.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

We will be using the dataset state.x77, which is part of the state datasets available in R. (Additional information about the dataset can be obtained by running help(state.x77).)

Exercise 1

a. Load the state datasets.
b. Convert the state.x77 dataset to a dataframe.
c. Rename the Life Exp variable to Life.Exp, and HS Grad to HS.Grad. (This avoids problems with referring to these variables when specifying a model.)

Exercise 2
Suppose we wanted to enter all the variables in a first-order linear regression model with Life Expectancy as the dependent variable. Fit this model.

Exercise 3

Suppose we wanted to remove the Income, Illiteracy, and Area variables from the model in Exercise 2. Use the update function to fit this model.

Learn more about multiple linear regression in the online course Linear regression in R for Data Scientists. In this course you will learn how to:

  • Model basic and complex real world problem using linear regression
  • Understand when models are performing poorly and correct it
  • Design complex models for hierarchical data
  • And much more

Exercise 4
Let’s assume that we have settled on a model that has HS.Grad and Murder as predictors. Fit this model.

Exercise 5
Add an interaction term to the model in Exercise 4 (3 different ways).

Exercise 6
For this and the remaining exercises in this set we will use the model from Exercise 4.

Obtain 95% confidence intervals for the coefficients of the two predictor variables.

Exercise 7
Predict the Life Expectancy for a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 8

Obtain a 98% confidence interval for the mean Life Expectancy in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 9

Obtain a 98% confidence interval for the Life Expectancy of a person living in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 10

Since our model only has two predictor variables, we can generate a 3D plot of our data and the fitted regression plane. Create this plot.




Intermediate Tree 2

This is a continuation of the intermediate decision tree exercise.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

 

Exercise 1
use the predict() command to make predictions on the Train data. Set the method to “class”. Class returns classifications instead of probability scores. Store this prediction in pred_dec.

Exercise 2
Print out the confusion matrix

Exercise 3
What is the accuracy of the model. Use the confusion matrix.

Exercise 4
What is the misclassification error rate? Refer to Basic_decision_tree exercise to get the formula.

Exercise 5
Lets say we want to find the baseline model to compare our prediction improvement. We create a base model using this code


length(Test$class)
base=rep(1,3183)

Use the table() command to create a confusion matrix between the base and Test$class.

Learn more about decision trees in the online courses

Exercise 6
What is the number difference between the confusion matrix accuracy of dec and base?

Exercise 7

Remember the predict() command in question 1. We will use the same mode and same command except we will set the method to “regression”. This gives us a probability estimates. Store this in pred_dec_reg

Exercise 8
load the ROCR package.

Use the prediction(), performance() and plot() command to print the ROC curve. Use pred_dec_reg variable from Q7. You can also refer to the previous exercise to see the code.

Exercise 9
plot() the same ROC curve but set colorize=TRUE

Exercise 10
Comment on your findings using ROC curve and accuracy. Is it a good model? Did you notice that ROC prediction() command only takes probability predictions as one of its arguments. Why is that so?




ROC curves

Every dataset comes with unique problems but the essense of it is to understand how to evaluate the models. Today we will briefly touch up on ROC curves.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Learn more about evaluating the performance of classification models in the online course Machine Learning A-Z™: Hands-On Python & R In Data Science. It includes 5 lectures (32 mins in total) on false positives/false negatives, the confusion matrix, accuracy paradox, and CAP curve analysis.

Exercise 1

ROC is a interesting topic. The only way we can learn ROC fast if we look at different models and see how it differs in its ROC curve. First we are going to load the ChickWeight dataset and then we are going to run different models to compare the change in ROC curve. If you do not have any package then use the install.packages() command to install them. Note the steps as well. We are essentially trying to predict wether a chicken is fed diet 1 or not. We are doing a 70:30 split and then we are using a glm log regression model to make the prediction.

Write this code in R:


library(ROCR)
library(caTools)
attach(ChickWeight)
df=ChickWeight
df$Diet=ifelse(df$Diet==1,1,0)
sample.seed=100
spl=sample.split(df$Diet,SplitRatio = 0.7)
Train=df[spl==TRUE,]
Test=df[spl==FALSE,]
model=glm(Diet~.,data=Train,family = binomial)
pred=predict(model,newdata = Test)

Exercise 2

Use the table() command to print the confusion matrix.Hint: use >0.5 to convert the probabilities into classes. What is the accuracy of the model?

Exercise 3

We want to plot the ROC curve now. Write this code down in R.

ROCRpred1=prediction(pred,Test$Diet)
ROCRperf1=performance(ROCRpred1,"tpr","fpr")

Now use the plot() command to plot ROCperf1

Exercise 4

What is your observations of the curve?Hint: Explain in terms of tpr and fpr and the direction it follows.

Exercise 5

Create a baseline model:

a.Use nrow() command to see the length of the Test variable. Store this in a variable ‘a’
b.Use the rep() command to create a list of 1’s and the number of instances should be equal to your answer in a.
c.Store the result of b into the variable rd.

What this will essentially do is a create a variable that has a 50/50 chance of getting the answer right or maybe slightly better or worse depending on the distribution of the diet variable. But this is really equal to a model that has an accuracy of 50 percent or almost by random chance.

Exercise 6

Using the same structure of the code from Q3, add the changes to reflect the requirement for the baseline and plot the baseline curve.Hint: use rd instead of pred

Exercise 7

What do you observe from the curve and explain your findings?

Exercise 8
model1 had a high accuracy rate because the chicken number was 100 percent corelated with diet. You can see that using just pure observation of the dataset. But what if we created an average model. We can do this by just using weight as the only predictor.

Write the code below to create model2


model2=glm(Diet~weight,data=Train,family = binomial)
pred2=predict(model2,newdata = Test)

Exercise 9
using the table() function again, create a confusion matrix between Test$Diet and pred2. Note the accuracy.

Exercise 10
Using the same code structure as Q3. Write down a code to plot the ROC curve of model2 and also set add=TRUE as one of the arguments so that it plots the new curve on top of the baseline. How does it compare with model 1 and the baseline model. Explain in terms of tpr and fpr




Intermediate Tree 1

If you followed through the Basic Decision Tree exercise, this should be useful for you. This is like a continuation but we add so much more. We are working with a bigger and badder datasets. We will be also using techniques we learned from model evaluation and work with ROC, accuracy and other metrics.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
read in the adult.csv file with header=FALSE. Store this in df. Use str() command to see the dataframe. Download the Data from here

Exercise 2
You are given the meta_data that goes with the CSV. You can download this here Use that to add the column names for your dataframe. Notice the df is ordered going from V1,V2,V3 _ _ and so on. As a side note, it is always best practice to use that to match and see if all the columns are read in correctly.

Exercise 3
Use the table command and print out the distribution of the class feature.

Exercise 4
Change the class column to binary.

Learn more about decision trees in the online courses

Exercise 5
Use the cor() command to see the corelation of all the numeric and integer columns including the class column. Remember that numbers close to 1 means high corelation and number close to 0 means low. This will give you a rough idea for feature selection

Exercise 6
Split the dataset into Train and Test sample. You may use sample.split() and use the ratio as 0.7 and set the seed to be 1000. Make sure to install and load caTools package.

Exercise 7
Check the number of rows of Train
Check the number of rows of Test

Exercise 8
We are ready to use decision tree in our dataset. Load the package “rpart” and “rpart.plot” If it is not installed, then use the install.packages() commmand.

Exercise 9
Use rpart to build the decision tree on the Train set. Include all features.Store this model in dec

Exercise 10
Use the prp() function to plot the decision tree. If you get any error use this code before the prp() command

par(mar = rep(2, 4))




Model Evaluation 2

We are committed to bringing you 100% authentic exercise sets. We even try to include as different datasets as possible to give you an understanding of different problems. No more classifying Titanic dataset. R has tons of datasets in its library. This is to encourage you to try as many datasets as possible. We will be comparing two models by checking their accuracy, Area under the curve, ROC performance etc.

It will be helpful to go over Tom Fawcett’s research paper on ‘An introduction to ROC analysis’

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan

Exercise 1

Run the following code. If you do not have ROCR package installed, you can use install.packages() command to install it.


library(ROCR)
library(caTools)
library(caret)
data("GermanCredit")
df1=GermanCredit
df1$Class=ifelse(df1$Class=="Bad",1,0)
set.seed(100)
spl=sample.split(df1$Class,SplitRatio = 0.7)
Train1=df1[spl==TRUE,]
Test1=df1[spl==FALSE,]
model1=glm(Class~.,data=Train1,family = binomial)
pred1=predict(model1,Test1)
table(Test1$Class,pred1>0.5)

Exercise 2

Using the confusion matrix, please state what is the accuracy of this model?

Exercise 3

Great. Now let’s see the ROC curve of the model. Use this code below and then use plot() command to plot ROCRperf2


ROCRpred1=prediction(pred1,Test1$Class)
ROCRperf1=performance(ROCRpred1,"tpr","fpr")

The plot above gives us an idea of the performance of the model. Is this a a good or bad model? State reasons

Exercise 4

use the summary function on the model to see the summary. Note that if there are more stars next to a feature, then it is highly corelated with our target variable.

Exercise 5

Although we found out the accuracy of the model in Q2, it is still not the best measure. A better measure is area under the curve. AUC takes account of class distribution in the model and is in the range of 0 to 1. 1 being the best and 0 being the worse. It can also be taken as a probability score. If the AUC is 0.70 then that means there is a 0.7 chance of the model to predict positive.

Insert the code below to obtain AUC. What is the AUC score? Is it better than the accuracy obrained at Q2?

auc= performance(ROCRpred1,measure="auc")
auc=auc@y.values[[1]]

Exercise 6

Now create another model called model2 and include 11 variables that have atleast a star next to their name.Hint: use the summary() command and intercept does not count.

Exercise 7

Now predict the target variable using the Test1 sample using model2 and store it in pred2.

Exercise 8
Use the table() command to get the confusion matrix. Note the accuracy.

Exercise 9
What is the auc of model2?

Exercise 10
Is model2 better than model 1? If so, then why?




Basic Tree 2 Exercises

treeplanting

This is a continuation of the exercise Basic Tree 1

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Learn more about decisions tree’s in the online courses Regression Machine Learning with R and Machine Learning A-Z™: Hands-On Python & R In Data Science

Exercise 1
load the tree library. If it is not installed then use the install.packages() command to install it.

Exercise 2
Convert all the feaures(columns) into factors, including the class column

Exercise 3
Use the sample methods that you learnt from the sample_exercise to split the data into two sets with a SplitRatio of 0.7. Hint: Use caTools library and sample.split() function. Store the results into Train and Test.

Exercise 4
Use the tree() command to build the model. Use class as the target variable and everything else as the predictor variable. Also, use the Train variable as the data source. Store the model in a variable called model1

Exercise 5
Use the plot() command to plot the model and use the text() command to add the text.

Exercise 6

Use the predict() command to predict the classes using the Test dataset. We want to predict the classes. Store this in the variable pred_class

Exercise 7

Use the table() command to print the confusion matrix. Hint: You are comparing the class from the Test set and the predicted vector. This tells you wether the model is answering anything right or wrong

Exercise 8
use the summary() to print the summary of the model and note the misclassification error rate.

Exercise 9
Now find the misclassification error rate of the model on the Test data. Use the formula. mean(Test$class != pred_class)

Exercise 10
Compare the two misclassification error rates and determine which is worse and why. How can we improve the model?




Recursive Partitioning and Regression Trees Exercises

tree-538274_960_720

[For this exercise, we will work using the package rpart. This is a beginner level exercise. Please refer to the help of rpart package]

Answers to the exercises are available here.

Exercise 1

Consider the Kyphosis data frame(type help(‘kyphosis’) for more details), that contains:
-Kyphosis:a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
-Age:in months.
-Number:the number of vertebrae involved.
-Start:the number of the first (topmost) vertebra operated on.

1) Build a tree to classify Kyphosis from Age, Number and Start.

Exercise 2

Consider the tree build in exercise 1.
1) Which variables are used to explain kyhosis presence?
2) How many observations contains the terminal nodes.

Exercise 3

Consider the Kyphosis data frame.
1)Build a tree using the first 60 observations of kyphosis.
2)Predict the kyphosis presence for the other 21 observations.
3)Which is the misclassification rate (prediction error)

Exercise 4

Consider the iris data frame(type help(‘iris’) for more details).
1)Build a tree to classify Species from the other variables.
2)Plot the trees, add nodes information.

Exercise 5

Consider the tree build in exercise 4.
Prune the the using median complexity parameter (cp) associated to the tree.
Plot in the same window, the pruned and the original tree.

Exercise 6

Consider the tree build in exercise 4.
1)In which terminal nodes is clasified each oobservations of iris?
2)Which Specie has a flower of Petal.Length greater than 2.45 and Petal.Width less than 1.75.

Exercise 7

Consider the car90 data frame(type help(‘car90’) for more details).
1)Build a tree to predict Price from the other variables.
2)Plot the trees, add nodes information.

Exercise 8

Consider the tree build in exercise 7.
1) Which variables are used to explain the price?
2)Which terminal nodes have a value of mean Price, less tan mean(car90$Price)?

Exercise 9

Consider the car.test.frame data frame (type help(‘car.test.frame’) for more details).
1)Build a tree to explain Mileage using the other variables.
2)Snip the tree in nodes number 2.
3)Plot both tree together

Exercise 10
Consider the tree build in exercise 9.
Which is the depth of the tree (with the root node counted as depth 0).
Set the maximum depth of the final tree on 2