Data science for Doctors: Variable importance Exercises


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the tenth part of the series and it aims to cover the very basics of the subject of principal correlation coefficient and components analysis, those two methods illustrate how variables are related.
In my opinion, it is necessary for researchers to know how to have a notion of the relationships between variables, in order to be able to find potential cause and effect relation – however this relation is hypothetical, you can’t claim that there is a cause-effect relation only because the correlation is high between those two variables-,remove unecessary variables etc. In particular we will go through Pearson correlation coefficient and Confidence interval by the bootstrap and ( Principal component analysis.

Before proceeding, it might be helpful to look over the help pages for the ggplot, cor, cor.tes, boot.cor, quantile, eigen, princomp, summary, plot, autoplot.

Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("ggfortify")
library(ggfortify)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Compute the value of the correlation coefficient for the variables age and preg.

Exercise 2

Construct the scatterplot for the variables age and preg.

Exercise 3

Apply a correlation test for the variables age and preg with null hypothesis to be the correlation is zero and the alternative to be different from zero.
hint: cor.test

Exercise 4

Construct a 95% confidence interval is by the bootstrap. First find the correlation by bootstrap.
hint: mean

Exercise 5

Now that you have found the correlation, find the 95% confidence interval.

Exercise 6

Find the eigen values and eigen vectors for the data set(exclude the class.fac variable).

Exercise 7

Compute the principal components for the dataset used above.

Exercise 8

Show the importance of each principal component.

Exercise 9

Plot the principal components using an elbow graph.

Exercise 10

Constract a scatterplot with x-axis to be the first component and the y-axis to be the second component. Moreover if possible draw the eigen vectors on the plot.
hint: autoplot




Data science for Doctors: Variable importance Solutions

Below are the solutions to these exercises on principal component analysis.

####################
#                  #
#    Exercise 1    #
#                  #
####################

cor(data$age,data$preg)
## [1] 0.5443412
####################
#                  #
#    Exercise 2    #
#                  #
####################

plot(data$age,data$preg)
plot of chunk unnamed-chunk-2
#OR
ggplot(data,aes(x = age, y = preg))+ geom_count() # we use geom_count()  to capture the frequency.
plot of chunk unnamed-chunk-2
#OR

ggplot(data,aes(x = age, y = preg))+ geom_point(colour = "blue", alpha = 0.2) # we use transparent points to capture the frequency. I personaslly prefer the this scatterplot because it captures the frequency but in a cleaner way.
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 3    #
#                  #
####################

cor.test(data$age,data$preg)
## 
## 	Pearson's product-moment correlation
## 
## data:  data$age and data$preg
## t = 17.959, df = 766, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4925652 0.5922775
## sample estimates:
##       cor 
## 0.5443412
####################
#                  #
#    Exercise 4    #
#                  #
####################

nboot <- 1000; boot.cor <- matrix(0,nrow=nboot,ncol = 1)
data_mat <- matrix(c(data$age,data$preg),ncol=2,byrow=FALSE)
for (i in 1:nboot){
  dat.star <- data_mat[sample(1:nrow(data_mat),replace=TRUE),]
  boot.cor[i,] <- cor(dat.star)[2,1]}
mean(boot.cor)
## [1] 0.5437089
####################
#                  #
#    Exercise 5    #
#                  #
####################

quantile(boot.cor[,1],c(0.025,0.975))
##      2.5%     97.5% 
## 0.4891308 0.5950294
####################
#                  #
#    Exercise 6    #
#                  #
####################

Z <- data.matrix(data[-ncol(data)])
K <- eigen(cor(Z))

K$values; K$vectors
## [1] 2.0943799 1.7312101 1.0296299 0.8755290 0.7623444 0.6826284 0.4198162
## [8] 0.4044620
##            [,1]       [,2]        [,3]        [,4]       [,5]         [,6]
## [1,] -0.1284321 -0.5937858  0.01308692  0.08069115  0.4756057 -0.193598168
## [2,] -0.3930826 -0.1740291 -0.46792282 -0.40432871 -0.4663280 -0.094161756
## [3,] -0.3600026 -0.1838921  0.53549442  0.05598649 -0.3279531  0.634115895
## [4,] -0.4398243  0.3319653  0.23767380  0.03797608  0.4878621 -0.009589438
## [5,] -0.4350262  0.2507811 -0.33670893 -0.34994376  0.3469348  0.270650609
## [6,] -0.4519413  0.1009598  0.36186463  0.05364595 -0.2532038 -0.685372179
## [7,] -0.2706114  0.1220690 -0.43318905  0.83368010 -0.1198105  0.085784088
## [8,] -0.1980271 -0.6205885 -0.07524755  0.07120060  0.1092900  0.033357170
##             [,7]         [,8]
## [1,]  0.58879003  0.117840984
## [2,]  0.06015291  0.450355256
## [3,]  0.19211793 -0.011295538
## [4,] -0.28221253  0.566283799
## [5,]  0.13200992 -0.548621381
## [6,]  0.03536644 -0.341517637
## [7,]  0.08609107 -0.008258731
## [8,] -0.71208542 -0.211661979
####################
#                  #
#    Exercise 7    #
#                  #
####################

pca <- princomp(Z, center = TRUE, cor=TRUE, scores=TRUE)
## Warning: In princomp.default(Z, center = TRUE, cor = TRUE, scores = TRUE) :
##  extra argument 'center' will be disregarded
####################
#                  #
#    Exercise 8    #
#                  #
####################

summary(pca)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
## Standard deviation     1.4471973 1.3157546 1.0147068 0.9356971 0.87312335
## Proportion of Variance 0.2617975 0.2164013 0.1287037 0.1094411 0.09529305
## Cumulative Proportion  0.2617975 0.4781988 0.6069025 0.7163436 0.81163667
##                            Comp.6     Comp.7     Comp.8
## Standard deviation     0.82621328 0.64793223 0.63597331
## Proportion of Variance 0.08532855 0.05247702 0.05055776
## Cumulative Proportion  0.89696522 0.94944224 1.00000000
####################
#                  #
#    Exercise 9    #
#                  #
####################

plot(pca,type="l")
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 10   #
#                  #
####################

autoplot(pca, data = data, colour = 'age',
         loadings = TRUE, loadings.label = TRUE, loadings.label.size = 3) + scale_color_gradient(low="green", high="red")
plot of chunk unnamed-chunk-2



Data science for Doctors: Cluster Analysis Exercises


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the ninth part of the series and it aims to cover the very basics of the subject of cluster analysis.
In my opinion, it is necessary for researchers to know how to discover relationships between patients and diseases. Therefore in this set of exercises we will go through the basics of cluster analysis relationship discovery. In particular we will use hierarchical clustering and centroid-based clustering , k-means clustering and k-median clustering.

Before proceeding, it might be helpful to look over the help pages for the ggplot, geom_point, dist, hclust, cutree, stats::rect.hclust, multiplot, kmeans, kGmedian.

Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("Gmedian")
library(Gmedian)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the class of the candidate (0 or 1).

Exercise 2

Create a distance matrix for the data.

Exercise 3

Make an hierarchical clustering analysis using the single linkage method. Then create an object that contains only two clusters.

Exercise 4

Make an hierarchical clustering analysis using the complete linkage method(default). Then create an object that contains only two clusters.

Exercise 5

Construct the trees that are produced by exercises 2 and 3 and draw the two clusters(at the plots).
hint: rect.hclust

Learn more about cluster analysis in the online course Applied Multivariate Analysis with R. In this course you will learn how to work with hiërarchical clustering, k-means clustering and much more.

Exercise 6

Construct two scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster that those points belong to. Each scatterplot is for different clustering method.

If possible illustrate those scatterplots (each one at a time) next to the plot of exercise 1, to see whether the clustering can discriminate the positive classified from the negative classified patients. In case you didn’t do that, find it at the solution’s section, I highly encourage you to check it out.

Exercise 7

Run the following in order to create dummy variables data_mat <- model.matrix(~.+0, data = data).
Make a centroid-based cluster analysis using the k-means method with k to be 2. Apply the k-mean clustering on the data_mat data frame.

Exercise 8

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster (retrieved from k-mean method) that those points belong to.

If possible illustrate those scatterplot next to the plot of exercise 1.

Exercise 9

Make a centroid-based cluster analysis using the k-median method with k to be 2. Apply the k-median clustering on the data_mat data frame.

Exercise 10

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster (retrieved from k-median method) that those points belong to.

If possible illustrate those scatterplot next to the plot of exercise 1.




Data science for Doctors: Cluster Analysis Exercises Solutions

Below are the solutions to these exercises on cluster analysis.

In order to run the multiplot function , run this script first

####################
#                  #
#    Exercise 1    #
#                  #
####################

p1 <- ggplot(data, aes(mass, age, color = class)) + geom_point()+ scale_colour_gradient(low="green", high="red")
p1
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 2    #
#                  #
####################

d <- dist(as.matrix(data))

####################
#                  #
#    Exercise 3    #
#                  #
####################

single_link <- hclust(d,method="single")
clust_sing <- cutree(single_link, k = 2)

####################
#                  #
#    Exercise 4    #
#                  #
####################

complete_link <- hclust(d,method="complete")
clust_complete <- cutree(complete_link, k = 2)

####################
#                  #
#    Exercise 5    #
#                  #
####################

plot(complete_link)
stats::rect.hclust(complete_link, 2)
plot of chunk unnamed-chunk-2
plot(single_link)
stats::rect.hclust(single_link, 2)
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 6    #
#                  #
####################

p2 <-ggplot(data, aes(mass, age, color = clust_complete)) + geom_point()+ scale_color_gradient(low="green", high="red")

p3 <- ggplot(data, aes(mass, age, color = clust_sing)) + geom_point()+ scale_color_gradient(low="green", high="red")

multiplot(p1,p2, cols = 2)
plot of chunk unnamed-chunk-2
multiplot(p1,p3, cols = 2)
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 7    #
#                  #
####################


kmean_cluster <- kmeans(data_mat, 2, nstart = 20) # initializes the centroids 20 times and selects the one with the lowest within cluster variation

####################
#                  #
#    Exercise 8    #
#                  #
####################

kmean_cluster$cluster <- kmean_cluster$cluster -1 # You are not required to do it, but I did that for scaling purposes.

p4 <- ggplot(data, aes(mass, age, color = kmean_cluster$cluster)) + geom_point()+ scale_color_gradient(low="green", high="red")

multiplot(p1,p4, cols = 2)
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 9    #
#                  #
####################

kmedian_cluster <-kGmedian(data_mat, ncenters=2, nstart = 20)

####################
#                  #
#    Exercise 10   #
#                  #
####################
kmedian_cluster$cluster <- kmedian_cluster$cluster -1
p5 <- ggplot(data, aes(mass, age, color = kmedian_cluster$cluster)) + geom_point()+ scale_color_gradient(low="green", high="red")

multiplot(p1,p5, cols = 2)
plot of chunk unnamed-chunk-2



Data science for Doctors: Inferential Statistics Exercises(Part-5) Solutions

Below are the solutions to these exercises on inferential statistics.

####################
#                  #
#    Exercise 1    #
#                  #
####################

hist(unlist(data["pres"]))
plot of chunk unnamed-chunk-2
#OR

ggplot(data, aes(unlist(data["pres"]))) + geom_histogram()
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 2    #
#                  #
####################

qqnorm(unlist(data["pres"]))
qqline(unlist(data["pres"]))
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 3    #
#                  #
####################

shapiro.test(unlist(data["pres"]))
## 
## 	Shapiro-Wilk normality test
## 
## data:  unlist(data["pres"])
## W = 0.81892, p-value < 2.2e-16
####################
#                  #
#    Exercise 4    #
#                  #
####################


ad.test(unlist(data["pres"]))
## 
## 	Anderson-Darling normality test
## 
## data:  unlist(data["pres"])
## A = 33.901, p-value < 2.2e-16
####################
#                  #
#    Exercise 5    #
#                  #
####################

sh <- apply(data, 1, function(x) shapiro.test(x)$p.value)
sum(sh > 0.05)/nrow(data) * 100
## [1] 32.68229
####################
#                  #
#    Exercise 6    #
#                  #
####################

boxplot(data["pres"])
plot of chunk unnamed-chunk-2
# OR 

qplot(y=data["pres"], x= 1, geom = "boxplot")
plot of chunk unnamed-chunk-2
#OR

ggplot(data, aes(x="pres", y=pres))+geom_boxplot()
plot of chunk unnamed-chunk-2
####################
#                  #
#    Exercise 7    #
#                  #
####################

grubbs.test(unlist(data$pres))
## 
## 	Grubbs test for one outlier
## 
## data:  unlist(data$pres)
## G = 3.57030, U = 0.98336, p-value = 0.1299
## alternative hypothesis: lowest value 0 is an outlier
####################
#                  #
#    Exercise 8    #
#                  #
####################

grubbs.test(unlist(data["pres"]),two.sided = TRUE)
## 
## 	Grubbs test for one outlier
## 
## data:  unlist(data["pres"])
## G.pres8 = 3.57030, U = 0.98336, p-value = 0.2598
## alternative hypothesis: lowest value 0 is an outlier
####################
#                  #
#    Exercise 9    #
#                  #
####################

mass_1 <- data$mass[sample(length(data$mass), 14)]
mass_2 <- rnorm(14,28,4)

wilcox.test(mass_1,mass_2,paired = TRUE)
## 
## 	Wilcoxon signed rank test
## 
## data:  mass_1 and mass_2
## V = 71, p-value = 0.2676
## alternative hypothesis: true location shift is not equal to 0
####################
#                  #
#    Exercise 10   #
#                  #
####################

wilcox.test(data$pres ~ data$class)
## 
## 	Wilcoxon rank sum test with continuity correction
## 
## data:  data$pres by data$class
## W = 55414, p-value = 7.559e-05
## alternative hypothesis: true location shift is not equal to 0



Data science for Doctors: Inferential Statistics Exercises(Part-5)


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the eighth part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for testing the normality of distributions(Shapiro–Wilk test, Anderson–Darling test.), the existence of outliers(Grubbs’ test for outliers). We will also cover the case that normality assumption doesn’t hold and how to deal with it(<a href="https://en.wikipedia.org/wiki/Rank_test" Rank tests). Finally we will do a brief recap of the previous exercises on inferential statistics.

Before proceeding, it might be helpful to look over the help pages for the hist, qqnorm, qqline, shapiro.test, ad.test, grubbs.test, wilcox.test.

Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("nortest")
library(nortest)
install.packages("outliers")
library(outliers)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Moreover run the chunk below in order to generate the samples that we will test on this set of exercises.
f_1 <- rnorm(28,29,3)
f_2 <- rnorm(23,29,6)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Plot an histogram of the variable pres.

Exercise 2

Plot the QQ-plot with a QQ-line for the variable pres.

Exercise 3

Apply a Shapiro-Wilk normality test for the variable pres.

Exercise 4

Apply a Anderson-Darling normality test for the variable pres.

Exercise 5

What is the percentage of data that passes a normality test?
This might be a bit challenging, consider using the apply function.

Learn more about Inferential Statistics in the online course Learn By Example: Statistics and Data Science in R. This course includes:

  • 6 different case-studies on inferential statistics
  • Extensive coverage of techniques for inference
  • And much more

Exercise 6

Construct a boxplot of pres and see whether there are outliers or not.

Exercise 7

Apply a Grubb’s test on the pres to see whether the variable contains outlier values.

Exercise 8

Apply a two-sided Grubb’s test on the pres to see whether the variable contains outlier values.

Exercise 9

Suppose we test a new diet on a sample of 14 people from the candidates (take a random sample from the set) and after the diet the average mass was 29 with standard deviation of 4 (generate 14 normal distributed samples with the properties mentioned before). Apply Wilcoxon signed rank test for the mass variable before and after the diet.

Exercise 10

Check whether the positive and negative candidates have the same distribution for the pres variable. In order to check that, apply a Wilcoxon rank sum test for the pres variable in respect to the class.fac variable.




Data science for Doctors: Inferential Statistics Exercises(Part-4)


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the seventh part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for F-distribution (F-test), and Chi-squared distribution (Chi-squared test). If you are not aware of what are the mentioned distributions please go here to acquire the necessary background. The assumption of the t-test (we covered it last time here) is that the two population variances are equal. Such an assumption can serve as a null hypothesis for F-test. Moreover sometimes it happens that we want to test a hypothesis with respect to more than one probability, here is where Chi-Squared test comes into play.

Before proceeding, it might be helpful to look over the help pages for the sd, var, var.test, chisq.test.

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Moreover run the chunk below in order to generate the samples that we will test on this set of exercises.
f_1 <- rnorm(28,29,3)
f_2 <- rnorm(23,29,6)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Compute the F-statistic. (test statistic for F-test)

Exercise 2

Compute the degrees of freedom for the numerator and denominator.

Exercise 3

Apply a two-sided F-test for the two samples

Exercise 4

Apply a one-sided F-test for the two samples with the alternative hypothesis to be that the standard deviation of the first sample is smaller than the second.

Exercise 5

Retrieve the p-value and the ratio of variances for both tests.

Exercise 6

Find the number of patients who show signs of diabetes and those who don’t.

Exercise 7

Assume that the hypothesis we made is that 10% of people show signs of diabetes. Is that a valid claim to make? Test it using the chi-squared test.

Exercise 8

Suppose that the mass index affects whether the patients show signs of diabetes and we assume that the people who weight more than the average are more likely to have diabetes signs. Make a matrix that contains the true-positives, false-positives, true-negatives, and false-negatives of our hypothesis.
hint: True-positive: data$class==1 & data$mass >= mean(data$mass)

Exercise 9

Test the hypothesis we made at exercise 8 using chi-squared test.

Exercise 10

The hypothesis we made at exercise 8 cannot be validated, however we have noticed that the dataset contains outliers which affect the average. Therefore we make another assumption that patients who are heavier than the 25% lightest of the patients are more likely to have signs of diabetes. Test that hypothesis.
hint: it is similar to the process we did at exercises 8 and 9 but with different criteria.




Data science for Doctors: Inferential Statistics Solutions(Part-4)

Below are the solutions to these exercises on inferential statistics.

####################
#                  #
#    Exercise 1    #
#                  #
####################

f_1 <- rnorm(28,29,3)
f_2 <- rnorm(23,29,6)

f <- sd(f_1)^2/sd(f_2)^2;f
## [1] 0.253262
#OR
f <- var(f_1)/var(f_2); f
## [1] 0.253262
####################
#                  #
#    Exercise 2    #
#                  #
####################

df_num <- length(f_1)-1; df_num
## [1] 27
df_den <- length(f_2)-1; df_den
## [1] 22
####################
#                  #
#    Exercise 3    #
#                  #
####################

two_sided_test <- var.test(f_1,f_2,alternative = "two.sided")
two_sided_test
## 
## 	F test to compare two variances
## 
## data:  f_1 and f_2
## F = 0.25326, num df = 27, denom df = 22, p-value = 0.0009226
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1101789 0.5626353
## sample estimates:
## ratio of variances 
##           0.253262
####################
#                  #
#    Exercise 4    #
#                  #
####################

one_sided_test <- var.test(f_1,f_2,alternative = "less")
one_sided_test
## 
## 	F test to compare two variances
## 
## data:  f_1 and f_2
## F = 0.25326, num df = 27, denom df = 22, p-value = 0.0004613
## alternative hypothesis: true ratio of variances is less than 1
## 95 percent confidence interval:
##  0.0000000 0.4938655
## sample estimates:
## ratio of variances 
##           0.253262
####################
#                  #
#    Exercise 5    #
#                  #
####################

two_sided_test$estimate
## ratio of variances 
##           0.253262
two_sided_test$p.value
## [1] 0.0009226243
one_sided_test$estimate
## ratio of variances 
##           0.253262
one_sided_test$p.value
## [1] 0.0004613122
####################
#                  #
#    Exercise 6    #
#                  #
####################

positive <- length(data$class[which(data$class==1)])
negative <- length(data$class[which(data$class==0)])

#OR

positive <- sum(data$class); positive
## [1] 268
negative <- length(data$class)-positive; negative
## [1] 500
####################
#                  #
#    Exercise 7    #
#                  #
####################

prob <- c(0.10,0.90)
res <- c(positive,negative)
chisq.test(res, p=prob)
## 
## 	Chi-squared test for given probabilities
## 
## data:  res
## X-squared = 528.9, df = 1, p-value < 2.2e-16
####################
#                  #
#    Exercise 8    #
#                  #
####################

TP <- length(data$class[which(data$class==1 & data$mass >= mean(data$mass))])
FP <- length(data$class[which(data$class==1 & data$mass < mean(data$mass))])
TN <- length(data$class[which(data$class==0 & data$mass < mean(data$mass))])
FN <- length(data$class[which(data$class==0 & data$mass >= mean(data$mass))])
res_mut <- matrix(c(TP,FP,TN,FN),2,byrow=TRUE); res_mut
##      [,1] [,2]
## [1,]  184   84
## [2,]  289  211
####################
#                  #
#    Exercise 9    #
#                  #
####################

chisq.test(res_mut)
## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  res_mut
## X-squared = 8.2403, df = 1, p-value = 0.004097
####################
#                  #
#    Exercise 10   #
#                  #
####################

TP <- length(data$class[which(data$class==1 & data$mass >= quantile(data$mass, 0.25))])
FP <- length(data$class[which(data$class==1 & data$mass < quantile(data$mass, 0.25))])
TN <- length(data$class[which(data$class==0 & data$mass < quantile(data$mass, 0.25))])
FN <- length(data$class[which(data$class==0 & data$mass >= quantile(data$mass, 0.25))])
res_mut <- matrix(c(TP,FP,TN,FN),2,byrow=TRUE); res_mut
##      [,1] [,2]
## [1,]  248   20
## [2,]  170  330
chisq.test(res_mut)
## 
## 	Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  res_mut
## X-squared = 238.68, df = 1, p-value < 2.2e-16



Data science for Doctors: Inferential Statistics Solutions (part-3)

Below are the solutions to these exercises on inferential statistics.

####################
#                  #
#    Exercise 1    #
#                  #
####################

n <- 25
m <- 29
std <- 4
x <- rnorm(n, mean = m, sd = std)
t <- sqrt(n)*(mean(x)-mean(data$mass))/sd(x); t
## [1] -3.992123
####################
#                  #
#    Exercise 2    #
#                  #
####################

p = 2*pt(-abs(t),n-1); p
## [1] 0.0005374991
####################
#                  #
#    Exercise 3    #
#                  #
####################

qt(0.025, n - 1); qt(0.975, n - 1)
## [1] -2.063899
## [1] 2.063899
# We reject the null hypothesis since the t value doesn't belong to the confidence interval

####################
#                  #
#    Exercise 4    #
#                  #
####################

t.test(x,mu=mean(data$mass))
## 
## 	One Sample t-test
## 
## data:  x
## t = -3.9921, df = 24, p-value = 0.0005375
## alternative hypothesis: true mean is not equal to 31.99258
## 95 percent confidence interval:
##  27.14068 30.44775
## sample estimates:
## mean of x 
##  28.79421
####################
#                  #
#    Exercise 5    #
#                  #
####################

t.test(x, mu=mean(data$mass), alternative = c("less"))
## 
## 	One Sample t-test
## 
## data:  x
## t = -3.9921, df = 24, p-value = 0.0002687
## alternative hypothesis: true mean is less than 31.99258
## 95 percent confidence interval:
##      -Inf 30.16492
## sample estimates:
## mean of x 
##  28.79421
####################
#                  #
#    Exercise 6    #
#                  #
####################

n <- 27
m <- 31
std <- 5
y <- rnorm(n, mean = m, sd = std)

t.test(x,y, var.equal=FALSE)
## 
## 	Welch Two Sample t-test
## 
## data:  x and y
## t = -1.0558, df = 47.52, p-value = 0.2964
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.062233  1.265441
## sample estimates:
## mean of x mean of y 
##  28.79421  30.19261
####################
#                  #
#    Exercise 7    #
#                  #
####################

t.test(x,y, var.equal=FALSE, paired=FALSE, alternative = c("less"))
## 
## 	Welch Two Sample t-test
## 
## data:  x and y
## t = -1.0558, df = 47.52, p-value = 0.1482
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 0.823576
## sample estimates:
## mean of x mean of y 
##  28.79421  30.19261
####################
#                  #
#    Exercise 8    #
#                  #
####################

n <- 27
m <- 31
std <- 4
z <- rnorm(n, mean = m_1, sd = std_1)
## Error in rnorm(n, mean = m_1, sd = std_1): object 'm_1' not found
t.test(x, z, var.equal=TRUE, paired=FALSE)
## Error in t.test.default(x, z, var.equal = TRUE, paired = FALSE): object 'z' not found
####################
#                  #
#    Exercise 9    #
#                  #
####################

t.test(x, z, var.equal=TRUE, paired=FALSE, alternative = c("less"))
## Error in t.test.default(x, z, var.equal = TRUE, paired = FALSE, alternative = c("less")): object 'z' not found
####################
#                  #
#    Exercise 10   #
#                  #
####################

n <- 27
m_1 <- 29
std_1 <- 4
t_1 <- rnorm(n, mean = m_1, sd = std_1)

n <- 27
m_2 <- 28
std_2 <- 4
t_2 <- rnorm(n, mean = m_2, sd = std_2)

t.test(t_1, t_2, var.equal=FALSE, paired=TRUE)
## 
## 	Paired t-test
## 
## data:  t_1 and t_2
## t = 1.8306, df = 26, p-value = 0.07865
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1658083  2.8642267
## sample estimates:
## mean of the differences 
##                1.349209



Data science for Doctors: Inferential Statistics Exercises (part-3)


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the sixth part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for Student’s t-distribution (Student’s t-test), which may be the most used test you will need to apply, since in most cases the standard deviation σ of the population is not known. We will cover the one-sample t-test and two-sample t-test(both with equal and unequal variance). If you are not aware of what are the mentioned distributions please go here to acquire the necessary background.

Before proceeding, it might be helpful to look over the help pages for the t.test.

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Suppose that we take a sample of 25 candidates that tried a diet and they had a average weight of 29 (generate 25 normal distributed samples with mean 29 and standard deviation 4) after the experiment.
Find the t-value.

Exercise 2

Find the p-value.

Exercise 3

Find the 95% confidence interval.

Exercise 4

Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the sample with 5% confidence level.

Exercise 5

Apply t-test with Null Hypothesis that the true mean of the sample is equal to the mean of the population and the alternative that the true mean is less than the mean of the population with 5% confidence level.

Exercise 6

Suppose that we want to compare the current diet with another one. We assume that we test a different diet to a sample of 27 with mass average of 31(generate normal distributed samples with mean 31 and standard deviation of 5). Test whether the two diets are significantly different.
Note that the two distributions have different variances.
hint: This is a two sample hypothesis testing with different variances.

Exercise 7

Test whether the the first diet is more efficient than the second.

Exercise 8

Assume that the second diet has the same variance as the first one. Is it significant different?

Exercise 9

Assume that the second diet has the same variance as the first one. Is it significantly better?

Exercise 10

Suppose that you take a sample of 27 with average mass of 29, and after the diet the average mass is 28(generate the sampled with rnorm(27,average,4)). Are they significant different?
hint: Paired Sample T-Test.