Data science for Doctors: Inferential Statistics Exercises (part-2)


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people
to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University.
Please find further information regarding the dataset there.

This is the fifth part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients,
therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for binomial distribution (Binomial test)
and normal distribution (Z-test). If you are not aware
of what are the mentioned distributions please go here to acquire
the necessary background.

Before proceeding, it might be helpful to look over the help pages for the binom.test, mean,sd ,sqrt, z.test.
Moreover it is crucial to be familiar with the Central Limit Theorem.

install.packages(“TeachingDemos”)
library(TeachingDemos)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Suppose that we take a sample of 30 candidates that tried a medicine and 5 of them are positive.
The null hypothesis is H_{0}: p = average of classes, is to be tested against H1: p != average of classes.
This practically means whether the drug had an effect on the patients

Exercise 2

Apply the same test as above but instead of writing the number of samples try to apply the test in respect to the number of
successes and failures (5,25).

Exercise 3

Having the same null hypothesis as the exercises 1,2 apply a one-sided test where H1: p < average of classes.

Exercise 4

At the previous exercises we didn’t specified the confidence interval, so it applied it with the default 0.95. Run the test from exercise 3 but instead of having confidence interval of 0.95 run it with confidence interval 0.99.

Exercise 5

We have created another drug and we tested it on other 30 candidates. After having taken the medicine for a few weeks only 2 out of 30 were positive. We got really excited and decided to set the confidence interval to 0.999. Does that drug have an actual impact?

Exercise 6

Suppose that we establish a new diet and the average of the sample,of size 30, of candidates who tried this diet had average mass 29 after the testing period. Find the confidence interval for significance level of 0.05. Keep in mind that we run the test and compare it in respect to the data$mass variable

Exercise 7

Find the Z-score of the sample.

Exercise 8

Find the p-value for the experiment.

Exercise 9

Run the z-test using the z.test function with confidence level of 0.95 and let the alternative hypothesis be that the diet had an effect. (two-sided test)

Exercise 10

Let’s get a bit more intuitive now, let the alternative hypothesis be that the diet would lead to lower average body mass with confidence level of 0.99. (one-sided test)




Data science for Doctors: Inferential Statistics Solutions (part-2)

Below are the solutions to these exercises on inferential statistics.

####################
#                  #
#    Exercise 1    #
#                  #
####################

binom.test(5 ,30, mean(data$class),alternative = "two.sided")
## 
## 	Exact binomial test
## 
## data:  5 and 30
## number of successes = 5, number of trials = 30, p-value = 0.03587
## alternative hypothesis: true probability of success is not equal to 0.3489583
## 95 percent confidence interval:
##  0.0564217 0.3472117
## sample estimates:
## probability of success 
##              0.1666667
####################
#                  #
#    Exercise 2    #
#                  #
####################

binom.test(c(5, 25), mean(data$class) ,alternative = "two.sided")
## 
## 	Exact binomial test
## 
## data:  c(5, 25)
## number of successes = 5, number of trials = 30, p-value =
## 0.0003249
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.0564217 0.3472117
## sample estimates:
## probability of success 
##              0.1666667
####################
#                  #
#    Exercise 3    #
#                  #
####################

binom.test(5, 30, mean(data$class), alternative="less")
## 
## 	Exact binomial test
## 
## data:  5 and 30
## number of successes = 5, number of trials = 30, p-value = 0.0239
## alternative hypothesis: true probability of success is less than 0.3489583
## 95 percent confidence interval:
##  0.0000000 0.3189712
## sample estimates:
## probability of success 
##              0.1666667
#OR 
pbinom(5, 30, mean(data$class))
## [1] 0.0238959
# We reject our null hypothesis

####################
#                  #
#    Exercise 4    #
#                  #
####################

binom.test(5,30, mean(data$class), conf.level=0.99,alternative="less")
## 
## 	Exact binomial test
## 
## data:  5 and 30
## number of successes = 5, number of trials = 30, p-value = 0.0239
## alternative hypothesis: true probability of success is less than 0.3489583
## 99 percent confidence interval:
##  0.0000000 0.3808047
## sample estimates:
## probability of success 
##              0.1666667
# we can't reject our null hypothesis

####################
#                  #
#    Exercise 5    #
#                  #
####################

binom.test(2, 30, mean(data$class), conf.level=0.999,alternative="less")
## 
## 	Exact binomial test
## 
## data:  2 and 30
## number of successes = 2, number of trials = 30, p-value =
## 0.0003637
## alternative hypothesis: true probability of success is less than 0.3489583
## 99.9 percent confidence interval:
##  0.0000000 0.3214435
## sample estimates:
## probability of success 
##             0.06666667
# We reject our null hypothesis

####################
#                  #
#    Exercise 6    #
#                  #
####################

z <- 1.96
low <- mean(data$mass) - z*sd(data$mass)/sqrt(30)
high <- mean(data$mass) + z*sd(data$mass)/sqrt(30)
low;high
## [1] 29.17127
## [1] 34.81389
####################
#                  #
#    Exercise 7    #
#                  #
####################

z <- (29 - mean(data$mass))/(sd(data$mass)/sqrt(30))

####################
#                  #
#    Exercise 8    #
#                  #
####################

2*pnorm(-abs(z),0,1) #Reject the null hypothesis
## [1] 0.03761903
####################
#                  #
#    Exercise 9    #
#                  #
####################

library(TeachingDemos)
z.test(29,mu=mean(data$mass),sd=sd(data$mass)/sqrt(30), alternative = "two.sided", conf.level = 0.95)
## 
## 	One Sample z-test
## 
## data:  29
## z = -2.079, n = 1.0000, Std. Dev. = 1.4394, Std. Dev. of the
## sample mean = 1.4394, p-value = 0.03762
## alternative hypothesis: true mean is not equal to 31.99258
## 95 percent confidence interval:
##  26.17874 31.82126
## sample estimates:
## mean of 29 
##         29
####################
#                  #
#    Exercise 10   #
#                  #
####################

z.test(29,mu=mean(data$mass),sd=sd(data$mass)/sqrt(30), alternative = "less", conf.level = 0.99)
## 
## 	One Sample z-test
## 
## data:  29
## z = -2.079, n = 1.0000, Std. Dev. = 1.4394, Std. Dev. of the
## sample mean = 1.4394, p-value = 0.01881
## alternative hypothesis: true mean is less than 31.99258
## 99 percent confidence interval:
##      -Inf 32.34865
## sample estimates:
## mean of 29 
##         29



Data Science for Doctors – Part 4 : Inferential Statistics (1/5) Solutions

Below are the solutions to these exercises on data display.

####################
#                  #
#    Exercise 1    #
#                  #
####################

iter <- 10000
means <- rep(NA, iter)

for (i in 1:iter){
  sam_50 <- sample(data$mass, 50)
  means[i] <- mean(sam_50)
}

hist(means)
plot of chunk unnamed-chunk-1
hist(data$mass)
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 2    #
#                  #
####################

mean(data$mass)
## [1] 31.99258
sd(data$mass)/sqrt(50)
## [1] 1.114989
#OR 
mean(means)
## [1] 31.98233
sd(means)
## [1] 1.081871
####################
#                  #
#    Exercise 3    #
#                  #
####################

library(moments)
skewness(means)
## [1] -0.0333564
#  slight positive skewness, which means that it is slightly light tailed
kurtosis(means)
## [1] 3.064367
# The kurtosis is close to the expected value 3.

####################
#                  #
#    Exercise 4    #
#                  #
####################

z = (30.5-mean(data$mass))/(sd(data$mass)/sqrt(50))
z
## [1] -1.338649
####################
#                  #
#    Exercise 5    #
#                  #
####################

pnorm(z)
## [1] 0.09034253
####################
#                  #
#    Exercise 6    #
#                  #
####################

z = (31-mean(data$mass))/(sd(data$mass)/sqrt(150))

####################
#                  #
#    Exercise 7    #
#                  #
####################

pnorm(z)
## [1] 0.06154952
####################
#                  #
#    Exercise 8    #
#                  #
####################

z*sd(data$mass)/sqrt(150)
## [1] -0.9925781
####################
#                  #
#    Exercise 9    #
#                  #
####################

z = 1.96
low <- 31 - z*sd(data$mass)/sqrt(250)
high <- 31 + z*sd(data$mass)/sqrt(250)
low;high
## [1] 30.02267
## [1] 31.97733
####################
#                  #
#    Exercise 10   #
#                  #
####################

z = 2.33
low <- 31 - z*sd(data$mass)/sqrt(250)
high <- 31 + z*sd(data$mass)/sqrt(250)
low;high
## [1] 29.83817
## [1] 32.16183
z = 2.58
low <- 31 - z*sd(data$mass)/sqrt(250)
high <- 31 + z*sd(data$mass)/sqrt(250)
low;high
## [1] 29.71351
## [1] 32.28649



Data Science for Doctors – Part 4 : Inferential Statistics (1/5)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the fourth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

Before proceeding, it might be helpful to look over the help pages for the sample, mean, sd , sort, pnorm. Moreover it is crucial to be familiar with the Central Limit Theorem.

You also may need to load the ggplot2 library.
install.packages("moments")
library(moments)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generate (10000 iterations) a sampling distribution of sample size 50, for the variable mass.

You are encouraged to experiment with different sample sizes and iterations in order to see the impact that they have to the distribution. (standard deviation, skewness, and kurtosis) Moreover you can plot the distributions to have a better perception of what you are working on.

Exercise 2

Find the mean and standard error (standard deviation) of the sampling distribution.

You are encouraged to use the values from the original distribution (data$mass) in order to comprehend how you derive the mean and standard deviation as well as the importance that the sample size has to the distribution.

Exercise 3

Find the of the skewness and kurtosis of the distribution you generated before.

Exercise 4

Suppose that we made an experiment and we took a sample of size 50 from the population and they followed an organic food diet. Their average mass was 30.5. What is the Z score for a mean of 30.5?

Exercise 5

What is the probability of drawing a sample of 50 with mean less than 30.5? Use the the z-table if you feel you need to.

Exercise 6

Suppose that you did the experiment again but to a larger sample size of 150 and you found the average mass to be 31. Compute the z score for this mean.

Exercise 7

What is the probability of drawing a sample of 150 with mean less than 31?

Exercise 8

If everybody would adopt the diet of the experiment. Find the margin of error for the 95% of sample means.

Exercise 9

What would be our interval estimate that 95% likely contains what this population mean would be if everyone in our population would start adopting the organic diet.

Exercise 10

Find the interval estimate for 98% and 99% likelihood.




Data Science for Doctors – Part 3 : Distributions

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

This is the third part of the series, it will contain the main distributions that you will use most of the time. This part is created in order to make sure that you have (or will have after solving this set of exercises) the knowledge for the next parts to come. The distributions that we will see are:

1)Binomial Distribution: The binomial distribution fits to repeated trials each with a dichotomous outcome such as success-failure, healthy-disease, heads-tails.

2)Normal Distribution: It is the most famous distribution, it is also assumed for many gene expression values.

3)T-Distribution: The T-distribution has many useful applications for testing hypotheses when the sample size is lower than thirty.

4)Chi-squared Distribution: The chi-squared distribution plays an important role in testing hypotheses about frequencies.

5)F-Distribution: The F-distribution is important for testing the equality of two variances.

Before proceeding, it might be helpful to look over the help pages for the choose, dbinom, pbinom , rbinom, qbinom,pnorm, qnorm, rnorm, dnorm,pchisq, qchisq, dchisq, df, pf, df.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Let X be binomially distributed with n = 100 and p = 0.3.Compute the following:
a) P(X = 34), P(X ≥ 34), and P(X ≤ 34)
b) P(30 ≤ X ≤ 60)
c) The quantiles x0.025, and x0.975

Exercise 2

Let X be normally distributed with mean = 3 and standard deviation = 1.Compute the following:
a) P(X 2),P(2 ≤ X ≤ 4)
b) The quantiles x0.025, x0.5and x0.975.

Exercise 3

Let T8 distribution.Compute the following:
a)P(T8 < 1), P(T8 > 2), P(-1 < T8 < 1).
b)The quantiles t0.025, t0.5, and t0.975. Can you justify the values of the quantiles?

Exercise 4

Compute the following for the chi-squared distribution with 5 degrees of freedom:
a) P(X25<2), P(X25>4),P(4<X25<6).
b) The quantiles g0.025, g0.5, and g0.975.

Exercise 5

Compute the following for the F6,3 distribution:
a)P(F6,3 < 2), P(F6,3 > 3), P(1 < F6,3 < 4).
b)The quantiles f0.025, f0.5, and f0.975.

Exercise 6

Generate 100 observations following binomial distribution and plot them(if possible at the same plot):
a) n = 20, p = 0.3
b) n = 20, p = 0.5
c) n = 20, p = 0.7

Exercise 7

Generate 100 observations following normal distribution and plot them(if possible at the same plot):
a) standard normal distribution ( N(0,1) )
b) mean = 0, s = 3
c) mean = 0, s = 7

Exercise 8

Generate 100 observations following T distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 9

Generate 100 observations following chi-squared distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 10

Generate 100 observations following F distribution and plot them(if possible at the same plot):
a) df1 = 3, df2 = 9
b) df1 = 9, df2 = 3
c) df1 = 15, df2 = 15




Data Science for Doctors – Part 3 : Distributions Solutions

Below are the solutions to these exercises on data display.

####################
#                  #
#    Exercise 1    #
#                  #
####################

n <- 100
p <- 0.3

#a
dbinom(34, n, p)
## [1] 0.05788395
sum(dbinom(34:n, n, p))
## [1] 0.2207422
pbinom(34, n, p)
## [1] 0.8371417
#b
sum(dbinom(30:60, n, p))
## [1] 0.5376603
#c
qbinom(0.025,n,p)
## [1] 21
qbinom(0.975,n,p)
## [1] 39
####################
#                  #
#    Exercise 2    #
#                  #
####################

m <- 3
s <- 1
#a

pnorm(2,m,s)
## [1] 0.1586553
pnorm(4,m,s) - pnorm(2,m,s)
## [1] 0.6826895
#b
qnorm(0.025,m,s)
## [1] 1.040036
qnorm(0.975,m,s)
## [1] 4.959964
qnorm(0.5,m,s)
## [1] 3
####################
#                  #
#    Exercise 3    #
#                  #
####################

df <- 8
#a
pt(1,df)
## [1] 0.8267032
1-pt(2,df)
## [1] 0.04025812
pt(1,df)-pt(-1,df)
## [1] 0.6534065
#b  
qt(0.025,df)
## [1] -2.306004
qt(0.5,df)
## [1] 0
1-qt(0.075,df)
## [1] 2.592221
####################
#                  #
#    Exercise 4    #
#                  #
####################

df <- 5
#a
pchisq(2,df)
## [1] 0.150855
1-pchisq(4,df)
## [1] 0.549416
# OR
pchisq(4,df,lower.tail = FALSE)
## [1] 0.549416
pchisq(6,df)-pchisq(4,df)
## [1] 0.243197
#b
qchisq(0.025, df, lower.tail=TRUE)
## [1] 0.8312116
qchisq(0.5, df, lower.tail=TRUE)
## [1] 4.35146
qchisq(0.075, df, lower.tail=FALSE)
## [1] 10.00831
####################
#                  #
#    Exercise 5    #
#                  #
####################

df_1 <- 6
df_2 <- 3

pf(2, df_1, df_2)
## [1] 0.6958948
1 - pf(3, df_1, df_2)
## [1] 0.1977977
pf(4, df_1, df_2) - pf(1, df_1, df_2)
## [1] 0.4039858
qf(0.025,df_1, df_2)
## [1] 0.1515427
qf(0.975,df_1, df_2)
## [1] 14.73472
####################
#                  #
#    Exercise 6    #
#                  #
####################

data <- data.frame(case = factor(rep(c("A","B","C"), each=100)),
                  gen = c(rbinom(100, 20, 0.3), rbinom(100, 20, 0.5),
                             rbinom(100, 20, 0.7)))

ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 7    #
#                  #
####################


data <- data.frame(case = factor(rep(c("A","B","C"), each=100)),
                   gen = c(rnorm(100, 0, 1), rnorm(100, 0, 3),
                           rnorm(100, 0, 7)))

ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 8    #
#                  #
####################

data <- data.frame(case = factor(rep(c("A","B","C"), each = 100)),
                   gen = c(rt(100, 5), rt(100, 10),
                           rt(100, 25)))


ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)
plot of chunk unnamed-chunk-1
#Notice the variance, which decreases as the degrees of freedom increase 


####################
#                  #
#    Exercise 9    #
#                  #
####################

data <- data.frame(case = factor(rep(c("A","B","C"), each = 100)),
                   gen = c(rchisq(100, 5), rchisq(100, 10),
                           rchisq(100, 25)))

ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)
plot of chunk unnamed-chunk-1
#Observe that the graphs change from heavily skew to the right into more bell-shaped.


####################
#                  #
#    Exercise 10   #
#                  #
####################

data <- data.frame(case = factor(rep(c("A","B","C"), each = 100)),
                    gen = c(rf(100, 3, 9), rf(100, 9, 3),
                            rf(100,15, 15)))

ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)+xlim(0, 10)
plot of chunk unnamed-chunk-1



Data Science for Doctors – Part 2 : Descriptive Statistics Solutions

Below are the solutions to these exercises on descriptive statistics.

Learn more about descriptive statistics in the online courses Learn by Example: Statistics and Data Science in R (including 8 lectures specifically on descriptive statistics), and Introduction to R.

####################
#                  #
#    Exercise 1    #
#                  #
####################

mean(data[['mass']])
## [1] 31.99258
#OR

sum(data[['mass']])/length(data[['mass']])
## [1] 31.99258
####################
#                  #
#    Exercise 2    #
#                  #
####################

median(data[['mass']])
## [1] 32
#OR

(sort(data[['mass']])[length(data[['mass']])/2] + sort(data[['mass']])[length(data[['mass']])/2+1] )/2
## [1] 32
# This is a fairly long command, give yourself some time to make sure you understood everything.

####################
#                  #
#    Exercise 3    #
#                  #
####################

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(data[['mass']])
## [1] 32
####################
#                  #
#    Exercise 4    #
#                  #
####################

sd(data[["age"]])
## [1] 11.76023
#OR

num <- 0
for (i in 1:length(data$age)){
  num <- num + (data$age[i]-mean(data$age))^2
}
sqrt(num/sum(length(data$age)))
## [1] 11.75257
####################
#                  #
#    Exercise 5    #
#                  #
####################

var(data$mass)
## [1] 62.15998
#OR

num <- 0
for (i in 1:length(data$mass)){
  num <- num + (data$mass[i] - mean(data$mass))^2
}
num/length(data$mass)
## [1] 62.07905
####################
#                  #
#    Exercise 6    #
#                  #
####################

IQR(data[["age"]]) # interquartile range
## [1] 17
#OR

(sort(data[['age']])[length(data[['age']])*.75] - sort(data[['age']])[length(data[['age']])*.25] )
## [1] 17
####################
#                  #
#    Exercise 7    #
#                  #
####################

mad(data[['age']])
## [1] 10.3782
#OR

num <- 1:length(data$age)
for (i in 1:length(data$age)){
  num[i] <- abs(data$age[i]-median(data$age))
}
1.4826*median(num) # 1.4826, us the constant when it follows normal distribution.
## [1] 10.3782
####################
#                  #
#    Exercise 8    #
#                  #
####################

cov(data$age,data$mass)
## [1] 3.36033
#OR

num <- 0
for (i in 1:length(data$age)){
  num <- num + (data$age[i] - mean(data$age)) * (data$mass[i] - mean(data$mass))
}
num/length(data$age)
## [1] 3.355954
####################
#                  #
#    Exercise 9    #
#                  #
####################

#Note: Pearson is used when the relation between the variables is linear, 
# while spearman make no such assumption

cor(data$age,data$mass,method  = "spearman")
## [1] 0.1311859
#Used to measure the degree of the relationship between linearly related variables
cor(data$age,data$mass,method  = "pearson")
## [1] 0.03624187
#OR

# Spearman
1-6*(sum((rank(data$age)-rank(data$mass))^2)/(length(data$age)*(length(data$age))^2-1))
## [1] 0.1322695
# Pearson #1
num <- 0
den <- 0
x <- 0
y <- 0
for ( i in 1:length(data$age)){
  num <- num + (data$age[i] - mean(data$age))*(data$mass[i] - mean(data$mass))
  x <- x + (data$age[i] - mean(data$age))^2
  y <- y + (data$mass[i] - mean(data$mass))^2
}
den <- sqrt(x*y)
num/den
## [1] 0.03624187
# Pearon #2
cov(data$age,data$mass)/(sd(data$age)*sd(data$mass))
## [1] 0.03624187
####################
#                  #
#    Exercise 10   #
#                  #
####################

summary(data)
##       preg             plas            pres             skin      
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##       test            mass            pedi             age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##      class          class.fac  
##  Min.   :0.000   Negative:500  
##  1st Qu.:0.000   Positive:268  
##  Median :0.000                 
##  Mean   :0.349                 
##  3rd Qu.:1.000                 
##  Max.   :1.000
str(data)
## 'data.frame':	768 obs. of  10 variables:
##  $ preg     : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ plas     : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ pres     : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ skin     : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ test     : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedi     : num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age      : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ class    : int  1 0 1 0 1 0 1 0 1 1 ...
##  $ class.fac: Factor w/ 2 levels "Negative","Positive": 2 1 2 1 2 1 2 1 2 2 ...
rcorr(as.matrix(data[-length(data)]),type = "spearman")
##        preg plas  pres  skin  test mass  pedi   age class
## preg   1.00 0.13  0.19 -0.09 -0.13 0.00 -0.04  0.61  0.20
## plas   0.13 1.00  0.24  0.06  0.21 0.23  0.09  0.29  0.48
## pres   0.19 0.24  1.00  0.13 -0.01 0.29  0.03  0.35  0.14
## skin  -0.09 0.06  0.13  1.00  0.54 0.44  0.18 -0.07  0.09
## test  -0.13 0.21 -0.01  0.54  1.00 0.19  0.22 -0.11  0.07
## mass   0.00 0.23  0.29  0.44  0.19 1.00  0.14  0.13  0.31
## pedi  -0.04 0.09  0.03  0.18  0.22 0.14  1.00  0.04  0.18
## age    0.61 0.29  0.35 -0.07 -0.11 0.13  0.04  1.00  0.31
## class  0.20 0.48  0.14  0.09  0.07 0.31  0.18  0.31  1.00
## 
## n= 768 
## 
## 
## P
##       preg   plas   pres   skin   test   mass   pedi   age    class 
## preg         0.0003 0.0000 0.0182 0.0004 0.9971 0.2313 0.0000 0.0000
## plas  0.0003        0.0000 0.0965 0.0000 0.0000 0.0114 0.0000 0.0000
## pres  0.0000 0.0000        0.0004 0.8514 0.0000 0.4057 0.0000 0.0000
## skin  0.0182 0.0965 0.0004        0.0000 0.0000 0.0000 0.0643 0.0129
## test  0.0004 0.0000 0.8514 0.0000        0.0000 0.0000 0.0015 0.0656
## mass  0.9971 0.0000 0.0000 0.0000 0.0000        0.0000 0.0003 0.0000
## pedi  0.2313 0.0114 0.4057 0.0000 0.0000 0.0000        0.2349 0.0000
## age   0.0000 0.0000 0.0000 0.0643 0.0015 0.0003 0.2349        0.0000
## class 0.0000 0.0000 0.0000 0.0129 0.0656 0.0000 0.0000 0.0000
rcorr(as.matrix(data[-length(data)]),type = "pearson")
##        preg plas pres  skin  test mass  pedi   age class
## preg   1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03  0.54  0.22
## plas   0.13 1.00 0.15  0.06  0.33 0.22  0.14  0.26  0.47
## pres   0.14 0.15 1.00  0.21  0.09 0.28  0.04  0.24  0.07
## skin  -0.08 0.06 0.21  1.00  0.44 0.39  0.18 -0.11  0.07
## test  -0.07 0.33 0.09  0.44  1.00 0.20  0.19 -0.04  0.13
## mass   0.02 0.22 0.28  0.39  0.20 1.00  0.14  0.04  0.29
## pedi  -0.03 0.14 0.04  0.18  0.19 0.14  1.00  0.03  0.17
## age    0.54 0.26 0.24 -0.11 -0.04 0.04  0.03  1.00  0.24
## class  0.22 0.47 0.07  0.07  0.13 0.29  0.17  0.24  1.00
## 
## n= 768 
## 
## 
## P
##       preg   plas   pres   skin   test   mass   pedi   age    class 
## preg         0.0003 0.0000 0.0236 0.0416 0.6246 0.3535 0.0000 0.0000
## plas  0.0003        0.0000 0.1124 0.0000 0.0000 0.0001 0.0000 0.0000
## pres  0.0000 0.0000        0.0000 0.0137 0.0000 0.2534 0.0000 0.0715
## skin  0.0236 0.1124 0.0000        0.0000 0.0000 0.0000 0.0016 0.0383
## test  0.0416 0.0000 0.0137 0.0000        0.0000 0.0000 0.2432 0.0003
## mass  0.6246 0.0000 0.0000 0.0000 0.0000        0.0000 0.3158 0.0000
## pedi  0.3535 0.0001 0.2534 0.0000 0.0000 0.0000        0.3530 0.0000
## age   0.0000 0.0000 0.0000 0.0016 0.2432 0.3158 0.3530        0.0000
## class 0.0000 0.0000 0.0715 0.0383 0.0003 0.0000 0.0000 0.0000



Data Science for Doctors – Part 2 : Descriptive Statistics

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the second part of the series, it will contain the main descriptive statistics measures you will use most of the time. Those measures are divided in measures of central tendency and measures of spread. Moreover, most of the exercises can be solved with built-in functions, but I would encourage you to solve them “by hand”, because once you know the mechanics of the measures, then you are way more confident on using those measures. On the “solutions” page, I have both methods, so even if you didn’t solve them by hand, it would be nice if you check them out.

Before proceeding, it might be helpful to look over the help pages for the mean, median, sort , unique, tabulate, sd, var, IQR, mad, abs, cov, cor, summary, str, rcorr.

You also may need to load the Hmisc library.
install.packages('Hmisc')
library(Hmisc)

In case you haven’t solve the part 1, run the following script to load the prerequisites for this part.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Find the mean of the mass variable.

Exercise 2

Find the median of the mass variable.

Exercise 3

Find the mode of the mass.

Exercise 4

Find the standard deviation of the age variable.

Learn more about descriptive statistics in the online courses Learn by Example: Statistics and Data Science in R (including 8 lectures specifically on descriptive statistics), and Introduction to R.

Exercise 5

Find the variance of the mass variable.

Unlike the popular mean/standard deviation combination,interquartile range and median/mean absolute deviation are not sensitive to the presence of outliers. Even though it is recommended to go for MAD because they can approximate the standard deviation.

Exercise 6

Find the interquartile range of the age variable.

Exercise 7

Find the median absolute deviation of age variable. Assume that the age follows a normal distribution.

Exercise 8
Find the covariance of the variables age, mass.

Exercise 9

Find the spearman and pearson correlations of the variables age, mass.

Exercise 10

Print the summary statistics, and the structure of the data set. Moreover construct the correlation matrix of the data set.




Data Science for Doctors – Part 1 : Data Display

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the first part of the series, it is going to be about data display.

Before proceeding, it might be helpful to look over the help pages for the table, pie, geom_bar , coord_polar, barplot, stripchart, geom_jitter, density, geom_density, hist, geom_histogram, boxplot, geom_boxplot, qqnorm, qqline, geom_point, plot, qqline, geom_point .

You also may need to load the ggplot2 library.
install.packages('ggplot2')
library(ggplot)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a frequency table of the class variable.

Exercise 2

class.fac <- factor(data[['class']],levels=c(0,1), labels= c("Negative","Positive"))

Create a pie chart of the class.fac variable.

Exercise 3

Create a bar plot for the age variable.

Exercise 4

Create a strip chart for the mass against class.fac.

Exercise 5

Create a density plot for the preg variable.

Exercise 6

Create a histogram for the preg variable.

Exercise 7

Create a boxplot for the age against class.fac.

Exercise 8

Create a normal QQ plot and a line which passes through the first and third quartiles.

Exercise 9

Create a scatter plot for the variables age against the mass variable .

Exercise 10

Create scatter plots for every variable of the data set against every variable of the data set on a single window.
hint: it is quite simple, don’t overthink about it.




Data Science for Doctors – Part 1 : Data Display Solutions

Below are the solutions to these exercises on data display.

####################
#                  #
#    Exercise 1    #
#                  #
####################

table(data['class'])
## 
##   0   1 
## 500 268
####################
#                  #
#    Exercise 2    #
#                  #
####################

pie(table(data['class.fac']))
plot of chunk unnamed-chunk-1
# OR
ggplot(data, aes(x = factor(1), fill = class.fac)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  labs(x = " ", y = " ")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 3    #
#                  #
####################

barplot(data[['age']])
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(age)) + geom_bar() +
  labs(x = "Age", y = "# of Candidates")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 4    #
#                  #
####################

stripchart(data[["mass"]] ~ data[['class.fac']], method="jitter")
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(mass,class.fac)) + geom_jitter() +
  labs(x = "BMI", y = "Diagnosis")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 5    #
#                  #
####################

plot(density(data$preg))
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(preg)) +
  geom_density() +
  labs ( x = "# of pregancies")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 6    #
#                  #
####################

hist(data[['preg']])
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(preg)) + geom_histogram() +
  labs(x ="# of pregancies", y = "# of Candidates")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 7    #
#                  #
####################

boxplot(data[['age']] ~data[['class.fac']])
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(class.fac,age)) + geom_boxplot() +
  labs(x = "Diagnosis", y = "Age")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 8    #
#                  #
####################

qqnorm(data[["age"]])
qqline(data[["age"]])
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 9    #
#                  #
####################

plot(data$age,data$mass)
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(age, mass)) +
  geom_point() +
  labs(x = "Age", y = "BMI")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 10   #
#                  #
####################

plot(data)
plot of chunk unnamed-chunk-1