Data Science for Doctors – Part 4 : Inferential Statistics (1/5) Solutions

Below are the solutions to these exercises on data display.

####################
#                  #
#    Exercise 1    #
#                  #
####################

iter <- 10000
means <- rep(NA, iter)

for (i in 1:iter){
  sam_50 <- sample(data$mass, 50)
  means[i] <- mean(sam_50)
}

hist(means)
plot of chunk unnamed-chunk-1
hist(data$mass)
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 2    #
#                  #
####################

mean(data$mass)
## [1] 31.99258
sd(data$mass)/sqrt(50)
## [1] 1.114989
#OR 
mean(means)
## [1] 31.98233
sd(means)
## [1] 1.081871
####################
#                  #
#    Exercise 3    #
#                  #
####################

library(moments)
skewness(means)
## [1] -0.0333564
#  slight positive skewness, which means that it is slightly light tailed
kurtosis(means)
## [1] 3.064367
# The kurtosis is close to the expected value 3.

####################
#                  #
#    Exercise 4    #
#                  #
####################

z = (30.5-mean(data$mass))/(sd(data$mass)/sqrt(50))
z
## [1] -1.338649
####################
#                  #
#    Exercise 5    #
#                  #
####################

pnorm(z)
## [1] 0.09034253
####################
#                  #
#    Exercise 6    #
#                  #
####################

z = (31-mean(data$mass))/(sd(data$mass)/sqrt(150))

####################
#                  #
#    Exercise 7    #
#                  #
####################

pnorm(z)
## [1] 0.06154952
####################
#                  #
#    Exercise 8    #
#                  #
####################

z*sd(data$mass)/sqrt(150)
## [1] -0.9925781
####################
#                  #
#    Exercise 9    #
#                  #
####################

z = 1.96
low <- 31 - z*sd(data$mass)/sqrt(250)
high <- 31 + z*sd(data$mass)/sqrt(250)
low;high
## [1] 30.02267
## [1] 31.97733
####################
#                  #
#    Exercise 10   #
#                  #
####################

z = 2.33
low <- 31 - z*sd(data$mass)/sqrt(250)
high <- 31 + z*sd(data$mass)/sqrt(250)
low;high
## [1] 29.83817
## [1] 32.16183
z = 2.58
low <- 31 - z*sd(data$mass)/sqrt(250)
high <- 31 + z*sd(data$mass)/sqrt(250)
low;high
## [1] 29.71351
## [1] 32.28649



Data Science for Doctors – Part 4 : Inferential Statistics (1/5)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the fourth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

Before proceeding, it might be helpful to look over the help pages for the sample, mean, sd , sort, pnorm. Moreover it is crucial to be familiar with the Central Limit Theorem.

You also may need to load the ggplot2 library.
install.packages("moments")
library(moments)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generate (10000 iterations) a sampling distribution of sample size 50, for the variable mass.

You are encouraged to experiment with different sample sizes and iterations in order to see the impact that they have to the distribution. (standard deviation, skewness, and kurtosis) Moreover you can plot the distributions to have a better perception of what you are working on.

Exercise 2

Find the mean and standard error (standard deviation) of the sampling distribution.

You are encouraged to use the values from the original distribution (data$mass) in order to comprehend how you derive the mean and standard deviation as well as the importance that the sample size has to the distribution.

Exercise 3

Find the of the skewness and kurtosis of the distribution you generated before.

Exercise 4

Suppose that we made an experiment and we took a sample of size 50 from the population and they followed an organic food diet. Their average mass was 30.5. What is the Z score for a mean of 30.5?

Exercise 5

What is the probability of drawing a sample of 50 with mean less than 30.5? Use the the z-table if you feel you need to.

Exercise 6

Suppose that you did the experiment again but to a larger sample size of 150 and you found the average mass to be 31. Compute the z score for this mean.

Exercise 7

What is the probability of drawing a sample of 150 with mean less than 31?

Exercise 8

If everybody would adopt the diet of the experiment. Find the margin of error for the 95% of sample means.

Exercise 9

What would be our interval estimate that 95% likely contains what this population mean would be if everyone in our population would start adopting the organic diet.

Exercise 10

Find the interval estimate for 98% and 99% likelihood.




Data Science for Doctors – Part 3 : Distributions

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

This is the third part of the series, it will contain the main distributions that you will use most of the time. This part is created in order to make sure that you have (or will have after solving this set of exercises) the knowledge for the next parts to come. The distributions that we will see are:

1)Binomial Distribution: The binomial distribution fits to repeated trials each with a dichotomous outcome such as success-failure, healthy-disease, heads-tails.

2)Normal Distribution: It is the most famous distribution, it is also assumed for many gene expression values.

3)T-Distribution: The T-distribution has many useful applications for testing hypotheses when the sample size is lower than thirty.

4)Chi-squared Distribution: The chi-squared distribution plays an important role in testing hypotheses about frequencies.

5)F-Distribution: The F-distribution is important for testing the equality of two variances.

Before proceeding, it might be helpful to look over the help pages for the choose, dbinom, pbinom , rbinom, qbinom,pnorm, qnorm, rnorm, dnorm,pchisq, qchisq, dchisq, df, pf, df.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Let X be binomially distributed with n = 100 and p = 0.3.Compute the following:
a) P(X = 34), P(X ≥ 34), and P(X ≤ 34)
b) P(30 ≤ X ≤ 60)
c) The quantiles x0.025, and x0.975

Exercise 2

Let X be normally distributed with mean = 3 and standard deviation = 1.Compute the following:
a) P(X 2),P(2 ≤ X ≤ 4)
b) The quantiles x0.025, x0.5and x0.975.

Exercise 3

Let T8 distribution.Compute the following:
a)P(T8 < 1), P(T8 > 2), P(-1 < T8 < 1).
b)The quantiles t0.025, t0.5, and t0.975. Can you justify the values of the quantiles?

Exercise 4

Compute the following for the chi-squared distribution with 5 degrees of freedom:
a) P(X25<2), P(X25>4),P(4<X25<6).
b) The quantiles g0.025, g0.5, and g0.975.

Exercise 5

Compute the following for the F6,3 distribution:
a)P(F6,3 < 2), P(F6,3 > 3), P(1 < F6,3 < 4).
b)The quantiles f0.025, f0.5, and f0.975.

Exercise 6

Generate 100 observations following binomial distribution and plot them(if possible at the same plot):
a) n = 20, p = 0.3
b) n = 20, p = 0.5
c) n = 20, p = 0.7

Exercise 7

Generate 100 observations following normal distribution and plot them(if possible at the same plot):
a) standard normal distribution ( N(0,1) )
b) mean = 0, s = 3
c) mean = 0, s = 7

Exercise 8

Generate 100 observations following T distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 9

Generate 100 observations following chi-squared distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 10

Generate 100 observations following F distribution and plot them(if possible at the same plot):
a) df1 = 3, df2 = 9
b) df1 = 9, df2 = 3
c) df1 = 15, df2 = 15




Data Science for Doctors – Part 3 : Distributions Solutions

Below are the solutions to these exercises on data display.

####################
#                  #
#    Exercise 1    #
#                  #
####################

n <- 100
p <- 0.3

#a
dbinom(34, n, p)
## [1] 0.05788395
sum(dbinom(34:n, n, p))
## [1] 0.2207422
pbinom(34, n, p)
## [1] 0.8371417
#b
sum(dbinom(30:60, n, p))
## [1] 0.5376603
#c
qbinom(0.025,n,p)
## [1] 21
qbinom(0.975,n,p)
## [1] 39
####################
#                  #
#    Exercise 2    #
#                  #
####################

m <- 3
s <- 1
#a

pnorm(2,m,s)
## [1] 0.1586553
pnorm(4,m,s) - pnorm(2,m,s)
## [1] 0.6826895
#b
qnorm(0.025,m,s)
## [1] 1.040036
qnorm(0.975,m,s)
## [1] 4.959964
qnorm(0.5,m,s)
## [1] 3
####################
#                  #
#    Exercise 3    #
#                  #
####################

df <- 8
#a
pt(1,df)
## [1] 0.8267032
1-pt(2,df)
## [1] 0.04025812
pt(1,df)-pt(-1,df)
## [1] 0.6534065
#b  
qt(0.025,df)
## [1] -2.306004
qt(0.5,df)
## [1] 0
1-qt(0.075,df)
## [1] 2.592221
####################
#                  #
#    Exercise 4    #
#                  #
####################

df <- 5
#a
pchisq(2,df)
## [1] 0.150855
1-pchisq(4,df)
## [1] 0.549416
# OR
pchisq(4,df,lower.tail = FALSE)
## [1] 0.549416
pchisq(6,df)-pchisq(4,df)
## [1] 0.243197
#b
qchisq(0.025, df, lower.tail=TRUE)
## [1] 0.8312116
qchisq(0.5, df, lower.tail=TRUE)
## [1] 4.35146
qchisq(0.075, df, lower.tail=FALSE)
## [1] 10.00831
####################
#                  #
#    Exercise 5    #
#                  #
####################

df_1 <- 6
df_2 <- 3

pf(2, df_1, df_2)
## [1] 0.6958948
1 - pf(3, df_1, df_2)
## [1] 0.1977977
pf(4, df_1, df_2) - pf(1, df_1, df_2)
## [1] 0.4039858
qf(0.025,df_1, df_2)
## [1] 0.1515427
qf(0.975,df_1, df_2)
## [1] 14.73472
####################
#                  #
#    Exercise 6    #
#                  #
####################

data <- data.frame(case = factor(rep(c("A","B","C"), each=100)),
                  gen = c(rbinom(100, 20, 0.3), rbinom(100, 20, 0.5),
                             rbinom(100, 20, 0.7)))

ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 7    #
#                  #
####################


data <- data.frame(case = factor(rep(c("A","B","C"), each=100)),
                   gen = c(rnorm(100, 0, 1), rnorm(100, 0, 3),
                           rnorm(100, 0, 7)))

ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 8    #
#                  #
####################

data <- data.frame(case = factor(rep(c("A","B","C"), each = 100)),
                   gen = c(rt(100, 5), rt(100, 10),
                           rt(100, 25)))


ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)
plot of chunk unnamed-chunk-1
#Notice the variance, which decreases as the degrees of freedom increase 


####################
#                  #
#    Exercise 9    #
#                  #
####################

data <- data.frame(case = factor(rep(c("A","B","C"), each = 100)),
                   gen = c(rchisq(100, 5), rchisq(100, 10),
                           rchisq(100, 25)))

ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)
plot of chunk unnamed-chunk-1
#Observe that the graphs change from heavily skew to the right into more bell-shaped.


####################
#                  #
#    Exercise 10   #
#                  #
####################

data <- data.frame(case = factor(rep(c("A","B","C"), each = 100)),
                    gen = c(rf(100, 3, 9), rf(100, 9, 3),
                            rf(100,15, 15)))

ggplot(data, aes(x=gen, fill=case)) + geom_density(alpha=.3)+xlim(0, 10)
plot of chunk unnamed-chunk-1



Data Science for Doctors – Part 2 : Descriptive Statistics Solutions

Below are the solutions to these exercises on descriptive statistics.

Learn more about descriptive statistics in the online courses Learn by Example: Statistics and Data Science in R (including 8 lectures specifically on descriptive statistics), and Introduction to R.

####################
#                  #
#    Exercise 1    #
#                  #
####################

mean(data[['mass']])
## [1] 31.99258
#OR

sum(data[['mass']])/length(data[['mass']])
## [1] 31.99258
####################
#                  #
#    Exercise 2    #
#                  #
####################

median(data[['mass']])
## [1] 32
#OR

(sort(data[['mass']])[length(data[['mass']])/2] + sort(data[['mass']])[length(data[['mass']])/2+1] )/2
## [1] 32
# This is a fairly long command, give yourself some time to make sure you understood everything.

####################
#                  #
#    Exercise 3    #
#                  #
####################

getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(data[['mass']])
## [1] 32
####################
#                  #
#    Exercise 4    #
#                  #
####################

sd(data[["age"]])
## [1] 11.76023
#OR

num <- 0
for (i in 1:length(data$age)){
  num <- num + (data$age[i]-mean(data$age))^2
}
sqrt(num/sum(length(data$age)))
## [1] 11.75257
####################
#                  #
#    Exercise 5    #
#                  #
####################

var(data$mass)
## [1] 62.15998
#OR

num <- 0
for (i in 1:length(data$mass)){
  num <- num + (data$mass[i] - mean(data$mass))^2
}
num/length(data$mass)
## [1] 62.07905
####################
#                  #
#    Exercise 6    #
#                  #
####################

IQR(data[["age"]]) # interquartile range
## [1] 17
#OR

(sort(data[['age']])[length(data[['age']])*.75] - sort(data[['age']])[length(data[['age']])*.25] )
## [1] 17
####################
#                  #
#    Exercise 7    #
#                  #
####################

mad(data[['age']])
## [1] 10.3782
#OR

num <- 1:length(data$age)
for (i in 1:length(data$age)){
  num[i] <- abs(data$age[i]-median(data$age))
}
1.4826*median(num) # 1.4826, us the constant when it follows normal distribution.
## [1] 10.3782
####################
#                  #
#    Exercise 8    #
#                  #
####################

cov(data$age,data$mass)
## [1] 3.36033
#OR

num <- 0
for (i in 1:length(data$age)){
  num <- num + (data$age[i] - mean(data$age)) * (data$mass[i] - mean(data$mass))
}
num/length(data$age)
## [1] 3.355954
####################
#                  #
#    Exercise 9    #
#                  #
####################

#Note: Pearson is used when the relation between the variables is linear, 
# while spearman make no such assumption

cor(data$age,data$mass,method  = "spearman")
## [1] 0.1311859
#Used to measure the degree of the relationship between linearly related variables
cor(data$age,data$mass,method  = "pearson")
## [1] 0.03624187
#OR

# Spearman
1-6*(sum((rank(data$age)-rank(data$mass))^2)/(length(data$age)*(length(data$age))^2-1))
## [1] 0.1322695
# Pearson #1
num <- 0
den <- 0
x <- 0
y <- 0
for ( i in 1:length(data$age)){
  num <- num + (data$age[i] - mean(data$age))*(data$mass[i] - mean(data$mass))
  x <- x + (data$age[i] - mean(data$age))^2
  y <- y + (data$mass[i] - mean(data$mass))^2
}
den <- sqrt(x*y)
num/den
## [1] 0.03624187
# Pearon #2
cov(data$age,data$mass)/(sd(data$age)*sd(data$mass))
## [1] 0.03624187
####################
#                  #
#    Exercise 10   #
#                  #
####################

summary(data)
##       preg             plas            pres             skin      
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##       test            mass            pedi             age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##      class          class.fac  
##  Min.   :0.000   Negative:500  
##  1st Qu.:0.000   Positive:268  
##  Median :0.000                 
##  Mean   :0.349                 
##  3rd Qu.:1.000                 
##  Max.   :1.000
str(data)
## 'data.frame':	768 obs. of  10 variables:
##  $ preg     : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ plas     : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ pres     : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ skin     : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ test     : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedi     : num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age      : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ class    : int  1 0 1 0 1 0 1 0 1 1 ...
##  $ class.fac: Factor w/ 2 levels "Negative","Positive": 2 1 2 1 2 1 2 1 2 2 ...
rcorr(as.matrix(data[-length(data)]),type = "spearman")
##        preg plas  pres  skin  test mass  pedi   age class
## preg   1.00 0.13  0.19 -0.09 -0.13 0.00 -0.04  0.61  0.20
## plas   0.13 1.00  0.24  0.06  0.21 0.23  0.09  0.29  0.48
## pres   0.19 0.24  1.00  0.13 -0.01 0.29  0.03  0.35  0.14
## skin  -0.09 0.06  0.13  1.00  0.54 0.44  0.18 -0.07  0.09
## test  -0.13 0.21 -0.01  0.54  1.00 0.19  0.22 -0.11  0.07
## mass   0.00 0.23  0.29  0.44  0.19 1.00  0.14  0.13  0.31
## pedi  -0.04 0.09  0.03  0.18  0.22 0.14  1.00  0.04  0.18
## age    0.61 0.29  0.35 -0.07 -0.11 0.13  0.04  1.00  0.31
## class  0.20 0.48  0.14  0.09  0.07 0.31  0.18  0.31  1.00
## 
## n= 768 
## 
## 
## P
##       preg   plas   pres   skin   test   mass   pedi   age    class 
## preg         0.0003 0.0000 0.0182 0.0004 0.9971 0.2313 0.0000 0.0000
## plas  0.0003        0.0000 0.0965 0.0000 0.0000 0.0114 0.0000 0.0000
## pres  0.0000 0.0000        0.0004 0.8514 0.0000 0.4057 0.0000 0.0000
## skin  0.0182 0.0965 0.0004        0.0000 0.0000 0.0000 0.0643 0.0129
## test  0.0004 0.0000 0.8514 0.0000        0.0000 0.0000 0.0015 0.0656
## mass  0.9971 0.0000 0.0000 0.0000 0.0000        0.0000 0.0003 0.0000
## pedi  0.2313 0.0114 0.4057 0.0000 0.0000 0.0000        0.2349 0.0000
## age   0.0000 0.0000 0.0000 0.0643 0.0015 0.0003 0.2349        0.0000
## class 0.0000 0.0000 0.0000 0.0129 0.0656 0.0000 0.0000 0.0000
rcorr(as.matrix(data[-length(data)]),type = "pearson")
##        preg plas pres  skin  test mass  pedi   age class
## preg   1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03  0.54  0.22
## plas   0.13 1.00 0.15  0.06  0.33 0.22  0.14  0.26  0.47
## pres   0.14 0.15 1.00  0.21  0.09 0.28  0.04  0.24  0.07
## skin  -0.08 0.06 0.21  1.00  0.44 0.39  0.18 -0.11  0.07
## test  -0.07 0.33 0.09  0.44  1.00 0.20  0.19 -0.04  0.13
## mass   0.02 0.22 0.28  0.39  0.20 1.00  0.14  0.04  0.29
## pedi  -0.03 0.14 0.04  0.18  0.19 0.14  1.00  0.03  0.17
## age    0.54 0.26 0.24 -0.11 -0.04 0.04  0.03  1.00  0.24
## class  0.22 0.47 0.07  0.07  0.13 0.29  0.17  0.24  1.00
## 
## n= 768 
## 
## 
## P
##       preg   plas   pres   skin   test   mass   pedi   age    class 
## preg         0.0003 0.0000 0.0236 0.0416 0.6246 0.3535 0.0000 0.0000
## plas  0.0003        0.0000 0.1124 0.0000 0.0000 0.0001 0.0000 0.0000
## pres  0.0000 0.0000        0.0000 0.0137 0.0000 0.2534 0.0000 0.0715
## skin  0.0236 0.1124 0.0000        0.0000 0.0000 0.0000 0.0016 0.0383
## test  0.0416 0.0000 0.0137 0.0000        0.0000 0.0000 0.2432 0.0003
## mass  0.6246 0.0000 0.0000 0.0000 0.0000        0.0000 0.3158 0.0000
## pedi  0.3535 0.0001 0.2534 0.0000 0.0000 0.0000        0.3530 0.0000
## age   0.0000 0.0000 0.0000 0.0016 0.2432 0.3158 0.3530        0.0000
## class 0.0000 0.0000 0.0715 0.0383 0.0003 0.0000 0.0000 0.0000



Data Science for Doctors – Part 2 : Descriptive Statistics

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the second part of the series, it will contain the main descriptive statistics measures you will use most of the time. Those measures are divided in measures of central tendency and measures of spread. Moreover, most of the exercises can be solved with built-in functions, but I would encourage you to solve them “by hand”, because once you know the mechanics of the measures, then you are way more confident on using those measures. On the “solutions” page, I have both methods, so even if you didn’t solve them by hand, it would be nice if you check them out.

Before proceeding, it might be helpful to look over the help pages for the mean, median, sort , unique, tabulate, sd, var, IQR, mad, abs, cov, cor, summary, str, rcorr.

You also may need to load the Hmisc library.
install.packages('Hmisc')
library(Hmisc)

In case you haven’t solve the part 1, run the following script to load the prerequisites for this part.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Find the mean of the mass variable.

Exercise 2

Find the median of the mass variable.

Exercise 3

Find the mode of the mass.

Exercise 4

Find the standard deviation of the age variable.

Learn more about descriptive statistics in the online courses Learn by Example: Statistics and Data Science in R (including 8 lectures specifically on descriptive statistics), and Introduction to R.

Exercise 5

Find the variance of the mass variable.

Unlike the popular mean/standard deviation combination,interquartile range and median/mean absolute deviation are not sensitive to the presence of outliers. Even though it is recommended to go for MAD because they can approximate the standard deviation.

Exercise 6

Find the interquartile range of the age variable.

Exercise 7

Find the median absolute deviation of age variable. Assume that the age follows a normal distribution.

Exercise 8
Find the covariance of the variables age, mass.

Exercise 9

Find the spearman and pearson correlations of the variables age, mass.

Exercise 10

Print the summary statistics, and the structure of the data set. Moreover construct the correlation matrix of the data set.




Data Science for Doctors – Part 1 : Data Display

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the first part of the series, it is going to be about data display.

Before proceeding, it might be helpful to look over the help pages for the table, pie, geom_bar , coord_polar, barplot, stripchart, geom_jitter, density, geom_density, hist, geom_histogram, boxplot, geom_boxplot, qqnorm, qqline, geom_point, plot, qqline, geom_point .

You also may need to load the ggplot2 library.
install.packages('ggplot2')
library(ggplot)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a frequency table of the class variable.

Exercise 2

class.fac <- factor(data[['class']],levels=c(0,1), labels= c("Negative","Positive"))

Create a pie chart of the class.fac variable.

Exercise 3

Create a bar plot for the age variable.

Exercise 4

Create a strip chart for the mass against class.fac.

Exercise 5

Create a density plot for the preg variable.

Exercise 6

Create a histogram for the preg variable.

Exercise 7

Create a boxplot for the age against class.fac.

Exercise 8

Create a normal QQ plot and a line which passes through the first and third quartiles.

Exercise 9

Create a scatter plot for the variables age against the mass variable .

Exercise 10

Create scatter plots for every variable of the data set against every variable of the data set on a single window.
hint: it is quite simple, don’t overthink about it.




Data Science for Doctors – Part 1 : Data Display Solutions

Below are the solutions to these exercises on data display.

####################
#                  #
#    Exercise 1    #
#                  #
####################

table(data['class'])
## 
##   0   1 
## 500 268
####################
#                  #
#    Exercise 2    #
#                  #
####################

pie(table(data['class.fac']))
plot of chunk unnamed-chunk-1
# OR
ggplot(data, aes(x = factor(1), fill = class.fac)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  labs(x = " ", y = " ")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 3    #
#                  #
####################

barplot(data[['age']])
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(age)) + geom_bar() +
  labs(x = "Age", y = "# of Candidates")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 4    #
#                  #
####################

stripchart(data[["mass"]] ~ data[['class.fac']], method="jitter")
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(mass,class.fac)) + geom_jitter() +
  labs(x = "BMI", y = "Diagnosis")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 5    #
#                  #
####################

plot(density(data$preg))
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(preg)) +
  geom_density() +
  labs ( x = "# of pregancies")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 6    #
#                  #
####################

hist(data[['preg']])
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(preg)) + geom_histogram() +
  labs(x ="# of pregancies", y = "# of Candidates")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 7    #
#                  #
####################

boxplot(data[['age']] ~data[['class.fac']])
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(class.fac,age)) + geom_boxplot() +
  labs(x = "Diagnosis", y = "Age")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 8    #
#                  #
####################

qqnorm(data[["age"]])
qqline(data[["age"]])
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 9    #
#                  #
####################

plot(data$age,data$mass)
plot of chunk unnamed-chunk-1
#OR

ggplot(data, aes(age, mass)) +
  geom_point() +
  labs(x = "Age", y = "BMI")
plot of chunk unnamed-chunk-1
####################
#                  #
#    Exercise 10   #
#                  #
####################

plot(data)
plot of chunk unnamed-chunk-1



Descriptive Analytics-Part 6: Interactive dashboard ( 2/2) solutions

Below are the solutions to these exercises on interactive dashboarding.

In case, you feel like you need the full script, you can find it here.

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

###############
#             #
# Exercise 1  #
#             #
###############

ui <- fluidPage(pageWithSidebar(
  headerPanel("Visualization")))
## Error in eval(expr, envir, enclos): could not find function "fluidPage"
###############
#             #
# Exercise 2  #
#             #
###############

ui <- fluidPage(pageWithSidebar(
  headerPanel("Visualization"),
  sidebarPanel()))
## Error in eval(expr, envir, enclos): could not find function "fluidPage"
###############
#             #
# Exercise 3  #
#             #
###############

ui <- fluidPage(pageWithSidebar(
  headerPanel("Visualization"),
  sidebarPanel(
    selectInput("delays",
                h3("Select type of delay"),
                list("Carrier" = "CarrierDelay",
                     "Weather" = "WeatherDelay",
                     "NAS" = "NASDelay","Security"="SecurityDelay",
                     "LateAircraft"="LateAircraftDelay"),selected = "CarrierDelay" ),
    selectInput("var",
                h3("Select categorical variable"),
                list("Destination" = "Dest",
                     "Origin" = "Origin",
                     "Carrier" = "UniqueCarrier","Airplane"="TailNum",
                     "CancellationCode"="CancellationCode"),selected = "Dest" ))))
## Error in eval(expr, envir, enclos): could not find function "fluidPage"
###############
#             #
# Exercise 4  #
#             #
###############

ui <- fluidPage(pageWithSidebar(
  headerPanel("Visualization"),
  sidebarPanel(
    selectInput("delays",
                h3("Select type of delay"),
                list("Carrier" = "CarrierDelay",
                     "Weather" = "WeatherDelay",
                     "NAS" = "NASDelay","Security"="SecurityDelay",
                     "LateAircraft"="LateAircraftDelay"),selected = "CarrierDelay" ),
    selectInput("var",
                h3("Select categorical variable"),
                list("Destination" = "Dest",
                     "Origin" = "Origin",
                     "Carrier" = "UniqueCarrier","Airplane"="TailNum",
                     "CancellationCode"="CancellationCode"),selected = "Dest" ),
    radioButtons("plot_cont",
                 h3("Select plot"),
                 list("Histotgram" = 1,
                      "Scatterplot" = 2,"ViolinPlot"=3),selected = 1 ))))
## Error in eval(expr, envir, enclos): could not find function "fluidPage"
###############
#             #
# Exercise 5  #
#             #
###############

ui <- fluidPage(pageWithSidebar(
  headerPanel("Visualization"),
  sidebarPanel(
    selectInput("delays",
                h3("Select type of delay"),
                list("Carrier" = "CarrierDelay",
                     "Weather" = "WeatherDelay",
                     "NAS" = "NASDelay","Security"="SecurityDelay",
                     "LateAircraft"="LateAircraftDelay"),selected = "CarrierDelay" ),
    selectInput("var",
                h3("Select categorical variable"),
                list("Destination" = "Dest",
                     "Origin" = "Origin",
                     "Carrier" = "UniqueCarrier","Airplane"="TailNum",
                     "CancellationCode"="CancellationCode"),selected = "Dest" ),
    radioButtons("plot_cont",
                 h3("Select plot"),
                 list("Histotgram" = 1,
                      "Scatterplot" = 2,"ViolinPlot"=3),selected = 1 ),
    radioButtons("plot_cat",
                 h3("Select plot"),
                 list("Barplot" = 1,
                      "Pie Chart" = 2,
                      "Rose wind" = 3),selected = 1 ))))
## Error in eval(expr, envir, enclos): could not find function "fluidPage"
###############
#             #
# Exercise 6  #
#             #
###############

ui <- fluidPage(pageWithSidebar(
  headerPanel("Visualization"),
  sidebarPanel(
    selectInput("delays",
                h3("Select type of delay"),
                list("Carrier" = "CarrierDelay",
                     "Weather" = "WeatherDelay",
                     "NAS" = "NASDelay","Security"="SecurityDelay",
                     "LateAircraft"="LateAircraftDelay"),selected = "CarrierDelay" ),
    selectInput("var",
                h3("Select categorical variable"),
                list("Destination" = "Dest",
                     "Origin" = "Origin",
                     "Carrier" = "UniqueCarrier","Airplane"="TailNum",
                     "CancellationCode"="CancellationCode"),selected = "Dest" ),
    radioButtons("plot_cont",
                 h3("Select plot"),
                 list("Histotgram" = 1,
                      "Scatterplot" = 2,"ViolinPlot"=3),selected = 1 ),
    radioButtons("plot_cat",
                 h3("Select plot"),
                 list("Barplot" = 1,
                      "Pie Chart" = 2,
                      "Rose wind" = 3),selected = 1 )),
  mainPanel(tabsetPanel())))
## Error in eval(expr, envir, enclos): could not find function "fluidPage"
###############
#             #
# Exercise 7  #
#             #
###############

ui <- fluidPage(pageWithSidebar(
  headerPanel("Visualization"),
  sidebarPanel(
    selectInput("delays",
                h3("Select type of delay"),
                list("Carrier" = "CarrierDelay",
                     "Weather" = "WeatherDelay",
                     "NAS" = "NASDelay","Security"="SecurityDelay",
                     "LateAircraft"="LateAircraftDelay"),selected = "CarrierDelay" ),
    selectInput("var",
                h3("Select categorical variable"),
                list("Destination" = "Dest",
                     "Origin" = "Origin",
                     "Carrier" = "UniqueCarrier","Airplane"="TailNum",
                     "CancellationCode"="CancellationCode"),selected = "Dest" ),
    radioButtons("plot_cont",
                 h3("Select plot"),
                 list("Histotgram" = 1,
                      "Scatterplot" = 2,"ViolinPlot"=3),selected = 1 ),
    radioButtons("plot_cat",
                 h3("Select plot"),
                 list("Barplot" = 1,
                      "Pie Chart" = 2,
                      "Rose wind" = 3),selected = 1 )),
  mainPanel(tabsetPanel(
    tabPanel("Delays",plotOutput("cont")),
    tabPanel("Categorical",plotOutput("cat")))
  )))
## Error in eval(expr, envir, enclos): could not find function "fluidPage"
###############
#             #
# Exercise 8  #
#             #
###############

server <- function(input, output) {
  observe({
    if (input$plot_cont == 1){
      output$cont <- renderPlot({
        ggplot(flights, aes(flights[[input$delays]])) +
          geom_histogram(breaks=seq(0, 100, by =2),
                         col="red",
                         aes(fill=..count..)) +
          scale_fill_gradient("Count", low = "green", high = "red") +
          labs(title=cat("Histogram for", input$delays,"time"), x=input$delays,y="# of flights")
      })
    }else if (input$plot_cont == 2 ){
      output$cont <- renderPlot({
        ggplot(flights,
               aes(x=Full_Date,
                   y=flights[[input$delays]],
                   color= UniqueCarrier,alpha =1/3))+
          geom_point()+ theme_bw(base_family='Times')+
          theme(axis.text.x=element_blank(),
                axis.ticks.x=element_blank())
      })
    }
    else{
      output$cont <- renderPlot({
        ggplot(flights, aes(factor(DayOfWeek), flights[[input$delays]]))+
          geom_violin(aes(fill = factor(DayOfWeek)),trim = FALSE)+ guides(fill=FALSE)+
          scale_y_continuous(limits = c(0, 25))+
          labs( y=input$delays,x="Day of Week")
      })
    }
})
}
###############
#             #
# Exercise 9  #
#             #
###############

server <- function(input, output) {
  observe({
    if (input$plot_cont == 1){
      output$cont <- renderPlot({
        ggplot(flights, aes(flights[[input$delays]])) +
          geom_histogram(breaks=seq(0, 100, by =2),
                         col="red",
                         aes(fill=..count..)) +
          scale_fill_gradient("Count", low = "green", high = "red") +
          labs(title=cat("Histogram for", input$delays,"time"), x=input$delays,y="# of flights")
      })
    }else if (input$plot_cont == 2 ){
      output$cont <- renderPlot({
        ggplot(flights,
               aes(x=Full_Date,
                   y=flights[[input$delays]],
                   color= UniqueCarrier,alpha =1/3))+
          geom_point()+ theme_bw(base_family='Times')+
          theme(axis.text.x=element_blank(),
                axis.ticks.x=element_blank())
      })
    }
    else{
      output$cont <- renderPlot({
        ggplot(flights, aes(factor(DayOfWeek), flights[[input$delays]]))+
          geom_violin(aes(fill = factor(DayOfWeek)),trim = FALSE)+ guides(fill=FALSE)+
          scale_y_continuous(limits = c(0, 25))+
          labs( y=input$delays,x="Day of Week")
      })
    }

    if (input$plot_cat == 1 ){
      output$cat <- renderPlot({
        ggplot (flights)+ aes (as.factor(flights[[input$var]])) +
          labs(title=cat("Bar plot for", input$var), x=input$var,y="# of flights")+ theme(axis.text.x = element_text(angle=90))+
          geom_bar()
      })
    }else if (input$plot_cat == 2 ){
      output$cat <- renderPlot({
         ggplot(flights, aes(x = factor(1), fill = as.factor(flights[[input$var]]))) +
          geom_bar(width = 1) + coord_polar(theta = "y")
      })
    }else {
      output$cat <- renderPlot({
        ggplot(flights, aes(x = DayOfWeek, fill = input$var)) + geom_bar(width = 1) + coord_polar()
      })
    }
    })
}

###############
#             #
# Exercise 10 #
#             #
###############

shinyApp(ui = ui, server = server)



Descriptive Analytics-Part 6: Interactive dashboard ( 2/2)

downloadDescriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.As this series of exercises comes to an end, the last part is going to be the development of a data product. Not everybody is able to code in R, so it is useful to be able to make GUIs in order to share your work with non-technical people. This part may be a little challenging, since it requires some basic knowledge of the shiny package. The outcome of this set of exercises will be almost like this web app (some variables are missing because I had to reduce the size of the data set).

In order to be able to solve this set of exercises you should have solved the part 0, part 1, part 2,part 3, and part 4 of this series but also you should run this script which contain some more data cleaning. In case you haven’t, run this script in your machine which contains the lines of code we used to modify our data set. This is the tenth set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data. The goal of Descriptive analytics is to inform the user about what is going on at the dataset. Before proceeding, it might be helpful to look over the help pages for the fluidPage, pageWithSidebar, headerPanel , sidebarPanel, selectInput, mainPanel, tabPanel, observe, verbatimTextOutput, renderPrint, shinyApp.

For this set of exercises you will need to install and load the package shiny.

install.packages('shiny')
library(shiny)

I have also changed the values of the DaysOfWeek variable, if you wish to do that as well the code for that is :
install.packages('lubridate')
library(lubridate)
flights$DayOfWeek <- wday(as.Date(flights$Full1_Date,'%m/%d/%Y'), label=TRUE)

Because the app requires some time to run, I have also removed the rows with missing values from the data set just to save some time.

flights <-flights[which(!is.na(flights['WeatherDelay'])),]
flights <-flights[which(!is.na(flights['ArrDelay'])),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Moreover it would be really nice of you to share the links of the apps you have developed. It would be a great contribution the community.

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 1

Create the user interface and set as the header of the web app : “Descriptive Analysis”

Exercise 2

Create a side panel.

Exercise 3

Create two select list input control. The former will contain the variables: CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay. The latter will contain the variables :Dest, Origin, UniqueCarrier, TailNum, CancellationCode.

Exercise 4

Create a set of radio buttons used to select a plot from a list ( Histogram, Scatter plot, Violin plot),and set as default plot the Histogram.

Exercise 5

Create a set of radio buttons used to select a plot from a list ( bar plot, pie chart, rose wind),and set as default plot the bar plot.

Exercise 6

Create a main panel.

Exercise 7

Create in the main panel two tabs named “Delays” and “Categorical” that will contain the plots of the exercises 4 and 5 respectively.

Exercise 8

Now that we are done with the user interface, create the server side of the app. Create the output of the first tab, which will be the plots from exercise 4 in respect to the first set of variables from exercise 3 ( notice that they are all continuous variables), bear in mind that at the scatter plot the x-axis should be the Full_Date and at the violin plot the x-axis should be the DayOfWeek as we did at the previous set of exercises. (please check out the first tab of the app, to make things more clear).

Exercise 9

Create the output of the second tab, , which will be the plots from exercise 5 in respect to the second set of variables from exercise 3 from the exercise 5, use the knowledge you applied ( or acquired at the previous exercises for the plots, make them as interesting as you can).(please check out the second tab of the app, to make things more clear).

Exercise 10

Launch the app.