# Data science for Doctors: Variable importance Exercises

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the tenth part of the series and it aims to cover the very basics of the subject of principal correlation coefficient and components analysis, those two methods illustrate how variables are related.

In my opinion, it is necessary for researchers to know how to have a notion of the relationships between variables, in order to be able to find potential cause and effect relation – however this relation is hypothetical, you can’t claim that there is a cause-effect relation only because the correlation is high between those two variables-,remove unecessary variables etc. In particular we will go through Pearson correlation coefficient and Confidence interval by the bootstrap and ( Principal component analysis.

Before proceeding, it might be helpful to look over the help pages for the `ggplot`

, `cor`

, `cor.tes`

, `boot.cor`

, `quantile`

, `eigen`

, `princomp`

, `summary`

, `plot`

, `autoplot`

.

Moreover please load the following libraries.

`install.packages("ggplot2")`

`library(ggplot2)`

`install.packages("ggfortify")`

`library(ggfortify)`

Please run the code below in order to load the data set and transform it into a proper data frame format:

`url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"`

`data <- read.table(url, fileEncoding="UTF-8", sep=",")`

`names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')`

`colnames(data) <- names`

`data <- data[-which(data$mass ==0),]`

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Compute the value of the correlation coefficient for the variables `age`

and `preg`

.

Exercise 2

Construct the scatterplot for the variables `age`

and `preg`

.

Exercise 3

Apply a correlation test for the variables `age`

and `preg`

with null hypothesis to be the correlation is zero and the alternative to be different from zero.

hint: `cor.test`

Exercise 4

Construct a 95% confidence interval is by the bootstrap. First find the correlation by bootstrap.

hint: mean

Exercise 5

Now that you have found the correlation, find the 95% confidence interval.

Exercise 6

Find the eigen values and eigen vectors for the data set(exclude the `class.fac`

variable).

Exercise 7

Compute the principal components for the dataset used above.

Exercise 8

Show the importance of each principal component.

Exercise 9

Plot the principal components using an elbow graph.

Exercise 10

Constract a scatterplot with x-axis to be the first component and the y-axis to be the second component. Moreover if possible draw the eigen vectors on the plot.

hint: autoplot