# Data science for Doctors: Cluster Analysis Exercises

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the ninth part of the series and it aims to cover the very basics of the subject of cluster analysis.

In my opinion, it is necessary for researchers to know how to discover relationships between patients and diseases. Therefore in this set of exercises we will go through the basics of cluster analysis relationship discovery. In particular we will use hierarchical clustering and centroid-based clustering , k-means clustering and k-median clustering.

Before proceeding, it might be helpful to look over the help pages for the `ggplot`

, `geom_point`

, `dist`

, `hclust`

, `cutree`

, `stats::rect.hclust`

, `multiplot`

, `kmeans`

, `kGmedian`

.

Moreover please load the following libraries.

`install.packages("ggplot2")`

`library(ggplot2)`

`install.packages("Gmedian")`

`library(Gmedian)`

Please run the code below in order to load the data set and transform it into a proper data frame format:

`url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"`

`data <- read.table(url, fileEncoding="UTF-8", sep=",")`

`names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')`

`colnames(data) <- names`

`data <- data[-which(data$mass ==0),]`

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Construct a scatterplot with x-axis to be the `mass`

variable and y-axis to be the `age`

variable. Moreover, determine the colour of the points based on the class of the candidate (0 or 1).

Exercise 2

Create a distance matrix for the data.

Exercise 3

Make an hierarchical clustering analysis using the single linkage method. Then create an object that contains only two clusters.

Exercise 4

Make an hierarchical clustering analysis using the complete linkage method(default). Then create an object that contains only two clusters.

Exercise 5

Construct the trees that are produced by exercises 2 and 3 and draw the two clusters(at the plots).

hint: rect.hclust

**Learn more**about cluster analysis in the online course Applied Multivariate Analysis with R. In this course you will learn how to work with hiërarchical clustering, k-means clustering and much more.

Exercise 6

Construct two scatterplot with x-axis to be the `mass`

variable and y-axis to be the `age`

variable. Moreover, determine the colour of the points based on the cluster that those points belong to. Each scatterplot is for different clustering method.

If possible illustrate those scatterplots (each one at a time) next to the plot of exercise 1, to see whether the clustering can discriminate the positive classified from the negative classified patients. In case you didn’t do that, find it at the solution’s section, I highly encourage you to check it out.

Exercise 7

Run the following in order to create dummy variables `data_mat <- model.matrix(~.+0, data = data)`

.

Make a centroid-based cluster analysis using the k-means method with k to be 2. Apply the k-mean clustering on the `data_mat`

data frame.

Exercise 8

Construct a scatterplot with x-axis to be the `mass`

variable and y-axis to be the `age`

variable. Moreover, determine the colour of the points based on the cluster (retrieved from k-mean method) that those points belong to.

If possible illustrate those scatterplot next to the plot of exercise 1.

Exercise 9

Make a centroid-based cluster analysis using the k-median method with k to be 2. Apply the k-median clustering on the `data_mat`

data frame.

Exercise 10

Construct a scatterplot with x-axis to be the `mass`

variable and y-axis to be the `age`

variable. Moreover, determine the colour of the points based on the cluster (retrieved from k-median method) that those points belong to.

If possible illustrate those scatterplot next to the plot of exercise 1.