Data Science for Doctors – Part 4 : Inferential Statistics (1/5)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the fourth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

Before proceeding, it might be helpful to look over the help pages for the sample, mean, sd , sort, pnorm. Moreover it is crucial to be familiar with the Central Limit Theorem.

You also may need to load the ggplot2 library.
install.packages("moments")
library(moments)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generate (10000 iterations) a sampling distribution of sample size 50, for the variable mass.

You are encouraged to experiment with different sample sizes and iterations in order to see the impact that they have to the distribution. (standard deviation, skewness, and kurtosis) Moreover you can plot the distributions to have a better perception of what you are working on.

Exercise 2

Find the mean and standard error (standard deviation) of the sampling distribution.

You are encouraged to use the values from the original distribution (data$mass) in order to comprehend how you derive the mean and standard deviation as well as the importance that the sample size has to the distribution.

Exercise 3

Find the of the skewness and kurtosis of the distribution you generated before.

Exercise 4

Suppose that we made an experiment and we took a sample of size 50 from the population and they followed an organic food diet. Their average mass was 30.5. What is the Z score for a mean of 30.5?

Exercise 5

What is the probability of drawing a sample of 50 with mean less than 30.5? Use the the z-table if you feel you need to.

Exercise 6

Suppose that you did the experiment again but to a larger sample size of 150 and you found the average mass to be 31. Compute the z score for this mean.

Exercise 7

What is the probability of drawing a sample of 150 with mean less than 31?

Exercise 8

If everybody would adopt the diet of the experiment. Find the margin of error for the 95% of sample means.

Exercise 9

What would be our interval estimate that 95% likely contains what this population mean would be if everyone in our population would start adopting the organic diet.

Exercise 10

Find the interval estimate for 98% and 99% likelihood.




Data Hacking with RDSTK 3

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This is a continuation from the last exercise RDSTK 2
We are going to use the function that we created in our last exercise to have a programmatic way to derive statistics using the coordinates2statistics() function. Last week we talked about local and global variables. This is important to understand before proceeding. Also refresh on ip2coordinates() function.

This package provides an R interface to Pete Warden’s Data Science Toolkit. See for more information click here.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
This week we will give you bigger and badder list to work with. Its a list of more a dozen proxy ip-addresses from the internet. Run the code


list=c("97.77.104.22","104.199.228.65","50.93.204.169","107.189.46.5","104.154.142.10","104.131.255.12","209.212.253.44","70.248.28.23","52.119.20.75","192.169.168.15","47.88.31.75 80","107.178.4.109","152.160.35.171","104.236.54.196","50.93.197.102","159.203.117.1","206.125.41.132","50.93.201.28","8.21.67.248 31","104.28.16.199")

Exercise 2

Remember how we used iterators to run through each location and derive the stats with ip2coordinates() function in the first rdstk exercise. Lets do the same here. Store the results in df

Exercise 3

If you came this far, great. Lets recall the function that we created in exercise 2. If you do not remember the function, here is the code for it. Run the code below and then run stat_maker(“population_density”). You should see a new column called pop

stat_maker=function(s2){
s1="statistics"
s3="value"
s2=as.character(s2)
for (i in 1:nrow(df)) {
df$pop[i] <<-coordinates2statistics(df[i,3],df[i,6],s2)[paste(s1,s2,s3, sep = ".")]
assign("test2",50,envir = .GlobalEnv)

}
}

You should see an output in the format “statistics.hello.value”

Exercise 4

Modify the function so that the function accepts a string and returns out a global variable that holds the elements of that string statistic. For example if you input elevation, the function will create a global variable called elevation with the results from the for loop stored

Exercise 5

Test out the function.


stat_maker("elevation")

Exercise 6

Test the function stat_maker. stat_maker(“population_density”). Notice it did not explicitly make the changes to the df but just returned it once you called the function. This is because we did not define df as a global variable. But thats okay. We will learn it later

Exercise 7

Great. Now before we modify our function, lets learn how we can make a global variable inside a function. Use the same code from exercise 5 but this time instead of defining df$pop2 as a local variable, define it as a global variable. Run the function and test it again.

Exercise 8

Run the code

stat_maker("us_population_poverty")

Notice that our function does not work for this case. This is because anything with the prefix us_population will return a dataframe with a column value like statistics.us_population.value
So you need to modify the function a little to accomodate for this.

Exercise 9

Run the following commands. You can also use any string starting with us_population fo this function. But the goal is to make global variables that hold this data. You can refer to the whole list of statistic funtions at www.datasciencetoolkit.org

stat_maker("us_population")
stat_maker("us_population_poverty")
stat_maker("us_population_asian")
stat_maker("us_population_bachelors_degree")
stat_maker("us_population_black_or_african_american")
stat_maker("us_population_black_or_african_american_not_hispanic ")
stat_maker("us_population_eighteen_to_twenty_four_years_old")
stat_maker("us_population_five_to_seventeen_years_old")
stat_maker("us_population_foreign_born")
stat_maker("us_population_hispanic_or_latino")

Exercise 10

Use cbind command to bind all the global variables into the df. Print the results of df.

Note: You can chose to make this df in other ways but this method was used to guide through modifying functions, global/local variables and working with strings.




Data Science for Doctors – Part 3 : Distributions

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

This is the third part of the series, it will contain the main distributions that you will use most of the time. This part is created in order to make sure that you have (or will have after solving this set of exercises) the knowledge for the next parts to come. The distributions that we will see are:

1)Binomial Distribution: The binomial distribution fits to repeated trials each with a dichotomous outcome such as success-failure, healthy-disease, heads-tails.

2)Normal Distribution: It is the most famous distribution, it is also assumed for many gene expression values.

3)T-Distribution: The T-distribution has many useful applications for testing hypotheses when the sample size is lower than thirty.

4)Chi-squared Distribution: The chi-squared distribution plays an important role in testing hypotheses about frequencies.

5)F-Distribution: The F-distribution is important for testing the equality of two variances.

Before proceeding, it might be helpful to look over the help pages for the choose, dbinom, pbinom , rbinom, qbinom,pnorm, qnorm, rnorm, dnorm,pchisq, qchisq, dchisq, df, pf, df.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Let X be binomially distributed with n = 100 and p = 0.3.Compute the following:
a) P(X = 34), P(X ≥ 34), and P(X ≤ 34)
b) P(30 ≤ X ≤ 60)
c) The quantiles x0.025, and x0.975

Exercise 2

Let X be normally distributed with mean = 3 and standard deviation = 1.Compute the following:
a) P(X 2),P(2 ≤ X ≤ 4)
b) The quantiles x0.025, x0.5and x0.975.

Exercise 3

Let T8 distribution.Compute the following:
a)P(T8 < 1), P(T8 > 2), P(-1 < T8 < 1).
b)The quantiles t0.025, t0.5, and t0.975. Can you justify the values of the quantiles?

Exercise 4

Compute the following for the chi-squared distribution with 5 degrees of freedom:
a) P(X25<2), P(X25>4),P(4<X25<6).
b) The quantiles g0.025, g0.5, and g0.975.

Exercise 5

Compute the following for the F6,3 distribution:
a)P(F6,3 < 2), P(F6,3 > 3), P(1 < F6,3 < 4).
b)The quantiles f0.025, f0.5, and f0.975.

Exercise 6

Generate 100 observations following binomial distribution and plot them(if possible at the same plot):
a) n = 20, p = 0.3
b) n = 20, p = 0.5
c) n = 20, p = 0.7

Exercise 7

Generate 100 observations following normal distribution and plot them(if possible at the same plot):
a) standard normal distribution ( N(0,1) )
b) mean = 0, s = 3
c) mean = 0, s = 7

Exercise 8

Generate 100 observations following T distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 9

Generate 100 observations following chi-squared distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 10

Generate 100 observations following F distribution and plot them(if possible at the same plot):
a) df1 = 3, df2 = 9
b) df1 = 9, df2 = 3
c) df1 = 15, df2 = 15




Data Hacking with RDSTK 2

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This is a continuation from the last exercise RDSTK 1
This package provides an R interface to Pete Warden’s Data Science Toolkit. See www.datasciencetoolkit.org for more information.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Load the rdstk library and download the dataset here

Exercise 2

a.create a string called s1 and store “statistics” inside
b.create a string called s3 and store “value”
c. create a function that will take a string s2 as an input and output a string in the format s1+s2+s3 seperated by “.”. Name this function “stringer”

Exercise 3

Lets test out this function.


stringer("hello")

You should see an output in the format “statistics.hello.value”

Exercise 4

Create a for loop that will iterate over the rows in df and derive the population density of the location using coordinates2statistics function. Save the results in df$pop

Exercise 5

Lets now make a function using elements you learned from exercise 3 and 4. So the function is going to take a string as an input like s2 from exercise 3. Inside the function you can combine it with s1 and s3. You have to create the same for loop from exercise 4. Instead of storing the result of the for loop in df$pop, use df$pop2.You should see a new feature inside df with all the results once you return df from it.

Exercise 6

Test the function stat_maker. stat_maker("population_density"). Notice it did not explicitly make the changes to the df but just returned it once you called the function. This is because we did not define df as a global variable. But thats okay. We will learn it later

Exercise 7

Great. Now before we modify our function, lets learn how we can make a global variable inside a function. Use the same code from exercise 5 but this time instead of defining df$pop2 as a local variable, define it as a global variable. Run the function and test it again.

Exercise 8

You can also use the assign() function inside a function and set the results as a global variable. Lets see an example of assign function

 

assign(“test”,50)

 

Now if you type test in your console. You should see 50. Try it

Exercise 9

Now try putting the same code in exercise 8 while changing test to test2 inside the stat_maker function. Once you test the function, you will see that test2 does not return anything. This is because it was not set as a global variable

Exercise 10

Set test2 as a global variable inside the stat_maker function. Run the function and now you should see test 2 return 50 when you call it.




Data Hacking with RDSTK (part 1)

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This package provides an R interface to Pete Warden’s Data Science Toolkit. See www.datasciencetoolkit.org for more information.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Install and load the RDSTK package.

Exercise 2

Convert the ip adress to co-ordinates. address=”165.124.145.197″. Store the results under the variable stat

Exercise 3

Derive the elevation of that location using the lattitude and longitude. Use the function coordinate coordinates2statistics() function to achieve this. Once you get the elevation store this back as one of the features of stat.

Exercise 4

Derive the population_density of that location using the lattitude and longitude. Use the function coordinate coordinates2statistics() function to achieve this. Once you get the elevation store this back as one of the features of stat called pop_den.

Exercise 5

Great. You are getting the hang of it. Let us try getting the mean temperature of that location. You will notice that it returns a list of 12 numbers, each for a month.

Run this code and see yourself

coordinates2statistics(stat[3],stat[6],"mean_temperature")[1]

Exercise 6

We have to transform the mean_temperature so we can store this as one of the features in our stat dataset. One way to do this is to convert it from long to wide format but that would be too reduntant. Let’s just find the mean temperature from January-December. You might find the sapply function useful to convert each element in the list to integers.

Exercise 7

We decided we do not really need January-December mean value. We actually need the mean temperature from June-December. Make that adjustment to your last code and store the results back in stat under the name mean_temp

Exercise 8

Okay great. Now lets work with more IP-address data. Here is a list of a few ip-addresses scraped from a few commenters of my exercises.

 

list=c(“165.124.145.197″,”31.24.74.155″,”79.129.19.173”)
df=data.frame(list)
df[,1]=as.character(df[,1])

 

Exercise 9

Use a iterator like apply that will go through the list and derive its statistics with the ip2coordinates() function. This is the first part. You may get a list within list sort of result. Store this in a variable called data

Exercise 10

Use a method to convert that list within list into a dataframe with 3 rows and all columns derived from the ip2coordinates() function. You are open to use any method for this.




Multiple Regression (Part 2) – Diagnostics

Multiple Regression is one of the most widely used methods in statistical modelling. However, despite its many benefits, it is oftentimes used without checking the underlying assumptions. This can lead to results which can be misleading or even completely wrong. Therefore, applying diagnostics to detect any strong violations of the assumptions is important. In the exercises below we cover some material on multiple regression diagnostics in R.

Answers to the exercises are available here.

If you obtain a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Multiple Regression (Part 1) can be found here.

We will be using the dataset state.x77, which is part of the state datasets available in R. (Additional information about the dataset can be obtained by running help(state.x77).)

Exercise 1
a. Load the state datasets.
b. Convert the state.x77 dataset to a dataframe.
c. Rename the Life Exp variable to Life.Exp, and HS Grad to HS.Grad. (This avoids problems with referring to these variables when specifying a model.)
d. Produce the correlation matrix.
e. Create a scatterplot matrix for the variables Life.Exp, HS.Grad, Murder, and Frost.

Exercise 2
a. Fit the model with Life.Exp as dependent variable, and HS.Grad and Murder as predictors.
b. Obtain the residuals.
c. Obtain the fitted values.

Exercise 3
a. Create a residual plot (residuals vs. fitted values).
b. Create the same residual plot using the plot command on the lm object from Exercise 2.

Learn more about multiple linear regression in the online courses Linear regression in R for Data Scientists, Statistics with R – advanced level, and Linear Regression and Modeling.

Exercise 4
Create plots of the residuals vs. each of the predictor variables.

Exercise 5
a. Create a Normality plot.
b. Create the same plot using the plot command on the lm object from Exercise 2.

Exercise 6
a. Obtain the studentized residuals.
b. Does there appear to be any outliers?

Exercise 7
a. Obtain the leverage value for each observation and plot them.
b. Obtain the conventional threshold for leverage values. Are any observations influential?

Exercise 8
a. Obtain DFFITS values.
b. Obtain the conventional threshold. Are any observations influential?
c. Obtain DFBETAS values.
d. Obtain the conventional threshold. Are any observations influential?

Exercise 9
a. Obtain Cook’s distance values and plot them.
b. Obtain the same plot using the plot command on the lm object from Exercise 2.
c. Obtain the threshold value. Are any observations influential?

Exercise 10
Create the Influence Plot using a function from the car package.




Multiple Regression (Part 1)

In the exercises below we cover some material on multiple regression in R.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

We will be using the dataset state.x77, which is part of the state datasets available in R. (Additional information about the dataset can be obtained by running help(state.x77).)

Exercise 1

a. Load the state datasets.
b. Convert the state.x77 dataset to a dataframe.
c. Rename the Life Exp variable to Life.Exp, and HS Grad to HS.Grad. (This avoids problems with referring to these variables when specifying a model.)

Exercise 2
Suppose we wanted to enter all the variables in a first-order linear regression model with Life Expectancy as the dependent variable. Fit this model.

Exercise 3

Suppose we wanted to remove the Income, Illiteracy, and Area variables from the model in Exercise 2. Use the update function to fit this model.

Learn more about multiple linear regression in the online course Linear regression in R for Data Scientists. In this course you will learn how to:

  • Model basic and complex real world problem using linear regression
  • Understand when models are performing poorly and correct it
  • Design complex models for hierarchical data
  • And much more

Exercise 4
Let’s assume that we have settled on a model that has HS.Grad and Murder as predictors. Fit this model.

Exercise 5
Add an interaction term to the model in Exercise 4 (3 different ways).

Exercise 6
For this and the remaining exercises in this set we will use the model from Exercise 4.

Obtain 95% confidence intervals for the coefficients of the two predictor variables.

Exercise 7
Predict the Life Expectancy for a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 8

Obtain a 98% confidence interval for the mean Life Expectancy in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 9

Obtain a 98% confidence interval for the Life Expectancy of a person living in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 10

Since our model only has two predictor variables, we can generate a 3D plot of our data and the fitted regression plane. Create this plot.




Intermediate Tree 2

This is a continuation of the intermediate decision tree exercise.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

 

Exercise 1
use the predict() command to make predictions on the Train data. Set the method to “class”. Class returns classifications instead of probability scores. Store this prediction in pred_dec.

Exercise 2
Print out the confusion matrix

Exercise 3
What is the accuracy of the model. Use the confusion matrix.

Exercise 4
What is the misclassification error rate? Refer to Basic_decision_tree exercise to get the formula.

Exercise 5
Lets say we want to find the baseline model to compare our prediction improvement. We create a base model using this code


length(Test$class)
base=rep(1,3183)

Use the table() command to create a confusion matrix between the base and Test$class.

Learn more about decision trees in the online courses

Exercise 6
What is the number difference between the confusion matrix accuracy of dec and base?

Exercise 7

Remember the predict() command in question 1. We will use the same mode and same command except we will set the method to “regression”. This gives us a probability estimates. Store this in pred_dec_reg

Exercise 8
load the ROCR package.

Use the prediction(), performance() and plot() command to print the ROC curve. Use pred_dec_reg variable from Q7. You can also refer to the previous exercise to see the code.

Exercise 9
plot() the same ROC curve but set colorize=TRUE

Exercise 10
Comment on your findings using ROC curve and accuracy. Is it a good model? Did you notice that ROC prediction() command only takes probability predictions as one of its arguments. Why is that so?




Working with Shapefiles in R Exercises

R has many powerful libraries to handle spatial data, and the things that R can do with maps can only grow. This exercise tries to demonstrate a few basic functionalities of R while dealing with shapefiles.

A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (ESRI). The geographic features are associated with an attribute table which is very similar to an R dataframe.

The rgdal package in R provides bindings to the popular Geospatial Data Abstraction Library (GDAL) for reading, writing and converting between spatial formats. We are using a very popular dataset of London sports participation shapefile (download here). The attributes Pop_2001 and Partic_Per represents the population of London Boroughs in 2001 and the percentage of the population participating in sporting activities.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Please install and load the package rgdal before starting the exercises.

Exercise 1
Read the London Sports map from the shapefile london_sports.shp .

Exercise 2
Change the coordinate system of the map to WGS 84.

Exercise 3
Find the names of the zones where sports participation rates is more than 25%.

Exercise 4
Plot the london map in Sky Blue, along with a title.

Exercise 5
Plot the zones in London with Sports Participation Rates less than 15% in red. Retain the earlier blue color for other zones.

Exercise 6
Plot the zones in London with Sports Participation Rates more than 25% in green. Retain the earlier color for other zones.

Exercise 7
Place a black circle marker at the centre of each zone. Retain previous maps.

Exercise 8
Put labels for each zone. Place the labels to the right of the black marker.

Exercise 9
Add another categorical attribute sports_part which has values "low", "medium" and "high" for sports participation rates less than equal to 15%, between 15 to 25% and greater than 25% respectively.

Exercise 10
Save the new map object with modified attribute table as a new shapefile “london_sport2.shp”.




Intermediate Tree 1

If you followed through the Basic Decision Tree exercise, this should be useful for you. This is like a continuation but we add so much more. We are working with a bigger and badder datasets. We will be also using techniques we learned from model evaluation and work with ROC, accuracy and other metrics.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
read in the adult.csv file with header=FALSE. Store this in df. Use str() command to see the dataframe. Download the Data from here

Exercise 2
You are given the meta_data that goes with the CSV. You can download this here Use that to add the column names for your dataframe. Notice the df is ordered going from V1,V2,V3 _ _ and so on. As a side note, it is always best practice to use that to match and see if all the columns are read in correctly.

Exercise 3
Use the table command and print out the distribution of the class feature.

Exercise 4
Change the class column to binary.

Learn more about decision trees in the online courses

Exercise 5
Use the cor() command to see the corelation of all the numeric and integer columns including the class column. Remember that numbers close to 1 means high corelation and number close to 0 means low. This will give you a rough idea for feature selection

Exercise 6
Split the dataset into Train and Test sample. You may use sample.split() and use the ratio as 0.7 and set the seed to be 1000. Make sure to install and load caTools package.

Exercise 7
Check the number of rows of Train
Check the number of rows of Test

Exercise 8
We are ready to use decision tree in our dataset. Load the package “rpart” and “rpart.plot” If it is not installed, then use the install.packages() commmand.

Exercise 9
Use rpart to build the decision tree on the Train set. Include all features.Store this model in dec

Exercise 10
Use the prp() function to plot the decision tree. If you get any error use this code before the prp() command

par(mar = rep(2, 4))