Data science for Doctors: Inferential Statistics Exercises (part-2)


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people
to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University.
Please find further information regarding the dataset there.

This is the fifth part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients,
therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for binomial distribution (Binomial test)
and normal distribution (Z-test). If you are not aware
of what are the mentioned distributions please go here to acquire
the necessary background.

Before proceeding, it might be helpful to look over the help pages for the binom.test, mean,sd ,sqrt, z.test.
Moreover it is crucial to be familiar with the Central Limit Theorem.

install.packages(“TeachingDemos”)
library(TeachingDemos)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Suppose that we take a sample of 30 candidates that tried a medicine and 5 of them are positive.
The null hypothesis is H_{0}: p = average of classes, is to be tested against H1: p != average of classes.
This practically means whether the drug had an effect on the patients

Exercise 2

Apply the same test as above but instead of writing the number of samples try to apply the test in respect to the number of
successes and failures (5,25).

Exercise 3

Having the same null hypothesis as the exercises 1,2 apply a one-sided test where H1: p < average of classes.

Exercise 4

At the previous exercises we didn’t specified the confidence interval, so it applied it with the default 0.95. Run the test from exercise 3 but instead of having confidence interval of 0.95 run it with confidence interval 0.99.

Exercise 5

We have created another drug and we tested it on other 30 candidates. After having taken the medicine for a few weeks only 2 out of 30 were positive. We got really excited and decided to set the confidence interval to 0.999. Does that drug have an actual impact?

Exercise 6

Suppose that we establish a new diet and the average of the sample,of size 30, of candidates who tried this diet had average mass 29 after the testing period. Find the confidence interval for significance level of 0.05. Keep in mind that we run the test and compare it in respect to the data$mass variable

Exercise 7

Find the Z-score of the sample.

Exercise 8

Find the p-value for the experiment.

Exercise 9

Run the z-test using the z.test function with confidence level of 0.95 and let the alternative hypothesis be that the diet had an effect. (two-sided test)

Exercise 10

Let’s get a bit more intuitive now, let the alternative hypothesis be that the diet would lead to lower average body mass with confidence level of 0.99. (one-sided test)




One way MANOVA exercises

In ANOVA our interest lies in knowing if one continuous dependent variable is affected by one or more categorical independent variables. MANOVA is an extension of ANOVA where we are now able to understand how several dependent variables are affected by independent variables. For example consider an investigation where a medical investigator has developed 3 back pain therapies. Patients are enrolled for a 10 week trial and at the end the investigator interviews them on reduction of physiological, emotional and cognitive pain. Interest is in knowing which therapy is best at reducing pain.

Just like in ANOVA we can have one way or two way MANOVA depending on number of independent variables.

When conducting MANOVA it is important to understand the assumptions that need to be satisfied so that the results are valid. The assumptions are explained below.

  • The observations are independent. Observations that are collected over time, over space and in any groupings violate the assumption of independence.
  • The data follows a multivariate normal distribution. When observations are many we can rely on the central limit theorem (CLT) to achieve normality. It has been generally accepted any distribution with more than that observations will follow a normal distribution. MANOVA is robust to any non-normality that arises from skewness but it is not robust to non-normality resulting from outliers. Outliers should be checked and appropriate action taken. Analysis can be done with and without the outliers to check sensitivity.
  • The variance in all the groups is homogeneous. A Bartlett test is useful for assessing the homogeneity of variance. MANOVA is not robust to deviations from the assumption of normality therefore a transformation is required to stabilize variance.

MANOVA can be used to understand the interactions and main effects of independent variables. The four test statistics that can be used are Wilk’s lambda, Pillai trace, Hotelling-Lawley trace and Roy’s maximum root. Among the four test statistics Pillai is least affected by any violations in assumptions but Wilk’s is the most commonly used.

In this first part of MANOVA exercises we will use data from a study investigating a control and three therapies aimed at reducing symptoms of koro. Forty patients were selected for inclusion in the study and 10 patients were assigned to each of the four groups. Interest is in understanding which therapy is best in reducing symptoms. We will create three variables that hold change in indices before and after treatment. Here we have one independent variable and three dependent variables resulting in a one way MANOVA.

Solutions to these exercises can be found here

Exercise 1

Import data into R

Exercise 2

Check the number of observations in each group

Exercise 3

Create the variables that hold the change in indices

Exercise 4

Summarize the change variables

Exercise 5

Get descriptive statistics for each therapy

Exercise 6

Obtain the correlation matrix

Exercise 7

Check for outliers

Exercise 8

Check for homogeneity of variance

Exercise 9

Run MANOVA test with outliers

Exercise 10

Interpret results




Multiple Regression (Part 3) Diagnostics

In the exercises below we cover some more material on multiple regression diagnostics in R. This includes added variable (partial-regression) plots, component+residual (partial-residual) plots, CERES plots, VIF values, tests for heteroscedasticity (nonconstant variance), tests for Normality, and a test for autocorrelation of residuals. These are perhaps not as common as what we have seen in Multiple Regression (Part 2), but their aid in investigating our model’s assumptions is valuable.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Multiple Regression (Part 2) Diagnostics can be found here.

As usual, we will be using the dataset state.x77, which is part of the state datasets available in R. (Additional information about the dataset can be obtained by running help(state.x77).)

First, please run the following code to obtain and format the data as usual:
data(state)
state77 <- as.data.frame(state.x77)
names(state77)[4] <- "Life.Exp"
names(state77)[6] <- "HS.Grad"

Exercise 1
For the model with Life.Exp as dependent variable, and HS.Grad and Murder as predictors, suppose we would like to study the marginal effect of each predictor variable, given that the other predictor is in the model.
a. Use a function from the car package to obtain added-variable (partial regression) plots for this purpose.
b. Re-create the added-variable plots from part a., labeling the two most influential points in the plots (according to Mahalanobis distance).

Learn more about multiple linear regression in the online course Linear regression in R for Data Scientists. In this course you will learn how to:

  • Model basic and complex real world problem using linear regression
  • Understand when models are performing poorly and correct it
  • Design complex models for hierarchical data
  • And much more

Exercise 2
a. Illiteracy is highly correlated with both HS.Grad and Murder. To illustrate problems that occur when multicollinearity exists, suppose we would like to study the marginal effect of Illiteracy (only), given that HS.Grad and Murder are in the model. Use a function from the car package to get the relevant added-variable plot.
b. From the correlation matrix in the previous Exercise Set, we know that Population and Area are the least strongly correlated variables with Life.Exp. Create added-variable plots for each of these two variables, given that all other six variables are in the model.

Exercise 3
Consider the model with HS.Grad, Murder, Income, and Area as predictors. Create component+residual (partial-residual) plots for this model.

Exercise 4
Create CERES plots for the model in Exercise 3.

Exercise 5
As an illustration of high collinearities, compute VIF (Variance Inflation Factor) values for a model with Life.Exp as the response, that includes all the variables as predictors. Which variables seem to be causing the most problems?

Exercise 6
Using a function from the package lmtest, conduct a Breusch-Pagan test for heteroscedasticity (non-constant variance) for the model in Exercise 1.

Exercise 7
Re-do the test in the previous exercise by using a function from the car package.

Exercise 8
The test in Exercise 6 (and 7) is for linear forms of heteroscedasticity. To test for nonlinear heteroscedasticity (e.g., “bowtie-shape” in a residual plot), conduct White’s test.

Exercise 9
a. Conduct the Kolmogorov-Smirnov normality test for the residuals from the model in Exercise 1.
b. Now conduct the Shapiro-Wilk normality test.
Note: More Normality tests can be found in the nortest package.

Exercise 10
For illustration purposes only, conduct the Durbin-Watson test for autocorrelation in residuals. (NOTE: This test is ONLY appropriate when the response variable is a time series, or somehow time-related (e.g., ordered by data collection time.))




Spatial analysis with ggmap Exercises (part-1)

R has many powerful libraries to handle spatial data, and the things that R can do with maps can only grow. This exercise tries to demonstrate a few basic functionalities of the ggmap package in R while dealing with raster images.

The ggmap package can be used to access maps from the Google Maps API and other APIs as raster layers and perform various raster operations on it. Moreover, many other features such as points, polygons, lines etc. can be added to the basemap using layered grammar of graphics on the lines of the ggplot2 package. In addition, the package provides Geocoding facilities using the popular Google API.

Answers to the exercises are available here.If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Please install and load the package ggmap before starting the exercises.

Exercise 1
Get a basemap for United Kingdom from Google Maps API and plot the map. Keep the zoom level in such a way that entire UK is visible in the map.

Exercise 2
Repeat Exercise 1 to get a Stamen map for United Kingdom. Keep the maptype as toner.

Exercise 3
Consider the following football clubs currently playing in the English Premier League,
Arsenal FC, Manchester City FC, Manchester United FC, Liverpool FC, Chelsea FC and Tottenham Hotspur FC. Please locate these clubs on the basemap obtained in Exercise 1 as red points.

Exercise 4
Consider the geolocation (-0.119543, 51.50332). Find out an address on the map that corresponds to this location..

Exercise 5
Get a basemap for London from Google Maps API and plot it. Choose an appropriate Zoom Level.

Exercise 6
Consider the following London based Football clubs: Arsenal FC, Tottenham Hotspur FC,
Chelsea FC, West Ham FC and Crystal Palace FC. Like in Exercise 4, plot these clubs on the London map generated in Exercise 5. Use different colors and shape for each club.

Exercise 7
Calculate the distance(while driving) from Emirates Stadium, London to Wembley, London in terms of kilometers and time taken (minutes).

Exercise 8
Calculate the maximum zoom level which can be used to fetch a basemap such that both Emirates Stadium and Wembley are included in the map.

Exercise 9
Get a basemap around Wembley at a zoom level 12 from Google Maps API. Keep the maptype as roadmap.

Exercise 10
Draw the driving route from Emirates Stadium, London to Wembley, London on the
basemap obtained in Example 9. Keep the color of the route as red.




Data Science for Doctors – Part 4 : Inferential Statistics (1/5)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the fourth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

Before proceeding, it might be helpful to look over the help pages for the sample, mean, sd , sort, pnorm. Moreover it is crucial to be familiar with the Central Limit Theorem.

You also may need to load the ggplot2 library.
install.packages("moments")
library(moments)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generate (10000 iterations) a sampling distribution of sample size 50, for the variable mass.

You are encouraged to experiment with different sample sizes and iterations in order to see the impact that they have to the distribution. (standard deviation, skewness, and kurtosis) Moreover you can plot the distributions to have a better perception of what you are working on.

Exercise 2

Find the mean and standard error (standard deviation) of the sampling distribution.

You are encouraged to use the values from the original distribution (data$mass) in order to comprehend how you derive the mean and standard deviation as well as the importance that the sample size has to the distribution.

Exercise 3

Find the of the skewness and kurtosis of the distribution you generated before.

Exercise 4

Suppose that we made an experiment and we took a sample of size 50 from the population and they followed an organic food diet. Their average mass was 30.5. What is the Z score for a mean of 30.5?

Exercise 5

What is the probability of drawing a sample of 50 with mean less than 30.5? Use the the z-table if you feel you need to.

Exercise 6

Suppose that you did the experiment again but to a larger sample size of 150 and you found the average mass to be 31. Compute the z score for this mean.

Exercise 7

What is the probability of drawing a sample of 150 with mean less than 31?

Exercise 8

If everybody would adopt the diet of the experiment. Find the margin of error for the 95% of sample means.

Exercise 9

What would be our interval estimate that 95% likely contains what this population mean would be if everyone in our population would start adopting the organic diet.

Exercise 10

Find the interval estimate for 98% and 99% likelihood.




Data Hacking with RDSTK 3

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This is a continuation from the last exercise RDSTK 2
We are going to use the function that we created in our last exercise to have a programmatic way to derive statistics using the coordinates2statistics() function. Last week we talked about local and global variables. This is important to understand before proceeding. Also refresh on ip2coordinates() function.

This package provides an R interface to Pete Warden’s Data Science Toolkit. See for more information click here.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
This week we will give you bigger and badder list to work with. Its a list of more a dozen proxy ip-addresses from the internet. Run the code


list=c("97.77.104.22","104.199.228.65","50.93.204.169","107.189.46.5","104.154.142.10","104.131.255.12","209.212.253.44","70.248.28.23","52.119.20.75","192.169.168.15","47.88.31.75 80","107.178.4.109","152.160.35.171","104.236.54.196","50.93.197.102","159.203.117.1","206.125.41.132","50.93.201.28","8.21.67.248 31","104.28.16.199")

Exercise 2

Remember how we used iterators to run through each location and derive the stats with ip2coordinates() function in the first rdstk exercise. Lets do the same here. Store the results in df

Exercise 3

If you came this far, great. Lets recall the function that we created in exercise 2. If you do not remember the function, here is the code for it. Run the code below and then run stat_maker(“population_density”). You should see a new column called pop

stat_maker=function(s2){
s1="statistics"
s3="value"
s2=as.character(s2)
for (i in 1:nrow(df)) {
df$pop[i] <<-coordinates2statistics(df[i,3],df[i,6],s2)[paste(s1,s2,s3, sep = ".")]
assign("test2",50,envir = .GlobalEnv)

}
}

You should see an output in the format “statistics.hello.value”

Exercise 4

Modify the function so that the function accepts a string and returns out a global variable that holds the elements of that string statistic. For example if you input elevation, the function will create a global variable called elevation with the results from the for loop stored

Exercise 5

Test out the function.


stat_maker("elevation")

Exercise 6

Test the function stat_maker. stat_maker(“population_density”). Notice it did not explicitly make the changes to the df but just returned it once you called the function. This is because we did not define df as a global variable. But thats okay. We will learn it later

Exercise 7

Great. Now before we modify our function, lets learn how we can make a global variable inside a function. Use the same code from exercise 5 but this time instead of defining df$pop2 as a local variable, define it as a global variable. Run the function and test it again.

Exercise 8

Run the code

stat_maker("us_population_poverty")

Notice that our function does not work for this case. This is because anything with the prefix us_population will return a dataframe with a column value like statistics.us_population.value
So you need to modify the function a little to accomodate for this.

Exercise 9

Run the following commands. You can also use any string starting with us_population fo this function. But the goal is to make global variables that hold this data. You can refer to the whole list of statistic funtions at www.datasciencetoolkit.org

stat_maker("us_population")
stat_maker("us_population_poverty")
stat_maker("us_population_asian")
stat_maker("us_population_bachelors_degree")
stat_maker("us_population_black_or_african_american")
stat_maker("us_population_black_or_african_american_not_hispanic ")
stat_maker("us_population_eighteen_to_twenty_four_years_old")
stat_maker("us_population_five_to_seventeen_years_old")
stat_maker("us_population_foreign_born")
stat_maker("us_population_hispanic_or_latino")

Exercise 10

Use cbind command to bind all the global variables into the df. Print the results of df.

Note: You can chose to make this df in other ways but this method was used to guide through modifying functions, global/local variables and working with strings.




Data Science for Doctors – Part 3 : Distributions

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

This is the third part of the series, it will contain the main distributions that you will use most of the time. This part is created in order to make sure that you have (or will have after solving this set of exercises) the knowledge for the next parts to come. The distributions that we will see are:

1)Binomial Distribution: The binomial distribution fits to repeated trials each with a dichotomous outcome such as success-failure, healthy-disease, heads-tails.

2)Normal Distribution: It is the most famous distribution, it is also assumed for many gene expression values.

3)T-Distribution: The T-distribution has many useful applications for testing hypotheses when the sample size is lower than thirty.

4)Chi-squared Distribution: The chi-squared distribution plays an important role in testing hypotheses about frequencies.

5)F-Distribution: The F-distribution is important for testing the equality of two variances.

Before proceeding, it might be helpful to look over the help pages for the choose, dbinom, pbinom , rbinom, qbinom,pnorm, qnorm, rnorm, dnorm,pchisq, qchisq, dchisq, df, pf, df.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Let X be binomially distributed with n = 100 and p = 0.3.Compute the following:
a) P(X = 34), P(X ≥ 34), and P(X ≤ 34)
b) P(30 ≤ X ≤ 60)
c) The quantiles x0.025, and x0.975

Exercise 2

Let X be normally distributed with mean = 3 and standard deviation = 1.Compute the following:
a) P(X 2),P(2 ≤ X ≤ 4)
b) The quantiles x0.025, x0.5and x0.975.

Exercise 3

Let T8 distribution.Compute the following:
a)P(T8 < 1), P(T8 > 2), P(-1 < T8 < 1).
b)The quantiles t0.025, t0.5, and t0.975. Can you justify the values of the quantiles?

Exercise 4

Compute the following for the chi-squared distribution with 5 degrees of freedom:
a) P(X25<2), P(X25>4),P(4<X25<6).
b) The quantiles g0.025, g0.5, and g0.975.

Exercise 5

Compute the following for the F6,3 distribution:
a)P(F6,3 < 2), P(F6,3 > 3), P(1 < F6,3 < 4).
b)The quantiles f0.025, f0.5, and f0.975.

Exercise 6

Generate 100 observations following binomial distribution and plot them(if possible at the same plot):
a) n = 20, p = 0.3
b) n = 20, p = 0.5
c) n = 20, p = 0.7

Exercise 7

Generate 100 observations following normal distribution and plot them(if possible at the same plot):
a) standard normal distribution ( N(0,1) )
b) mean = 0, s = 3
c) mean = 0, s = 7

Exercise 8

Generate 100 observations following T distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 9

Generate 100 observations following chi-squared distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 10

Generate 100 observations following F distribution and plot them(if possible at the same plot):
a) df1 = 3, df2 = 9
b) df1 = 9, df2 = 3
c) df1 = 15, df2 = 15




Data Hacking with RDSTK 2

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This is a continuation from the last exercise RDSTK 1
This package provides an R interface to Pete Warden’s Data Science Toolkit. See www.datasciencetoolkit.org for more information.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Load the rdstk library and download the dataset here

Exercise 2

a.create a string called s1 and store “statistics” inside
b.create a string called s3 and store “value”
c. create a function that will take a string s2 as an input and output a string in the format s1+s2+s3 seperated by “.”. Name this function “stringer”

Exercise 3

Lets test out this function.


stringer("hello")

You should see an output in the format “statistics.hello.value”

Exercise 4

Create a for loop that will iterate over the rows in df and derive the population density of the location using coordinates2statistics function. Save the results in df$pop

Exercise 5

Lets now make a function using elements you learned from exercise 3 and 4. So the function is going to take a string as an input like s2 from exercise 3. Inside the function you can combine it with s1 and s3. You have to create the same for loop from exercise 4. Instead of storing the result of the for loop in df$pop, use df$pop2.You should see a new feature inside df with all the results once you return df from it.

Exercise 6

Test the function stat_maker. stat_maker("population_density"). Notice it did not explicitly make the changes to the df but just returned it once you called the function. This is because we did not define df as a global variable. But thats okay. We will learn it later

Exercise 7

Great. Now before we modify our function, lets learn how we can make a global variable inside a function. Use the same code from exercise 5 but this time instead of defining df$pop2 as a local variable, define it as a global variable. Run the function and test it again.

Exercise 8

You can also use the assign() function inside a function and set the results as a global variable. Lets see an example of assign function

 

assign(“test”,50)

 

Now if you type test in your console. You should see 50. Try it

Exercise 9

Now try putting the same code in exercise 8 while changing test to test2 inside the stat_maker function. Once you test the function, you will see that test2 does not return anything. This is because it was not set as a global variable

Exercise 10

Set test2 as a global variable inside the stat_maker function. Run the function and now you should see test 2 return 50 when you call it.




Data Hacking with RDSTK (part 1)

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This package provides an R interface to Pete Warden’s Data Science Toolkit. See www.datasciencetoolkit.org for more information.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Install and load the RDSTK package.

Exercise 2

Convert the ip adress to co-ordinates. address=”165.124.145.197″. Store the results under the variable stat

Exercise 3

Derive the elevation of that location using the lattitude and longitude. Use the function coordinate coordinates2statistics() function to achieve this. Once you get the elevation store this back as one of the features of stat.

Exercise 4

Derive the population_density of that location using the lattitude and longitude. Use the function coordinate coordinates2statistics() function to achieve this. Once you get the elevation store this back as one of the features of stat called pop_den.

Exercise 5

Great. You are getting the hang of it. Let us try getting the mean temperature of that location. You will notice that it returns a list of 12 numbers, each for a month.

Run this code and see yourself

coordinates2statistics(stat[3],stat[6],"mean_temperature")[1]

Exercise 6

We have to transform the mean_temperature so we can store this as one of the features in our stat dataset. One way to do this is to convert it from long to wide format but that would be too reduntant. Let’s just find the mean temperature from January-December. You might find the sapply function useful to convert each element in the list to integers.

Exercise 7

We decided we do not really need January-December mean value. We actually need the mean temperature from June-December. Make that adjustment to your last code and store the results back in stat under the name mean_temp

Exercise 8

Okay great. Now lets work with more IP-address data. Here is a list of a few ip-addresses scraped from a few commenters of my exercises.

 

list=c(“165.124.145.197″,”31.24.74.155″,”79.129.19.173”)
df=data.frame(list)
df[,1]=as.character(df[,1])

 

Exercise 9

Use a iterator like apply that will go through the list and derive its statistics with the ip2coordinates() function. This is the first part. You may get a list within list sort of result. Store this in a variable called data

Exercise 10

Use a method to convert that list within list into a dataframe with 3 rows and all columns derived from the ip2coordinates() function. You are open to use any method for this.




Multiple Regression (Part 2) – Diagnostics

Multiple Regression is one of the most widely used methods in statistical modelling. However, despite its many benefits, it is oftentimes used without checking the underlying assumptions. This can lead to results which can be misleading or even completely wrong. Therefore, applying diagnostics to detect any strong violations of the assumptions is important. In the exercises below we cover some material on multiple regression diagnostics in R.

Answers to the exercises are available here.

If you obtain a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Multiple Regression (Part 1) can be found here.

We will be using the dataset state.x77, which is part of the state datasets available in R. (Additional information about the dataset can be obtained by running help(state.x77).)

Exercise 1
a. Load the state datasets.
b. Convert the state.x77 dataset to a dataframe.
c. Rename the Life Exp variable to Life.Exp, and HS Grad to HS.Grad. (This avoids problems with referring to these variables when specifying a model.)
d. Produce the correlation matrix.
e. Create a scatterplot matrix for the variables Life.Exp, HS.Grad, Murder, and Frost.

Exercise 2
a. Fit the model with Life.Exp as dependent variable, and HS.Grad and Murder as predictors.
b. Obtain the residuals.
c. Obtain the fitted values.

Exercise 3
a. Create a residual plot (residuals vs. fitted values).
b. Create the same residual plot using the plot command on the lm object from Exercise 2.

Learn more about multiple linear regression in the online courses Linear regression in R for Data Scientists, Statistics with R – advanced level, and Linear Regression and Modeling.

Exercise 4
Create plots of the residuals vs. each of the predictor variables.

Exercise 5
a. Create a Normality plot.
b. Create the same plot using the plot command on the lm object from Exercise 2.

Exercise 6
a. Obtain the studentized residuals.
b. Does there appear to be any outliers?

Exercise 7
a. Obtain the leverage value for each observation and plot them.
b. Obtain the conventional threshold for leverage values. Are any observations influential?

Exercise 8
a. Obtain DFFITS values.
b. Obtain the conventional threshold. Are any observations influential?
c. Obtain DFBETAS values.
d. Obtain the conventional threshold. Are any observations influential?

Exercise 9
a. Obtain Cook’s distance values and plot them.
b. Obtain the same plot using the plot command on the lm object from Exercise 2.
c. Obtain the threshold value. Are any observations influential?

Exercise 10
Create the Influence Plot using a function from the car package.