Multiple Regression (Part 1)

In the exercises below we cover some material on multiple regression in R.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

We will be using the dataset state.x77, which is part of the state datasets available in R. (Additional information about the dataset can be obtained by running help(state.x77).)

Exercise 1

a. Load the state datasets.
b. Convert the state.x77 dataset to a dataframe.
c. Rename the Life Exp variable to Life.Exp, and HS Grad to HS.Grad. (This avoids problems with referring to these variables when specifying a model.)

Exercise 2
Suppose we wanted to enter all the variables in a first-order linear regression model with Life Expectancy as the dependent variable. Fit this model.

Exercise 3

Suppose we wanted to remove the Income, Illiteracy, and Area variables from the model in Exercise 2. Use the update function to fit this model.

Learn more about multiple linear regression in the online course Linear regression in R for Data Scientists. In this course you will learn how to:

  • Model basic and complex real world problem using linear regression
  • Understand when models are performing poorly and correct it
  • Design complex models for hierarchical data
  • And much more

Exercise 4
Let’s assume that we have settled on a model that has HS.Grad and Murder as predictors. Fit this model.

Exercise 5
Add an interaction term to the model in Exercise 4 (3 different ways).

Exercise 6
For this and the remaining exercises in this set we will use the model from Exercise 4.

Obtain 95% confidence intervals for the coefficients of the two predictor variables.

Exercise 7
Predict the Life Expectancy for a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 8

Obtain a 98% confidence interval for the mean Life Expectancy in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 9

Obtain a 98% confidence interval for the Life Expectancy of a person living in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 10

Since our model only has two predictor variables, we can generate a 3D plot of our data and the fitted regression plane. Create this plot.




Intermediate Tree 2

This is a continuation of the intermediate decision tree exercise.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

 

Exercise 1
use the predict() command to make predictions on the Train data. Set the method to “class”. Class returns classifications instead of probability scores. Store this prediction in pred_dec.

Exercise 2
Print out the confusion matrix

Exercise 3
What is the accuracy of the model. Use the confusion matrix.

Exercise 4
What is the misclassification error rate? Refer to Basic_decision_tree exercise to get the formula.

Exercise 5
Lets say we want to find the baseline model to compare our prediction improvement. We create a base model using this code


length(Test$class)
base=rep(1,3183)

Use the table() command to create a confusion matrix between the base and Test$class.

Learn more about decision trees in the online courses

Exercise 6
What is the number difference between the confusion matrix accuracy of dec and base?

Exercise 7

Remember the predict() command in question 1. We will use the same mode and same command except we will set the method to “regression”. This gives us a probability estimates. Store this in pred_dec_reg

Exercise 8
load the ROCR package.

Use the prediction(), performance() and plot() command to print the ROC curve. Use pred_dec_reg variable from Q7. You can also refer to the previous exercise to see the code.

Exercise 9
plot() the same ROC curve but set colorize=TRUE

Exercise 10
Comment on your findings using ROC curve and accuracy. Is it a good model? Did you notice that ROC prediction() command only takes probability predictions as one of its arguments. Why is that so?




ROC curves

Every dataset comes with unique problems but the essense of it is to understand how to evaluate the models. Today we will briefly touch up on ROC curves.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Learn more about evaluating the performance of classification models in the online course Machine Learning A-Z™: Hands-On Python & R In Data Science. It includes 5 lectures (32 mins in total) on false positives/false negatives, the confusion matrix, accuracy paradox, and CAP curve analysis.

Exercise 1

ROC is a interesting topic. The only way we can learn ROC fast if we look at different models and see how it differs in its ROC curve. First we are going to load the ChickWeight dataset and then we are going to run different models to compare the change in ROC curve. If you do not have any package then use the install.packages() command to install them. Note the steps as well. We are essentially trying to predict wether a chicken is fed diet 1 or not. We are doing a 70:30 split and then we are using a glm log regression model to make the prediction.

Write this code in R:


library(ROCR)
library(caTools)
attach(ChickWeight)
df=ChickWeight
df$Diet=ifelse(df$Diet==1,1,0)
sample.seed=100
spl=sample.split(df$Diet,SplitRatio = 0.7)
Train=df[spl==TRUE,]
Test=df[spl==FALSE,]
model=glm(Diet~.,data=Train,family = binomial)
pred=predict(model,newdata = Test)

Exercise 2

Use the table() command to print the confusion matrix.Hint: use >0.5 to convert the probabilities into classes. What is the accuracy of the model?

Exercise 3

We want to plot the ROC curve now. Write this code down in R.

ROCRpred1=prediction(pred,Test$Diet)
ROCRperf1=performance(ROCRpred1,"tpr","fpr")

Now use the plot() command to plot ROCperf1

Exercise 4

What is your observations of the curve?Hint: Explain in terms of tpr and fpr and the direction it follows.

Exercise 5

Create a baseline model:

a.Use nrow() command to see the length of the Test variable. Store this in a variable ‘a’
b.Use the rep() command to create a list of 1’s and the number of instances should be equal to your answer in a.
c.Store the result of b into the variable rd.

What this will essentially do is a create a variable that has a 50/50 chance of getting the answer right or maybe slightly better or worse depending on the distribution of the diet variable. But this is really equal to a model that has an accuracy of 50 percent or almost by random chance.

Exercise 6

Using the same structure of the code from Q3, add the changes to reflect the requirement for the baseline and plot the baseline curve.Hint: use rd instead of pred

Exercise 7

What do you observe from the curve and explain your findings?

Exercise 8
model1 had a high accuracy rate because the chicken number was 100 percent corelated with diet. You can see that using just pure observation of the dataset. But what if we created an average model. We can do this by just using weight as the only predictor.

Write the code below to create model2


model2=glm(Diet~weight,data=Train,family = binomial)
pred2=predict(model2,newdata = Test)

Exercise 9
using the table() function again, create a confusion matrix between Test$Diet and pred2. Note the accuracy.

Exercise 10
Using the same code structure as Q3. Write down a code to plot the ROC curve of model2 and also set add=TRUE as one of the arguments so that it plots the new curve on top of the baseline. How does it compare with model 1 and the baseline model. Explain in terms of tpr and fpr




Intermediate Tree 1

If you followed through the Basic Decision Tree exercise, this should be useful for you. This is like a continuation but we add so much more. We are working with a bigger and badder datasets. We will be also using techniques we learned from model evaluation and work with ROC, accuracy and other metrics.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
read in the adult.csv file with header=FALSE. Store this in df. Use str() command to see the dataframe. Download the Data from here

Exercise 2
You are given the meta_data that goes with the CSV. You can download this here Use that to add the column names for your dataframe. Notice the df is ordered going from V1,V2,V3 _ _ and so on. As a side note, it is always best practice to use that to match and see if all the columns are read in correctly.

Exercise 3
Use the table command and print out the distribution of the class feature.

Exercise 4
Change the class column to binary.

Learn more about decision trees in the online courses

Exercise 5
Use the cor() command to see the corelation of all the numeric and integer columns including the class column. Remember that numbers close to 1 means high corelation and number close to 0 means low. This will give you a rough idea for feature selection

Exercise 6
Split the dataset into Train and Test sample. You may use sample.split() and use the ratio as 0.7 and set the seed to be 1000. Make sure to install and load caTools package.

Exercise 7
Check the number of rows of Train
Check the number of rows of Test

Exercise 8
We are ready to use decision tree in our dataset. Load the package “rpart” and “rpart.plot” If it is not installed, then use the install.packages() commmand.

Exercise 9
Use rpart to build the decision tree on the Train set. Include all features.Store this model in dec

Exercise 10
Use the prp() function to plot the decision tree. If you get any error use this code before the prp() command

par(mar = rep(2, 4))




Model Evaluation 2

We are committed to bringing you 100% authentic exercise sets. We even try to include as different datasets as possible to give you an understanding of different problems. No more classifying Titanic dataset. R has tons of datasets in its library. This is to encourage you to try as many datasets as possible. We will be comparing two models by checking their accuracy, Area under the curve, ROC performance etc.

It will be helpful to go over Tom Fawcett’s research paper on ‘An introduction to ROC analysis’

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan

Exercise 1

Run the following code. If you do not have ROCR package installed, you can use install.packages() command to install it.


library(ROCR)
library(caTools)
library(caret)
data("GermanCredit")
df1=GermanCredit
df1$Class=ifelse(df1$Class=="Bad",1,0)
set.seed(100)
spl=sample.split(df1$Class,SplitRatio = 0.7)
Train1=df1[spl==TRUE,]
Test1=df1[spl==FALSE,]
model1=glm(Class~.,data=Train1,family = binomial)
pred1=predict(model1,Test1)
table(Test1$Class,pred1>0.5)

Exercise 2

Using the confusion matrix, please state what is the accuracy of this model?

Exercise 3

Great. Now let’s see the ROC curve of the model. Use this code below and then use plot() command to plot ROCRperf2


ROCRpred1=prediction(pred1,Test1$Class)
ROCRperf1=performance(ROCRpred1,"tpr","fpr")

The plot above gives us an idea of the performance of the model. Is this a a good or bad model? State reasons

Exercise 4

use the summary function on the model to see the summary. Note that if there are more stars next to a feature, then it is highly corelated with our target variable.

Exercise 5

Although we found out the accuracy of the model in Q2, it is still not the best measure. A better measure is area under the curve. AUC takes account of class distribution in the model and is in the range of 0 to 1. 1 being the best and 0 being the worse. It can also be taken as a probability score. If the AUC is 0.70 then that means there is a 0.7 chance of the model to predict positive.

Insert the code below to obtain AUC. What is the AUC score? Is it better than the accuracy obrained at Q2?

auc= performance(ROCRpred1,measure="auc")
auc=auc@y.values[[1]]

Exercise 6

Now create another model called model2 and include 11 variables that have atleast a star next to their name.Hint: use the summary() command and intercept does not count.

Exercise 7

Now predict the target variable using the Test1 sample using model2 and store it in pred2.

Exercise 8
Use the table() command to get the confusion matrix. Note the accuracy.

Exercise 9
What is the auc of model2?

Exercise 10
Is model2 better than model 1? If so, then why?




Basic Tree 2 Exercises

treeplanting

This is a continuation of the exercise Basic Tree 1

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Learn more about decisions tree’s in the online courses Regression Machine Learning with R and Machine Learning A-Z™: Hands-On Python & R In Data Science

Exercise 1
load the tree library. If it is not installed then use the install.packages() command to install it.

Exercise 2
Convert all the feaures(columns) into factors, including the class column

Exercise 3
Use the sample methods that you learnt from the sample_exercise to split the data into two sets with a SplitRatio of 0.7. Hint: Use caTools library and sample.split() function. Store the results into Train and Test.

Exercise 4
Use the tree() command to build the model. Use class as the target variable and everything else as the predictor variable. Also, use the Train variable as the data source. Store the model in a variable called model1

Exercise 5
Use the plot() command to plot the model and use the text() command to add the text.

Exercise 6

Use the predict() command to predict the classes using the Test dataset. We want to predict the classes. Store this in the variable pred_class

Exercise 7

Use the table() command to print the confusion matrix. Hint: You are comparing the class from the Test set and the predicted vector. This tells you wether the model is answering anything right or wrong

Exercise 8
use the summary() to print the summary of the model and note the misclassification error rate.

Exercise 9
Now find the misclassification error rate of the model on the Test data. Use the formula. mean(Test$class != pred_class)

Exercise 10
Compare the two misclassification error rates and determine which is worse and why. How can we improve the model?




Recursive Partitioning and Regression Trees Exercises

tree-538274_960_720

[For this exercise, we will work using the package rpart. This is a beginner level exercise. Please refer to the help of rpart package]

Answers to the exercises are available here.

Exercise 1

Consider the Kyphosis data frame(type help(‘kyphosis’) for more details), that contains:
-Kyphosis:a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
-Age:in months.
-Number:the number of vertebrae involved.
-Start:the number of the first (topmost) vertebra operated on.

1) Build a tree to classify Kyphosis from Age, Number and Start.

Exercise 2

Consider the tree build in exercise 1.
1) Which variables are used to explain kyhosis presence?
2) How many observations contains the terminal nodes.

Exercise 3

Consider the Kyphosis data frame.
1)Build a tree using the first 60 observations of kyphosis.
2)Predict the kyphosis presence for the other 21 observations.
3)Which is the misclassification rate (prediction error)

Exercise 4

Consider the iris data frame(type help(‘iris’) for more details).
1)Build a tree to classify Species from the other variables.
2)Plot the trees, add nodes information.

Exercise 5

Consider the tree build in exercise 4.
Prune the the using median complexity parameter (cp) associated to the tree.
Plot in the same window, the pruned and the original tree.

Exercise 6

Consider the tree build in exercise 4.
1)In which terminal nodes is clasified each oobservations of iris?
2)Which Specie has a flower of Petal.Length greater than 2.45 and Petal.Width less than 1.75.

Exercise 7

Consider the car90 data frame(type help(‘car90’) for more details).
1)Build a tree to predict Price from the other variables.
2)Plot the trees, add nodes information.

Exercise 8

Consider the tree build in exercise 7.
1) Which variables are used to explain the price?
2)Which terminal nodes have a value of mean Price, less tan mean(car90$Price)?

Exercise 9

Consider the car.test.frame data frame (type help(‘car.test.frame’) for more details).
1)Build a tree to explain Mileage using the other variables.
2)Snip the tree in nodes number 2.
3)Plot both tree together

Exercise 10
Consider the tree build in exercise 9.
Which is the depth of the tree (with the root node counted as depth 0).
Set the maximum depth of the final tree on 2




Basic Tree 1 Exercises

treeplanting
Using the knowledge you acquired in the previous exercises on sampling and selecting(here), we will now go through an entire data analysis process. You will be using what you know as crutches to solve the problems. Don’t worry. It might look intimidating but follow the sequence and you will see that modeling a decision tree is the best decision you made today. We will take you through all stages of the data pipeline. From Data loading,feature selection, sampling, plotting, modelling and evaluating a decision tree.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Learn more about decisions tree’s in the online courses Regression Machine Learning with R and Machine Learning A-Z™: Hands-On Python & R In Data Science

Exercise 1
Use read.csv() command to load the lenses.csv data and store it in lens. Use the str() command to see lens. Download the dataset from here

Exercise 2
Notice there are no column names. The column names are as follows
index, age, spec_pres, astigmatic, tpr. Use one line code to change the column names to the aforementioned names.

Exercise 3
Given the meta data

age: (1) young, (2) pre-presbyopic, (3) presbyopic
spec_pres: (1) myope, (2) hypermetrope
astigmatic: (1) no, (2) yes
tpr: (1) reduced, (2) normal
class: (1) patient needs hard contact lens, (2) patient needs soft contact lens, (3) patient does not need contact lens

Type the code lens$age[lens$age == "1"]="young"
Use the same format to change all the data to its names for the age and spec_pres variables.

Exercise 4
Use the str() command to see the changes. Also notice that the astigmatic column is a factor that is also storing numbers as characters. To get all of them in the same format, lets convert it to character. Use the code as.character() to convert this column data type to character.

Exercise 5
Now change the astigmatic column data to the right names

Exercise 6
Use the following code to replace the 1 with “reduced in the tpr column

lens$tpr[lens$tpr==1]="reduced"

Now type str(lens) to see the dataframe. Notice that the tpr column data type change to character from integer. Anytime you introduce something that is not a number in a number dataframe, it will become a character.

Exercise 7
Go ahead and replace 2 in the tpr column with “normal”

Exercise 8
use the table() command to see the counts of each data type

Exercise 9
Notice that there is a g in the count. That could possibly be a typo. We can go ahead and remove that row since there is only one row with that typo. Hint: You can select all rows that does not have that typo and store it back in the lens dataframe.

Exercise 10
Great Work. We realized that the index column is not necessary for our modeling purposes. So lets remove the index column.




Model Evaluation Exercises 1

model-evaluationWe are committed to bringing you 100% authentic exercise sets. We even try to include as different datasets as possible to give you an understanding of different problems. No more classifying Titanic dataset. R has tons of datasets in its library. This is to encourage you to try as many datasets as possible. We will grasp the basics of gauging accuracy of the model.

It will be helpful to go over Tom Fawcett’s research paper on ‘An introduction to ROC analysis’

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

In pattern recognition and classification, we use precision(fraction of retrieved instances that are relevant) and recall(fraction of relevant instances that are retrieved). In other words precision helps us understand whether there is overlap in the model and whether the model can distinguish between the items. Recall,also called sensitivity, helps us gauge how many positives we can identify out of all the positives.

Suppose we design a model to identify iphones from a video that also contain android phones. If the program identifies 5 iphones in a scene containing 7 iphones and some android phones. if 3 of the identification are correct but 2 are actually android phones.
a. What is the precision of the model?
b. What is the recall of the model?

Exercise 2

Suppose we created the model and want to assess how well it predicted. We will compare the actual value from the Test set to the predicted value that is derived from the model.

FALSE TRUE
0 94 23
1 24 100

Using the confusion matrix above, answer the following questions
a. Number of true positives?
b. Number of false negatives?
c. Number of true negatives?
d. Number of false positives?

Exercise 3
Quick way to gauge the accuracy of the model, assuming that the model target class is balanced, is to use the formula (TN+TP)/N. What is the accuracy of the model?

Exercise 4
Quick way to gauge the error rate is to use the formula (FP+FN)/N. What is the overall error rate?

Exercise 5
Sensitivity is defined as TP/(TP+FN). Specificity is defined as TN/(TN+FP). Using that information, answer the following questions
a.What is the sensitivity of the model?
b.What is the specificity of the model?

Exercise 6

There are usually two type of errors, false positive and false negative. False negative error rate is defined as FN/(TP+FN). False positive error rate is defined as FP/(TN+FP). Now answer the following questions:
a. What is the False negative error?
b. What is the False positive error?

Exercise 7

Now suppose this is a model like
that was used to identify iphones from androids in a video(Q1).
a. What is the precision?
b. What is the recall? Hint: Recall is the same as sensitivity.

Exercise 8
Go ahead and run the code below. You may run line by line to see what is happening. In brief, we are loading the housing data, changing the Cont variable to binary class, splitting the data into Train and Test set, Building a model, predicting and comparing actual Cont value in the Test set with predicted value in pred.

attach(housing)
housing$Cont=ifelse(housing$Cont==”High”,1,0)
spl=sample.split(housing$Cont,SplitRatio = 0.7)
Train=housing[spl==TRUE,]
Test=housing[spl==FALSE,]
model 0.5)

Exercise 9
Using the confusion matrix output from your last code, answer the following questions.
a) What is the Precision?
b) What is the recall?

Exercise 10
Using the confusion matrix, answer the follwing questions
c) What is the accuracy of the model?
d) What is the overall error rate?




Frequency and chi-square test for independence Exercises

occasional-table-groupIn this exercise, we cover some basics on frequency tables. We also briefly look at chi-square test for independence to find relationships of two variables. Before proceeding, it might be helpful to look over the help pages for the table, summary, and margin.table functions.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
use the attach() command to load the trees dataset in R

Exercise 2
Use the table() command with the arguments: trees$Height and trees$Volume. This will generate a two-way frequency table. Store this in variable mytable.

Exercise 3
If you are familiar with excel pivot tables, then you will know this function. Use the margin.table( ) function to get the Height frequencies summed over Volume

Exercise 4
Use the margin.table( ) function to get the Volume frequencies summed over Height.

Exercise 5
Now use the table() function again but using all the features of the trees dataset, that includes girth, height and volume. This will print out a multidimensional 3 way frequency table.

Exercise 6
Suppose you have a variable ‘a’ that stores a second sample of heights of trees.

a=c(70, 65, 63, 72, 80, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75, 74, 85, 8, 71, 63, 78, 80, 74, 72, 77, 81, 82, 80, 86, 80, 87)

Use the cbind() to add the a column to your trees dataset. Store the results back into trees.

Exercise 7
Now create a 2 way frequency table between Height and a as the arguments. Store this table in mytable_2.

Exercise 8
Observe the contents of mytable and mytable_2. What differences do you observe from the results?

Exercise 9
Chi Square test for independance:
a)Print the results of the summary() function on mytable. Note the Chi Square test for independance results and P value
b)Print the results of the summary() function on mytable_2. Note the Chi Square test for independance results and P value.

Exercise 10
What did the chi square test for independance help you to see?