We are committed to bringing you 100% authentic exercise sets. We even try to include as different datasets as possible to give you an understanding of different problems. No more classifying Titanic dataset. R has tons of datasets in its library. This is to encourage you to try as many datasets as possible. We will be comparing two models by checking their accuracy, Area under the curve, ROC performance etc.

It will be helpful to go over Tom Fawcett’s research paper on ‘An introduction to ROC analysis’

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Learn more**about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan

**Exercise 1**

Run the following code. If you do not have ROCR package installed, you can use install.packages() command to install it.

library(ROCR)

library(caTools)

library(caret)

data("GermanCredit")

df1=GermanCredit

df1$Class=ifelse(df1$Class=="Bad",1,0)

set.seed(100)

spl=sample.split(df1$Class,SplitRatio = 0.7)

Train1=df1[spl==TRUE,]

Test1=df1[spl==FALSE,]

model1=glm(Class~.,data=Train1,family = binomial)

pred1=predict(model1,Test1)

table(Test1$Class,pred1>0.5)

**Exercise 2**

Using the confusion matrix, please state what is the accuracy of this model?

**Exercise 3**

Great. Now let’s see the ROC curve of the model. Use this code below and then use plot() command to plot ROCRperf2

ROCRpred1=prediction(pred1,Test1$Class)

ROCRperf1=performance(ROCRpred1,"tpr","fpr")

The plot above gives us an idea of the performance of the model. Is this a a good or bad model? State reasons

**Exercise 4**

use the summary function on the model to see the summary. Note that if there are more stars next to a feature, then it is highly corelated with our target variable.

**Exercise 5**

Although we found out the accuracy of the model in Q2, it is still not the best measure. A better measure is area under the curve. AUC takes account of class distribution in the model and is in the range of 0 to 1. 1 being the best and 0 being the worse. It can also be taken as a probability score. If the AUC is 0.70 then that means there is a 0.7 chance of the model to predict positive.

Insert the code below to obtain AUC. What is the AUC score? Is it better than the accuracy obrained at Q2?

auc= performance(ROCRpred1,measure="auc")

auc=auc@y.values[[1]]

**Exercise 6**

Now create another model called model2 and include 11 variables that have atleast a star next to their name.Hint: use the summary() command and intercept does not count.

**Exercise 7**

Now predict the target variable using the Test1 sample using model2 and store it in pred2.

**Exercise 8**

Use the `table()`

command to get the confusion matrix. Note the accuracy.

**Exercise 9**

What is the auc of model2?

**Exercise 10**

Is model2 better than model 1? If so, then why?

Dggv says

“use the summary function on the model to see the summary. Note that if there are more stars next to a feature, then it is highly corelated with our target variable.”

Lawl. Dude, read an econometrics text.