Multiple Regression (Part 1)

In the exercises below we cover some material on multiple regression in R.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

We will be using the dataset state.x77, which is part of the state datasets available in R. (Additional information about the dataset can be obtained by running help(state.x77).)

Exercise 1

a. Load the state datasets.
b. Convert the state.x77 dataset to a dataframe.
c. Rename the Life Exp variable to Life.Exp, and HS Grad to HS.Grad. (This avoids problems with referring to these variables when specifying a model.)

Exercise 2
Suppose we wanted to enter all the variables in a first-order linear regression model with Life Expectancy as the dependent variable. Fit this model.

Exercise 3

Suppose we wanted to remove the Income, Illiteracy, and Area variables from the model in Exercise 2. Use the update function to fit this model.

Learn more about multiple linear regression in the online course Linear regression in R for Data Scientists. In this course you will learn how to:

  • Model basic and complex real world problem using linear regression
  • Understand when models are performing poorly and correct it
  • Design complex models for hierarchical data
  • And much more

Exercise 4
Let’s assume that we have settled on a model that has HS.Grad and Murder as predictors. Fit this model.

Exercise 5
Add an interaction term to the model in Exercise 4 (3 different ways).

Exercise 6
For this and the remaining exercises in this set we will use the model from Exercise 4.

Obtain 95% confidence intervals for the coefficients of the two predictor variables.

Exercise 7
Predict the Life Expectancy for a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 8

Obtain a 98% confidence interval for the mean Life Expectancy in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 9

Obtain a 98% confidence interval for the Life Expectancy of a person living in a state where 55% of the population are High School graduates, and the murder rate is 8 per 100,000.

Exercise 10

Since our model only has two predictor variables, we can generate a 3D plot of our data and the fitted regression plane. Create this plot.




Intermediate Tree 2

This is a continuation of the intermediate decision tree exercise.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

 

Exercise 1
use the predict() command to make predictions on the Train data. Set the method to “class”. Class returns classifications instead of probability scores. Store this prediction in pred_dec.

Exercise 2
Print out the confusion matrix

Exercise 3
What is the accuracy of the model. Use the confusion matrix.

Exercise 4
What is the misclassification error rate? Refer to Basic_decision_tree exercise to get the formula.

Exercise 5
Lets say we want to find the baseline model to compare our prediction improvement. We create a base model using this code


length(Test$class)
base=rep(1,3183)

Use the table() command to create a confusion matrix between the base and Test$class.

Learn more about decision trees in the online courses

Exercise 6
What is the number difference between the confusion matrix accuracy of dec and base?

Exercise 7

Remember the predict() command in question 1. We will use the same mode and same command except we will set the method to “regression”. This gives us a probability estimates. Store this in pred_dec_reg

Exercise 8
load the ROCR package.

Use the prediction(), performance() and plot() command to print the ROC curve. Use pred_dec_reg variable from Q7. You can also refer to the previous exercise to see the code.

Exercise 9
plot() the same ROC curve but set colorize=TRUE

Exercise 10
Comment on your findings using ROC curve and accuracy. Is it a good model? Did you notice that ROC prediction() command only takes probability predictions as one of its arguments. Why is that so?




Working with Shapefiles in R Exercises

R has many powerful libraries to handle spatial data, and the things that R can do with maps can only grow. This exercise tries to demonstrate a few basic functionalities of R while dealing with shapefiles.

A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (ESRI). The geographic features are associated with an attribute table which is very similar to an R dataframe.

The rgdal package in R provides bindings to the popular Geospatial Data Abstraction Library (GDAL) for reading, writing and converting between spatial formats. We are using a very popular dataset of London sports participation shapefile (download here). The attributes Pop_2001 and Partic_Per represents the population of London Boroughs in 2001 and the percentage of the population participating in sporting activities.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Please install and load the package rgdal before starting the exercises.

Exercise 1
Read the London Sports map from the shapefile london_sports.shp .

Exercise 2
Change the coordinate system of the map to WGS 84.

Exercise 3
Find the names of the zones where sports participation rates is more than 25%.

Exercise 4
Plot the london map in Sky Blue, along with a title.

Exercise 5
Plot the zones in London with Sports Participation Rates less than 15% in red. Retain the earlier blue color for other zones.

Exercise 6
Plot the zones in London with Sports Participation Rates more than 25% in green. Retain the earlier color for other zones.

Exercise 7
Place a black circle marker at the centre of each zone. Retain previous maps.

Exercise 8
Put labels for each zone. Place the labels to the right of the black marker.

Exercise 9
Add another categorical attribute sports_part which has values "low", "medium" and "high" for sports participation rates less than equal to 15%, between 15 to 25% and greater than 25% respectively.

Exercise 10
Save the new map object with modified attribute table as a new shapefile “london_sport2.shp”.




Intermediate Tree 1

If you followed through the Basic Decision Tree exercise, this should be useful for you. This is like a continuation but we add so much more. We are working with a bigger and badder datasets. We will be also using techniques we learned from model evaluation and work with ROC, accuracy and other metrics.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
read in the adult.csv file with header=FALSE. Store this in df. Use str() command to see the dataframe. Download the Data from here

Exercise 2
You are given the meta_data that goes with the CSV. You can download this here Use that to add the column names for your dataframe. Notice the df is ordered going from V1,V2,V3 _ _ and so on. As a side note, it is always best practice to use that to match and see if all the columns are read in correctly.

Exercise 3
Use the table command and print out the distribution of the class feature.

Exercise 4
Change the class column to binary.

Learn more about decision trees in the online courses

Exercise 5
Use the cor() command to see the corelation of all the numeric and integer columns including the class column. Remember that numbers close to 1 means high corelation and number close to 0 means low. This will give you a rough idea for feature selection

Exercise 6
Split the dataset into Train and Test sample. You may use sample.split() and use the ratio as 0.7 and set the seed to be 1000. Make sure to install and load caTools package.

Exercise 7
Check the number of rows of Train
Check the number of rows of Test

Exercise 8
We are ready to use decision tree in our dataset. Load the package “rpart” and “rpart.plot” If it is not installed, then use the install.packages() commmand.

Exercise 9
Use rpart to build the decision tree on the Train set. Include all features.Store this model in dec

Exercise 10
Use the prp() function to plot the decision tree. If you get any error use this code before the prp() command

par(mar = rep(2, 4))




Descriptive Analytics-Part 5: Data Visualisation (Spatial data)

downloadDescriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.

In order to be able to solve this set of exercises you should have solved the part 0, part 1, part 2,part 3, and part 4 of this series but also you should run this script which contain some more data cleaning. In case you haven’t, run this script in your machine which contains the lines of code we used to modify our data set. This is the eighth set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data.In order to solve this set of exercises, you have to download this data set which provide us the coordinates of each airport.Please find the script used to create a merged dataset here . I don’t expect you to do the pre-processing yourself since it is beyond the scope of this set but I highly encourage you to give it a try, in case you did that with a better or more efficient way than I did, please post your solution at the comment section(it will be highly appreciated). Moreover we will remove the rows with missing values (various delays) because the methods that we will use are computationally expensive so having a big data set is just a waste of time. The goal of Descriptive analytics is to inform the user about what is going on at the dataset. A great way to do that fast and effectively is by performing data visualisation. Data visualisation is also a form of art, it has to be simple, comprehended and full of information. On this set of exercises we will explore different ways of visualising spatial using the famous ggmap package. Before proceeding, it might be helpful to look over the help pages for the get_map, ggmap, facet_wrap.

For this set of exercises you will need to install and load the packages ggplot2, dplyr, and ggmap.

install.packages('ggplot2')
library(ggplot2)
install.packages('dplyr')
library(dplyr)
install.packages('ggmap')
library(ggmap)

I have also changed the values of the DaysOfWeek variable, if you wish to do that as well the code for that is :
install.packages('lubridate')
library(lubridate)
flights$DayOfWeek <- wday(as.Date(flights$Full1_Date,'%m/%d/%Y'), label=TRUE)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Query the map of United States using the get_map function.
It is recommended to experiment with the various types of maps and select the one that you think is the best. (I have used the toner-lite from Stamen Maps.)

Exercise 2

Print the map that you have selected.

Exercise 3

Modify the printed map in order to print out a bigger image( extent) and assign it to a m object.

Exercise 4

Plot the destination airports of the flights on the map.

Exercise 5

Plot the destination airports of the flights on the map, the size of the points should be based on the number of flights that arrived to the destination airports.

Exercise 6

Plot the destination airports of the flights on the map, the colour of the points should be based on the number of flights that arrived to the destination airport. Make it a bit prettier, use the scale_colour_gradient and set the lows and the highs of your preferences.

Exercise 7

Plot the destination airports of the flights on the map, the colour of the points should be based on the number of flights that arrived to the destination airport and the size of the points should be based on the total delay of arrival of the flights that arrived at the destination airport.
Something is not right here, right?

Exercise 8

Plot the destination airports of the flights on the map, the colour of the points should be based on the number of flights that arrived to the destination airport and the size of the points should be based on the total delay of arrival divided by the number of flights per destination.

Exercise 9

Plot the destination airports for everyday of the week (hint : facet_wrap )

Exercise 10
Plot the destination airports of the flights on the map, the colour of the points should be based on the number of flights that arrived to the destination airports, the size of the points should be based on the total delay of arrival of the flights that arrived at the destination airport for everyday of the week.
(This may be a bit more challenging , if you can’t solve it go to the solutions and try to understand the reason I did what I did, if you have any questions please post them at the comment section).




Model Evaluation 2

We are committed to bringing you 100% authentic exercise sets. We even try to include as different datasets as possible to give you an understanding of different problems. No more classifying Titanic dataset. R has tons of datasets in its library. This is to encourage you to try as many datasets as possible. We will be comparing two models by checking their accuracy, Area under the curve, ROC performance etc.

It will be helpful to go over Tom Fawcett’s research paper on ‘An introduction to ROC analysis’

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan

Exercise 1

Run the following code. If you do not have ROCR package installed, you can use install.packages() command to install it.


library(ROCR)
library(caTools)
library(caret)
data("GermanCredit")
df1=GermanCredit
df1$Class=ifelse(df1$Class=="Bad",1,0)
set.seed(100)
spl=sample.split(df1$Class,SplitRatio = 0.7)
Train1=df1[spl==TRUE,]
Test1=df1[spl==FALSE,]
model1=glm(Class~.,data=Train1,family = binomial)
pred1=predict(model1,Test1)
table(Test1$Class,pred1>0.5)

Exercise 2

Using the confusion matrix, please state what is the accuracy of this model?

Exercise 3

Great. Now let’s see the ROC curve of the model. Use this code below and then use plot() command to plot ROCRperf2


ROCRpred1=prediction(pred1,Test1$Class)
ROCRperf1=performance(ROCRpred1,"tpr","fpr")

The plot above gives us an idea of the performance of the model. Is this a a good or bad model? State reasons

Exercise 4

use the summary function on the model to see the summary. Note that if there are more stars next to a feature, then it is highly corelated with our target variable.

Exercise 5

Although we found out the accuracy of the model in Q2, it is still not the best measure. A better measure is area under the curve. AUC takes account of class distribution in the model and is in the range of 0 to 1. 1 being the best and 0 being the worse. It can also be taken as a probability score. If the AUC is 0.70 then that means there is a 0.7 chance of the model to predict positive.

Insert the code below to obtain AUC. What is the AUC score? Is it better than the accuracy obrained at Q2?

auc= performance(ROCRpred1,measure="auc")
auc=auc@y.values[[1]]

Exercise 6

Now create another model called model2 and include 11 variables that have atleast a star next to their name.Hint: use the summary() command and intercept does not count.

Exercise 7

Now predict the target variable using the Test1 sample using model2 and store it in pred2.

Exercise 8
Use the table() command to get the confusion matrix. Note the accuracy.

Exercise 9
What is the auc of model2?

Exercise 10
Is model2 better than model 1? If so, then why?




Descriptive Analytics-Part 5: Data Visualisation (Categorical variables)

downloadDescriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.

In order to be able to solve this set of exercises you should have solved the part 0, part 1, part 2,part 3, and part 4 of this series but also you should run this script which contain some more data cleaning. In case you haven’t, run this script in your machine which contains the lines of code we used to modify our data set. This is the sixth set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data. The goal of Descriptive analytics is to inform the user about what is going on at the dataset. A great way to do that fast and effectively is by performing data visualisation. Data visualisation is also a form of art, it has to be simple, comprehended and full of information. On this set of exercises we will explore different ways of visualising categorical variables using the famous ggplot2 package. Before proceeding, it might be helpful to look over the help pages for the ggplot, geom_bar, facet_wrap,facet_grid, coord_polar, geom_raster, scale_fill_distiller.

For this set of exercises you will need to install and load the packages ggplot2, code>dplyr, and RColorBrewer.

install.packages('ggplot2')
library(ggplot2)
install.packages('dplyr')
library(dplyr)
install.packages('RColorBrewer')
library(RColorBrewer)

I have also changed the values of the DaysOfWeek variable, if you wish to do that as well the code for that is :
install.packages('lubridate')
library(lubridate)
flights$DayOfWeek <- wday(as.Date(flights$Full1_Date,'%m/%d/%Y'), label=TRUE)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Construct a barplot which illustrates the number of flights per carrier.

Exercise 2
Make a barplot which illustrates the number of flights per carrier and each bar also contains information regarding the number of cancellations per carrier.

Exercise 3
Make a barplot which illustrates the number of flights per carrier but also for every carrier to have two bars that show the number of flights that were cancelled and the ones that departed.

Exercise 4
Make a barplot that shows the proportion of cancelled flights per carrier.

Exercise 5
Make seven barplots which illustrate the number of flights per carrier and each bar also contains information regarding the number of cancellations per carrier for every day of the week. hint:facet

Exercise 6
Make one barplot which illustrates the number of flights per carrier and each bar also contains information regarding the number of cancellations per carrier for every day of the week.

Exercise 7
Create a pie chart that illustrates the number of flights per carrier

Exercise 8
Create a wind rose that illustrates the number of flights per carrier for every day of the week.

Exercise 9
Make a heat map that illustrates the number of flights per carrier for every day of the week.

Exercise 10
With the same data from the heatmap from the previous exercise, also provide some information regarding the cancellation ratio (2 digits recommended) and make customise the heatmap in order for the higher values to be more distinctive.




R-SQL Exercises

How to write Structured Query Language (SQL) code in R. Well there are many packages on CRAN that relate to databases.

01_sqldf

In the exercises below we cover some of the important data manipulation operations using SQL in R. We will use a ‘sqldf’ package, an R package for running SQL statements on data frames.

Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Install the ‘sqldf’ and ‘PASWR’ package. Load the packages. Also load the ‘titanic3’ data from ‘PASWR’ package.

Exercise 2
Count the number of rows in the ‘titanic3’ data using sqldf function. Below is the R equivalent code to do the same.

nrow(titanic3)

Exercise 3
Select all the columns and rows from ‘titanic3’ data and put it into a variable ‘TitanicData’. Below is the R equivalent code to the same.

TitanicData <- titanic3[ , ]

Exercise 4
Select the first two columns of the ‘titanic3’ data and put it into a variable ‘TitanicSubset2Cols’. Below is the R equivalent code to the same. Note: you need to specify the column names in sqldf function.

TitanicSubset2Cols2 <- titanic3[,c(1,2)]

Exercise 5
Print the first 6 rows of the ‘titanic3’ dataset using sqldf function. Below is the R equivalent code to do the same.

head(titanic3)

Exercise 6
Count the number of people in the ‘titanic3’ dataset where the sex is female. Below is the R equivalent code to do the same.

nrow(titanic3[titanic3$sex=="female",])

Exercise 7
Count the number of people in the ‘titanic3’ dataset where the sex is female and the port of embarkment is southampton. Below is the R equivalent code to do the same.

nrow(titanic3[(titanic3$sex=="female" & titanic3$embarked=="southampton"),])

Exercise 8
Calculate the total amount paid by female (where sex is female). Below is the R equivalent code to do the same.

sum(titanic3$fare[titanic3$sex=="female"])

Exercise 9
Count the number of cabins in the ship. Below is the R equivalent code to do the same.

length(unique(titanic3$cabin))

Exercise 10
Count the number of people in the ship whose name start with ‘A’. Below is the R equivalent to do the same.

nrow(Data[grep("^A", Data$name),])




Functions exercises vol. 2

functions exercises

[For this exercise, first write down your answer, without using R. Then, check your answer using R.]

Answers to the exercises are available here.

Exercise 1

Create a function that given a data frame and a vector, will add a the vector (if the vector length match with the rows number of the data frame)
as a new variable to the data frame.

Exercise 2

Consider a data frame df:

Id=c(1:10)
Age=c(14,12,15,10,23,21,41,56,78,12)
Sex=c('F','M','M','F','M','F','M','M','F','M')
Code=letters[1:10]
df=data.frame(Id,Age,Sex,Code)

Create a function that, given a data frame and two indexes, exchanges two values ​​of the Code variable with each other.
For example, if the index is 1 and 3, you assign:

df[1,'Code']=df[3,'Code']
df[3,'Code']=df[1,'Code']

Exercise 3

Consider two variables x,y and a data frame df:

x,y integer

A=c(1:10)
B=seq(100,10,-10)
H=seq(-200,-50,along.with=B)
df=data.frame(A,B,H)

Create a function that given a data frame df calculate a new variable ‘SUM_x_y'(If x=2 and y=3, then the new variable will be ‘SUM_2_3’,
if x=4 and y=10, then the new variable will be ‘SUM_4_10’),such that for each row ‘i’ is equal to:

sum(x*df[1:i,1])+sum(y*df[1:i,2])

Exercise 4

Create a function that given a numeric vector, sort this in ascending order and duplicate it by two.

Exercise 5

Create a function that given a vector alpha numeric, keep only the numbers and apply the function created on exercise 4.
For example, if the input is a vector w="a" "v" "7" "4" "q" , the function will return w=8 14.

Exercise 6

Create a function that given a string

ST='NAME: Maria /COUNTRY:uruguay /EMAIL: mariaUY@gmail.com'

return a matrix

[,1] [,2]
[1,] "NAME" " Maria "
[2,] "COUNTRY" "uruguay "
[3,] "EMAIL" " mariaUY@gmail.com"

Exercise 7

Consider a vector:

ST=c('NAME:Maria /COUNTRY:uruguay /EMAIL:mariaUY@gmail.com','NAME:Paul/COUNTRY:UK /EMAIL:PaulUK@gmail.com',
'NAME:Jhon /COUNTRY:USA /EMAIL:JhonUSA@gmail.com','NAME:Carlos /COUNTRY:Spain /EMAIL:CarlosSP@gmail.com')

Create a function that given a vector string ST return a matrix:

[,1] [,2] [,3] [,4] [,5]
[1,] "NAME" "Maria " "Paul" "Jhon " "Carlos "
[2,] "COUNTRY" "uruguay " "UK " "USA " "Spain "
[3,] "EMAIL" "mariaUY@gmail.com" "PaulUK@gmail.com" "JhonUSA@gmail.com" "CarlosSP@gmail.com"

Exercise 8

Create a function that given a numeric vector X returns the digits 0 to 9 that are not in X. If X=0 2 4 8
the function return 1 3 5 6 7 9

Exercise 9

Create a function that given two strings (one word each), check if one is an anagram of another.

Exercise 10
Create a function that given one word, return the position of word’s letters on letters vector.
For example, if the word is ‘abc’, the function will return 1 2 3.

Want to practice functions a bit more? We have more exercise sets on this topic here.




Best practices while writing R code Exercises

How can I write R codes that other people can understand and use?

Hand pressing Best Practice button on interface with blue background.

In the exercises below we cover some of the best practices while writing a small piece of R code or a full automated script. These are some of the practices which should be kept in mind while coding, trust me it will make your life a lot easier.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
We want to create a numeric vector. The values of this vector should be between 1 and 10, starting from 1 with a difference of 2. Below is the code to generate a numeric vector.Make the suitable changes so that it follows standard practice for assignments.

NumVector = seq(1,10,by=2)

Exercise 2
The command below installs “car” package. Make changes in the command below so that all the packages get installed on which “car” is dependent.

install.packages("car")

Exercise 3
Make the changes in the below code so that it is easy for other users to read and it follows the standard practice while writing an if/else statement in R.

y <- 0
x <- 0

if (y == 0)
{
log(x)
} else {
y ^ x
}

Exercise 4
Update the below code so that it is easy for other users to read it.

NumVector <- seq(1,10,by=2)

if(length(NumVector) > 10 && debug)
message(“Length of the numeric vector is greater than 10”)

Exercise 5
Correct the indentation in the below function so that it is easy for you and other users to read and understand.

test<-1

if (test==1) {
print(“Hello World!”)
print(“The value of test is 1 here”)
} else{
print(“The value of test is not 1 here”);
}
print(test*test+1);

Exercise 6
Update the below code such that it first checks if the “dplyr” package is present. if it is already present, don’t install it just load the package.If the package is not present, install it and then load it.

install.packages("dplyr",dependencies = T)

Exercise 7
Change the below code so that the it doesn’t print package related information while loading the plyr package.

library(plyr)

Exercise 8
Make the changes in the below code so that it doesn’t print warnings while calculating the correlation value between two vectors.

a <- c(1,1)
b <- c(2,3)
cor(a,b)

Exercise 9
Update the below command so that it calls the ‘rename’ function from ‘plyr’ package. The same function is present in both the packages- ‘plyr’ and ‘rename’.

rename(head(mtcars), c(mpg = "NewName"))

Exercise 10
Create a scalar vector ‘a’ with a value of 10e-02 (1/100). Below code prints the same vector in scientific format. Make changes to print in a numeric format.

a <- 1e-02
print(a)