Data Science for Operational Excellence (Part-4)

Suppose your friend is a restaurant chain owner (only 3 units) facing some competitors challenges related to low price, lets call it a price war. Inside his business he knows that there’s no much cost to be cut. But, he thinks that, maybe if he tries harder to find better supplier with low freight and product costs, he could be in a better situation. So, he decided to hire you a recent grad data scientist to figure out how to solve this problem and to build a tool to make your findings to be incorporated in his daily operations. As a Data Scientist you know that this problem could be solved through the use of lpSolve package.

Our goal here is to expand your knowledge to create custom constraints to be used in real business problems.

Answers to the exercises are available here.

Exercise 1
We will solve a transportation problem for your friend’s restaurant chain with 2 different products sold from 4 different suppliers. Create a cost vector that model different costs for each combination of restaurant, supplier and product. Use integer random numbers from 0 to 1000 to fill this vector. In order to be reproducible, set seed equals to 1234.

Exercise 2
Create the demand constraints. Consider that every restaurant need a specific quantity for each product. Use integer random numbers from 100 to 500 to define minimum quantities to keep the restaurant open without running out of any supplies.

Exercise 3
Create the offer constraints. Consider that every supplier can deliver a specific quantity related to each product. Use integer random numbers from 200 to 700 to define maximum quantities that each supplier can deliver.

Exercise 4
Prepare the parameter of the lp() function using the variables created above.

Exercise 5
Now, solve these problem with these constraints created so far.

Learn more about geo visualization in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Build advanced map visualizations
  • Work with different sources for maps
  • And much more visualizations

Exercise 6
We know that some suppliers have minimum order quantities. Create a new set of constraints to represent that. Use integer random numbers from 50 to 70 to define minimum quantities that we can order from each supplier.

Exercise 7
Now, solve these problem with these constraints created so far.

Exercise 8
We also know that some vehicles have maximum capacity in terms of weight and volume. Create a new set of constraints to represent that. Use integer random numbers from 100 to 500 to define maximum quantities that we can order from each supplier.

Exercise 9
Prepare again the lp() function parameters using the variable created above.

Exercise 10
Now, solve these problem with all constraints.




Shiny Application Layouts Exercises (Part-2)

SHINY APPLICATION LAYOUTS-PLOT PLUS COLUMNS

In the second part of our series we will build another small shiny app but use another UI.

More specifically we will present the example of a UI with a plot at the top and columns at the bottom that contain the inputs that drive the plot. For our case we are going to use the diamonds dataset to create a Diamonds Analyzer App.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in the “Building Shiny App” series as we will build basic shiny stuff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test your skills with the exercise set we prepared for you. Lets begin!

Answers to the are available here.

Shiny Installation

In order to create the app we have to install and load the package shiny.

Exercise 1

Install and load shiny.

Grid Layout

The sidebarLayout uses Shiny’s grid layout functions. Rows are created by the fluidRow function and include columns defined by the column function. Column widths should add up to 12 within a fluidRow.
The first parameter to the column function is it’s width. You can also change the position of columns to decide the location of UI elements. You can put columns to the right by adding the offset parameter to the column function. Each unit of offset increases the left-margin of a column by a whole column.

Now let’s begin to build our UI. First of all we will place the fluidpage with a title as below:

#ui.R
library(shiny)
shinyUI(fluidPage(
title = "Diamonds",
h4("Diamonds Analyzer")

))
#server.R
library(shiny)
function(input, output) {}

Exercise 2

Create the initial UI of your app and name it “Diamonds Analyzer”.

You can use the fluidrow function with the column function of width =2 inside of it like this:
#ui.R
library(shiny)
shinyUI(fluidPage(
title = "Diamonds",
h4("Diamonds Analyzer"),
fluidRow(column(2),
column(2),
)

))

Exercise 3

Create a fluidrow with two columns of width = 4 inside it. NOTE: Do not expect to see something yet.

Now it is time to fill these columns with some tools that will help us determine the variables that we are going to use for our plot.
In the first 4 columns we will put a selectInput as the code below.
#ui.R
fluidRow(column(4,
h4("Variable X"),
selectInput('x', 'X', names(diamonds)))

Exercise 4

Put a selectInput in the first 4 columns of your UI. Name it “Variable X”. HINT:Use names to get the names of the dataset diamonds as inputs.

Now let’s move to the next four columns. We are going to put in there another selectInput and select the second of the dataset’s names as default. We are also going to see what offset does by setting it to 1 and then deactivating it again like the example below. You can use the code as it is or change the parameters given to understand the logic behind its use.
#ui.R
offset = 1,
selectInput('y', 'Y', names(dataset), names(dataset)[[2]])

Exercise 5

Create a selectInput from column 5 to column 8. Choose the second of the dataset’s name as default. Name it “Variable Y”. HINT: Use names to get the names of the dataset diamonds as inputs.

Exercise 6

Set the offset parameter to 1 from columns 5 to 8.

Now let’s call our plot and put it on the top of our UI. Look at the example below.

Exercise 7

Place the plot on the top of your UI. HINT: Use plotOutput and hr. NOTE: You are not going to see the plot in your UI because you have not created the server side yet.

We are going to create a reactive expression in order to combine the selected variables into a new data frame.Look at the example:
#server.R
selectedData <- reactive({
diamonds[, c(input$x, input$y)]
})

Exercise 8

Create a reactive expression in order to combine the selected variables into a new data frame. HINT: Use reactive.

Now plot your new data frame like the example:
#server.R
output$plot <- renderPlot({
plot(selectedData())
})

Exercise 9

Plot your data frame. HINT: Use renderPlot.

As mentioned before the width of our UI is equal to 12 columns. So what is goint to happen if we a add a new column of width = 4 next to the other two? You have to find out in order to understand better how it works.

Exercise 10

Create a new selectInput and try to put it next to “Variable Y”. Try to explain the result. NOTE: You do not have to connect it with your plot.




Forecasting for small business Exercises (Part-3)

Uncertainty is the biggest enemy of a profitable business. That is especially true of small business who don’t have enough resources to survive an unexpected diminution of revenue or to capitalize on a sudden increase of demand. In this context, it is especially important to be able to predict accurately the change in the markets to be able to make better decision and stay competitive.

This series of posts will teach you how to use data to make sound prediction. In the last set of exercises, we’ve seen how to make predictions on a random walk by isolating the white noise components via differentiation of the term of the time series. But this approach is valid only if the random components of the time series follow a normal distribution of constant mean and variance and if those components are added together in each iteration to create the new observations.

Today, we’ll see some transformations we can apply on the time series make them stationary, especially how to stabilise variance and how to detect and remove seasonality in a time series.

To be able to do theses exercise, you have to have installed the packages forecast and tseries.

Answers to the exercises are available here.

Exercise 1
Use the data() function to load the EuStockMarkets dataset from the R library. Then use the diff() function on EuStockMarkets[,1] to isolate the random component and plot them. This differentiation is the most used transformation with time series.

We can see that the mean of the random component of the time series seems to stay constant over time, but the variance seems to get bigger near 1997.
Exercise 2
Apply a the log() function on EuStockMarkets[,1] and repeat the step of exercise 1. The logarithmic transformation is often used to stabilise non constant variance.

Exercise 3
Use the adf.test() function from the tseries package to test if the time series you obtain in the last exercise is stationary. Use a lag of 1.

Exercise 4
Load and plot the co2 dataset from the R library dataset. Use the lowess() function to create a trend line and add it to the plot of the time series.

By looking at the last plot, we can see that the time series oscillate from one side to the other of the trend line with a constant period. That characteristic is called seasonally and is often observed in time series. Just think about temperature, which change predictably from season to season.
Exercise 5
To eliminate the upward trend in the data use the diff() function and save the resulting time series in a variable called diff.co2. Plot the autocorrelation plot of diff.co2.

Exercise 6
This last autocorrelation plot has years for unit which is not really intuitive in our scenario. Make another autocorrelation plot where the x axis has months as units. By looking at this plot, can you tell what is the seasonal period of this time series?

Another way to verify if the time series show seasonnality is to use the tbats function from the forecast package. As his named says, this function fits a tbats model on the time series and return a smooth curve that fit the data. We’ll learn more about that model in a future post.
Exercise 7
Use the tbats function on the co2 time series and store the result in a variable called tbats.model. If the time series show sign of seasonality, the $seasonal.periods value of tbats.model will store the period value, else the result will be null.

Exercise 8
Use the diff() function with the appropriate lag to remove the seasonality of the co2 time series, then use it again to remove the trend. Plot the resulting random component and the autocorrelation plot.

Exercise 9
Apply the adf test, the kpss test and the Ljung-Box test on the result of the last exercise to make sure that the random component is stationary.

Exercise 10
An interesting way to analyse a time series is to use the decompose() function which uses a moving average to estimate the seasonal, random and trend component of a time series. With that in mind, use this function and plot each component of the co2 time series.




Data Science for Operational Excellence (Part-3)


Optimized transportation planning is a task usually left to the firm’s logistic department. However, it is often difficult to visualize, specially if there are many points involved in the logistic network. R and its packages can help solving this issue. Our goal here is to expand logistics networking visualization. In order to do that, we will use packages as ggmap and leaflet.

Answers to the exercises are available here.

Exercise 1
Load libraries: ggmap, fields, lpSolve, leaflet, dplyr, magrittr. Use the following vectors to create a new one called allCitiesAux: soyaCities <- c("Sapezal","Sorriso", "Nova Mutum", "Diamantino", "Cascavel") , transhipment <- c("Alto Araguaia", "Cascavel"), ports <- c("Santos", "Paranagua").
Exercise 2
Use the function geocode to collect latitude and longitude for all cities.
Exercise 3
Create a data frame and with columns names: City, lat and lng.
Exercise 4
Create a matrix that contains all the distance between all cities. We will use this in the lp.transportation function, so remember that rows must be offer points and columns demand points.

Learn more about geo visualization in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Build advanced map visualizations
  • Work with different sources for maps
  • And much more visualizations

Exercise 5
Create row.signs, row.rhs, col.signs, col.rhs. For that, remember to set a seed equals to 123 and that all soya must be exported through ports. For the “right hand side” variables use random generated number. Port demands should be between 300 and 600. Soya production should be between 50 and 300.
Exercise 6
Solve the transportation problem and change columns and row names to match the names from the cost matrix.
Exercise 7
Create a list of data frames to store all segments presented in the solution. Example, one of this segments should be Sapezal to Santos.
Exercise 8
Create a map using leaflet and add lines for each segment based on the list of data frames created previously.
Exercise 9
Create a list of data frames to store road routes extracted using the route function from ggmap.
Exercise 10
Create a new map using leaflet that, instead of showing straight lines from origin to destination, shows road routes.




Experimental Design Exercises

In this set of exercises we shall follow the practice of conducting an experimental study. Researcher wants to see if there is any influence of working-out on body mass. Three groups of subjects with similar food and sport habits were included in the experiment. Each group was subjected to a different set of exercises. Body mass was measured before and after workout. The focus of the research is the difference in body mass between groups, measured after working-out. In order to examine these effects, we shall use paired t test, t test for independent samples, one-way and two-ways analysis of variance and analysis of covariance.

You can download the dataset here. The data is fictious.

Answers to the exercises are available here.

If you have different solution, feel free to post it.

Exercise 1

Load the data. Calculate descriptive statistics and test for the normality of both initial and final measurements for whole sample and for each group.

Exercise 2

Is there effect of exercises and what is the size of that effect for each group? (Tip: You should use paired t test.)

Exercise 3

Is the variance of the body mass on final measurement the same for each of the three groups? (Tip: Use Levene’s test for homogeneity of variances)

Exercise 4

Is there a difference between groups on final measurement and what is the effect size? (Tip: Use one-way ANOVA)

Learn more about statistics for your experimental design in the online course Learn By Example: Statistics and Data Science in R. In this course you will learn how to:

  • Work thru regression problems
  • use different statistical tests and interpret them
  • And much more

Exercise 5

Between which groups does the difference of body mass appear after the working-out? (Tip: Conduct post-hoc test.)

Exercise 6

What is the impact of age and working-out program on body mass on final measurement? (Tip: Use two-way between groups ANOVA.)

Exercise 7

What is the origin of effect of working-out program between subjects of different age? (Tip: You should conduct post-hoc test.)

Exercise 8

Is there a linear relationship between initial and final measurement of body mass for each group?

Exercise 9

Is there a significant difference in body mass on final measurement between groups, while controlling for initial measurement?

Exercise 10

How much of the variance is explained by independent variable? How much of the variance is explained by covariate?




Data science for Doctors: Cluster Analysis Exercises


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the ninth part of the series and it aims to cover the very basics of the subject of cluster analysis.
In my opinion, it is necessary for researchers to know how to discover relationships between patients and diseases. Therefore in this set of exercises we will go through the basics of cluster analysis relationship discovery. In particular we will use hierarchical clustering and centroid-based clustering , k-means clustering and k-median clustering.

Before proceeding, it might be helpful to look over the help pages for the ggplot, geom_point, dist, hclust, cutree, stats::rect.hclust, multiplot, kmeans, kGmedian.

Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("Gmedian")
library(Gmedian)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the class of the candidate (0 or 1).

Exercise 2

Create a distance matrix for the data.

Exercise 3

Make an hierarchical clustering analysis using the single linkage method. Then create an object that contains only two clusters.

Exercise 4

Make an hierarchical clustering analysis using the complete linkage method(default). Then create an object that contains only two clusters.

Exercise 5

Construct the trees that are produced by exercises 2 and 3 and draw the two clusters(at the plots).
hint: rect.hclust

Learn more about cluster analysis in the online course Applied Multivariate Analysis with R. In this course you will learn how to work with hiërarchical clustering, k-means clustering and much more.

Exercise 6

Construct two scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster that those points belong to. Each scatterplot is for different clustering method.

If possible illustrate those scatterplots (each one at a time) next to the plot of exercise 1, to see whether the clustering can discriminate the positive classified from the negative classified patients. In case you didn’t do that, find it at the solution’s section, I highly encourage you to check it out.

Exercise 7

Run the following in order to create dummy variables data_mat <- model.matrix(~.+0, data = data).
Make a centroid-based cluster analysis using the k-means method with k to be 2. Apply the k-mean clustering on the data_mat data frame.

Exercise 8

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster (retrieved from k-mean method) that those points belong to.

If possible illustrate those scatterplot next to the plot of exercise 1.

Exercise 9

Make a centroid-based cluster analysis using the k-median method with k to be 2. Apply the k-median clustering on the data_mat data frame.

Exercise 10

Construct a scatterplot with x-axis to be the mass variable and y-axis to be the age variable. Moreover, determine the colour of the points based on the cluster (retrieved from k-median method) that those points belong to.

If possible illustrate those scatterplot next to the plot of exercise 1.




User Defined Functions in R Exercises (Part 1)

In the Exercises we will discuss User Defined Function in R

Answers to the exercises are available here.

Exercise 1

Create a function to print square of number

Exercise 2

Create a function to print a number raise to another with the one argument a default argument
Exercise 3

Create a function to print class of an argument

Exercise 4

Create a function to accept two matrix arguments and do matrix operations with same.

Exercise 5

Create a user defined function to accept a name from the user

Exercise 6

Create a user defined function to accept values from the user using scan and return the values

Exercise 7

Create a user defined function to create a matrix and return the same.

Exercise 8

Create a function to take two arguments, one student marks and other student names and plot a graph based on the same.

Exercise 9

Create a function to accept an employee data frame(Name,Gender,Age,Designation & SSN) and print the First & Fifth employee as well as the Names & the Designation of all the employees in the function

Exercise 10

Create a function to create an employee data frame(Name,Gender,Age,Designation & SSN) and return the Name,Age & Designation of all employees.




Shiny Application Layouts Exercises (Part-1)

Shiny Application Layouts part 1

Welcome to the first part of our new series “Shiny Application Layouts”. As you can understand from the title we will see how to organize the output of our application in various ways. For this reason we will build together 10 simple apps that will help you understand what kind of interfaces shiny provides.

In this part we will see tabsets which is one of the simplest ways to organize our app. This part can be useful for you in two ways.
First of all, as already mentioned, you can see different ways to enhance the appearance and the utility of your shiny app.
Secondly you can make a revision on what you learnt in Shiny Apps series as we will build basic shiny staff in order to present it in the proper way.

In part 1 we will use the dataset data which is loaded in R by default and we will create three tabPanel. Each one of them with its own utility.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Tabsets

We use tabset in our application application to organize output. Firstly install the shiny package and then call it in a new r script of your working directory.
You can use install.packages() and library() for these.

Exercise 1

Install and load the shiny package.

Tab Panels

Tabsets can be created by calling the tabsetPanel function. Each tab panel includes a list of output elements.

In this case we will create a histogram, a summary and table view of the data, each rendered on their own tab.

Let’s start by creating the basic interface like the example below: As we know the UI includes three basic parts (headerPanel, sidebarPanel, mainPanel).
#ui.R
library(shiny)

shinyUI(pageWithSidebar(
headerPanel("Example"),
sidebarPanel(),
mainPanel()
))

#server.R
shinyServer(function(input, output) {})

Exercise 2

Create the basic interface, name it “APP 1”.

Secondly let’s fill our sidebar with some widgets. We will create radioButtons and sliderInput to control the random distribution and the number of observations. Look at the example.
#ui.R
library(shiny)

shinyUI(pageWithSidebar(
headerPanel("Example"),
sidebarPanel(
radioButtons("cho", "Choice:",
list("One" = "one",
"Two" = "two",
"Three" = "three",
"Four" = "four"
br(),

sliderInput("n",
"Number of observations:",
value = 50,
min = 1,
max = 100)
),
mainPanel()
))

#server.R
shinyServer(function(input, output) {})

Exercise 3

Put radioButtons inside sidebarPanel.The four choices will be “Normal”, “Uniform”, “Log-Normal”, “Exponential” and the name of it “Distributions”.

Exercise 4

Put sliderInput under radioButtons. Its values should be from 1 to 1000 and the selected value should be 500. Name it “Number of observations”.

It is time to create three tabsets in order to place there a plot, summary, and table view of our dataset. More specifically we will use a plotOutput, a verbatimTextOutput and a tableOutput. Look at the code and the build your own.
#ui.R
tabsetPanel(
tabPanel("Plot", plotOutput("plot")),
tabPanel("Summary", verbatimTextOutput("summary")),
tabPanel("Table", tableOutput("table"))
)

Exercise 5

Create your tabsetPanel which will include the tabPanel.

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 6

Create the first tabPanel for your Plot.Use plotOutput.
Create the second tabPanel for your Summary Statistics.Use verbatimTextOutput.
Create the third tabPanel for your Table.Use tableOutput.

Tabs and Reactive Data

Working with tabs into our user-interface magnifies the importance of creating reactive expressions for our data. In this example each tab provides its own view of the dataset. We have to mention that the bigger the dataset the slower our app will be.
We should use a Reactive expression to generate the requested distribution. This is called whenever the inputs change. The renderers are defined below then all use the value computed from this expression.
#server.R
data <- reactive({
dist <- switch(input$dist,
norm = rnorm,
unif = runif,
lnorm = rlnorm,
exp = rexp,
rnorm)

dist(input$n)
})

Exercise 7

Put data() in the correct place of your code in order to activate it.

Now we should generate a plot of the data. Note that the dependencies on the data are both tracked by the dependency graph.

As you can see from the code below we use the renderPlot function to create the reactive plot and inside there we use the hist function in order to create the histogram.
#server.R
output$plot <- renderPlot({
dist <- input$dist
n <- input$n

hist(data())
})

Exercise 8

Create a reactive histogram inside the tabPanel “Plot”

Let’s print a summary of the data in the tabPanel “Summary”. As you can see below we will use renderPrint for our case and the function summary.
#server.R
output$summary <- renderPrint({
summary(data())
})

Exercise 9

Print tha dataset’s summary in the tabPanel “Summary”.

Last but not least we will create an HTML table view of the data. For this we will use renderTable and data.frame function.
#server.R
output$table <- renderTable({
data.frame(x=data())
})

Exercise 10

Create a data table in the tabPanel “Table”.




Forecasting: Exponential Smoothing Exercises (Part-3)

Exponential smoothing is a method of finding patterns in time series, which can be used to make forecasts. In its simple form, exponential smoothing is a weighted moving average: each smoothed value is a weighted average of all past time series values (with weights decreasing exponentially from the most recent to the oldest values). In more complicated forms, exponential smoothing is applied to a time series recursively to allow for a trend and seasonality. In that case, the model is said to consist of three components – error, trend, and seasonality, from which another notation for exponential smoothing (“ETS”) is derived.
This set of exercises focuses primarily on the ets function from the forecast package. The function can be used to apply various exponential smoothing methods (including Holt’s and Holt-Winters’ methods), and allows for both automatic and manual selection of the model structure (for example, whether the model includes trend and seasonal components). The exercises are based on the monthly data on US civilian unemployment rate as a percentage of the labor force for 2012-2017 retrieved from FRED, the Federal Reserve Bank of St. Louis database (download here)

For other parts of the series follow the tag forecasting.
Answers to the exercises are available here.

Exercise 1
Load the data, transform it the the ts type (indicating that the data is monthly and the first period is January 2012), and plot it.

Exercise 2
Use the ses function from the forecast package to get a forecast based on simple exponential smoothing for the next 12 months, and plot the forecast.

Exercise 3
Estimate an exponential smoothing model using the ets function with default parameters. Then pass the model as input to the forecast function to get a forecast for the next 12 months, and plot the forecast (both functions are from the forecast package).

Exercise 4
Print a summary of the model estimated in the previous exercise, and find the automatically estimated structure of the model. Does it include trend and seasonal components? If those components are present are they additive or multiplicative?

Exercise 5
Use the ets function to estimate an exponential smoothing model with a damped trend. Make a forecast based on the model for the next 12 months, and plot it.

Learn more about Forecasting in the online course Time Series Analysis and Forecasting in R. In this course you will learn how to:

  • A complete introduction on Forecasting
  • Work thru an exponentional smoothing instruction
  • And much more

Exercise 6
Use the ets function to estimate another model that does not include a trend component. Make a forecast based on the model for the next 12 months, and plot it.

Exercise 7
Find a function in the forecast package that estimates the BATS model (exponential smoothing state space model with Box-Cox transformation, ARMA errors, trend and seasonal components). Use it to estimate the model with a dumped trend, and make a forecast. Plot the forecast.

Exercise 8
Use the accuracy function from the forecast package to get a matrix of accuracy measures for the forecast obtained in the previous exercise. Explore the structure of the matrix, and save a measure of the mean absolute error (MAE) in a variable.

Exercise 9
Write a function that inputs a time series and a list of model estimation functions, calculates forecasts for the next 12 periods using each of the functions (with default parameters), and outputs the forecast with the smallest mean absolute error.
Run the function using the unemployment time series and a list of functions that includes ets, bats, and auto.arima. Plot the obtained result.

Exercise 10
Modify the function written in the previous exercise so that it prints the mean absolute error for each forecasting model along with the name of that model (the name can be retrieved from the forecast object).




Data science for Doctors: Inferential Statistics Exercises(Part-5)


Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the eighth part of the series and it aims to cover partially the subject of Inferential statistics.
Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.
In more detail, in this part we will go through the hypothesis testing for testing the normality of distributions(Shapiro–Wilk test, Anderson–Darling test.), the existence of outliers(Grubbs’ test for outliers). We will also cover the case that normality assumption doesn’t hold and how to deal with it(<a href="https://en.wikipedia.org/wiki/Rank_test" Rank tests). Finally we will do a brief recap of the previous exercises on inferential statistics.

Before proceeding, it might be helpful to look over the help pages for the hist, qqnorm, qqline, shapiro.test, ad.test, grubbs.test, wilcox.test.

Moreover please load the following libraries.
install.packages("ggplot2")
library(ggplot2)
install.packages("nortest")
library(nortest)
install.packages("outliers")
library(outliers)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Moreover run the chunk below in order to generate the samples that we will test on this set of exercises.
f_1 <- rnorm(28,29,3)
f_2 <- rnorm(23,29,6)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Plot an histogram of the variable pres.

Exercise 2

Plot the QQ-plot with a QQ-line for the variable pres.

Exercise 3

Apply a Shapiro-Wilk normality test for the variable pres.

Exercise 4

Apply a Anderson-Darling normality test for the variable pres.

Exercise 5

What is the percentage of data that passes a normality test?
This might be a bit challenging, consider using the apply function.

Learn more about Inferential Statistics in the online course Learn By Example: Statistics and Data Science in R. This course includes:

  • 6 different case-studies on inferential statistics
  • Extensive coverage of techniques for inference
  • And much more

Exercise 6

Construct a boxplot of pres and see whether there are outliers or not.

Exercise 7

Apply a Grubb’s test on the pres to see whether the variable contains outlier values.

Exercise 8

Apply a two-sided Grubb’s test on the pres to see whether the variable contains outlier values.

Exercise 9

Suppose we test a new diet on a sample of 14 people from the candidates (take a random sample from the set) and after the diet the average mass was 29 with standard deviation of 4 (generate 14 normal distributed samples with the properties mentioned before). Apply Wilcoxon signed rank test for the mass variable before and after the diet.

Exercise 10

Check whether the positive and negative candidates have the same distribution for the pres variable. In order to check that, apply a Wilcoxon rank sum test for the pres variable in respect to the class.fac variable.