Data Science for Doctors – Part 4 : Inferential Statistics (1/5)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the fourth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

Before proceeding, it might be helpful to look over the help pages for the sample, mean, sd , sort, pnorm. Moreover it is crucial to be familiar with the Central Limit Theorem.

You also may need to load the ggplot2 library.
install.packages("moments")
library(moments)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names
data <- data[-which(data$mass ==0),]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generate (10000 iterations) a sampling distribution of sample size 50, for the variable mass.

You are encouraged to experiment with different sample sizes and iterations in order to see the impact that they have to the distribution. (standard deviation, skewness, and kurtosis) Moreover you can plot the distributions to have a better perception of what you are working on.

Exercise 2

Find the mean and standard error (standard deviation) of the sampling distribution.

You are encouraged to use the values from the original distribution (data$mass) in order to comprehend how you derive the mean and standard deviation as well as the importance that the sample size has to the distribution.

Exercise 3

Find the of the skewness and kurtosis of the distribution you generated before.

Exercise 4

Suppose that we made an experiment and we took a sample of size 50 from the population and they followed an organic food diet. Their average mass was 30.5. What is the Z score for a mean of 30.5?

Exercise 5

What is the probability of drawing a sample of 50 with mean less than 30.5? Use the the z-table if you feel you need to.

Exercise 6

Suppose that you did the experiment again but to a larger sample size of 150 and you found the average mass to be 31. Compute the z score for this mean.

Exercise 7

What is the probability of drawing a sample of 150 with mean less than 31?

Exercise 8

If everybody would adopt the diet of the experiment. Find the margin of error for the 95% of sample means.

Exercise 9

What would be our interval estimate that 95% likely contains what this population mean would be if everyone in our population would start adopting the organic diet.

Exercise 10

Find the interval estimate for 98% and 99% likelihood.




Building Shiny App exercises part 7

Connect widgets & plots

In the seventh part of our journey we are ready to connect more of the widgets we created before with our k-means plot in order to totally control its output. Of cousre we will also reform the plot itself properly in order to make it a real k-means plot.
Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

First of all let’s move the widgets we are going to use from the sidebarPanel into the mainPanel and specifically under our plot.

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 1

Remove the textInput from your server.R file. Then place the checkboxGroupInput and the selectInput in the same row with the sliderInput. Name them “Variable X” and “Variable Y” respectively. HINT: Use fluidrow and column.

Create a reactive expression

Reactive expressions are expressions that can read reactive values and call other reactive expressions. Whenever a reactive value changes, any reactive expressions that depended on it are marked as “invalidated” and will automatically re-execute if necessary. If a reactive expression is marked as invalidated, any other reactive expressions that recently called it are also marked as invalidated. In this way, invalidations ripple through the expressions that depend on each other.
The reactive expression is activated like this: example <- reactive({ })

Exercise 2

Place a reactive expression in server.R, at any spot except inside output$All and name it “Data”. HINT: Use reactive

Connect your dataset’s variables with your widgets.

Now let’s connect your selectInput with the variables of your dataset as in the example below.

#ui.R
library(shiny)
shinyUI(fluidPage(
titlePanel("Shiny App"),

sidebarLayout(
sidebarPanel(h2(“Menu”),
selectInput(‘ycol’, ‘Y Variable’, names(iris)) ),
mainPanel(h1(“Main”)
)
)
))
#server.R
shinyServer(function(input, output) {
example <- reactive({
iris[, c(input$ycol)]
})
})

Exercise 3

Put the variables of the iris dataset as inputs in your selectInput as “Variable Y” . HINT: Use names.

Exercise 4

Do the same for checkboxGroupInput and “Variable X”. HINT: Use names.

Select the fourth variabale as default like the example below.

#ui.R
library(shiny)
shinyUI(fluidPage(
titlePanel("Shiny App"),

sidebarLayout(
sidebarPanel(h2(“Menu”),
checkboxGroupInput(“xcol”, “Variable X”,names(iris),
selected=names(iris)[[4]]),
selectInput(“ycol”, “Y Variable”, names(iris),
selected=names(iris)[[4]])
),
mainPanel(h1(“Main”)
)
)
))
#server.R
shinyServer(function(input, output) {
example <- reactive({
iris[, c(input$xcol,input$ycol)
]
})
})

Exercise 5

Make the second variable the default choise for both widgets. HINT: Use selected.

Now follow the example below to create a new function and place there the automated function for k means calculation.

#ui.R
library(shiny)
shinyUI(fluidPage(
titlePanel("Shiny App"),

sidebarLayout(
sidebarPanel(h2(“Menu”),
checkboxGroupInput(“xcol”, “Variable X”,names(iris),
selected=names(iris)[[4]]),
selectInput(“ycol”, “Y Variable”, names(iris),
selected=names(iris)[[4]])
),
mainPanel(h1(“Main”)
)
)
))
#server.R
shinyServer(function(input, output) {
example <- reactive({
iris[, c(input$xcol,input$ycol)
]
})
example2 <- reactive({
kmeans(example())
})
})

Exercise 6

Create the reactive function Clusters and put in there the function kmeans which will be applied on the function Data. HINT: Use reactive.

Connect your plot with the widgets.

It is time to connect your plot with the widgets.

Exercise 7

Put Data inside renderPlot as first argument replacing the data that you have chosen to be plotted until now. Moreover delete xlab and ylab.

Improve your k-means visualiztion.

You gan change automatically the colours of your clusters by copying and pasting this part of code as first argument of renderPlot before the plot function:

palette(c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3",
"#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999"))

We will choose to have up to nine clusters so we choose nine colours.

Exercise 8

Set min of your sliderInput to 1, max to 9 and value to 4 and use the palette function to give colours.

This is how you can give different colors to your clusters. To activate these colors put this part of code into your plot function.

col = Clusters()$cluster,

Exercise 9

Activate the palette function.

To make your clusters easily foundable you can fully color them by adding into plot function this:
pch = 20, cex = 3

Exercise 10

Fully color the points of your plot.




Data Hacking with RDSTK 3

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This is a continuation from the last exercise RDSTK 2
We are going to use the function that we created in our last exercise to have a programmatic way to derive statistics using the coordinates2statistics() function. Last week we talked about local and global variables. This is important to understand before proceeding. Also refresh on ip2coordinates() function.

This package provides an R interface to Pete Warden’s Data Science Toolkit. See for more information click here.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
This week we will give you bigger and badder list to work with. Its a list of more a dozen proxy ip-addresses from the internet. Run the code


list=c("97.77.104.22","104.199.228.65","50.93.204.169","107.189.46.5","104.154.142.10","104.131.255.12","209.212.253.44","70.248.28.23","52.119.20.75","192.169.168.15","47.88.31.75 80","107.178.4.109","152.160.35.171","104.236.54.196","50.93.197.102","159.203.117.1","206.125.41.132","50.93.201.28","8.21.67.248 31","104.28.16.199")

Exercise 2

Remember how we used iterators to run through each location and derive the stats with ip2coordinates() function in the first rdstk exercise. Lets do the same here. Store the results in df

Exercise 3

If you came this far, great. Lets recall the function that we created in exercise 2. If you do not remember the function, here is the code for it. Run the code below and then run stat_maker(“population_density”). You should see a new column called pop

stat_maker=function(s2){
s1="statistics"
s3="value"
s2=as.character(s2)
for (i in 1:nrow(df)) {
df$pop[i] <<-coordinates2statistics(df[i,3],df[i,6],s2)[paste(s1,s2,s3, sep = ".")]
assign("test2",50,envir = .GlobalEnv)

}
}

You should see an output in the format “statistics.hello.value”

Exercise 4

Modify the function so that the function accepts a string and returns out a global variable that holds the elements of that string statistic. For example if you input elevation, the function will create a global variable called elevation with the results from the for loop stored

Exercise 5

Test out the function.


stat_maker("elevation")

Exercise 6

Test the function stat_maker. stat_maker(“population_density”). Notice it did not explicitly make the changes to the df but just returned it once you called the function. This is because we did not define df as a global variable. But thats okay. We will learn it later

Exercise 7

Great. Now before we modify our function, lets learn how we can make a global variable inside a function. Use the same code from exercise 5 but this time instead of defining df$pop2 as a local variable, define it as a global variable. Run the function and test it again.

Exercise 8

Run the code

stat_maker("us_population_poverty")

Notice that our function does not work for this case. This is because anything with the prefix us_population will return a dataframe with a column value like statistics.us_population.value
So you need to modify the function a little to accomodate for this.

Exercise 9

Run the following commands. You can also use any string starting with us_population fo this function. But the goal is to make global variables that hold this data. You can refer to the whole list of statistic funtions at www.datasciencetoolkit.org

stat_maker("us_population")
stat_maker("us_population_poverty")
stat_maker("us_population_asian")
stat_maker("us_population_bachelors_degree")
stat_maker("us_population_black_or_african_american")
stat_maker("us_population_black_or_african_american_not_hispanic ")
stat_maker("us_population_eighteen_to_twenty_four_years_old")
stat_maker("us_population_five_to_seventeen_years_old")
stat_maker("us_population_foreign_born")
stat_maker("us_population_hispanic_or_latino")

Exercise 10

Use cbind command to bind all the global variables into the df. Print the results of df.

Note: You can chose to make this df in other ways but this method was used to guide through modifying functions, global/local variables and working with strings.




Data Science for Doctors – Part 3 : Distributions

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

This is the third part of the series, it will contain the main distributions that you will use most of the time. This part is created in order to make sure that you have (or will have after solving this set of exercises) the knowledge for the next parts to come. The distributions that we will see are:

1)Binomial Distribution: The binomial distribution fits to repeated trials each with a dichotomous outcome such as success-failure, healthy-disease, heads-tails.

2)Normal Distribution: It is the most famous distribution, it is also assumed for many gene expression values.

3)T-Distribution: The T-distribution has many useful applications for testing hypotheses when the sample size is lower than thirty.

4)Chi-squared Distribution: The chi-squared distribution plays an important role in testing hypotheses about frequencies.

5)F-Distribution: The F-distribution is important for testing the equality of two variances.

Before proceeding, it might be helpful to look over the help pages for the choose, dbinom, pbinom , rbinom, qbinom,pnorm, qnorm, rnorm, dnorm,pchisq, qchisq, dchisq, df, pf, df.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Let X be binomially distributed with n = 100 and p = 0.3.Compute the following:
a) P(X = 34), P(X ≥ 34), and P(X ≤ 34)
b) P(30 ≤ X ≤ 60)
c) The quantiles x0.025, and x0.975

Exercise 2

Let X be normally distributed with mean = 3 and standard deviation = 1.Compute the following:
a) P(X 2),P(2 ≤ X ≤ 4)
b) The quantiles x0.025, x0.5and x0.975.

Exercise 3

Let T8 distribution.Compute the following:
a)P(T8 < 1), P(T8 > 2), P(-1 < T8 < 1).
b)The quantiles t0.025, t0.5, and t0.975. Can you justify the values of the quantiles?

Exercise 4

Compute the following for the chi-squared distribution with 5 degrees of freedom:
a) P(X25<2), P(X25>4),P(4<X25<6).
b) The quantiles g0.025, g0.5, and g0.975.

Exercise 5

Compute the following for the F6,3 distribution:
a)P(F6,3 < 2), P(F6,3 > 3), P(1 < F6,3 < 4).
b)The quantiles f0.025, f0.5, and f0.975.

Exercise 6

Generate 100 observations following binomial distribution and plot them(if possible at the same plot):
a) n = 20, p = 0.3
b) n = 20, p = 0.5
c) n = 20, p = 0.7

Exercise 7

Generate 100 observations following normal distribution and plot them(if possible at the same plot):
a) standard normal distribution ( N(0,1) )
b) mean = 0, s = 3
c) mean = 0, s = 7

Exercise 8

Generate 100 observations following T distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 9

Generate 100 observations following chi-squared distribution and plot them(if possible at the same plot):
a) df = 5
b) df = 10
c) df = 25

Exercise 10

Generate 100 observations following F distribution and plot them(if possible at the same plot):
a) df1 = 3, df2 = 9
b) df1 = 9, df2 = 3
c) df1 = 15, df2 = 15




Data Hacking with RDSTK 2

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This is a continuation from the last exercise RDSTK 1
This package provides an R interface to Pete Warden’s Data Science Toolkit. See www.datasciencetoolkit.org for more information.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Load the rdstk library and download the dataset here

Exercise 2

a.create a string called s1 and store “statistics” inside
b.create a string called s3 and store “value”
c. create a function that will take a string s2 as an input and output a string in the format s1+s2+s3 seperated by “.”. Name this function “stringer”

Exercise 3

Lets test out this function.


stringer("hello")

You should see an output in the format “statistics.hello.value”

Exercise 4

Create a for loop that will iterate over the rows in df and derive the population density of the location using coordinates2statistics function. Save the results in df$pop

Exercise 5

Lets now make a function using elements you learned from exercise 3 and 4. So the function is going to take a string as an input like s2 from exercise 3. Inside the function you can combine it with s1 and s3. You have to create the same for loop from exercise 4. Instead of storing the result of the for loop in df$pop, use df$pop2.You should see a new feature inside df with all the results once you return df from it.

Exercise 6

Test the function stat_maker. stat_maker("population_density"). Notice it did not explicitly make the changes to the df but just returned it once you called the function. This is because we did not define df as a global variable. But thats okay. We will learn it later

Exercise 7

Great. Now before we modify our function, lets learn how we can make a global variable inside a function. Use the same code from exercise 5 but this time instead of defining df$pop2 as a local variable, define it as a global variable. Run the function and test it again.

Exercise 8

You can also use the assign() function inside a function and set the results as a global variable. Lets see an example of assign function

 

assign(“test”,50)

 

Now if you type test in your console. You should see 50. Try it

Exercise 9

Now try putting the same code in exercise 8 while changing test to test2 inside the stat_maker function. Once you test the function, you will see that test2 does not return anything. This is because it was not set as a global variable

Exercise 10

Set test2 as a global variable inside the stat_maker function. Run the function and now you should see test 2 return 50 when you call it.




Building Shiny App exercises part 6

RENDER FUNCTIONS

In the sixth part of our series we will talk about the renderPlot and the renderUI function and then we will be ready to create our first visualization. (Find part 1-5 here).
We are going to create a simple interactive scatterplot that will help us see the clusters that are created when we run the k-means algorithm on our dataset. Read the examples below to understand how to activate a renderPlot function and the test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

DESCRIPTIVE STATISTICS

As in every statistical application it is wise to apply descriptive statistics on your dataset and also provide this information to user in an easy-readable way. So, first of all we will place a Data Table inside the “SUMMARY” tabPanel. The example below can be your guide.

#ui.R
library(shiny)
shinyUI(fluidPage(
sidebarLayout(
sidebarPanel(
),
mainPanel(
dataTableOutput("Table")
)
)))

#server.R
shinyServer(function(input, output, session) {
sum<-as.data.frame.array(summary(iris))
output$Table <- renderDataTable(sum)
})

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 1

Create a Data Table(“Table2”) with the descriptive statistics of your dataset. HINT: Use summary, as.data.frame.array and renderDataTable.

renderPlot

The renderPlot function enders a reactive plot that is suitable for assigning to an output slot. The general form of the function that generates the plot is below:

renderPlot(expr, width = "auto", height = "auto", res = 72, ...,
env = parent.frame(), quoted = FALSE, execOnResize = FALSE,
outputArgs = list())

The example below shows you how to create a simple scatterplot between two variables of the iris dataset(“Sepal Length” and “Sepal Width”).

# ui.R
library(shiny)
shinyUI(fluidPage(
sidebarLayout(
sidebarPanel(
),
mainPanel(
plotOutput("plot1")
)
)))

#server.R
shinyServer(function(input, output, session) {
output$plot1 <- renderPlot({
plot(iris$Sepal.Length,iris$Sepal.Width)
})
})

Initially remove renderImage and radioButtons from the tabPanel “K means”.

Exercise 2

Add a scatterplot inside the tabPanel “K Means” between two variables of the iris dataset.

INTERACTIVE PLOTS

Shiny has built-in support for interacting with static plots generated by R’s base graphics functions,this makes it easy to add features like selecting points and regions, as well as zooming in and out of images.
To get the position of the mouse when a plot is clicked, you simply need to use the click option with the plotOutput. For example, this app will print out the x and y coordinate position of the mouse cursor when a click occurs.

#ui.R
library(shiny)
shinyUI(fluidPage(
sidebarLayout(
sidebarPanel(),
mainPanel(
plotOutput("plot1", click = "plot_click"),
verbatimTextOutput("info")
)
)))

#server.R
shinyServer(function(input, output, session) {
output$plot1 <- renderPlot({
plot(iris$Sepal.Length,iris$Sepal.Width)
})
output$info <- renderText({
paste0("x=", input$plot_click$x, "\ny=", input$plot_click$y)
})
})

Exercise 3

Add click inside the plotOutput you just created. Name it “mouse”.

Exercise 4

Add a verbatimTextOutput inside the “K Means” tabPanel,under the plotOutput you created before. Name it “coord”.

Exercise 5

Make “x” and “y” coordinates appear in the pre-tag you just created. HINT : Use renderText and paste0 and do not forget to activate it with the submitButton.

Exercise 6

Set height = “auto” and width = “auto”.

PLOT ANNOTATION

This function can be used to add labels to a plot. Its first four principal arguments can also be used as arguments in most high-level plotting functions. They must be of type character or expression. In the latter case, quite a bit of mathematical notation is available such as sub- and superscripts, greek letters, fraction, etc.
title(main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
line = NA, outer = FALSE, ...)

Look at the example below:
# ui.R
library(shiny)
shinyUI(fluidPage(
sidebarLayout(
sidebarPanel(),
mainPanel(
plotOutput("plot1", click = "plot_click"),
verbatimTextOutput("info")
)
)))

#server.R
shinyServer(function(input, output, session) {
output$plot1 <- renderPlot({
plot(iris$Sepal.Length,iris$Sepal.Width,main = "SCATTER PLOT",sub = "K Means",xlab="Sepal Length",ylab = "Sepal Width")
})
output$info <- renderText({
paste0("x=", input$plot_click$x, "\ny=", input$plot_click$y)
})
})

Exercise 7

Set scatterplot title to “K-Means”, the X-axis label to “Petal Length” and the Y-axis label to “Petal Width”. HINT: Use main,xlab,ylab.

You can also modify and set other graphical parameters related to the title and subtitle like the example below:

# ui.R
library(shiny)
shinyUI(fluidPage(
sidebarLayout(
sidebarPanel(),
mainPanel(
plotOutput("plot1", click = "plot_click"),
verbatimTextOutput("info")
)
)))

#server.R
shinyServer(function(input, output, session) {
output$plot1 <- renderPlot({
plot(iris$Sepal.Length,iris$Sepal.Width,main = "SCATTER PLOT",sub = "K Means",xlab="Sepal Length",ylab = "Sepal Width",
cex.main = 3, font.main= 5, col.main= "green",
cex.sub = 0.65, font.sub = 4, col.sub = "orange")
})
output$info <- renderText({
paste0("x=", input$plot_click$x, "\ny=", input$plot_click$y)
})
})

Exercise 8

Give values to the rest of the graphical parameters of the title like the example above and get used to them. HINT: Use cex.main, font.main and col.main.

renderUI

renderUI(expr, env = parent.frame(), quoted = FALSE, outputArgs = list())

Makes a reactive version of a function that generates HTML using the Shiny UI library. As you can see in the example below this expression returns a tag object.

# ui.R
library(shiny)
shinyUI(fluidPage(
sidebarLayout(
sidebarPanel( uiOutput("Controls")),
mainPanel(
plotOutput("plot1", click = "plot_click"),
verbatimTextOutput("info")
)
)))

#server.R
shinyServer(function(input, output, session) {
output$plot1 <- renderPlot({
plot(iris$Sepal.Length,iris$Sepal.Width,main = "SCATTER PLOT",sub = "K Means",xlab="Sepal Length",ylab = "Sepal Width",
cex.main = 2, font.main= 4, col.main= "blue",
cex.sub = 0.75, font.sub = 3, col.sub = "red")
})
output$info <- renderText({
paste0("x=", input$plot_click$x, "\ny=", input$plot_click$y)
})
output$Controls <- renderUI({
tagList(
sliderInput("n", "N", 1, 1000, 500),
textInput("label", "Label")
)
})
})

Exercise 9

Put a uiOutput inside tabPanel “K-Means” and name it “All”. Then create its output in server.R with a tagList into it. HINT: Use uiOutput, renderUI and tagList.

Exercise 10

Remove the submitButton and move the sliderInput and the textOutput from the ui.R into the tagList.




Data Science for Doctors – Part 2 : Descriptive Statistics

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the second part of the series, it will contain the main descriptive statistics measures you will use most of the time. Those measures are divided in measures of central tendency and measures of spread. Moreover, most of the exercises can be solved with built-in functions, but I would encourage you to solve them “by hand”, because once you know the mechanics of the measures, then you are way more confident on using those measures. On the “solutions” page, I have both methods, so even if you didn’t solve them by hand, it would be nice if you check them out.

Before proceeding, it might be helpful to look over the help pages for the mean, median, sort , unique, tabulate, sd, var, IQR, mad, abs, cov, cor, summary, str, rcorr.

You also may need to load the Hmisc library.
install.packages('Hmisc')
library(Hmisc)

In case you haven’t solve the part 1, run the following script to load the prerequisites for this part.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Find the mean of the mass variable.

Exercise 2

Find the median of the mass variable.

Exercise 3

Find the mode of the mass.

Exercise 4

Find the standard deviation of the age variable.

Learn more about descriptive statistics in the online courses Learn by Example: Statistics and Data Science in R (including 8 lectures specifically on descriptive statistics), and Introduction to R.

Exercise 5

Find the variance of the mass variable.

Unlike the popular mean/standard deviation combination,interquartile range and median/mean absolute deviation are not sensitive to the presence of outliers. Even though it is recommended to go for MAD because they can approximate the standard deviation.

Exercise 6

Find the interquartile range of the age variable.

Exercise 7

Find the median absolute deviation of age variable. Assume that the age follows a normal distribution.

Exercise 8
Find the covariance of the variables age, mass.

Exercise 9

Find the spearman and pearson correlations of the variables age, mass.

Exercise 10

Print the summary statistics, and the structure of the data set. Moreover construct the correlation matrix of the data set.




Multipanel Graphics in R (part 1)

Multipanel Graphics in RIn many situations, we require that several plots are placed in the same figure as subplots. R has various ways of doing it. Base Graphics has three different ways to draw subplots, i.e. mfrow, layout and split.screen, with increasing degree of complexity, and, at the same time, with increased control over the plot elements. This example introduces the mfrowmfcol and layout functions in Base Graphics. We use the familiar iris dataset for the illustrations.

Answers to the exercises are available here.If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Consider the iris dataset, draw the following scatterplots, a) Sepal.Length vs Sepal.Width, b) Sepal.Length vs Petal.Length , and c) Sepal.Length vs Petal.Width . Annotate each scatterplot with a title. Use separate colors and plotting characters for each plot.

Exercise 2
Plot the three scatterplots in the same figure as subplots arranged in one row. Use mfrow.

Exercise 3
Plot the three scatterplots in the same figure as subplots arranged in one column. Use mfrow .

Exercise 4
Repeat the same scatterplots. Partition in such a way that the first row contains plots a and b, and the second row contain plot c. Use mfrow.

Exercise 5
Repeat Exercise 2 with mfcol.

Exercise 6
Repeat Exercise 3 with mfcol.

Exercise 7
Repeat Exercise 4 with mfcol.

Exercise 8
Repeat Exercise 2 with layout.

Exercise 9
Repeat Exercise 3 with layout.

Exercise 10
Repeat Exercise 4 with layout. In this case, let scatterplot c occupy the second row completely.




Data Hacking with RDSTK (part 1)

RDSTK is a very versatile package. It includes functions to help you convert IP address to geo locations and derive statistics from them. It also allows you to input a body of text and convert it into sentiments.

This package provides an R interface to Pete Warden’s Data Science Toolkit. See www.datasciencetoolkit.org for more information.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Install and load the RDSTK package.

Exercise 2

Convert the ip adress to co-ordinates. address=”165.124.145.197″. Store the results under the variable stat

Exercise 3

Derive the elevation of that location using the lattitude and longitude. Use the function coordinate coordinates2statistics() function to achieve this. Once you get the elevation store this back as one of the features of stat.

Exercise 4

Derive the population_density of that location using the lattitude and longitude. Use the function coordinate coordinates2statistics() function to achieve this. Once you get the elevation store this back as one of the features of stat called pop_den.

Exercise 5

Great. You are getting the hang of it. Let us try getting the mean temperature of that location. You will notice that it returns a list of 12 numbers, each for a month.

Run this code and see yourself

coordinates2statistics(stat[3],stat[6],"mean_temperature")[1]

Exercise 6

We have to transform the mean_temperature so we can store this as one of the features in our stat dataset. One way to do this is to convert it from long to wide format but that would be too reduntant. Let’s just find the mean temperature from January-December. You might find the sapply function useful to convert each element in the list to integers.

Exercise 7

We decided we do not really need January-December mean value. We actually need the mean temperature from June-December. Make that adjustment to your last code and store the results back in stat under the name mean_temp

Exercise 8

Okay great. Now lets work with more IP-address data. Here is a list of a few ip-addresses scraped from a few commenters of my exercises.

 

list=c(“165.124.145.197″,”31.24.74.155″,”79.129.19.173”)
df=data.frame(list)
df[,1]=as.character(df[,1])

 

Exercise 9

Use a iterator like apply that will go through the list and derive its statistics with the ip2coordinates() function. This is the first part. You may get a list within list sort of result. Store this in a variable called data

Exercise 10

Use a method to convert that list within list into a dataframe with 3 rows and all columns derived from the ip2coordinates() function. You are open to use any method for this.




Data Science for Doctors – Part 1 : Data Display

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the first part of the series, it is going to be about data display.

Before proceeding, it might be helpful to look over the help pages for the table, pie, geom_bar , coord_polar, barplot, stripchart, geom_jitter, density, geom_density, hist, geom_histogram, boxplot, geom_boxplot, qqnorm, qqline, geom_point, plot, qqline, geom_point .

You also may need to load the ggplot2 library.
install.packages('ggplot2')
library(ggplot)

Please run the code below in order to load the data set and transform it into a proper data frame format:

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
data <- read.table(url, fileEncoding="UTF-8", sep=",")
names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')
colnames(data) <- names

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a frequency table of the class variable.

Exercise 2

class.fac <- factor(data[['class']],levels=c(0,1), labels= c("Negative","Positive"))

Create a pie chart of the class.fac variable.

Exercise 3

Create a bar plot for the age variable.

Exercise 4

Create a strip chart for the mass against class.fac.

Exercise 5

Create a density plot for the preg variable.

Exercise 6

Create a histogram for the preg variable.

Exercise 7

Create a boxplot for the age against class.fac.

Exercise 8

Create a normal QQ plot and a line which passes through the first and third quartiles.

Exercise 9

Create a scatter plot for the variables age against the mass variable .

Exercise 10

Create scatter plots for every variable of the data set against every variable of the data set on a single window.
hint: it is quite simple, don’t overthink about it.