Working with air quality and meteorological data Exercises (Part-1)

Atmospheric air pollution is one of the most important environmental concerns in many countries around the world, and it is strongly affected by meteorological conditions. Accordingly, in this set of exercises we use openair package to work and analyze air quality and meteorological data. This packages provides tools to directly import data from air quality measurement network across UK, as well as tools to analyse and producing reports. In this exercise set we will import and analyze data from MY1 station which is located in Marylebone Road in London, UK.

Answers to the exercises are available here.

Please install and load the package openair before starting the exercises.

Exercise 1
Import the MY1 data for the year 2016 and save it into a dataframe called my1data.

Exercise 2
Get basic statistical summaries of myd1 dataframe.

Exercise 3
Calculate monthly means of:
a. pm10
b. pm2.5
b. nox
c. no
d. o3

You can use Air Quality Data and weather patterns in combination with spatial data visualization, Learn more about spatial data in the online course
[Intermediate] Spatial Data Analysis with R, QGIS & More
. this course you will learn how to:

  • Work with Spatial data and maps
  • Learn about different tools to develop spatial data next to R
  • And much more

Exercise 4
Calculate daily means of:
a. pm10
b. pm2.5
b. nox
c. no
d. o3

Exercise 5
calculate daily maximum of:
b. nox
c. no




Sending Emails from R Exercises

When monitoring a data source, model, or other automated process, it’s convienent to have method for easily delivering performance metrics and notifying you whenever something is amiss. One option is to use a dashboard; however, this requires active time and effort to grab numbers and catch errors. An alternative approach is to send an email alert on the performance of the process. In this exercise set, we will explore the email approach using the mailR package.

Exercises in this section will be solved using the mailR package as well as basic HTML and CSS. It is recommended to take a look at the mailR documentation before continuing.

Answers to the exercises are available here.

Exercise 1
Let’s begin by sending “Hello World. This is my email!” as the body parameter from yourself to yourself.

Exercise 2
By passing in a vector for the to parameter, you can send the email to multiple recipients. Send the above email to yourself and a friend.

Exercise 3
So far, your emails have had no subject. Send the email from Exercise 1 to yourself with “Email Testing” for the subject parameter.

Exercise 4
With this package, we can take full advantage of CSS when constructing the body of an email. Send the email from the previous exercise from yourself to yourself where “Hello World.” is now red and “This is my email!” is now blue.

Note: make sure that html = TRUE.

Learn more about html functionality and web connection in the online course A complete journey to web analytics using R tool. In this course you will learn how to:

  • Perform a web based analytic question start to end
  • Learn how to import data from different online platforms such as twitter
  • And much more

Exercise 5
If you write a complex email containing images, dynamic elements, etc. as an HTML file, then you can reference this file with the body parameter. Create an HTML file containing “Hello World. This is my email!” called my_email.html. Send this email to yourself.

Exercise 6
Using knitr, you can compile HTML files. Compile the default knitr document that uses the mtcars dataset to an HTML file and email this to yourself.

Exercise 7
Create a new R script called mailr_six.R containing your code from the above exercises and attach that to your email by referencing the file path to mailr_six.R in the attach.files parameter. Send this email from yourself to yourself.

Exercise 8
The attached R script above does not have a description or a name. Add these in the file.descriptions and file.names parameters, respectively. Send the resulting email to yourself.

Exercise 9
Just as with the recicipents, you can attach multiple files, descriptions, and names by passing in vectors to the respective parameters. Create a new R script called mailr_eight.R containing your code from the above exercises and attach both mailr_six.R and mailr_eight.R to your email. Send the resulting email to yourself.

Exercise 10
Create a new R script where a random integer called important_number is generated. If important_number is even, then send an email to yourself notifying you that important_number is even.




Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)

Statistics are often taught in school by and for people who like Mathematics. As a consequence, in those class emphasis is put on leaning equations, solving calculus problems and creating mathematics models instead of building an intuition for probabilistic problems. But, if you read this, you know a bit of R programming and have access to a computer that is really good at computing stuff! So let’s learn how we can tackle useful statistic problems by writing simple R query and how to think in probabilistic terms.

In previous set, we’ve seen how to compute probability based on certain density distributions, how to simulate situations to compute their probability and use that knowledge make decisions in obvious situation. But what is a probability? Is there a more scientific way to make those decisions? What is the P-value xkcd keep talking about? In this exercise set, will learn the answer to most of those question and more!

One simple definition of the probability that an event will occur is that it’s the frequency of the observations of this event in a data set divided by the total number of observations in this set. For example, if you have a survey where 2 respondents out of 816 says that they are interested in a potential partner only if they are dressed in an animal costume, you can say that the probability that someone in the population is a furry is about 2/816 or 1/408 or 0.00245… or 0.245%.

Answers to the exercises are available here.

Exercise 1
The average height of males in the USA is about 5 foot 9 inches with a standard deviation of 2.94 inches. If this measure follow a normal distribution, write a function that takes a sample size as input and compute the probability to have a subject taller than 5 foot 8 and smaller than 5 foot 9 on this sample size. Then, set the seed to 42 and compute the probability for a sample size of 200.

Exercise 2
We can deduce a lot from that definition. First, the probability is always a fraction, but since we are usually not used to high number and have a hard time doing division in our head 3968/17849 is not a really useful probability. In consequence, we will usually use a percentage or a real number between o and 1 to represent a probability. Why 0 and one? If an event is not present in the data set, his frequency is 0 so whatever is the total number of observations his probability is 0 and if all the observations are the same, the fraction is going to be equal to 1. Also, if you think about the example of the furries in the survey, maybe you think that there’s a chance that there are only two furries in the entire population and they both take the survey, so the probability that an individual is a furry is in reality a lot lower than 0.0245%. Or maybe there’s a lot more furries in the population and only two where surveyed, which makes the real probability much higher. You are right token reader! In a survey, we estimate the real probability and we can never tell the real probability from a small sample (that’s why if you are against the national survey in your country, all the statisticians hate you in silence). However, the more the sample size of a survey is high the less those rare occurrences happen.

  1. Compute the probability that an American male is taller than 5 foot 8 and smaller than 5 foot 9 with the pnorm function.
  2. Write a function that draws a sample of subject from this distribution, compute the probability of observing a male of this height and compute the percentage of difference between that estimate and the real value. Make sure that you can repeat this process for all sample size between two values.
  3. Use this function to draw sample of size from 1 to 10000 and store the result in a matrix.
  4. Plot the difference between the estimation of the probability and the real value.

This plot show that the more the sample size is big, the less the error of estimation is, but the difference of error between an sample of size 1000 and 10000 is quite small.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Work with about different binomial and logistic regression techniques
  • Know how to compare regression models and choose the right fit
  • And much more

Exercise 3
We have already seen that density probability can be used to compute probability, but how?

For a standard normal distribution:

  1. Compute the probability that x is smaller or equal to zero, then plot the distribution and draw a vertical line at 0.
  2. Compute the probability that x is greater than zero.
  3. Compute the probability that x is less than -0.25, then plot the distribution and draw a vertical line at -0.25.
  4. Compute the probability that x is smaller than zero and greater than -0.25.

Yeah, the area under the curve of a density function between two points is equal to the probability that an event is equal to a value on this interval. That’s why density are really useful: they help us to easily compute the probability of an event by doing calculus. Often we will use the cumulative distribution function (cdf), which is the antiderivative of the density function, to compute directly the probability of an event on an interval. The function pnorm() for example, compute the value of the cdf between minus infinity and a value x. Note that a cdf return the probability that a random variable take a value smaller.
Exercise 4
For a standard normal distribution, find the values x such as:

  1. 99% of the observation are smaller than x.
  2. 97.5% of the observation are smaller than x.
  3. 95% of the observation are smaller than x.
  4. 99% of the observation are greater than x.
  5. 97.5% of the observation are greater than x.
  6. 95% of the observation are greater than x.

Exercise 5
Since probability are often estimated, it is useful to measure how good is the estimation and report that measure with the estimation. That’s why you often hear survey reported in the form of “x% of the population with a y% margin 19 times out of 20”. In practice, the size of the survey and the variance of the results are the two most important factors that can influence the estimation of a probability. Simulation and bootstrap methods are great way to find the margin of error of an estimation.

Load this dataset and use bootstrapping to compute the interval that has 95% (19/20) chance to contain the real probability of getting a value between 5 and 10. What is the margin of error of this estimation?

This process can be used to any statistics that is estimated, like a mean, a proportion, etc.

When doing estimation, we can use a statistic test to draw conclusion about our estimation and eventually make decisions based on it. For example, if in a survey, we estimate that the average number of miles traveled by car each week by American is 361.47, we could be interested to know if the real average is bigger than 360. To do so, we could start by formulation a null and an alternative hypothesis to test. In our scenario, a null hypothesis would be that the mean is equal or less than 360. We will follow the step of the test and if at the end we cannot support this hypothesis, then we will conclude that the alternative hypothesis is probably true. In our scenario that hypothesis should be that the mean is bigger than 360.

Then we choose a percentage of times we could afford to be wrong. This value will determine the range of possible values for which we will accept the null hypothesis and is called the significance level (α).

Then we can use a math formula or a bootstrap method to estimate the probability that a sample from this population would create an estimate of 361.47. If this probability is less than the significance level, we reject the null hypothesis and go with the alternative hypothesis. If not, we cannot reject the null hypothesis.

So basically, what we do is we look at how often our estimation should happen if the null hypothesis is true and if it’s rare enough at our taste, significance level, we conclude that it’s not a random occurance but a sign that the null hypothesis is false.
Exercise 6
This dataset represents the survey of the situation above.

  1. Estimate of the mean of this dataset.
  2. Use the bootstrap method to find 10000 estimations of the mean from this dataset.
  3. Find the value from this bootstrap sample that is bigger than 5% of all the others values.This value is called the critical value of the test and correspond to α.
  4. From the data we have, should be conclude that the mean of the population is bigger than 360? What is the significance level of this test?

Exercise 7
We can represent the test visually. Since we reject the null hypothesis if the percentage of bootstrapped mean smaller than 360 is bigger than 5%, we can simply look where the fifth percentile lie on the histogram of the bootstrapped mean. If it’s at the left of the 360 value, we know that more than 5% of bootstrapped means are smaller than 360 and we don’t reject the null hypothesis.

Draw the histogram of the bootstrapped mean and draw two vertical lines: one at 360 and one at the fifth percentile.

Exercise 8
There are two ways that a mean can be not equal to a value: when the mean is bigger than the value and when it’s smaller than this value. So if we want to test the equality of the mean to a specific value we must verify if most of our estimations lie around this value or if a lot of them are far from it. To do so, we create an interval who has for endpoints our mean and another point that is at the same distance from this value that the mean. Then we can compute the probability to get an estimation outside this interval. This way, we test if the value is not bigger or smaller than the value 1-α of the time.

Here’s the steps to test the hypothesis that the mean of the dataset of exercise 6 is equal to 363:

  1. To simulate that our distribution has a mean of 363, shift the dataset so that this value become the mean.
  2. Generate 10000 bootstrapped means from this distribution.
  3. Compute the endpoints of the test interval.
  4. Compute the probability that the mean is outside this interval.
  5. What conclusion can we make with a α of 5%?

Exercise 9
Repeat the step of exercise 8, but this time test if the mean is smaller than 363.

This show that a one direction test is more powerful than a two direction test in this situation since there’s less wiggle room between the value of reference and the critical region of the test. So if you have prior knowledge that could make you believe that an estimation is bigger or smaller than a value, testing for than would give you more assurance of the validity of your results.

Exercise 10
The p-value of a test is the probability that we would observe a random estimation as the one we made if the null hypothesis is true. This value is often used in scientific reports since it’s a concise way to express statistics finding. If we know the p-value of a test and the significance level α we can deduce the result of the test since the null hypothesis is rejected when p<α. In another word: you have been using the p-value all this time to make conclusion!

Load the dataset of exercise 5 and compute the p-value associated to the test that the mean is equal to 13 if α is equal to 5%.




ggvis Exercises (Part-1)

INTRODUCTION

The ggvis package is used to make interactive data visualizations. The fact that it combines shiny’s reactive programming model and dplyr’s grammar of data transformation make it a useful tool for data scientists.

This package may allows us to implement features like interactivity, but on the other hand every interactive ggvis plot must be connected to a running R session.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Create a list which will include the variables “Horsepower” and “MPG.city” of the “Cars93” data set. HINT: Use ggvis().

Exercise 2

Use the list you just created to make a scatterplot. HINT: Use layer_points().

Exercise 3

Use %>% to create the scatterplot of Exercise 2.

Learn more about using ggvis in the online course R: Complete Data Visualization Solutions. In this course you will learn how to:

  • Work extensively with the ggvis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 4

Use the list you created in Exercise 1 to create a scatterplot and use “Cylinders” as stroke.

Exercise 5

Use the list you created in Exercise 1 to create a scatterplot and use “Cylinders” as fill.

Exercise 6

Use the list you created in Exercise 1 to create a scatterplot and use “EngineSize” as size.

Exercise 7

Use the list you created in Exercise 1 to create a scatterplot and use “Cylinders” as shape.

Exercise 8

Use the list you created in Exercise 1 to create a scatterplot with red color and black stroke.

Exercise 9

Use the list you created in Exercise 1 to create a scatterplot with size set to 300 and opacity to 0.5 .

Exercise 10

Use the list you created in Exercise 1 to create a scatterplot with cross as shape.




More string Hacking with Regex and Rebus

For a begineer in R or any language,regular expression might seem like a daunting task . Rebus package in R gives a lowers the barrier for common regular expression tasks and is useful for a begineer or even for advanced users for most of the common regex skills in a more intuitive yet verbose way .Check out the package and try this exercises to test your knowledge .
Load stringr/stringi as well for this set of exercise . I encourage you to do this and
this before working on this set .
Answers are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
create two strings
Suppose you have a vectore
x <- c("stringer","stringi","rebus","redbus")

use rebus and find the strings starting with st .Hint use START from rebus

Exercise 2

Use the same string vectore and find the strings which ends with bus.

Exercise 3
you have a vector like
m <- c("aba","aca","abba","accdda")

find the strings which starts and ends with a and have a single character in between
Hint – use ANY_CHAR
Exercise 4
y <- c("brain","brawn","rain","train")

find all the strings that starts with br and ends with n .
Hint – use any_char with hi=Inf to build the regex

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 5
Use the same vector as previous exercise and find strings starting with br or tr .
Hint – or

Exercise 6
Now we turn our attention to character class,if you are familiar with character classes in regex , you will find it pretty easy with rebus and if you are starting with regex .you might find it easy to remember with rebus
Suppose you have a vector
l <- c("Canada","america","france")

Find string with C or m in it .so your answer should be Canada and America

Exercise 7
From the string 123abc ,find the digits ,using rebus .

Exercise 8
Create a character class for vowels and find all the Vowels in the vector
vow <- c("blue","sue","CLUE","TRUE")

Exercise 9
Find the characters other than vowels from above vector .

Exercise 10
Now create a new vector
vow1 <- c("blue","sue","CLUE","TRUE","aue")

find the string which is made of only vowels




Soccer data sparring: Scraping, merging and analyzing exercises

While understanding and spending time improving specific techniques, and strengthening indvidual muscles is important, occasionally it is necessary to do some rounds of actual sparring to see your flow and spot weaknesses. This exercise sets forces you to use all that you have practiced: to scrape links, download data, regular expressions, merge data and then analyze it.

We will download data from the website football-data.co.uk that has data on some football/soccer leagues results and odds quoted by bookmakers where you can bet on the results.

Answers are available here.

Exercise 1

Use R to scan the German section on football-data.co.uk for any links and save them in a character vector called all_links. There are many ways to accomplish this.

Exercise 2

Among the links you found should be a number pointing to comma-separated values files with data on Bundesliga 1 and 2 separated by season. Now update all_links vector so that only links to csv files remain. Use regular expressions.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 3

Again, update all_links so that only links to csv tables ‘from Bundesliga 1 from Season 1993/1994 to 2013/2014 inclusive’ remain.

Exercise 4

Import to a list in your workspace all the 21 remaining csv files in all_links, each one as a data.frame. Use read.csv, with the url and na.strings = c("", "NA"). Not that you might need to add a prefix for them, so the links are complete.

Exercise 5

Take the list and generate a one big data.frame with all the data.frames previously imported. One way to do this is using rbind.fill function from a well-known package. Name the new data.frame as bundesl.

Exercise 6

Take a good look at the new dataset. Our read.csv did not work perfectly on this data: it turns out that there are some empty rows and empty columns, identify and count them. Update the bundesl so it no longer has empty rows m nor columns.

Exercise 7

Format the Date column so R understands using as.Date().

Exercise 8

Remove all columns which are not 100% complete, and the variable Div as well.

Exercise 9

Which are the top 3 teams in terms of numbers of wins in Bundesliga 1 for our period? You are free to use base-R functions or any package. Be warned that his task is not as simple as it seems due the nature in the data and small inconsitency in the data.

Exercise 10

Which team has held the longest winning streak in our data?




Data visualization with googleVis exercises part 10

Timeline, Merging & Flash charts

This is part 10 of our series and we are going to explore the features of some interesting types of charts that googleVis provides like Timeline, Flash and learn how to merge two googleVis charts to one.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package & Data frame

As you already know, the first thing you have to do is install and load the googleVis package with:
install.packages("googleVis")
library(googleVis)

Secondly we will create an experimental data frame which will be used for our charts’ plotting. You can create it with:
datTLc <- data.frame(Position=c(rep("President", 3), rep("Vice", 3)),
Name=c("Washington", "Adams", "Jefferson",
"Adams", "Jefferson", "Burr"),
start=as.Date(x=rep(c("1789-03-29", "1797-02-03",
"1801-02-03"),2)),
end=as.Date(x=rep(c("1797-02-03", "1801-02-03",
"1809-02-03"),2)))

You can explore the “datTLC” data frame with head().

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page. All charts require an Internet connection.

Timeline Chart

It is quite simple to create a timeline chart with googleVis. We will use the “datTLC” data frame we just created.
Look at the example below to create a simple timeline chart:
TLC <- gvisTimeline(data=datTLc)
plot(TLC)

Exercise 1

Create a list named “TLC” and pass to it the “datTLC” data frame as a timeline chart. HINT: Use gvisTimeline().

Exercise 2

Plot the the timeline chart. HINT: Use plot().

You can select the variables you want as rows and columns with:
TLC <- gvisTimeline(data=dataframe,
rowlabel="var1",
barlabel="var2",
start="var3",
end="var4")
plot(TLC)

Exercise 3

Put “Name” as rowlabel, “Position” as barlabel, “start” as start “end” as end and plot the chart.

Options

You can group your chart by row or not with:
options=list(timeline="{groupByRowLabel:true}")

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 4

Group your timeline chart NOT by rowlabel and plot it.

You can set the colours and size of your chart with:
options=list(timeline="{groupByRowLabel:false}",
backgroundColor='yellow',
height=300,
colors="['blue', 'brown']"))
plot(TLC)

Exercise 5

Set the background color of your chart to white, the “Position” colours to red and green respectively, the height to 400 and plot it.

Merging charts

We will now see how to merge two charts to one. For this purpose we are going to use a Geo Chart and a Table which we saw in parts 6 & 7 respectively.

Geo <- gvisGeoChart(Exports, "Country", "Profit",
options=list(width=400, height=400))
Table <- gvisTable(Exports,
options=list(width=320, height=400))

After you create these two charts you can merge them with:
GeoTable <- gvisMerge(Geo,Table, horizontal=TRUE)
plot(GeoTable)

Exercise 6

Create a Geo chart and Table like the example above and merge them. HINT: Use gvisMerge().

Flash charts

All the following charts require a Flash player.

Motion chart

The most exciting type of chart that googleVis provides, in my opinion, is the motion chart. It is quite simple to create a motion chart with googleVis. We will use the “Fruits” data set for this example. You can see the variables of your data set with head().
Look at the example below to create a simple motion chart:

MotionC=gvisMotionChart(Fruits,
idvar = "Fruit",
timevar = "Year"
)
plot(MotionC)

Exercise 7

Create a list named “MotionC” and pass to it the “Fruits” data set as a motion chart. HINT: Use gvisMotionChart().

Exercise 8

Plot the the motion chart. HINT: Use plot().

As you saw the variables were set automatically, but you can set them as you want with:
MotionC=gvisMotionChart(Fruits,
idvar = "Fruit",
timevar = "Year",
xvar = "Expenses",
yvar = "Sales",
sizevar ="Profit",
colorvar = "Location")
plot(MotionC)

Exercise 9

Create a list named “MotionC” and pass to it the “Fruits” data set as a motion chart. You can use the example above or you can use the variables differently to see the differences. HINT: Use gvisMotionChart().

Exercise 10

Plot the the motion chart. HINT: Use plot().




R Markdown exercises part 2

INTRODUCTION

R Markdown is one of the most popular data science tools and is used to save and execute code, create exceptional reports whice are easily shareable.

The documents that R Markdown provides are fully reproducible and support a wide variety of static and dynamic output formats.

Using markdown syntax, which provides an easy way of creating documents that can be converted to many other file types, while embeding R code in the report, so it is not necessary to keep the report and R script separately. Furthermore The report is written as normal text, so knowledge of HTML is not required. Of course no additional files are needed because everything is incorporated in the HTML file.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Make a table out of the object dataframe you created and set its numbers to have one significant figure. HINT: Use kable().

Exercise 2

Use bold text for your report’s title. HINT: Use ** **.

Exercise 3

Use Italic text for the author’s name. HINT: Use * *.

Learn more about reporting your results in the online course: R for Data Science Solutions. In this course you will learn how to:

  • Build a complete workflow in R for your data science problem
  • Get indepth on how to report your results in a interactive way
  • And much more

Exercise 4

Add “Summary” as Header of size 1 above your summary context.

Exercise 5

Add “Plot”, “Dataframe” and “Table 1” as Headers of size 3 above the rest of the three objects of your report respectively.

Exercise 6

Create manually a small table for your dataframe.

Exercise 7

Apply right alignment to the column “B”.

Exercise 8

Create an unordered list of the contents of column “A” of your dataframe.

Exercise 9

Transform the list you just created to ordered.

Exercise 10

Add a link named “Link” that leads to “www.r-exercises.com”.




Parallel Computing Exercises: Snow and Rmpi (Part-3)


The foreach statement, which was introduced in the previous set of exercises of this series, can work with various parallel backends. This set allows to train in working with backends provided by the snow and Rmpi packages (on a single machine with multiple CPUs). The name of the former package stands for “Simple Network of Workstations”. It can employ various parallelization techniques; socket clustering is used here. The latter one is an R’s wrapper for the MPI (Message-Passing Interface), which is another paralellization technique.
The set also demonstrates that inter-process communication overhead has to be taken into account when preparing to use parallelization. If short tasks are run in parallel the overhead can offset the gains in performance from using multiple CPUs, and in some cases execution can get even slower. For parallelization to be useful, tasks that are run in parallel have to be long enough.
The exercises are based on an example of using bootstrapping to estimate the sampling distribution of linear regression coefficients. The regression is run multiple times on different sets of data derived from an original sample. The size of each derived data set is equal to the size of the original sample, which is possible because the sets are produced by random sampling with replacement. The original sample is taken from the InstEval data set, which comes with the lme4 package, and represents lecture/instructor evaluations by students at the ETH. The estimated distribution is not analyzed in the exercises.
The exercises require the packages foreach, snow, doSNOW, Rmpi, and doMPI to be installed.
IMPORTANT NOTE: the Rmpi package depends on an MPI software, which has to be installed on the machine separately. The software can be the following:

  • Windows: either the Microsoft MPI, or Open MPI library (the former one can be installed as an ordinary application).
  • OS X/macOS: the Open MPI library (available through Homebrew).
  • Linux: the Open MPI library (look for packages named libopenmpi (or openmpi, lib64openmpi, or similar), as well as libopenmpi-dev (or libopenmpi-devel, or similar) in your distribution’s repository).

The zipped data set can be downloaded here. For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.

Exercise 1
Load the data set, and assign it to the data_large variable.

Exercise 2
Create a smaller data set that will be used to compare how parallel computing performance depends on the size of the task. Use the sample function to obtain a random subset from the loaded data. Its size has to be 10% of the size of the original dataset (in terms of rows). Assign the subset to the data_small variable.
For reproducibility, set the seed to 1234.
Print the number of rows in the data_large and data_small data sets.

Exercise 3
Write a function that will be used as a task in parallel computing. The function has to take a data set as an input, and do the following:

  1. Resample the data, i.e. obtain a new sample of data based on the input data set. The number of rows in the new sample has to be equal to the one in the input data set (use the sample function as in the previous exercise, but change parameters to allow for resampling with replacement).
  2. Run a linear regression on the resampled data. Use y as the dependent variable, and the others as independent variables (this can be done by using the formula y ~ . as an argument to the lm function).
  3. Return a vector of coefficients of the linear regression.

Learn more about optimizing your workflow in the online course Getting Started with R for Data Science. In this course you will learn how to:

  • efficiently organize your workflow to get the best performance of your entire project
  • get a full introduction to using R for a data science project
  • And much more

Exercise 4
Let’s test how much time it takes to run the task multiple times sequentially (not in parallel). Use the foreach statement with the %do% operator (as discussed in the previous set of exercises of this series) to run it:

  • 10 times with the data_large data set, and
  • 100 times with the data_small data set.

Use the rbind function as an argument to foreach to combine the results.
In both cases, measure how much time is spent on execution of the task (with the system.time function). Theoretically, the length of time spent should be roughly the same because the total number of rows processed is equal (it is 100,000 rows: 10,000 rows 10 times in the first case, and 1,000 rows 100 times in the second case), and the row length is the same. But is this the case in practice?

Exercise 5
Now we’ll prepare to run the task in parallel using 2 CPU cores. First, we’ll use a parallel computing backend for the foreach statement from the snow package. This requires to steps:

  1. Make a cluster for parallel execution using the makeCluster function from the snow package. Pass two arguments to this function: the size of the cluster (i.e. the number of CPU cores that will be used in computations), and the type of the cluster ("SOCK" in this case).
  2. Register the cluster with the registerDoSNOW function from the doSNOW package (which provides a foreach parallel adapter for the 'snow' package).

Exercise 6
Run the task 10 times with the large data set in parallel using the foreach statement with the %dopar% operator (as discussed in the previous set of exercises of this series). Measure the time spent on execution with the system.time function.
When done, use the stopCluster function from the snow package to stop the cluster.
Is the length of execution time smaller comparing to the one measured in Exercise 4?

Exercise 7
Repeat the steps listed in Exercise 5 and Exercise 6 to run the task 100 times using the small data set.
What is the change in the execution time?

Exercise 8
Next, we’ll use another parallel backend for the foreach function: the one that is provided by the Rmpi package (R’s wrapper to Message-Passing Interface), and accessible through an adapter from the doMPI package. From the user perspective, it differs from the snow-based backend in the following ways:

  • as mentioned above, additional software has to be installed for this backend to work (either (a) the openmpi library, available for Windows, macOS, and Linux, or (b) the Microsoft MPI library, which is available for Windows,
  • when an mpi cluster is created, it immediately starts using CPUs as much as it can,
  • when the work is complete, the mpi execution environment has to be terminated; if terminated, it can’t be relaunched without restarting the R session (if you try to create an mpi cluster after the environment was terminated, the session will be aborted, which may result in a loss of data; see Exercise 10 for more details).

In this exercise, we’ll create an mpi execution environment to run the task using 2 CPU cores. This requires actions similar to the ones performed in Exercise 5:

  1. Make a cluster for parallel execution using the startMPIcluster function from the doMPI package. This function can take just one argument, which is the number of CPU cores to be used in computations.
  2. Register the cluster with the registerDoMPI function from the doMPI package.

After creating a cluster, you may check whether the CPU usage on your machine increased using Resource Monitor (Windows), Activity Monitor (macOS), top or htop commands (Linux), or other tools.

Exercise 9
Stop the cluster created in the previous exercise with the closeCluster command from the doMPI package. The CPU usage should fall immediately.

Exercise 10
Create an mpi cluster again, and use it as a backend for the foreach statement to run the task defined above:

  • 10 times with the data_large data set, and
  • 100 times with the data_small data set.

In both cases, start a cluster before running the task, and stop it afterwards. Measure how much time is spent on execution of the task. How the time compares to the execution time with the snow cluster (found in Exercises 6 and 7)?
When done working with the clusters, terminate the mpi execution environment with the mpi.finalize function. Note that this function always returns 1.
Important! As mentioned above, if you intend to create an mpi cluster again after the environment was terminated you have to restart the R session, otherwise the current session will be aborted, which may result in a loss of data. In RStudio, an R session can be relaunched from the Session menu (relaunching the session this way does not affect the data, you’ll only need to reload libraries). In other cases, you may have to quit and restart R.




Data wrangling : Transforming (3/3)


Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the third part of the series and it aims to cover the transforming of data used.This can include filtering, summarizing, and ordering your data by different means. This also includes combining various data sets, creating new variables, and many other manipulation tasks. At this post, we will go through a few more advanced transformation tasks on mtcars data set, in particular table manipulation.

Before proceeding, it might be helpful to look over the help pages for the iner_join, full__join, left_join, right_join, semi_join, anti_join, intersect, union, setdiff, bind_rows.

Moreover please load the following libraries and run the following link.
install.packages("dplyr")
library(dplyr)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a new object named car_inner containing the observations that have matching values in both tables mtcars and cars_table using as key the variable ID.

Exercise 2

Create a new object named car_left containing all the observations from the left table (mtcars), and the matched records from the right table (cars_table) using as key the variable ID.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • Work with popular libraries such as dplyr
  • Learn about methods such as pipelines
  • And much more

Exercise 3

Create a new object named car_right containing all the observations from the right table (cars_table), and the matched records from the right table (mtcars) using as key the variable ID.

Exercise 4

Create a new object named car_full containing all the observations when there is a match in either left (cars_table) or right (mtcars) table observation using as key the variable ID.

Exercise 5

Create a new object named car_semi containing all the observations from mtcars where there are matching values in cars_table using as key the variable ID.

Exercise 6
Create a new object named car_anti containing all the observations from mtcars where there are not matching values in cars_table using as key the variable ID.

Exercise 7

Create a new object named cars_inter which contains rows that appear in both tables mtcars and cars.

Exercise 8

Create a new object named cars_union which contains rows appear in either tables mtcars and cars.

Exercise 9

Create a new object named cars_diff which contains rows appear in table mtcars and not cars.

Exercise 10

Append mtcars to cars and assign it at the object car_rows.