Working with the xlsx package Exercises (part 2)

This exercise set provides (further) practice in writing Excel documents using the xlsx package as well as importing and general data manipulation. Specifically we have loops in order for you to practice scaling. A previous exercise set focused on writing a simple sheet with the same package, see here.

We will use a subset of commuting data from the Dublin area from AIRO and the 2011 Irish census.

Solutions are available here.

Exercise 1
Load the xlsx package. If necessary install it as indicated in the previous xlsx exercise set.

Exercise 2
Download the data to your computer and read into your R workspace as commuting using read.xlsx2() or the slower alternative read.xlsx(). Use colClasses to set relevant classes as we will be manipulating the data later on.

Exercise 3
Clean the data a bit by removing 'Population_Aged_5_Over_By_' and 'To_Work_School_College_' from the column names.

Exercise 4
Sum the 'population aged 5 and over' variables by electoral division name using for instance aggregate() or data.table and save the result as commuting_ed.

Learn more about working with excel and R in the online course Learn By Example: Statistics and Data Science in R. In this course you will learn how to:

  • Learn some of the differences between working in Excel with regression modelling and R
  • Learn about different statistical concepts
  • And much more

Exercise 5
Create an xlsx workbook object in your R workspace and call it wb.

Exercise 6
Create three sheets objects in wb named sheet1, sheet2, sheet3 in wb and your workspace. Use a loop.

Exercise 7
Make a data.frame that lists proportion of respondents in each of the following category by electoral division: travel on foot, travel on bicycle, leave home before 6:30.

Exercise 8
Add the top 5 electoral division in each category to a previously created sheets with all the proportions using a loop. Leave the first row free.

Exercise 9
Add some great title to the first row of each sheet and apply some style to it.

Exercise 10
Save your workbook to your working directory and open using Excel. Go back to R and continue formatting and adding information to your workbook at will.




Volatility modelling in R exercises (Part-1)

Volatility modelling is typically used for high frequency financial data. Asset returns are typically uncorrelated while the variation of asset prices (volatility) tends to be correlated across time.
In this exercise set we will use the rugarch package (package description: here) to implement the ARCH (Autoregressive Conditional Heteroskedasticity) model in R.

Answers to the exercises are available here.

Exercise 1
Load the rugarch package and the dmbp dataset (Bollerslev, T. and Ghysels, E. 1996, Periodic Autoregressive Conditional Heteroscedasticity, Journal
of Business and Economic Statistics, 14, 139–151). This dataset has daily logarithmic nominal returns for Deutsche-mark / Pound. There is also a dummy variable to indicate non-trading days.

Exercise 2
Define the daily return as a time series variable and plot the return against time. Notice the unpredictability apparent from the graph.

Exercise 3
Plot the graph of the autocorrelation function of returns. Notice that there is hardly any evidence of autocorrelation of returns.

Exercise 4
Plot the graph of the autocorrelation function of squared returns. Notice the apparent strong serial correlation.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 5
We will first simulate and analyze an ARCH process. Use the ugarchspec function to define an ARCH(1) process. The return has a simple mean specification with mean=0. The variance follows as AR-1 process with constant=0.2 and AR-1 coefficient = 0.7.

Exercise 6
Simulate the ARCH process for 500 time periods. Exercises 7 to 9 use this simulated data.

Exercise 7
Plot the returns vs time and note the apparent unpredictability. Plot the path of conditional sigma vs time and note that there is some persistence over time.

Exercise 8
Plot the ACF of returns and squared returns. Note that there is no auto-correlation between returns but squared returns have significant first degree autocorrelation as we defined in exercise-5.

Exercise 9
Test for ARCH effects using the Ljung Box test for the simulated data.

Exercise 10
Test for ARCH effects using the Ljung Box test for the currency returns data.




Data Visualization with googleVis exercises part 4

Adding Features to your Charts

We saw in the previous charts some basic and well-known types of charts that googleVis offers to users. Before continuing with other, more sophisticated charts in the next parts we are going to “dig a little deeper” and see some interesting features of those we already know.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package & Data frame

As you already know, the first thing you have to do is install and load the googleVis package with:
install.packages("googleVis")
library(googleVis)

Secondly we will create an experimental data frame which will be used for our charts’ plotting. You can create it with:
df=data.frame(name=c("James", "Curry", "Harden"),
Pts=c(20,23,34),
Rbs=c(13,7,9))

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page.

Customizing Chart

We are going to use the two-axis Line Chart we created in part 1. This is the code we used, in case you forgot it:

LineC2 <- gvisLineChart(df, "name", c("Pts","Rbs"),
options=list(
series="[{targetAxisIndex: 0},
{targetAxisIndex:1}]",
vAxes="[{title:'Pts'}, {title:'Rbs'}]"
))
plot(LineC2)

Colours

To set the color of every line we can use:
series="[{color:'green', targetAxisIndex: 0,

Exercise 1

Change the colours of your line chart to green and yellow respectively and plot the chart.

Line Width

You can determine the line width of every line with:
series="[{color:'green',targetAxisIndex: 0, lineWidth: 3},

Exercise 2

Change the line width of your lines to 3 and 6 respectively and plot the chart.

Dashed lines

You can tranform your lines to dashed with:
series="[{color:'green', targetAxisIndex: 0,
lineWidth: 1, lineDashStyle: [2, 2, 20, 2, 20, 2]},

There are many styles and colours available and you can find them here.

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Exercise 3

Choose two different styles of dashed lines for every line of your chart from the link above and plot your chart.

Point Shape

With the pointShape option you can choose from a variety of shapes for your points.

We will use the scatter chart we built in part 3 to see how it works. Here is the code:
ScatterCD <- gvisScatterChart(cars,
options=list(
legend="none",
pointSize=3,lineWidth=2,
title="Cars", vAxis="{title:'speed'}",
hAxis="{title:'dist'}",
width=600, height=300))
plot(ScatterCD)

Exercise 4

Change the shape of your scatter chart’s points to ‘square’ and plot it. HINT: Use pointShape.

Exercise 5

Change the shape of your scatter chart’s points to ‘triangle’, their point size to 7 and plot it.

Edit Button

A really useful and easy feature that googleVis provides is the edit button which gives the user the ability to customize the chart in an automated way.
options=list(gvis.editor="Edit!"))

Exercise 6

Add an edit button in the scatter chart you just created. HINT: Use gvis.editor.

Chart with more options

Now let’s see how we can create a chart with many features that can enhance its appearance. We will use again the 2-axis line that we used before.
LineCD2 <- gvisLineChart(df, "name", c("Pts","Rbs"),
options=list(
series="[{color:'green',targetAxisIndex: 0, lineWidth: 3,
lineDashStyle: [14, 2, 2, 7]},
{color:'yellow',targetAxisIndex:1,lineWidth: 6,
lineDashStyle: [10, 2]}]",
vAxes="[{title:'Pts'}, {title:'Rbs'}]"
))
plot(LineCD2)

Background color

You can decide the background color of your chart with:
backgroundColor="red",

Exercise 7

Set the background color of your line chart to “lightblue” and plot it. HINT: Use backgroundColor.

Title

To give a title and decide its features you can use:
title="Title",
titleTextStyle="{color:'orange',
fontName:'Courier',
fontSize:14}",

Exercise 8

Give a title of your choise to the line chart and set its font to blue, Courier of size 16. HINT: Use titleTextStyle.

Curve Type & Legend

Another nice-looking choise that googleVis gives you is to display the lines like curves with:
curveType="function"

You can also move the legend of your chart to the bottom with:
legend="bottom"

Exercise 9

Smooth the lines of your line chart by setting the curveType option to function and move the legend to the bottom. HINT: Use curveType and legend.

Axes features

Finally you can “play” with your axes. This is an example:
vAxis="{gridlines:{color:'green', count:4}}",
hAxis="{title:'City', titleTextStyle:{color:'red'}}",
series="[{color:'yellow', targetAxisIndex: 0},
{color: 'brown',targetAxisIndex:1}]",
vAxes="[{title:'val1'}, {title:'val2'}]",

Exercise 10

Give the title “Name” to your hAxis and color it orange. Separate your vAxis with 3 red gridlines. HINT: Use titleTextStyle and gridlines




Bonus: Improve Data Consistency With Vapply()

The vapply() function improves consistency via pre-specification of return value. Speed is also improved.

The usage of vapply():
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

Answers to the exercises are available here.

Exercise 1
Beginning Level-

The dataframe used for this exercise:
dataset1 <- data.frame(observationA = 16:8, observationB = c(20:19, 6:12))

Using vapply(), find the length of dataset1‘s observations.

Exercise 2

Beginning Level

Find the mean of dataset1‘s observations.

Exercise 3

Beginning Level

Using vapply(), find the sums of dataset1‘s observations.

Learn more about the apply family in the online course R Programming: Advanced Analytics In R For Data Science. In this course you will learn how to:

  • Work with all common data types like dates, integers and characters
  • Indepth use and analysis of the apply family of functions
  • And much more

Exercise 4

Intermediate Level

Find the class of dataset1‘s observations.

Exercise 5

Intermediate Level

Use vapply() to verify all the “mtcars” columns are numeric.

Exercise 6

Intermediate Level

Find the range of dataset1.

Exercise 7

Intermediate Level

Print dataset1 with the vapply() function.

Exercise 8

Advanced Level

Find the quantiles of dataset1, with the vapply() function.

Exercise 9

Advanced Level

Find the number of characters in the following dataset of strings:
cars <- c("Corolla", "Firebird", "Europa")

Exercise 10

Advanced Level

Using the following function, process dataset1‘s observations with toValue set to FALSE.

Required function:
convert <- function(x, toValue=TRUE) {
if (toValue) { x = x * 25.4} else {
x = x / 25.4}
return(x)
}




Data wrangling : Reshaping


Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the second part of this series and it aims to cover the reshaping of data used to turn them into a tidy form. By tidy form, we mean that each feature forms a column and each observation forms a row.

Before proceeding, it might be helpful to look over the help pages for the spread, gather, unite, separate, replace_na, fill, extract_numeric.

Moreover please load the following libraries.
install.packages("magrittr")
library(magrittr)
install.packages("tidyr")
library(tidyr)

Please run the code below in order to load the data set:

data <- airquality[4:6]

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Print out the structure of the data frame.

Exercise 2

Let’s turn the data frame in a wider form, from above and and turn the Month variable into column headings and spread the Temp values across the months they are related to.

Exercise 3

Turn the wide (exercise 2) data frame into its initial format using the gather function, specify the columns you would like to gather by index number.

Exercise 4

Turn the wide (exercise 2) data frame into its initial format using the gather function, specify the columns you would like to gather by column name.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 5

Turn the wide (exercise 2) data frame into its initial format using the gather function, specify the columns by using remaining column names(the ones you don’t use for gathering).

Exercise 6

Unite the variables Day and Month to a new feature named Date with the format %d-%m .

Exercise 7

Create the data frame at its previous format (exercise 6). Separate the variable you have created before (Date) to Day, Month.

Exercise 8

Replace the missing values (NA) with 'Unknown'.

Exercise 9

Run the script below, so that you make a new feature year.
back2long_na$year <- rep(NA, nrow(back2long_na))

back2long_na$year[1] <- '2015'
back2long_na$year[as.integer(nrow(back2long_na)/3)] <- '2016'

back2long_na$year[as.integer(2*nrow(back2long_na)/3)] <- '2017'

You have noticed, that the new column has many values. Fill the NAs with the non-missing value write above it. (eg.the NA’s that are below the ‘2016’ and ‘2017’ value assign it to ‘2016’.

Hint: use the fill function.

Exercise 10

Extract the numeric values from the Temp feature.

Hint: extract_numeric, this is a very important function when the variable we apply the function on is a character with ‘noise’, for example ‘$40’ and you want to transform it to 40.




Neural networks Exercises (Part-3)

Source: Wikipedia


Neural network have become a corner stone of machine learning in the last decade. Created in the late 1940s with the intention to create computer programs who mimics the way neurons process information, those kinds of algorithm have long been believe to be only an academic curiosity, deprived of practical use since they require a lot of processing power and other machine learning algorithm outperform them. However since the mid 2000s, the creation of new neural network types and techniques, couple with the increase availability of fast computers made the neural network a powerful tool that every data analysts or programmer must know.

In this series of articles, we’ll see how to fit a neural network with R, we’ll learn the core concepts we need to know to well apply those algorithms and how to evaluate if our model is appropriate to use in production. In the last exercises sets, we have seen how to implement a feed-forward neural network in R. That kind of neural network is quite useful to match a single input value to a specific output value, either a dependent variable in regression problems or a class in clustering problems. However sometime, a sequence of input can give a lot more of information to the network than a single value. For example, if you want to train a neural network to predict which letter will come next in a word based on which letters have been typed, making prediction based on the last letter entered can give good results, but if all the previous letter are used for making the predictions the results should be better since the arrangement of previous letter can give important information about the rest of the word.

In today’s exercise set, we will see a type of neural network that is design to make use of the information made available by using sequence of inputs. Those ”recurrent neural networks” do so by using a hidden state at time t-1 that influence the calculation of the weight at time t. For more information about this type of neural network, you can read this article which is a good introduction on the subject.

Answers to the exercises are available here.

Exercise 1
We will start by using a recurrent neural network to predict the values of a time series. Load the tsEuStockMarkets dataset from the dataset package and save the first 1400 observations from the “DAX” time series as your working dataset.

Exercise 2
Process the dataset so he can be used in a neural network.

Exercise 3
Create two matrix containing 10 sequences of 140 observations from the previous dataset. The first one must be made of the original observations and will be the input of our neural network. The second one will be the output and since we want to predict the value of the stock market at time t+1 based on the value at time t, this matrix will be the same as the first one were all the elements are shifted from one position. Make sure that each sequence are coded as a row of each matrix.

Exercise 4
Set the seed to 42 and choose randomly eight sequences to train your model and two sequences that will be used for validation later. Once it’s done, load the rnn package and use the trainr() function to train a recurrent neural network on the training dataset. For now, use a learning rate of 0.01, one hidden layer of one neuron and 500 epoch.

Exercise 5
Use the function predictr to make prediction on all the 10 sequences of your original data matrix, then plot the real values and the predicted value on the same graph. Also draw the plot of the prediction on the test set and the real value of your dataset.

Exercise 6
The last model seems to underestimate the stock values that are higher than 0.5. Repeat the step of exercise 3 and 4 but this time use 10 hidden layers. Once it’s done calculate the RMSE of your predictions. This will be the baseline model for the rest of this exercise set.

Learn more about neural networks in the online course Machine Learning A-Z™: Hands-On Python & R In Data Science. In this course you will learn how to:

  • Work with Deep Learning networks and related packages in R
  • Create Natural Language Processing models
  • And much more

Exercise 7
One interesting method often used to accelerate the training of a neural network is the “Nesterov momentum”. This procedure is based on the fact that while trying to find the weights that minimize the cost function of your neural network, optimization algorithm like gradient descend “zigzag” around a straight path to the minimum value. By adding a momentum matrix, which keeps track of the general direction of the gradient, to the gradient we can minimize the deviation from this optimal path and speeding the convergence of the algorithm. You can see this video for more information about this concept.

Repeat the last exercise, but this time use 250 epochs and a momentum of 0.7.

Exercise 8
As special type of recurrent neural network trained by backpropagation through time is called the Long Short-Term Memory (LSTM) network. This type of recurrent neural network is quite useful in a deep learning context, since this method is robust again the vanishing gradient problem. We will see both of those concepts more in detail in a future exercise set, but for now you can read about it here.

The trainr() function give us the ability to train a LSTM network by setting the network_type parameter to “lstm”. Use this algorithm with 500 epochs and 20 neuron in the hidden layer to predict the value of your time series.

Exercise 9
When working with a recurrent neural network it is important to choose an input sequence length that give the algorithm the maximum information possible without adding useless noise to the input. Until now we used 10 sequences of 140 observations. Train a recurrent neural network on 28 sequences of 50 observations, make prediction and compute the RMSE to see if this encoding had an effect on your predictions.

Exercise 10
Try to use all of the 1860 observation in the “DAX” time series to train and test a recurrent neural network. Then post the setting you used for your model and why you choose them in the comments.




Data Visualization with googleVis exercises part 3

Scatter & Bubble chart

This is the third part of our data visualization series and at this part we will explore the features of two more of the charts that googleVis provides.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

Package Installation

As you already know, the first thing you have to do is install and load the googleVis package with:
install.packages("googleVis")
library(googleVis)

NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page.

Scatter chart

It is quite simple to create a scatter chart with googleVis. We will use the cars dataset. Look at the example below:
ScatterC <- gvisScatterChart(cars)
plot(ScatterC)

Exercise 1

Create a list named “ScatterC” and pass to it the cars dataset as a scatter chart. HINT: Use gvisScatterChart().

Exercise 2

Plot the the scatter chart. HINT: Use plot().

Titles

It is time to learn how to enhance the appearance of our googleVis charts. We shall give a title to the chart and also name hAxis and vAxis. Look at the example:
options=list(title="Cars", vAxis="{title:'speed'}",
hAxis="{title:'dist'}" )

Exercise 3

Name your chart “Cars”, your chart’s vAxis “speed”, your chart’s hAxis “dist” and plot the chart. HINT: Use list().

Size

You can adjust the size with width and height.

Exercise 4

Set your chart’s width to 600 and height to 300.

Legend

You can deactivate your chart’s legend if you set it to “none”.

Exercise 5

Deactivate your chart’s legend.

Learn more about using GoogleVis in the online course Mastering in Visualization with R programming. In this course you will learn how to:

  • Work extensively with the GoogleVis package and its functionality
  • Learn what visualizations exist for your specific use case
  • And much more

Point size & Line width

You can determine the size of the chart’s points with pointsize and also choose to unite them with line with linewidth. For example:
pointSize=4,linewidth=3

Exercise 6

Set point size to 3 and line width to 2.

Bubble Chart

Another amazing type of chart that googleVis provides is the bubble chart. You can create a simple Bubble Chart of the Fruits dataset like this:
BubbleC <- gvisBubbleChart(Fruits)
plot(BubbleC)

Exercise 7

Create a list named “BubbleC” and pass to it the Fruits dataset as a bubble chart. HINT: Use gvisBubbleChart().

Exercise 8

Plot the chart. HINT: Use plot().

Bubble Chart’s Features

As you can see, you created a bubble chart but it seems to be useless. In order to make it useful you should pass to it some of your dataset’s variables as features. It depends on what you want to be displayed and how. If you type head(Fruits) you can easily recognize the numeric variables of your dataset. Then you can use them like this:
BubbleC <- gvisBubbleChart(Fruits,idvar="VAR1",
xvar="VAR2", yvar="VAR3",
colorvar="VAR4", sizevar="VAR5")

Exercise 9

Find the numeric variables of Fruits, then set “Fruit” as idvar, “Sales” as xvar, “Expenses” as yvar, “Year” as colorvar and “Profit” as sizevar and plot your chart. HINT: Use head().

Data range

You can also adjust the minimum and maximum number of hAxis and vAxis that you want to be displayed. Look at the example below:
options=list(
hAxis='{minValue:50, maxValue:150}')

Exercise 10

Set your hAxis range from 70 to 130 and your vAxis range from 50 to 100.




Ridge regression in R exercises

Bias vs Variance tradeoff is always encountered in applying supervised learning algorithms. Least squares regression provides a good fit for the training set but can suffer from high variance which lowers predictive ability. To counter this problem, we can regularize the beta coefficients by employing a penalization term. Ridge regression applies l2 penalty to the residual sum of squares. In contrast, LASSO regression, which was covered here previously, applies l1 penalty.
Using ridge regression, we can shrink the beta coefficients towards zero which would reduce variance at the cost of higher bias which can result in better predictive ability than least squares regression. In this exercise set we will use the glmnet package (package description: here) to implement ridge regression in R.

Answers to the exercises are available here.

Exercise 1
Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) “Least Angle Regression” (with discussion) Annals of Statistics). This is the same dataset from the LASSO exercise set and has patient level data on the progression of diabetes. Next, load the glmnet package that will that we will now use to implement ridge regression.
The dataset has three matrices x, x2 and y. x has a smaller set of independent variables while x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes.
Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis.
Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison.

Exercise 2
Fit the ridge regression model using the glmnet function and plot the trace of the estimated coefficients against lambdas. Note that coefficients are shrunk closer to zero for higher values of lambda.

Exercise 3
Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error.

Exercise 4
Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that coefficients are lower than least squares estimates.

Exercise 5
To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note the shrinkage effect on the estimates.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 6
Split the data randomly between a training set (80%) and test set (20%). We will use these to get the prediction standard error for least squares and ridge regression models.

Exercise 7
Fit the ridge regression model on the training and get the estimated beta coefficients for both the minimum lambda and the higher lambda within 1-standard error of the minimum.

Exercise 8
Get predictions from the ridge regression model for the test set and calculate the prediction standard error. Do this for both the minimum lambda and the higher lambda within 1-standard error of the minimum.

Exercise 9
Fit the least squares model on the training set.

Exercise 10
Get predictions from the least squares model for the test set and calculate the prediction standard error.




Manipulate Biological Data Using Biostrings Package Exercises(Part 4)


Bioinformatics is an amalgamation Biology and Computer science. Biological Data is manipulated using Computers and Computer software’s in Bioinformatics. Biological Data includes DNA; RNA & Proteins. DNA & RNA is made of Nucleotide which are our genetic material in which we are coded.Our Structure and Functions are done by protein, which are build of Amino acids
Here in this we try to manipulate DNA, RNA, Protein strings using Biostring Package
Install Packages
Biostrings

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create an RNA string and find palindromes in the sequence

Exercise 2

Create a DNA string and find palindromes in the sequence

Exercise 3

Create a DNA string and find the dinucleotide frequency of the sequences

Exercise 4

Create an RNA string and find the dinucleotide frequency of the sequences

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • import data into R in several ways while also beeing able to identify a suitable import tool
  • use SQL code within R
  • And much more

Exercise 5

Create a DNA string and find the oligonucleotide frequency in the sequences

Exercise 6

Create an RNA string and find the oligonucleotide frequency in the sequences

Exercise 7

Create a DNA string and find the trinucleotide frequency in the sequences

Exercise 8

Create an RNA string and find the trinucleotide frequency in the sequences

Exercise 9

Print amino acid alphabets

Exercise 10

Create an Amino acid string and print the frequency of the amino acid strings in the sequence




Using the xlsx package to create an Excel file

Microsoft Excel is perhaps the most popular data anlysis tool out there. While arguably convenient, spreadsheet software is error prone and Excel code can be very hard to review and test.

After successfully completing this exercise set, you will be able to prepare a basic Excel document using just R (no need to touch Excel yourself), leaving behind a reproducible R-script.

Solutions are available here.

Exercise 1
Install and load the xlsx package, using the dependencies = TRUE option.

Exercise 2
Create an xlsx workbook object in your R workspace and call it wb.

Exercise 3
Create a sheet object in wb named iris assign it the name sheet1 in your workspace.

Exercise 4
Write the built-in Iris data.frame to the iris sheet without row names. Hint: use the addDataFrame() function.

Now you can write your workbook anytime to your working directory using saveWorkbook(wb, "filename.xlsx").

Learn more about working with excel and R in the online course Learn By Example: Statistics and Data Science in R. In this course you will learn how to:

  • Learn some of the differences between working in Excel with regression modelling and R
  • Learn about different statistical concepts
  • And much more

Exercise 5
Apply ‘freeze pane’ on the top row.

Exercise 6
Set width of columns 1 through 5 to 12, that is 84 pixels.

Exercise 7
Use Font, CellBlock and CB.setFont to make the header in bold.

Exercise 8
Using tapply generate a table with the mean of ‘petal width’ by species and write to a new sheet called pw, from row 2 down.

Exercise 9
Add a title in cell A1 above the table, merge the cells of the first three columns.

Exercise 10
Save your workbook to your working directory and open using Excel. Go back to R and continue formatting and adding information to your workbook at will.