Descriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.

This is the first set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. Download it and save it as a csv file. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data. In the exercises below we cover the basics of data exploration. I chose it to be the ‘part 0’ of the descriptive analytics solution, because in order to proceed to the data pre-processing and then description you need to get to know your data set while it is not formally on the value chain of descriptive analytics process. Before proceeding, it might be helpful to look over the help pages for the `str`

, `summary`

, `dim`

, `nrow`

, `ncol`

, `names`

, `is.na`

, `match`

functions.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Load the data before proceeding. Let’s name the dataset as ‘flights’

`flights <- read.csv('2008.csv') `

**Exercise 1**

Print the structure of the data. What do you think about it?

**Exercise 2**

Print the summary statistics of the data. What do you think about the values? (format, consistency, completeness)

**Exercise 3**

Print the dimensionality of the data (number of rows and columns)

**Exercise 4**

Print the number of rows. This may seem like a silly command, but it is quite useful for loops and if statements.

**Exercise 5**

Print the number of columns.

**Exercise 6**

Print the names of the variables.

**Exercise 7**

Print whether the first column has missing values (NAs). Try to answer this question with two ways. Hint: %in%

**Exercise 8**

Print the number of variables that contain missing values.

**Exercise 9**

Find the portion of the variables that contain missing values. What do you think about it?

**Exercise 10**

Print the names of the variables that contain missing values.

VESSELIN NIKOV says

# Exercise 6

# Number of NA’s in the first column

sum(is.na(flights[,1]))

Vasileios Tsakalos says

Hello Vesselin , I suppose you refer to the exercise 7 .The goal of the exercise is to return a logical value in order to used afterwards on an conditional statement.

Thank you very much for your comment.

Carl Sutton says

The question asked whether NA’s were present, not how many (I am being nit picky here).

anyNA answers that question. Perhaps not as informative as the sum, but the summary function is quite good at answering that question.

Randy Minder says

Another way to do this, and get the NA count for all columns:

sapply(flights, function(x) sum(is.na(x)))

Ashok Harnal says

What I do not understand is why the following takes more time than the above solution:

apply(flights,2,function (x) sum(is.na(x)) )

Pasquale Dente says

The time difference is likely due to the fact that behind `sapply` there is (mostly) fast C code, while `apply` is `R` code. Please, do not trust me, have a look yourself at the source code.

Carl Sutton says

exercise 8 can be handled by

# Exercise 8

# Print the number of variables that contain missing values.

flights <- data.table(flights)

na_cols <- flights[,lapply(.SD, anyNA)]

Reduce("+",na_cols)

Note that converting first to data.table, and using the "j" to create a TRUE FALSE list, and then the Reduce (thank you stack.overflow "How to sum a numeric list elements in R")) gives then answer of 14 which is what the solution shows.

Vasileios Tsakalos says

That’s a great answer. I haven’t thought of that. I really appreciate your feedback . Thanks for your time.

Cheers !

Victor says

My version of last four exercises:

#7

which(flights$Year==NA)

NA %in% flights$Year

#8

NAvalues<-vector()

for (i in 1:length(flights)){

NAvalues[i]<- NA %in% flights[,i]

}

sum(NAvalues)

#9

portion <- sum(NAvalues)/ncol(flights)

portion

#10

names(flights[,NAvalues])

Vasileios Tsakalos says

Hello Victor, I really appreciate your feedback.

Thanks for sharing !

Cheers!

Jose Sanchez says

Went a different way with the last ones:

#ex 7.

sum(is.na(flights[,”Year”]))

# ex 8.

sum(0<apply(flights, 2, function(x) sum(is.na(x))))

# ex 9.

sum(0<apply(flights, 2, function(x) sum(is.na(x))))/ncol(flights)

# ex 10.

names(flights[0<apply(flights, 2, function(x) sum(is.na(x)))])

Roger says

# Exercise 7

any(is.na(flights[1]))