R & Shiny codes Archives - Data Secrets: Smart Research for Real Results

Rsample (Part 1) – Bootstrap estimate of a confidence interval for a mean

Miloš Vilotić — Thu, 26 Jul 2018 00:39:48 +0000

The [su_label]Rsample[/su_label] package contains functions that allow different types of resampling (e.g. cross-validation, bootstrap, etc.). The data structure in which resampling data is stored is a data frame and is very convenient for further work. You can read more about the [su_label]Rsample[/su_label] package on the official package page: https://github.com/tidymodels/rsample. The package is part of the RStudio “tidymodels” project (project leader is Max Kuhn).

In the first post about the [su_label]Rsample[/su_label] package, we will show its application for estimating confidence interval of the mean. This example will show the basic structure of the package. We can present the task in the following way. We want to estimate the 95% confidence interval of the mean, and we do not know the analytical apparatus for evaluating the interval, in this case the t-test. The idea is to create a large number of bootstrap samples (repeat samples of the same size as the original sample) from the original sample and to calculate the mean value for each. The estimate of the confidence interval would then be calculated by simply calculating the 0.025 and 0.975 quantiles from the obtained mean values.

First, we will create a sample from a normal distribution with a mean value of 5 and standard deviation 1. The sample size is 100.

[code lang=”r”]
library(rsample)
library(ggplot2)
library(dplyr)
library(purrr)

set.seed(54321)
data <- data_frame(x = rnorm(100, 5))
data
[/code]

# A tibble: 100 x 1
       x
   
 1  4.82
 2  4.07
 3  4.22
 4  3.35
 5  4.59
 6  3.90
 7  3.31
 8  7.52
 9  6.40
10  5.18
# ... with 90 more rows

We will use the [su_label]bootstraps[/su_label] function from the [su_label]Rsample[/su_label] package to generate 1000 bootstrap samples. Generated samples are stored in the data frame in the [su_label]splits[/su_label] column. Column [su_label]id[/su_label] denotes the name of each sample.

[code firstline=”9″ language=”r”]
bt_data <- bootstraps(data, times = 1000)
bt_data
[/code]

# Bootstrap sampling 
# A tibble: 1,000 x 2
   splits       id           
                  
 1  Bootstrap0001
 2  Bootstrap0002
 3  Bootstrap0003
 4  Bootstrap0004
 5  Bootstrap0005
 6  Bootstrap0006
 7  Bootstrap0007
 8  Bootstrap0008
 9  Bootstrap0009
10  Bootstrap0010
# ... with 990 more rows

The structure of the first bootstrap sample can be seen via the following command:

[code firstline=”11″ language=”r”]
bt_data$splits[[1]]
[/code]

<100/39/100>

We see that the new sample contains 100 observations (analysis data in the terminology of the [su_label]Rsample[/su_label] package), 39 observations were in the [su_label]assessment[/su_label] (often used to evaluate the performance of a model that was fit to the analysis data). The last number indicates the size of the original set. This data is expected as it is a sample with replacement. Each new sample can be accessed using the [su_label]analysis(sample)[/su_label] function. Assessment data is accessed using the [su_label]assessment(sample)[/su_label] function.

Once we have created bootstrap samples the mean value for each sample is calculated. We create an auxiliary function [su_label]get_split_mean[/su_label] to calculate the sample mean value and later apply it to each sample.

[code firstline=”11″ language=”r”]
get_split_mean <- function(split) {
# access to the sample data
split_data <- analysis(split)
# calculate the sample mean value
split_mean <- mean(split_data$x)
return(split_mean)
}
[/code]

Using the [su_label]map_dbl[/su_label] function from the [su_label]purrr[/su_label] package, we will pass through all the samples and calculate the mean values. We add the new vector of mean values to the existing [su_label]bt_data[/su_label] data frame. In this way, we get a very nice structure (tidy data) for further work.

[code firstline=”19″ language=”r”]
bt_data$bt_means <- map_dbl(bt_data$splits, get_split_mean)
bt_data
[/code]

# Bootstrap sampling 
# A tibble: 1,000 x 3
   splits       id            bt_means
                      
 1  Bootstrap0001     5.02
 2  Bootstrap0002     4.93
 3  Bootstrap0003     4.99
 4  Bootstrap0004     4.86
 5  Bootstrap0005     4.90
 6  Bootstrap0006     5.03
 7  Bootstrap0007     5.10
 8  Bootstrap0008     5.00
 9  Bootstrap0009     4.78
10  Bootstrap0010     5.07
# ... with 990 more rows

The 95% confidence interval is obtained using the [su_label]quantile[/su_label] function.

[code firstline=”21″ language=”r”]
bt_ci <- round(quantile(bt_data$bt_means, c(0.025, 0.975)), 3)
bt_ci
[/code]

 2.5% 97.5% 
4.704 5.117

We calculate the confidence interval of the original set using the [su_label]t.test[/su_label] function.

[code firstline=”23″ language=”r”]
t_test <- t.test(data$x)
tt_ci <- round(t_test$conf.int, 3)
tt_ci
[/code]

[1] 4.697 5.117
attr(,"conf.level")
[1] 0.95

We see that the intervals are almost identical. In the same way we can estimate other parameters for which an analytical solution can’t be obtained or the solution is too complex.

The estimate of the confidence interval using bootstrap samples is shown in the following graph. The code for the graph is in the Appendix.

Appendix

[code lang=”r”]
# Uploading required R packages
library(rsample)
library(ggplot2)
library(dplyr)
library(purrr)

# Generating sample from a normal distribution
set.seed(54321)
data <- data_frame(x = rnorm(100, 5))
data

# Generate 1000 bootstrap samples
bt_data <- bootstraps(data, times = 1000)
bt_data

# Display the structure of the first bootstrap sample
bt_data$splits[[1]]

# Create an auxiliary function for calculating the sample mean value
get_split_mean <- function(split) {
# access to the sample data
split_data <- analysis(split)
# calculate the sample mean value
split_mean <- mean(split_data$x)
return(split_mean)
}

# Adding the new vector to the existing data frame.
bt_data$bt_means <- map_dbl(bt_data$splits, get_split_mean)
bt_data

# Calculate the 95% confidence interval.
bt_ci <- round(quantile(bt_data$bt_means, c(0.025, 0.975)), 3)
bt_ci

# Calculate the confidence interval of the original set.
t_test <- t.test(data$x)
tt_ci <- round(t_test$conf.int, 3)
tt_ci

#################
# Generate graph
##################
bt_ci_lower <- quantile(bt_data$bt_means, c(0.025))
bt_ci_upper <- quantile(bt_data$bt_means, c(0.975))
bt_mean <- mean(bt_data$bt_means)

bt_data <- bt_data %>%
dplyr::mutate(Color = ifelse(bt_means < bt_ci_lower | bt_means > bt_ci_upper, "Out of 95% CI", "In 95% CI"))

ggplot(bt_data, aes(x = bt_means, y = 0)) +
geom_jitter(aes(color = Color), alpha = 0.6, size = 3, width = 0) +
geom_vline(xintercept = round(c(bt_ci_lower, bt_mean, bt_ci_upper),2), linetype = c(2,1,2)) +
scale_x_continuous(breaks = round(c(bt_ci_lower, bt_mean, bt_ci_upper),2),
labels = c("Lower CI (95%)", "Mean", "Upper CI (95%)"),
sec.axis = sec_axis(~., breaks = round(c(bt_ci_lower, bt_mean, bt_ci_upper),2),
labels = round(c(bt_ci_lower, bt_mean, bt_ci_upper),2))) +
scale_color_manual(values = c("gray40", "firebrick4")) +
theme_void() +
theme(legend.title = element_blank(),
legend.position = "top",
legend.text = element_text(size = 14),
axis.text.x = element_text(size = 14))
[/code]

Miloš Vilotić

Master in Statistics. Developing R packages and Shiny applications.

The post Rsample (Part 1) – Bootstrap estimate of a confidence interval for a mean appeared first on Data Secrets: Smart Research for Real Results.

Using a recursive function to conditionally aggregate a contingency table

Miloš Vilotić — Thu, 07 Jun 2018 23:12:34 +0000

In this post, we will show on a simple example the application of a recursive function. The problem that we want to solve is the following. We have a matrix of dimensions N x M. This can be, for example, a two-dimensional contingency table. Columns that have a mean value less than the set value must be added to the adjacent column. In the case where there are two adjacent columns the lower one is selected. This kind of problem can be resolved elegantly with the use of recursive functions.
In the example we will use the matrix 5 x 8. We will generate 40 integers in the interval from 1 to 9.

[code lang=”r” firstline=”1″]
# Define a seed for the random number generator
set.seed(155)

# Create a matrix 5 x 8
mat <- matrix(ceiling(runif(40, 0, 9)), 5, 8)

# Name columns and rows of a newly created matrix
colnames(mat) <- LETTERS[1:8]
rownames(mat) <- letters[1:5]

# Print the matrix
print(mat)
[/code]

  A B C D E F G H
a 8 3 9 8 6 5 5 3
b 4 7 2 5 5 1 4 6
c 7 6 2 5 3 2 7 8
d 7 5 5 8 2 5 8 8
e 5 5 2 2 2 8 3 5

We want to merge other columns with all columns with a mean value of less than 5. In this example, columns with mean values less than 5 are C, E and F. Column C must be added to column B (mean B < mean D) and column E must be added to column F (mean F < mean D). Since column F is added to column E, it is not necessary to merge it with other columns.

[code lang=”r” firstline=”13″]
# Calculating columns mean values
col_means <- round(colMeans(mat), 1)

# Print the columns mean values
print(t(as.matrix(col_means)))
[/code]

       A   B C   D   E   F   G H
[1,] 6.2 5.2 4 5.6 3.6 4.2 5.4 6

First we will create an auxiliary [su_label]merge_columns[/su_label] function that merges two columns according to the above described algorithm. Implementation of the function is in the appendix at the end of the post.
The recursive function will call itself as long as there is a column with a mean value less than 5.

[code lang=”r” firstline=”18″]
collapsing_matrix <- function(mat, threshold) {
col_means <- colMeans(mat)
ind <- which(col_means < threshold)[1]

if (!is.na(ind)) {
mat <- merge_columns(mat, ind)

# the function calls itself if there is a column
# with a mean value less than threshold
collapsing_matrix(mat, threshold)
} else {
return(mat)
}
}
[/code]

By applying the function to the matrix defined above, we are getting the expected result.

[code lang=”r” firstline=”32″]
# Generating a matrix according to a specified condition
new_mat <- collapsing_matrix(mat, threshold = 5)

# Print the matrix
print(new_mat)
[/code]

  A B,C D E,F G H
a 8  12 8  11 5 3
b 4   9 5   6 4 6
c 7   8 5   5 7 8
d 7  10 8   7 8 8
e 5   7 2  10 3 5

The function can be applied to a matrix of any dimension. For instance, we have a matrix 5 x 20.

  A B C D E F G H I J K L M N O P Q R S T
a 8 3 9 8 6 5 5 3 8 3 9 8 6 5 5 3 8 3 9 8
b 4 7 2 5 5 1 4 6 4 7 2 5 5 1 4 6 4 7 2 5
c 7 6 2 5 3 2 7 8 7 6 2 5 3 2 7 8 7 6 2 5
d 7 5 5 8 2 5 8 8 7 5 5 8 2 5 8 8 7 5 5 8
e 5 5 2 2 2 8 3 5 5 5 2 2 2 8 3 5 5 5 2 2

The mean values of the columns of the matrix.

       A   B C   D   E   F   G H   I   J K   L   M   N   O P   Q   R S   T
[1,] 6.2 5.2 4 5.6 3.6 4.2 5.4 6 6.2 5.2 4 5.6 3.6 4.2 5.4 6 6.2 5.2 4 5.6

Result after calling the function [su_label]collapsing_matrix()[/su_label] for threshold value 5.

  A B,C D E,F G H I J,K L M,N O P Q R,S T
a 8  12 8  11 5 3 8  12 8  11 5 3 8  12 8
b 4   9 5   6 4 6 4   9 5   6 4 6 4   9 5
c 7   8 5   5 7 8 7   8 5   5 7 8 7   8 5
d 7  10 8   7 8 8 7  10 8   7 8 8 7  10 8
e 5   7 2  10 3 5 5   7 2  10 3 5 5   7 2

With fewer changes in the R code, the recursive function in the appendix can be written so that instead of the mean value, different criterion could be used, for example, the minimum value.

Appendix

[code lang=”r”]
merge_columns <- function(mat, ind) {
n_cols <- ncol(mat)
col_ind <- 1:n_cols
col_names <- colnames(mat)
col_means <- colMeans(mat)

if (ind == 1) {
start <- NULL
i <- 0
j <- 1
end <- setdiff(col_ind, c(1,2))
} else {
if (ind == ncol(mat)) {
start <- setdiff(col_ind, c(n_cols-1, n_cols))
i <- -1
j <- 0
end <- NULL
} else {
mean_before <- col_means[ind-1]
mean_after <- col_means[ind+1]
if (mean_before <= mean_after) {
start <- col_ind[col_ind < ind-1]
i <- -1
j <- 0
end <- col_ind[col_ind > ind]
} else {
start <- col_ind[col_ind < ind]
i <- 0
j <- 1
end <- col_ind[col_ind > ind+1]
}
}
}

mat <- cbind(mat[,start], c(mat[,ind+i] + mat[,ind+j]), mat[,end])
new_col_names <- c(col_names[start], paste(col_names[ind+i], col_names[ind+j], sep = ","), col_names[end])
colnames(mat) <- new_col_names
return(mat)
}
[/code]

Miloš Vilotić

Master in Statistics. Developing R packages and Shiny applications.

The post Using a recursive function to conditionally aggregate a contingency table appeared first on Data Secrets: Smart Research for Real Results.