The post Quick start with R: Improving our regression model (Part 29) appeared first on My Statistical Consultant Blog.

]]>`lm()`

command to perform a least squares regression on them, and diagnosing our regression using the `plot()`

command. Here are the data again.
`height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)`

bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)

Just as we did last time, we perform the regression using `lm()`

. This time we store it as an object `M`

. Indeed – R allows you to do that!

`M <- lm(height ~ bodymass)`

Now we use the `summary()`

command to obtain useful information about our regression:

`summary(M)`

Our model *p*-value is very significant (approximately 0.0004) and we have very good explanatory power (over 81% of the variability in height is explained by body mass).

We saw in the previous blog that points 2, 4, 5 and 6 have great influence on the model. Now we see how to re-fit our model while omitting one datum. Let’s omit point 6. Note the syntax we use to do so, involving the `subset()`

command inside the `lm()`

command and omitting the point using the syntax `!=`

which stands for “not equal to”. The syntax instructs R to fit a linear model on a subset of the data in which all points are included except the sixth point.

`M2 <- lm(height ~ bodymass, subset=(1:length(height)!=6))`

`summary(M2)`

Because we have omitted one observation, we have lost one degree of freedom (from 8 to 7) but our model has greater explanatory power (i.e. the Multiple *R*-Squared has increased from 0.81 to 0.85). From that perspective, our model has improved, but of course, point 6 may well be a valid observation, and perhaps should be retained. Whether you omit or retain such data is a matter of judgement.

Our diagnostic plots were as follows:

When comparing them with the diagnostic plots in previous blog we can see that there are no significant changes in these plots. In other words, omitting point 6 didn’t improve quality of the regression.

David

# Create two variables. height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175) bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78) # Store the regression model as an object. M <- lm(height ~ bodymass) # Obtain useful information about regression. summary(M) # Store regression model as object after omitting point 6. M2 <- lm(height ~ bodymass, subset=(1:length(height)!=6)) # Obtain useful information about new regression. summary(M2) # Create a plotting environment of two rows and two columns and plot the model. par(mfrow = c(2,2)) plot(M2)

Senior Academic Manager in *New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: Improving our regression model (Part 29) appeared first on My Statistical Consultant Blog.

]]>The post Quick start with R: Diagnosing our regression model (Part 28) appeared first on My Statistical Consultant Blog.

]]>`lm()`

command to perform a least squares regression on them, treating one of them as the dependent variable and the other as the independent variable. Here they are again.
`height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)`

bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)

Today we learn how to obtain more useful diagnostic information about a regression model. As before, we perform the regression.

`lm(height ~ bodymass)`

Now we can use several R diagnostic plots and influence statistics to understand our model. These diagnostic plots are as follows:

- Residuals vs. fitted values
- Q-Q plots
- Scale Location plots
- Cook’s distance plots.

To use R’s regression diagnostic plots, we set up the regression model as an object and create a plotting environment of two rows and two columns. Then we use the `plot()`

command, treating the model as an argument.

`model <- lm(height ~ bodymass)`

`par(mfrow = c(2,2))`

`plot(model)`

The first plot gives the residuals plotted against the fitted values. If our data had no scatter, so that all points fell on the regression line, then every datum would fall on the horizontal line of the first plot. The red curve is a smoothed representation of the residuals, and ideally should be relatively flat and close to the horizontal line; that is – there should be no trend. However, this is not the case for our data. To summarise: the first plot (residuals vs. fitted values) should look more or less random, but that is not what we see here.

The second plot is a Quantile-Quantile (Q-Q) plot of the residuals. This plot helps us to identify whether or not the residuals are distributed normally. In our example, most of the points lie close to the dashed line. If the residuals were distributed normally, all of the points would lie along this line. For real data, there will be deviations, but any deviations should be small. To summarise: the second plot (normal Q-Q errors) will give a straight line if the errors are distributed normally, but points 4, 5 and 6 deviate from the straight line.

The third plot (Scale Location) gives the fitted values, plotted against the square root of the standardized residuals (giving a mean of zero and a variance of unity). Large residuals (both positive and negative) appear at the top of the plot, while small residuals appear at the bottom. The red curve indicates any trend in the standardised residuals. If the red line is reasonably flat, then the variance in the residuals doesn’t change greatly over the range of the independent variable (and we have homoscedastic data). To summarise: the third plot is similar to the first, and should look random. However, ours does not.

The last plot (lower right) gives the standardized residuals, plotted against leverage. For normally distributed residuals, the standardized residuals will be centred symmetrically on zero. Leverage provides one measure of the extent to which each point influences the regression. Because the regression line passes through the geometric centre of the data, points that lie far from the geometric centre have greater leverage, and their leverage increases when the points are relatively isolated (i.e. there are not many points close to the point of interest). The leverage of any point depends on the distance from the geometric centre and on the isolation of that point. Data that are simultaneously outliers and have high leverage influence both the slopes and intercept of a regression model. We see that point 4 has high leverage.

The last plot also gives contours of Cook’s distance, which is a measure of how much the regression would change if a point were omitted from the regression. Cook’s distance increases when leverage is large. When the residuals are large, any point far from the geometric centre and that has a large residual distorts the regression. Ideally, the red smoothed line remains close to the horizontal dashed line and ideally no points have a large Cook’s distance (i.e. > 0.5). Neither of these two conditions apply for our data. To summarise: the last plot (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that point 4 (having both high leverage and high Cook’s Distance) has considerable influence on the model.

David

# Create two variables. height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175) bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78) # Estimate the regression model. lm(height ~ bodymass) # Store the regression model as an object. model <- lm(height ~ bodymass) # Create a plotting environment of two rows and two columns and plot the model. par(mfrow = c(2,2)) plot(model)

Senior Academic Manager in *New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: Diagnosing our regression model (Part 28) appeared first on My Statistical Consultant Blog.

]]>The post Quick start with R: More on regression (Part 27) appeared first on My Statistical Consultant Blog.

]]>`lm()`

command to perform a least squares regression on them, treating one of them as the dependent variable and the other as the independent variable. Here they are again.`height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)`

bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)

Today we learn how to obtain useful diagnostic information about a regression model and then how to draw residuals on a plot. As before, we perform the regression.

`lm(height ~ bodymass)`

Now let’s find out more about the regression. First, let’s store the regression model as an object called `mod`

and then use the `summary()`

command to learn about the regression.

`mod <- lm(height ~ bodymass)`

`summary(mod)`

Here is what R gives you.

R has given you a great deal of diagnostic information about the regression. The most useful of this information are the coefficients themselves, the Adjusted *R*-squared, the *F*-statistic and the *p*-value for the model.

Now let’s use R’s `predict()`

command to create a vector of fitted values.

`regmodel <- predict(lm(height ~ bodymass))`

`regmodel`

Here are the fitted values:

Now let’s plot the data and regression line again.

`plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)")`

abline(lm(height ~ bodymass))

We can plot the residuals using R’s for loop and a subscript `k`

that runs from 1 to the number of data points. We know that there are 10 data points, but if we do not know the number of data we can find it using the `length()`

command on either the height or body mass variable.

`npoints <- length(height)`

`npoints`

Now let’s implement the loop and draw the residuals (the differences between the observed data and the corresponding fitted values) using the `lines()`

command. Note the syntax we use to draw in the residuals.

`for (k in 1: npoints) lines(c(bodymass[k], bodymass[k]), c(height[k], regmodel[k]))`

Here is our plot, including the residuals.

None of this was so difficult!

Next time we will look at more advanced aspects of regression models and see what R has to offer. See you then!

David

# Create two variables. height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175) bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78) # Estimate the regression model. lm(height ~ bodymass) # Store the regression model as an object. mod <- lm(height ~ bodymass) summary(mod) # Create a vector of fitted values. regmodel <- predict(lm(height ~ bodymass)) regmodel # Plot the data and regression line. plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)") abline(lm(height ~ bodymass)) # Find the number of data. npoints <- length(height) npoints # Draw in the residuals. for (k in 1: npoints) lines(c(bodymass[k], bodymass[k]), c(height[k], regmodel[k]))

Senior Academic Manager in *New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: More on regression (Part 27) appeared first on My Statistical Consultant Blog.

]]>The post Quick start with R: Scatterplot with regression line (Part 26) appeared first on My Statistical Consultant Blog.

]]>`height <- c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)`

Now let’s take bodymass to be a variable that describes the masses (in kg) of the same ten people. Copy and paste the following code to the R command line to create the weight variable.

`bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)`

Both variables are now stored in the R workspace. To view them, enter:

`height`

bodymass

We can now create a simple plot of the two variables as follows:

`plot(bodymass, height)`

We can enhance this plot using various arguments within the `plot()`

command. Copy and paste the following code into the R workspace:

`plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)")`

In the above code, the syntax `pch = 16`

creates solid dots, while `cex = 1.3`

creates dots that are 1.3 times bigger than the default (where `cex = 1`

). More about these commands later.

Now let’s perform a linear regression using `lm()`

on the two variables by adding the following text at the command line:

`lm(height ~ bodymass)`

We see that the intercept is 98.0054 and the slope is 0.9528. By the way – `lm`

stands for “linear model”.

Finally, we can add a best fit line (regression line) to our plot by adding the following text at the command line:

`abline(98.0054, 0.9528)`

Another line of syntax that will plot the regression line is:

`abline(lm(height ~ bodymass))`

None of this was so difficult! In our next blog we will look again at regression.

David

# Create the height variable. height <- c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175) # Create the weight variable. bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78) # View both variables. height bodymass # Create a scatterplot of height against bodymass variable. plot(bodymass, height) # Create more complete scatterplot of height against bodymass variable. plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)") # Perform a linear regression on the two variables. lm(height ~ bodymass) # Add a best fit line (regression line) to the plot. abline(98.0054, 0.9528) # Alternatively we can plot the regression line using the following syntax: abline(lm(height ~ bodymass))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: Scatterplot with regression line (Part 26) appeared first on My Statistical Consultant Blog.

]]>The post Quick start with R: Symbol colours and legend in qplot (Part 25) appeared first on My Statistical Consultant Blog.

]]>`qplot`

to map symbol size to a categorical variable. Now we see how to control symbol colours and create legend titles. Copy in the same dataset of Blog 24 (a medical data set relating to patients in a randomised controlled trial:`M <- structure(list(PATIENT = structure(c(32L, 15L, 41L, 42L, 44L, 17L, 31L, 10L, 38L, 18L, 22L, 30L), .Label = c("Adrienne", "Alan", "Andy", "Ann ", "Anne ", "Anton", "Audrey", "Ben", "Bernie", "Beth", "Bob", "Bobby", "Bruce", "Charles", "Dave", "Dianne", "Frida", "Guy", "Henry", "Hugh", "Ian", "Irina", "James", "Jim", "Jo ", "John", "Jonah", "Joseph", "Lesley", "Liz", "Magnus", "Mary", "Max", "Merril", "Mike", "Mikhail", "Nick", "Peter", "Robert", "Robin", "Simon", "Steve", "Stuart", "Sue", "Telu"), class = "factor"), GENDER = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), AGE = structure(c(3L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("E", "M", "Y"), class = "factor"), WEIGHT_1 = c(79.2, 58.8, 72, 59.7, 79.6, 83.1, 68.7, 67.6, 79.1, 39.9, 64.7, 65.6), WEIGHT_2 = c(76.6, 59.3, 70.1, 57.3, 79.8, 82.3, 66.8, 67.4, 76.8, 41.4, 65.3, 63.2), HEIGHT = c(169L, 161L, 175L, 149L, 179L, 177L, 175L, 170L, 177L, 138L, 170L, 165L), SMOKE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE), RECOVER = c(1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L)), .Names = c("PATIENT", "GENDER", "TREATMENT", "AGE", "WEIGHT_1", "WEIGHT_2", "HEIGHT", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(1L, 4L, 5L, 13L, 15L, 17L, 22L, 29L, 33L, 41L, 42L, 43L))`

`M`

Now let’s map symbol size to `GENDER`

and symbol colour to `EXERCISE`

, but choosing our own colours. To control your symbol colours, use the layer: `scale_colour_manual(values = )`

and select your desired colours. We choose red and blue, and symbol sizes 3 and 7.

`qplot(HEIGHT, WEIGHT_1, data = M, geom = c("point"), xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(EXERCISE)) + scale_size_manual(values = c(3, 7)) + scale_colour_manual(values = c("red", "blue"))`

Here is our graph with red and blue points:

Now let’s see how to control the legend title (the title that sits directly above the legend). For this example, we control the legend title through the name argument within the two functions `scale_size_manual()`

and `scale_colour_manual()`

. Enter this syntax in which we choose appropriate legend titles:

`qplot(HEIGHT, WEIGHT_1, data = M, geom = c("point"), xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(EXERCISE)) + scale_size_manual(values = c(3, 7), name="Gender") + scale_colour_manual(values = c("red","blue"), name="Exercise")`

We now have our preferred symbol colour and size, and legend titles of our choosing.

David

# Create and display the dataset. M <- structure(list(PATIENT = structure(c(32L, 15L, 41L, 42L, 44L, 17L, 31L, 10L, 38L, 18L, 22L, 30L), .Label = c("Adrienne", "Alan", "Andy", "Ann ", "Anne ", "Anton", "Audrey", "Ben", "Bernie", "Beth", "Bob", "Bobby", "Bruce", "Charles", "Dave", "Dianne", "Frida", "Guy", "Henry", "Hugh", "Ian", "Irina", "James", "Jim", "Jo ", "John", "Jonah", "Joseph", "Lesley", "Liz", "Magnus", "Mary", "Max", "Merril", "Mike", "Mikhail", "Nick", "Peter", "Robert", "Robin", "Simon", "Steve", "Stuart", "Sue", "Telu"), class = "factor"), GENDER = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), AGE = structure(c(3L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("E", "M", "Y"), class = "factor"), WEIGHT_1 = c(79.2, 58.8, 72, 59.7, 79.6, 83.1, 68.7, 67.6, 79.1, 39.9, 64.7, 65.6), WEIGHT_2 = c(76.6, 59.3, 70.1, 57.3, 79.8, 82.3, 66.8, 67.4, 76.8, 41.4, 65.3, 63.2), HEIGHT = c(169L, 161L, 175L, 149L, 179L, 177L, 175L, 170L, 177L, 138L, 170L, 165L), SMOKE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE), RECOVER = c(1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L)), .Names = c("PATIENT", "GENDER", "TREATMENT", "AGE", "WEIGHT_1", "WEIGHT_2", "HEIGHT", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(1L, 4L, 5L, 13L, 15L, 17L, 22L, 29L, 33L, 41L, 42L, 43L)) M # Create a scatterplot of patient height against weight before treatment. qplot(HEIGHT, WEIGHT_1, data = M, geom = c("point"), xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(EXERCISE)) + scale_size_manual(values = c(3, 7)) + scale_colour_manual(values = c("red", "blue")) # Change the legend. qplot(HEIGHT, WEIGHT_1, data = M, geom = c("point"), xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(EXERCISE)) + scale_size_manual(values = c(3, 7), name="Gender") + scale_colour_manual(values = c("red","blue"), name="Exercise")

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: Symbol colours and legend in qplot (Part 25) appeared first on My Statistical Consultant Blog.

]]>The post Quick start with R: Symbol sizes in qplot (Part 24) appeared first on My Statistical Consultant Blog.

]]>`qplot`

to map symbol colour to a categorical variable. Copy in the following dataset (a medical dataset relating to patients in a randomised controlled trial):`M <- structure(list(PATIENT = structure(c(32L, 15L, 41L, 42L, 44L, 17L, 31L, 10L, 38L, 18L, 22L, 30L), .Label = c("Adrienne", "Alan", "Andy", "Ann ", "Anne ", "Anton", "Audrey", "Ben", "Bernie", "Beth", "Bob", "Bobby", "Bruce", "Charles", "Dave", "Dianne", "Frida", "Guy", "Henry", "Hugh", "Ian", "Irina", "James", "Jim", "Jo ", "John", "Jonah", "Joseph", "Lesley", "Liz", "Magnus", "Mary", "Max", "Merril", "Mike", "Mikhail", "Nick", "Peter", "Robert", "Robin", "Simon", "Steve", "Stuart", "Sue", "Telu"), class = "factor"), GENDER = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), AGE = structure(c(3L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("E", "M", "Y"), class = "factor"), WEIGHT_1 = c(79.2, 58.8, 72, 59.7, 79.6, 83.1, 68.7, 67.6, 79.1, 39.9, 64.7, 65.6), WEIGHT_2 = c(76.6, 59.3, 70.1, 57.3, 79.8, 82.3, 66.8, 67.4, 76.8, 41.4, 65.3, 63.2), HEIGHT = c(169L, 161L, 175L, 149L, 179L, 177L, 175L, 170L, 177L, 138L, 170L, 165L), SMOKE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE), RECOVER = c(1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L)), .Names = c("PATIENT", "GENDER", "TREATMENT", "AGE", "WEIGHT_1", "WEIGHT_2", "HEIGHT", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(1L, 4L, 5L, 13L, 15L, 17L, 22L, 29L, 33L, 41L, 42L, 43L))`

`M`

Now we create a scatterplot of patient height against weight before treatment, and we map both symbol size and shape to `GENDER`

using `factor()`

. Enter the following syntax:

`qplot(HEIGHT, WEIGHT_1, data = M, xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(GENDER)) + scale_size_manual(values = c(5, 7))`

Note how we mapped symbol size and colour to `GENDER`

using the syntax:

`size = factor(GENDER)`

and `color = factor(GENDER))`

Also note how we controlled symbol size using the layer:

`+ scale_size_manual(values = c(5, 7))`

In this example I have chosen symbol sizes of 5 and 7. You may select different sizes, depending on your preferences. Very quickly you will gain experience and select the symbol sizes that suit your graphs best. Of course you can experiment with the above syntax yourselves, each time changing the symbol size values. For example:

`qplot(HEIGHT, WEIGHT_1, data = M, xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(GENDER)) + scale_size_manual(values = c(2, 9))`

The difference in point sizes is now rather extreme, but you now see how to control symbol size. Soon we will learn how to control symbol colour too. See you later!

David

# Create and display the dataset. M <- structure(list(PATIENT = structure(c(32L, 15L, 41L, 42L, 44L, 17L, 31L, 10L, 38L, 18L, 22L, 30L), .Label = c("Adrienne", "Alan", "Andy", "Ann ", "Anne ", "Anton", "Audrey", "Ben", "Bernie", "Beth", "Bob", "Bobby", "Bruce", "Charles", "Dave", "Dianne", "Frida", "Guy", "Henry", "Hugh", "Ian", "Irina", "James", "Jim", "Jo ", "John", "Jonah", "Joseph", "Lesley", "Liz", "Magnus", "Mary", "Max", "Merril", "Mike", "Mikhail", "Nick", "Peter", "Robert", "Robin", "Simon", "Steve", "Stuart", "Sue", "Telu"), class = "factor"), GENDER = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), AGE = structure(c(3L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("E", "M", "Y"), class = "factor"), WEIGHT_1 = c(79.2, 58.8, 72, 59.7, 79.6, 83.1, 68.7, 67.6, 79.1, 39.9, 64.7, 65.6), WEIGHT_2 = c(76.6, 59.3, 70.1, 57.3, 79.8, 82.3, 66.8, 67.4, 76.8, 41.4, 65.3, 63.2), HEIGHT = c(169L, 161L, 175L, 149L, 179L, 177L, 175L, 170L, 177L, 138L, 170L, 165L), SMOKE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE), RECOVER = c(1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L)), .Names = c("PATIENT", "GENDER", "TREATMENT", "AGE", "WEIGHT_1", "WEIGHT_2", "HEIGHT", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(1L, 4L, 5L, 13L, 15L, 17L, 22L, 29L, 33L, 41L, 42L, 43L)) M # Create a scatterplot of patient height against weight before treatment. qplot(HEIGHT, WEIGHT_1, data = M, xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(GENDER)) + scale_size_manual(values = c(5, 7)) # Change the symbol size values. qplot(HEIGHT, WEIGHT_1, data = M, xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(GENDER)) + scale_size_manual(values = c(2, 9))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: Symbol sizes in qplot (Part 24) appeared first on My Statistical Consultant Blog.

]]>The post Quick start with R: Using qplot() function (Part 23) appeared first on My Statistical Consultant Blog.

]]>`qplot`

to create a simple scatterplot.The

`qplot`

(quick plot) system is a subset of the `ggplot2`

(grammar of graphics) package which you can use to create nice graphs. It is great for creating graphs of categorical data, because you can map symbol colour, size and shape to the levels of your categorical variable. To use `qplot`

first install `ggplot2`

as follows:`install.packages("ggplot2")`

and then load

`ggplot2`

using the command:`library(ggplot2)`

The

`qplot`

syntax is as follows:`qplot(x = X, y = X, data = X, colour = X, shape = X, geom = X, main = "Title")`

where

`x`

gives the `x`

values you wish to plot.`y`

gives the `y`

values you wish to plot. You now have bivariate data and must provide an appropriate `geom`

.`data`

gives the object name of the data frame.`colour`

maps the colour scheme onto a factor variable, and `qplot`

now selects different colours for different levels of the variable. You can use special syntax to set your own colours.`shape`

maps the symbol shapes onto a factor variable, and `qplot`

now selects different shapes for different levels of the factor variable. You can use special syntax to set your own shapes.`geom`

provides a list of keywords that control the kind of plot, including: “`histogram`

“, “`density`

“, “`line`

“, “`point`

“.`main`

provides the title for the plot.In

`qplot`

, you can set your desired aesthetics using the operator `I()`

. For example, if you want red use: `color = I("red")`

. If you want to control the size of the symbols, use: `size = I(N)`

, where a value of `N`

greater than 1 expands the symbols. For example, `size = I(5)`

produces very big symbols.Anyway – let’s start with a simple example where we set up a simple scatterplot with blue symbols. Now read in this dataset:

`T <- structure(list(A = c(1, 2, 4, 5, 6, 7), B = c(1, 4, 16, 25, 36, 49)), .Names = c("A", "B"), row.names = c(NA, -6L), class = "data.frame")`

`T`

Now plot `A`

against `B`

using `I()`

for colour and symbol size. We include axis labels of our choice and use symbol size 5 (large symbols).

`qplot(A, B, data = T, xlab = "NUMBERS", ylab = "VERTICAL AXIS", colour = I("blue"), size = I(5))`

Note the default background, grey in colour and including a grid. We can modify those attributes quite easily and we will do so in a later blog

Now we create a scatterplot with a smooth curve using `geom = c("smooth")`

.

`qplot(A, B, data = T, xlab = "NUMBERS", ylab = "VERTICAL AXIS", colour = I("blue"), size = I(1), geom = c("smooth"))`

We chose `size = I(1)`

for this example, but we can include a larger value to get a thicker line.

That wasn’t so hard! In Blog 24 we will look at further plotting techniques using `qplot`

.

See you later!

David

# Install ggplot2 package. install.packages("ggplot2") # Load ggplot2 package. library(ggplot2) # Read in and display the dataset. T <- structure(list(A = c(1, 2, 4, 5, 6, 7), B = c(1, 4, 16, 25, 36, 49)), .Names = c("A", "B"), row.names = c(NA, -6L), class = "data.frame") T # Plot A against B. qplot(A, B, data = T, xlab = "NUMBERS", ylab = "VERTICAL AXIS", colour = I("blue"), size = I(5)) # Create a scatterplot with a smooth curve. qplot(A, B, data = T, xlab = "NUMBERS", ylab = "VERTICAL AXIS", colour = I("blue"), size = I(1), geom = c("smooth"))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: Using qplot() function (Part 23) appeared first on My Statistical Consultant Blog.

]]>The post Quick start with R: Mathematical expressions for graphs (Part 22) appeared first on My Statistical Consultant Blog.

]]>Mathematical expressions on graphs are made possible through

`expression(paste())`

and `substitute()`

. If you need mathematical expressions as axis labels, switch off the default axes and include Greek symbols by writing them out in English. You can create fractions through the `frac()`

command. Note how we obtain the plus or minus sign through the syntax: `%+-%`

Here is a nice example. Let’s create a set of 71 values from – 6 to + 6. These values are the horizontal axis values.

`x <- seq(-6, 6, len = 71)`

Now we plot a cosine function using a continuous curve (using

`type="l"`

) while suppressing the x axis using the syntax: `xaxt="n"`

`plot(x, cos(x),type="l",xaxt="n", xlab=expression(paste("Angle ",theta)), ylab=expression("sin "*beta))`

where we have inserted relevant mathematical text for the axis labels using

`expression(paste())`

. Here is the graph so far:

Now we create a horizontal axis to our own specifications, including relevant labels:

`axis(1, at = c(-2*pi, -1.5*pi, -pi, -pi/2, 0, pi/2, pi, 1.5*pi, 2*pi), lab = expression(-2*phi, -1.5*phi, -phi, -phi/2, 0, phi/2, phi, 2*phi, 1.5*phi))`

Let’s put in some mathematical expressions, centred appropriately. The first argument within each `text()`

function gives the value along the horizontal axis about which the text will be centred.

`text(-0.7*pi,0.5,substitute(chi^2=="23.5"))`

text(0.1*pi, -0.5, expression(paste(frac(alpha*omega, sigma*phi*sqrt(2*pi)), " ", e^{frac(-(5*x+2*mu)^3, 5*sigma^3)})))

text(0.3*pi,0,expression(hat(z) %+-% frac(se, alpha)))

Here is our graph, complete with mathematical expressions:

That wasn’t so hard! In Blog 23 we will look at further plotting techniques in R.

See you later!

David

# Create a set of 71 horizontal axis values. x <- seq(-6, 6, len = 71) # Plot a cosine function. plot(x, cos(x),type="l",xaxt="n", xlab=expression(paste("Angle ",theta)), ylab=expression("sin "*beta)) # Custimise a horizontal axis. axis(1, at = c(-2*pi, -1.5*pi, -pi, -pi/2, 0, pi/2, pi, 1.5*pi, 2*pi), lab = expression(-2*phi, -1.5*phi, -phi, -phi/2, 0, phi/2, phi, 2*phi, 1.5*phi)) # Add a few mathematical expressions. text(-0.7*pi,0.5,substitute(chi^2=="23.5")) text(0.1*pi, -0.5, expression(paste(frac(alpha*omega, sigma*phi*sqrt(2*pi)), " ", e^{frac(-(5*x+2*mu)^3, 5*sigma^3)}))) text(0.3*pi,0,expression(hat(z) %+-% frac(se, alpha)))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: Mathematical expressions for graphs (Part 22) appeared first on My Statistical Consultant Blog.

]]>The post Quick start with R: Plotting multiple graphs on the same page (Part 21) appeared first on My Statistical Consultant Blog.

]]>`par(mfrow=(A,B))`

, where `A`

refers to the number of rows and `B`

to the number of columns (and where each cell will hold a single graph). This syntax sets up a plotting environment of `A`

rows and `B`

columns.First we create four vectors, all of the same length.

`X <- c(1, 2, 3, 4, 5, 6, 7)`

`Y1 <- c(2, 4, 5, 7, 12, 14, 16)`

`Y2 <- c(3, 6, 7, 8, 9, 11, 12)`

`Y3 <- c(1, 7, 3, 2, 2, 7, 9)`

Now we set up a plotting environment of two rows and three columns (in order to hold six graphs), using

`par(mfrow())`

`par(mfrow=c(2,3))`

Now we plot six graphs on the same plotting environment. We use the

`plot()`

command six times in succession, each time graphing one of the `Y`

vectors against the `X`

vector.`plot(X,Y1, pch = 1)`

plot(X,Y2, pch = 2)

plot(X,Y3, pch = 3)

plot(X,Y1, pch = 4)

plot(X,Y2, pch = 15)

plot(X,Y3, pch = 16)

Out plot plot looks like this:

That wasn’t so hard! In Blog 22 we will look at further plotting techniques in R.

See you later!

David

# Create four vectors of the same length. X <- c(1, 2, 3, 4, 5, 6, 7) Y1 <- c(2, 4, 5, 7, 12, 14, 16) Y2 <- c(3, 6, 7, 8, 9, 11, 12) Y3 <- c(1, 7, 3, 2, 2, 7, 9) # Set up multiple graphs (2 x 3) on the same page. par(mfrow=c(2,3)) # Create six scatterplots. plot(X,Y1, pch = 1) plot(X,Y2, pch = 2) plot(X,Y3, pch = 3) plot(X,Y1, pch = 4) plot(X,Y2, pch = 15) plot(X,Y3, pch = 16)

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

The post Quick start with R: Plotting multiple graphs on the same page (Part 21) appeared first on My Statistical Consultant Blog.

]]>The post Rsample (Part 1) – Bootstrap estimate of a confidence interval for a mean appeared first on My Statistical Consultant Blog.

]]>In the first post about the Rsample package, we will show its application for estimating confidence interval of the mean. This example will show the basic structure of the package. We can present the task in the following way. We want to estimate the 95% confidence interval of the mean, and we do not know the analytical apparatus for evaluating the interval, in this case the t-test. The idea is to create a large number of bootstrap samples (repeat samples of the same size as the original sample) from the original sample and to calculate the mean value for each. The estimate of the confidence interval would then be calculated by simply calculating the 0.025 and 0.975 quantiles from the obtained mean values.

First, we will create a sample from a normal distribution with a mean value of 5 and standard deviation 1. The sample size is 100.

library(rsample) library(ggplot2) library(dplyr) library(purrr) set.seed(54321) data <- data_frame(x = rnorm(100, 5)) data

# A tibble: 100 x 1 x <dbl> 1 4.82 2 4.07 3 4.22 4 3.35 5 4.59 6 3.90 7 3.31 8 7.52 9 6.40 10 5.18 # ... with 90 more rows

We will use the bootstraps function from the Rsample package to generate 1000 bootstrap samples. Generated samples are stored in the data frame in the splits column. Column id denotes the name of each sample.

bt_data <- bootstraps(data, times = 1000) bt_data

# Bootstrap sampling # A tibble: 1,000 x 2 splits id <list> <chr> 1 <S3: rsplit> Bootstrap0001 2 <S3: rsplit> Bootstrap0002 3 <S3: rsplit> Bootstrap0003 4 <S3: rsplit> Bootstrap0004 5 <S3: rsplit> Bootstrap0005 6 <S3: rsplit> Bootstrap0006 7 <S3: rsplit> Bootstrap0007 8 <S3: rsplit> Bootstrap0008 9 <S3: rsplit> Bootstrap0009 10 <S3: rsplit> Bootstrap0010 # ... with 990 more rows

The structure of the first bootstrap sample can be seen via the following command:

bt_data$splits[[1]]

<100/39/100>

We see that the new sample contains 100 observations (*analysis data* in the terminology of the Rsample package), 39 observations were in the assessment (often used to evaluate the performance of a model that was fit to the analysis data). The last number indicates the size of the original set. This data is expected as it is a sample with replacement. Each new sample can be accessed using the analysis(sample) function. Assessment data is accessed using the assessment(sample) function.

Once we have created bootstrap samples the mean value for each sample is calculated. We create an auxiliary function get_split_mean to calculate the sample mean value and later apply it to each sample.

get_split_mean <- function(split) { # access to the sample data split_data <- analysis(split) # calculate the sample mean value split_mean <- mean(split_data$x) return(split_mean) }

Using the map_dbl function from the purrr package, we will pass through all the samples and calculate the mean values. We add the new vector of mean values to the existing bt_data data frame. In this way, we get a very nice structure (*tidy data*) for further work.

bt_data$bt_means <- map_dbl(bt_data$splits, get_split_mean) bt_data

# Bootstrap sampling # A tibble: 1,000 x 3 splits id bt_means <list> <chr> <dbl> 1 <S3: rsplit> Bootstrap0001 5.02 2 <S3: rsplit> Bootstrap0002 4.93 3 <S3: rsplit> Bootstrap0003 4.99 4 <S3: rsplit> Bootstrap0004 4.86 5 <S3: rsplit> Bootstrap0005 4.90 6 <S3: rsplit> Bootstrap0006 5.03 7 <S3: rsplit> Bootstrap0007 5.10 8 <S3: rsplit> Bootstrap0008 5.00 9 <S3: rsplit> Bootstrap0009 4.78 10 <S3: rsplit> Bootstrap0010 5.07 # ... with 990 more rows

The 95% confidence interval is obtained using the quantile function.

bt_ci <- round(quantile(bt_data$bt_means, c(0.025, 0.975)), 3) bt_ci

2.5% 97.5% 4.704 5.117

We calculate the confidence interval of the original set using the t.test function.

t_test <- t.test(data$x) tt_ci <- round(t_test$conf.int, 3) tt_ci

[1] 4.697 5.117 attr(,"conf.level") [1] 0.95

We see that the intervals are almost identical. In the same way we can estimate other parameters for which an analytical solution can’t be obtained or the solution is too complex.

The estimate of the confidence interval using bootstrap samples is shown in the following graph. The code for the graph is in the Appendix.

# Uploading required R packages library(rsample) library(ggplot2) library(dplyr) library(purrr) # Generating sample from a normal distribution set.seed(54321) data <- data_frame(x = rnorm(100, 5)) data # Generate 1000 bootstrap samples bt_data <- bootstraps(data, times = 1000) bt_data # Display the structure of the first bootstrap sample bt_data$splits[[1]] # Create an auxiliary function for calculating the sample mean value get_split_mean <- function(split) { # access to the sample data split_data <- analysis(split) # calculate the sample mean value split_mean <- mean(split_data$x) return(split_mean) } # Adding the new vector to the existing data frame. bt_data$bt_means <- map_dbl(bt_data$splits, get_split_mean) bt_data # Calculate the 95% confidence interval. bt_ci <- round(quantile(bt_data$bt_means, c(0.025, 0.975)), 3) bt_ci # Calculate the confidence interval of the original set. t_test <- t.test(data$x) tt_ci <- round(t_test$conf.int, 3) tt_ci ################# # Generate graph ################## bt_ci_lower <- quantile(bt_data$bt_means, c(0.025)) bt_ci_upper <- quantile(bt_data$bt_means, c(0.975)) bt_mean <- mean(bt_data$bt_means) bt_data <- bt_data %>% dplyr::mutate(Color = ifelse(bt_means < bt_ci_lower | bt_means > bt_ci_upper, "Out of 95% CI", "In 95% CI")) ggplot(bt_data, aes(x = bt_means, y = 0)) + geom_jitter(aes(color = Color), alpha = 0.6, size = 3, width = 0) + geom_vline(xintercept = round(c(bt_ci_lower, bt_mean, bt_ci_upper),2), linetype = c(2,1,2)) + scale_x_continuous(breaks = round(c(bt_ci_lower, bt_mean, bt_ci_upper),2), labels = c("Lower CI (95%)", "Mean", "Upper CI (95%)"), sec.axis = sec_axis(~., breaks = round(c(bt_ci_lower, bt_mean, bt_ci_upper),2), labels = round(c(bt_ci_lower, bt_mean, bt_ci_upper),2))) + scale_color_manual(values = c("gray40", "firebrick4")) + theme_void() + theme(legend.title = element_blank(), legend.position = "top", legend.text = element_text(size = 14), axis.text.x = element_text(size = 14))

The post Rsample (Part 1) – Bootstrap estimate of a confidence interval for a mean appeared first on My Statistical Consultant Blog.

]]>