`lm()`

command to perform a least squares regression on them, and diagnosing our regression using the `plot()`

command. Here are the data again.
`height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)`

bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)

Just as we did last time, we perform the regression using `lm()`

. This time we store it as an object `M`

. Indeed – R allows you to do that!

`M <- lm(height ~ bodymass)`

Now we use the `summary()`

command to obtain useful information about our regression:

`summary(M)`

Our model *p*-value is very significant (approximately 0.0004) and we have very good explanatory power (over 81% of the variability in height is explained by body mass).

We saw in the previous blog that points 2, 4, 5 and 6 have great influence on the model. Now we see how to re-fit our model while omitting one datum. Let’s omit point 6. Note the syntax we use to do so, involving the `subset()`

command inside the `lm()`

command and omitting the point using the syntax `!=`

which stands for “not equal to”. The syntax instructs R to fit a linear model on a subset of the data in which all points are included except the sixth point.

`M2 <- lm(height ~ bodymass, subset=(1:length(height)!=6))`

`summary(M2)`

Because we have omitted one observation, we have lost one degree of freedom (from 8 to 7) but our model has greater explanatory power (i.e. the Multiple *R*-Squared has increased from 0.81 to 0.85). From that perspective, our model has improved, but of course, point 6 may well be a valid observation, and perhaps should be retained. Whether you omit or retain such data is a matter of judgement.

Our diagnostic plots were as follows:

When comparing them with the diagnostic plots in previous blog we can see that there are no significant changes in these plots. In other words, omitting point 6 didn’t improve quality of the regression.

David

# Create two variables. height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175) bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78) # Store the regression model as an object. M <- lm(height ~ bodymass) # Obtain useful information about regression. summary(M) # Store regression model as object after omitting point 6. M2 <- lm(height ~ bodymass, subset=(1:length(height)!=6)) # Obtain useful information about new regression. summary(M2) # Create a plotting environment of two rows and two columns and plot the model. par(mfrow = c(2,2)) plot(M2)

Senior Academic Manager in *New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

`lm()`

command to perform a least squares regression on them, treating one of them as the dependent variable and the other as the independent variable. Here they are again.
`height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)`

bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)

Today we learn how to obtain more useful diagnostic information about a regression model. As before, we perform the regression.

`lm(height ~ bodymass)`

Now we can use several R diagnostic plots and influence statistics to understand our model. These diagnostic plots are as follows:

- Residuals vs. fitted values
- Q-Q plots
- Scale Location plots
- Cook’s distance plots.

To use R’s regression diagnostic plots, we set up the regression model as an object and create a plotting environment of two rows and two columns. Then we use the `plot()`

command, treating the model as an argument.

`model <- lm(height ~ bodymass)`

`par(mfrow = c(2,2))`

`plot(model)`

The first plot gives the residuals plotted against the fitted values. If our data had no scatter, so that all points fell on the regression line, then every datum would fall on the horizontal line of the first plot. The red curve is a smoothed representation of the residuals, and ideally should be relatively flat and close to the horizontal line; that is – there should be no trend. However, this is not the case for our data. To summarise: the first plot (residuals vs. fitted values) should look more or less random, but that is not what we see here.

The second plot is a Quantile-Quantile (Q-Q) plot of the residuals. This plot helps us to identify whether or not the residuals are distributed normally. In our example, most of the points lie close to the dashed line. If the residuals were distributed normally, all of the points would lie along this line. For real data, there will be deviations, but any deviations should be small. To summarise: the second plot (normal Q-Q errors) will give a straight line if the errors are distributed normally, but points 4, 5 and 6 deviate from the straight line.

The third plot (Scale Location) gives the fitted values, plotted against the square root of the standardized residuals (giving a mean of zero and a variance of unity). Large residuals (both positive and negative) appear at the top of the plot, while small residuals appear at the bottom. The red curve indicates any trend in the standardised residuals. If the red line is reasonably flat, then the variance in the residuals doesn’t change greatly over the range of the independent variable (and we have homoscedastic data). To summarise: the third plot is similar to the first, and should look random. However, ours does not.

The last plot (lower right) gives the standardized residuals, plotted against leverage. For normally distributed residuals, the standardized residuals will be centred symmetrically on zero. Leverage provides one measure of the extent to which each point influences the regression. Because the regression line passes through the geometric centre of the data, points that lie far from the geometric centre have greater leverage, and their leverage increases when the points are relatively isolated (i.e. there are not many points close to the point of interest). The leverage of any point depends on the distance from the geometric centre and on the isolation of that point. Data that are simultaneously outliers and have high leverage influence both the slopes and intercept of a regression model. We see that point 4 has high leverage.

The last plot also gives contours of Cook’s distance, which is a measure of how much the regression would change if a point were omitted from the regression. Cook’s distance increases when leverage is large. When the residuals are large, any point far from the geometric centre and that has a large residual distorts the regression. Ideally, the red smoothed line remains close to the horizontal dashed line and ideally no points have a large Cook’s distance (i.e. > 0.5). Neither of these two conditions apply for our data. To summarise: the last plot (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that point 4 (having both high leverage and high Cook’s Distance) has considerable influence on the model.

David

# Create two variables. height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175) bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78) # Estimate the regression model. lm(height ~ bodymass) # Store the regression model as an object. model <- lm(height ~ bodymass) # Create a plotting environment of two rows and two columns and plot the model. par(mfrow = c(2,2)) plot(model)

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

`lm()`

command to perform a least squares regression on them, treating one of them as the dependent variable and the other as the independent variable. Here they are again.`height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)`

bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)

Today we learn how to obtain useful diagnostic information about a regression model and then how to draw residuals on a plot. As before, we perform the regression.

`lm(height ~ bodymass)`

Now let’s find out more about the regression. First, let’s store the regression model as an object called `mod`

and then use the `summary()`

command to learn about the regression.

`mod <- lm(height ~ bodymass)`

`summary(mod)`

Here is what R gives you.

R has given you a great deal of diagnostic information about the regression. The most useful of this information are the coefficients themselves, the Adjusted *R*-squared, the *F*-statistic and the *p*-value for the model.

Now let’s use R’s `predict()`

command to create a vector of fitted values.

`regmodel <- predict(lm(height ~ bodymass))`

`regmodel`

Here are the fitted values:

Now let’s plot the data and regression line again.

`plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)")`

abline(lm(height ~ bodymass))

We can plot the residuals using R’s for loop and a subscript `k`

that runs from 1 to the number of data points. We know that there are 10 data points, but if we do not know the number of data we can find it using the `length()`

command on either the height or body mass variable.

`npoints <- length(height)`

`npoints`

Now let’s implement the loop and draw the residuals (the differences between the observed data and the corresponding fitted values) using the `lines()`

command. Note the syntax we use to draw in the residuals.

`for (k in 1: npoints) lines(c(bodymass[k], bodymass[k]), c(height[k], regmodel[k]))`

Here is our plot, including the residuals.

None of this was so difficult!

Next time we will look at more advanced aspects of regression models and see what R has to offer. See you then!

David

# Create two variables. height = c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175) bodymass = c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78) # Estimate the regression model. lm(height ~ bodymass) # Store the regression model as an object. mod <- lm(height ~ bodymass) summary(mod) # Create a vector of fitted values. regmodel <- predict(lm(height ~ bodymass)) regmodel # Plot the data and regression line. plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)") abline(lm(height ~ bodymass)) # Find the number of data. npoints <- length(height) npoints # Draw in the residuals. for (k in 1: npoints) lines(c(bodymass[k], bodymass[k]), c(height[k], regmodel[k]))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

`height <- c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)`

Now let’s take bodymass to be a variable that describes the masses (in kg) of the same ten people. Copy and paste the following code to the R command line to create the weight variable.

`bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)`

Both variables are now stored in the R workspace. To view them, enter:

`height`

bodymass

We can now create a simple plot of the two variables as follows:

`plot(bodymass, height)`

We can enhance this plot using various arguments within the `plot()`

command. Copy and paste the following code into the R workspace:

`plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)")`

In the above code, the syntax `pch = 16`

creates solid dots, while `cex = 1.3`

creates dots that are 1.3 times bigger than the default (where `cex = 1`

). More about these commands later.

Now let’s perform a linear regression using `lm()`

on the two variables by adding the following text at the command line:

`lm(height ~ bodymass)`

We see that the intercept is 98.0054 and the slope is 0.9528. By the way – `lm`

stands for “linear model”.

Finally, we can add a best fit line (regression line) to our plot by adding the following text at the command line:

`abline(98.0054, 0.9528)`

Another line of syntax that will plot the regression line is:

`abline(lm(height ~ bodymass))`

None of this was so difficult! In our next blog we will look again at regression.

David

# Create the height variable. height <- c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175) # Create the weight variable. bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78) # View both variables. height bodymass # Create a scatterplot of height against bodymass variable. plot(bodymass, height) # Create more complete scatterplot of height against bodymass variable. plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)") # Perform a linear regression on the two variables. lm(height ~ bodymass) # Add a best fit line (regression line) to the plot. abline(98.0054, 0.9528) # Alternatively we can plot the regression line using the following syntax: abline(lm(height ~ bodymass))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

`qplot`

to map symbol size to a categorical variable. Now we see how to control symbol colours and create legend titles. Copy in the same dataset of Blog 24 (a medical data set relating to patients in a randomised controlled trial:`M <- structure(list(PATIENT = structure(c(32L, 15L, 41L, 42L, 44L, 17L, 31L, 10L, 38L, 18L, 22L, 30L), .Label = c("Adrienne", "Alan", "Andy", "Ann ", "Anne ", "Anton", "Audrey", "Ben", "Bernie", "Beth", "Bob", "Bobby", "Bruce", "Charles", "Dave", "Dianne", "Frida", "Guy", "Henry", "Hugh", "Ian", "Irina", "James", "Jim", "Jo ", "John", "Jonah", "Joseph", "Lesley", "Liz", "Magnus", "Mary", "Max", "Merril", "Mike", "Mikhail", "Nick", "Peter", "Robert", "Robin", "Simon", "Steve", "Stuart", "Sue", "Telu"), class = "factor"), GENDER = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), AGE = structure(c(3L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("E", "M", "Y"), class = "factor"), WEIGHT_1 = c(79.2, 58.8, 72, 59.7, 79.6, 83.1, 68.7, 67.6, 79.1, 39.9, 64.7, 65.6), WEIGHT_2 = c(76.6, 59.3, 70.1, 57.3, 79.8, 82.3, 66.8, 67.4, 76.8, 41.4, 65.3, 63.2), HEIGHT = c(169L, 161L, 175L, 149L, 179L, 177L, 175L, 170L, 177L, 138L, 170L, 165L), SMOKE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE), RECOVER = c(1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L)), .Names = c("PATIENT", "GENDER", "TREATMENT", "AGE", "WEIGHT_1", "WEIGHT_2", "HEIGHT", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(1L, 4L, 5L, 13L, 15L, 17L, 22L, 29L, 33L, 41L, 42L, 43L))`

`M`

Now let’s map symbol size to `GENDER`

and symbol colour to `EXERCISE`

, but choosing our own colours. To control your symbol colours, use the layer: `scale_colour_manual(values = )`

and select your desired colours. We choose red and blue, and symbol sizes 3 and 7.

`qplot(HEIGHT, WEIGHT_1, data = M, geom = c("point"), xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(EXERCISE)) + scale_size_manual(values = c(3, 7)) + scale_colour_manual(values = c("red", "blue"))`

Here is our graph with red and blue points:

Now let’s see how to control the legend title (the title that sits directly above the legend). For this example, we control the legend title through the name argument within the two functions `scale_size_manual()`

and `scale_colour_manual()`

. Enter this syntax in which we choose appropriate legend titles:

`qplot(HEIGHT, WEIGHT_1, data = M, geom = c("point"), xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(EXERCISE)) + scale_size_manual(values = c(3, 7), name="Gender") + scale_colour_manual(values = c("red","blue"), name="Exercise")`

We now have our preferred symbol colour and size, and legend titles of our choosing.

David

# Create and display the dataset. M <- structure(list(PATIENT = structure(c(32L, 15L, 41L, 42L, 44L, 17L, 31L, 10L, 38L, 18L, 22L, 30L), .Label = c("Adrienne", "Alan", "Andy", "Ann ", "Anne ", "Anton", "Audrey", "Ben", "Bernie", "Beth", "Bob", "Bobby", "Bruce", "Charles", "Dave", "Dianne", "Frida", "Guy", "Henry", "Hugh", "Ian", "Irina", "James", "Jim", "Jo ", "John", "Jonah", "Joseph", "Lesley", "Liz", "Magnus", "Mary", "Max", "Merril", "Mike", "Mikhail", "Nick", "Peter", "Robert", "Robin", "Simon", "Steve", "Stuart", "Sue", "Telu"), class = "factor"), GENDER = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), AGE = structure(c(3L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("E", "M", "Y"), class = "factor"), WEIGHT_1 = c(79.2, 58.8, 72, 59.7, 79.6, 83.1, 68.7, 67.6, 79.1, 39.9, 64.7, 65.6), WEIGHT_2 = c(76.6, 59.3, 70.1, 57.3, 79.8, 82.3, 66.8, 67.4, 76.8, 41.4, 65.3, 63.2), HEIGHT = c(169L, 161L, 175L, 149L, 179L, 177L, 175L, 170L, 177L, 138L, 170L, 165L), SMOKE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE), RECOVER = c(1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L)), .Names = c("PATIENT", "GENDER", "TREATMENT", "AGE", "WEIGHT_1", "WEIGHT_2", "HEIGHT", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(1L, 4L, 5L, 13L, 15L, 17L, 22L, 29L, 33L, 41L, 42L, 43L)) M # Create a scatterplot of patient height against weight before treatment. qplot(HEIGHT, WEIGHT_1, data = M, geom = c("point"), xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(EXERCISE)) + scale_size_manual(values = c(3, 7)) + scale_colour_manual(values = c("red", "blue")) # Change the legend. qplot(HEIGHT, WEIGHT_1, data = M, geom = c("point"), xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(EXERCISE)) + scale_size_manual(values = c(3, 7), name="Gender") + scale_colour_manual(values = c("red","blue"), name="Exercise")

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

`qplot`

to map symbol colour to a categorical variable. Copy in the following dataset (a medical dataset relating to patients in a randomised controlled trial):`M <- structure(list(PATIENT = structure(c(32L, 15L, 41L, 42L, 44L, 17L, 31L, 10L, 38L, 18L, 22L, 30L), .Label = c("Adrienne", "Alan", "Andy", "Ann ", "Anne ", "Anton", "Audrey", "Ben", "Bernie", "Beth", "Bob", "Bobby", "Bruce", "Charles", "Dave", "Dianne", "Frida", "Guy", "Henry", "Hugh", "Ian", "Irina", "James", "Jim", "Jo ", "John", "Jonah", "Joseph", "Lesley", "Liz", "Magnus", "Mary", "Max", "Merril", "Mike", "Mikhail", "Nick", "Peter", "Robert", "Robin", "Simon", "Steve", "Stuart", "Sue", "Telu"), class = "factor"), GENDER = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), AGE = structure(c(3L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("E", "M", "Y"), class = "factor"), WEIGHT_1 = c(79.2, 58.8, 72, 59.7, 79.6, 83.1, 68.7, 67.6, 79.1, 39.9, 64.7, 65.6), WEIGHT_2 = c(76.6, 59.3, 70.1, 57.3, 79.8, 82.3, 66.8, 67.4, 76.8, 41.4, 65.3, 63.2), HEIGHT = c(169L, 161L, 175L, 149L, 179L, 177L, 175L, 170L, 177L, 138L, 170L, 165L), SMOKE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE), RECOVER = c(1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L)), .Names = c("PATIENT", "GENDER", "TREATMENT", "AGE", "WEIGHT_1", "WEIGHT_2", "HEIGHT", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(1L, 4L, 5L, 13L, 15L, 17L, 22L, 29L, 33L, 41L, 42L, 43L))`

`M`

Now we create a scatterplot of patient height against weight before treatment, and we map both symbol size and shape to `GENDER`

using `factor()`

. Enter the following syntax:

`qplot(HEIGHT, WEIGHT_1, data = M, xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(GENDER)) + scale_size_manual(values = c(5, 7))`

Note how we mapped symbol size and colour to `GENDER`

using the syntax:

`size = factor(GENDER)`

and `color = factor(GENDER))`

Also note how we controlled symbol size using the layer:

`+ scale_size_manual(values = c(5, 7))`

In this example I have chosen symbol sizes of 5 and 7. You may select different sizes, depending on your preferences. Very quickly you will gain experience and select the symbol sizes that suit your graphs best. Of course you can experiment with the above syntax yourselves, each time changing the symbol size values. For example:

`qplot(HEIGHT, WEIGHT_1, data = M, xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(GENDER)) + scale_size_manual(values = c(2, 9))`

The difference in point sizes is now rather extreme, but you now see how to control symbol size. Soon we will learn how to control symbol colour too. See you later!

David

# Create and display the dataset. M <- structure(list(PATIENT = structure(c(32L, 15L, 41L, 42L, 44L, 17L, 31L, 10L, 38L, 18L, 22L, 30L), .Label = c("Adrienne", "Alan", "Andy", "Ann ", "Anne ", "Anton", "Audrey", "Ben", "Bernie", "Beth", "Bob", "Bobby", "Bruce", "Charles", "Dave", "Dianne", "Frida", "Guy", "Henry", "Hugh", "Ian", "Irina", "James", "Jim", "Jo ", "John", "Jonah", "Joseph", "Lesley", "Liz", "Magnus", "Mary", "Max", "Merril", "Mike", "Mikhail", "Nick", "Peter", "Robert", "Robin", "Simon", "Steve", "Stuart", "Sue", "Telu"), class = "factor"), GENDER = structure(c(1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), AGE = structure(c(3L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("E", "M", "Y"), class = "factor"), WEIGHT_1 = c(79.2, 58.8, 72, 59.7, 79.6, 83.1, 68.7, 67.6, 79.1, 39.9, 64.7, 65.6), WEIGHT_2 = c(76.6, 59.3, 70.1, 57.3, 79.8, 82.3, 66.8, 67.4, 76.8, 41.4, 65.3, 63.2), HEIGHT = c(169L, 161L, 175L, 149L, 179L, 177L, 175L, 170L, 177L, 138L, 170L, 165L), SMOKE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE), RECOVER = c(1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L)), .Names = c("PATIENT", "GENDER", "TREATMENT", "AGE", "WEIGHT_1", "WEIGHT_2", "HEIGHT", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(1L, 4L, 5L, 13L, 15L, 17L, 22L, 29L, 33L, 41L, 42L, 43L)) M # Create a scatterplot of patient height against weight before treatment. qplot(HEIGHT, WEIGHT_1, data = M, xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(GENDER)) + scale_size_manual(values = c(5, 7)) # Change the symbol size values. qplot(HEIGHT, WEIGHT_1, data = M, xlab = "HEIGHT (cm)", ylab = "WEIGHT BEFORE TREATMENT (kg)" , size = factor(GENDER), color = factor(GENDER)) + scale_size_manual(values = c(2, 9))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

`qplot`

to create a simple scatterplot.The

`qplot`

(quick plot) system is a subset of the `ggplot2`

(grammar of graphics) package which you can use to create nice graphs. It is great for creating graphs of categorical data, because you can map symbol colour, size and shape to the levels of your categorical variable. To use `qplot`

first install `ggplot2`

as follows:`install.packages("ggplot2")`

and then load

`ggplot2`

using the command:`library(ggplot2)`

The

`qplot`

syntax is as follows:`qplot(x = X, y = X, data = X, colour = X, shape = X, geom = X, main = "Title")`

where

`x`

gives the `x`

values you wish to plot.`y`

gives the `y`

values you wish to plot. You now have bivariate data and must provide an appropriate `geom`

.`data`

gives the object name of the data frame.`colour`

maps the colour scheme onto a factor variable, and `qplot`

now selects different colours for different levels of the variable. You can use special syntax to set your own colours.`shape`

maps the symbol shapes onto a factor variable, and `qplot`

now selects different shapes for different levels of the factor variable. You can use special syntax to set your own shapes.`geom`

provides a list of keywords that control the kind of plot, including: “`histogram`

“, “`density`

“, “`line`

“, “`point`

“.`main`

provides the title for the plot.In

`qplot`

, you can set your desired aesthetics using the operator `I()`

. For example, if you want red use: `color = I("red")`

. If you want to control the size of the symbols, use: `size = I(N)`

, where a value of `N`

greater than 1 expands the symbols. For example, `size = I(5)`

produces very big symbols.Anyway – let’s start with a simple example where we set up a simple scatterplot with blue symbols. Now read in this dataset:

`T <- structure(list(A = c(1, 2, 4, 5, 6, 7), B = c(1, 4, 16, 25, 36, 49)), .Names = c("A", "B"), row.names = c(NA, -6L), class = "data.frame")`

`T`

Now plot `A`

against `B`

using `I()`

for colour and symbol size. We include axis labels of our choice and use symbol size 5 (large symbols).

`qplot(A, B, data = T, xlab = "NUMBERS", ylab = "VERTICAL AXIS", colour = I("blue"), size = I(5))`

Note the default background, grey in colour and including a grid. We can modify those attributes quite easily and we will do so in a later blog

Now we create a scatterplot with a smooth curve using `geom = c("smooth")`

.

`qplot(A, B, data = T, xlab = "NUMBERS", ylab = "VERTICAL AXIS", colour = I("blue"), size = I(1), geom = c("smooth"))`

We chose `size = I(1)`

for this example, but we can include a larger value to get a thicker line.

That wasn’t so hard! In Blog 24 we will look at further plotting techniques using `qplot`

.

See you later!

David

# Install ggplot2 package. install.packages("ggplot2") # Load ggplot2 package. library(ggplot2) # Read in and display the dataset. T <- structure(list(A = c(1, 2, 4, 5, 6, 7), B = c(1, 4, 16, 25, 36, 49)), .Names = c("A", "B"), row.names = c(NA, -6L), class = "data.frame") T # Plot A against B. qplot(A, B, data = T, xlab = "NUMBERS", ylab = "VERTICAL AXIS", colour = I("blue"), size = I(5)) # Create a scatterplot with a smooth curve. qplot(A, B, data = T, xlab = "NUMBERS", ylab = "VERTICAL AXIS", colour = I("blue"), size = I(1), geom = c("smooth"))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

Mathematical expressions on graphs are made possible through

`expression(paste())`

and `substitute()`

. If you need mathematical expressions as axis labels, switch off the default axes and include Greek symbols by writing them out in English. You can create fractions through the `frac()`

command. Note how we obtain the plus or minus sign through the syntax: `%+-%`

Here is a nice example. Let’s create a set of 71 values from – 6 to + 6. These values are the horizontal axis values.

`x <- seq(-6, 6, len = 71)`

Now we plot a cosine function using a continuous curve (using

`type="l"`

) while suppressing the x axis using the syntax: `xaxt="n"`

`plot(x, cos(x),type="l",xaxt="n", xlab=expression(paste("Angle ",theta)), ylab=expression("sin "*beta))`

where we have inserted relevant mathematical text for the axis labels using

`expression(paste())`

. Here is the graph so far:

Now we create a horizontal axis to our own specifications, including relevant labels:

`axis(1, at = c(-2*pi, -1.5*pi, -pi, -pi/2, 0, pi/2, pi, 1.5*pi, 2*pi), lab = expression(-2*phi, -1.5*phi, -phi, -phi/2, 0, phi/2, phi, 2*phi, 1.5*phi))`

Let’s put in some mathematical expressions, centred appropriately. The first argument within each `text()`

function gives the value along the horizontal axis about which the text will be centred.

`text(-0.7*pi,0.5,substitute(chi^2=="23.5"))`

text(0.1*pi, -0.5, expression(paste(frac(alpha*omega, sigma*phi*sqrt(2*pi)), " ", e^{frac(-(5*x+2*mu)^3, 5*sigma^3)})))

text(0.3*pi,0,expression(hat(z) %+-% frac(se, alpha)))

Here is our graph, complete with mathematical expressions:

That wasn’t so hard! In Blog 23 we will look at further plotting techniques in R.

See you later!

David

# Create a set of 71 horizontal axis values. x <- seq(-6, 6, len = 71) # Plot a cosine function. plot(x, cos(x),type="l",xaxt="n", xlab=expression(paste("Angle ",theta)), ylab=expression("sin "*beta)) # Custimise a horizontal axis. axis(1, at = c(-2*pi, -1.5*pi, -pi, -pi/2, 0, pi/2, pi, 1.5*pi, 2*pi), lab = expression(-2*phi, -1.5*phi, -phi, -phi/2, 0, phi/2, phi, 2*phi, 1.5*phi)) # Add a few mathematical expressions. text(-0.7*pi,0.5,substitute(chi^2=="23.5")) text(0.1*pi, -0.5, expression(paste(frac(alpha*omega, sigma*phi*sqrt(2*pi)), " ", e^{frac(-(5*x+2*mu)^3, 5*sigma^3)}))) text(0.3*pi,0,expression(hat(z) %+-% frac(se, alpha)))

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

`par(mfrow=(A,B))`

, where `A`

refers to the number of rows and `B`

to the number of columns (and where each cell will hold a single graph). This syntax sets up a plotting environment of `A`

rows and `B`

columns.First we create four vectors, all of the same length.

`X <- c(1, 2, 3, 4, 5, 6, 7)`

`Y1 <- c(2, 4, 5, 7, 12, 14, 16)`

`Y2 <- c(3, 6, 7, 8, 9, 11, 12)`

`Y3 <- c(1, 7, 3, 2, 2, 7, 9)`

Now we set up a plotting environment of two rows and three columns (in order to hold six graphs), using

`par(mfrow())`

`par(mfrow=c(2,3))`

Now we plot six graphs on the same plotting environment. We use the

`plot()`

command six times in succession, each time graphing one of the `Y`

vectors against the `X`

vector.`plot(X,Y1, pch = 1)`

plot(X,Y2, pch = 2)

plot(X,Y3, pch = 3)

plot(X,Y1, pch = 4)

plot(X,Y2, pch = 15)

plot(X,Y3, pch = 16)

Out plot plot looks like this:

That wasn’t so hard! In Blog 22 we will look at further plotting techniques in R.

See you later!

David

# Create four vectors of the same length. X <- c(1, 2, 3, 4, 5, 6, 7) Y1 <- c(2, 4, 5, 7, 12, 14, 16) Y2 <- c(3, 6, 7, 8, 9, 11, 12) Y3 <- c(1, 7, 3, 2, 2, 7, 9) # Set up multiple graphs (2 x 3) on the same page. par(mfrow=c(2,3)) # Create six scatterplots. plot(X,Y1, pch = 1) plot(X,Y2, pch = 2) plot(X,Y3, pch = 3) plot(X,Y1, pch = 4) plot(X,Y2, pch = 15) plot(X,Y3, pch = 16)

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.

```
A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6)
A
```

We can re-code all missing values by another number (such as zero) as follows:

```
A[ is.na(A) ] <- 0
A
```

Let’s re-code all values less than 5 to the value 99.

```
A[ A < 5 ] <- 99
A
```

However, some re-coding tasks are more complex, particularly when you wish to re-code a categorical variable or factor. In such cases, you might want to re-code an array with character elements to numeric elements.

```
gender <- c("MALE","FEMALE","FEMALE","UNKNOWN","MALE")
gender
```

Let’s re-code males as 1 and females as 2. Very useful is the following re-coding syntax because it works in many practical situations. It involves repeated (nested) use of the `ifelse()`

command.

`ifelse(gender == "MALE", 1, ifelse(gender == "FEMALE", 2, 3))`

The element with unknown gender was re-coded as 3. Make a note of this syntax. It’s great for re-coding within R programmes.

Another example, this time using a rectangular array.

```
A <- data.frame(Gender = c("F", "F", "M", "F", "B", "M", "M"), Height = c(154, 167, 178, 145, 169, 183, 176))
A
```

We have deliberately introduced an error where gender is misclassified as `B`

. This one gets re-coded to the value 99. Note that the `Gender`

variable is located in the first column, or `A[ ,1]`

.

```
A[ ,1] <- ifelse(A[ ,1] == "M", 1, ifelse(A[,1] == "F", 2, 99))
A
```

You can use the same approach to code as many different levels as you need to. Let’s re-code for four different levels. My last example is drawn from the films of the *Lord of the Rings and the Hobbit*. The sets where Peter Jackson produced these films are just a short walk from where I live, so the example is relevant for me.

```
S <- data.frame(SPECIES = c("ORC", "HOBBIT", "ELF", "TROLL", "ORC", "ORC", "ELF", "HOBBIT"), HEIGHT = c(194, 127, 178, 195, 149, 183, 176, 134))
S
```

We now use nested `ifelse()`

commands to re-code Orcs as 1, Elves as 2, Hobbits as 3, and Trolls as 4.

```
S[,1] <- ifelse(S[,1] == "ORC", 1, ifelse(S[,1] == "ELF", 2, ifelse(S[,1] == "HOBBIT", 3, ifelse(S[,1] == "TROLL", 4, 99))))
S
```

We can recode back to characters just as easily.

```
S[,1] <- ifelse(S[,1] == 1, "ORC", ifelse(S[,1] == 2, "ELF", ifelse(S[,1] == 3, "HOBBIT", ifelse(S[,1] == 4, "TROLL", 99))))
S
```

The general approach is the same as before, but now you have a few additional sets of parentheses. That wasn’t so hard! In Blog 21 I will present another tip for data analysis in R.

See you later!

David

# Set up a vector that has missing values. A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6) A # Re-code all missing values by another number (such as zero). A[ is.na(A) ] <- 0 A # Re-code all values less than 5 to the value 99. A[ A < 5 ] <- 99 A # Re-code an array with character elements to numeric elements. gender <- c("MALE","FEMALE","FEMALE","UNKNOWN","MALE") gender # Re-code males as 1 and females as 2. It involves repeated (nested) use of the ifelse() command. ifelse(gender == "MALE", 1, ifelse(gender == "FEMALE", 2, 3)) # Another example, using a rectangular array. A <- data.frame(Gender = c("F", "F", "M", "F", "B", "M", "M"), Height = c(154, 167, 178, 145, 169, 183, 176)) A # Re-code misclassified B to the value 99. Gender variable is located in the first column, or A[ ,1]. A[ ,1] <- ifelse(A[ ,1] == "M", 1, ifelse(A[,1] == "F", 2, 99)) A # Another example from the films of the Lord of the Rings and the Hobbit. S <- data.frame(SPECIES = c("ORC", "HOBBIT", "ELF", "TROLL", "ORC", "ORC", "ELF", "HOBBIT"), HEIGHT = c(194, 127, 178, 195, 149, 183, 176, 134)) S # Use nested ifelse commands to re-code Orcs as 1, Elves as 2, Hobbits as 3, and Trolls as 4. S[,1] <- ifelse(S[,1] == "ORC", 1, ifelse(S[,1] == "ELF", 2, ifelse(S[,1] == "HOBBIT", 3, ifelse(S[,1] == "TROLL", 4, 99)))) S # Recode back to characters. S[,1] <- ifelse(S[,1] == 1, "ORC", ifelse(S[,1] == 2, "ELF", ifelse(S[,1] == 3, "HOBBIT", ifelse(S[,1] == 4, "TROLL", 99)))) S

*New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.