In Part 10, let’s look at the aggregate()
command for creating summary tables using R. You may have a complex dataset that includes categorical variables of several levels, and you may wish to create summary tables for each level of the categorical variable. For example, your dataset may include the variable Gender, a two-level categorical variable with levels Male and Female. Your dataset may include other categorical variables such as Ethnicity, Hair Colour, the Treatments received by patients in a medical study. In any case, you may wish to produce summary statistics for each level of the categorical variable. This is where the aggregate()
command is so helpful.
Here is a dataset of patients receiving medical treatment (A, B or C). We have data on their gender, their body mass in kg, whether or not they exercise, whether or not they smoke, and whether or not they recovered after treatment. Cut and paste the following dataset into R.
patients <- structure(list(GENDER = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 3L, 1L, 1L, 2L, 3L, 2L, 2L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), MASS = c(79L, 87L, 65L, 58L, 72L, 95L, 76L, 56L, 77L, 104L, 67L, 82L, 59L, 68L, 79L, 125L, 83L, 63L, 57L, 84L, 72L, 68L, 65L, 64L, 87L, 92L, 56L), SMOKE = structure(c(2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("N", "Y"), class = "factor"), RECOVER = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("N", "Y"), class = "factor")), .Names = c("GENDER", "TREATMENT", "MASS", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(NA, -27L))
attach(patients)
patients
Let’s use the aggregate()
command to obtain a table of mean body mass across the two levels of gender. Note the syntax. The continuous variable becomes the first argument. Then the categorical variable appears inside the list()
command. Finally, the function you wish to apply (in this case you want the mean) becomes the third argument.
Table1 <- aggregate(MASS, list(GENDER), FUN=mean)
Table1
Note that the aggregate()
command does not return the variable names. Anyway – now we use the aggregate()
command to obtain a table of mean body mass across the two levels of smoker (i.e. people who smoke and people who do not).
Table2 <- aggregate(MASS, list(SMOKE), FUN=mean)
Table2
The aggregate()
command allows us to create more complex tables, across the levels of several categorical variables together.
Table3 <- aggregate(MASS, list(GENDER, TREATMENT), FUN=mean)
Table3
Finally, let’s look at maximum body mass across the levels of gender, smoker and treatment.
Table4 <- aggregate(MASS, list(GENDER, SMOKE, TREATMENT), FUN=max)
Table4
That wasn’t so hard! In Blog 11 we will look at further techniques in R.
See you soon!
David
Annex: R codes used
[code lang=”r”]
# Create and display the following dataset.
patients <- structure(list(GENDER = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 3L, 1L, 1L, 2L, 3L, 2L, 2L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), MASS = c(79L, 87L, 65L, 58L, 72L, 95L, 76L, 56L, 77L, 104L, 67L, 82L, 59L, 68L, 79L, 125L, 83L, 63L, 57L, 84L, 72L, 68L, 65L, 64L, 87L, 92L, 56L), SMOKE = structure(c(2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("N", "Y"), class = "factor"), RECOVER = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("N", "Y"), class = "factor")), .Names = c("GENDER", "TREATMENT", "MASS", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(NA, -27L))
attach(patients)
patients
# Create a table of mean body mass across the two levels of gender.
Table1 <- aggregate(MASS, list(GENDER), FUN=mean)
Table1
# Create a table of mean body mass across the two levels of smoker (i.e. people who smoke and people who do not).
Table2 <- aggregate(MASS, list(SMOKE), FUN=mean)
Table2
# Create a table of mean body mass across the levels of gender and treatment variables together.
Table3 <- aggregate(MASS, list(GENDER, TREATMENT), FUN=mean)
Table3
# Create a table of maximum body mass across the levels of gender, smoker and treatment.
Table4 <- aggregate(MASS, list(GENDER, SMOKE, TREATMENT), FUN=max)
Table4
[/code]