In Part 10, let’s look at the `aggregate()`

command for creating summary tables using R. You may have a complex dataset that includes categorical variables of several levels, and you may wish to create summary tables for each level of the categorical variable. For example, your dataset may include the variable Gender, a two-level categorical variable with levels Male and Female. Your dataset may include other categorical variables such as Ethnicity, Hair Colour, the Treatments received by patients in a medical study. In any case, you may wish to produce summary statistics for each level of the categorical variable. This is where the `aggregate()`

command is so helpful.

Here is a dataset of patients receiving medical treatment (A, B or C). We have data on their gender, their body mass in kg, whether or not they exercise, whether or not they smoke, and whether or not they recovered after treatment. Cut and paste the following dataset into R.

`patients <- structure(list(GENDER = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 3L, 1L, 1L, 2L, 3L, 2L, 2L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), MASS = c(79L, 87L, 65L, 58L, 72L, 95L, 76L, 56L, 77L, 104L, 67L, 82L, 59L, 68L, 79L, 125L, 83L, 63L, 57L, 84L, 72L, 68L, 65L, 64L, 87L, 92L, 56L), SMOKE = structure(c(2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("N", "Y"), class = "factor"), RECOVER = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("N", "Y"), class = "factor")), .Names = c("GENDER", "TREATMENT", "MASS", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(NA, -27L))`

`attach(patients)`

```
```

`patients`

Let’s use the `aggregate()`

command to obtain a table of mean body mass across the two levels of gender. Note the syntax. The continuous variable becomes the first argument. Then the categorical variable appears inside the `list()`

command. Finally, the function you wish to apply (in this case you want the mean) becomes the third argument.

`Table1 <- aggregate(MASS, list(GENDER), FUN=mean)`

`Table1`

Note that the `aggregate()`

command does not return the variable names. Anyway – now we use the `aggregate()`

command to obtain a table of mean body mass across the two levels of smoker (i.e. people who smoke and people who do not).

`Table2 <- aggregate(MASS, list(SMOKE), FUN=mean)`

`Table2`

The `aggregate()`

command allows us to create more complex tables, across the levels of several categorical variables together.

`Table3 <- aggregate(MASS, list(GENDER, TREATMENT), FUN=mean)`

`Table3`

Finally, let’s look at maximum body mass across the levels of gender, smoker and treatment.

`Table4 <- aggregate(MASS, list(GENDER, SMOKE, TREATMENT), FUN=max)`

`Table4`

That wasn’t so hard! In Blog 11 we will look at further techniques in R.

See you soon!

David

#### Annex: R codes used

# Create and display the following dataset. patients <- structure(list(GENDER = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L), .Label = c("F", "M"), class = "factor"), TREATMENT = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 1L, 3L, 1L, 1L, 2L, 3L, 2L, 2L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"), MASS = c(79L, 87L, 65L, 58L, 72L, 95L, 76L, 56L, 77L, 104L, 67L, 82L, 59L, 68L, 79L, 125L, 83L, 63L, 57L, 84L, 72L, 68L, 65L, 64L, 87L, 92L, 56L), SMOKE = structure(c(2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L), .Label = c("N", "Y"), class = "factor"), EXERCISE = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("N", "Y"), class = "factor"), RECOVER = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("N", "Y"), class = "factor")), .Names = c("GENDER", "TREATMENT", "MASS", "SMOKE", "EXERCISE", "RECOVER"), class = "data.frame", row.names = c(NA, -27L)) attach(patients) patients # Create a table of mean body mass across the two levels of gender. Table1 <- aggregate(MASS, list(GENDER), FUN=mean) Table1 # Create a table of mean body mass across the two levels of smoker (i.e. people who smoke and people who do not). Table2 <- aggregate(MASS, list(SMOKE), FUN=mean) Table2 # Create a table of mean body mass across the levels of gender and treatment variables together. Table3 <- aggregate(MASS, list(GENDER, TREATMENT), FUN=mean) Table3 # Create a table of maximum body mass across the levels of gender, smoker and treatment. Table4 <- aggregate(MASS, list(GENDER, SMOKE, TREATMENT), FUN=max) Table4

Senior Academic Manager in *New Zealand Institute of Sport* and Director of *Sigma Statistics and Research Ltd*. Author of the book: *R Graph Essentials*.