AMR/vignettes/freq.Rmd

---
title: "Creating Frequency Tables"
author: "Matthijs S. Berends"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{Creating Frequency Tables}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE, results = 'markup'}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#"
)
library(dplyr)
library(AMR)
```

## Introduction

Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the `freq` function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the `septic_patients` dataset (included in this AMR package) as example.

## Frequencies of one variable

To only show and quickly review the content of one variable, you can just select this variable in various ways. Let's say we want to get the frequencies of the `sex` variable of the `septic_patients` dataset:
```{r, echo = TRUE, results = 'hide'}
# just using base R
freq(septic_patients$sex)

# using base R to select the variable and pass it on with a pipe from the dplyr package
septic_patients$sex %>% freq()

# do it all with pipes, using the `select` function from the dplyr package
septic_patients %>%
  select(sex) %>%
  freq()

# or the preferred way: using a pipe to pass the variable on to the freq function
septic_patients %>% freq(sex) # this also shows 'age' in the title

```
This will all lead to the following table:
```{r, echo = TRUE}
freq(septic_patients$sex)
```
This immediately shows the class of the variable, its length and availability (i.e. the amount of `NA`), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.

## Frequencies of more than one variable

Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.

For illustration, we could add some more variables to the `septic_patients` dataset to learn about bacterial properties:
```{r, echo = TRUE, results = 'hide'}
my_patients <- septic_patients %>% left_join_microorganisms()
```
Now all variables of the `microorganisms` dataset have been joined to the `septic_patients` dataset. The `microorganisms` dataset consists of the following variables:
```{r, echo = TRUE}
colnames(microorganisms)
```

If we compare the dimensions between the old and new dataset, we can see that these `r ncol(my_patients) - ncol(septic_patients)` variables were added:
```{r, echo = TRUE}
dim(septic_patients)
dim(my_patients)
```

So now the `genus` and `species` variables are available. A frequency table of these combined variables can be created like this:
```{r, echo = TRUE}
my_patients %>% freq(genus, species)
```

## Frequencies of numeric values

Frequency tables can be created of any input.

In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:

```{r, echo = TRUE}
# # get age distribution of unique patients
septic_patients %>% 
  distinct(patient_id, .keep_all = TRUE) %>% 
  freq(age, nmax = 5)
```

So the following properties are determined, where `NA` values are always ignored:

* **Mean**

* **Standard deviation**

* **Coefficient of variation** (CV), the standard deviation divided by the mean

* **Five numbers of Tukey** (min, Q1, median, Q3, max)

* **Coefficient of quartile variation** (CQV, sometimes called coefficient of dispersion), calculated as (Q3 - Q1) / (Q3 + Q1) using quantile with `type = 6` as quantile algorithm to comply with SPSS standards

* **Outliers** (total count and unique count)

So for example, the above frequency table quickly shows the median age of patients being `r my_patients %>% distinct(patient_id, .keep_all = TRUE) %>% pull(age) %>% median(na.rm = TRUE)`.

## Frequencies of factors

Frequencies of factors will be sorted on factor level instead of item count by default. This can be changed with the `sort.count` parameter. Frequency tables of factors always show the factor level as an additional last column.

`sort.count` is `TRUE` by default, except for factors. Compare this default behaviour...

```{r, echo = TRUE}
septic_patients %>%
  freq(hospital_id)
```

... with this, where items are now sorted on count:

```{r, echo = TRUE}
septic_patients %>%
  freq(hospital_id, sort.count = TRUE)
```

All classes will be printed into the header. Variables with the new `rsi` class of this AMR package are actually ordered factors and have three classes (look at `Class` in the header):

```{r, echo = TRUE}
septic_patients %>%
  select(amox) %>% 
  freq()
```

## Frequencies of dates

Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:

```{r, echo = TRUE}
septic_patients %>%
  select(date) %>% 
  freq(nmax = 5)
```

## Assigning a frequency table to an object

A frequency table is actaually a regular `data.frame`, with the exception that it contains an additional class.

```{r, echo = TRUE}
my_df <- septic_patients %>% freq(age)
class(my_df)
```

Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:

```{r, echo = TRUE}
dim(my_df)
```

## Additional parameters

### Parameter `na.rm`
With the `na.rm` parameter (defaults to `TRUE`, but they will always be shown into the header), you can include `NA` values in the frequency table:

```{r, echo = TRUE}
septic_patients %>%
  freq(amox, na.rm = FALSE)
```

### Parameter `row.names`
The default frequency tables shows row indices. To remove them, use `row.names = FALSE`:

```{r, echo = TRUE}
septic_patients %>%
  freq(hospital_id, row.names = FALSE)
```

### Parameter `markdown`
The `markdown` parameter can be used in reports created with R Markdown. This will always print all rows:

```{r, echo = TRUE}
septic_patients %>%
  freq(hospital_id, markdown = TRUE)
```

----
```{r, echo = FALSE}
# this will print "2018" in 2018, and "2018-yyyy" after 2018.
yrs <- c(2018:format(Sys.Date(), "%Y"))
yrs <- c(min(yrs), max(yrs))
yrs <- paste(unique(yrs), collapse = "-")
```
AMR, (c) `r yrs`, `r packageDescription("AMR")$URL`

Licensed under the [GNU General Public License v2.0](https://github.com/msberends/AMR/blob/master/LICENSE).
added vignette of freq 2018-05-09 11:44:46 +02:00			`---`
			`title: "Creating Frequency Tables"`
			`author: "Matthijs S. Berends"`
			`output:`
			`rmarkdown::html_vignette:`
			`toc: true`
			`vignette: >`
edit vignette title 2018-05-14 11:54:08 +02:00			`%\VignetteIndexEntry{Creating Frequency Tables}`
added vignette of freq 2018-05-09 11:44:46 +02:00			`%\VignetteEngine{knitr::rmarkdown}`
			`%\VignetteEncoding{UTF-8}`
			`---`

			```{r setup, include = FALSE, results = 'markup'}
			`knitr::opts_chunk$set(`
			`collapse = TRUE,`
			`comment = "#"`
			`)`
			`library(dplyr)`
			`library(AMR)`
			```

			`## Introduction`

			Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the `freq` function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the `septic_patients` dataset (included in this AMR package) as example.

			`## Frequencies of one variable`

			To only show and quickly review the content of one variable, you can just select this variable in various ways. Let's say we want to get the frequencies of the `sex` variable of the `septic_patients` dataset:
			```{r, echo = TRUE, results = 'hide'}
extra unit tests, add row.names to freq 2018-06-19 15:20:14 +02:00			`# just using base R`
added vignette of freq 2018-05-09 11:44:46 +02:00			`freq(septic_patients$sex)`

new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`# using base R to select the variable and pass it on with a pipe from the dplyr package`
added vignette of freq 2018-05-09 11:44:46 +02:00			`septic_patients$sex %>% freq()`

new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			# do it all with pipes, using the `select` function from the dplyr package
added vignette of freq 2018-05-09 11:44:46 +02:00			`septic_patients %>%`
			`select(sex) %>%`
			`freq()`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00
			`# or the preferred way: using a pipe to pass the variable on to the freq function`
			`septic_patients %>% freq(sex) # this also shows 'age' in the title`

added vignette of freq 2018-05-09 11:44:46 +02:00			```
			`This will all lead to the following table:`
			```{r, echo = TRUE}
			`freq(septic_patients$sex)`
			```
			This immediately shows the class of the variable, its length and availability (i.e. the amount of `NA`), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.

			`## Frequencies of more than one variable`

			`Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.`

			For illustration, we could add some more variables to the `septic_patients` dataset to learn about bacterial properties:
			```{r, echo = TRUE, results = 'hide'}
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`my_patients <- septic_patients %>% left_join_microorganisms()`
added vignette of freq 2018-05-09 11:44:46 +02:00			```
			Now all variables of the `microorganisms` dataset have been joined to the `septic_patients` dataset. The `microorganisms` dataset consists of the following variables:
			```{r, echo = TRUE}
			`colnames(microorganisms)`
			```

			If we compare the dimensions between the old and new dataset, we can see that these `r ncol(my_patients) - ncol(septic_patients)` variables were added:
			```{r, echo = TRUE}
			`dim(septic_patients)`
			`dim(my_patients)`
			```

			So now the `genus` and `species` variables are available. A frequency table of these combined variables can be created like this:
			```{r, echo = TRUE}
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`my_patients %>% freq(genus, species)`
added vignette of freq 2018-05-09 11:44:46 +02:00			```

			`## Frequencies of numeric values`

			`Frequency tables can be created of any input.`

			`In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:`

			```{r, echo = TRUE}
			`# # get age distribution of unique patients`
			`septic_patients %>%`
			`distinct(patient_id, .keep_all = TRUE) %>%`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`freq(age, nmax = 5)`
added vignette of freq 2018-05-09 11:44:46 +02:00			```

			So the following properties are determined, where `NA` values are always ignored:

			`* Mean`

			`* Standard deviation`

			`* Coefficient of variation (CV), the standard deviation divided by the mean`

			`* Five numbers of Tukey (min, Q1, median, Q3, max)`

			* Coefficient of quartile variation (CQV, sometimes called coefficient of dispersion), calculated as (Q3 - Q1) / (Q3 + Q1) using quantile with `type = 6` as quantile algorithm to comply with SPSS standards

			`* Outliers (total count and unique count)`

			So for example, the above frequency table quickly shows the median age of patients being `r my_patients %>% distinct(patient_id, .keep_all = TRUE) %>% pull(age) %>% median(na.rm = TRUE)`.

			`## Frequencies of factors`

			Frequencies of factors will be sorted on factor level instead of item count by default. This can be changed with the `sort.count` parameter. Frequency tables of factors always show the factor level as an additional last column.

Update freq function 2018-05-22 16:34:22 +02:00			`sort.count` is `TRUE` by default, except for factors. Compare this default behaviour...
added vignette of freq 2018-05-09 11:44:46 +02:00
			```{r, echo = TRUE}
			`septic_patients %>%`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`freq(hospital_id)`
added vignette of freq 2018-05-09 11:44:46 +02:00			```

Update freq function 2018-05-22 16:34:22 +02:00			`... with this, where items are now sorted on count:`
added vignette of freq 2018-05-09 11:44:46 +02:00
			```{r, echo = TRUE}
			`septic_patients %>%`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`freq(hospital_id, sort.count = TRUE)`
added vignette of freq 2018-05-09 11:44:46 +02:00			```

			All classes will be printed into the header. Variables with the new `rsi` class of this AMR package are actually ordered factors and have three classes (look at `Class` in the header):

			```{r, echo = TRUE}
			`septic_patients %>%`
			`select(amox) %>%`
			`freq()`
			```

			`## Frequencies of dates`

			`Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:`

			```{r, echo = TRUE}
			`septic_patients %>%`
			`select(date) %>%`
			`freq(nmax = 5)`
			```

new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`## Assigning a frequency table to an object`

			A frequency table is actaually a regular `data.frame`, with the exception that it contains an additional class.

			```{r, echo = TRUE}
			`my_df <- septic_patients %>% freq(age)`
			`class(my_df)`
			```

			`Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:`

			```{r, echo = TRUE}
			`dim(my_df)`
			```

added vignette of freq 2018-05-09 11:44:46 +02:00			`## Additional parameters`

			### Parameter `na.rm`
			With the `na.rm` parameter (defaults to `TRUE`, but they will always be shown into the header), you can include `NA` values in the frequency table:
extra unit tests, add row.names to freq 2018-06-19 15:20:14 +02:00
added vignette of freq 2018-05-09 11:44:46 +02:00			```{r, echo = TRUE}
			`septic_patients %>%`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`freq(amox, na.rm = FALSE)`
added vignette of freq 2018-05-09 11:44:46 +02:00			```

extra unit tests, add row.names to freq 2018-06-19 15:20:14 +02:00			### Parameter `row.names`
			The default frequency tables shows row indices. To remove them, use `row.names = FALSE`:

			```{r, echo = TRUE}
			`septic_patients %>%`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`freq(hospital_id, row.names = FALSE)`
extra unit tests, add row.names to freq 2018-06-19 15:20:14 +02:00			```

added vignette of freq 2018-05-09 11:44:46 +02:00			### Parameter `markdown`
			The `markdown` parameter can be used in reports created with R Markdown. This will always print all rows:

			```{r, echo = TRUE}
			`septic_patients %>%`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`freq(hospital_id, markdown = TRUE)`
added vignette of freq 2018-05-09 11:44:46 +02:00			```

			`----`
			```{r, echo = FALSE}
			`# this will print "2018" in 2018, and "2018-yyyy" after 2018.`
			`yrs <- c(2018:format(Sys.Date(), "%Y"))`
			`yrs <- c(min(yrs), max(yrs))`
			`yrs <- paste(unique(yrs), collapse = "-")`
			```
			AMR, (c) `r yrs`, `r packageDescription("AMR")$URL`

			`Licensed under the [GNU General Public License v2.0](https://github.com/msberends/AMR/blob/master/LICENSE).`