diff --git a/NEWS.md b/NEWS.md index cffdf5e7..afacac0b 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,6 +1,6 @@ # 0.2.9000 (development version) #### New -* Vignettes about frequency tables: [vignettes/freq.html](vignettes/freq.html) +* Vignettes about frequency tables * Possibility to globally set the default for the amount of items to print in frequency tables (`freq` function), with `options(max.print.freq = n)` #### Changed diff --git a/vignettes/.gitignore b/vignettes/.gitignore new file mode 100644 index 00000000..5a5283f9 --- /dev/null +++ b/vignettes/.gitignore @@ -0,0 +1,4 @@ +figure +*.html +*.md +rsconnect diff --git a/vignettes/freq.Rmd b/vignettes/freq.Rmd index 7eb895e7..04702aac 100644 --- a/vignettes/freq.Rmd +++ b/vignettes/freq.Rmd @@ -5,7 +5,7 @@ output: rmarkdown::html_vignette: toc: true vignette: > - %\VignetteIndexEntry{Vignette Title} + %\VignetteIndexEntry{Creating Frequency Tables} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- diff --git a/vignettes/freq.html b/vignettes/freq.html deleted file mode 100644 index f956a698..00000000 --- a/vignettes/freq.html +++ /dev/null @@ -1,344 +0,0 @@ - - - - -
- - - - - - - - - - -Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the freq
function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the septic_patients
dataset (included in this AMR package) as example.
To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the sex
variable of the septic_patients
dataset:
# # just using base R
-freq(septic_patients$sex)
-
-# # using base R to select the variable and pass it on with a pipe
-septic_patients$sex %>% freq()
-
-# # do it all with pipes, using the `select` function of the dplyr package
-septic_patients %>%
- select(sex) %>%
- freq()
This will all lead to the following table:
-freq(septic_patients$sex)
-# Class: character
-# Length: 2000 (of which NA: 0 = 0.0%)
-# Unique: 2
-#
-# Item Count Percent Cum. Count Cum. Percent
-# ----- ------ -------- ----------- -------------
-# M 1112 55.6% 1112 55.6%
-# F 888 44.4% 2000 100.0%
This immediately shows the class of the variable, its length and availability (i.e. the amount of NA
), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.
Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.
-For illustration, we could add some more variables to the septic_patients
dataset to learn about bacterial properties:
my_patients <- septic_patients %>%
- left_join_microorganisms()
Now all variables of the microorganisms
dataset have been joined to the septic_patients
dataset. The microorganisms
dataset consists of the following variables:
colnames(microorganisms)
-# [1] "bactid" "bactsys" "family" "genus"
-# [5] "species" "subspecies" "fullname" "type"
-# [9] "gramstain" "aerobic" "type_nl" "gramstain_nl"
If we compare the dimensions between the old and new dataset, we can see that these 11 variables were added:
-dim(septic_patients)
-# [1] 2000 47
-dim(my_patients)
-# [1] 2000 58
So now the genus
and species
variables are available. A frequency table of these combined variables can be created like this:
my_patients %>%
- select(genus, species) %>%
- freq()
-# Columns: 2
-# Length: 2000 (of which NA: 0 = 0.0%)
-# Unique: 137
-#
-# Item Count Percent Cum. Count Cum. Percent
-# ---------------------------------- ------ -------- ----------- -------------
-# Escherichia coli 485 24.2% 485 24.2%
-# Staphylococcus coagulase negatief 297 14.8% 782 39.1%
-# Staphylococcus aureus 200 10.0% 982 49.1%
-# Staphylococcus epidermidis 150 7.5% 1132 56.6%
-# Streptococcus pneumoniae 97 4.9% 1229 61.5%
-# Staphylococcus hominis 67 3.4% 1296 64.8%
-# Klebsiella pneumoniae 65 3.2% 1361 68.0%
-# Enterococcus faecalis 44 2.2% 1405 70.2%
-# Proteus mirabilis 33 1.7% 1438 71.9%
-# Pseudomonas aeruginosa 31 1.6% 1469 73.5%
-# Streptococcus pyogenes 30 1.5% 1499 75.0%
-# Enterococcus faecium 27 1.4% 1526 76.3%
-# Bacteroides fragilis 26 1.3% 1552 77.6%
-# Enterobacter cloacae 25 1.2% 1577 78.8%
-# Klebsiella oxytoca 23 1.1% 1600 80.0%
-# ... and 122 more (n = 400; 20.0%). Use `nmax` to show more or less rows.
Frequency tables can be created of any input.
-In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:
-# # get age distribution of unique patients
-septic_patients %>%
- distinct(patient_id, .keep_all = TRUE) %>%
- select(age) %>%
- freq(nmax = 5)
-# Class: integer
-# Length: 1920 (of which NA: 0 = 0.0%)
-# Unique: 94
-#
-# Mean: 68
-# Std. dev.: 18 (CV: 0.27)
-# Five-Num: 0 | 61 | 72 | 80 | 101 (CQV: 0.13)
-# Outliers: 94 (unique: 26)
-#
-# Item Count Percent Cum. Count Cum. Percent
-# ----- ------ -------- ----------- -------------
-# 0 34 1.8% 34 1.8%
-# 1 5 0.3% 39 2.0%
-# 2 5 0.3% 44 2.3%
-# 3 2 0.1% 46 2.4%
-# 4 1 0.1% 47 2.4%
-# ... and 89 more (n = 1873; 97.6%).
So the following properties are determined, where NA
values are always ignored:
Mean
Standard deviation
Coefficient of variation (CV), the standard deviation divided by the mean
Five numbers of Tukey (min, Q1, median, Q3, max)
Coefficient of quartile variation (CQV, sometimes called coefficient of dispersion), calculated as (Q3 - Q1) / (Q3 + Q1) using quantile with type = 6
as quantile algorithm to comply with SPSS standards
Outliers (total count and unique count)
So for example, the above frequency table quickly shows the median age of patients being 72.
-Frequencies of factors will be sorted on factor level instead of item count by default. This can be changed with the sort.count
parameter. Frequency tables of factors always show the factor level as an additional last column.
sort.count
is TRUE
by default, except for factors. Compare this default behaviour:
septic_patients %>%
- select(hospital_id) %>%
- freq()
-# Class: factor
-# Length: 2000 (of which NA: 0 = 0.0%)
-# Unique: 5
-#
-# Item Count Percent Cum. Count Cum. Percent (Factor Level)
-# ----- ------ -------- ----------- ------------- ---------------
-# A 233 11.7% 233 11.7% 1
-# B 583 29.1% 816 40.8% 2
-# C 221 11.1% 1037 51.8% 3
-# D 650 32.5% 1687 84.4% 4
-# E 313 15.7% 2000 100.0% 5
To this, where items are now sorted on item count:
-septic_patients %>%
- select(hospital_id) %>%
- freq(sort.count = TRUE)
-# Class: factor
-# Length: 2000 (of which NA: 0 = 0.0%)
-# Unique: 5
-#
-# Item Count Percent Cum. Count Cum. Percent (Factor Level)
-# ----- ------ -------- ----------- ------------- ---------------
-# D 650 32.5% 650 32.5% 4
-# B 583 29.1% 1233 61.7% 2
-# E 313 15.7% 1546 77.3% 5
-# A 233 11.7% 1779 88.9% 1
-# C 221 11.1% 2000 100.0% 3
All classes will be printed into the header. Variables with the new rsi
class of this AMR package are actually ordered factors and have three classes (look at Class
in the header):
septic_patients %>%
- select(amox) %>%
- freq()
-# Class: factor > ordered > rsi
-# Length: 2000 (of which NA: 678 = 33.9%)
-# Unique: 3
-#
-# Item Count Percent Cum. Count Cum. Percent (Factor Level)
-# ----- ------ -------- ----------- ------------- ---------------
-# S 561 42.4% 561 42.4% 1
-# I 49 3.7% 610 46.1% 2
-# R 712 53.9% 1322 100.0% 3
Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:
-septic_patients %>%
- select(date) %>%
- freq(nmax = 5)
-# Class: Date
-# Length: 2000 (of which NA: 0 = 0.0%)
-# Unique: 1662
-#
-# Oldest: 2 januari 2001
-# Newest: 18 oktober 2017 (+6133)
-#
-# Item Count Percent Cum. Count Cum. Percent
-# ----------- ------ -------- ----------- -------------
-# 2008-12-24 5 0.2% 5 0.2%
-# 2010-12-10 4 0.2% 9 0.4%
-# 2011-03-03 4 0.2% 13 0.6%
-# 2013-06-24 4 0.2% 17 0.8%
-# 2017-09-01 4 0.2% 21 1.1%
-# ... and 1657 more (n = 1979; 99.0%).
na.rm
With the na.rm
parameter (defaults to TRUE
, but they will always be shown into the header), you can include NA
values in the frequency table:
septic_patients %>%
- select(amox) %>%
- freq(na.rm = FALSE)
-# Class: factor > ordered > rsi
-# Length: 2678 (of which NA: 678 = 25.3%)
-# Unique: 4
-#
-# Item Count Percent Cum. Count Cum. Percent (Factor Level)
-# ----- ------ -------- ----------- ------------- ---------------
-# S 561 28.1% 561 28.1% 1
-# I 49 2.5% 610 30.5% 2
-# R 712 35.6% 1322 66.1% 3
-# <NA> 678 33.9% 2000 100.0% <NA>
markdown
The markdown
parameter can be used in reports created with R Markdown. This will always print all rows:
septic_patients %>%
- select(hospital_id) %>%
- freq(markdown = TRUE)
-#
-# Class: factor
-#
-# Length: 2000 (of which NA: 0 = 0.0%)
-#
-# Unique: 5
-#
-# |Item | Count| Percent| Cum. Count| Cum. Percent| (Factor Level)|
-# |:----|-----:|-------:|----------:|------------:|--------------:|
-# |A | 233| 11.7%| 233| 11.7%| 1|
-# |B | 583| 29.1%| 816| 40.8%| 2|
-# |C | 221| 11.1%| 1037| 51.8%| 3|
-# |D | 650| 32.5%| 1687| 84.4%| 4|
-# |E | 313| 15.7%| 2000| 100.0%| 5|
as.data.frame
With the as.data.frame
parameter you can assign the frequency table to an object, or just print it as a data.frame
to the console:
my_df <- septic_patients %>%
- select(hospital_id) %>%
- freq(as.data.frame = TRUE)
-
-my_df
-# item count percent cum_count cum_percent factor_level
-# 1 A 233 0.1165 233 0.1165 1
-# 2 B 583 0.2915 816 0.4080 2
-# 3 C 221 0.1105 1037 0.5185 3
-# 4 D 650 0.3250 1687 0.8435 4
-# 5 E 313 0.1565 2000 1.0000 5
-
-class(my_df)
-# [1] "data.frame"
AMR, (c) 2018, https://github.com/msberends/AMR
-Licensed under the GNU General Public License v2.0.
-