How to create frequency tables

Introduction

Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the freq function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the septic_patients dataset (included in this AMR package) as example.

Frequencies of one variable

To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the gender variable of the septic_patients dataset:

septic_patients %>% freq(gender)

Frequency table of gender from a data.frame (2,000 x 49)
Class: character (character)
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 2

Shortest: 1
Longest: 1

	Item	Count	Percent	Cum. Count	Cum. Percent
1	M	1,031	51.6%	1,031	51.6%
2	F	969	48.5%	2,000	100.0%

This immediately shows the class of the variable, its length and availability (i.e. the amount of NA), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.

Frequencies of more than one variable

Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.

For illustration, we could add some more variables to the septic_patients dataset to learn about bacterial properties:

my_patients <- septic_patients %>% left_join_microorganisms()
# Joining, by = "mo"

Now all variables of the microorganisms dataset have been joined to the septic_patients dataset. The microorganisms dataset consists of the following variables:

colnames(microorganisms)
#  [1] "mo"         "tsn"        "genus"      "species"    "subspecies"
#  [6] "fullname"   "family"     "order"      "class"      "phylum"    
# [11] "subkingdom" "kingdom"    "gramstain"  "prevalence" "ref"

If we compare the dimensions between the old and new dataset, we can see that these 14 variables were added:

dim(septic_patients)
# [1] 2000   49
dim(my_patients)
# [1] 2000   63

So now the genus and species variables are available. A frequency table of these combined variables can be created like this:

my_patients %>%
  freq(genus, species, nmax = 15)

Frequency table of genus and species from a data.frame (2,000 x 63)
Columns: 2
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 96

Shortest: 12
Longest: 34

	Item	Count	Percent	Cum. Count	Cum. Percent
1	Escherichia coli	467	23.4%	467	23.4%
2	Staphylococcus coagulase negative	313	15.7%	780	39.0%
3	Staphylococcus aureus	235	11.8%	1,015	50.7%
4	Staphylococcus epidermidis	174	8.7%	1,189	59.5%
5	Streptococcus pneumoniae	117	5.9%	1,306	65.3%
6	Staphylococcus hominis	81	4.1%	1,387	69.4%
7	Klebsiella pneumoniae	58	2.9%	1,445	72.3%
8	Enterococcus faecalis	39	2.0%	1,484	74.2%
9	Proteus mirabilis	36	1.8%	1,520	76.0%
10	Pseudomonas aeruginosa	30	1.5%	1,550	77.5%
11	Serratia marcescens	25	1.3%	1,575	78.8%
12	Enterobacter cloacae	23	1.2%	1,598	79.9%
13	Enterococcus faecium	21	1.1%	1,619	81.0%
14	Staphylococcus capitis	21	1.1%	1,640	82.0%
15	Bacteroides fragilis	20	1.0%	1,660	83.0%

(omitted 81 entries, n = 340 [17.0%])

Frequencies of numeric values

Frequency tables can be created of any input.

In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:

# # get age distribution of unique patients
septic_patients %>% 
  distinct(patient_id, .keep_all = TRUE) %>% 
  freq(age, nmax = 5, header = TRUE)

Frequency table of age from a data.frame (981 x 49)
Class: numeric (numeric)
Length: 981 (of which NA: 0 = 0.00%)
Unique: 73

Mean: 71.08
SD: 14.05 (CV: 0.20, MAD: 13.34)
Five-Num: 14 | 63 | 74 | 82 | 97 (IQR: 19, CQV: 0.13)
Outliers: 15 (unique count: 12)

	Item	Count	Percent	Cum. Count	Cum. Percent
1	83	44	4.5%	44	4.5%
2	76	43	4.4%	87	8.9%
3	75	37	3.8%	124	12.6%
4	82	33	3.4%	157	16.0%
5	78	32	3.3%	189	19.3%

(omitted 68 entries, n = 792 [80.7%])

So the following properties are determined, where NA values are always ignored:

Mean
Standard deviation
Coefficient of variation (CV), the standard deviation divided by the mean
Five numbers of Tukey (min, Q1, median, Q3, max)
Coefficient of quartile variation (CQV, sometimes called coefficient of dispersion), calculated as (Q3 - Q1) / (Q3 + Q1) using quantile with type = 6 as quantile algorithm to comply with SPSS standards
Outliers (total count and unique count)

So for example, the above frequency table quickly shows the median age of patients being 74.

Frequencies of factors

To sort frequencies of factors on factor level instead of item count, use the sort.count parameter.

sort.count is TRUE by default. Compare this default behaviour…

septic_patients %>%
  freq(hospital_id)

Frequency table of hospital_id from a data.frame (2,000 x 49)
Class: factor (numeric)
Levels: A, B, C, D
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 4

	Item	Count	Percent	Cum. Count	Cum. Percent
1	D	762	38.1%	762	38.1%
2	B	663	33.2%	1,425	71.3%
3	A	321	16.1%	1,746	87.3%
4	C	254	12.7%	2,000	100.0%

… with this, where items are now sorted on count:

septic_patients %>%
  freq(hospital_id, sort.count = FALSE)

Frequency table of hospital_id from a data.frame (2,000 x 49)
Class: factor (numeric)
Levels: A, B, C, D
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 4

	Item	Count	Percent	Cum. Count	Cum. Percent
1	A	321	16.1%	321	16.1%
2	B	663	33.2%	984	49.2%
3	C	254	12.7%	1,238	61.9%
4	D	762	38.1%	2,000	100.0%

All classes will be printed into the header (default is FALSE when using markdown like this document). Variables with the new rsi class of this AMR package are actually ordered factors and have three classes (look at Class in the header):

septic_patients %>%
  freq(amox, header = TRUE)

Frequency table of amox from a data.frame (2,000 x 49)
Class: factor > ordered > rsi (numeric)
Levels: S < I < R
Length: 2,000 (of which NA: 771 = 38.55%)
Unique: 3

%IR: 55.82% (ratio S : IR = 1.0 : 1.3)

	Item	Count	Percent	Cum. Count	Cum. Percent
1	R	683	55.6%	683	55.6%
2	S	543	44.2%	1,226	99.8%
3	I	3	0.2%	1,229	100.0%

Frequencies of dates

Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:

septic_patients %>%
  freq(date, nmax = 5, header = TRUE)

Frequency table of date from a data.frame (2,000 x 49)
Class: Date (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 1,140

Oldest: 2 January 2002
Newest: 28 December 2017 (+5,839)
Median: 31 July 2009 (47.39%)

	Item	Count	Percent	Cum. Count	Cum. Percent
1	2016-05-21	10	0.5%	10	0.5%
2	2004-11-15	8	0.4%	18	0.9%
3	2013-07-29	8	0.4%	26	1.3%
4	2017-06-12	8	0.4%	34	1.7%
5	2015-11-19	7	0.4%	41	2.1%

(omitted 1,135 entries, n = 1,959 [98.0%])

Assigning a frequency table to an object

A frequency table is actaually a regular data.frame, with the exception that it contains an additional class.

my_df <- septic_patients %>% freq(age)
class(my_df)

[1] “frequency_tbl” “data.frame”

Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:

dim(my_df)

[1] 74 5

Additional parameters

Parameter `na.rm`

With the na.rm parameter (defaults to TRUE, but they will always be shown into the header), you can include NA values in the frequency table:

septic_patients %>%
  freq(amox, na.rm = FALSE)

Frequency table of amox from a data.frame (2,000 x 49)
Class: factor > ordered > rsi (numeric)
Levels: S < I < R
Length: 2,771 (of which NA: 771 = 27.82%)
Unique: 4

%IR: 34.30% (ratio S : IR = 1.0 : 1.3)

	Item	Count	Percent	Cum. Count	Cum. Percent
1	(NA)	771	38.6%	771	38.6%
2	R	683	34.2%	1,454	72.7%
3	S	543	27.2%	1,997	99.9%
4	I	3	0.2%	2,000	100.0%

Parameter `row.names`

The default frequency tables shows row indices. To remove them, use row.names = FALSE:

septic_patients %>%
  freq(hospital_id, row.names = FALSE)

Frequency table of hospital_id from a data.frame (2,000 x 49)
Class: factor (numeric)
Levels: A, B, C, D
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 4

Item	Count	Percent	Cum. Count	Cum. Percent
D	762	38.1%	762	38.1%
B	663	33.2%	1,425	71.3%
A	321	16.1%	1,746	87.3%
C	254	12.7%	2,000	100.0%

Parameter `markdown`

The markdown parameter is TRUE at default in non-interactive sessions, like in reports created with R Markdown. This will always print all rows, unless nmax is set.

septic_patients %>%
  freq(hospital_id, markdown = TRUE)

Frequency table of hospital_id from a data.frame (2,000 x 49)
Class: factor (numeric)
Levels: A, B, C, D
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 4

	Item	Count	Percent	Cum. Count	Cum. Percent
1	D	762	38.1%	762	38.1%
2	B	663	33.2%	1,425	71.3%
3	A	321	16.1%	1,746	87.3%
4	C	254	12.7%	2,000	100.0%

Matthijs S. Berends

09 February 2019

Introduction

Frequencies of one variable

Frequencies of more than one variable

Frequencies of numeric values

Frequencies of factors

Frequencies of dates

Assigning a frequency table to an object

Additional parameters

Parameter `na.rm`

Parameter `row.names`

Parameter `markdown`

Contents

How to create frequency tables

Matthijs S. Berends

09 February 2019

Introduction

Frequencies of one variable

Frequencies of more than one variable

Frequencies of numeric values

Frequencies of factors

Frequencies of dates

Assigning a frequency table to an object

Additional parameters

Parameter na.rm

Parameter row.names

Parameter markdown

Contents

Parameter `na.rm`

Parameter `row.names`

Parameter `markdown`