How to create frequency tables

Introduction

Frequency tables (or frequency distributions) are summaries of the distribution of values in a sample. With the freq() function, you can create univariate frequency tables. Multiple variables will be pasted into one variable, so it forces a univariate distribution. We take the septic_patients dataset (included in this AMR package) as example.

Frequencies of one variable

To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the gender variable of the septic_patients dataset:

# Any of these will work:
# freq(septic_patients$gender)
# freq(septic_patients[, "gender"])

# Using tidyverse:
# septic_patients$gender %>% freq()
# septic_patients[, "gender"] %>% freq()
# septic_patients %>% freq("gender")

# Probably the fastest and easiest:
septic_patients %>% freq(gender)

Frequency table of gender from septic_patients (2,000 x 49)

Class: character
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 2

Shortest: 1
Longest: 1

	Item	Count	Percent	Cum. Count	Cum. Percent
1	M	1,031	51.6%	1,031	51.6%
2	F	969	48.4%	2,000	100.0%

This immediately shows the class of the variable, its length and availability (i.e. the amount of NA), the amount of unique values and (most importantly) that among septic patients men are more prevalent than women.

Frequencies of more than one variable

Multiple variables will be pasted into one variable to review individual cases, keeping a univariate frequency table.

For illustration, we could add some more variables to the septic_patients dataset to learn about bacterial properties:

my_patients <- septic_patients %>% left_join_microorganisms()
# Joining, by = "mo"

Now all variables of the microorganisms dataset have been joined to the septic_patients dataset. The microorganisms dataset consists of the following variables:

colnames(microorganisms)
#  [1] "mo"         "col_id"     "fullname"   "kingdom"    "phylum"    
#  [6] "class"      "order"      "family"     "genus"      "species"   
# [11] "subspecies" "rank"       "ref"        "species_id" "source"    
# [16] "prevalence"

If we compare the dimensions between the old and new dataset, we can see that these 15 variables were added:

dim(septic_patients)
# [1] 2000   49
dim(my_patients)
# [1] 2000   64

So now the genus and species variables are available. A frequency table of these combined variables can be created like this:

my_patients %>%
  freq(genus, species, nmax = 15)

Frequency table of genus and species from my_patients (2,000 x 64)

Columns: 2
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 95

Shortest: 8
Longest: 34

	Item	Count	Percent	Cum. Count	Cum. Percent
1	Escherichia coli	467	23.4%	467	23.4%
2	Staphylococcus coagulase-negative	313	15.6%	780	39.0%
3	Staphylococcus aureus	235	11.7%	1,015	50.7%
4	Staphylococcus epidermidis	174	8.7%	1,189	59.4%
5	Streptococcus pneumoniae	117	5.8%	1,306	65.3%
6	Staphylococcus hominis	81	4.0%	1,387	69.4%
7	Klebsiella pneumoniae	58	2.9%	1,445	72.2%
8	Enterococcus faecalis	39	2.0%	1,484	74.2%
9	Proteus mirabilis	36	1.8%	1,520	76.0%
10	Pseudomonas aeruginosa	30	1.5%	1,550	77.5%
11	Serratia marcescens	25	1.2%	1,575	78.8%
12	Enterobacter cloacae	23	1.2%	1,598	79.9%
13	Enterococcus faecium	21	1.0%	1,619	81.0%
14	Staphylococcus capitis	21	1.0%	1,640	82.0%
15	Bacteroides fragilis	20	1.0%	1,660	83.0%

(omitted 80 entries, n = 340 [17.0%])

Frequencies of numeric values

Frequency tables can be created of any input.

In case of numeric values (like integers, doubles, etc.) additional descriptive statistics will be calculated and shown into the header:

# # get age distribution of unique patients
septic_patients %>% 
  distinct(patient_id, .keep_all = TRUE) %>% 
  freq(age, nmax = 5, header = TRUE)

Frequency table of age from a data.frame (981 x 49)

Class: numeric
Length: 981 (of which NA: 0 = 0.00%)
Unique: 73

Mean: 71.08
SD: 14.05 (CV: 0.20, MAD: 13.34)
Five-Num: 14 | 63 | 74 | 82 | 97 (IQR: 19, CQV: 0.13)
Outliers: 15 (unique count: 12)

	Item	Count	Percent	Cum. Count	Cum. Percent
1	83	44	4.5%	44	4.5%
2	76	43	4.4%	87	8.9%
3	75	37	3.8%	124	12.6%
4	82	33	3.4%	157	16.0%
5	78	32	3.3%	189	19.3%

(omitted 68 entries, n = 792 [80.7%])

So the following properties are determined, where NA values are always ignored:

Mean
Standard deviation
Coefficient of variation (CV), the standard deviation divided by the mean
Mean absolute deviation (MAD), the median of the absolute deviations from the median - a more robust statistic than the standard deviation
Five numbers of Tukey, namely: the minimum, Q1, median, Q3 and maximum
Interquartile range (IQR), the distance between Q1 and Q3
Coefficient of quartile variation (CQV, sometimes called coefficient of dispersion), calculated as (Q3 - Q1) / (Q3 + Q1) using quantile() with type = 6 as quantile algorithm to comply with SPSS standards
Outliers (total count and unique count)

So for example, the above frequency table quickly shows the median age of patients being 74.

Frequencies of factors

To sort frequencies of factors on their levels instead of item count, use the sort.count parameter.

sort.count is TRUE by default. Compare this default behaviour…

septic_patients %>%
  freq(hospital_id)

Frequency table of hospital_id from septic_patients (2,000 x 49)

Class: factor (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Levels: 4: A, B, C, D
Unique: 4

	Item	Count	Percent	Cum. Count	Cum. Percent
1	D	762	38.1%	762	38.1%
2	B	663	33.2%	1,425	71.2%
3	A	321	16.0%	1,746	87.3%
4	C	254	12.7%	2,000	100.0%

… to this, where items are now sorted on factor levels:

septic_patients %>%
  freq(hospital_id, sort.count = FALSE)

Frequency table of hospital_id from septic_patients (2,000 x 49)

Class: factor (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Levels: 4: A, B, C, D
Unique: 4

	Item	Count	Percent	Cum. Count	Cum. Percent
1	A	321	16.0%	321	16.0%
2	B	663	33.2%	984	49.2%
3	C	254	12.7%	1,238	61.9%
4	D	762	38.1%	2,000	100.0%

All classes will be printed into the header. Variables with the new rsi class of this AMR package are actually ordered factors and have three classes (look at Class in the header):

septic_patients %>%
  freq(AMX, header = TRUE)

Frequency table of AMX from septic_patients (2,000 x 49)

Class: factor > ordered > rsi (numeric)
Length: 2,000 (of which NA: 771 = 38.55%)
Levels: 3: S < I < R
Unique: 3

Drug: Amoxicillin (AMX, J01CA04)
Group: Beta-lactams/penicillins
%SI: 44.43%

	Item	Count	Percent	Cum. Count	Cum. Percent
1	R	683	55.6%	683	55.6%
2	S	543	44.2%	1,226	99.8%
3	I	3	0.2%	1,229	100.0%

Frequencies of dates

Frequencies of dates will show the oldest and newest date in the data, and the amount of days between them:

septic_patients %>%
  freq(date, nmax = 5, header = TRUE)

Frequency table of date from septic_patients (2,000 x 49)

Class: Date (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Unique: 1,140

Oldest: 2 January 2002
Newest: 28 December 2017 (+5,839)
Median: 31 July 2009 (47.39%)

	Item	Count	Percent	Cum. Count	Cum. Percent
1	2016-05-21	10	0.5%	10	0.5%
2	2004-11-15	8	0.4%	18	0.9%
3	2013-07-29	8	0.4%	26	1.3%
4	2017-06-12	8	0.4%	34	1.7%
5	2015-11-19	7	0.4%	41	2.0%

(omitted 1,135 entries, n = 1,959 [98.0%])

Assigning a frequency table to an object

A frequency table is actually a regular data.frame, with the exception that it contains an additional class.

my_df <- septic_patients %>% freq(age)
class(my_df)

[1] “freq” “data.frame”

Because of this additional class, a frequency table prints like the examples above. But the object itself contains the complete table without a row limitation:

dim(my_df)

[1] 74 5

Additional parameters

Parameter `na.rm`

With the na.rm parameter you can remove NA values from the frequency table (defaults to TRUE, but the number of NA values will always be shown into the header):

septic_patients %>%
  freq(AMX, na.rm = FALSE)

Frequency table of AMX from septic_patients (2,000 x 49)

Class: factor > ordered > rsi (numeric)
Length: 2,000 (of which NA: 771 = 38.55%)
Levels: 3: S < I < R
Unique: 4

Drug: Amoxicillin (AMX, J01CA04)
Group: Beta-lactams/penicillins
%SI: 44.43%

	Item	Count	Percent	Cum. Count	Cum. Percent
1	(NA)	771	38.6%	771	38.6%
2	R	683	34.2%	1,454	72.7%
3	S	543	27.2%	1,997	99.8%
4	I	3	0.2%	2,000	100.0%

Parameter `row.names`

A frequency table shows row indices. To remove them, use row.names = FALSE:

septic_patients %>%
  freq(hospital_id, row.names = FALSE)

Frequency table of hospital_id from septic_patients (2,000 x 49)

Class: factor (numeric)
Length: 2,000 (of which NA: 0 = 0.00%)
Levels: 4: A, B, C, D
Unique: 4

Item	Count	Percent	Cum. Count	Cum. Percent
D	762	38.1%	762	38.1%
B	663	33.2%	1,425	71.2%
A	321	16.0%	1,746	87.3%
C	254	12.7%	2,000	100.0%

Parameter `markdown`

The markdown parameter is TRUE at default in non-interactive sessions, like in reports created with R Markdown. This will always print all rows, unless nmax is set. Without markdown (like in regular R), a frequency table would print like:

septic_patients %>%
  freq(hospital_id, markdown = FALSE)
# Frequency table of `hospital_id` from `septic_patients` (2,000 x 49) 
# 
# Class:   factor (numeric)
# Length:  2,000 (of which NA: 0 = 0.00%)
# Levels:  4: A, B, C, D
# Unique:  4
# 
#      Item    Count   Percent   Cum. Count   Cum. Percent
# ---  -----  ------  --------  -----------  -------------
# 1    D         762     38.1%          762          38.1%
# 2    B         663     33.2%        1,425          71.2%
# 3    A         321     16.0%        1,746          87.3%
# 4    C         254     12.7%        2,000         100.0%

Matthijs S. Berends

10 July 2019

Introduction

Frequencies of one variable

Frequencies of more than one variable

Frequencies of numeric values

Frequencies of factors

Frequencies of dates

Assigning a frequency table to an object

Additional parameters

Parameter `na.rm`

Parameter `row.names`

Parameter `markdown`

Contents

How to create frequency tables

Matthijs S. Berends

10 July 2019

Introduction

Frequencies of one variable

Frequencies of more than one variable

Frequencies of numeric values

Frequencies of factors

Frequencies of dates

Assigning a frequency table to an object

Additional parameters

Parameter na.rm

Parameter row.names

Parameter markdown

Contents

Parameter `na.rm`

Parameter `row.names`

Parameter `markdown`