diff --git a/appveyor.yml b/appveyor.yml index 26f3c673..2ee5ffa3 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -23,7 +23,7 @@ init: ps: | $ErrorActionPreference = "Stop" - Invoke-WebRequest http://raw.github.com/krlmlr/r-appveyor/master/scripts/appveyor-tool.ps1 -OutFile "..\appveyor-tool.ps1" + Invoke-WebRequest https://raw.githubusercontent.com/krlmlr/r-appveyor/master/scripts/appveyor-tool.ps1 -OutFile "..\appveyor-tool.ps1" Import-Module '..\appveyor-tool.ps1' install: diff --git a/docs/articles/AMR.html b/docs/articles/AMR.html index f9678395..1e382202 100644 --- a/docs/articles/AMR.html +++ b/docs/articles/AMR.html @@ -1,35 +1,75 @@ + - - - - + + + + How to conduct AMR analysis • AMR (for R) - + + + - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + - - - - - - + + + + - - + + + + +
-
+
+ -
+ + + +
-

-Adding new variables

+

Adding new variables

Now that we have the microbial ID, we can add some taxonomic properties:

data <- data %>% 
-  mutate(gramstain = mo_gramstain(bacteria),
-         genus = mo_genus(bacteria),
-         species = mo_species(bacteria))
+ mutate(gramstain = mo_gramstain(bacteria), + genus = mo_genus(bacteria), + species = mo_species(bacteria))
-

-First isolates

+

First isolates

We also need to know which isolates we can actually use for analysis.

To conduct an analysis of antimicrobial resistance, you must only include the first isolate of every patient per episode (Hindler et al., Clin Infect Dis. 2007). If you would not do this, you could easily get an overestimate or underestimate of the resistance of an antibiotic. Imagine that a patient was admitted with an MRSA and that it was found in 5 different blood cultures the following weeks (yes, some countries like the Netherlands have these blood drawing policies). The resistance percentage of oxacillin of all isolates would be overestimated, because you included this MRSA more than once. It would clearly be selection bias.

The Clinical and Laboratory Standards Institute (CLSI) appoints this as follows:

(…) When preparing a cumulative antibiogram to guide clinical decisions about empirical antimicrobial therapy of initial infections, only the first isolate of a given species per patient, per analysis period (eg, one year) should be included, irrespective of body site, antimicrobial susceptibility profile, or other phenotypical characteristics (eg, biotype). The first isolate is easily identified, and cumulative antimicrobial susceptibility test data prepared using the first isolate are generally comparable to cumulative antimicrobial susceptibility test data calculated by other methods, providing duplicate isolates are excluded.
M39-A4 Analysis and Presentation of Cumulative Antimicrobial Susceptibility Test Data, 4th Edition. CLSI, 2014. Chapter 6.4

-

This AMR package includes this methodology with the first_isolate() function. It adopts the episode of a year (can be changed by user) and it starts counting days after every selected isolate. This new variable can easily be added to our data:

+

This AMR package includes this methodology with the first_isolate() function. It adopts the episode of a year (can be changed by user) and it starts counting days after every selected isolate. This new variable can easily be added to our data:

-

So only 28.5% is suitable for resistance analysis! We can now filter on it with the filter() function, also from the dplyr package:

+#> => Found 5,639 first isolates (28.2% of total)
+

So only 28.2% is suitable for resistance analysis! We can now filter on it with the filter() function, also from the dplyr package:

-

For future use, the above two syntaxes can be shortened with the filter_first_isolate() function:

+ filter(first == TRUE)
+

For future use, the above two syntaxes can be shortened with the filter_first_isolate() function:

data_1st <- data %>% 
-  filter_first_isolate()
+ filter_first_isolate()
-

-First weighted isolates

+

First weighted isolates

We made a slight twist to the CLSI algorithm, to take into account the antimicrobial susceptibility profile. Imagine this data, sorted on date:

- - +
+ + @@ -512,33 +547,34 @@ - + + - + - + - + - - + + - + @@ -549,7 +585,7 @@ - + @@ -560,65 +596,65 @@ - + - + - + - - + + - + - - + + - + - - + + - + + - - + - + - + @@ -626,20 +662,21 @@
isolate date patient_idcipr gent first
12010-01-312010-01-19 C6 B_ESCHR_COL R SSR S TRUE
22010-03-092010-03-29 C6 B_ESCHR_COL SSSRR S FALSE
32010-05-302010-05-27 C6 B_ESCHR_COL S
42010-06-222010-06-25 C6 B_ESCHR_COL S
52011-01-102010-06-28 C6 B_ESCHR_COL SI R SS FALSE
62011-02-192010-07-31 C6 B_ESCHR_COLI S S STRUESFALSE
72011-02-212010-08-07 C6 B_ESCHR_COL RIRSS S FALSE
82011-05-112010-11-13 C6 B_ESCHR_COLSSRI S S FALSE
92011-07-062011-04-02 C6 B_ESCHR_COLR S S SSFALSETRUE
102011-08-112011-06-26 C6 B_ESCHR_COLSR S S S
-

Only 2 isolates are marked as ‘first’ according to CLSI guideline. But when reviewing the antibiogram, it is obvious that some isolates are absolutely different strains and should be included too. This is why we weigh isolates, based on their antibiogram. The key_antibiotics() function adds a vector with 18 key antibiotics: 6 broad spectrum ones, 6 small spectrum for Gram negatives and 6 small spectrum for Gram positives. These can be defined by the user.

-

If a column exists with a name like ‘key(…)ab’ the first_isolate() function will automatically use it and determine the first weighted isolates. Mind the NOTEs in below output:

+

Only 2 isolates are marked as ‘first’ according to CLSI guideline. But when reviewing the antibiogram, it is obvious that some isolates are absolutely different strains and should be included too. This is why we weigh isolates, based on their antibiogram. The key_antibiotics() function adds a vector with 18 key antibiotics: 6 broad spectrum ones, 6 small spectrum for Gram negatives and 6 small spectrum for Gram positives. These can be defined by the user.

+

If a column exists with a name like ‘key(…)ab’ the first_isolate() function will automatically use it and determine the first weighted isolates. Mind the NOTEs in below output:

data <- data %>% 
-  mutate(keyab = key_antibiotics(.)) %>% 
-  mutate(first_weighted = first_isolate(.))
+  mutate(keyab = key_antibiotics(.)) %>% 
+  mutate(first_weighted = first_isolate(.))
 #> NOTE: Using column `bacteria` as input for `col_mo`.
 #> NOTE: Using column `bacteria` as input for `col_mo`.
 #> NOTE: Using column `date` as input for `col_date`.
 #> NOTE: Using column `patient_id` as input for `col_patient_id`.
 #> NOTE: Using column `keyab` as input for `col_keyantibiotics`. Use col_keyantibiotics  = FALSE to prevent this.
 #> [Criterion] Inclusion based on key antibiotics, ignoring I.
-#> => Found 15,864 first weighted isolates (79.3% of total)
- - +#> => Found 15,809 first weighted isolates (79.0% of total) +
+ + @@ -650,23 +687,36 @@ - + + - + - + - + + + + + + + + + + + + + @@ -676,21 +726,9 @@ - - - - - - - - - - - - - + @@ -702,70 +740,70 @@ - + - + - + - - + + - + - - + + - + - - + + - + - + + - - - + + - + - + @@ -774,18 +812,19 @@
isolate date patient_idgent first first_weighted
12010-01-312010-01-19 C6 B_ESCHR_COL R SSR S TRUE TRUE
22010-03-092010-03-29C6B_ESCHR_COLSRRSFALSETRUE
32010-05-27 C6 B_ESCHR_COL SFALSE TRUE
32010-05-30C6B_ESCHR_COLSSSSFALSEFALSE
42010-06-222010-06-25 C6 B_ESCHR_COL S
52011-01-102010-06-28 C6 B_ESCHR_COL SI R SS FALSE TRUE
62011-02-192010-07-31 C6 B_ESCHR_COLI S S STRUESFALSE TRUE
72011-02-212010-08-07 C6 B_ESCHR_COL RIRSS S FALSE TRUE
82011-05-112010-11-13 C6 B_ESCHR_COLSSRI S S FALSETRUEFALSE
92011-07-062011-04-02 C6 B_ESCHR_COLR S S SSFALSEFALSETRUETRUE
102011-08-112011-06-26 C6 B_ESCHR_COLSR S S S
-

Instead of 2, now 6 isolates are flagged. In total, 79.3% of all isolates are marked ‘first weighted’ - 50.9% more than when using the CLSI guideline. In real life, this novel algorithm will yield 5-10% more isolates than the classic CLSI guideline.

-

As with filter_first_isolate(), there’s a shortcut for this new algorithm too:

+

Instead of 2, now 7 isolates are flagged. In total, 79% of all isolates are marked ‘first weighted’ - 50.8% more than when using the CLSI guideline. In real life, this novel algorithm will yield 5-10% more isolates than the classic CLSI guideline.

+

As with filter_first_isolate(), there’s a shortcut for this new algorithm too:

data_1st <- data %>% 
-  filter_first_weighted_isolate()
-

So we end up with 15,864 isolates for analysis.

+ filter_first_weighted_isolate()
+

So we end up with 15,809 isolates for analysis.

We can remove unneeded columns:

data_1st <- data_1st %>% 
-  select(-c(first, keyab))
+ select(-c(first, keyab))

Now our data looks like:

-
head(data_1st)
- - +
head(data_1st)
+
+ + @@ -800,13 +839,30 @@ - + + - - - + + + + + + + + + + + + + + + + + + + @@ -818,84 +874,68 @@ - - - - - - - - - - - - - - - - - - - + + + - + + - - - - + + + - - + + - + - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + - - - + + + - + + - - - - - + + + + @@ -904,253 +944,258 @@
-

-Analysing the data

+

Analysing the data

You might want to start by getting an idea of how the data is distributed. It’s an important start, because it also decides how you will continue your analysis.

-

-Dispersion of species

-

To just get an idea how the species are distributed, create a frequency table with our freq() function. We created the genus and species column earlier based on the microbial ID. With paste(), we can concatenate them together.

-

The freq() function can be used like the base R language was intended:

-
freq(paste(data_1st$genus, data_1st$species))
+

Dispersion of species

+

To just get an idea how the species are distributed, create a frequency table with our freq() function. We created the genus and species column earlier based on the microbial ID. With paste(), we can concatenate them together.

+

The freq() function can be used like the base R language was intended:

+

Or can be used like the dplyr way, which is easier readable:

-
data_1st %>% freq(genus, species)
-

Frequency table of genus and species from a data.frame (15,864 x 13)

-

Columns: 2
-Length: 15,864 (of which NA: 0 = 0.00%)
+

+

Frequency table of genus and species from a data.frame (15,809 x 13)

+

Columns: 2
+Length: 15,809 (of which NA: 0 = 0.00%)
Unique: 4

-

Shortest: 16
+

Shortest: 16
Longest: 24

-
date patient_idgenus species first_weighted
12012-08-24V322010-01-25W1 Hospital BB_KLBSL_PNERISSFGram negativeKlebsiellapneumoniaeTRUE
32011-02-16N10Hospital C B_ESCHR_COL S Scoli TRUE
22013-04-28S3Hospital BB_STPHY_AURRSSSFGram positiveStaphylococcusaureusTRUE
32013-05-13F242010-04-01J4 Hospital DB_STPHY_AURB_ESCHR_COLS S SR S MGram positiveStaphylococcusaureusGram negativeEscherichiacoli TRUE
52015-12-23G92015-02-27O2 Hospital BB_STPHY_AURB_ESCHR_COL SSSSMGram positiveStaphylococcusaureusTRUE
62011-12-19T2Hospital BB_STPHY_AUR R S SS FGram negativeEscherichiacoliTRUE
72011-08-19D6Hospital BB_STPHY_AURSISSM Gram positive Staphylococcus aureus TRUE
72011-11-26J1102017-01-12P10 Hospital AB_ESCHR_COLB_STPHY_AURS R S SRMGram negativeEscherichiacoliFGram positiveStaphylococcusaureus TRUE
- +
+ + - + + - - - - + + + + - - - - + + + + - - - - + + + + - - - + + +
Item Count Percent Cum. Count Cum. Percent
1 Escherichia coli7,84949.5%7,84949.5%7,88949.9%7,88949.9%
2 Staphylococcus aureus4,01025.3%11,85974.8%3,87824.5%11,76774.4%
3 Streptococcus pneumoniae2,44315.4%14,30290.2%2,50115.8%14,26890.3%
4 Klebsiella pneumoniae1,5629.8%15,8641,5419.7%15,809 100.0%
-

-Resistance percentages

+

Resistance percentages

The functions portion_R, portion_RI, portion_I, portion_IS and portion_S can be used to determine the portion of a specific antimicrobial outcome. They can be used on their own:

-
data_1st %>% portion_IR(amox)
-#> [1] 0.4777484
-

Or can be used in conjuction with group_by() and summarise(), both from the dplyr package:

+
data_1st %>% portion_IR(amox)
+#> [1] 0.4737808
+

Or can be used in conjuction with group_by() and summarise(), both from the dplyr package:

data_1st %>% 
-  group_by(hospital) %>% 
-  summarise(amoxicillin = portion_IR(amox))
- - + group_by(hospital) %>% + summarise(amoxicillin = portion_IR(amox)) +
+ + - + + - + - + - + - +
hospital amoxicillin
Hospital A0.48142770.4735608
Hospital B0.48112520.4708504
Hospital C0.48438150.4779765
Hospital D0.46107210.4762201
-

Of course it would be very convenient to know the number of isolates responsible for the percentages. For that purpose the n_rsi() can be used, which works exactly like n_distinct() from the dplyr package. It counts all isolates available for every group (i.e. values S, I or R):

+

Of course it would be very convenient to know the number of isolates responsible for the percentages. For that purpose the n_rsi() can be used, which works exactly like n_distinct() from the dplyr package. It counts all isolates available for every group (i.e. values S, I or R):

data_1st %>% 
-  group_by(hospital) %>% 
-  summarise(amoxicillin = portion_IR(amox),
-            available = n_rsi(amox))
- - + group_by(hospital) %>% + summarise(amoxicillin = portion_IR(amox), + available = n_rsi(amox)) +
+ + - + + - - + + - - + + - - + + - - + +
hospital amoxicillin available
Hospital A0.481427748190.47356084690
Hospital B0.481125255100.47085045609
Hospital C0.484381524010.47797652293
Hospital D0.461072131340.47622013217

These functions can also be used to get the portion of multiple antibiotics, to calculate co-resistance very easily:

data_1st %>% 
-  group_by(genus) %>% 
-  summarise(amoxicillin = portion_S(amcl),
-            gentamicin = portion_S(gent),
-            "amox + gent" = portion_S(amcl, gent))
- - + group_by(genus) %>% + summarise(amoxicillin = portion_S(amcl), + gentamicin = portion_S(gent), + "amox + gent" = portion_S(amcl, gent)) +
+ + - + + - - - + + + - - - + + + - - - + + + - + - +
genus amoxicillin gentamicin amox + gent
Escherichia0.71805330.89833100.97044210.73228550.90074790.9751553
Klebsiella0.71895010.88796410.96542890.73523690.89941600.9759896
Staphylococcus0.72319200.92169580.97905240.72408460.91877260.9744714
Streptococcus0.73720840.7417033 0.00000000.73720840.7417033

To make a transition to the next part, let’s see how this difference could be plotted:

data_1st %>% 
-  group_by(genus) %>% 
-  summarise("1. Amoxicillin" = portion_S(amcl),
-            "2. Gentamicin" = portion_S(gent),
-            "3. Amox + gent" = portion_S(amcl, gent)) %>% 
-  tidyr::gather("Antibiotic", "S", -genus) %>%
-  ggplot(aes(x = genus,
+  group_by(genus) %>% 
+  summarise("1. Amoxicillin" = portion_S(amcl),
+            "2. Gentamicin" = portion_S(gent),
+            "3. Amox + gent" = portion_S(amcl, gent)) %>% 
+  tidyr::gather("Antibiotic", "S", -genus) %>%
+  ggplot(aes(x = genus,
              y = S,
              fill = Antibiotic)) +
-  geom_col(position = "dodge2")
-

+ geom_col(position = "dodge2")
+

-

-Plots

+

Plots

To show results in plots, most R users would nowadays use the ggplot2 package. This package lets you create plots in layers. You can read more about it on their website. A quick example would look like these syntaxes:

-
ggplot(data = a_data_set,
-       mapping = aes(x = year,
+
-

The AMR package contains functions to extend this ggplot2 package, for example geom_rsi(). It automatically transforms data with count_df() or portion_df() and show results in stacked bars. Its simplest and shortest example:

-
ggplot(data_1st) +
-  geom_rsi(translate_ab = FALSE)
-

+ggplot(a_data_set, + aes(year, value) + + geom_bar()
+

The AMR package contains functions to extend this ggplot2 package, for example geom_rsi(). It automatically transforms data with count_df() or portion_df() and show results in stacked bars. Its simplest and shortest example:

+
ggplot(data_1st) +
+  geom_rsi(translate_ab = FALSE)
+

Omit the translate_ab = FALSE to have the antibiotic codes (amox, amcl, cipr, gent) translated to official WHO names (amoxicillin, amoxicillin and betalactamase inhibitor, ciprofloxacin, gentamicin).

If we group on e.g. the genus column and add some additional functions from our package, we can create this:

# group the data on `genus`
-ggplot(data_1st %>% group_by(genus)) + 
+ggplot(data_1st %>% group_by(genus)) + 
   # create bars with genus on x axis
   # it looks for variables with class `rsi`,
   # of which we have 4 (earlier created with `as.rsi`)
-  geom_rsi(x = "genus") + 
+  geom_rsi(x = "genus") + 
   # split plots on antibiotic
-  facet_rsi(facet = "Antibiotic") +
+  facet_rsi(facet = "Antibiotic") +
   # make R red, I yellow and S green
-  scale_rsi_colours() +
+  scale_rsi_colours() +
   # show percentages on y axis
-  scale_y_percent(breaks = 0:4 * 25) +
+  scale_y_percent(breaks = 0:4 * 25) +
   # turn 90 degrees, make it bars instead of columns
-  coord_flip() +
+  coord_flip() +
   # add labels
-  labs(title = "Resistance per genus and antibiotic", 
-       subtitle = "(this is fake data)") +
+  labs(title = "Resistance per genus and antibiotic", 
+       subtitle = "(this is fake data)") +
   # and print genus in italic to follow our convention
   # (is now y axis because we turned the plot)
-  theme(axis.text.y = element_text(face = "italic"))
-

-

To simplify this, we also created the ggplot_rsi() function, which combines almost all above functions:

+ theme(axis.text.y = element_text(face = "italic"))
+

+

To simplify this, we also created the ggplot_rsi() function, which combines almost all above functions:

data_1st %>% 
-  group_by(genus) %>%
-  ggplot_rsi(x = "genus",
-             facet = "Antibiotic",
+  group_by(genus) %>%
+  ggplot_rsi(x = "genus",
+             facet = "Antibiotic",
              breaks = 0:4 * 25,
              datalabels = FALSE) +
-  coord_flip()
-

+ coord_flip() +

-

-Independence test

+

Independence test

The next example uses the included septic_patients, which is an anonymised data set containing 2,000 microbial blood culture isolates with their full antibiograms found in septic patients in 4 different hospitals in the Netherlands, between 2001 and 2017. It is true, genuine data. This data.frame can be used to practice AMR analysis.

-

We will compare the resistance to fosfomycin (column fosf) in hospital A and D. The input for the final fisher.test() will be this:

- - +

We will compare the resistance to fosfomycin (column fosf) in hospital A and D. The input for the final fisher.test() will be this:

+
+ + - + + @@ -1166,16 +1211,16 @@ Longest: 24

A D
IR

We can transform the data and apply the test in only a couple of lines:

septic_patients %>%
-  filter(hospital_id %in% c("A", "D")) %>% # filter on only hospitals A and D
-  select(hospital_id, fosf) %>%            # select the hospitals and fosfomycin
-  group_by(hospital_id) %>%                # group on the hospitals
-  count_df(combine_IR = TRUE) %>%          # count all isolates per group (hospital_id)
-  tidyr::spread(hospital_id, Value) %>%    # transform output so A and D are columns
-  select(A, D) %>%                         # and select these only
-  as.matrix() %>%                          # transform to good old matrix for fisher.test()
-  fisher.test()                            # do Fisher's Exact Test
+  filter(hospital_id %in% c("A", "D")) %>% # filter on only hospitals A and D
+  select(hospital_id, fosf) %>%            # select the hospitals and fosfomycin
+  group_by(hospital_id) %>%                # group on the hospitals
+  count_df(combine_IR = TRUE) %>%          # count all isolates per group (hospital_id)
+  tidyr::spread(hospital_id, Value) %>%    # transform output so A and D are columns
+  select(A, D) %>%                         # and select these only
+  as.matrix() %>%                          # transform to good old matrix for fisher.test()
+  fisher.test()                            # do Fisher's Exact Test
 #> 
-#>  Fisher's Exact Test for Count Data
+#>  Fisher's Exact Test for Count Data
 #> 
 #> data:  .
 #> p-value = 0.03104
@@ -1192,56 +1237,49 @@ Longest: 24

-