diff --git a/DESCRIPTION b/DESCRIPTION index a56411ca..b95d2456 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: AMR -Version: 0.5.0.9016 -Date: 2019-02-04 +Version: 0.5.0.9017 +Date: 2019-02-08 Title: Antimicrobial Resistance Analysis Authors@R: c( person( diff --git a/NAMESPACE b/NAMESPACE index 0965661a..79cac6fa 100755 --- a/NAMESPACE +++ b/NAMESPACE @@ -128,6 +128,7 @@ export(mo_subkingdom) export(mo_subspecies) export(mo_taxonomy) export(mo_type) +export(mo_uncertainties) export(mo_year) export(mrgn) export(n_rsi) @@ -195,6 +196,7 @@ importFrom(crayon,black) importFrom(crayon,blue) importFrom(crayon,bold) importFrom(crayon,green) +importFrom(crayon,has_color) importFrom(crayon,italic) importFrom(crayon,magenta) importFrom(crayon,red) diff --git a/NEWS.md b/NEWS.md index 8f2e9770..16b6a055 100755 --- a/NEWS.md +++ b/NEWS.md @@ -6,6 +6,7 @@ * Support for data from [WHONET](https://whonet.org/) and [EARS-Net](https://ecdc.europa.eu/en/about-us/partnerships-and-networks/disease-and-laboratory-networks/ears-net) (European Antimicrobial Resistance Surveillance Network): * Exported files from WHONET can be read and used in this package. For functions like `first_isolate()` and `eucast_rules()`, all parameters will be filled in automatically. * This package now knows all antibiotic abbrevations by EARS-Net (which are also being used by WHONET) - the `antibiotics` data set now contains a column `ears_net`. + * The function `as.mo()` now knows all WHONET species abbreviations too, because more than 1,600 microbial abbreviations were added to the `microorganisms.codes` data set. * All `ab_*` functions are deprecated and replaced by `atc_*` functions: ```r ab_property -> atc_property() @@ -24,6 +25,7 @@ * Support for the upcoming [`dplyr`](https://dplyr.tidyverse.org) version 0.8.0 * New function `guess_ab_col()` to find an antibiotic column in a table * New function `mo_failures()` to review values that could not be coerced to a valid MO code, using `as.mo()`. This latter function will now only show a maximum of 10 uncoerced values and will refer to `mo_failures()`. +* New function `mo_uncertainties()` to review values that could be coerced to a valid MO code using `as.mo()`, but with uncertainty. * New function `mo_renamed()` to get a list of all returned values from `as.mo()` that have had taxonomic renaming * New function `age()` to calculate the (patients) age in years * New function `age_groups()` to split ages into custom or predefined groups (like children or elderly). This allows for easier demographic antimicrobial resistance analysis per age group. @@ -46,23 +48,27 @@ filter(only_firsts == TRUE) %>% select(-only_firsts) ``` +* New function `availability()` to check the number of available (non-empty) results in a `data.frame` * New vignettes about how to conduct AMR analysis, predict antimicrobial resistance, use the *G*-test and more. These are also available (and even easier readable) on our website: https://msberends.gitlab.io/AMR. #### Changed +* Function `eucast_rules()`: + * Updated EUCAST Clinical breakpoints to [version 9.0 of 1 January 2019](http://www.eucast.org/clinical_breakpoints/), the data set `septic_patients` now reflects these changes + * Fixed a critical bug where some rules that depend on previous applied rules would not be applied adequately + * Emphasised in manual that penicillin is meant as benzylpenicillin (ATC [J01CE01](https://www.whocc.no/atc_ddd_index/?code=J01CE01)) + * New info is returned when running this function, stating exactly what has been changed or added. Use `eucast_rules(..., verbose = TRUE)` to get a data set with all changed per bug and drug combination. +* Added 605 *Aspergillus* species and 23 *Trichophyton* species to the `microorganisms` data set * Added 65 antibiotics to the `antibiotics` data set, from the [Pharmaceuticals Community Register](http://ec.europa.eu/health/documents/community-register/html/atc.htm) of the European Commission * Removed columns `atc_group1_nl` and `atc_group2_nl` from the `antibiotics` data set * Functions `atc_ddd()` and `atc_groups()` have been renamed `atc_online_ddd()` and `atc_online_groups()`. The old functions are deprecated and will be removed in a future version. * Function `guess_mo()` is now deprecated in favour of `as.mo()` and will be removed in future versions * Function `guess_atc()` is now deprecated in favour of `as.atc()` and will be removed in future versions -* Function `eucast_rules()`: - * Updated EUCAST Clinical breakpoints to [version 9.0 of 1 January 2019](http://www.eucast.org/clinical_breakpoints/) - * Fixed a critical bug where some rules that depend on previous applied rules would not be applied adequately - * Emphasised in manual that penicillin is meant as benzylpenicillin (ATC [J01CE01](https://www.whocc.no/atc_ddd_index/?code=J01CE01)) * Improvements for `as.mo()`: * Fix for vector containing only empty values * Finds better results when input is in other languages * Better handling for subspecies * Better handling for *Salmonellae* + * Understanding of highly virulent *E. coli* strains like EIEC, EPEC and STEC * There will be looked for uncertain results at default - these results will be returned with an informative warning * Manual now contains more info about the algorithms * Progress bar will be shown when it takes more than 3 seconds to get results diff --git a/R/data.R b/R/data.R index 53150089..42e76ea2 100755 --- a/R/data.R +++ b/R/data.R @@ -134,7 +134,7 @@ #' #' A data set containing the complete microbial taxonomy of the kingdoms Bacteria, Fungi and Protozoa from ITIS. MO codes can be looked up using \code{\link{as.mo}}. #' @inheritSection ITIS ITIS -#' @format A \code{\link{data.frame}} with 18,833 observations and 15 variables: +#' @format A \code{\link{data.frame}} with 19,456 observations and 15 variables: #' \describe{ #' \item{\code{mo}}{ID of microorganism} #' \item{\code{tsn}}{Taxonomic Serial Number (TSN), as defined by ITIS} @@ -153,6 +153,17 @@ #' \item{\code{ref}}{Author(s) and year of concerning publication as found in ITIS, see Source} #' } #' @source Integrated Taxonomic Information System (ITIS) public online database, \url{https://www.itis.gov}. +#' @details Manually added were: +#' \itemize{ +#' \item{605 species of Aspergillus (as Aspergillus misses from ITIS, list from https://en.wikipedia.org/wiki/List_of_Aspergillus_species on 2019-02-05)} +#' \item{23 species of Trichophyton (as Trichophyton misses from ITIS, list from https://en.wikipedia.org/wiki/Trichophyton on 2019-02-05)} +#' \item{9 species of Streptococcus (beta haemolytic groups A, B, C, D, F, G, H, K and unspecified)} +#' \item{2 species of Straphylococcus (coagulase-negative [CoNS] and coagulase-positive [CoPS])} +#' \item{1 species of Candida (C. glabrata)} +#' \item{2 other undefined (unknown Gram negatives and unknown Gram positives)} +#' } +#' +#' These manual entries have no Taxonomic Serial Number (TSN), so can be looked up with \code{filter(microorganisms, is.na(tsn)}. #' @inheritSection AMR Read more on our website! #' @seealso \code{\link{as.mo}} \code{\link{mo_property}} \code{\link{microorganisms.codes}} "microorganisms" @@ -175,12 +186,13 @@ #' Translation table for microorganism codes #' -#' A data set containing commonly used codes for microorganisms. Define your own with \code{\link{set_mo_source}}. -#' @format A \code{\link{data.frame}} with 3,303 observations and 2 variables: +#' A data set containing commonly used codes for microorganisms, from laboratory systems and WHONET. Define your own with \code{\link{set_mo_source}}. +#' @format A \code{\link{data.frame}} with 4,731 observations and 2 variables: #' \describe{ #' \item{\code{certe}}{Commonly used code of a microorganism} -#' \item{\code{mo}}{Code of microorganism in \code{\link{microorganisms}}} +#' \item{\code{mo}}{ID of the microorganism in the \code{\link{microorganisms}} data set} #' } +#' @inheritSection ITIS ITIS #' @inheritSection AMR Read more on our website! #' @seealso \code{\link{as.mo}} \code{\link{microorganisms}} "microorganisms.codes" @@ -246,17 +258,21 @@ #' @name supplementary_data #' @inheritSection AMR Read more on our website! # # Renew data: +# # sorted on (1) bacteria, (2) fungi, (3) protozoa and then human pathogenic prevalence and then TSN: # microorganismsDT <- data.table::as.data.table(AMR::microorganisms) -# # sort on (1) bacteria, (2) fungi, (3) protozoa and then human pathogenic prevalence and then TSN: # data.table::setkey(microorganismsDT, kingdom, prevalence, fullname) -# microorganisms.prevDT <- microorganismsDT[prevalence == 9999,] -# microorganisms.unprevDT <- microorganismsDT[prevalence != 9999,] +# microorganisms.prevDT <- microorganismsDT[prevalence != 9999,] +# microorganisms.unprevDT <- microorganismsDT[prevalence == 9999,] # microorganisms.oldDT <- data.table::as.data.table(AMR::microorganisms.old) # data.table::setkey(microorganisms.oldDT, tsn, name) -# devtools::use_data(microorganismsDT, overwrite = TRUE) -# devtools::use_data(microorganisms.prevDT, overwrite = TRUE) -# devtools::use_data(microorganisms.unprevDT, overwrite = TRUE) -# devtools::use_data(microorganisms.oldDT, overwrite = TRUE) +# usethis::use_data(microorganismsDT, overwrite = TRUE) +# usethis::use_data(microorganisms.prevDT, overwrite = TRUE) +# usethis::use_data(microorganisms.unprevDT, overwrite = TRUE) +# usethis::use_data(microorganisms.oldDT, overwrite = TRUE) +# rm(microorganismsDT) +# rm(microorganisms.prevDT) +# rm(microorganisms.unprevDT) +# rm(microorganisms.oldDT) "microorganismsDT" #' @rdname supplementary_data diff --git a/R/eucast_rules.R b/R/eucast_rules.R index 059c1604..77b734df 100755 --- a/R/eucast_rules.R +++ b/R/eucast_rules.R @@ -25,7 +25,7 @@ #' @param tbl table with antibiotic columns, like e.g. \code{amox} and \code{amcl} #' @param info print progress #' @param rules a character vector that specifies which rules should be applied - one or more of \code{c("breakpoints", "expert", "other", "all")} -#' @param verbose a logical to indicate whether extensive info should be returned as a \code{data.frame} with info about which rows and columns are effected +#' @param verbose a logical to indicate whether extensive info should be returned as a \code{data.frame} with info about which rows and columns are effected. It runs all EUCAST rules, but will not be applied to an output - only an informative \code{data.frame} with changes will be returned as output. #' @param amcl,amik,amox,ampi,azit,azlo,aztr,cefa,cfep,cfot,cfox,cfra,cfta,cftr,cfur,chlo,cipr,clar,clin,clox,coli,czol,dapt,doxy,erta,eryt,fosf,fusi,gent,imip,kana,levo,linc,line,mero,mezl,mino,moxi,nali,neom,neti,nitr,norf,novo,oflo,oxac,peni,pipe,pita,poly,pris,qida,rifa,roxi,siso,teic,tetr,tica,tige,tobr,trim,trsu,vanc column name of an antibiotic, see Antibiotics #' @param ... parameters that are passed on to \code{eucast_rules} #' @inheritParams first_isolate @@ -101,7 +101,7 @@ #' @export #' @importFrom dplyr %>% select pull mutate_at vars #' @importFrom crayon bold bgGreen bgYellow bgRed black green blue italic strip_style -#' @return The input of \code{tbl}, possibly with edited values of antibiotics. Or, if \code{verbose = TRUE}, a \code{data.frame} with verbose info. +#' @return The input of \code{tbl}, possibly with edited values of antibiotics. Or, if \code{verbose = TRUE}, a \code{data.frame} with all original and new values of the affected bug-drug combinations. #' @source #' \itemize{ #' \item{ @@ -144,7 +144,9 @@ #' # 4 Klebsiella pneumoniae - - - - - S S #' # 5 Pseudomonas aeruginosa - - - - - S S #' -#' b <- eucast_rules(a, "mo") # 18 results are forced as R or S +#' +#' # apply EUCAST rules: 18 results are forced as R or S +#' b <- eucast_rules(a) #' #' b #' # mo vanc amox coli cfta cfur peni cfox @@ -153,6 +155,11 @@ #' # 3 Escherichia coli R - - - - R S #' # 4 Klebsiella pneumoniae R R - - - R S #' # 5 Pseudomonas aeruginosa R R - - R R R +#' +#' +#' # do not apply EUCAST rules, but rather get a a data.frame +#' # with 18 rows, containing all details about the transformations: +#' c <- eucast_rules(a, verbose = TRUE) eucast_rules <- function(tbl, col_mo = NULL, info = TRUE, @@ -406,22 +413,31 @@ eucast_rules <- function(tbl, trsu <- col.list[trsu] vanc <- col.list[vanc] - number_changed <- 0 + number_added_S <- 0 + number_added_I <- 0 + number_added_R <- 0 + number_changed_to_S <- 0 + number_changed_to_I <- 0 + number_changed_to_R <- 0 + number_affected_rows <- integer(0) - verbose_info <- data.frame(rule_type = character(0), - rule_set = character(0), - force_to = character(0), - found = integer(0), - changed = integer(0), - target_columns = integer(0), - target_rows = integer(0), + verbose_info <- data.frame(row = integer(0), + col = character(0), + mo = character(0), + mo_fullname = character(0), + old = character(0), + new = character(0), + rule_source = character(0), + rule_group = character(0), stringsAsFactors = FALSE) # helper function for editing the table edit_rsi <- function(to, rule, rows, cols) { cols <- unique(cols[!is.na(cols) & !is.null(cols)]) if (length(rows) > 0 & length(cols) > 0) { + before_df <- tbl_original before <- as.character(unlist(as.list(tbl_original[rows, cols]))) + tryCatch( # insert into original table tbl_original[rows, cols] <<- to, @@ -442,29 +458,81 @@ eucast_rules <- function(tbl, suppressWarnings( tbl[rows, cols] <<- to )) + after <- as.character(unlist(as.list(tbl_original[rows, cols]))) - number_changed <<- number_changed + sum(before != after, na.rm = TRUE) + + tbl[rows, cols] <<- tbl_original[rows, cols] + + number_newly_added_S <- sum(!before %in% c("S", "I", "R") & after == "S", na.rm = TRUE) + number_newly_added_I <- sum(!before %in% c("S", "I", "R") & after == "I", na.rm = TRUE) + number_newly_added_R <- sum(!before %in% c("S", "I", "R") & after == "R", na.rm = TRUE) + number_newly_changed_to_S <- sum(before %in% c("I", "R") & after == "S", na.rm = TRUE) + number_newly_changed_to_I <- sum(before %in% c("S", "R") & after == "I", na.rm = TRUE) + number_newly_changed_to_R <- sum(before %in% c("S", "I") & after == "R", na.rm = TRUE) + + # totals + number_added_S <<- number_added_S + number_newly_added_S + number_added_I <<- number_added_I + number_newly_added_I + number_added_R <<- number_added_R + number_newly_added_R + number_changed_to_S <<- number_changed_to_S + number_newly_changed_to_S + number_changed_to_I <<- number_changed_to_I + number_newly_changed_to_I + number_changed_to_R <<- number_changed_to_R + number_newly_changed_to_R number_affected_rows <<- unique(c(number_affected_rows, rows)) - changed_results <<- changed_results + sum(before != after, na.rm = TRUE) # will be reset at start of every rule + + # will be reset at start of every rule + changed_results <<- changed_results + + number_newly_added_S + + number_newly_added_I + + number_newly_added_R + + number_newly_changed_to_S + + number_newly_changed_to_I + + number_newly_changed_to_R if (verbose == TRUE) { - for (i in 1:length(cols)) { - # add new row for every affected column - verbose_new <- data.frame(rule_type = strip_style(rule[1]), - rule_set = strip_style(rule[2]), - force_to = to, - found = length(before), - changed = sum(before != after, na.rm = TRUE), - target_column = cols[i], - stringsAsFactors = FALSE) - verbose_new$target_rows <- list(unname(rows)) - rownames(verbose_new) <- NULL - verbose_info <<- rbind(verbose_info, verbose_new) + for (r in 1:length(rows)) { + for (c in 1:length(cols)) { + old <- before_df[rows[r], cols[c]] + new <- tbl[rows[r], cols[c]] + if (!identical(old, new)) { + verbose_new <- data.frame(row = rows[r], + col = cols[c], + mo = tbl_original[rows[r], col_mo], + mo_fullname = "", + old = old, + new = new, + rule_source = strip_style(rule[1]), + rule_group = strip_style(rule[2]), + stringsAsFactors = FALSE) + verbose_info <<- rbind(verbose_info, verbose_new) + } + } } + # verbose_new <- data.frame(row = integer(0), + # col = character(0), + # old = character(0), + # new = character(0), + # rule_source = character(0), + # rule_group = character(0), + # stringsAsFactors = FALSE) + # a <<- rule + # for (i in 1:length(cols)) { + # # add new row for every affected column + # verbose_new <- data.frame(rule_type = strip_style(rule[1]), + # rule_set = strip_style(rule[2]), + # force_to = to, + # found = length(before), + # changed = sum(before != after, na.rm = TRUE), + # target_column = cols[i], + # stringsAsFactors = FALSE) + # verbose_new$target_rows <- list(unname(rows)) + # rownames(verbose_new) <- NULL + # verbose_info <<- rbind(verbose_info, verbose_new) + # } } } } + na.rm <- function(col) { if (is.null(col)) { "" @@ -489,15 +557,15 @@ eucast_rules <- function(tbl, # since ampicillin ^= amoxicillin, get the first from the latter (not in original EUCAST table) if (!is.null(ampi) & !is.null(amox)) { if (verbose == TRUE) { - cat(bgGreen("\n VERBOSE: transforming", - length(which(tbl[, amox] == "S" & !tbl[, ampi] %in% c("S", "I", "R"))), - "empty ampicillin fields to 'S' based on amoxicillin. ")) - cat(bgGreen("\n VERBOSE: transforming", - length(which(tbl[, amox] == "I" & !tbl[, ampi] %in% c("S", "I", "R"))), - "empty ampicillin fields to 'I' based on amoxicillin. ")) - cat(bgGreen("\n VERBOSE: transforming", - length(which(tbl[, amox] == "R" & !tbl[, ampi] %in% c("S", "I", "R"))), - "empty ampicillin fields to 'R' based on amoxicillin. \n")) + cat("\n VERBOSE: transforming", + length(which(tbl[, amox] == "S" & !tbl[, ampi] %in% c("S", "I", "R"))), + "empty ampicillin fields to 'S' based on amoxicillin. ") + cat("\n VERBOSE: transforming", + length(which(tbl[, amox] == "I" & !tbl[, ampi] %in% c("S", "I", "R"))), + "empty ampicillin fields to 'I' based on amoxicillin. ") + cat("\n VERBOSE: transforming", + length(which(tbl[, amox] == "R" & !tbl[, ampi] %in% c("S", "I", "R"))), + "empty ampicillin fields to 'R' based on amoxicillin. \n") } tbl[which(tbl[, amox] == "S" & !tbl[, ampi] %in% c("S", "I", "R")), ampi] <- "S" tbl[which(tbl[, amox] == "I" & !tbl[, ampi] %in% c("S", "I", "R")), ampi] <- "I" @@ -1804,22 +1872,46 @@ eucast_rules <- function(tbl, } else { wouldve <- "" } - if (number_changed == 0) { - colour <- green + if (sum(number_added_S, number_added_I, number_added_R, + number_changed_to_S, number_changed_to_I, number_changed_to_R, + na.rm = TRUE) == 0) { + colour <- green # is function } else { - colour <- blue + colour <- blue # is function } decimal.mark <- getOption("OutDec") big.mark <- ifelse(decimal.mark != ",", ",", ".") + formatnr <- function(x) { + format(x, big.mark = big.mark, decimal.mark = decimal.mark) + } cat(bold(paste('\n=> EUCAST rules', paste0(wouldve, 'affected'), - number_affected_rows %>% length() %>% format(big.mark = big.mark, decimal.mark = decimal.mark), - 'out of', nrow(tbl_original) %>% format(big.mark = big.mark, decimal.mark = decimal.mark), - 'rows ->', - colour(paste0(wouldve, 'changed'), - number_changed %>% format(big.mark = big.mark, decimal.mark = decimal.mark), 'test results.\n\n')))) + number_affected_rows %>% length() %>% formatnr(), + 'out of', nrow(tbl_original) %>% formatnr(), + 'rows\n'))) + total_added <- number_added_S + number_added_I + number_added_R + total_changed <- number_changed_to_S + number_changed_to_I + number_changed_to_R + cat(colour(paste0(" -> ", wouldve, "added ", + bold(formatnr(total_added), "test results"), + if(total_added > 0) + paste0(" (", formatnr(number_added_S), " as S; ", + formatnr(number_added_I), " as I; ", + formatnr(number_added_R), " as R)"), + "\n"))) + cat(colour(paste0(" -> ", wouldve, "changed ", + bold(formatnr(total_changed), "test results"), + if(total_changed > 0) + paste0(" (", formatnr(number_changed_to_S), " to S; ", + formatnr(number_changed_to_I), " to I; ", + formatnr(number_changed_to_R), " to R)"), + "\n"))) } if (verbose == TRUE) { + suppressWarnings( + suppressMessages( + verbose_info$mo_fullname <- mo_fullname(verbose_info$mo) + ) + ) return(verbose_info) } diff --git a/R/freq.R b/R/freq.R index 34fe6013..4158bdc9 100755 --- a/R/freq.R +++ b/R/freq.R @@ -228,7 +228,7 @@ frequency_tbl <- function(x, x.name <- x.name %>% strsplit("%>%", fixed = TRUE) %>% unlist() %>% .[1] %>% trimws() } if (x.name == ".") { - x.name <- "a `data.frame`" + x.name <- "a data.frame" } else { x.name <- paste0("`", x.name, "`") } @@ -797,11 +797,30 @@ print.frequency_tbl <- function(x, opt <- attr(x, "opt") opt$header_txt <- header(x) + dots <- list(...) + if ("markdown" %in% names(dots)) { + if (dots$markdown == TRUE) { + opt$tbl_format <- "markdown" + } else { + opt$tbl_format <- "pandoc" + } + } + if (!missing(markdown)) { + if (markdown == TRUE) { + opt$tbl_format <- "markdown" + } else { + opt$tbl_format <- "pandoc" + } + } + if (length(opt$vars) == 0) { opt$vars <- NULL } if (is.null(opt$title)) { + if (isTRUE(opt$data %like% "^a data.frame") & opt$tbl_format == "markdown") { + opt$data <- gsub("data.frame", "`data.frame`", opt$data, fixed = TRUE) + } if (!is.null(opt$data) & !is.null(opt$vars)) { title <- paste0("`", paste0(opt$vars, collapse = "` and `"), "` from ", opt$data) } else if (!is.null(opt$data) & is.null(opt$vars)) { @@ -845,21 +864,6 @@ print.frequency_tbl <- function(x, if (!missing(big.mark)) { opt$big.mark <- big.mark } - dots <- list(...) - if ("markdown" %in% names(dots)) { - if (dots$markdown == TRUE) { - opt$tbl_format <- "markdown" - } else { - opt$tbl_format <- "pandoc" - } - } - if (!missing(markdown)) { - if (markdown == TRUE) { - opt$tbl_format <- "markdown" - } else { - opt$tbl_format <- "pandoc" - } - } if (!missing(header)) { opt$header <- header } diff --git a/R/mo.R b/R/mo.R index 22d78454..8b015b05 100755 --- a/R/mo.R +++ b/R/mo.R @@ -54,7 +54,7 @@ #' #' This function uses Artificial Intelligence (AI) to help getting fast and logical results. It tries to find matches in this order: #' \itemize{ -#' \item{Taxonomic kingdom: it first searches in bacteria, then fungi, then protozoa} +#' \item{Taxonomic kingdom: it first searches in Bacteria, then Fungi, then Protozoa} #' \item{Human pathogenic prevalence: it first searches in more prevalent microorganisms, then less prevalent ones} #' \item{Valid MO codes and full names: it first searches in already valid MO code and known genus/species combinations} #' \item{Breakdown of input values: from here it starts to breakdown input values to find possible matches} @@ -69,13 +69,30 @@ #' } #' This means that looking up human pathogenic microorganisms takes less time than looking up human \strong{non}-pathogenic microorganisms. #' -#' When using \code{allow_uncertain = TRUE} (which is the default setting), it will use additional rules if all previous AI rules failed to get valid results. Examples: +#' \strong{UNCERTAIN RESULTS} \cr +#' When using \code{allow_uncertain = TRUE} (which is the default setting), it will use additional rules if all previous AI rules failed to get valid results. These are: +#' \itemize{ +#' \item{It tries to look for previously accepted (but now invalid) taxonomic names} +#' \item{It strips off values between brackets and the brackets itself, and re-evaluates the input with all previous rules} +#' \item{It strips off words from the end one by one and re-evaluates the input with all previous rules} +#' \item{It strips off words from the start one by one and re-evaluates the input with all previous rules} +#' \item{It tries to look for some manual changes which are not yet published to the ITIS database (like \emph{Propionibacterium} not yet being \emph{Cutibacterium})} +#' } +#' +#' Examples: #' \itemize{ #' \item{\code{"Streptococcus group B (known as S. agalactiae)"}. The text between brackets will be removed and a warning will be thrown that the result \emph{Streptococcus group B} (\code{B_STRPTC_GRB}) needs review.} #' \item{\code{"S. aureus - please mind: MRSA"}. The last word will be stripped, after which the function will try to find a match. If it does not, the second last word will be stripped, etc. Again, a warning will be thrown that the result \emph{Staphylococcus aureus} (\code{B_STPHY_AUR}) needs review.} #' \item{\code{"D. spartina"}. This is the abbreviation of an old taxonomic name: \emph{Didymosphaeria spartinae} (the last "e" was missing from the input). This fungus was renamed to \emph{Leptosphaeria obiones}, so a warning will be thrown that this result (\code{F_LPTSP_OBI}) needs review.} +#' \item{\code{"Fluoroquinolone-resistant Neisseria gonorrhoeae"}. The first word will be stripped, after which the function will try to find a match. A warning will be thrown that the result \emph{Neisseria gonorrhoeae} (\code{B_NESSR_GON}) needs review.} #' } #' +#' Use \code{mo_failures()} to get a vector with all values that could not be coerced to a valid value. +#' +#' Use \code{mo_uncertainties()} to get a vector with all values that were coerced to a valid value, but with uncertainty. +#' +#' Use \code{mo_renamed()} to get a vector with all values that could be coerced based on an old, previously accepted taxonomic name. +#' #' @inheritSection ITIS ITIS # (source as a section, so it can be inherited by other man pages) #' @section Source: @@ -154,7 +171,7 @@ is.mo <- function(x) { #' @importFrom dplyr %>% pull left_join n_distinct progress_estimated filter #' @importFrom data.table data.table as.data.table setkey -#' @importFrom crayon magenta red italic +#' @importFrom crayon magenta red silver italic has_color exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, allow_uncertain = TRUE, reference_df = get_mo_source(), property = "mo", clear_options = TRUE) { @@ -170,6 +187,7 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, if (clear_options == TRUE) { options(mo_failures = NULL) + options(mo_uncertainties = NULL) options(mo_renamed = NULL) } @@ -194,6 +212,7 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, } notes <- character(0) + uncertainties <- character(0) failures <- character(0) x_input <- x # only check the uniques, which is way faster @@ -251,7 +270,7 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, x_backup <- trimws(x, which = "both") # remove spp and species - x <- trimws(gsub(" +(spp.?|ssp.?|subsp.?|species)", " ", x_backup, ignore.case = TRUE), which = "both") + x <- trimws(gsub(" +(spp.?|ssp.?|sp.? |ss ?.?|subsp.?|subspecies|biovar |serovar |species)", " ", x_backup, ignore.case = TRUE), which = "both") x_species <- paste(x, "species") # translate to English for supported languages of mo_property x <- gsub("(Gruppe|gruppe|groep|grupo|gruppo|groupe)", "group", x, ignore.case = TRUE) @@ -259,6 +278,14 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, x <- gsub("(no MO)", "", x, fixed = TRUE) # remove non-text in case of "E. coli" except dots and spaces x <- gsub("[^.a-zA-Z0-9/ \\-]+", "", x) + # replace minus by a space + x <- gsub("-+", " ", x) + # replace hemolytic by haemolytic + x <- gsub("ha?emoly", "haemoly", x) + # place minus back in streptococci + x <- gsub("(alpha|beta|gamma) haemoly", "\\1-haemolytic", x) + # remove genus as first word + x <- gsub("^Genus ", "", x) # but spaces before and after should be omitted x <- trimws(x, which = "both") @@ -272,13 +299,13 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, x <- gsub("[ .]+", ".*", x) # add start en stop regex x <- paste0('^', x, '$') - x_withspaces_start <- paste0('^', x_withspaces) - x_withspaces <- paste0('^', x_withspaces, '$') + x_withspaces_start_only <- paste0('^', x_withspaces) + x_withspaces_start_end <- paste0('^', x_withspaces, '$') # cat(paste0('x "', x, '"\n')) # cat(paste0('x_species "', x_species, '"\n')) - # cat(paste0('x_withspaces_start "', x_withspaces_start, '"\n')) - # cat(paste0('x_withspaces "', x_withspaces, '"\n')) + # cat(paste0('x_withspaces_start_only "', x_withspaces_start_only, '"\n')) + # cat(paste0('x_withspaces_start_end "', x_withspaces_start_end, '"\n')) # cat(paste0('x_backup "', x_backup, '"\n')) # cat(paste0('x_trimmed "', x_trimmed, '"\n')) # cat(paste0('x_trimmed_species "', x_trimmed_species, '"\n')) @@ -290,16 +317,17 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, progress$tick()$print() - if (identical(x_trimmed[i], "")) { - # empty values + if (tolower(x_trimmed[i]) %in% c("", "xxx", "other", "none", "unknown")) { + # empty and nonsense values, ignore without warning ("xxx" is WHONET code for 'no growth') x[i] <- NA_character_ next } - if (nchar(x_trimmed[i]) < 3) { + + if (nchar(gsub("[^a-zA-Z]", "", x_trimmed[i])) < 3) { # check if search term was like "A. species", then return first genus found with ^A - if (x_backup[i] %like% "species" | x_backup[i] %like% "spp[.]?") { + if (x_backup[i] %like% "[a-z]+ species" | x_backup[i] %like% "[a-z] spp[.]?") { # get mo code of first hit - found <- microorganismsDT[fullname %like% x_withspaces_start[i], mo] + found <- microorganismsDT[fullname %like% x_withspaces_start_only[i], mo] if (length(found) > 0) { mo_code <- found[1L] %>% strsplit("_") %>% unlist() %>% .[1:2] %>% paste(collapse = "_") found <- microorganismsDT[mo == mo_code, ..property][[1]] @@ -316,14 +344,13 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, next } - # no nonsense text - if (toupper(x_trimmed[i]) %in% c('OTHER', 'NONE', 'UNKNOWN')) { + if (x_trimmed[i] %like% "virus") { + # there is no fullname like virus, so don't try to coerce it x[i] <- NA_character_ failures <- c(failures, x_backup[i]) next } - # translate known trivial abbreviations to genus + species ---- if (!is.na(x_trimmed[i])) { if (toupper(x_trimmed[i]) %in% c('MRSA', 'MSSA', 'VISA', 'VRSA')) { @@ -339,6 +366,10 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, x[i] <- microorganismsDT[mo == 'B_ENTRC', ..property][[1]][1L] next } + if (toupper(x_trimmed[i]) %in% c('EHEC', 'EPEC', 'EIEC', 'STEC', 'ATEC')) { + x[i] <- microorganismsDT[mo == 'B_ESCHR_COL', ..property][[1]][1L] + next + } if (toupper(x_trimmed[i]) == 'MRPA') { # multi resistant P. aeruginosa x[i] <- microorganismsDT[mo == 'B_PDMNS_AER', ..property][[1]][1L] @@ -398,13 +429,25 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, next } if (grepl("[sS]almonella [A-Z][a-z]+ ?.*", x_trimmed[i])) { - # Salmonella with capital letter species like "Salmonella Goettingen" - they're all S. enterica - x[i] <- microorganismsDT[mo == 'B_SLMNL_ENT', ..property][[1]][1L] - notes <- c(notes, - magenta(paste0("Note: ", italic(x_trimmed[i]), - " was considered (a subspecies of) ", - italic("Salmonella enterica"), - " (B_SLMNL_ENT)"))) + if (x_trimmed[i] %like% "Salmonella group") { + # Salmonella Group A to Z, just return S. species for now + x[i] <- microorganismsDT[mo == 'B_SLMNL', ..property][[1]][1L] + notes <- c(notes, + magenta(paste0("Note: ", + italic("Salmonella"), " ", trimws(gsub("Salmonella", "", x_trimmed[i])), + " was considered ", + italic("Salmonella species"), + " (B_SLMNL)"))) + } else { + # Salmonella with capital letter species like "Salmonella Goettingen" - they're all S. enterica + x[i] <- microorganismsDT[mo == 'B_SLMNL_ENT', ..property][[1]][1L] + notes <- c(notes, + magenta(paste0("Note: ", + italic("Salmonella"), " ", trimws(gsub("Salmonella", "", x_trimmed[i])), + " was considered a subspecies of ", + italic("Salmonella enterica"), + " (B_SLMNL_ENT)"))) + } next } } @@ -417,14 +460,14 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, x[i] <- found[1L] next } - if (nchar(x_trimmed[i]) > 4) { - # not when abbr is esco, stau, klpn, etc. - found <- microorganismsDT[tolower(fullname) %like% gsub(" ", ".*", x_trimmed_species[i], fixed = TRUE), ..property][[1]] + if (nchar(x_trimmed[i]) >= 6) { + found <- microorganismsDT[tolower(fullname) %like% paste0(x_withspaces_start_only[i], "[a-z]+ species"), ..property][[1]] if (length(found) > 0) { x[i] <- found[1L] next } } + # rest of genus only is in allow_uncertain part. } # TRY OTHER SOURCES ---- @@ -472,29 +515,27 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, next } - # try any match keeping spaces ---- - found <- microorganisms.prevDT[fullname %like% x_withspaces[i], ..property][[1]] - if (length(found) > 0) { + found <- microorganisms.prevDT[fullname %like% x_withspaces_start_end[i], ..property][[1]] + if (length(found) > 0 & nchar(x_trimmed[i]) >= 6) { x[i] <- found[1L] next } # try any match keeping spaces, not ending with $ ---- - found <- microorganisms.prevDT[fullname %like% x_withspaces_start[i], ..property][[1]] - if (length(found) > 0) { + found <- microorganisms.prevDT[fullname %like% x_withspaces_start_only[i], ..property][[1]] + if (length(found) > 0 & nchar(x_trimmed[i]) >= 6) { x[i] <- found[1L] next } # try any match diregarding spaces ---- found <- microorganisms.prevDT[fullname %like% x[i], ..property][[1]] - if (length(found) > 0) { + if (length(found) > 0 & nchar(x_trimmed[i]) >= 6) { x[i] <- found[1L] next } - # try splitting of characters in the middle and then find ID ---- # only when text length is 6 or lower # like esco = E. coli, klpn = K. pneumoniae, stau = S. aureus, staaur = S. aureus @@ -512,7 +553,7 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, # try fullname without start and stop regex, to also find subspecies ---- # like "K. pneu rhino" >> "Klebsiella pneumoniae (rhinoscleromatis)" = KLEPNERH - found <- microorganisms.prevDT[fullname %like% x_withspaces_start[i], ..property][[1]] + found <- microorganisms.prevDT[fullname %like% x_withspaces_start_only[i], ..property][[1]] if (length(found) > 0) { x[i] <- found[1L] next @@ -549,13 +590,13 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, next } # try any match keeping spaces ---- - found <- microorganisms.unprevDT[fullname %like% x_withspaces[i], ..property][[1]] + found <- microorganisms.unprevDT[fullname %like% x_withspaces_start_end[i], ..property][[1]] if (length(found) > 0) { x[i] <- found[1L] next } # try any match keeping spaces, not ending with $ ---- - found <- microorganisms.unprevDT[fullname %like% x_withspaces_start[i], ..property][[1]] + found <- microorganisms.unprevDT[fullname %like% x_withspaces_start_only[i], ..property][[1]] if (length(found) > 0) { x[i] <- found[1L] next @@ -583,7 +624,7 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, # try fullname without start and stop regex, to also find subspecies ---- # like "K. pneu rhino" >> "Klebsiella pneumoniae (rhinoscleromatis)" = KLEPNERH - found <- microorganisms.unprevDT[fullname %like% x_withspaces_start[i], ..property][[1]] + found <- microorganisms.unprevDT[fullname %like% x_withspaces_start_only[i], ..property][[1]] if (length(found) > 0) { x[i] <- found[1L] next @@ -594,7 +635,7 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, # look for old taxonomic names ---- found <- microorganisms.oldDT[tolower(name) == tolower(x_backup[i]) | tsn == x_trimmed[i] - | name %like% x_withspaces[i],] + | name %like% x_withspaces_start_end[i],] if (NROW(found) > 0) { # when property is "ref" (which is the case in mo_ref, mo_authors and mo_year), return the old value, so: # mo_ref("Chlamydia psittaci) = "Page, 1968" (with warning) @@ -604,22 +645,36 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, } else { x[i] <- microorganismsDT[tsn == found[1, tsn_new], ..property][[1]] } - notes <- c(notes, - renamed_note(name_old = found[1, name], - name_new = microorganismsDT[tsn == found[1, tsn_new], fullname], - ref_old = found[1, ref], - ref_new = microorganismsDT[tsn == found[1, tsn_new], ref], - mo = microorganismsDT[tsn == found[1, tsn_new], mo])) + was_renamed(name_old = found[1, name], + name_new = microorganismsDT[tsn == found[1, tsn_new], fullname], + ref_old = found[1, ref], + ref_new = microorganismsDT[tsn == found[1, tsn_new], ref], + mo = microorganismsDT[tsn == found[1, tsn_new], mo]) next } # check for uncertain results ---- if (allow_uncertain == TRUE) { - uncertain_fn <- function(a.x_backup, b.x_trimmed, c.x_withspaces, d.x_withspaces_start, e.x) { - # (1) look again for old taxonomic names, now for G. species ---- - found <- microorganisms.oldDT[name %like% c.x_withspaces - | name %like% d.x_withspaces_start + uncertain_fn <- function(a.x_backup, b.x_trimmed, c.x_withspaces_start_end, d.x_withspaces_start_only, e.x) { + + # (1) look for genus only, part of name ---- + if (nchar(b.x_trimmed) > 4 & !b.x_trimmed %like% " ") { + if (!grepl("^[A-Z][a-z]+", b.x_trimmed, ignore.case = FALSE)) { + # not when input is like Genustext, because then Neospora would lead to Actinokineospora + found <- microorganismsDT[tolower(fullname) %like% paste(b.x_trimmed, "species"), ..property][[1]] + if (length(found) > 0) { + x[i] <- found[1L] + uncertainties <<- c(uncertainties, + paste0("'", a.x_backup, "' >> ", microorganismsDT[mo == found[1L], fullname][[1]], " (", found[1L], ")")) + return(x) + } + } + } + + # (2) look again for old taxonomic names, now for G. species ---- + found <- microorganisms.oldDT[name %like% c.x_withspaces_start_end + | name %like% d.x_withspaces_start_only | name %like% e.x,] if (NROW(found) > 0 & nchar(b.x_trimmed) >= 6) { if (property == "ref") { @@ -630,32 +685,29 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, } else { x <- microorganismsDT[tsn == found[1, tsn_new], ..property][[1]] } - warning(red(paste0('(UNCERTAIN) "', - a.x_backup, '" >> ', italic(found[1, name]), " (TSN ", found[1, tsn], ")")), - call. = FALSE, immediate. = FALSE) - notes <<- c(notes, - renamed_note(name_old = found[1, name], - name_new = microorganismsDT[tsn == found[1, tsn_new], fullname], - ref_old = found[1, ref], - ref_new = microorganismsDT[tsn == found[1, tsn_new], ref], - mo = microorganismsDT[tsn == found[1, tsn_new], mo])) + was_renamed(name_old = found[1, name], + name_new = microorganismsDT[tsn == found[1, tsn_new], fullname], + ref_old = found[1, ref], + ref_new = microorganismsDT[tsn == found[1, tsn_new], ref], + mo = microorganismsDT[tsn == found[1, tsn_new], mo]) + uncertainties <<- c(uncertainties, + paste0("'", a.x_backup, "' >> ", found[1, name], " (TSN ", found[1, tsn], ")")) return(x) } - # (2) strip values between brackets ---- + # (3) strip values between brackets ---- a.x_backup_stripped <- gsub("( [(].*[)])", "", a.x_backup) a.x_backup_stripped <- trimws(gsub(" ", " ", a.x_backup_stripped, fixed = TRUE)) found <- suppressMessages(suppressWarnings(exec_as.mo(a.x_backup_stripped, clear_options = FALSE, allow_uncertain = FALSE))) if (!is.na(found) & nchar(b.x_trimmed) >= 6) { found_result <- found found <- microorganismsDT[mo == found, ..property][[1]] - warning(red(paste0('(UNCERTAIN) "', - a.x_backup, '" >> ', italic(microorganismsDT[mo == found_result[1L], fullname][[1]]), " (", found_result[1L], ")")), - call. = FALSE, immediate. = FALSE) + uncertainties <<- c(uncertainties, + paste0("'", a.x_backup, "' >> ", microorganismsDT[mo == found_result[1L], fullname][[1]], " (", found_result[1L], ")")) return(found[1L]) } - # (3) try to strip off one element and check the remains ---- + # (4) try to strip off one element from end and check the remains ---- x_strip <- a.x_backup %>% strsplit(" ") %>% unlist() if (length(x_strip) > 1 & nchar(b.x_trimmed) >= 6) { for (i in 1:(length(x_strip) - 1)) { @@ -664,22 +716,39 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, if (!is.na(found)) { found_result <- found found <- microorganismsDT[mo == found, ..property][[1]] - warning(red(paste0('(UNCERTAIN) "', - a.x_backup, '" >> ', italic(microorganismsDT[mo == found_result[1L], fullname][[1]]), " (", found_result[1L], ")")), - call. = FALSE, immediate. = FALSE) + uncertainties <<- c(uncertainties, + paste0("'", a.x_backup, "' >> ", microorganismsDT[mo == found_result[1L], fullname][[1]], " (", found_result[1L], ")")) return(found[1L]) } } } - # (4) not yet implemented taxonomic changes in ITIS + # (5) try to strip off one element from start and check the remains ---- + x_strip <- a.x_backup %>% strsplit(" ") %>% unlist() + if (length(x_strip) > 1 & nchar(b.x_trimmed) >= 6) { + for (i in 2:(length(x_strip))) { + x_strip_collapsed <- paste(x_strip[i:length(x_strip)], collapse = " ") + found <- suppressMessages(suppressWarnings(exec_as.mo(x_strip_collapsed, clear_options = FALSE, allow_uncertain = FALSE))) + if (!is.na(found)) { + found_result <- found + found <- microorganismsDT[mo == found, ..property][[1]] + uncertainties <<- c(uncertainties, + paste0("'", a.x_backup, "' >> ", microorganismsDT[mo == found_result[1L], fullname][[1]], " (", found_result[1L], ")")) + return(found[1L]) + } + } + } + + # (6) not yet implemented taxonomic changes in ITIS ---- found <- suppressMessages(suppressWarnings(exec_as.mo(TEMPORARY_TAXONOMY(b.x_trimmed), clear_options = FALSE, allow_uncertain = FALSE))) if (!is.na(found)) { found_result <- found found <- microorganismsDT[mo == found, ..property][[1]] - warning(red(paste0('(UNCERTAIN) "', - a.x_backup, '" >> ', italic(microorganismsDT[mo == found_result[1L], fullname][[1]]), " (", found_result[1L], ")")), + warning(silver(paste0('Guessed with uncertainty: "', + a.x_backup, '" >> ', italic(microorganismsDT[mo == found_result[1L], fullname][[1]]), " (", found_result[1L], ")")), call. = FALSE, immediate. = FALSE) + uncertainties <<- c(uncertainties, + paste0('"', a.x_backup, '" >> ', microorganismsDT[mo == found_result[1L], fullname][[1]], " (", found_result[1L], ")")) return(found[1L]) } @@ -687,7 +756,7 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, return(NA_character_) } - x[i] <- uncertain_fn(x_backup[i], x_trimmed[i], x_withspaces[i], x_withspaces_start[i], x[i]) + x[i] <- uncertain_fn(x_backup[i], x_trimmed[i], x_withspaces_start_end[i], x_withspaces_start_only[i], x[i]) if (!is.na(x[i])) { next } @@ -696,26 +765,39 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, # not found ---- x[i] <- NA_character_ failures <- c(failures, x_backup[i]) - } } + # failures failures <- failures[!failures %in% c(NA, NULL, NaN)] if (length(failures) > 0) { options(mo_failures = sort(unique(failures))) - plural <- "" + plural <- c("value", "it") if (n_distinct(failures) > 1) { - plural <- "s" + plural <- c("values", "them") } total_failures <- length(x_input[x_input %in% failures & !x_input %in% c(NA, NULL, NaN)]) total_n <- length(x_input[!x_input %in% c(NA, NULL, NaN)]) - msg <- paste0("\n", n_distinct(failures), " unique value", plural, + msg <- paste0("\n", n_distinct(failures), " unique ", plural[1], " (^= ", percent(total_failures / total_n, round = 1, force_zero = TRUE), ") could not be coerced to a valid MO code") if (n_distinct(failures) <= 10) { msg <- paste0(msg, ": ", paste('"', unique(failures), '"', sep = "", collapse = ', ')) } - msg <- paste0(msg, ". Use mo_failures() to review failured input.") + msg <- paste0(msg, ". Use mo_failures() to review ", plural[2], ".") + warning(red(msg), + call. = FALSE, + immediate. = TRUE) # thus will always be shown, even if >= warnings + } + # uncertainties + if (length(uncertainties) > 0) { + options(mo_uncertainties = sort(unique(uncertainties))) + plural <- c("value", "it") + if (n_distinct(failures) > 1) { + plural <- c("values", "them") + } + msg <- paste0("\nResults of ", n_distinct(uncertainties), " input ", plural[1], + " guessed with uncertainty. Use mo_uncertainties() to review ", plural[2], ".") warning(red(msg), call. = FALSE, immediate. = TRUE) # thus will always be shown, even if >= warnings @@ -774,6 +856,9 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, x[x == microorganismsDT[mo == 'B_STRPTC_SAL', ..property][[1]][1L]] <- microorganismsDT[mo == 'B_STRPTC_GRK', ..property][[1]][1L] } + + # Wrap up ---------------------------------------------------------------- + # comply to x, which is also unique and without empty values x_input_unique_nonempty <- unique(x_input[!is.na(x_input) & !is.null(x_input) & !identical(x_input, "")]) @@ -794,10 +879,15 @@ exec_as.mo <- function(x, Becker = FALSE, Lancefield = FALSE, x <- as.integer(x) } - if (length(notes > 0)) { + if (length(mo_renamed()) > 0) { + if (has_color()) { + notes <- getOption("mo_renamed") + } else { + notes <- mo_renamed() + } notes <- sort(notes) for (i in 1:length(notes)) { - base::message(notes[i]) + base::message(blue(paste("Note:", notes[i]))) } } @@ -810,7 +900,7 @@ TEMPORARY_TAXONOMY <- function(x) { } #' @importFrom crayon blue italic -renamed_note <- function(name_old, name_new, ref_old = "", ref_new = "", mo = "") { +was_renamed <- function(name_old, name_new, ref_old = "", ref_new = "", mo = "") { if (!is.na(ref_old)) { ref_old <- paste0(" (", ref_old, ")") } else { @@ -828,10 +918,7 @@ renamed_note <- function(name_old, name_new, ref_old = "", ref_new = "", mo = "" } msg <- paste0(italic(name_old), ref_old, " was renamed ", italic(name_new), ref_new, mo) msg <- gsub("et al.", italic("et al."), msg) - msg_plain <- paste0(name_old, ref_old, " >> ", name_new, ref_new) - msg_plain <- c(getOption("mo_renamed", character(0)), msg_plain) - options(mo_renamed = sort(msg_plain)) - return(blue(paste("Note:", msg))) + options(mo_renamed = sort(msg)) } #' @exportMethod print.mo @@ -882,20 +969,20 @@ pull.mo <- function(.data, ...) { pull(as.data.frame(.data), ...) } -#' Vector of failed coercion attempts -#' -#' Returns a vector of all failed attempts to coerce values to a valid MO code with \code{\link{as.mo}}. -#' @seealso \code{\link{as.mo}} +#' @rdname as.mo #' @export mo_failures <- function() { getOption("mo_failures") } -#' Vector of taxonomic renamed items -#' -#' Returns a vector of all renamed items of the last coercion to valid MO codes with \code{\link{as.mo}}. -#' @seealso \code{\link{as.mo}} +#' @rdname as.mo +#' @export +mo_uncertainties <- function() { + getOption("mo_uncertainties") +} + +#' @rdname as.mo #' @export mo_renamed <- function() { - getOption("mo_renamed") + strip_style(gsub("was renamed", ">>", getOption("mo_renamed"), fixed = TRUE)) } diff --git a/R/mo_property.R b/R/mo_property.R index ae8bb176..5d67e37d 100755 --- a/R/mo_property.R +++ b/R/mo_property.R @@ -248,7 +248,11 @@ mo_gramstain <- function(x, language = get_locale(), ...) { #' @rdname mo_property #' @export mo_TSN <- function(x, ...) { - mo_validate(x = x, property = "tsn", ...) + res <- mo_validate(x = x, property = "tsn", ...) + if (any(is.na(res))) { + warning("Some results do not have a TSN, because they are missing from ITIS and were added manually. See ?microorganisms.") + } + res } #' @rdname mo_property diff --git a/_pkgdown.yml b/_pkgdown.yml index 10518726..01e534e4 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -119,6 +119,7 @@ reference: Functions for conducting AMR analysis, like counting isolates, calculating resistance or susceptibility, creating frequency tables or make plots. contents: + - '`availability`' - '`count`' - '`portion`' - '`freq`' @@ -148,8 +149,6 @@ reference: contents: - '`get_locale`' - '`like`' - - '`mo_failures`' - - '`mo_renamed`' - '`ab_property`' diff --git a/data/WHONET.rda b/data/WHONET.rda index 0a777481..9a079c86 100644 Binary files a/data/WHONET.rda and b/data/WHONET.rda differ diff --git a/data/microorganisms.codes.rda b/data/microorganisms.codes.rda index c18dc4c7..518404aa 100644 Binary files a/data/microorganisms.codes.rda and b/data/microorganisms.codes.rda differ diff --git a/data/microorganisms.oldDT.rda b/data/microorganisms.oldDT.rda index ad07df63..a9d189ef 100644 Binary files a/data/microorganisms.oldDT.rda and b/data/microorganisms.oldDT.rda differ diff --git a/data/microorganisms.prevDT.rda b/data/microorganisms.prevDT.rda index 62bf1795..a84a2d25 100644 Binary files a/data/microorganisms.prevDT.rda and b/data/microorganisms.prevDT.rda differ diff --git a/data/microorganisms.rda b/data/microorganisms.rda index 85b0691e..9c7a29d4 100755 Binary files a/data/microorganisms.rda and b/data/microorganisms.rda differ diff --git a/data/microorganisms.unprevDT.rda b/data/microorganisms.unprevDT.rda index 1eb394f0..78e1b8d4 100644 Binary files a/data/microorganisms.unprevDT.rda and b/data/microorganisms.unprevDT.rda differ diff --git a/data/microorganismsDT.rda b/data/microorganismsDT.rda index 14c7ac46..2b4f8a68 100644 Binary files a/data/microorganismsDT.rda and b/data/microorganismsDT.rda differ diff --git a/data/septic_patients.rda b/data/septic_patients.rda index 10975011..4fa1a989 100755 Binary files a/data/septic_patients.rda and b/data/septic_patients.rda differ diff --git a/docs/LICENSE-text.html b/docs/LICENSE-text.html index 708f719b..d7f04512 100644 --- a/docs/LICENSE-text.html +++ b/docs/LICENSE-text.html @@ -78,7 +78,7 @@
diff --git a/docs/articles/AMR.html b/docs/articles/AMR.html index ee872f5a..bbce516b 100644 --- a/docs/articles/AMR.html +++ b/docs/articles/AMR.html @@ -185,7 +185,7 @@AMR.Rmd
Note: values on this page will change with every website update since they are based on randomly created values and the page was written in RMarkdown. However, the methodology remains unchanged. This page was generated on 04 February 2019.
+Note: values on this page will change with every website update since they are based on randomly created values and the page was written in RMarkdown. However, the methodology remains unchanged. This page was generated on 08 February 2019.
Use the frequency table function freq()
to look specifically for unique values in any variable. For example, for the gender
variable:
# Frequency table of `gender` from a `data.frame` (5,000 x 9)
+# Frequency table of `gender` from a data.frame (5,000 x 9)
# Class: factor (numeric)
# Levels: F, M
# Length: 5,000 (of which NA: 0 = 0.00%)
@@ -396,8 +396,8 @@
#
# Item Count Percent Cum. Count Cum. Percent
# --- ----- ------ -------- ----------- -------------
-# 1 M 2,560 51.2% 2,560 51.2%
-# 2 F 2,440 48.8% 5,000 100.0%
+# 1 M 2,551 51.0% 2,551 51.0%
+# 2 F 2,449 49.0% 5,000 100.0%
So, we can draw at least two conclusions immediately. From a data scientist perspective, the data looks clean: only values M
and F
. From a researcher perspective: there are slightly more men. Nothing we didn’t already know.
The data is already quite clean, but we still need to transform some variables. The bacteria
column now consists of text, and we want to add more variables based on microbial IDs later on. So, we will transform this column to valid IDs. The mutate()
function of the dplyr
package makes this really easy:
data <- data %>%
@@ -428,10 +428,10 @@
# Kingella kingae (no changes)
#
# EUCAST Expert Rules, Intrinsic Resistance and Exceptional Phenotypes (v3.1, 2016)
-# Table 1: Intrinsic resistance in Enterobacteriaceae (348 changes)
+# Table 1: Intrinsic resistance in Enterobacteriaceae (345 changes)
# Table 2: Intrinsic resistance in non-fermentative Gram-negative bacteria (no changes)
# Table 3: Intrinsic resistance in other Gram-negative bacteria (no changes)
-# Table 4: Intrinsic resistance in Gram-positive bacteria (702 changes)
+# Table 4: Intrinsic resistance in Gram-positive bacteria (673 changes)
# Table 8: Interpretive rules for B-lactam agents and Gram-positive cocci (no changes)
# Table 9: Interpretive rules for B-lactam agents and Gram-negative rods (no changes)
# Table 10: Interpretive rules for B-lactam agents and other Gram-negative bacteria (no changes)
@@ -447,7 +447,9 @@
# Non-EUCAST: piperacillin/tazobactam = S where piperacillin = S (no changes)
# Non-EUCAST: trimethoprim/sulfa = S where trimethoprim = S (no changes)
#
-# => EUCAST rules affected 1,820 out of 5,000 rows -> changed 1,050 test results.
So only 59% is suitable for resistance analysis! We can now filter on it with the filter()
function, also from the dplyr
package:
So only 58.8% is suitable for resistance analysis! We can now filter on it with the filter()
function, also from the dplyr
package:
For future use, the above two syntaxes can be shortened with the filter_first_isolate()
function:
Only 3 isolates are marked as ‘first’ according to CLSI guideline. But when reviewing the antibiogram, it is obvious that some isolates are absolutely different strains and should be included too. This is why we weigh isolates, based on their antibiogram. The key_antibiotics()
function adds a vector with 18 key antibiotics: 6 broad spectrum ones, 6 small spectrum for Gram negatives and 6 small spectrum for Gram positives. These can be defined by the user.
isolate | @@ -637,11 +639,11 @@|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | -2010-03-08 | -G3 | +2010-04-03 | +C3 | B_ESCHR_COL | S | -I | +S | S | S | TRUE | @@ -649,20 +651,20 @@||
2 | -2010-05-08 | -G3 | +2010-10-31 | +C3 | B_ESCHR_COL | -S | -S | R | S | +S | +S | FALSE | TRUE |
3 | -2010-06-21 | -G3 | +2010-11-12 | +C3 | B_ESCHR_COL | S | S | @@ -673,22 +675,22 @@||||||
4 | -2010-12-01 | -G3 | +2010-11-21 | +C3 | B_ESCHR_COL | R | -S | -S | R | +R | +S | FALSE | TRUE |
5 | -2011-01-05 | -G3 | +2010-12-01 | +C3 | B_ESCHR_COL | -R | +S | S | S | S | @@ -697,11 +699,11 @@|||
6 | -2012-01-16 | -G3 | +2011-10-22 | +C3 | B_ESCHR_COL | S | -S | +I | S | S | TRUE | @@ -709,23 +711,23 @@||
7 | -2012-04-11 | -G3 | +2012-03-22 | +C3 | B_ESCHR_COL | S | S | -R | +S | S | FALSE | -TRUE | +FALSE |
8 | -2012-10-23 | -G3 | +2012-05-14 | +C3 | B_ESCHR_COL | -S | -S | +R | +R | S | S | FALSE | @@ -733,20 +735,8 @@|
9 | -2012-11-24 | -G3 | -B_ESCHR_COL | -S | -S | -S | -S | -FALSE | -FALSE | -||||
10 | -2014-01-26 | -G3 | +2012-10-26 | +C3 | B_ESCHR_COL | S | S | @@ -755,13 +745,25 @@TRUE | TRUE | ||||
10 | +2013-06-13 | +C3 | +B_ESCHR_COL | +S | +R | +S | +S | +FALSE | +TRUE | +
Instead of 3, now 9 isolates are flagged. In total, 88.5% of all isolates are marked ‘first weighted’ - 29.5% more than when using the CLSI guideline. In real life, this novel algorithm will yield 5-10% more isolates than the classic CLSI guideline.
+Instead of 3, now 9 isolates are flagged. In total, 87.7% of all isolates are marked ‘first weighted’ - 29% more than when using the CLSI guideline. In real life, this novel algorithm will yield 5-10% more isolates than the classic CLSI guideline.
As with filter_first_isolate()
, there’s a shortcut for this new algorithm too:
So we end up with 4,424 isolates for analysis.
+So we end up with 4,387 isolates for analysis.
We can remove unneeded columns:
@@ -769,6 +771,7 @@date | patient_id | hospital | @@ -785,58 +788,46 @@||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010-05-26 | -E8 | -Hospital C | -B_ESCHR_COL | +1 | +2010-10-06 | +F7 | +Hospital D | +B_KLBSL_PNE | R | S | -R | +S | S | M | Gram negative | -Escherichia | -coli | -TRUE | -
2016-11-27 | -D6 | -Hospital B | -B_STRPTC_PNE | -R | -S | -S | -R | -M | -Gram positive | -Streptococcus | +Klebsiella | pneumoniae | TRUE | |||||
2015-03-24 | -J2 | +|||||||||||||||||
2 | +2015-07-29 | +T1 | Hospital A | B_ESCHR_COL | R | S | S | S | -M | +F | Gram negative | Escherichia | coli | TRUE | ||||
2014-09-12 | -Y4 | -Hospital A | +||||||||||||||||
3 | +2017-10-20 | +P2 | +Hospital B | B_STPHY_AUR | S | S | -S | +R | S | F | Gram positive | @@ -844,31 +835,49 @@aureus | TRUE | |||||
2015-05-27 | -M8 | -Hospital B | -B_ESCHR_COL | +|||||||||||||||
4 | +2010-02-07 | +Z6 | +Hospital A | +B_KLBSL_PNE | +R | R | S | S | +F | +Gram negative | +Klebsiella | +pneumoniae | +TRUE | +|||||
6 | +2016-02-08 | +S2 | +Hospital B | +B_ESCHR_COL | S | -M | +S | +S | +S | +F | Gram negative | Escherichia | coli | TRUE | ||||
2017-10-14 | -R8 | +9 | +2016-10-31 | +H3 | Hospital C | B_STPHY_AUR | +R | S | +R | S | -S | -S | -F | +M | Gram positive | Staphylococcus | aureus | @@ -891,9 +900,9 @@|
1 | Escherichia coli | -2,141 | -48.4% | -2,141 | -48.4% | +2,129 | +48.5% | +2,129 | +48.5% | |||||||||
2 | Staphylococcus aureus | -1,126 | -25.5% | -3,267 | -73.8% | +1,098 | +25.0% | +3,227 | +73.6% | |||||||||
3 | Streptococcus pneumoniae | -699 | -15.8% | -3,966 | -89.6% | +688 | +15.7% | +3,915 | +89.2% | |||||||||
4 | Klebsiella pneumoniae | -458 | -10.4% | -4,424 | +472 | +10.8% | +4,387 | 100.0% | ||||||||||
Hospital A | -0.4566642 | +0.4544765 | ||||||||||||||||
Hospital B | -0.4615894 | +0.4920107 | ||||||||||||||||
Hospital C | -0.4807122 | +0.4686567 | ||||||||||||||||
Hospital D | -0.4579008 | +0.4570792 |
EUCAST.Rmd
G_test.Rmd
Predict.Rmd
WHONET.Rmd
We will have to transform some variables to simplify and automate the analysis:
mo
) using the ITIS reference data set, which contains all ~20,000 microorganisms from the taxonomic kingdoms Bacteria, Fungi and Protozoa. We do the tranformation with as.mo()
.mo
) using the ITIS reference data set, which contains all ~20,000 microorganisms from the taxonomic kingdoms Bacteria, Fungi and Protozoa. We do the tranformation with as.mo()
. This function also recognises almost all WHONET abbreviations of microorganisms."S"
, "I"
or "R"
. That is exactly where the as.rsi()
function is for.# transform variables
data <- WHONET %>%
# get microbial ID based on given organism
- mutate(mo = as.mo(Organism)) %>%
+ mutate(mo = as.mo(Organism)) %>%
# transform everything from "AMP_ND10" to "CIP_EE" to the new `rsi` class
- mutate_at(vars(AMP_ND10:CIP_EE), as.rsi)
No errors or warnings, so all values are transformed succesfully. Let’s check it though, with a couple of frequency tables:
Frequency table of mo
from a data.frame
(500 x 54)
-Class: mo (character)
+Class: mo
(character
)
Length: 500 (of which NA: 0 = 0.00%)
Unique: 56
Families: 14
@@ -329,7 +329,7 @@ Species: 51
Frequency table of AMC_ND2
from a data.frame
(500 x 54)
-Class: factor > ordered > rsi (numeric)
+Class: factor
> ordered
> rsi
(numeric
)
Levels: S < I < R
Length: 500 (of which NA: 41 = 8.20%)
Unique: 3
ab_property.Rmd
benchmarks.Rmd
freq.Rmd
To only show and quickly review the content of one variable, you can just select this variable in various ways. Let’s say we want to get the frequencies of the gender
variable of the septic_patients
dataset:
Frequency table
+Frequency table of gender
from a data.frame
(2,000 x 49)
+Class: character
(character
)
+Length: 2,000 (of which NA: 0 = 0.00%)
+Unique: 2
Shortest: 1
+Longest: 1
@@ -255,7 +260,12 @@ |
---|
@@ -399,8 +409,8 @@ septic_patients %>% distinct(patient_id, .keep_all = TRUE) %>% freq(age, nmax = 5, header = TRUE) - |
---|
@@ -526,7 +540,11 @@ Outliers: 15 (unique count: 12) |
---|
@@ -574,8 +592,8 @@ Outliers: 15 (unique count: 12) |
---|
@@ -758,7 +781,11 @@ Median: 31 July 2009 (47.39%) |
---|
Item | @@ -806,7 +833,11 @@ Median: 31 July 2009 (47.39%)
---|
diff --git a/docs/articles/index.html b/docs/articles/index.html index 25f9296c..fcbf71f0 100644 --- a/docs/articles/index.html +++ b/docs/articles/index.html @@ -78,7 +78,7 @@ diff --git a/docs/articles/mo_property.html b/docs/articles/mo_property.html index 76677e4f..f4475ca1 100644 --- a/docs/articles/mo_property.html +++ b/docs/articles/mo_property.html @@ -40,7 +40,7 @@ @@ -185,7 +185,7 @@ |
---|