(v1.8.0.9001) as.mo improvement, fixes #52

2025-07-08 11:51:59 +02:00 · 2022-02-26 21:58:23 +01:00
parent be792cc9eb
commit 18e8525d10
108 changed files with 568 additions and 399 deletions
--- a/R/catalogue_of_life.R
+++ b/R/catalogue_of_life.R
@ -43,7 +43,7 @@ format_included_data_number <- function(data) {
 #'
 #' This package contains the complete taxonomic tree (last updated: `r CATALOGUE_OF_LIFE$yearmonth_LPSN`) of almost all microorganisms from the authoritative and comprehensive Catalogue of Life (CoL), supplemented with data from the List of Prokaryotic names with Standing in Nomenclature (LPSN).
 #' @section Catalogue of Life:
-#' \if{html}{\figure{logo_col.png}{options: height=40px style=margin-bottom:5px} \cr}
+#' \if{html}{\figure{logo_col.png}{options: height="40" style=margin-bottom:"5"} \cr}
 #' This package contains the complete taxonomic tree of almost all microorganisms (`r format_included_data_number(microorganisms)` species) from the authoritative and comprehensive Catalogue of Life (CoL, <http://www.catalogueoflife.org>). The CoL is the most comprehensive and authoritative global index of species currently available. Nonetheless, we supplemented the CoL data with data from the List of Prokaryotic names with Standing in Nomenclature (LPSN, [lpsn.dsmz.de](https://lpsn.dsmz.de)). This supplementation is needed until the [CoL+ project](https://github.com/CatalogueOfLife/general) is finished, which we await.
 #'
 #' [Click here][catalogue_of_life] for more information about the included taxa. Check which versions of the CoL and LPSN were included in this package with [catalogue_of_life_version()].
--- a/R/lifecycle.R
+++ b/R/lifecycle.R
@ -32,23 +32,23 @@
 #' @rdname lifecycle
 #' @description Functions in this `AMR` package are categorised using [the lifecycle circle of the Tidyverse as found on www.tidyverse.org/lifecycle](https://lifecycle.r-lib.org/articles/stages.html).
 #' 
-#' \if{html}{\figure{lifecycle_tidyverse.svg}{options: height=200px style=margin-bottom:5px} \cr}
+#' \if{html}{\figure{lifecycle_tidyverse.svg}{options: height="200" style=margin-bottom:"5"} \cr}
 #' This page contains a section for every lifecycle (with text borrowed from the aforementioned Tidyverse website), so they can be used in the manual pages of the functions. 
 #' @section Experimental Lifecycle:
-#' \if{html}{\figure{lifecycle_experimental.svg}{options: style=margin-bottom:5px} \cr}
+#' \if{html}{\figure{lifecycle_experimental.svg}{options: style=margin-bottom:"5"} \cr}
 #' The [lifecycle][AMR::lifecycle] of this function is **experimental**. An experimental function is in early stages of development. The unlying code might be changing frequently. Experimental functions might be removed without deprecation, so you are generally best off waiting until a function is more mature before you use it in production code. Experimental functions are only available in development versions of this `AMR` package and will thus not be included in releases that are submitted to CRAN, since such functions have not yet matured enough.
 #' @section Maturing Lifecycle:
-#' \if{html}{\figure{lifecycle_maturing.svg}{options: style=margin-bottom:5px} \cr}
+#' \if{html}{\figure{lifecycle_maturing.svg}{options: style=margin-bottom:"5"} \cr}
 #' The [lifecycle][AMR::lifecycle] of this function is **maturing**. The unlying code of a maturing function has been roughed out, but finer details might still change. Since this function needs wider usage and more extensive testing, you are very welcome [to suggest changes at our repository](https://github.com/msberends/AMR/issues) or [write us an email (see section 'Contact Us')][AMR::AMR].
 #' @section Stable Lifecycle:
-#' \if{html}{\figure{lifecycle_stable.svg}{options: style=margin-bottom:5px} \cr}
+#' \if{html}{\figure{lifecycle_stable.svg}{options: style=margin-bottom:"5"} \cr}
 #' The [lifecycle][AMR::lifecycle] of this function is **stable**. In a stable function, major changes are unlikely. This means that the unlying code will generally evolve by adding new arguments; removing arguments or changing the meaning of existing arguments will be avoided.
 #' 
 #' If the unlying code needs breaking changes, they will occur gradually. For example, an argument will be deprecated and first continue to work, but will emit an message informing you of the change. Next, typically after at least one newly released version on CRAN, the message will be transformed to an error.
 #' @section Retired Lifecycle:
-#' \if{html}{\figure{lifecycle_retired.svg}{options: style=margin-bottom:5px} \cr}
+#' \if{html}{\figure{lifecycle_retired.svg}{options: style=margin-bottom:"5"} \cr}
 #' The [lifecycle][AMR::lifecycle] of this function is **retired**. A retired function is no longer under active development, and (if appropiate) a better alternative is available. No new arguments will be added, and only the most critical bugs will be fixed. In a future version, this function will be removed.
 #' @section Questioning Lifecycle:
-#' \if{html}{\figure{lifecycle_questioning.svg}{options: style=margin-bottom:5px} \cr}
+#' \if{html}{\figure{lifecycle_questioning.svg}{options: style=margin-bottom:"5"} \cr}
 #' The [lifecycle][AMR::lifecycle] of this function is **questioning**. This function might be no longer be optimal approach, or is it questionable whether this function should be in this `AMR` package at all.
 NULL
--- a/R/mo.R
+++ b/R/mo.R
@ -31,9 +31,9 @@
 #' @param Becker a [logical] to indicate whether staphylococci should be categorised into coagulase-negative staphylococci ("CoNS") and coagulase-positive staphylococci ("CoPS") instead of their own species, according to Karsten Becker *et al.* (1,2,3).
 #'
 #' This excludes *Staphylococcus aureus* at default, use `Becker = "all"` to also categorise *S. aureus* as "CoPS".
-#' @param Lancefield a [logical] to indicate whether beta-haemolytic *Streptococci* should be categorised into Lancefield groups instead of their own species, according to Rebecca C. Lancefield (4). These *Streptococci* will be categorised in their first group, e.g. *Streptococcus dysgalactiae* will be group C, although officially it was also categorised into groups G and L.
+#' @param Lancefield a [logical] to indicate whether a beta-haemolytic *Streptococcus* should be categorised into Lancefield groups instead of their own species, according to Rebecca C. Lancefield (4). These streptococci will be categorised in their first group, e.g. *Streptococcus dysgalactiae* will be group C, although officially it was also categorised into groups G and L.
 #'
-#' This excludes *Enterococci* at default (who are in group D), use `Lancefield = "all"` to also categorise all *Enterococci* as group D.
+#' This excludes enterococci at default (who are in group D), use `Lancefield = "all"` to also categorise all enterococci as group D.
 #' @param allow_uncertain a number between `0` (or `"none"`) and `3` (or `"all"`), or `TRUE` (= `2`) or `FALSE` (= `0`) to indicate whether the input should be checked for less probable results, see *Details*
 #' @param reference_df a [data.frame] to be used for extra reference when translating `x` to a valid [`mo`]. See [set_mo_source()] and [get_mo_source()] to automate the usage of your own codes (e.g. used in your analysis or organisation).
 #' @param ignore_pattern a regular expression (case-insensitive) of which all matches in `x` must return `NA`. This can be convenient to exclude known non-relevant input and can also be set with the option `AMR_ignore_pattern`, e.g. `options(AMR_ignore_pattern = "(not reported|contaminated flora)")`.
@ -1188,9 +1188,38 @@ exec_as.mo <- function(x,
                return(found)
              }
              
-              # (6) try to strip off half an element from end and check the remains ----
+              # (6) remove non-taxonomic prefix and suffix ----
              if (isTRUE(debug)) {
-                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (6) try to strip off half an element from end and check the remains\n"))
+                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (6) remove non-taxonomic prefix and suffix\n"))
+              }
+              x_without_nontax <- gsub("(^[a-zA-Z]+[./-]+[a-zA-Z]+[^a-zA-Z]* )([a-zA-Z.]+ [a-zA-Z]+.*)",
+                                              "\\2", a.x_backup, perl = TRUE)
+              x_without_nontax <- gsub("( *[(].*[)] *)[^a-zA-Z]*$", "", x_without_nontax, perl = TRUE)
+              if (isTRUE(debug)) {
+                message("Running '", x_without_nontax, "'")
+              }
+              
+              # first try without dyslexia mode
+              found <- suppressMessages(suppressWarnings(exec_as.mo(x_without_nontax, initial_search = FALSE, dyslexia_mode = FALSE, allow_uncertain = FALSE, debug = debug, reference_data_to_use = uncertain.reference_data_to_use, actual_uncertainty = 2, actual_input = x_without_nontax)))
+              if (empty_result(found)) {
+                # then with dyslexia mode
+                found <- suppressMessages(suppressWarnings(exec_as.mo(x_without_nontax, initial_search = FALSE, dyslexia_mode = TRUE, allow_uncertain = FALSE, debug = debug, reference_data_to_use = uncertain.reference_data_to_use, actual_uncertainty = 2, actual_input = x_without_nontax)))
+              }
+              if (!empty_result(found) & nchar(g.x_backup_without_spp) >= 6) {
+                # we ran with actual_input = x_without_nontax, so now correct for a.x_backup:
+                uncertain_df <- attr(found, which = "uncertainties", exact = TRUE)
+                uncertain_df$input <- a.x_backup
+                found_result <- found
+                uncertainties <<- rbind(uncertainties,
+                                        uncertain_df,
+                                        stringsAsFactors = FALSE)
+                found <- lookup(mo == found)
+                return(found)
+              }
+              
+              # (7) try to strip off half an element from end and check the remains ----
+              if (isTRUE(debug)) {
+                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (7) try to strip off half an element from end and check the remains\n"))
              }
              x_strip <- a.x_backup %pm>% strsplit("[ .]") %pm>% unlist()
              if (length(x_strip) > 1) {
@ -1220,9 +1249,9 @@ exec_as.mo <- function(x,
                  }
                }
              }
-              # (7) try to strip off one element from end and check the remains ----
+              # (8) try to strip off one element from end and check the remains ----
              if (isTRUE(debug)) {
-                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (7) try to strip off one element from end and check the remains\n"))
+                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (8) try to strip off one element from end and check the remains\n"))
              }
              if (length(x_strip) > 1) {
                for (i in seq_len(length(x_strip) - 1)) {
@ -1249,9 +1278,9 @@ exec_as.mo <- function(x,
                  }
                }
              }
-              # (8) check for unknown yeasts/fungi ----
+              # (9) check for unknown yeasts/fungi ----
              if (isTRUE(debug)) {
-                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (8) check for unknown yeasts/fungi\n"))
+                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (9) check for unknown yeasts/fungi\n"))
              }
              if (b.x_trimmed %like_case% "yeast") {
                found <- "F_YEAST"
@ -1275,9 +1304,9 @@ exec_as.mo <- function(x,
                                        stringsAsFactors = FALSE)
                return(found)
              }
-              # (9) try to strip off one element from start and check the remains (only allow >= 2-part name outcome) ----
+              # (10) try to strip off one element from start and check the remains (only allow >= 2-part name outcome) ----
              if (isTRUE(debug)) {
-                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (9) try to strip off one element from start and check the remains (only allow >= 2-part name outcome)\n"))
+                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (10) try to strip off one element from start and check the remains (only allow >= 2-part name outcome)\n"))
              }
              x_strip <- a.x_backup %pm>% strsplit("[ .]") %pm>% unlist()
              if (length(x_strip) > 1 & nchar(g.x_backup_without_spp) >= 6) {
@ -1311,9 +1340,9 @@ exec_as.mo <- function(x,
            if (uncertainty_level >= 3) {
              now_checks_for_uncertainty_level <- 3
              
-              # (10) try to strip off one element from start and check the remains (any text size) ----
+              # (11) try to strip off one element from start and check the remains (any text size) ----
              if (isTRUE(debug)) {
-                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (10) try to strip off one element from start and check the remains (any text size)\n"))
+                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (11) try to strip off one element from start and check the remains (any text size)\n"))
              }
              x_strip <- a.x_backup %pm>% strsplit("[ .]") %pm>% unlist()
              if (length(x_strip) > 1 & nchar(g.x_backup_without_spp) >= 6) {
@ -1338,10 +1367,10 @@ exec_as.mo <- function(x,
                  }
                }
              }
-              # (11) try to strip off one element from end and check the remains (any text size) ----
+              # (12) try to strip off one element from end and check the remains (any text size) ----
              # (this is in fact 7 but without nchar limit of >=6)
              if (isTRUE(debug)) {
-                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (11) try to strip off one element from end and check the remains (any text size)\n"))
+                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (12) try to strip off one element from end and check the remains (any text size)\n"))
              }
              if (length(x_strip) > 1) {
                for (i in seq_len(length(x_strip) - 1)) {
@ -1366,9 +1395,9 @@ exec_as.mo <- function(x,
                }
              }
              
-              # (12) part of a name (very unlikely match) ----
+              # (13) part of a name (very unlikely match) ----
              if (isTRUE(debug)) {
-                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (12) part of a name (very unlikely match)\n"))
+                cat(font_bold("\n[ UNCERTAINTY LEVEL", now_checks_for_uncertainty_level, "] (13) part of a name (very unlikely match)\n"))
              }
              if (isTRUE(debug)) {
                message("Running '", f.x_withspaces_end_only, "'")
@ -1882,7 +1911,7 @@ print.mo_uncertainties <- function(x, ...) {
  if (NROW(x) == 0) {
    return(NULL)
  }
-  cat(word_wrap("Matching scores", ifelse(has_colour(), " (in blue)", ""), " are based on human pathogenic prevalence and the resemblance between the input and the full taxonomic name. See `?mo_matching_score`.\n\n", add_fn = font_blue))
+  cat(word_wrap("Matching scores", ifelse(has_colour(), " (in blue)", ""), " are based on pathogenicity in humans and the resemblance between the input and the full taxonomic name. See `?mo_matching_score`.\n\n", add_fn = font_blue))
  
  txt <- ""
  for (i in seq_len(nrow(x))) {
--- a/R/mo_matching_score.R
+++ b/R/mo_matching_score.R
@ -27,13 +27,13 @@
 #' 
 #' This algorithm is used by [as.mo()] and all the [`mo_*`][mo_property()] functions to determine the most probable match of taxonomic records based on user input. 
 #' @inheritSection lifecycle Stable Lifecycle
-#' @author Dr. Matthijs Berends
+#' @author Dr Matthijs Berends
 #' @param x Any user input value(s)
 #' @param n A full taxonomic name, that exists in [`microorganisms$fullname`][microorganisms]
 #' @section Matching Score for Microorganisms:
 #' With ambiguous user input in [as.mo()] and all the [`mo_*`][mo_property()] functions, the returned results are chosen based on their matching score using [mo_matching_score()]. This matching score \eqn{m}, is calculated as:
 #' 
-#' \ifelse{latex}{\deqn{m_{(x, n)} = \frac{l_{n} - 0.5 \cdot \min \begin{cases}l_{n} \\ \textrm{lev}(x, n)\end{cases}}{l_{n} \cdot p_{n} \cdot k_{n}}}}{\ifelse{html}{\figure{mo_matching_score.png}{options: width="300px" alt="mo matching score"}}{m(x, n) = ( l_n * min(l_n, lev(x, n) ) ) / ( l_n * p_n * k_n )}}
+#' \ifelse{latex}{\deqn{m_{(x, n)} = \frac{l_{n} - 0.5 \cdot \min \begin{cases}l_{n} \\ \textrm{lev}(x, n)\end{cases}}{l_{n} \cdot p_{n} \cdot k_{n}}}}{\ifelse{html}{\figure{mo_matching_score.png}{options: width="300" alt="mo matching score"}}{m(x, n) = ( l_n * min(l_n, lev(x, n) ) ) / ( l_n * p_n * k_n )}}
 #' 
 #' where:
 #' 
@ -49,6 +49,8 @@
 #' All characters in \eqn{x} and \eqn{n} are ignored that are other than A-Z, a-z, 0-9, spaces and parentheses.
 #' 
 #' All matches are sorted descending on their matching score and for all user input values, the top match will be returned. This will lead to the effect that e.g., `"E. coli"` will return the microbial ID of *Escherichia coli* (\eqn{m = `r round(mo_matching_score("E. coli", "Escherichia coli"), 3)`}, a highly prevalent microorganism found in humans) and not *Entamoeba coli* (\eqn{m = `r round(mo_matching_score("E. coli", "Entamoeba coli"), 3)`}, a less prevalent microorganism in humans), although the latter would alphabetically come first. 
+#' 
+#' Since `AMR` version 1.8.1, common microorganism abbreviations are ignored in determining the matching score. These abbreviations are currently: `r vector_and(pkg_env$mo_field_abbreviations, quotes = FALSE)`.
 #' @export
 #' @inheritSection AMR Reference Data Publicly Available
 #' @inheritSection AMR Read more on Our Website!
@ -65,9 +67,16 @@ mo_matching_score <- function(x, n) {
  x <- parse_and_convert(x)
  # no dots and other non-whitespace characters
  x <- gsub("[^a-zA-Z0-9 \\(\\)]+", "", x)
+  
+  # remove abbreviations known to the field
+  x <- gsub(paste0("(^|[^a-z0-9]+)(",
+                   paste0(pkg_env$mo_field_abbreviations, collapse = "|"),
+                   ")([^a-z0-9]+|$)"),
+            "", x, perl = TRUE, ignore.case = TRUE)
+  
  # only keep one space
  x <- gsub(" +", " ", x)
-
+  
  # n is always a taxonomically valid full name
  if (length(n) == 1) {
    n <- rep(n, length(x))
@ -82,7 +91,7 @@ mo_matching_score <- function(x, n) {
  l_n.lev <- double(length = length(x))
  for (i in seq_len(length(x))) {
    # determine Levenshtein distance, but maximise to nchar of n
-    lev[i] <- utils::adist(x[i], n[i], ignore.case = FALSE, fixed = TRUE)
+    lev[i] <- utils::adist(x[i], n[i], ignore.case = FALSE, fixed = TRUE, costs = c(ins = 1, del = 1, sub = 1))
    # minimum of (l_n, Levenshtein distance)
    l_n.lev[i] <- min(l_n[i], as.double(lev[i]))
  }
--- a/R/rsi.R
+++ b/R/rsi.R
@ -310,11 +310,15 @@ as.rsi.default <- function(x, ...) {
    x[x %like% "([^a-z]|^)sus(cep(tible)?)?"] <- "S"
    x[x %like% "([^a-z]|^)int(er(mediate)?)?|incr.*exp"] <- "I"
    # remove other invalid characters
-    x <- gsub("[^rsiRSIHi]+", "", x, perl = TRUE)
-    # some labs now report "H" instead of "I" to not interfere with EUCAST prior to 2019
-    x <- gsub("H", "I", x, ignore.case = TRUE)
    # set to capitals
    x <- toupper(x)
+    x <- gsub("[^RSIHDU]+", "", x, perl = TRUE)
+    # some labs now report "H" instead of "I" to not interfere with EUCAST prior to 2019
+    x <- gsub("^H$", "I", x, perl = TRUE)
+    # and MIPS uses D for Dose-dependent (which is I, but it will throw a note)
+    x <- gsub("^D$", "I", x, perl = TRUE)
+    # and MIPS uses U for "susceptible urine"
+    x <- gsub("^U$", "S", x, perl = TRUE)
    # in cases of "S;S" keep S, but in case of "S;I" make it NA
    x <- gsub("^S+$", "S", x)
    x <- gsub("^I+$", "I", x)
@ -333,6 +337,15 @@ as.rsi.default <- function(x, ...) {
                 "%) that were invalid antimicrobial interpretations: ",
                 list_missing, call = FALSE)
      }
+      if (any(toupper(x.bak) == "U") && message_not_thrown_before("as.rsi", "U")) {
+        warning_("in as.rsi(): 'U' was interpreted as 'S', following some laboratory systems", call = FALSE)
+      }
+      if (any(toupper(x.bak) == "D") && message_not_thrown_before("as.rsi", "D")) {
+        warning_("in as.rsi(): 'D' (dose-dependent) was interpreted as 'I', following some laboratory systems", call = FALSE)
+      }
+      if (any(toupper(x.bak) == "H") && message_not_thrown_before("as.rsi", "H")) {
+        warning_("in as.rsi(): 'H' was interpreted as 'I', following some laboratory systems", call = FALSE)
+      }
    }
  }
  
--- a/R/whocc.R
+++ b/R/whocc.R
@ -27,7 +27,7 @@
 #'
 #' All antimicrobial drugs and their official names, ATC codes, ATC groups and defined daily dose (DDD) are included in this package, using the WHO Collaborating Centre for Drug Statistics Methodology.
 #' @section WHOCC:
-#' \if{html}{\figure{logo_who.png}{options: height=60px style=margin-bottom:5px} \cr}
+#' \if{html}{\figure{logo_who.png}{options: height="60" style=margin-bottom:"5"} \cr}
 #' This package contains **all ~550 antibiotic, antimycotic and antiviral drugs** and their Anatomical Therapeutic Chemical (ATC) codes, ATC groups and Defined Daily Dose (DDD) from the World Health Organization Collaborating Centre for Drug Statistics Methodology (WHOCC, <https://www.whocc.no>) and the Pharmaceuticals Community Register of the European Commission (<https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm>). 
 #'
 #' These have become the gold standard for international drug utilisation monitoring and research.
--- a/R/zzz.R
+++ b/R/zzz.R
@ -26,6 +26,11 @@
 # set up package environment, used by numerous AMR functions
 pkg_env <- new.env(hash = FALSE)
 pkg_env$mo_failed <- character(0)
+pkg_env$mo_field_abbreviations <- c("AIEC", "ATEC", "BORSA", "CRSM", "DAEC", "EAEC",
+                                    "EHEC", "EIEC", "EPEC", "ETEC", "GISA", "MRPA",
+                                    "MRSA", "MRSE", "MSSA", "MSSE", "NMEC", "PISP",
+                                    "PRSP", "STEC", "UPEC", "VISA", "VISP", "VRE",
+                                    "VRSA", "VRSP")

 # determine info icon for messages
 utf8_supported <- isTRUE(base::l10n_info()$`UTF-8`)