AMR/R/mo_matching_score.R

# ==================================================================== #
# TITLE                                                                #
# Antimicrobial Resistance (AMR) Data Analysis for R                   #
#                                                                      #
# SOURCE                                                               #
# https://github.com/msberends/AMR                                     #
#                                                                      #
# LICENCE                                                              #
# (c) 2018-2022 Berends MS, Luz CF et al.                              #
# Developed at the University of Groningen, the Netherlands, in        #
# collaboration with non-profit organisations Certe Medical            #
# Diagnostics & Advice, and University Medical Center Groningen.       # 
#                                                                      #
# This R package is free software; you can freely use and distribute   #
# it for both personal and commercial purposes under the terms of the  #
# GNU General Public License version 2.0 (GNU GPL-2), as published by  #
# the Free Software Foundation.                                        #
# We created this package for both routine data analysis and academic  #
# research and it was publicly released in the hope that it will be    #
# useful, but it comes WITHOUT ANY WARRANTY OR LIABILITY.              #
#                                                                      #
# Visit our website for the full manual and a complete tutorial about  #
# how to conduct AMR data analysis: https://msberends.github.io/AMR/   #
# ==================================================================== #

#' Calculate the Matching Score for Microorganisms
#' 
#' This algorithm is used by [as.mo()] and all the [`mo_*`][mo_property()] functions to determine the most probable match of taxonomic records based on user input. 
#' @author Dr Matthijs Berends
#' @param x Any user input value(s)
#' @param n A full taxonomic name, that exists in [`microorganisms$fullname`][microorganisms]
#' @section Matching Score for Microorganisms:
#' With ambiguous user input in [as.mo()] and all the [`mo_*`][mo_property()] functions, the returned results are chosen based on their matching score using [mo_matching_score()]. This matching score \eqn{m}, is calculated as:
#' 
#' \ifelse{latex}{\deqn{m_{(x, n)} = \frac{l_{n} - 0.5 \cdot \min \begin{cases}l_{n} \\ \textrm{lev}(x, n)\end{cases}}{l_{n} \cdot p_{n} \cdot k_{n}}}}{\ifelse{html}{\figure{mo_matching_score.png}{options: width="300" alt="mo matching score"}}{m(x, n) = ( l_n * min(l_n, lev(x, n) ) ) / ( l_n * p_n * k_n )}}
#' 
#' where:
#' 
#' * \ifelse{html}{\out{<i>x</i> is the user input;}}{\eqn{x} is the user input;}
#' * \ifelse{html}{\out{<i>n</i> is a taxonomic name (genus, species, and subspecies);}}{\eqn{n} is a taxonomic name (genus, species, and subspecies);}
#' * \ifelse{html}{\out{<i>l<sub>n</sub></i> is the length of <i>n</i>;}}{l_n is the length of \eqn{n};}
#' * \ifelse{html}{\out{<i>lev</i> is the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance function</a>, which counts any insertion, deletion and substitution as 1 that is needed to change <i>x</i> into <i>n</i>;}}{lev is the Levenshtein distance function, which counts any insertion, deletion and substitution as 1 that is needed to change \eqn{x} into \eqn{n};}
#' * \ifelse{html}{\out{<i>p<sub>n</sub></i> is the human pathogenic prevalence group of <i>n</i>, as described below;}}{p_n is the human pathogenic prevalence group of \eqn{n}, as described below;}
#' * \ifelse{html}{\out{<i>k<sub>n</sub></i> is the taxonomic kingdom of <i>n</i>, set as Bacteria = 1, Fungi = 2, Protozoa = 3, Archaea = 4, others = 5.}}{l_n is the taxonomic kingdom of \eqn{n}, set as Bacteria = 1, Fungi = 2, Protozoa = 3, Archaea = 4, others = 5.}
#' 
#' The grouping into human pathogenic prevalence (\eqn{p}) is based on experience from several microbiological laboratories in the Netherlands in conjunction with international reports on pathogen prevalence. **Group 1** (most prevalent microorganisms) consists of all microorganisms where the taxonomic class is Gammaproteobacteria or where the taxonomic genus is *Enterococcus*, *Staphylococcus* or *Streptococcus*. This group consequently contains all common Gram-negative bacteria, such as *Pseudomonas* and *Legionella* and all species within the order Enterobacterales. **Group 2** consists of all microorganisms where the taxonomic phylum is Proteobacteria, Firmicutes, Actinobacteria or Sarcomastigophora, or where the taxonomic genus is *Absidia*, *Acremonium*, *Actinotignum*, *Alternaria*, *Anaerosalibacter*, *Apophysomyces*, *Arachnia*, *Aspergillus*, *Aureobacterium*, *Aureobasidium*, *Bacteroides*, *Basidiobolus*, *Beauveria*, *Blastocystis*, *Branhamella*, *Calymmatobacterium*, *Candida*, *Capnocytophaga*, *Catabacter*, *Chaetomium*, *Chryseobacterium*, *Chryseomonas*, *Chrysonilia*, *Cladophialophora*, *Cladosporium*, *Conidiobolus*, *Cryptococcus*, *Curvularia*, *Exophiala*, *Exserohilum*, *Flavobacterium*, *Fonsecaea*, *Fusarium*, *Fusobacterium*, *Hendersonula*, *Hypomyces*, *Koserella*, *Lelliottia*, *Leptosphaeria*, *Leptotrichia*, *Malassezia*, *Malbranchea*, *Mortierella*, *Mucor*, *Mycocentrospora*, *Mycoplasma*, *Nectria*, *Ochroconis*, *Oidiodendron*, *Phoma*, *Piedraia*, *Pithomyces*, *Pityrosporum*, *Prevotella*, *Pseudallescheria*, *Rhizomucor*, *Rhizopus*, *Rhodotorula*, *Scolecobasidium*, *Scopulariopsis*, *Scytalidium*, *Sporobolomyces*, *Stachybotrys*, *Stomatococcus*, *Treponema*, *Trichoderma*, *Trichophyton*, *Trichosporon*, *Tritirachium* or *Ureaplasma*. **Group 3** consists of all other microorganisms.
#' 
#' All characters in \eqn{x} and \eqn{n} are ignored that are other than A-Z, a-z, 0-9, spaces and parentheses.
#' 
#' All matches are sorted descending on their matching score and for all user input values, the top match will be returned. This will lead to the effect that e.g., `"E. coli"` will return the microbial ID of *Escherichia coli* (\eqn{m = `r round(mo_matching_score("E. coli", "Escherichia coli"), 3)`}, a highly prevalent microorganism found in humans) and not *Entamoeba coli* (\eqn{m = `r round(mo_matching_score("E. coli", "Entamoeba coli"), 3)`}, a less prevalent microorganism in humans), although the latter would alphabetically come first. 
#' 
#' Since `AMR` version 1.8.1, common microorganism abbreviations are ignored in determining the matching score. These abbreviations are currently: `r vector_and(pkg_env$mo_field_abbreviations, quotes = FALSE)`.
#' @export
#' @inheritSection AMR Reference Data Publicly Available
#' @examples 
#' as.mo("E. coli")
#' mo_uncertainties()
#' 
#' mo_matching_score(x = "E. coli",
#'                   n = c("Escherichia coli", "Entamoeba coli"))
mo_matching_score <- function(x, n) {
  meet_criteria(x, allow_class = c("character", "data.frame", "list"))
  meet_criteria(n, allow_class = "character")
  
  x <- parse_and_convert(x)
  # no dots and other non-whitespace characters
  x <- gsub("[^a-zA-Z0-9 \\(\\)]+", "", x)
  
  # remove abbreviations known to the field
  x <- gsub(paste0("(^|[^a-z0-9]+)(",
                   paste0(pkg_env$mo_field_abbreviations, collapse = "|"),
                   ")([^a-z0-9]+|$)"),
            "", x, perl = TRUE, ignore.case = TRUE)
  
  # only keep one space
  x <- gsub(" +", " ", x)
  
  # n is always a taxonomically valid full name
  if (length(n) == 1) {
    n <- rep(n, length(x))
  }
  if (length(x) == 1) {
    x <- rep(x, length(n))
  }
  
  # length of fullname
  l_n <- nchar(n)
  lev <- double(length = length(x))
  l_n.lev <- double(length = length(x))
  for (i in seq_len(length(x))) {
    # determine Levenshtein distance, but maximise to nchar of n
    lev[i] <- utils::adist(x[i], n[i], ignore.case = FALSE, fixed = TRUE, costs = c(ins = 1, del = 1, sub = 1))
    # minimum of (l_n, Levenshtein distance)
    l_n.lev[i] <- min(l_n[i], as.double(lev[i]))
  }
  # human pathogenic prevalence (1 to 3), see ?as.mo
  p_n <- MO_lookup[match(n, MO_lookup$fullname), "prevalence", drop = TRUE]
  # kingdom index (Bacteria = 1, Fungi = 2, Protozoa = 3, Archaea = 4, others = 5)
  k_n <- MO_lookup[match(n, MO_lookup$fullname), "kingdom_index", drop = TRUE]
  
  # matching score:
  (l_n - 0.5 * l_n.lev) / (l_n * p_n * k_n)
}
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`# ==================================================================== #`
			`# TITLE #`
(v1.5.0.9014) only_rsi_columns, is.rsi.eligible improvement 2021-02-02 23:57:35 +01:00			`# Antimicrobial Resistance (AMR) Data Analysis for R #`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`# #`
			`# SOURCE #`
			`# https://github.com/msberends/AMR #`
			`# #`
			`# LICENCE #`
(v1.8.0) prerelease 1.8.0 2021-12-23 18:56:28 +01:00			`# (c) 2018-2022 Berends MS, Luz CF et al. #`
(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00			`# Developed at the University of Groningen, the Netherlands, in #`
			`# collaboration with non-profit organisations Certe Medical #`
			`# Diagnostics & Advice, and University Medical Center Groningen. #`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`# #`
			`# This R package is free software; you can freely use and distribute #`
			`# it for both personal and commercial purposes under the terms of the #`
			`# GNU General Public License version 2.0 (GNU GPL-2), as published by #`
			`# the Free Software Foundation. #`
			`# We created this package for both routine data analysis and academic #`
			`# research and it was publicly released in the hope that it will be #`
			`# useful, but it comes WITHOUT ANY WARRANTY OR LIABILITY. #`
(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00			`# #`
			`# Visit our website for the full manual and a complete tutorial about #`
(v1.5.0.9014) only_rsi_columns, is.rsi.eligible improvement 2021-02-02 23:57:35 +01:00			`# how to conduct AMR data analysis: https://msberends.github.io/AMR/ #`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`# ==================================================================== #`

(v1.5.0.9006) major documentation update 2021-01-18 16:57:56 +01:00			`#' Calculate the Matching Score for Microorganisms`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`#'`
(v1.4.0.9012) reference_df fix 2020-11-05 01:11:49 +01:00			#' This algorithm is used by [as.mo()] and all the [`mo_*`][mo_property()] functions to determine the most probable match of taxonomic records based on user input.
(v1.8.0.9001) as.mo improvement, fixes #52 2022-02-26 21:58:23 +01:00			`#' @author Dr Matthijs Berends`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`#' @param x Any user input value(s)`
(v1.3.0.9030) matching score update 2020-09-26 16:26:01 +02:00			#' @param n A full taxonomic name, that exists in [`microorganisms$fullname`][microorganisms]
(v1.5.0.9006) major documentation update 2021-01-18 16:57:56 +01:00			`#' @section Matching Score for Microorganisms:`
(v1.3.0.9033) skimr fix 2020-09-28 11:00:59 +02:00			#' With ambiguous user input in [as.mo()] and all the [`mo_*`][mo_property()] functions, the returned results are chosen based on their matching score using [mo_matching_score()]. This matching score \eqn{m}, is calculated as:
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`#'`
(v1.8.0.9001) as.mo improvement, fixes #52 2022-02-26 21:58:23 +01:00			`#' \ifelse{latex}{\deqn{m_{(x, n)} = \frac{l_{n} - 0.5 \cdot \min \begin{cases}l_{n} \\ \textrm{lev}(x, n)\end{cases}}{l_{n} \cdot p_{n} \cdot k_{n}}}}{\ifelse{html}{\figure{mo_matching_score.png}{options: width="300" alt="mo matching score"}}{m(x, n) = ( l_n * min(l_n, lev(x, n) ) ) / ( l_n * p_n * k_n )}}`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`#'`
(v1.3.0.9030) matching score update 2020-09-26 16:26:01 +02:00			`#' where:`
			`#'`
(v.1.5.0.9000) implementation of EUCAST rules v11 (2021) 2021-01-12 22:08:04 +01:00			`#' * \ifelse{html}{\out{<i>x</i> is the user input;}}{\eqn{x} is the user input;}`
			`#' * \ifelse{html}{\out{<i>n</i> is a taxonomic name (genus, species, and subspecies);}}{\eqn{n} is a taxonomic name (genus, species, and subspecies);}`
			`#' * \ifelse{html}{\out{<i>l<sub>n</sub></i> is the length of <i>n</i>;}}{l_n is the length of \eqn{n};}`
			`#' * \ifelse{html}{\out{<i>lev</i> is the <a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance function</a>, which counts any insertion, deletion and substitution as 1 that is needed to change <i>x</i> into <i>n</i>;}}{lev is the Levenshtein distance function, which counts any insertion, deletion and substitution as 1 that is needed to change \eqn{x} into \eqn{n};}`
			`#' * \ifelse{html}{\out{<i>p<sub>n</sub></i> is the human pathogenic prevalence group of <i>n</i>, as described below;}}{p_n is the human pathogenic prevalence group of \eqn{n}, as described below;}`
			`#' * \ifelse{html}{\out{<i>k<sub>n</sub></i> is the taxonomic kingdom of <i>n</i>, set as Bacteria = 1, Fungi = 2, Protozoa = 3, Archaea = 4, others = 5.}}{l_n is the taxonomic kingdom of \eqn{n}, set as Bacteria = 1, Fungi = 2, Protozoa = 3, Archaea = 4, others = 5.}`
(v1.3.0.9031) matching score update 2020-09-26 16:51:17 +02:00			`#'`
(v1.6.0.9003) like() fix 2021-04-16 11:41:05 +02:00			#' The grouping into human pathogenic prevalence (\eqn{p}) is based on experience from several microbiological laboratories in the Netherlands in conjunction with international reports on pathogen prevalence. Group 1 (most prevalent microorganisms) consists of all microorganisms where the taxonomic class is Gammaproteobacteria or where the taxonomic genus is Enterococcus, Staphylococcus or Streptococcus. This group consequently contains all common Gram-negative bacteria, such as Pseudomonas and Legionella and all species within the order Enterobacterales. Group 2 consists of all microorganisms where the taxonomic phylum is Proteobacteria, Firmicutes, Actinobacteria or Sarcomastigophora, or where the taxonomic genus is Absidia, Acremonium, Actinotignum, Alternaria, Anaerosalibacter, Apophysomyces, Arachnia, Aspergillus, Aureobacterium, Aureobasidium, Bacteroides, Basidiobolus, Beauveria, Blastocystis, Branhamella, Calymmatobacterium, Candida, Capnocytophaga, Catabacter, Chaetomium, Chryseobacterium, Chryseomonas, Chrysonilia, Cladophialophora, Cladosporium, Conidiobolus, Cryptococcus, Curvularia, Exophiala, Exserohilum, Flavobacterium, Fonsecaea, Fusarium, Fusobacterium, Hendersonula, Hypomyces, Koserella, Lelliottia, Leptosphaeria, Leptotrichia, Malassezia, Malbranchea, Mortierella, Mucor, Mycocentrospora, Mycoplasma, Nectria, Ochroconis, Oidiodendron, Phoma, Piedraia, Pithomyces, Pityrosporum, Prevotella, Pseudallescheria, Rhizomucor, Rhizopus, Rhodotorula, Scolecobasidium, Scopulariopsis, Scytalidium, Sporobolomyces, Stachybotrys, Stomatococcus, Treponema, Trichoderma, Trichophyton, Trichosporon, Tritirachium or Ureaplasma. Group 3 consists of all other microorganisms.
(v1.3.0.9030) matching score update 2020-09-26 16:26:01 +02:00			`#'`
(v1.7.1.9023) Removed filter_ functions, new set_ab_names(), ATC code update, ab selector update, fixes #46 and fixed #47 2021-08-16 21:54:34 +02:00			`#' All characters in \eqn{x} and \eqn{n} are ignored that are other than A-Z, a-z, 0-9, spaces and parentheses.`
			`#'`
(v1.3.0.9035) mdro() for EUCAST 3.2, examples cleanup 2020-09-29 23:35:46 +02:00			#' All matches are sorted descending on their matching score and for all user input values, the top match will be returned. This will lead to the effect that e.g., `"E. coli"` will return the microbial ID of Escherichia coli (\eqn{m = `r round(mo_matching_score("E. coli", "Escherichia coli"), 3)`}, a highly prevalent microorganism found in humans) and not Entamoeba coli (\eqn{m = `r round(mo_matching_score("E. coli", "Entamoeba coli"), 3)`}, a less prevalent microorganism in humans), although the latter would alphabetically come first.
(v1.8.0.9001) as.mo improvement, fixes #52 2022-02-26 21:58:23 +01:00			`#'`
			#' Since `AMR` version 1.8.1, common microorganism abbreviations are ignored in determining the matching score. These abbreviations are currently: `r vector_and(pkg_env$mo_field_abbreviations, quotes = FALSE)`.
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`#' @export`
(v1.5.0.9006) major documentation update 2021-01-18 16:57:56 +01:00			`#' @inheritSection AMR Reference Data Publicly Available`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`#' @examples`
			`#' as.mo("E. coli")`
			`#' mo_uncertainties()`
(v1.3.0.9030) matching score update 2020-09-26 16:26:01 +02:00			`#'`
(v1.3.0.9035) mdro() for EUCAST 3.2, examples cleanup 2020-09-29 23:35:46 +02:00			`#' mo_matching_score(x = "E. coli",`
			`#' n = c("Escherichia coli", "Entamoeba coli"))`
(v1.3.0.9030) matching score update 2020-09-26 16:26:01 +02:00			`mo_matching_score <- function(x, n) {`
(v1.4.0.9001) is_gram_positive(), is_gram_negative(), parameter hardening 2020-10-19 17:09:19 +02:00			`meet_criteria(x, allow_class = c("character", "data.frame", "list"))`
			`meet_criteria(n, allow_class = "character")`

(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00			`x <- parse_and_convert(x)`
			`# no dots and other non-whitespace characters`
			`x <- gsub("[^a-zA-Z0-9 \\(\\)]+", "", x)`
(v1.8.0.9001) as.mo improvement, fixes #52 2022-02-26 21:58:23 +01:00
			`# remove abbreviations known to the field`
			`x <- gsub(paste0("(^\|[^a-z0-9]+)(",`
			`paste0(pkg_env$mo_field_abbreviations, collapse = "\|"),`
			`")([^a-z0-9]+\|$)"),`
			`"", x, perl = TRUE, ignore.case = TRUE)`

(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00			`# only keep one space`
			`x <- gsub(" +", " ", x)`
(v1.8.0.9001) as.mo improvement, fixes #52 2022-02-26 21:58:23 +01:00
(v1.3.0.9030) matching score update 2020-09-26 16:26:01 +02:00			`# n is always a taxonomically valid full name`
			`if (length(n) == 1) {`
			`n <- rep(n, length(x))`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`}`
			`if (length(x) == 1) {`
(v1.3.0.9030) matching score update 2020-09-26 16:26:01 +02:00			`x <- rep(x, length(n))`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`}`
(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00
			`# length of fullname`
			`l_n <- nchar(n)`
			`lev <- double(length = length(x))`
			`l_n.lev <- double(length = length(x))`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`for (i in seq_len(length(x))) {`
(v1.3.0.9030) matching score update 2020-09-26 16:26:01 +02:00			`# determine Levenshtein distance, but maximise to nchar of n`
(v1.8.0.9001) as.mo improvement, fixes #52 2022-02-26 21:58:23 +01:00			`lev[i] <- utils::adist(x[i], n[i], ignore.case = FALSE, fixed = TRUE, costs = c(ins = 1, del = 1, sub = 1))`
(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00			`# minimum of (l_n, Levenshtein distance)`
			`l_n.lev[i] <- min(l_n[i], as.double(lev[i]))`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`}`
(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00			`# human pathogenic prevalence (1 to 3), see ?as.mo`
			`p_n <- MO_lookup[match(n, MO_lookup$fullname), "prevalence", drop = TRUE]`
			`# kingdom index (Bacteria = 1, Fungi = 2, Protozoa = 3, Archaea = 4, others = 5)`
			`k_n <- MO_lookup[match(n, MO_lookup$fullname), "kingdom_index", drop = TRUE]`
(v1.3.0.9023) optimalisation 2020-09-19 11:54:01 +02:00
			`# matching score:`
(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00			`(l_n - 0.5 * l_n.lev) / (l_n * p_n * k_n)`
(v1.3.0.9022) mo_matching_score(), poorman update, as.rsi() fix 2020-09-18 16:05:53 +02:00			`}`