(v3.0.0.9003) eucast_rules fix, new tidymodels integration

2025-10-21 15:16:17 +02:00 · 2025-06-13 14:03:21 +02:00
parent 3742e9e994
commit 72db2b2562
22 changed files with 760 additions and 107 deletions
--- a/man/amr-tidymodels.Rd
+++ b/man/amr-tidymodels.Rd
@@ -0,0 +1,122 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/tidymodels.R
+\name{amr-tidymodels}
+\alias{amr-tidymodels}
+\alias{all_mic}
+\alias{all_mic_predictors}
+\alias{all_sir}
+\alias{all_sir_predictors}
+\alias{step_mic_log2}
+\alias{step_sir_numeric}
+\title{AMR Extensions for Tidymodels}
+\usage{
+all_mic()
+
+all_mic_predictors()
+
+all_sir()
+
+all_sir_predictors()
+
+step_mic_log2(recipe, ..., role = NA, trained = FALSE, columns = NULL,
+  skip = FALSE, id = recipes::rand_id("mic_log2"))
+
+step_sir_numeric(recipe, ..., role = NA, trained = FALSE, columns = NULL,
+  skip = FALSE, id = recipes::rand_id("sir_numeric"))
+}
+\arguments{
+\item{recipe}{A recipe object. The step will be added to the sequence of
+operations for this recipe.}
+
+\item{...}{One or more selector functions to choose variables for this step.
+See \code{\link[recipes:selections]{selections()}} for more details.}
+
+\item{role}{Not used by this step since no new variables are created.}
+
+\item{trained}{A logical to indicate if the quantities for preprocessing have
+been estimated.}
+
+\item{skip}{A logical. Should the step be skipped when the recipe is baked by
+\code{\link[recipes:bake]{bake()}}? While all operations are baked when \code{\link[recipes:prep]{prep()}} is run, some
+operations may not be able to be conducted on new data (e.g. processing the
+outcome variable(s)). Care should be taken when using \code{skip = TRUE} as it
+may affect the computations for subsequent operations.}
+
+\item{id}{A character string that is unique to this step to identify it.}
+}
+\description{
+This family of functions allows using AMR-specific data types such as \verb{<mic>} and \verb{<sir>} inside \code{tidymodels} pipelines.
+}
+\details{
+You can read more in our online \href{https://amr-for-r.org/articles/AMR_with_tidymodels.html}{AMR with tidymodels introduction}.
+
+Tidyselect helpers include:
+\itemize{
+\item \code{\link[=all_mic]{all_mic()}} and \code{\link[=all_mic_predictors]{all_mic_predictors()}} to select \verb{<mic>} columns
+\item \code{\link[=all_sir]{all_sir()}} and \code{\link[=all_sir_predictors]{all_sir_predictors()}} to select \verb{<sir>} columns
+}
+
+Pre-processing pipeline steps include:
+\itemize{
+\item \code{\link[=step_mic_log2]{step_mic_log2()}} to convert MIC columns to numeric (via \code{as.numeric()}) and apply a log2 transform, to be used with \code{\link[=all_mic_predictors]{all_mic_predictors()}}
+\item \code{\link[=step_sir_numeric]{step_sir_numeric()}} to convert SIR columns to numeric (via \code{as.numeric()}), to be used with \code{\link[=all_sir_predictors]{all_sir_predictors()}}: \code{"S"} = 1, \code{"I"}/\code{"SDD"} = 2, \code{"R"} = 3. All other values are rendered \code{NA}. Keep this in mind for further processing, especially if the model does not allow for \code{NA} values.
+}
+
+These steps integrate with \code{recipes::recipe()} and work like standard preprocessing steps. They are useful for preparing data for modelling, especially with classification models.
+}
+\examples{
+library(tidymodels)
+
+# The below approach formed the basis for this paper: DOI 10.3389/fmicb.2025.1582703
+# Presence of ESBL genes was predicted based on raw MIC values.
+
+
+# example data set in the AMR package
+esbl_isolates
+
+# Prepare a binary outcome and convert to ordered factor
+data <- esbl_isolates \%>\%
+  mutate(esbl = factor(esbl, levels = c(FALSE, TRUE), ordered = TRUE))
+
+# Split into training and testing sets
+split <- initial_split(data)
+training_data <- training(split)
+testing_data <- testing(split)
+
+# Create and prep a recipe with MIC log2 transformation
+mic_recipe <- recipe(esbl ~ ., data = training_data) \%>\%
+  # Optionally remove non-predictive variables
+  remove_role(genus, old_role = "predictor") \%>\%
+  # Apply the log2 transformation to all MIC predictors
+  step_mic_log2(all_mic_predictors()) \%>\%
+  prep()
+
+# View prepped recipe
+mic_recipe
+
+# Apply the recipe to training and testing data
+out_training <- bake(mic_recipe, new_data = NULL)
+out_testing <- bake(mic_recipe, new_data = testing_data)
+
+# Fit a logistic regression model
+fitted <- logistic_reg(mode = "classification") \%>\%
+  set_engine("glm") \%>\%
+  fit(esbl ~ ., data = out_training)
+
+# Generate predictions on the test set
+predictions <- predict(fitted, out_testing) \%>\%
+  bind_cols(out_testing)
+
+# Evaluate predictions using standard classification metrics
+our_metrics <- metric_set(accuracy, kap, ppv, npv)
+metrics <- our_metrics(predictions, truth = esbl, estimate = .pred_class)
+
+# Show performance:
+# - negative predictive value (NPV) of ~98\%
+# - positive predictive value (PPV) of ~94\%
+metrics
+}
+\seealso{
+\code{\link[recipes:recipe]{recipes::recipe()}}, \code{\link[=as.mic]{as.mic()}}, \code{\link[=as.sir]{as.sir()}}
+}
+\keyword{internal}
--- a/man/as.sir.Rd
+++ b/man/as.sir.Rd
@@ -247,7 +247,7 @@ To determine which isolates are multi-drug resistant, be sure to run \code{\link

 The function \code{\link[=is.sir]{is.sir()}} detects if the input contains class \code{sir}. If the input is a \link{data.frame} or \link{list}, it iterates over all columns/items and returns a \link{logical} vector.

-The base R function \code{\link[=as.double]{as.double()}} can be used to retrieve quantitative values from a \code{sir} object: \code{"S"} = 1, \code{"I"}/\code{"SDD"} = 2, \code{"R"} = 3. All other values are rendered \code{NA} . \strong{Note:} Do not use \code{as.integer()}, since that (because of how R works internally) will return the factor level indices, and not these aforementioned quantitative values.
+The base R function \code{\link[=as.double]{as.double()}} can be used to retrieve quantitative values from a \code{sir} object: \code{"S"} = 1, \code{"I"}/\code{"SDD"} = 2, \code{"R"} = 3. All other values are rendered \code{NA}. \strong{Note:} Do not use \code{as.integer()}, since that (because of how R works internally) will return the factor level indices, and not these aforementioned quantitative values.

 The function \code{\link[=is_sir_eligible]{is_sir_eligible()}} returns \code{TRUE} when a column contains at most 5\% potentially invalid antimicrobial interpretations, and \code{FALSE} otherwise. The threshold of 5\% can be set with the \code{threshold} argument. If the input is a \link{data.frame}, it iterates over all columns and returns a \link{logical} vector.
 }
--- a/man/esbl_isolates.Rd
+++ b/man/esbl_isolates.Rd
@@ -0,0 +1,27 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/data.R
+\docType{data}
+\name{esbl_isolates}
+\alias{esbl_isolates}
+\title{Data Set with 500 ESBL Isolates}
+\format{
+A \link[tibble:tibble]{tibble} with 500 observations and 19 variables:
+\itemize{
+\item \code{esbl}\cr Logical indicator if the isolate is ESBL-producing
+\item \code{genus}\cr Genus of the microorganism
+\item \code{AMC:COL}\cr MIC values for 17 antimicrobial agents, transformed to class \code{\link{mic}} (see \code{\link[=as.mic]{as.mic()}})
+}
+}
+\usage{
+esbl_isolates
+}
+\description{
+A data set containing 500 microbial isolates with MIC values of common antibiotics and a binary \code{esbl} column for extended-spectrum beta-lactamase (ESBL) production. This data set contains randomised fictitious data but reflects reality and can be used to practise AMR-related machine learning, e.g., classification modelling with \href{https://amr-for-r.org/articles/AMR_with_tidymodels.html}{tidymodels}.
+}
+\details{
+See our \link[=amr-tidymodels]{tidymodels integration} for an example using this data set.
+}
+\examples{
+esbl_isolates
+}
+\keyword{datasets}
--- a/man/random.Rd
+++ b/man/random.Rd
@@ -7,19 +7,25 @@
 \alias{random_sir}
 \title{Random MIC Values/Disk Zones/SIR Generation}
 \usage{
-random_mic(size = NULL, mo = NULL, ab = NULL, ...)
+random_mic(size = NULL, mo = NULL, ab = NULL, skew = "right",
+  severity = 1, ...)

-random_disk(size = NULL, mo = NULL, ab = NULL, ...)
+random_disk(size = NULL, mo = NULL, ab = NULL, skew = "left",
+  severity = 1, ...)

 random_sir(size = NULL, prob_SIR = c(0.33, 0.33, 0.33), ...)
 }
 \arguments{
 \item{size}{Desired size of the returned vector. If used in a \link{data.frame} call or \code{dplyr} verb, will get the current (group) size if left blank.}

-\item{mo}{Any \link{character} that can be coerced to a valid microorganism code with \code{\link[=as.mo]{as.mo()}}.}
+\item{mo}{Any \link{character} that can be coerced to a valid microorganism code with \code{\link[=as.mo]{as.mo()}}. Can be the same length as \code{size}.}

 \item{ab}{Any \link{character} that can be coerced to a valid antimicrobial drug code with \code{\link[=as.ab]{as.ab()}}.}

+\item{skew}{Direction of skew for MIC or disk values, either \code{"right"} or \code{"left"}. A left-skewed distribution has the majority of the data on the right.}
+
+\item{severity}{Skew severity; higher values will increase the skewedness. Default is \code{2}; use \code{0} to prevent skewedness.}
+
 \item{...}{Ignored, only in place to allow future extensions.}

 \item{prob_SIR}{A vector of length 3: the probabilities for "S" (1st value), "I" (2nd value) and "R" (3rd value).}
@@ -31,17 +37,25 @@ class \code{mic} for \code{\link[=random_mic]{random_mic()}} (see \code{\link[=a
 These functions can be used for generating random MIC values and disk diffusion diameters, for AMR data analysis practice. By providing a microorganism and antimicrobial drug, the generated results will reflect reality as much as possible.
 }
 \details{
-The base \R function \code{\link[=sample]{sample()}} is used for generating values.
-
-Generated values are based on the EUCAST 2025 guideline as implemented in the \link{clinical_breakpoints} data set. To create specific generated values per bug or drug, set the \code{mo} and/or \code{ab} argument.
+Internally, MIC and disk zone values are sampled based on clinical breakpoints defined in the \link{clinical_breakpoints} data set. To create specific generated values per bug or drug, set the \code{mo} and/or \code{ab} argument. The MICs are sampled on a log2 scale and disks linearly, using weighted probabilities. The weights are based on the \code{skew} and \code{severity} arguments:
+\itemize{
+\item \code{skew = "right"} places more emphasis on lower MIC or higher disk values.
+\item \code{skew = "left"} places more emphasis on higher MIC or lower disk values.
+\item \code{severity} controls the exponential bias applied.
+}
 }
 \examples{
 random_mic(25)
 random_disk(25)
 random_sir(25)

+# add more skewedness, make more realistic by setting a bug and/or drug:
+disks <- random_disk(100, severity = 2, mo = "Escherichia coli", ab = "CIP")
+plot(disks)
+# `plot()` and `ggplot2::autoplot()` allow for coloured bars if `mo` and `ab` are set
+plot(disks, mo = "Escherichia coli", ab = "CIP", guideline = "CLSI 2025")
+
 \donttest{
-# make the random generation more realistic by setting a bug and/or drug:
 random_mic(25, "Klebsiella pneumoniae") # range 0.0625-64
 random_mic(25, "Klebsiella pneumoniae", "meropenem") # range 0.0625-16
 random_mic(25, "Streptococcus pneumoniae", "meropenem") # range 0.0625-4