(v3.0.0.9003) eucast_rules fix, new tidymodels integration

2025-07-12 19:41:58 +02:00 · 2025-06-13 14:03:21 +02:00
parent 3742e9e994
commit 72db2b2562
22 changed files with 760 additions and 107 deletions
--- a/vignettes/AMR_with_tidymodels.Rmd
+++ b/vignettes/AMR_with_tidymodels.Rmd
@ -26,7 +26,14 @@ knitr::opts_chunk$set(

 Antimicrobial resistance (AMR) is a global health crisis, and understanding resistance patterns is crucial for managing effective treatments. The `AMR` R package provides robust tools for analysing AMR data, including convenient antimicrobial selector functions like `aminoglycosides()` and `betalactams()`. 

-In this post, we will explore how to use the `tidymodels` framework to predict resistance patterns in the `example_isolates` dataset in two examples. 
+In this post, we will explore how to use the `tidymodels` framework to predict resistance patterns in the `example_isolates` dataset in two examples.
+
+This post contains the following examples:
+
+1. Using Antimicrobial Selectors
+2. Predicting ESBL Presence Using Raw MICs
+3. Predicting AMR Over Time
+

 ## Example 1: Using Antimicrobial Selectors

@ -208,10 +215,150 @@ This workflow is extensible to other antimicrobial classes and resistance patter

 ---

+## Example 2: Predicting ESBL Presence Using Raw MICs

-## Example 2: Predicting AMR Over Time
+In this second example, we demonstrate how to use `<mic>` columns directly in `tidymodels` workflows using AMR-specific recipe steps. This includes a transformation to `log2` scale using `step_mic_log2()`, which prepares MIC values for use in classification models.

-In this second example, we aim to predict antimicrobial resistance (AMR) trends over time using `tidymodels`. We will model resistance to three antibiotics (amoxicillin `AMX`, amoxicillin-clavulanic acid `AMC`, and ciprofloxacin `CIP`), based on historical data grouped by year and hospital ward.
+This approach and idea formed the basis for the publication [DOI: 10.3389/fmicb.2025.1582703](https://doi.org/10.3389/fmicb.2025.1582703) to model the presence of extended-spectrum beta-lactamases (ESBL).
+
+### **Objective**
+
+Our goal is to:
+
+1. Use raw MIC values to predict whether a bacterial isolate produces ESBL.
+2. Apply AMR-aware preprocessing in a `tidymodels` recipe.
+3. Train a classification model and evaluate its predictive performance.
+
+### **Data Preparation**
+
+We use the `esbl_isolates` dataset that comes with the AMR package.
+
+```{r}
+# Load required libraries
+library(AMR)
+library(tidymodels)
+
+# View the esbl_isolates data set
+esbl_isolates
+
+# Prepare a binary outcome and convert to ordered factor
+data <- esbl_isolates %>%
+  mutate(esbl = factor(esbl, levels = c(FALSE, TRUE), ordered = TRUE))
+```
+
+**Explanation:**
+
+- `esbl_isolates`: Contains MIC test results and ESBL status for each isolate.
+- `mutate(esbl = ...)`: Converts the target column to an ordered factor for classification.
+
+### **Defining the Workflow**
+
+#### 1. Preprocessing with a Recipe
+
+We use our `step_mic_log2()` function to log2-transform MIC values, ensuring that MICs are numeric and properly scaled. All MIC predictors can easily and agnostically selected using the new `all_mic_predictors()`:
+
+```{r}
+# Split into training and testing sets
+set.seed(123)
+split <- initial_split(data)
+training_data <- training(split)
+testing_data <- testing(split)
+
+# Define the recipe
+mic_recipe <- recipe(esbl ~ ., data = training_data) %>%
+  remove_role(genus, old_role = "predictor") %>%  # Remove non-informative variable
+  step_mic_log2(all_mic_predictors()) #%>%         # Log2 transform all MIC predictors
+ # prep()
+
+mic_recipe
+```
+
+**Explanation:**
+
+- `remove_role()`: Removes irrelevant variables like genus.
+- `step_mic_log2()`: Applies `log2(as.numeric(...))` to all MIC predictors in one go.
+- `prep()`: Finalises the recipe based on training data.
+
+#### 2. Specifying the Model
+
+We use a simple logistic regression to model ESBL presence, though recent models such as xgboost ([link to `parsnip` manual](https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html)) could be much more precise.
+
+```{r}
+# Define the model
+model <- logistic_reg(mode = "classification") %>%
+  set_engine("glm")
+
+model
+```
+
+**Explanation:**
+
+- `logistic_reg()`: Specifies a binary classification model.
+- `set_engine("glm")`: Uses the base R GLM engine.
+
+#### 3. Building the Workflow
+
+```{r}
+# Create workflow
+workflow_model <- workflow() %>%
+  add_recipe(mic_recipe) %>%
+  add_model(model)
+
+workflow_model
+```
+
+### **Training and Evaluating the Model**
+
+```{r}
+# Fit the model
+fitted <- fit(workflow_model, training_data)
+
+# Generate predictions
+predictions <- predict(fitted, testing_data) %>%
+  bind_cols(testing_data)
+
+# Evaluate model performance
+our_metrics <- metric_set(accuracy, kap, ppv, npv)
+metrics <- our_metrics(predictions, truth = esbl, estimate = .pred_class)
+
+metrics
+```
+
+**Explanation:**
+
+- `fit()`: Trains the model on the processed training data.
+- `predict()`: Produces predictions for unseen test data.
+- `metric_set()`: Allows evaluating multiple classification metrics.
+
+It appears we can predict ESBL gene presence with a positive predictive value (PPV) of `r round(metrics$.estimate[3], 3) * 100`% and a negative predictive value (NPV) of `r round(metrics$.estimate[4], 3) * 100` using a simplistic logistic regression model.
+
+### **Visualising Predictions**
+
+We can visualise predictions by comparing predicted and actual ESBL status.
+
+```{r}
+library(ggplot2)
+
+ggplot(predictions, aes(x = esbl, fill = .pred_class)) +
+  geom_bar(position = "stack") +
+  labs(title = "Predicted vs Actual ESBL Status",
+       x = "Actual ESBL",
+       y = "Count") +
+  theme_minimal()
+```
+
+### **Conclusion**
+
+In this example, we showcased how the new `AMR`-specific recipe steps simplify working with `<mic>` columns in `tidymodels`. The `step_mic_log2()` transformation converts ordered MICs to log2-transformed numerics, improving compatibility with classification models.
+
+This pipeline enables realistic, reproducible, and interpretable modelling of antimicrobial resistance data.
+
+---
+
+
+## Example 3: Predicting AMR Over Time
+
+In this third example, we aim to predict antimicrobial resistance (AMR) trends over time using `tidymodels`. We will model resistance to three antibiotics (amoxicillin `AMX`, amoxicillin-clavulanic acid `AMC`, and ciprofloxacin `CIP`), based on historical data grouped by year and hospital ward.

 ### **Objective**

--- a/vignettes/welcome_to_AMR.Rmd
+++ b/vignettes/welcome_to_AMR.Rmd
@ -28,7 +28,7 @@ Note: to keep the package size as small as possible, we only include this vignet

 The `AMR` package is a peer-reviewed, [free and open-source](https://amr-for-r.org/#copyright) R package with [zero dependencies](https://en.wikipedia.org/wiki/Dependency_hell) to simplify the analysis and prediction of Antimicrobial Resistance (AMR) and to work with microbial and antimicrobial data and properties, by using evidence-based methods. **Our aim is to provide a standard** for clean and reproducible AMR data analysis, that can therefore empower epidemiological analyses to continuously enable surveillance and treatment evaluation in any setting. We are a team of [many different researchers](https://amr-for-r.org/authors.html) from around the globe to make this a successful and durable project!

-This work was published in the Journal of Statistical Software (Volume 104(3); \doi{10.18637/jss.v104.i03}) and formed the basis of two PhD theses (\doi{10.33612/diss.177417131} and \doi{10.33612/diss.192486375}).
+This work was published in the Journal of Statistical Software (Volume 104(3); [DOI 10.18637/jss.v104.i03](https://doi.org/10.18637/jss.v104.i03)) and formed the basis of two PhD theses ([DOI 10.33612/diss.177417131](https://doi.org/10.33612/diss.177417131) and [DOI 10.33612/diss.192486375](https://doi.org/10.33612/diss.192486375)).

 After installing this package, R knows [**`r AMR:::format_included_data_number(AMR::microorganisms)` distinct microbial species**](https://amr-for-r.org/reference/microorganisms.html) (updated June 2024) and all [**`r AMR:::format_included_data_number(NROW(AMR::antimicrobials) + NROW(AMR::antivirals))` antimicrobial and antiviral drugs**](https://amr-for-r.org/reference/antimicrobials.html) by name and code (including ATC, EARS-Net, ASIARS-Net, PubChem, LOINC and SNOMED CT), and knows all about valid SIR and MIC values. The integral clinical breakpoint guidelines from CLSI `r min(as.integer(gsub("[^0-9]", "", subset(AMR::clinical_breakpoints, grepl("CLSI", guideline))$guideline)))`-`r max(as.integer(gsub("[^0-9]", "", subset(AMR::clinical_breakpoints, grepl("CLSI", guideline))$guideline)))` and EUCAST `r min(as.integer(gsub("[^0-9]", "", subset(AMR::clinical_breakpoints, grepl("EUCAST", guideline))$guideline)))`-`r max(as.integer(gsub("[^0-9]", "", subset(AMR::clinical_breakpoints, grepl("EUCAST", guideline))$guideline)))` are included, even with epidemiological cut-off (ECOFF) values. It supports and can read any data format, including WHONET data. This package works on Windows, macOS and Linux with all versions of R since R-3.0 (April 2013). **It was designed to work in any setting, including those with very limited resources**. It was created for both routine data analysis and academic research at the Faculty of Medical Sciences of the [University of Groningen](https://www.rug.nl) and the [University Medical Center Groningen](https://www.umcg.nl).