(v2.1.1.9121) support tidymodels

2025-07-12 02:22:08 +02:00 · 2024-12-19 20:17:15 +01:00
parent 8249cfda46
commit 15fc72fc66
16 changed files with 638 additions and 89 deletions
--- a/vignettes/AMR_with_tidymodels.Rmd
+++ b/vignettes/AMR_with_tidymodels.Rmd
@ -0,0 +1,191 @@
+---
+title: "`AMR` with `tidymodels`"
+output: 
+  rmarkdown::html_vignette:
+    toc: true
+    toc_depth: 3
+vignette: >
+  %\VignetteIndexEntry{`AMR` with `tidymodels`}
+  %\VignetteEncoding{UTF-8}
+  %\VignetteEngine{knitr::rmarkdown}
+editor_options: 
+  chunk_output_type: console
+---
+
+```{r setup, include = FALSE, results = 'markup'}
+knitr::opts_chunk$set(
+  warning = FALSE,
+  collapse = TRUE,
+  comment = "#>",
+  fig.width = 7.5,
+  fig.height = 5
+)
+```
+
+Antimicrobial resistance (AMR) is a global health crisis, and understanding resistance patterns is crucial for managing effective treatments. The `AMR` R package provides robust tools for analysing AMR data, including convenient antibiotic selector functions like `aminoglycosides()` and `betalactams()`. In this post, we will explore how to use the `tidymodels` framework to predict resistance patterns in the `example_isolates` dataset. 
+
+By leveraging the power of `tidymodels` and the `AMR` package, we’ll build a reproducible machine learning workflow to predict resistance to two important antibiotic classes: aminoglycosides and beta-lactams.
+
+---
+
+### **Objective**
+
+Our goal is to build a predictive model using the `tidymodels` framework to determine resistance patterns based on microbial data. We will:
+
+1. Preprocess data using the selector functions `aminoglycosides()` and `betalactams()`.
+2. Define a logistic regression model for prediction.
+3. Use a structured `tidymodels` workflow to preprocess, train, and evaluate the model.
+
+---
+
+### **Data Preparation**
+
+We begin by loading the required libraries and preparing the `example_isolates` dataset from the `AMR` package.
+
+```{r}
+# Load required libraries
+library(tidymodels)   # For machine learning workflows, and data manipulation (dplyr, tidyr, ...)
+library(AMR)          # For AMR data analysis
+
+# Load the example_isolates dataset
+data("example_isolates")  # Preloaded dataset with AMR results
+
+# Select relevant columns for prediction
+data <- example_isolates %>%
+  # select AB results dynamically
+  select(mo, aminoglycosides(), betalactams()) %>%
+  # replace NAs with NI (not-interpretable)
+   mutate(across(where(is.sir),
+                 ~replace_na(.x, "NI")),
+          # make factors of SIR columns
+          across(where(is.sir),
+                 as.integer),
+          # get Gramstain of microorganisms
+          mo = as.factor(mo_gramstain(mo))) %>%
+  # drop NAs - the ones without a Gramstain (fungi, etc.)
+  drop_na() # %>%
+  # Cefepime is not reliable
+  #select(-FEP)
+```
+
+**Explanation:**
+- `aminoglycosides()` and `betalactams()` dynamically select columns for antibiotics in these classes.
+- `drop_na()` ensures the model receives complete cases for training.
+
+---
+
+### **Defining the Workflow**
+
+We now define the `tidymodels` workflow, which consists of three steps: preprocessing, model specification, and fitting.
+
+#### 1. Preprocessing with a Recipe
+
+We create a recipe to preprocess the data for modelling. This includes:
+- Encoding resistance results (`S`, `I`, `R`) as binary (resistant or not resistant).
+- Converting microbial organism names (`mo`) into numerical features using one-hot encoding.
+
+```{r}
+# Define the recipe for data preprocessing
+resistance_recipe <- recipe(mo ~ ., data = data) %>%
+  step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9)
+resistance_recipe
+```
+
+**Explanation:**
+- `step_mutate()` transforms resistance results (`R`) into binary variables (TRUE/FALSE).
+- `step_dummy()` converts categorical organism (`mo`) names into one-hot encoded numerical features, making them compatible with the model.
+
+#### 2. Specifying the Model
+
+We define a logistic regression model since resistance prediction is a binary classification task.
+
+```{r}
+# Specify a logistic regression model
+logistic_model <- logistic_reg() %>%
+  set_engine("glm") # Use the Generalized Linear Model engine
+logistic_model
+```
+
+**Explanation:**
+- `logistic_reg()` sets up a logistic regression model.
+- `set_engine("glm")` specifies the use of R's built-in GLM engine.
+
+#### 3. Building the Workflow
+
+We bundle the recipe and model together into a `workflow`, which organizes the entire modeling process.
+
+```{r}
+# Combine the recipe and model into a workflow
+resistance_workflow <- workflow() %>%
+  add_recipe(resistance_recipe) %>% # Add the preprocessing recipe
+  add_model(logistic_model) # Add the logistic regression model
+resistance_workflow
+```
+
+---
+
+### **Training and Evaluating the Model**
+
+To train the model, we split the data into training and testing sets. Then, we fit the workflow on the training set and evaluate its performance.
+
+```{r}
+# Split data into training and testing sets
+set.seed(123) # For reproducibility
+data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing
+training_data <- training(data_split) # Training set
+testing_data <- testing(data_split)   # Testing set
+
+# Fit the workflow to the training data
+fitted_workflow <- resistance_workflow %>%
+  fit(training_data) # Train the model
+
+fitted_workflow
+```
+
+**Explanation:**
+- `initial_split()` splits the data into training and testing sets.
+- `fit()` trains the workflow on the training set.
+
+Next, we evaluate the model on the testing data.
+
+```{r}
+# Make predictions on the testing set
+predictions <- fitted_workflow %>%
+  predict(testing_data)                # Generate predictions
+probabilities <- fitted_workflow %>%
+  predict(testing_data, type = "prob") # Generate probabilities
+
+predictions <- predictions %>%
+  bind_cols(probabilities) %>%
+  bind_cols(testing_data) # Combine with true labels
+
+predictions
+
+# Evaluate model performance
+metrics <- predictions %>%
+  metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics
+
+metrics
+```
+
+**Explanation:**
+- `predict()` generates predictions on the testing set.
+- `metrics()` computes evaluation metrics like accuracy and AUC.
+
+It appears we can predict the Gram based on AMR results with a `r round(metrics$.estimate[1], 3)` accuracy. The ROC curve looks like:
+
+```{r}
+predictions %>%
+  roc_curve(mo, `.pred_Gram-negative`) %>%
+  autoplot()
+```
+
+---
+
+### **Conclusion**
+
+In this post, we demonstrated how to build a machine learning pipeline with the `tidymodels` framework and the `AMR` package. By combining selector functions like `aminoglycosides()` and `betalactams()` with `tidymodels`, we efficiently prepared data, trained a model, and evaluated its performance.
+
+This workflow is extensible to other antibiotic classes and resistance patterns, empowering users to analyse AMR data systematically and reproducibly.
+
+---