`AMR` with `tidymodels`
+ + + Source:vignettes/AMR_with_tidymodels.Rmd
+ AMR_with_tidymodels.Rmd
Antimicrobial resistance (AMR) is a global health crisis, and
+understanding resistance patterns is crucial for managing effective
+treatments. The AMR
R package provides robust tools for
+analysing AMR data, including convenient antibiotic selector functions
+like aminoglycosides()
and betalactams()
. In
+this post, we will explore how to use the tidymodels
+framework to predict resistance patterns in the
+example_isolates
dataset.
By leveraging the power of tidymodels
and the
+AMR
package, we’ll build a reproducible machine learning
+workflow to predict resistance to two important antibiotic classes:
+aminoglycosides and beta-lactams.
+
+Objective +
+Our goal is to build a predictive model using the
+tidymodels
framework to determine resistance patterns based
+on microbial data. We will:
-
+
- Preprocess data using the selector functions
+
aminoglycosides()
andbetalactams()
.
+ - Define a logistic regression model for prediction. +
- Use a structured
tidymodels
workflow to preprocess, +train, and evaluate the model.
+
+
+Data Preparation +
+We begin by loading the required libraries and preparing the
+example_isolates
dataset from the AMR
+package.
+# Load required libraries
+library(tidymodels) # For machine learning workflows, and data manipulation (dplyr, tidyr, ...)
+#> Error in get(paste0(generic, ".", class), envir = get_method_env()) :
+#> object 'type_sum.accel' not found
+#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
+#> ✔ broom 1.0.7 ✔ recipes 1.1.0
+#> ✔ dials 1.3.0 ✔ rsample 1.2.1
+#> ✔ dplyr 1.1.4 ✔ tibble 3.2.1
+#> ✔ ggplot2 3.5.1 ✔ tidyr 1.3.1
+#> ✔ infer 1.0.7 ✔ tune 1.2.1
+#> ✔ modeldata 1.4.0 ✔ workflows 1.1.4
+#> ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
+#> ✔ purrr 1.0.2 ✔ yardstick 1.3.1
+#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
+#> ✖ purrr::discard() masks scales::discard()
+#> ✖ dplyr::filter() masks stats::filter()
+#> ✖ dplyr::lag() masks stats::lag()
+#> ✖ recipes::step() masks stats::step()
+#> • Use tidymodels_prefer() to resolve common conflicts.
+library(AMR) # For AMR data analysis
+
+# Load the example_isolates dataset
+data("example_isolates") # Preloaded dataset with AMR results
+
+# Select relevant columns for prediction
+data <- example_isolates %>%
+ # select AB results dynamically
+ select(mo, aminoglycosides(), betalactams()) %>%
+ # replace NAs with NI (not-interpretable)
+ mutate(across(where(is.sir),
+ ~replace_na(.x, "NI")),
+ # make factors of SIR columns
+ across(where(is.sir),
+ as.integer),
+ # get Gramstain of microorganisms
+ mo = as.factor(mo_gramstain(mo))) %>%
+ # drop NAs - the ones without a Gramstain (fungi, etc.)
+ drop_na() # %>%
+#> ℹ For aminoglycosides() using columns 'GEN' (gentamicin), 'TOB'
+#> (tobramycin), 'AMK' (amikacin), and 'KAN' (kanamycin)
+#> ℹ For betalactams() using columns 'PEN' (benzylpenicillin), 'OXA'
+#> (oxacillin), 'FLC' (flucloxacillin), 'AMX' (amoxicillin), 'AMC'
+#> (amoxicillin/clavulanic acid), 'AMP' (ampicillin), 'TZP'
+#> (piperacillin/tazobactam), 'CZO' (cefazolin), 'FEP' (cefepime), 'CXM'
+#> (cefuroxime), 'FOX' (cefoxitin), 'CTX' (cefotaxime), 'CAZ' (ceftazidime),
+#> 'CRO' (ceftriaxone), 'IPM' (imipenem), and 'MEM' (meropenem)
+ # Cefepime is not reliable
+ #select(-FEP)
Explanation: - aminoglycosides()
and
+betalactams()
dynamically select columns for antibiotics in
+these classes. - drop_na()
ensures the model receives
+complete cases for training.
+
+Defining the Workflow +
+We now define the tidymodels
workflow, which consists of
+three steps: preprocessing, model specification, and fitting.
1. Preprocessing with a Recipe +
+We create a recipe to preprocess the data for modelling. This
+includes: - Encoding resistance results (S
, I
,
+R
) as binary (resistant or not resistant). - Converting
+microbial organism names (mo
) into numerical features using
+one-hot encoding.
+# Define the recipe for data preprocessing
+resistance_recipe <- recipe(mo ~ ., data = data) %>%
+ step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9)
+resistance_recipe
+#>
+#> ── Recipe ──────────────────────────────────────────────────────────────────────
+#>
+#> ── Inputs
+#> Number of variables by role
+#> outcome: 1
+#> predictor: 20
+#>
+#> ── Operations
+#> • Correlation filter on: c(aminoglycosides(), betalactams())
Explanation: - step_mutate()
transforms
+resistance results (R
) into binary variables (TRUE/FALSE).
+- step_dummy()
converts categorical organism
+(mo
) names into one-hot encoded numerical features, making
+them compatible with the model.
2. Specifying the Model +
+We define a logistic regression model since resistance prediction is +a binary classification task.
+
+# Specify a logistic regression model
+logistic_model <- logistic_reg() %>%
+ set_engine("glm") # Use the Generalized Linear Model engine
+logistic_model
+#> Logistic Regression Model Specification (classification)
+#>
+#> Computational engine: glm
Explanation: - logistic_reg()
sets up a
+logistic regression model. - set_engine("glm")
specifies
+the use of R’s built-in GLM engine.
3. Building the Workflow +
+We bundle the recipe and model together into a workflow
,
+which organizes the entire modeling process.
+# Combine the recipe and model into a workflow
+resistance_workflow <- workflow() %>%
+ add_recipe(resistance_recipe) %>% # Add the preprocessing recipe
+ add_model(logistic_model) # Add the logistic regression model
+resistance_workflow
+#> ══ Workflow ════════════════════════════════════════════════════════════════════
+#> Preprocessor: Recipe
+#> Model: logistic_reg()
+#>
+#> ── Preprocessor ────────────────────────────────────────────────────────────────
+#> 1 Recipe Step
+#>
+#> • step_corr()
+#>
+#> ── Model ───────────────────────────────────────────────────────────────────────
+#> Logistic Regression Model Specification (classification)
+#>
+#> Computational engine: glm
+
+Training and Evaluating the Model +
+To train the model, we split the data into training and testing sets. +Then, we fit the workflow on the training set and evaluate its +performance.
+
+# Split data into training and testing sets
+set.seed(123) # For reproducibility
+data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing
+training_data <- training(data_split) # Training set
+testing_data <- testing(data_split) # Testing set
+
+# Fit the workflow to the training data
+fitted_workflow <- resistance_workflow %>%
+ fit(training_data) # Train the model
+#> ℹ For aminoglycosides() using columns 'GEN' (gentamicin), 'TOB'
+#> (tobramycin), 'AMK' (amikacin), and 'KAN' (kanamycin)
+#> ℹ For betalactams() using columns 'PEN' (benzylpenicillin), 'OXA'
+#> (oxacillin), 'FLC' (flucloxacillin), 'AMX' (amoxicillin), 'AMC'
+#> (amoxicillin/clavulanic acid), 'AMP' (ampicillin), 'TZP'
+#> (piperacillin/tazobactam), 'CZO' (cefazolin), 'FEP' (cefepime), 'CXM'
+#> (cefuroxime), 'FOX' (cefoxitin), 'CTX' (cefotaxime), 'CAZ' (ceftazidime),
+#> 'CRO' (ceftriaxone), 'IPM' (imipenem), and 'MEM' (meropenem)
+
+fitted_workflow
+#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
+#> Preprocessor: Recipe
+#> Model: logistic_reg()
+#>
+#> ── Preprocessor ────────────────────────────────────────────────────────────────
+#> 1 Recipe Step
+#>
+#> • step_corr()
+#>
+#> ── Model ───────────────────────────────────────────────────────────────────────
+#>
+#> Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
+#>
+#> Coefficients:
+#> (Intercept) GEN TOB AMK KAN PEN
+#> 101.11641 -3.69738 4.55879 1.86703 -23.37497 -0.57182
+#> OXA FLC AMC AMP TZP CZO
+#> -4.68575 -11.69742 0.79748 -1.56197 0.87667 -2.28424
+#> FEP CXM FOX CAZ CRO IPM
+#> -0.19847 0.02659 10.32455 10.27248 0.97321 -0.93096
+#> MEM
+#> -0.88753
+#>
+#> Degrees of Freedom: 1573 Total (i.e. Null); 1555 Residual
+#> Null Deviance: 2071
+#> Residual Deviance: 74.91 AIC: 112.9
Explanation: - initial_split()
splits
+the data into training and testing sets. - fit()
trains the
+workflow on the training set.
Next, we evaluate the model on the testing data.
+
+# Make predictions on the testing set
+predictions <- fitted_workflow %>%
+ predict(testing_data) # Generate predictions
+probabilities <- fitted_workflow %>%
+ predict(testing_data, type = "prob") # Generate probabilities
+
+predictions <- predictions %>%
+ bind_cols(probabilities) %>%
+ bind_cols(testing_data) # Combine with true labels
+
+predictions
+#> # A tibble: 394 × 24
+#> .pred_class `.pred_Gram-negative` `.pred_Gram-positive` mo GEN TOB
+#> <fct> <dbl> <dbl> <fct> <int> <int>
+#> 1 Gram-positive 1.07e- 1 8.93e- 1 Gram-p… 5 5
+#> 2 Gram-positive 3.17e- 8 1.00e+ 0 Gram-p… 5 1
+#> 3 Gram-negative 9.99e- 1 1.42e- 3 Gram-n… 5 5
+#> 4 Gram-positive 2.22e-16 1 e+ 0 Gram-p… 5 5
+#> 5 Gram-negative 9.46e- 1 5.42e- 2 Gram-n… 5 5
+#> 6 Gram-positive 1.07e- 1 8.93e- 1 Gram-p… 5 5
+#> 7 Gram-positive 2.22e-16 1 e+ 0 Gram-p… 1 5
+#> 8 Gram-positive 2.22e-16 1 e+ 0 Gram-p… 4 4
+#> 9 Gram-negative 1 e+ 0 2.22e-16 Gram-n… 1 1
+#> 10 Gram-positive 6.05e-11 1.00e+ 0 Gram-p… 4 4
+#> # ℹ 384 more rows
+#> # ℹ 18 more variables: AMK <int>, KAN <int>, PEN <int>, OXA <int>, FLC <int>,
+#> # AMX <int>, AMC <int>, AMP <int>, TZP <int>, CZO <int>, FEP <int>,
+#> # CXM <int>, FOX <int>, CTX <int>, CAZ <int>, CRO <int>, IPM <int>, MEM <int>
+
+# Evaluate model performance
+metrics <- predictions %>%
+ metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics
+
+metrics
+#> # A tibble: 2 × 3
+#> .metric .estimator .estimate
+#> <chr> <chr> <dbl>
+#> 1 accuracy binary 0.995
+#> 2 kap binary 0.989
Explanation: - predict()
generates
+predictions on the testing set. - metrics()
computes
+evaluation metrics like accuracy and AUC.
It appears we can predict the Gram based on AMR results with a 0.995 +accuracy. The ROC curve looks like:
+ + ++
+Conclusion +
+In this post, we demonstrated how to build a machine learning
+pipeline with the tidymodels
framework and the
+AMR
package. By combining selector functions like
+aminoglycosides()
and betalactams()
with
+tidymodels
, we efficiently prepared data, trained a model,
+and evaluated its performance.
This workflow is extensible to other antibiotic classes and +resistance patterns, empowering users to analyse AMR data systematically +and reproducibly.
++