AMR/articles/AMR_with_tidymodels.md

# AMR with tidymodels

> This page was entirely written by our [AMR for R
> Assistant](https://chat.amr-for-r.org), a ChatGPT manually-trained
> model able to answer any question about the `AMR` package.

Antimicrobial resistance (AMR) is a global health crisis, and
understanding resistance patterns is crucial for managing effective
treatments. The `AMR` R package provides robust tools for analysing AMR
data, including convenient antimicrobial selector functions like
[`aminoglycosides()`](https://amr-for-r.org/reference/antimicrobial_selectors.md)
and
[`betalactams()`](https://amr-for-r.org/reference/antimicrobial_selectors.md).

In this post, we will explore how to use the `tidymodels` framework to
predict resistance patterns in the `example_isolates` dataset in two
examples.

This post contains the following examples:

1.  Using Antimicrobial Selectors
2.  Predicting ESBL Presence Using Raw MICs
3.  Predicting AMR Over Time

## Example 1: Using Antimicrobial Selectors

By leveraging the power of `tidymodels` and the `AMR` package, we’ll
build a reproducible machine learning workflow to predict the Gramstain
of the microorganism to two important antibiotic classes:
aminoglycosides and beta-lactams.

### **Objective**

Our goal is to build a predictive model using the `tidymodels` framework
to determine the Gramstain of the microorganism based on microbial data.
We will:

1.  Preprocess data using the selector functions
    [`aminoglycosides()`](https://amr-for-r.org/reference/antimicrobial_selectors.md)
    and
    [`betalactams()`](https://amr-for-r.org/reference/antimicrobial_selectors.md).
2.  Define a logistic regression model for prediction.
3.  Use a structured `tidymodels` workflow to preprocess, train, and
    evaluate the model.

### **Data Preparation**

We begin by loading the required libraries and preparing the
`example_isolates` dataset from the `AMR` package.

``` r
# Load required libraries
library(AMR)          # For AMR data analysis
library(tidymodels)   # For machine learning workflows, and data manipulation (dplyr, tidyr, ...)
```

Prepare the data:

``` r
# Your data could look like this:
example_isolates
#> # A tibble: 2,000 × 46
#>    date       patient   age gender ward     mo           PEN   OXA   FLC   AMX
#>    <date>     <chr>   <dbl> <chr>  <chr>    <mo>         <sir> <sir> <sir> <sir>
#>  1 2002-01-02 A77334     65 F      Clinical B_ESCHR_COLI   R     NA    NA    NA
#>  2 2002-01-03 A77334     65 F      Clinical B_ESCHR_COLI   R     NA    NA    NA
#>  3 2002-01-07 067927     45 F      ICU      B_STPHY_EPDR   R     NA    R     NA
#>  4 2002-01-07 067927     45 F      ICU      B_STPHY_EPDR   R     NA    R     NA
#>  5 2002-01-13 067927     45 F      ICU      B_STPHY_EPDR   R     NA    R     NA
#>  6 2002-01-13 067927     45 F      ICU      B_STPHY_EPDR   R     NA    R     NA
#>  7 2002-01-14 462729     78 M      Clinical B_STPHY_AURS   R     NA    S     R
#>  8 2002-01-14 462729     78 M      Clinical B_STPHY_AURS   R     NA    S     R
#>  9 2002-01-16 067927     45 F      ICU      B_STPHY_EPDR   R     NA    R     NA
#> 10 2002-01-17 858515     79 F      ICU      B_STPHY_EPDR   R     NA    S     NA
#> # ℹ 1,990 more rows
#> # ℹ 36 more variables: AMC <sir>, AMP <sir>, TZP <sir>, CZO <sir>, FEP <sir>,
#> #   CXM <sir>, FOX <sir>, CTX <sir>, CAZ <sir>, CRO <sir>, GEN <sir>,
#> #   TOB <sir>, AMK <sir>, KAN <sir>, TMP <sir>, SXT <sir>, NIT <sir>,
#> #   FOS <sir>, LNZ <sir>, CIP <sir>, MFX <sir>, VAN <sir>, TEC <sir>,
#> #   TCY <sir>, TGC <sir>, DOX <sir>, ERY <sir>, CLI <sir>, AZM <sir>,
#> #   IPM <sir>, MEM <sir>, MTR <sir>, CHL <sir>, COL <sir>, MUP <sir>, …

# Select relevant columns for prediction
data <- example_isolates %>%
  # select AB results dynamically
  select(mo, aminoglycosides(), betalactams()) %>%
  # replace NAs with NI (not-interpretable)
   mutate(across(where(is.sir),
                 ~replace_na(.x, "NI")),
          # make factors of SIR columns
          across(where(is.sir),
                 as.integer),
          # get Gramstain of microorganisms
          mo = as.factor(mo_gramstain(mo))) %>%
  # drop NAs - the ones without a Gramstain (fungi, etc.)
  drop_na()
#> ℹ For `aminoglycosides()` using columns 'GEN' (gentamicin), 'TOB'
#>   (tobramycin), 'AMK' (amikacin), and 'KAN' (kanamycin)
#> ℹ For `betalactams()` using columns 'PEN' (benzylpenicillin), 'OXA'
#>   (oxacillin), 'FLC' (flucloxacillin), 'AMX' (amoxicillin), 'AMC'
#>   (amoxicillin/clavulanic acid), 'AMP' (ampicillin), 'TZP'
#>   (piperacillin/tazobactam), 'CZO' (cefazolin), 'FEP' (cefepime), 'CXM'
#>   (cefuroxime), 'FOX' (cefoxitin), 'CTX' (cefotaxime), 'CAZ' (ceftazidime),
#>   'CRO' (ceftriaxone), 'IPM' (imipenem), and 'MEM' (meropenem)
```

**Explanation:**

- [`aminoglycosides()`](https://amr-for-r.org/reference/antimicrobial_selectors.md)
  and
  [`betalactams()`](https://amr-for-r.org/reference/antimicrobial_selectors.md)
  dynamically select columns for antimicrobials in these classes.
- `drop_na()` ensures the model receives complete cases for training.

### **Defining the Workflow**

We now define the `tidymodels` workflow, which consists of three steps:
preprocessing, model specification, and fitting.

#### 1. Preprocessing with a Recipe

We create a recipe to preprocess the data for modelling.

``` r
# Define the recipe for data preprocessing
resistance_recipe <- recipe(mo ~ ., data = data) %>%
  step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9)
resistance_recipe
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 20
#>
#> ── Operations
#> • Correlation filter on: c(aminoglycosides(), betalactams())
```

For a recipe that includes at least one preprocessing operation, like we
have with `step_corr()`, the necessary parameters can be estimated from
a training set using `prep()`:

``` r
prep(resistance_recipe)
#> ℹ For `aminoglycosides()` using columns 'GEN' (gentamicin), 'TOB'
#>   (tobramycin), 'AMK' (amikacin), and 'KAN' (kanamycin)
#> ℹ For `betalactams()` using columns 'PEN' (benzylpenicillin), 'OXA'
#>   (oxacillin), 'FLC' (flucloxacillin), 'AMX' (amoxicillin), 'AMC'
#>   (amoxicillin/clavulanic acid), 'AMP' (ampicillin), 'TZP'
#>   (piperacillin/tazobactam), 'CZO' (cefazolin), 'FEP' (cefepime), 'CXM'
#>   (cefuroxime), 'FOX' (cefoxitin), 'CTX' (cefotaxime), 'CAZ' (ceftazidime),
#>   'CRO' (ceftriaxone), 'IPM' (imipenem), and 'MEM' (meropenem)
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome:    1
#> predictor: 20
#>
#> ── Training information
#> Training data contained 1968 data points and no incomplete rows.
#>
#> ── Operations
#> • Correlation filter on: AMX CTX | Trained
```

**Explanation:**

- `recipe(mo ~ ., data = data)` will take the `mo` column as outcome and
  all other columns as predictors.
- `step_corr()` removes predictors (i.e., antibiotic columns) that have
  a higher correlation than 90%.

Notice how the recipe contains just the antimicrobial selector
functions - no need to define the columns specifically. In the
preparation (retrieved with `prep()`) we can see that the columns or
variables ‘AMX’ and ‘CTX’ were removed as they correlate too much with
existing, other variables.

#### 2. Specifying the Model

We define a logistic regression model since resistance prediction is a
binary classification task.

``` r
# Specify a logistic regression model
logistic_model <- logistic_reg() %>%
  set_engine("glm") # Use the Generalised Linear Model engine
logistic_model
#> Logistic Regression Model Specification (classification)
#>
#> Computational engine: glm
```

**Explanation:**

- `logistic_reg()` sets up a logistic regression model.
- `set_engine("glm")` specifies the use of R’s built-in GLM engine.

#### 3. Building the Workflow

We bundle the recipe and model together into a `workflow`, which
organises the entire modelling process.

``` r
# Combine the recipe and model into a workflow
resistance_workflow <- workflow() %>%
  add_recipe(resistance_recipe) %>% # Add the preprocessing recipe
  add_model(logistic_model) # Add the logistic regression model
resistance_workflow
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_corr()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Logistic Regression Model Specification (classification)
#>
#> Computational engine: glm
```

### **Training and Evaluating the Model**

To train the model, we split the data into training and testing sets.
Then, we fit the workflow on the training set and evaluate its
performance.

``` r
# Split data into training and testing sets
set.seed(123) # For reproducibility
data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing
training_data <- training(data_split) # Training set
testing_data <- testing(data_split)   # Testing set

# Fit the workflow to the training data
fitted_workflow <- resistance_workflow %>%
  fit(training_data) # Train the model
```

**Explanation:**

- `initial_split()` splits the data into training and testing sets.
- `fit()` trains the workflow on the training set.

Notice how in `fit()`, the antimicrobial selector functions are
internally called again. For training, these functions are called since
they are stored in the recipe.

Next, we evaluate the model on the testing data.

``` r
# Make predictions on the testing set
predictions <- fitted_workflow %>%
  predict(testing_data)                # Generate predictions
probabilities <- fitted_workflow %>%
  predict(testing_data, type = "prob") # Generate probabilities

predictions <- predictions %>%
  bind_cols(probabilities) %>%
  bind_cols(testing_data) # Combine with true labels

predictions
#> # A tibble: 394 × 24
#>    .pred_class   `.pred_Gram-negative` `.pred_Gram-positive` mo        GEN   TOB
#>    <fct>                         <dbl>                 <dbl> <fct>   <int> <int>
#>  1 Gram-positive              1.07e- 1             8.93 e- 1 Gram-p…     5     5
#>  2 Gram-positive              3.17e- 8             1.000e+ 0 Gram-p…     5     1
#>  3 Gram-negative              9.99e- 1             1.42 e- 3 Gram-n…     5     5
#>  4 Gram-positive              2.22e-16             1    e+ 0 Gram-p…     5     5
#>  5 Gram-negative              9.46e- 1             5.42 e- 2 Gram-n…     5     5
#>  6 Gram-positive              1.07e- 1             8.93 e- 1 Gram-p…     5     5
#>  7 Gram-positive              2.22e-16             1    e+ 0 Gram-p…     1     5
#>  8 Gram-positive              2.22e-16             1    e+ 0 Gram-p…     4     4
#>  9 Gram-negative              1   e+ 0             2.22 e-16 Gram-n…     1     1
#> 10 Gram-positive              6.05e-11             1.000e+ 0 Gram-p…     4     4
#> # ℹ 384 more rows
#> # ℹ 18 more variables: AMK <int>, KAN <int>, PEN <int>, OXA <int>, FLC <int>,
#> #   AMX <int>, AMC <int>, AMP <int>, TZP <int>, CZO <int>, FEP <int>,
#> #   CXM <int>, FOX <int>, CTX <int>, CAZ <int>, CRO <int>, IPM <int>, MEM <int>

# Evaluate model performance
metrics <- predictions %>%
  metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics

metrics
#> # A tibble: 2 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.995
#> 2 kap      binary         0.989


# To assess some other model properties, you can make our own `metrics()` function
our_metrics <- metric_set(accuracy, kap, ppv, npv) # add Positive Predictive Value and Negative Predictive Value
metrics2 <- predictions %>%
  our_metrics(truth = mo, estimate = .pred_class) # run again on our `our_metrics()` function

metrics2
#> # A tibble: 4 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.995
#> 2 kap      binary         0.989
#> 3 ppv      binary         0.987
#> 4 npv      binary         1
```

**Explanation:**

- [`predict()`](https://rdrr.io/r/stats/predict.html) generates
  predictions on the testing set.
- `metrics()` computes evaluation metrics like accuracy and kappa.

It appears we can predict the Gram stain with a 99.5% accuracy based on
AMR results of only aminoglycosides and beta-lactam antibiotics. The ROC
curve looks like this:

``` r
predictions %>%
  roc_curve(mo, `.pred_Gram-negative`) %>%
  autoplot()
```

![](AMR_with_tidymodels_files/figure-html/unnamed-chunk-8-1.png)

### **Conclusion**

In this post, we demonstrated how to build a machine learning pipeline
with the `tidymodels` framework and the `AMR` package. By combining
selector functions like
[`aminoglycosides()`](https://amr-for-r.org/reference/antimicrobial_selectors.md)
and
[`betalactams()`](https://amr-for-r.org/reference/antimicrobial_selectors.md)
with `tidymodels`, we efficiently prepared data, trained a model, and
evaluated its performance.

This workflow is extensible to other antimicrobial classes and
resistance patterns, empowering users to analyse AMR data systematically
and reproducibly.

------------------------------------------------------------------------

## Example 2: Predicting ESBL Presence Using Raw MICs

In this second example, we demonstrate how to use `<mic>` columns
directly in `tidymodels` workflows using AMR-specific recipe steps. This
includes a transformation to `log2` scale using `step_mic_log2()`, which
prepares MIC values for use in classification models.

This approach and idea formed the basis for the publication [DOI:
10.3389/fmicb.2025.1582703](https://doi.org/10.3389/fmicb.2025.1582703)
to model the presence of extended-spectrum beta-lactamases (ESBL).

> NOTE: THIS EXAMPLE WILL BE AVAILABLE IN A NEXT VERSION (#TODO)
>
> The new AMR package version will contain new tidymodels selectors such
> as `step_mic_log2()`.

------------------------------------------------------------------------

## Example 2: Predicting AMR Over Time

In this third example, we aim to predict antimicrobial resistance (AMR)
trends over time using `tidymodels`. We will model resistance to three
antibiotics (amoxicillin `AMX`, amoxicillin-clavulanic acid `AMC`, and
ciprofloxacin `CIP`), based on historical data grouped by year and
hospital ward.

### **Objective**

Our goal is to:

1.  Prepare the dataset by aggregating resistance data over time.
2.  Define a regression model to predict AMR trends.
3.  Use `tidymodels` to preprocess, train, and evaluate the model.

### **Data Preparation**

We start by transforming the `example_isolates` dataset into a
structured time-series format.

``` r
# Load required libraries
library(AMR)
library(tidymodels)

# Transform dataset
data_time <- example_isolates %>%
  top_n_microorganisms(n = 10) %>% # Filter on the top #10 species
  mutate(year = as.integer(format(date, "%Y")),  # Extract year from date
         gramstain = mo_gramstain(mo)) %>% # Get taxonomic names
  group_by(year, gramstain) %>%
  summarise(across(c(AMX, AMC, CIP),
                   function(x) resistance(x, minimum = 0),
                   .names = "res_{.col}"),
            .groups = "drop") %>%
  filter(!is.na(res_AMX) & !is.na(res_AMC) & !is.na(res_CIP)) # Drop missing values
#> ℹ Using column 'mo' as input for `col_mo`.

data_time
#> # A tibble: 32 × 5
#>     year gramstain     res_AMX res_AMC res_CIP
#>    <int> <chr>           <dbl>   <dbl>   <dbl>
#>  1  2002 Gram-negative   1      0.105   0.0606
#>  2  2002 Gram-positive   0.838  0.182   0.162
#>  3  2003 Gram-negative   1      0.0714  0
#>  4  2003 Gram-positive   0.714  0.244   0.154
#>  5  2004 Gram-negative   0.464  0.0938  0
#>  6  2004 Gram-positive   0.849  0.299   0.244
#>  7  2005 Gram-negative   0.412  0.132   0.0588
#>  8  2005 Gram-positive   0.882  0.382   0.154
#>  9  2006 Gram-negative   0.379  0       0.1
#> 10  2006 Gram-positive   0.778  0.333   0.353
#> # ℹ 22 more rows
```

**Explanation:**

- `mo_name(mo)`: Converts microbial codes into proper species names.
- [`resistance()`](https://amr-for-r.org/reference/proportion.md):
  Converts AMR results into numeric values (proportion of resistant
  isolates).
- `group_by(year, ward, species)`: Aggregates resistance rates by year
  and ward.

### **Defining the Workflow**

We now define the modelling workflow, which consists of a preprocessing
step, a model specification, and the fitting process.

#### 1. Preprocessing with a Recipe

``` r
# Define the recipe
resistance_recipe_time <- recipe(res_AMX ~ year + gramstain, data = data_time) %>%
  step_dummy(gramstain, one_hot = TRUE) %>%  # Convert categorical to numerical
  step_normalize(year) %>%  # Normalise year for better model performance
  step_nzv(all_predictors())  # Remove near-zero variance predictors

resistance_recipe_time
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#>
#> ── Operations
#> • Dummy variables from: gramstain
#> • Centering and scaling for: year
#> • Sparse, unbalanced variable filter on: all_predictors()
```

**Explanation:**

- `step_dummy()`: Encodes categorical variables (`ward`, `species`) as
  numerical indicators.
- `step_normalize()`: Normalises the `year` variable.
- `step_nzv()`: Removes near-zero variance predictors.

#### 2. Specifying the Model

We use a linear regression model to predict resistance trends.

``` r
# Define the linear regression model
lm_model <- linear_reg() %>%
  set_engine("lm") # Use linear regression

lm_model
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
```

**Explanation:**

- `linear_reg()`: Defines a linear regression model.
- `set_engine("lm")`: Uses R’s built-in linear regression engine.

#### 3. Building the Workflow

We combine the preprocessing recipe and model into a workflow.

``` r
# Create workflow
resistance_workflow_time <- workflow() %>%
  add_recipe(resistance_recipe_time) %>%
  add_model(lm_model)

resistance_workflow_time
#> ══ Workflow ════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 3 Recipe Steps
#>
#> • step_dummy()
#> • step_normalize()
#> • step_nzv()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
```

### **Training and Evaluating the Model**

We split the data into training and testing sets, fit the model, and
evaluate performance.

``` r
# Split the data
set.seed(123)
data_split_time <- initial_split(data_time, prop = 0.8)
train_time <- training(data_split_time)
test_time <- testing(data_split_time)

# Train the model
fitted_workflow_time <- resistance_workflow_time %>%
  fit(train_time)

# Make predictions
predictions_time <- fitted_workflow_time %>%
  predict(test_time) %>%
  bind_cols(test_time)

# Evaluate model
metrics_time <- predictions_time %>%
  metrics(truth = res_AMX, estimate = .pred)

metrics_time
#> # A tibble: 3 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 rmse    standard      0.0774
#> 2 rsq     standard      0.711
#> 3 mae     standard      0.0704
```

**Explanation:**

- `initial_split()`: Splits data into training and testing sets.
- `fit()`: Trains the workflow.
- [`predict()`](https://rdrr.io/r/stats/predict.html): Generates
  resistance predictions.
- `metrics()`: Evaluates model performance.

### **Visualising Predictions**

We plot resistance trends over time for amoxicillin.

``` r
library(ggplot2)

# Plot actual vs predicted resistance over time
ggplot(predictions_time, aes(x = year)) +
  geom_point(aes(y = res_AMX, color = "Actual")) +
  geom_line(aes(y = .pred, color = "Predicted")) +
  labs(title = "Predicted vs Actual AMX Resistance Over Time",
       x = "Year",
       y = "Resistance Proportion") +
  theme_minimal()
```

![](AMR_with_tidymodels_files/figure-html/unnamed-chunk-14-1.png)

Additionally, we can visualise resistance trends in `ggplot2` and
directly add linear models there:

``` r
ggplot(data_time, aes(x = year, y = res_AMX, color = gramstain)) +
  geom_line() +
  labs(title = "AMX Resistance Trends",
       x = "Year",
       y = "Resistance Proportion") +
  # add a linear model directly in ggplot2:
  geom_smooth(method = "lm",
              formula = y ~ x,
              alpha = 0.25) +
  theme_minimal()
```

![](AMR_with_tidymodels_files/figure-html/unnamed-chunk-15-1.png)

### **Conclusion**

In this example, we demonstrated how to analyze AMR trends over time
using `tidymodels`. By aggregating resistance rates by year and hospital
ward, we built a predictive model to track changes in resistance to
amoxicillin (`AMX`), amoxicillin-clavulanic acid (`AMC`), and
ciprofloxacin (`CIP`).

This method can be extended to other antibiotics and resistance
patterns, providing valuable insights into AMR dynamics in healthcare
settings.