AMR/vignettes/AMR_with_tidymodels.Rmd

---
title: "AMR with tidymodels"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{AMR with tidymodels}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options:
  chunk_output_type: console
---

```{r setup, include = FALSE, results = 'markup'}
knitr::opts_chunk$set(
  warning = FALSE,
  collapse = TRUE,
  comment = "#>",
  fig.width = 7.5,
  fig.height = 5
)
```

> This page was entirely written by our [AMR for R Assistant](https://chat.amr-for-r.org), a ChatGPT manually-trained model able to answer any question about the `AMR` package.

Antimicrobial resistance (AMR) is a global health crisis, and understanding resistance patterns is crucial for managing effective treatments. The `AMR` R package provides robust tools for analysing AMR data, including convenient antimicrobial selector functions like `aminoglycosides()` and `betalactams()`.

In this post, we will explore how to use the `tidymodels` framework to predict resistance patterns in the `example_isolates` dataset in two examples.

This post contains the following examples:

1. Using Antimicrobial Selectors
2. Predicting ESBL Presence Using Raw MICs
3. Predicting AMR Over Time


## Example 1: Using Antimicrobial Selectors

By leveraging the power of `tidymodels` and the `AMR` package, we’ll build a reproducible machine learning workflow to predict the Gramstain of the microorganism to two important antibiotic classes: aminoglycosides and beta-lactams.

### **Objective**

Our goal is to build a predictive model using the `tidymodels` framework to determine the Gramstain of the microorganism based on microbial data. We will:

1. Preprocess data using the selector functions `aminoglycosides()` and `betalactams()`.
2. Define a logistic regression model for prediction.
3. Use a structured `tidymodels` workflow to preprocess, train, and evaluate the model.

### **Data Preparation**

We begin by loading the required libraries and preparing the `example_isolates` dataset from the `AMR` package.

```{r lib packages, message = FALSE, warning = FALSE, results = 'asis'}
# Load required libraries
library(AMR)          # For AMR data analysis
library(tidymodels)   # For machine learning workflows, and data manipulation (dplyr, tidyr, ...)
```

Prepare the data:

```{r}
# Your data could look like this:
example_isolates

# Select relevant columns for prediction
data <- example_isolates %>%
  # select AB results dynamically
  select(mo, aminoglycosides(), betalactams()) %>%
  # replace NAs with NI (not-interpretable)
   mutate(across(where(is.sir),
                 ~replace_na(.x, "NI")),
          # make factors of SIR columns
          across(where(is.sir),
                 as.integer),
          # get Gramstain of microorganisms
          mo = as.factor(mo_gramstain(mo))) %>%
  # drop NAs - the ones without a Gramstain (fungi, etc.)
  drop_na()
```

**Explanation:**

- `aminoglycosides()` and `betalactams()` dynamically select columns for antimicrobials in these classes.
- `drop_na()` ensures the model receives complete cases for training.

### **Defining the Workflow**

We now define the `tidymodels` workflow, which consists of three steps: preprocessing, model specification, and fitting.

#### 1. Preprocessing with a Recipe

We create a recipe to preprocess the data for modelling.

```{r}
# Define the recipe for data preprocessing
resistance_recipe <- recipe(mo ~ ., data = data) %>%
  step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9)
resistance_recipe
```

For a recipe that includes at least one preprocessing operation, like we have with `step_corr()`, the necessary parameters can be estimated from a training set using `prep()`:

```{r}
prep(resistance_recipe)
```

**Explanation:**

- `recipe(mo ~ ., data = data)` will take the `mo` column as outcome and all other columns as predictors.
- `step_corr()` removes predictors (i.e., antibiotic columns) that have a higher correlation than 90%.

Notice how the recipe contains just the antimicrobial selector functions - no need to define the columns specifically. In the preparation (retrieved with `prep()`) we can see that the columns or variables `r paste0("'", suppressMessages(prep(resistance_recipe))$steps[[1]]$removals, "'", collapse = " and ")` were removed as they correlate too much with existing, other variables.

#### 2. Specifying the Model

We define a logistic regression model since resistance prediction is a binary classification task.

```{r}
# Specify a logistic regression model
logistic_model <- logistic_reg() %>%
  set_engine("glm") # Use the Generalised Linear Model engine
logistic_model
```

**Explanation:**

- `logistic_reg()` sets up a logistic regression model.
- `set_engine("glm")` specifies the use of R's built-in GLM engine.

#### 3. Building the Workflow

We bundle the recipe and model together into a `workflow`, which organises the entire modelling process.

```{r}
# Combine the recipe and model into a workflow
resistance_workflow <- workflow() %>%
  add_recipe(resistance_recipe) %>% # Add the preprocessing recipe
  add_model(logistic_model) # Add the logistic regression model
resistance_workflow
```

### **Training and Evaluating the Model**

To train the model, we split the data into training and testing sets. Then, we fit the workflow on the training set and evaluate its performance.

```{r}
# Split data into training and testing sets
set.seed(123) # For reproducibility
data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing
training_data <- training(data_split) # Training set
testing_data <- testing(data_split)   # Testing set

# Fit the workflow to the training data
fitted_workflow <- resistance_workflow %>%
  fit(training_data) # Train the model
```

**Explanation:**

- `initial_split()` splits the data into training and testing sets.
- `fit()` trains the workflow on the training set.

Notice how in `fit()`, the antimicrobial selector functions are internally called again. For training, these functions are called since they are stored in the recipe.

Next, we evaluate the model on the testing data.

```{r}
# Make predictions on the testing set
predictions <- fitted_workflow %>%
  predict(testing_data)                # Generate predictions
probabilities <- fitted_workflow %>%
  predict(testing_data, type = "prob") # Generate probabilities

predictions <- predictions %>%
  bind_cols(probabilities) %>%
  bind_cols(testing_data) # Combine with true labels

predictions

# Evaluate model performance
metrics <- predictions %>%
  metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics

metrics


# To assess some other model properties, you can make our own `metrics()` function
our_metrics <- metric_set(accuracy, kap, ppv, npv) # add Positive Predictive Value and Negative Predictive Value
metrics2 <- predictions %>%
  our_metrics(truth = mo, estimate = .pred_class) # run again on our `our_metrics()` function

metrics2
```

**Explanation:**

- `predict()` generates predictions on the testing set.
- `metrics()` computes evaluation metrics like accuracy and kappa.

It appears we can predict the Gram stain with a `r round(metrics$.estimate[1], 3) * 100`% accuracy based on AMR results of only aminoglycosides and beta-lactam antibiotics. The ROC curve looks like this:

```{r}
predictions %>%
  roc_curve(mo, `.pred_Gram-negative`) %>%
  autoplot()
```

### **Conclusion**

In this post, we demonstrated how to build a machine learning pipeline with the `tidymodels` framework and the `AMR` package. By combining selector functions like `aminoglycosides()` and `betalactams()` with `tidymodels`, we efficiently prepared data, trained a model, and evaluated its performance.

This workflow is extensible to other antimicrobial classes and resistance patterns, empowering users to analyse AMR data systematically and reproducibly.


---

## Example 2: Predicting ESBL Presence Using Raw MICs

In this second example, we demonstrate how to use `<mic>` columns directly in `tidymodels` workflows using AMR-specific recipe steps. This includes a transformation to `log2` scale using `step_mic_log2()`, which prepares MIC values for use in classification models.

This approach and idea formed the basis for the publication [DOI: 10.3389/fmicb.2025.1582703](https://doi.org/10.3389/fmicb.2025.1582703) to model the presence of extended-spectrum beta-lactamases (ESBL).

### **Objective**

Our goal is to:

1. Use raw MIC values to predict whether a bacterial isolate produces ESBL.
2. Apply AMR-aware preprocessing in a `tidymodels` recipe.
3. Train a classification model and evaluate its predictive performance.

### **Data Preparation**

We use the `esbl_isolates` dataset that comes with the AMR package.

```{r}
# Load required libraries
library(AMR)
library(tidymodels)

# View the esbl_isolates data set
esbl_isolates

# Prepare a binary outcome and convert to ordered factor
data <- esbl_isolates %>%
  mutate(esbl = factor(esbl, levels = c(FALSE, TRUE), ordered = TRUE))
```

**Explanation:**

- `esbl_isolates`: Contains MIC test results and ESBL status for each isolate.
- `mutate(esbl = ...)`: Converts the target column to an ordered factor for classification.

### **Defining the Workflow**

#### 1. Preprocessing with a Recipe

We use our `step_mic_log2()` function to log2-transform MIC values, ensuring that MICs are numeric and properly scaled. All MIC predictors can easily and agnostically selected using the new `all_mic_predictors()`:

```{r}
# Split into training and testing sets
set.seed(123)
split <- initial_split(data)
training_data <- training(split)
testing_data <- testing(split)

# Define the recipe
mic_recipe <- recipe(esbl ~ ., data = training_data) %>%
  remove_role(genus, old_role = "predictor") %>%  # Remove non-informative variable
  step_mic_log2(all_mic_predictors()) #%>%         # Log2 transform all MIC predictors
 # prep()

mic_recipe
```

**Explanation:**

- `remove_role()`: Removes irrelevant variables like genus.
- `step_mic_log2()`: Applies `log2(as.numeric(...))` to all MIC predictors in one go.
- `prep()`: Finalises the recipe based on training data.

#### 2. Specifying the Model

We use a simple logistic regression to model ESBL presence, though recent models such as xgboost ([link to `parsnip` manual](https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html)) could be much more precise.

```{r}
# Define the model
model <- logistic_reg(mode = "classification") %>%
  set_engine("glm")

model
```

**Explanation:**

- `logistic_reg()`: Specifies a binary classification model.
- `set_engine("glm")`: Uses the base R GLM engine.

#### 3. Building the Workflow

```{r}
# Create workflow
workflow_model <- workflow() %>%
  add_recipe(mic_recipe) %>%
  add_model(model)

workflow_model
```

### **Training and Evaluating the Model**

```{r}
# Fit the model
fitted <- fit(workflow_model, training_data)

# Generate predictions
predictions <- predict(fitted, testing_data) %>%
  bind_cols(testing_data)

# Evaluate model performance
our_metrics <- metric_set(accuracy, kap, ppv, npv)
metrics <- our_metrics(predictions, truth = esbl, estimate = .pred_class)

metrics
```

**Explanation:**

- `fit()`: Trains the model on the processed training data.
- `predict()`: Produces predictions for unseen test data.
- `metric_set()`: Allows evaluating multiple classification metrics.

It appears we can predict ESBL gene presence with a positive predictive value (PPV) of `r round(metrics$.estimate[3], 3) * 100`% and a negative predictive value (NPV) of `r round(metrics$.estimate[4], 3) * 100` using a simplistic logistic regression model.

### **Visualising Predictions**

We can visualise predictions by comparing predicted and actual ESBL status.

```{r}
library(ggplot2)

ggplot(predictions, aes(x = esbl, fill = .pred_class)) +
  geom_bar(position = "stack") +
  labs(title = "Predicted vs Actual ESBL Status",
       x = "Actual ESBL",
       y = "Count") +
  theme_minimal()
```

### **Conclusion**

In this example, we showcased how the new `AMR`-specific recipe steps simplify working with `<mic>` columns in `tidymodels`. The `step_mic_log2()` transformation converts ordered MICs to log2-transformed numerics, improving compatibility with classification models.

This pipeline enables realistic, reproducible, and interpretable modelling of antimicrobial resistance data.

---


## Example 3: Predicting AMR Over Time

In this third example, we aim to predict antimicrobial resistance (AMR) trends over time using `tidymodels`. We will model resistance to three antibiotics (amoxicillin `AMX`, amoxicillin-clavulanic acid `AMC`, and ciprofloxacin `CIP`), based on historical data grouped by year and hospital ward.

### **Objective**

Our goal is to:

1. Prepare the dataset by aggregating resistance data over time.
2. Define a regression model to predict AMR trends.
3. Use `tidymodels` to preprocess, train, and evaluate the model.

### **Data Preparation**

We start by transforming the `example_isolates` dataset into a structured time-series format.

```{r}
# Load required libraries
library(AMR)
library(tidymodels)

# Transform dataset
data_time <- example_isolates %>%
  top_n_microorganisms(n = 10) %>% # Filter on the top #10 species
  mutate(year = as.integer(format(date, "%Y")),  # Extract year from date
         gramstain = mo_gramstain(mo)) %>% # Get taxonomic names
  group_by(year, gramstain) %>%
  summarise(across(c(AMX, AMC, CIP),
                   function(x) resistance(x, minimum = 0),
                   .names = "res_{.col}"),
            .groups = "drop") %>%
  filter(!is.na(res_AMX) & !is.na(res_AMC) & !is.na(res_CIP)) # Drop missing values

data_time
```

**Explanation:**

- `mo_name(mo)`: Converts microbial codes into proper species names.
- `resistance()`: Converts AMR results into numeric values (proportion of resistant isolates).
- `group_by(year, ward, species)`: Aggregates resistance rates by year and ward.

### **Defining the Workflow**

We now define the modelling workflow, which consists of a preprocessing step, a model specification, and the fitting process.

#### 1. Preprocessing with a Recipe

```{r}
# Define the recipe
resistance_recipe_time <- recipe(res_AMX ~ year + gramstain, data = data_time) %>%
  step_dummy(gramstain, one_hot = TRUE) %>%  # Convert categorical to numerical
  step_normalize(year) %>%  # Normalise year for better model performance
  step_nzv(all_predictors())  # Remove near-zero variance predictors

resistance_recipe_time
```

**Explanation:**

- `step_dummy()`: Encodes categorical variables (`ward`, `species`) as numerical indicators.
- `step_normalize()`: Normalises the `year` variable.
- `step_nzv()`: Removes near-zero variance predictors.

#### 2. Specifying the Model

We use a linear regression model to predict resistance trends.

```{r}
# Define the linear regression model
lm_model <- linear_reg() %>%
  set_engine("lm") # Use linear regression

lm_model
```

**Explanation:**

- `linear_reg()`: Defines a linear regression model.
- `set_engine("lm")`: Uses R’s built-in linear regression engine.

#### 3. Building the Workflow

We combine the preprocessing recipe and model into a workflow.

```{r}
# Create workflow
resistance_workflow_time <- workflow() %>%
  add_recipe(resistance_recipe_time) %>%
  add_model(lm_model)

resistance_workflow_time
```

### **Training and Evaluating the Model**

We split the data into training and testing sets, fit the model, and evaluate performance.

```{r}
# Split the data
set.seed(123)
data_split_time <- initial_split(data_time, prop = 0.8)
train_time <- training(data_split_time)
test_time <- testing(data_split_time)

# Train the model
fitted_workflow_time <- resistance_workflow_time %>%
  fit(train_time)

# Make predictions
predictions_time <- fitted_workflow_time %>%
  predict(test_time) %>%
  bind_cols(test_time)

# Evaluate model
metrics_time <- predictions_time %>%
  metrics(truth = res_AMX, estimate = .pred)

metrics_time
```

**Explanation:**

- `initial_split()`: Splits data into training and testing sets.
- `fit()`: Trains the workflow.
- `predict()`: Generates resistance predictions.
- `metrics()`: Evaluates model performance.

### **Visualising Predictions**

We plot resistance trends over time for amoxicillin.

```{r}
library(ggplot2)

# Plot actual vs predicted resistance over time
ggplot(predictions_time, aes(x = year)) +
  geom_point(aes(y = res_AMX, color = "Actual")) +
  geom_line(aes(y = .pred, color = "Predicted")) +
  labs(title = "Predicted vs Actual AMX Resistance Over Time",
       x = "Year",
       y = "Resistance Proportion") +
  theme_minimal()
```

Additionally, we can visualise resistance trends in `ggplot2` and directly add linear models there:

```{r}
ggplot(data_time, aes(x = year, y = res_AMX, color = gramstain)) +
  geom_line() +
  labs(title = "AMX Resistance Trends",
       x = "Year",
       y = "Resistance Proportion") +
  # add a linear model directly in ggplot2:
  geom_smooth(method = "lm",
              formula = y ~ x,
              alpha = 0.25) +
  theme_minimal()
```

### **Conclusion**

In this example, we demonstrated how to analyze AMR trends over time using `tidymodels`. By aggregating resistance rates by year and hospital ward, we built a predictive model to track changes in resistance to amoxicillin (`AMX`), amoxicillin-clavulanic acid (`AMC`), and ciprofloxacin (`CIP`).

This method can be extended to other antibiotics and resistance patterns, providing valuable insights into AMR dynamics in healthcare settings.