1
0
mirror of https://github.com/msberends/AMR.git synced 2024-12-26 17:26:12 +01:00
AMR/vignettes/AMR_with_tidymodels.Rmd

192 lines
6.4 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "`AMR` with `tidymodels`"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 3
vignette: >
%\VignetteIndexEntry{`AMR` with `tidymodels`}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r setup, include = FALSE, results = 'markup'}
knitr::opts_chunk$set(
warning = FALSE,
collapse = TRUE,
comment = "#>",
fig.width = 7.5,
fig.height = 5
)
```
Antimicrobial resistance (AMR) is a global health crisis, and understanding resistance patterns is crucial for managing effective treatments. The `AMR` R package provides robust tools for analysing AMR data, including convenient antibiotic selector functions like `aminoglycosides()` and `betalactams()`. In this post, we will explore how to use the `tidymodels` framework to predict resistance patterns in the `example_isolates` dataset.
By leveraging the power of `tidymodels` and the `AMR` package, well build a reproducible machine learning workflow to predict resistance to two important antibiotic classes: aminoglycosides and beta-lactams.
---
### **Objective**
Our goal is to build a predictive model using the `tidymodels` framework to determine resistance patterns based on microbial data. We will:
1. Preprocess data using the selector functions `aminoglycosides()` and `betalactams()`.
2. Define a logistic regression model for prediction.
3. Use a structured `tidymodels` workflow to preprocess, train, and evaluate the model.
---
### **Data Preparation**
We begin by loading the required libraries and preparing the `example_isolates` dataset from the `AMR` package.
```{r}
# Load required libraries
library(tidymodels) # For machine learning workflows, and data manipulation (dplyr, tidyr, ...)
library(AMR) # For AMR data analysis
# Load the example_isolates dataset
data("example_isolates") # Preloaded dataset with AMR results
# Select relevant columns for prediction
data <- example_isolates %>%
# select AB results dynamically
select(mo, aminoglycosides(), betalactams()) %>%
# replace NAs with NI (not-interpretable)
mutate(across(where(is.sir),
~replace_na(.x, "NI")),
# make factors of SIR columns
across(where(is.sir),
as.integer),
# get Gramstain of microorganisms
mo = as.factor(mo_gramstain(mo))) %>%
# drop NAs - the ones without a Gramstain (fungi, etc.)
drop_na() # %>%
# Cefepime is not reliable
#select(-FEP)
```
**Explanation:**
- `aminoglycosides()` and `betalactams()` dynamically select columns for antibiotics in these classes.
- `drop_na()` ensures the model receives complete cases for training.
---
### **Defining the Workflow**
We now define the `tidymodels` workflow, which consists of three steps: preprocessing, model specification, and fitting.
#### 1. Preprocessing with a Recipe
We create a recipe to preprocess the data for modelling. This includes:
- Encoding resistance results (`S`, `I`, `R`) as binary (resistant or not resistant).
- Converting microbial organism names (`mo`) into numerical features using one-hot encoding.
```{r}
# Define the recipe for data preprocessing
resistance_recipe <- recipe(mo ~ ., data = data) %>%
step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9)
resistance_recipe
```
**Explanation:**
- `step_mutate()` transforms resistance results (`R`) into binary variables (TRUE/FALSE).
- `step_dummy()` converts categorical organism (`mo`) names into one-hot encoded numerical features, making them compatible with the model.
#### 2. Specifying the Model
We define a logistic regression model since resistance prediction is a binary classification task.
```{r}
# Specify a logistic regression model
logistic_model <- logistic_reg() %>%
set_engine("glm") # Use the Generalized Linear Model engine
logistic_model
```
**Explanation:**
- `logistic_reg()` sets up a logistic regression model.
- `set_engine("glm")` specifies the use of R's built-in GLM engine.
#### 3. Building the Workflow
We bundle the recipe and model together into a `workflow`, which organizes the entire modeling process.
```{r}
# Combine the recipe and model into a workflow
resistance_workflow <- workflow() %>%
add_recipe(resistance_recipe) %>% # Add the preprocessing recipe
add_model(logistic_model) # Add the logistic regression model
resistance_workflow
```
---
### **Training and Evaluating the Model**
To train the model, we split the data into training and testing sets. Then, we fit the workflow on the training set and evaluate its performance.
```{r}
# Split data into training and testing sets
set.seed(123) # For reproducibility
data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing
training_data <- training(data_split) # Training set
testing_data <- testing(data_split) # Testing set
# Fit the workflow to the training data
fitted_workflow <- resistance_workflow %>%
fit(training_data) # Train the model
fitted_workflow
```
**Explanation:**
- `initial_split()` splits the data into training and testing sets.
- `fit()` trains the workflow on the training set.
Next, we evaluate the model on the testing data.
```{r}
# Make predictions on the testing set
predictions <- fitted_workflow %>%
predict(testing_data) # Generate predictions
probabilities <- fitted_workflow %>%
predict(testing_data, type = "prob") # Generate probabilities
predictions <- predictions %>%
bind_cols(probabilities) %>%
bind_cols(testing_data) # Combine with true labels
predictions
# Evaluate model performance
metrics <- predictions %>%
metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics
metrics
```
**Explanation:**
- `predict()` generates predictions on the testing set.
- `metrics()` computes evaluation metrics like accuracy and AUC.
It appears we can predict the Gram based on AMR results with a `r round(metrics$.estimate[1], 3)` accuracy. The ROC curve looks like:
```{r}
predictions %>%
roc_curve(mo, `.pred_Gram-negative`) %>%
autoplot()
```
---
### **Conclusion**
In this post, we demonstrated how to build a machine learning pipeline with the `tidymodels` framework and the `AMR` package. By combining selector functions like `aminoglycosides()` and `betalactams()` with `tidymodels`, we efficiently prepared data, trained a model, and evaluated its performance.
This workflow is extensible to other antibiotic classes and resistance patterns, empowering users to analyse AMR data systematically and reproducibly.
---