mirror of
https://github.com/msberends/AMR.git
synced 2025-01-14 00:11:50 +01:00
192 lines
6.4 KiB
Plaintext
192 lines
6.4 KiB
Plaintext
|
---
|
|||
|
title: "`AMR` with `tidymodels`"
|
|||
|
output:
|
|||
|
rmarkdown::html_vignette:
|
|||
|
toc: true
|
|||
|
toc_depth: 3
|
|||
|
vignette: >
|
|||
|
%\VignetteIndexEntry{`AMR` with `tidymodels`}
|
|||
|
%\VignetteEncoding{UTF-8}
|
|||
|
%\VignetteEngine{knitr::rmarkdown}
|
|||
|
editor_options:
|
|||
|
chunk_output_type: console
|
|||
|
---
|
|||
|
|
|||
|
```{r setup, include = FALSE, results = 'markup'}
|
|||
|
knitr::opts_chunk$set(
|
|||
|
warning = FALSE,
|
|||
|
collapse = TRUE,
|
|||
|
comment = "#>",
|
|||
|
fig.width = 7.5,
|
|||
|
fig.height = 5
|
|||
|
)
|
|||
|
```
|
|||
|
|
|||
|
Antimicrobial resistance (AMR) is a global health crisis, and understanding resistance patterns is crucial for managing effective treatments. The `AMR` R package provides robust tools for analysing AMR data, including convenient antibiotic selector functions like `aminoglycosides()` and `betalactams()`. In this post, we will explore how to use the `tidymodels` framework to predict resistance patterns in the `example_isolates` dataset.
|
|||
|
|
|||
|
By leveraging the power of `tidymodels` and the `AMR` package, we’ll build a reproducible machine learning workflow to predict resistance to two important antibiotic classes: aminoglycosides and beta-lactams.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **Objective**
|
|||
|
|
|||
|
Our goal is to build a predictive model using the `tidymodels` framework to determine resistance patterns based on microbial data. We will:
|
|||
|
|
|||
|
1. Preprocess data using the selector functions `aminoglycosides()` and `betalactams()`.
|
|||
|
2. Define a logistic regression model for prediction.
|
|||
|
3. Use a structured `tidymodels` workflow to preprocess, train, and evaluate the model.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **Data Preparation**
|
|||
|
|
|||
|
We begin by loading the required libraries and preparing the `example_isolates` dataset from the `AMR` package.
|
|||
|
|
|||
|
```{r}
|
|||
|
# Load required libraries
|
|||
|
library(tidymodels) # For machine learning workflows, and data manipulation (dplyr, tidyr, ...)
|
|||
|
library(AMR) # For AMR data analysis
|
|||
|
|
|||
|
# Load the example_isolates dataset
|
|||
|
data("example_isolates") # Preloaded dataset with AMR results
|
|||
|
|
|||
|
# Select relevant columns for prediction
|
|||
|
data <- example_isolates %>%
|
|||
|
# select AB results dynamically
|
|||
|
select(mo, aminoglycosides(), betalactams()) %>%
|
|||
|
# replace NAs with NI (not-interpretable)
|
|||
|
mutate(across(where(is.sir),
|
|||
|
~replace_na(.x, "NI")),
|
|||
|
# make factors of SIR columns
|
|||
|
across(where(is.sir),
|
|||
|
as.integer),
|
|||
|
# get Gramstain of microorganisms
|
|||
|
mo = as.factor(mo_gramstain(mo))) %>%
|
|||
|
# drop NAs - the ones without a Gramstain (fungi, etc.)
|
|||
|
drop_na() # %>%
|
|||
|
# Cefepime is not reliable
|
|||
|
#select(-FEP)
|
|||
|
```
|
|||
|
|
|||
|
**Explanation:**
|
|||
|
- `aminoglycosides()` and `betalactams()` dynamically select columns for antibiotics in these classes.
|
|||
|
- `drop_na()` ensures the model receives complete cases for training.
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **Defining the Workflow**
|
|||
|
|
|||
|
We now define the `tidymodels` workflow, which consists of three steps: preprocessing, model specification, and fitting.
|
|||
|
|
|||
|
#### 1. Preprocessing with a Recipe
|
|||
|
|
|||
|
We create a recipe to preprocess the data for modelling. This includes:
|
|||
|
- Encoding resistance results (`S`, `I`, `R`) as binary (resistant or not resistant).
|
|||
|
- Converting microbial organism names (`mo`) into numerical features using one-hot encoding.
|
|||
|
|
|||
|
```{r}
|
|||
|
# Define the recipe for data preprocessing
|
|||
|
resistance_recipe <- recipe(mo ~ ., data = data) %>%
|
|||
|
step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9)
|
|||
|
resistance_recipe
|
|||
|
```
|
|||
|
|
|||
|
**Explanation:**
|
|||
|
- `step_mutate()` transforms resistance results (`R`) into binary variables (TRUE/FALSE).
|
|||
|
- `step_dummy()` converts categorical organism (`mo`) names into one-hot encoded numerical features, making them compatible with the model.
|
|||
|
|
|||
|
#### 2. Specifying the Model
|
|||
|
|
|||
|
We define a logistic regression model since resistance prediction is a binary classification task.
|
|||
|
|
|||
|
```{r}
|
|||
|
# Specify a logistic regression model
|
|||
|
logistic_model <- logistic_reg() %>%
|
|||
|
set_engine("glm") # Use the Generalized Linear Model engine
|
|||
|
logistic_model
|
|||
|
```
|
|||
|
|
|||
|
**Explanation:**
|
|||
|
- `logistic_reg()` sets up a logistic regression model.
|
|||
|
- `set_engine("glm")` specifies the use of R's built-in GLM engine.
|
|||
|
|
|||
|
#### 3. Building the Workflow
|
|||
|
|
|||
|
We bundle the recipe and model together into a `workflow`, which organizes the entire modeling process.
|
|||
|
|
|||
|
```{r}
|
|||
|
# Combine the recipe and model into a workflow
|
|||
|
resistance_workflow <- workflow() %>%
|
|||
|
add_recipe(resistance_recipe) %>% # Add the preprocessing recipe
|
|||
|
add_model(logistic_model) # Add the logistic regression model
|
|||
|
resistance_workflow
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **Training and Evaluating the Model**
|
|||
|
|
|||
|
To train the model, we split the data into training and testing sets. Then, we fit the workflow on the training set and evaluate its performance.
|
|||
|
|
|||
|
```{r}
|
|||
|
# Split data into training and testing sets
|
|||
|
set.seed(123) # For reproducibility
|
|||
|
data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing
|
|||
|
training_data <- training(data_split) # Training set
|
|||
|
testing_data <- testing(data_split) # Testing set
|
|||
|
|
|||
|
# Fit the workflow to the training data
|
|||
|
fitted_workflow <- resistance_workflow %>%
|
|||
|
fit(training_data) # Train the model
|
|||
|
|
|||
|
fitted_workflow
|
|||
|
```
|
|||
|
|
|||
|
**Explanation:**
|
|||
|
- `initial_split()` splits the data into training and testing sets.
|
|||
|
- `fit()` trains the workflow on the training set.
|
|||
|
|
|||
|
Next, we evaluate the model on the testing data.
|
|||
|
|
|||
|
```{r}
|
|||
|
# Make predictions on the testing set
|
|||
|
predictions <- fitted_workflow %>%
|
|||
|
predict(testing_data) # Generate predictions
|
|||
|
probabilities <- fitted_workflow %>%
|
|||
|
predict(testing_data, type = "prob") # Generate probabilities
|
|||
|
|
|||
|
predictions <- predictions %>%
|
|||
|
bind_cols(probabilities) %>%
|
|||
|
bind_cols(testing_data) # Combine with true labels
|
|||
|
|
|||
|
predictions
|
|||
|
|
|||
|
# Evaluate model performance
|
|||
|
metrics <- predictions %>%
|
|||
|
metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics
|
|||
|
|
|||
|
metrics
|
|||
|
```
|
|||
|
|
|||
|
**Explanation:**
|
|||
|
- `predict()` generates predictions on the testing set.
|
|||
|
- `metrics()` computes evaluation metrics like accuracy and AUC.
|
|||
|
|
|||
|
It appears we can predict the Gram based on AMR results with a `r round(metrics$.estimate[1], 3)` accuracy. The ROC curve looks like:
|
|||
|
|
|||
|
```{r}
|
|||
|
predictions %>%
|
|||
|
roc_curve(mo, `.pred_Gram-negative`) %>%
|
|||
|
autoplot()
|
|||
|
```
|
|||
|
|
|||
|
---
|
|||
|
|
|||
|
### **Conclusion**
|
|||
|
|
|||
|
In this post, we demonstrated how to build a machine learning pipeline with the `tidymodels` framework and the `AMR` package. By combining selector functions like `aminoglycosides()` and `betalactams()` with `tidymodels`, we efficiently prepared data, trained a model, and evaluated its performance.
|
|||
|
|
|||
|
This workflow is extensible to other antibiotic classes and resistance patterns, empowering users to analyse AMR data systematically and reproducibly.
|
|||
|
|
|||
|
---
|