mirror of
				https://github.com/msberends/AMR.git
				synced 2025-11-04 03:44:07 +01:00 
			
		
		
		
	
		
			
				
	
	
		
			184 lines
		
	
	
		
			6.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			184 lines
		
	
	
		
			6.7 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
---
 | 
						||
title: "AMR with tidymodels"
 | 
						||
output: 
 | 
						||
  rmarkdown::html_vignette:
 | 
						||
    toc: true
 | 
						||
    toc_depth: 3
 | 
						||
vignette: >
 | 
						||
  %\VignetteIndexEntry{AMR with tidymodels}
 | 
						||
  %\VignetteEncoding{UTF-8}
 | 
						||
  %\VignetteEngine{knitr::rmarkdown}
 | 
						||
editor_options: 
 | 
						||
  chunk_output_type: console
 | 
						||
---
 | 
						||
 | 
						||
```{r setup, include = FALSE, results = 'markup'}
 | 
						||
knitr::opts_chunk$set(
 | 
						||
  warning = FALSE,
 | 
						||
  collapse = TRUE,
 | 
						||
  comment = "#>",
 | 
						||
  fig.width = 7.5,
 | 
						||
  fig.height = 5
 | 
						||
)
 | 
						||
```
 | 
						||
 | 
						||
> This page was entirely written by our [AMR for R Assistant](https://chatgpt.com/g/g-M4UNLwFi5-amr-for-r-assistant), a ChatGPT manually-trained model able to answer any question about the AMR package.
 | 
						||
 | 
						||
Antimicrobial resistance (AMR) is a global health crisis, and understanding resistance patterns is crucial for managing effective treatments. The `AMR` R package provides robust tools for analysing AMR data, including convenient antibiotic selector functions like `aminoglycosides()` and `betalactams()`. In this post, we will explore how to use the `tidymodels` framework to predict resistance patterns in the `example_isolates` dataset. 
 | 
						||
 | 
						||
By leveraging the power of `tidymodels` and the `AMR` package, we’ll build a reproducible machine learning workflow to predict the Gramstain of the microorganism to two important antibiotic classes: aminoglycosides and beta-lactams.
 | 
						||
 | 
						||
### **Objective**
 | 
						||
 | 
						||
Our goal is to build a predictive model using the `tidymodels` framework to determine the Gramstain of the microorganism based on microbial data. We will:
 | 
						||
 | 
						||
1. Preprocess data using the selector functions `aminoglycosides()` and `betalactams()`.
 | 
						||
2. Define a logistic regression model for prediction.
 | 
						||
3. Use a structured `tidymodels` workflow to preprocess, train, and evaluate the model.
 | 
						||
 | 
						||
### **Data Preparation**
 | 
						||
 | 
						||
We begin by loading the required libraries and preparing the `example_isolates` dataset from the `AMR` package.
 | 
						||
 | 
						||
```{r}
 | 
						||
# Load required libraries
 | 
						||
library(tidymodels)   # For machine learning workflows, and data manipulation (dplyr, tidyr, ...)
 | 
						||
library(AMR)          # For AMR data analysis
 | 
						||
 | 
						||
# Load the example_isolates dataset
 | 
						||
data("example_isolates")  # Preloaded dataset with AMR results
 | 
						||
 | 
						||
# Select relevant columns for prediction
 | 
						||
data <- example_isolates %>%
 | 
						||
  # select AB results dynamically
 | 
						||
  select(mo, aminoglycosides(), betalactams()) %>%
 | 
						||
  # replace NAs with NI (not-interpretable)
 | 
						||
   mutate(across(where(is.sir),
 | 
						||
                 ~replace_na(.x, "NI")),
 | 
						||
          # make factors of SIR columns
 | 
						||
          across(where(is.sir),
 | 
						||
                 as.integer),
 | 
						||
          # get Gramstain of microorganisms
 | 
						||
          mo = as.factor(mo_gramstain(mo))) %>%
 | 
						||
  # drop NAs - the ones without a Gramstain (fungi, etc.)
 | 
						||
  drop_na()
 | 
						||
```
 | 
						||
 | 
						||
**Explanation:**
 | 
						||
 | 
						||
- `aminoglycosides()` and `betalactams()` dynamically select columns for antibiotics in these classes.
 | 
						||
- `drop_na()` ensures the model receives complete cases for training.
 | 
						||
 | 
						||
### **Defining the Workflow**
 | 
						||
 | 
						||
We now define the `tidymodels` workflow, which consists of three steps: preprocessing, model specification, and fitting.
 | 
						||
 | 
						||
#### 1. Preprocessing with a Recipe
 | 
						||
 | 
						||
We create a recipe to preprocess the data for modelling.
 | 
						||
 | 
						||
```{r}
 | 
						||
# Define the recipe for data preprocessing
 | 
						||
resistance_recipe <- recipe(mo ~ ., data = data) %>%
 | 
						||
  step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9)
 | 
						||
resistance_recipe
 | 
						||
```
 | 
						||
 | 
						||
**Explanation:**
 | 
						||
 | 
						||
- `recipe(mo ~ ., data = data)` will take the `mo` column as outcome and all other columns as predictors.
 | 
						||
- `step_corr()` removes predictors (i.e., antibiotic columns) that have a higher correlation than 90%.
 | 
						||
 | 
						||
Notice how the recipe contains just the antibiotic selector functions - no need to define the columns specifically.
 | 
						||
 | 
						||
#### 2. Specifying the Model
 | 
						||
 | 
						||
We define a logistic regression model since resistance prediction is a binary classification task.
 | 
						||
 | 
						||
```{r}
 | 
						||
# Specify a logistic regression model
 | 
						||
logistic_model <- logistic_reg() %>%
 | 
						||
  set_engine("glm") # Use the Generalized Linear Model engine
 | 
						||
logistic_model
 | 
						||
```
 | 
						||
 | 
						||
**Explanation:**
 | 
						||
 | 
						||
- `logistic_reg()` sets up a logistic regression model.
 | 
						||
- `set_engine("glm")` specifies the use of R's built-in GLM engine.
 | 
						||
 | 
						||
#### 3. Building the Workflow
 | 
						||
 | 
						||
We bundle the recipe and model together into a `workflow`, which organizes the entire modeling process.
 | 
						||
 | 
						||
```{r}
 | 
						||
# Combine the recipe and model into a workflow
 | 
						||
resistance_workflow <- workflow() %>%
 | 
						||
  add_recipe(resistance_recipe) %>% # Add the preprocessing recipe
 | 
						||
  add_model(logistic_model) # Add the logistic regression model
 | 
						||
```
 | 
						||
 | 
						||
### **Training and Evaluating the Model**
 | 
						||
 | 
						||
To train the model, we split the data into training and testing sets. Then, we fit the workflow on the training set and evaluate its performance.
 | 
						||
 | 
						||
```{r}
 | 
						||
# Split data into training and testing sets
 | 
						||
set.seed(123) # For reproducibility
 | 
						||
data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing
 | 
						||
training_data <- training(data_split) # Training set
 | 
						||
testing_data <- testing(data_split)   # Testing set
 | 
						||
 | 
						||
# Fit the workflow to the training data
 | 
						||
fitted_workflow <- resistance_workflow %>%
 | 
						||
  fit(training_data) # Train the model
 | 
						||
```
 | 
						||
 | 
						||
**Explanation:**
 | 
						||
 | 
						||
- `initial_split()` splits the data into training and testing sets.
 | 
						||
- `fit()` trains the workflow on the training set.
 | 
						||
 | 
						||
Notice how in `fit()`, the antibiotic selector functions are internally called again. For training, these functions are called since they are stored in the recipe.
 | 
						||
 | 
						||
Next, we evaluate the model on the testing data.
 | 
						||
 | 
						||
```{r}
 | 
						||
# Make predictions on the testing set
 | 
						||
predictions <- fitted_workflow %>%
 | 
						||
  predict(testing_data)                # Generate predictions
 | 
						||
probabilities <- fitted_workflow %>%
 | 
						||
  predict(testing_data, type = "prob") # Generate probabilities
 | 
						||
 | 
						||
predictions <- predictions %>%
 | 
						||
  bind_cols(probabilities) %>%
 | 
						||
  bind_cols(testing_data) # Combine with true labels
 | 
						||
 | 
						||
predictions
 | 
						||
 | 
						||
# Evaluate model performance
 | 
						||
metrics <- predictions %>%
 | 
						||
  metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics
 | 
						||
 | 
						||
metrics
 | 
						||
```
 | 
						||
 | 
						||
**Explanation:**
 | 
						||
 | 
						||
- `predict()` generates predictions on the testing set.
 | 
						||
- `metrics()` computes evaluation metrics like accuracy and kappa.
 | 
						||
 | 
						||
It appears we can predict the Gram based on AMR results with a `r round(metrics$.estimate[1], 3)` accuracy based on AMR results of aminoglycosides and beta-lactam antibiotics. The ROC curve looks like this:
 | 
						||
 | 
						||
```{r}
 | 
						||
predictions %>%
 | 
						||
  roc_curve(mo, `.pred_Gram-negative`) %>%
 | 
						||
  autoplot()
 | 
						||
```
 | 
						||
 | 
						||
### **Conclusion**
 | 
						||
 | 
						||
In this post, we demonstrated how to build a machine learning pipeline with the `tidymodels` framework and the `AMR` package. By combining selector functions like `aminoglycosides()` and `betalactams()` with `tidymodels`, we efficiently prepared data, trained a model, and evaluated its performance.
 | 
						||
 | 
						||
This workflow is extensible to other antibiotic classes and resistance patterns, empowering users to analyse AMR data systematically and reproducibly.
 |