--- title: "AMR with tidymodels" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{AMR with tidymodels} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include = FALSE, results = 'markup'} knitr::opts_chunk$set( warning = FALSE, collapse = TRUE, comment = "#>", fig.width = 7.5, fig.height = 5 ) ``` > This page was entirely written by our [AMR for R Assistant](https://chatgpt.com/g/g-M4UNLwFi5-amr-for-r-assistant), a ChatGPT manually-trained model able to answer any question about the AMR package. Antimicrobial resistance (AMR) is a global health crisis, and understanding resistance patterns is crucial for managing effective treatments. The `AMR` R package provides robust tools for analysing AMR data, including convenient antibiotic selector functions like `aminoglycosides()` and `betalactams()`. In this post, we will explore how to use the `tidymodels` framework to predict resistance patterns in the `example_isolates` dataset. By leveraging the power of `tidymodels` and the `AMR` package, we’ll build a reproducible machine learning workflow to predict the Gramstain of the microorganism to two important antibiotic classes: aminoglycosides and beta-lactams. ### **Objective** Our goal is to build a predictive model using the `tidymodels` framework to determine the Gramstain of the microorganism based on microbial data. We will: 1. Preprocess data using the selector functions `aminoglycosides()` and `betalactams()`. 2. Define a logistic regression model for prediction. 3. Use a structured `tidymodels` workflow to preprocess, train, and evaluate the model. ### **Data Preparation** We begin by loading the required libraries and preparing the `example_isolates` dataset from the `AMR` package. ```{r} # Load required libraries library(tidymodels) # For machine learning workflows, and data manipulation (dplyr, tidyr, ...) library(AMR) # For AMR data analysis # Load the example_isolates dataset data("example_isolates") # Preloaded dataset with AMR results # Select relevant columns for prediction data <- example_isolates %>% # select AB results dynamically select(mo, aminoglycosides(), betalactams()) %>% # replace NAs with NI (not-interpretable) mutate(across(where(is.sir), ~replace_na(.x, "NI")), # make factors of SIR columns across(where(is.sir), as.integer), # get Gramstain of microorganisms mo = as.factor(mo_gramstain(mo))) %>% # drop NAs - the ones without a Gramstain (fungi, etc.) drop_na() ``` **Explanation:** - `aminoglycosides()` and `betalactams()` dynamically select columns for antibiotics in these classes. - `drop_na()` ensures the model receives complete cases for training. ### **Defining the Workflow** We now define the `tidymodels` workflow, which consists of three steps: preprocessing, model specification, and fitting. #### 1. Preprocessing with a Recipe We create a recipe to preprocess the data for modelling. ```{r} # Define the recipe for data preprocessing resistance_recipe <- recipe(mo ~ ., data = data) %>% step_corr(c(aminoglycosides(), betalactams()), threshold = 0.9) resistance_recipe ``` **Explanation:** - `recipe(mo ~ ., data = data)` will take the `mo` column as outcome and all other columns as predictors. - `step_corr()` removes predictors (i.e., antibiotic columns) that have a higher correlation than 90%. Notice how the recipe contains just the antibiotic selector functions - no need to define the columns specifically. #### 2. Specifying the Model We define a logistic regression model since resistance prediction is a binary classification task. ```{r} # Specify a logistic regression model logistic_model <- logistic_reg() %>% set_engine("glm") # Use the Generalized Linear Model engine logistic_model ``` **Explanation:** - `logistic_reg()` sets up a logistic regression model. - `set_engine("glm")` specifies the use of R's built-in GLM engine. #### 3. Building the Workflow We bundle the recipe and model together into a `workflow`, which organizes the entire modeling process. ```{r} # Combine the recipe and model into a workflow resistance_workflow <- workflow() %>% add_recipe(resistance_recipe) %>% # Add the preprocessing recipe add_model(logistic_model) # Add the logistic regression model ``` ### **Training and Evaluating the Model** To train the model, we split the data into training and testing sets. Then, we fit the workflow on the training set and evaluate its performance. ```{r} # Split data into training and testing sets set.seed(123) # For reproducibility data_split <- initial_split(data, prop = 0.8) # 80% training, 20% testing training_data <- training(data_split) # Training set testing_data <- testing(data_split) # Testing set # Fit the workflow to the training data fitted_workflow <- resistance_workflow %>% fit(training_data) # Train the model ``` **Explanation:** - `initial_split()` splits the data into training and testing sets. - `fit()` trains the workflow on the training set. Notice how in `fit()`, the antibiotic selector functions are internally called again. For training, these functions are called since they are stored in the recipe. Next, we evaluate the model on the testing data. ```{r} # Make predictions on the testing set predictions <- fitted_workflow %>% predict(testing_data) # Generate predictions probabilities <- fitted_workflow %>% predict(testing_data, type = "prob") # Generate probabilities predictions <- predictions %>% bind_cols(probabilities) %>% bind_cols(testing_data) # Combine with true labels predictions # Evaluate model performance metrics <- predictions %>% metrics(truth = mo, estimate = .pred_class) # Calculate performance metrics metrics ``` **Explanation:** - `predict()` generates predictions on the testing set. - `metrics()` computes evaluation metrics like accuracy and kappa. It appears we can predict the Gram based on AMR results with a `r round(metrics$.estimate[1], 3)` accuracy based on AMR results of aminoglycosides and beta-lactam antibiotics. The ROC curve looks like this: ```{r} predictions %>% roc_curve(mo, `.pred_Gram-negative`) %>% autoplot() ``` ### **Conclusion** In this post, we demonstrated how to build a machine learning pipeline with the `tidymodels` framework and the `AMR` package. By combining selector functions like `aminoglycosides()` and `betalactams()` with `tidymodels`, we efficiently prepared data, trained a model, and evaluated its performance. This workflow is extensible to other antibiotic classes and resistance patterns, empowering users to analyse AMR data systematically and reproducibly.