# Conduct principal component analysis (PCA) for AMR **NOTE: This page will be updated soon, as the pca() function is currently being developed.** ## Introduction ## Transforming For PCA, we need to transform our AMR data first. This is what the `example_isolates` data set in this package looks like: ``` r library(AMR) library(dplyr) glimpse(example_isolates) #> Rows: 2,000 #> Columns: 46 #> $ date 2002-01-02, 2002-01-03, 2002-01-07, 2002-01-07, 2002-01-13, 2… #> $ patient "A77334", "A77334", "067927", "067927", "067927", "067927", "4… #> $ age 65, 65, 45, 45, 45, 45, 78, 78, 45, 79, 67, 67, 71, 71, 75, 50… #> $ gender "F", "F", "F", "F", "F", "F", "M", "M", "F", "F", "M", "M", "M… #> $ ward "Clinical", "Clinical", "ICU", "ICU", "ICU", "ICU", "Clinical"… #> $ mo "B_ESCHR_COLI", "B_ESCHR_COLI", "B_STPHY_EPDR", "B_STPHY_EPDR",… #> $ PEN R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, S,… #> $ OXA NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ FLC NA, NA, R, R, R, R, S, S, R, S, S, S, NA, NA, NA, NA, NA, R, R… #> $ AMX NA, NA, NA, NA, NA, NA, R, R, NA, NA, NA, NA, NA, NA, R, NA, N… #> $ AMC I, I, NA, NA, NA, NA, S, S, NA, NA, S, S, I, I, R, I, I, NA, N… #> $ AMP NA, NA, NA, NA, NA, NA, R, R, NA, NA, NA, NA, NA, NA, R, NA, N… #> $ TZP NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ CZO NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, R, NA,… #> $ FEP NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ CXM I, I, R, R, R, R, S, S, R, S, S, S, S, S, NA, S, S, R, R, S, S… #> $ FOX NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, R, NA,… #> $ CTX NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, S, S, NA, S, S… #> $ CAZ NA, NA, R, R, R, R, R, R, R, R, R, R, NA, NA, NA, S, S, R, R, … #> $ CRO NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, S, S, NA, S, S… #> $ GEN NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ TOB NA, NA, NA, NA, NA, NA, S, S, NA, NA, NA, NA, S, S, NA, NA, NA… #> $ AMK NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ KAN NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ TMP R, R, S, S, R, R, R, R, S, S, NA, NA, S, S, S, S, S, R, R, R, … #> $ SXT R, R, S, S, NA, NA, NA, NA, S, S, NA, NA, S, S, S, S, S, NA, N… #> $ NIT NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, R,… #> $ FOS NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ LNZ R, R, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, R, R, R, R, R, N… #> $ CIP NA, NA, NA, NA, NA, NA, NA, NA, S, S, NA, NA, NA, NA, NA, S, S… #> $ MFX NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ VAN R, R, S, S, S, S, S, S, S, S, NA, NA, R, R, R, R, R, S, S, S, … #> $ TEC R, R, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, R, R, R, R, R, N… #> $ TCY R, R, S, S, S, S, S, S, S, I, S, S, NA, NA, I, R, R, S, I, R, … #> $ TGC NA, NA, S, S, S, S, S, S, S, NA, S, S, NA, NA, NA, R, R, S, NA… #> $ DOX NA, NA, S, S, S, S, S, S, S, NA, S, S, NA, NA, NA, R, R, S, NA… #> $ ERY R, R, R, R, R, R, S, S, R, S, S, S, R, R, R, R, R, R, R, R, S,… #> $ CLI R, R, NA, NA, NA, R, NA, NA, NA, NA, NA, NA, R, R, R, R, R, NA… #> $ AZM R, R, R, R, R, R, S, S, R, S, S, S, R, R, R, R, R, R, R, R, S,… #> $ IPM NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, S, S, NA, S, S… #> $ MEM NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ MTR NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ CHL NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ COL NA, NA, R, R, R, R, R, R, R, R, R, R, NA, NA, NA, R, R, R, R, … #> $ MUP NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… #> $ RIF R, R, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, R, R, R, R, R, N… ``` Now to transform this to a data set with only resistance percentages per taxonomic order and genus: ``` r resistance_data <- example_isolates %>% group_by( order = mo_order(mo), # group on anything, like order genus = mo_genus(mo) ) %>% # and genus as we do here summarise_if(is.sir, resistance) %>% # then get resistance of all drugs select( order, genus, AMC, CXM, CTX, CAZ, GEN, TOB, TMP, SXT ) # and select only relevant columns head(resistance_data) #> # A tibble: 6 × 10 #> # Groups: order [5] #> order genus AMC CXM CTX CAZ GEN TOB TMP SXT #> #> 1 (unknown order) (unknown ge… NA NA NA NA NA NA NA NA #> 2 Actinomycetales Schaalia NA NA NA NA NA NA NA NA #> 3 Bacteroidales Bacteroides NA NA NA NA NA NA NA NA #> 4 Campylobacterales Campylobact… NA NA NA NA NA NA NA NA #> 5 Caryophanales Gemella NA NA NA NA NA NA NA NA #> 6 Caryophanales Listeria NA NA NA NA NA NA NA NA ``` ## Perform principal component analysis The new [`pca()`](https://amr-for-r.org/reference/pca.md) function will automatically filter on rows that contain numeric values in all selected variables, so we now only need to do: ``` r pca_result <- pca(resistance_data) #> ℹ Columns selected for PCA: "AMC", "CAZ", "CTX", "CXM", "GEN", "SXT", #> "TMP", and "TOB". Total observations available: 7. ``` The result can be reviewed with the good old [`summary()`](https://rdrr.io/r/base/summary.html) function: ``` r summary(pca_result) #> Groups (n=4, named as 'order'): #> [1] "Caryophanales" "Enterobacterales" "Lactobacillales" "Pseudomonadales" #> Importance of components: #> PC1 PC2 PC3 PC4 PC5 PC6 PC7 #> Standard deviation 2.1539 1.6807 0.6138 0.33879 0.20808 0.03140 1.232e-16 #> Proportion of Variance 0.5799 0.3531 0.0471 0.01435 0.00541 0.00012 0.000e+00 #> Cumulative Proportion 0.5799 0.9330 0.9801 0.99446 0.99988 1.00000 1.000e+00 ``` #> Groups (n=4, named as 'order'): #> [1] "Caryophanales" "Enterobacterales" "Lactobacillales" "Pseudomonadales" Good news. The first two components explain a total of 93.3% of the variance (see the PC1 and PC2 values of the *Proportion of Variance*. We can create a so-called biplot with the base R [`biplot()`](https://rdrr.io/r/stats/biplot.html) function, to see which antimicrobial resistance per drug explain the difference per microorganism. ## Plotting the results ``` r biplot(pca_result) ``` ![](PCA_files/figure-html/unnamed-chunk-5-1.png) But we can’t see the explanation of the points. Perhaps this works better with our new [`ggplot_pca()`](https://amr-for-r.org/reference/ggplot_pca.md) function, that automatically adds the right labels and even groups: ``` r ggplot_pca(pca_result) ``` ![](PCA_files/figure-html/unnamed-chunk-6-1.png) You can also print an ellipse per group, and edit the appearance: ``` r ggplot_pca(pca_result, ellipse = TRUE) + ggplot2::labs(title = "An AMR/PCA biplot!") ``` ![](PCA_files/figure-html/unnamed-chunk-7-1.png)