Produces a ggplot2 variant of a so-called biplot for PCA (principal component analysis), but is more flexible and more appealing than the base R biplot() function.

ggplot_pca(
  x,
  choices = 1:2,
  scale = TRUE,
  pc.biplot = TRUE,
  labels = NULL,
  labels_textsize = 3,
  labels_text_placement = 1.5,
  groups = NULL,
  ellipse = TRUE,
  ellipse_prob = 0.68,
  ellipse_size = 0.5,
  ellipse_alpha = 0.5,
  points_size = 2,
  points_alpha = 0.25,
  arrows = TRUE,
  arrows_colour = "darkblue",
  arrows_size = 0.5,
  arrows_textsize = 3,
  arrows_alpha = 0.75,
  base_textsize = 10,
  ...
)

Arguments

x

an object returned by pca(), prcomp() or princomp()

choices

length 2 vector specifying the components to plot. Only the default is a biplot in the strict sense.

scale

The variables are scaled by lambda ^ scale and the observations are scaled by lambda ^ (1-scale) where lambda are the singular values as computed by princomp. Normally 0 <= scale <= 1, and a warning will be issued if the specified scale is outside this range.

pc.biplot

If true, use what Gabriel (1971) refers to as a "principal component biplot", with lambda = 1 and observations scaled up by sqrt(n) and variables scaled down by sqrt(n). Then inner products between variables approximate covariances and distances between observations approximate Mahalanobis distance.

labels

an optional vector of labels for the observations. If set, the labels will be placed below their respective points. When using the pca() function as input for x, this will be determined automatically based on the attribute non_numeric_cols, see pca().

labels_textsize

the size of the text used for the labels

labels_text_placement

adjustment factor the placement of the variable names (>=1 means further away from the arrow head)

groups

an optional vector of groups for the labels, with the same length as labels. If set, the points and labels will be coloured according to these groups. When using the pca() function as input for x, this will be determined automatically based on the attribute non_numeric_cols, see pca().

ellipse

a logical to indicate whether a normal data ellipse should be drawn for each group (set with groups)

ellipse_prob

statistical size of the ellipse in normal probability

ellipse_size

the size of the ellipse line

ellipse_alpha

the alpha (transparency) of the ellipse line

points_size

the size of the points

points_alpha

the alpha (transparency) of the points

arrows

a logical to indicate whether arrows should be drawn

arrows_colour

the colour of the arrow and their text

arrows_size

the size (thickness) of the arrow lines

arrows_textsize

the size of the text at the end of the arrows

arrows_alpha

the alpha (transparency) of the arrows and their text

base_textsize

the text size for all plot elements except the labels and arrows

...

Parameters passed on to functions

Source

The ggplot_pca() function is based on the ggbiplot() function from the ggbiplot package by Vince Vu, as found on GitHub: https://github.com/vqv/ggbiplot (retrieved: 2 March 2020, their latest commit: 7325e88; 12 February 2015).

As per their GPL-2 licence that demands documentation of code changes, the changes made based on the source code were:

  1. Rewritten code to remove the dependency on packages plyr, scales and grid

  2. Parametrised more options, like arrow and ellipse settings

  3. Added total amount of explained variance as a caption in the plot

  4. Cleaned all syntax based on the lintr package and added integrity checks

  5. Updated documentation

Details

The colours for labels and points can be changed by adding another scale layer for colour, like scale_colour_viridis_d() or scale_colour_brewer().

Maturing lifecycle


The lifecycle of this function is maturing. The unlying code of a maturing function has been roughed out, but finer details might still change. Since this function needs wider usage and more extensive testing, you are very welcome to suggest changes at our repository or write us an email (see section 'Contact Us').

Examples

# `example_isolates` is a dataset available in the AMR package.
# See ?example_isolates.

if (FALSE) {
# See ?pca for more info about Principal Component Analysis (PCA).
library(dplyr)
pca_model <- example_isolates %>%
  filter(mo_genus(mo) == "Staphylococcus") %>%
  group_by(species = mo_shortname(mo)) %>%
  summarise_if (is.rsi, resistance) %>%
  pca(FLC, AMC, CXM, GEN, TOB, TMP, SXT, CIP, TEC, TCY, ERY)

# old
biplot(pca_model)

# new 
ggplot_pca(pca_model)
}