improve top_n_microorganisms(): add property_for_each, fix property=NULL, enforce rank order (#297)

2026-06-29 17:36:20 +02:00 · 2026-06-26 21:40:11 +02:00
parent 02bd9a71c1
commit f7d353361c
15 changed files with 143 additions and 95 deletions
--- a/man/AMR.Rd
+++ b/man/AMR.Rd
@@ -12,7 +12,7 @@ The \code{AMR} package is a peer-reviewed, \href{https://amr-for-r.org/#copyrigh

 This work was published in the Journal of Statistical Software (Volume 104(3); \doi{10.18637/jss.v104.i03}) and formed the basis of two PhD theses (\doi{10.33612/diss.177417131} and \doi{10.33612/diss.192486375}).

-After installing this package, R knows \href{https://amr-for-r.org/reference/microorganisms.html}{\strong{~97 000 distinct microbial species}} (updated May 2026) and all \href{https://amr-for-r.org/reference/antimicrobials.html}{\strong{~620 antimicrobial and antiviral drugs}} by name and code (including ATC, EARS-Net, ASIARS-Net, PubChem, LOINC and SNOMED CT), and knows all about valid SIR and MIC values. The integral clinical breakpoint guidelines from CLSI 2011-2026 and EUCAST 2011-2026 are included, even with epidemiological cut-off (ECOFF) values. It supports and can read any data format, including WHONET data. This package works on Windows, macOS and Linux with all versions of R since R-3.0 (April 2013). \strong{It was designed to work in any setting, including those with very limited resources}. It was created for both routine data analysis and academic research at the Faculty of Medical Sciences of the \href{https://www.rug.nl}{University of Groningen} and the \href{https://www.umcg.nl}{University Medical Center Groningen}.
+After installing this package, R knows \href{https://amr-for-r.org/reference/microorganisms.html}{\strong{~97 000 distinct microbial species}} (updated mei 2026) and all \href{https://amr-for-r.org/reference/antimicrobials.html}{\strong{~620 antimicrobial and antiviral drugs}} by name and code (including ATC, EARS-Net, ASIARS-Net, PubChem, LOINC and SNOMED CT), and knows all about valid SIR and MIC values. The integral clinical breakpoint guidelines from CLSI 2011-2026 and EUCAST 2011-2026 are included, even with epidemiological cut-off (ECOFF) values. It supports and can read any data format, including WHONET data. This package works on Windows, macOS and Linux with all versions of R since R-3.0 (April 2013). \strong{It was designed to work in any setting, including those with very limited resources}. It was created for both routine data analysis and academic research at the Faculty of Medical Sciences of the \href{https://www.rug.nl}{University of Groningen} and the \href{https://www.umcg.nl}{University Medical Center Groningen}.

 The \code{AMR} package is available in English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swahili, Swedish, Turkish, Ukrainian, Urdu, and Vietnamese. Antimicrobial drug (group) names and colloquial microorganism names are provided in these languages.
 }
--- a/man/g.test.Rd
+++ b/man/g.test.Rd
@@ -46,7 +46,7 @@ A list with class \code{"htest"} containing the following
    \code{(observed - expected) / sqrt(expected)}.}
  \item{stdres}{standardized residuals,
    \code{(observed - expected) / sqrt(V)}, where \code{V} is the
-    residual cell variance (Agresti, 2007, section 2.4.5
+    residual cell variance {(\if{html}{\out{<a href="#reference+chisq.test.Rd+R+3AAgresti+3A2007" class="citation">}}Agresti 2007\if{html}{\out{</a>}}, section 2.4.5)}
    for the case where \code{x} is a matrix, \code{n * p * (1 - p)} otherwise).}
 }
 \description{
--- a/man/ggplot_pca.Rd
+++ b/man/ggplot_pca.Rd
@@ -59,8 +59,9 @@ ggplot_pca(
  }

 \item{pc.biplot}{
-    If true, use what Gabriel (1971) refers to as a "principal component
-    biplot", with \code{lambda = 1} and observations scaled up by sqrt(n) and
+    If true, use what {\if{html}{\cite{}\out{<a href="#reference+biplot.princomp.Rd+R+3AGabriel+3A1971" class="citation">}}Gabriel (1971)\if{html}{\out{</a>}}} refers to as a
+    \dQuote{principal component biplot},
+    with \code{lambda = 1} and observations scaled up by sqrt(n) and
    variables scaled down by sqrt(n).  Then inner products between
    variables approximate covariances and distances between observations
    approximate Mahalanobis distance.
--- a/man/top_n_microorganisms.Rd
+++ b/man/top_n_microorganisms.Rd
@@ -9,6 +9,7 @@ top_n_microorganisms(
  n,
  property = "species",
  n_for_each = NULL,
+  property_for_each = "species",
  col_mo = NULL,
  ...
 )
@@ -16,37 +17,40 @@ top_n_microorganisms(
 \arguments{
 \item{x}{A data frame containing microbial data.}

-\item{n}{An integer specifying the maximum number of unique values of the \code{property} to include in the output.}
+\item{n}{A positive whole number specifying the maximum number of unique values of \code{property} to include in the output.}

-\item{property}{A character string indicating the microorganism property to use for filtering. Must be one of the column names of the \link{microorganisms} data set: \code{"mo"}, \code{"fullname"}, \code{"status"}, \code{"domain"}, \code{"kingdom"}, \code{"phylum"}, \code{"class"}, \code{"order"}, \code{"family"}, \code{"genus"}, \code{"species"}, \code{"subspecies"}, \code{"rank"}, \code{"ref"}, \code{"oxygen_tolerance"}, \code{"morphology"}, \code{"source"}, \code{"lpsn"}, \code{"lpsn_parent"}, \code{"lpsn_renamed_to"}, \code{"mycobank"}, \code{"mycobank_parent"}, \code{"mycobank_renamed_to"}, \code{"gbif"}, \code{"gbif_parent"}, \code{"gbif_renamed_to"}, \code{"prevalence"}, or \code{"snomed"}. If \code{NULL}, the raw values from \code{col_mo} will be used without transformation. When using \code{"species"} (default) or \code{"subpecies"}, the genus will be added to make sure each (sub)species still belongs to the right genus.}
+\item{property}{A character string indicating the microorganism property to use for filtering. Must be one of the column names of the \link{microorganisms} data set: \code{"mo"}, \code{"fullname"}, \code{"status"}, \code{"domain"}, \code{"kingdom"}, \code{"phylum"}, \code{"class"}, \code{"order"}, \code{"family"}, \code{"genus"}, \code{"species"}, \code{"subspecies"}, \code{"rank"}, \code{"ref"}, \code{"oxygen_tolerance"}, \code{"morphology"}, \code{"source"}, \code{"lpsn"}, \code{"lpsn_parent"}, \code{"lpsn_renamed_to"}, \code{"mycobank"}, \code{"mycobank_parent"}, \code{"mycobank_renamed_to"}, \code{"gbif"}, \code{"gbif_parent"}, \code{"gbif_renamed_to"}, \code{"prevalence"}, or \code{"snomed"}. If \code{NULL}, the raw values from \code{col_mo} will be used without transformation. When using \code{"species"} (default) or \code{"subspecies"}, the genus is prepended to ensure each name is unambiguous.}

-\item{n_for_each}{An optional integer specifying the maximum number of rows to retain for each value of the selected property. If \code{NULL}, all rows within the top \emph{n} groups will be included.}
+\item{n_for_each}{An optional positive whole number specifying the maximum number of distinct microorganism groups at the level of \code{property_for_each} to retain within each of the top \emph{n} groups. Only used when \code{property_for_each} is also set.}
+
+\item{property_for_each}{The microorganism property to use for sub-grouping within each top \emph{n} group. Must be one of the column names of the \link{microorganisms} data set and at a strictly lower taxonomic rank than \code{property} (allowed order: domain > kingdom > phylum > class > order > family > genus > species > subspecies). Defaults to \code{"species"}. Only relevant when \code{n_for_each} is set.}

 \item{col_mo}{A character string indicating the column in \code{x} that contains microorganism names or codes. Defaults to the first column of class \code{\link{mo}}. Values will be coerced using \code{\link[=as.mo]{as.mo()}}.}

 \item{...}{Additional arguments passed on to \code{\link[=mo_property]{mo_property()}} when \code{property} is not \code{NULL}.}
 }
 \description{
-This function filters a data set to include only the top \emph{n} microorganisms based on a specified property, such as taxonomic family or genus. For example, it can filter a data set to the top 3 species, or to any species in the top 5 genera, or to the top 3 species in each of the top 5 genera.
+Filters a data set to include only the top \emph{n} microorganisms based on a specified property, such as taxonomic family or genus. For example, it can filter a data set to the top 3 species, to any species in the top 5 genera, or to the top 3 species in each of the top 5 genera.
 }
 \details{
-This function is useful for preprocessing data before creating \link[=antibiogram]{antibiograms} or other analyses that require focused subsets of microbial data. For example, it can filter a data set to only include isolates from the top 10 species.
+This function is useful for preprocessing data before creating \link[=antibiogram]{antibiograms} or other analyses that require focused subsets of microbial data.
 }
 \examples{
 # filter to the top 3 species:
-top_n_microorganisms(example_isolates,
-  n = 3
-)
+top_n_microorganisms(example_isolates, n = 3)

 # filter to any species in the top 5 genera:
-top_n_microorganisms(example_isolates,
-  n = 5, property = "genus"
-)
+top_n_microorganisms(example_isolates, n = 5, property = "genus")

 # filter to the top 3 species in each of the top 5 genera:
 top_n_microorganisms(example_isolates,
  n = 5, property = "genus", n_for_each = 3
 )
+
+# filter to the top 2 genera in each of the top 3 families:
+top_n_microorganisms(example_isolates,
+  n = 3, property = "family", n_for_each = 2, property_for_each = "genus"
+)
 }
 \seealso{
 \code{\link[=mo_property]{mo_property()}}, \code{\link[=as.mo]{as.mo()}}, \code{\link[=antibiogram]{antibiogram()}}