AMR/R/mo_source.R

315 lines
14 KiB
R
Raw Normal View History

2019-01-21 15:53:01 +01:00
# ==================================================================== #
# TITLE #
# Antimicrobial Resistance (AMR) Data Analysis for R #
2019-01-21 15:53:01 +01:00
# #
# SOURCE #
2020-07-08 14:48:06 +02:00
# https://github.com/msberends/AMR #
2019-01-21 15:53:01 +01:00
# #
# LICENCE #
2021-12-23 18:56:28 +01:00
# (c) 2018-2022 Berends MS, Luz CF et al. #
2020-10-08 11:16:03 +02:00
# Developed at the University of Groningen, the Netherlands, in #
# collaboration with non-profit organisations Certe Medical #
# Diagnostics & Advice, and University Medical Center Groningen. #
2019-01-21 15:53:01 +01:00
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
# GNU General Public License version 2.0 (GNU GPL-2), as published by #
# the Free Software Foundation. #
# We created this package for both routine data analysis and academic #
# research and it was publicly released in the hope that it will be #
# useful, but it comes WITHOUT ANY WARRANTY OR LIABILITY. #
2020-10-08 11:16:03 +02:00
# #
# Visit our website for the full manual and a complete tutorial about #
# how to conduct AMR data analysis: https://msberends.github.io/AMR/ #
2019-01-21 15:53:01 +01:00
# ==================================================================== #
#' User-Defined Reference Data Set for Microorganisms
2019-01-21 15:53:01 +01:00
#'
#' @description These functions can be used to predefine your own reference to be used in [as.mo()] and consequently all [`mo_*`][mo_property()] functions (such as [mo_genus()] and [mo_gramstain()]).
2019-02-28 13:56:28 +01:00
#'
#' This is **the fastest way** to have your organisation (or analysis) specific codes picked up and translated by this package, since you don't have to bother about it again after setting it up once.
2022-05-09 20:36:44 +02:00
#' @param path location of your reference file, this can be any text file (comma-, tab- or pipe-separated) or an Excel file (see *Details*). Can also be `""`, `NULL` or `FALSE` to delete the reference file.
2020-12-17 16:22:25 +01:00
#' @param destination destination of the compressed data file, default to the user's home directory.
2019-01-21 15:53:01 +01:00
#' @rdname mo_source
#' @name mo_source
#' @aliases set_mo_source get_mo_source
2020-12-21 22:46:29 +01:00
#' @details The reference file can be a text file separated with commas (CSV) or tabs or pipes, an Excel file (either 'xls' or 'xlsx' format) or an \R object file (extension '.rds'). To use an Excel file, you will need to have the `readxl` package installed.
2019-01-21 15:53:01 +01:00
#'
2022-05-09 20:36:44 +02:00
#' [set_mo_source()] will check the file for validity: it must be a [data.frame], must have a column named `"mo"` which contains values from [`microorganisms$mo`][microorganisms] or [`microorganisms$fullname`][microorganisms] and must have a reference column with your own defined values. If all tests pass, [set_mo_source()] will read the file into \R and will ask to export it to `"~/mo_source.rds"`. The CRAN policy disallows packages to write to the file system, although '*exceptions may be allowed in interactive sessions if the package obtains confirmation from the user*'. For this reason, this function only works in interactive sessions so that the user can **specifically confirm and allow** that this file will be created. The destination of this file can be set with the `destination` argument and defaults to the user's home directory. It can also be set as an \R option, using `options(AMR_mo_source = "my/location/file.rds")`.
#'
2022-05-09 20:36:44 +02:00
#' The created compressed data file `"mo_source.rds"` will be used at default for MO determination (function [as.mo()] and consequently all `mo_*` functions like [mo_genus()] and [mo_gramstain()]). The location and timestamp of the original file will be saved as an [attribute][base::attributes()] to the compressed data file.
#'
2020-12-17 16:22:25 +01:00
#' The function [get_mo_source()] will return the data set by reading `"mo_source.rds"` with [readRDS()]. If the original file has changed (by checking the location and timestamp of the original file), it will call [set_mo_source()] to update the data file automatically if used in an interactive session.
2019-01-21 15:53:01 +01:00
#'
2020-05-25 01:01:14 +02:00
#' Reading an Excel file (`.xlsx`) with only one row has a size of 8-9 kB. The compressed file created with [set_mo_source()] will then have a size of 0.1 kB and can be read by [get_mo_source()] in only a couple of microseconds (millionths of a second).
#'
#' @section How to Setup:
2020-05-25 01:01:14 +02:00
#'
2022-05-09 20:36:44 +02:00
#' Imagine this data on a sheet of an Excel file. The first column contains the organisation specific codes, the second column contains valid taxonomic names:
#'
#' ```
2022-05-09 20:36:44 +02:00
#' | A | B |
#' --|--------------------|-----------------------|
#' 1 | Organisation XYZ | mo |
#' 2 | lab_mo_ecoli | Escherichia coli |
#' 3 | lab_mo_kpneumoniae | Klebsiella pneumoniae |
#' 4 | | |
#' ```
2019-01-21 15:53:01 +01:00
#'
#' We save it as `"home/me/ourcodes.xlsx"`. Now we have to set it as a source:
2020-05-25 01:01:14 +02:00
#'
#' ```
2019-02-27 11:36:12 +01:00
#' set_mo_source("home/me/ourcodes.xlsx")
2020-12-17 16:22:25 +01:00
#' #> NOTE: Created mo_source file '/Users/me/mo_source.rds' (0.3 kB) from
#' #> '/Users/me/Documents/ourcodes.xlsx' (9 kB), columns
#' #> "Organisation XYZ" and "mo"
#' ```
2019-01-21 15:53:01 +01:00
#'
2020-12-17 16:22:25 +01:00
#' It has now created a file `"~/mo_source.rds"` with the contents of our Excel file. Only the first column with foreign values and the 'mo' column will be kept when creating the RDS file.
2019-02-28 13:56:28 +01:00
#'
#' And now we can use it in our functions:
2020-05-25 01:01:14 +02:00
#'
#' ```
2019-01-21 15:53:01 +01:00
#' as.mo("lab_mo_ecoli")
2020-12-17 16:22:25 +01:00
#' #> Class <mo>
2020-05-25 01:01:14 +02:00
#' #> [1] B_ESCHR_COLI
2019-01-21 15:53:01 +01:00
#'
#' mo_genus("lab_mo_kpneumoniae")
2020-05-25 01:01:14 +02:00
#' #> [1] "Klebsiella"
2019-03-01 09:34:04 +01:00
#'
#' # other input values still work too
#' as.mo(c("Escherichia coli", "E. coli", "lab_mo_ecoli"))
2020-12-17 16:22:25 +01:00
#' #> NOTE: Translation to one microorganism was guessed with uncertainty.
#' #> Use mo_uncertainties() to review it.
#' #> Class <mo>
2020-05-25 01:01:14 +02:00
#' #> [1] B_ESCHR_COLI B_ESCHR_COLI B_ESCHR_COLI
#' ```
2019-01-21 15:53:01 +01:00
#'
2020-05-25 01:01:14 +02:00
#' If we edit the Excel file by, let's say, adding row 4 like this:
#'
#' ```
2022-05-09 20:36:44 +02:00
#' | A | B |
#' --|--------------------|-----------------------|
#' 1 | Organisation XYZ | mo |
#' 2 | lab_mo_ecoli | Escherichia coli |
#' 3 | lab_mo_kpneumoniae | Klebsiella pneumoniae |
#' 4 | lab_Staph_aureus | Staphylococcus aureus |
#' 5 | | |
#' ```
2019-01-21 15:53:01 +01:00
#'
2019-11-30 12:01:50 +01:00
#' ...any new usage of an MO function in this package will update your data file:
2020-05-25 01:01:14 +02:00
#'
#' ```
2019-02-28 13:56:28 +01:00
#' as.mo("lab_mo_ecoli")
2020-12-17 16:22:25 +01:00
#' #> NOTE: Updated mo_source file '/Users/me/mo_source.rds' (0.3 kB) from
#' #> '/Users/me/Documents/ourcodes.xlsx' (9 kB), columns
2020-12-17 16:22:25 +01:00
#' #> "Organisation XYZ" and "mo"
#' #> Class <mo>
2020-05-25 01:01:14 +02:00
#' #> [1] B_ESCHR_COLI
2019-02-28 13:56:28 +01:00
#'
#' mo_genus("lab_Staph_aureus")
2020-05-25 01:01:14 +02:00
#' #> [1] "Staphylococcus"
#' ```
2019-02-28 13:56:28 +01:00
#'
2020-05-25 01:01:14 +02:00
#' To delete the reference data file, just use `""`, `NULL` or `FALSE` as input for [set_mo_source()]:
#'
#' ```
2019-02-28 13:56:28 +01:00
#' set_mo_source(NULL)
2020-12-17 16:22:25 +01:00
#' #> Removed mo_source file '/Users/me/mo_source.rds'
#' ```
2020-05-25 01:01:14 +02:00
#'
#' If the original file (in the previous case an Excel file) is moved or deleted, the `mo_source.rds` file will be removed upon the next use of [as.mo()] or any [`mo_*`][mo_property()] function.
2019-02-28 13:56:28 +01:00
#' @export
2020-12-17 16:22:25 +01:00
set_mo_source <- function(path, destination = getOption("AMR_mo_source", "~/mo_source.rds")) {
meet_criteria(path, allow_class = "character", has_length = 1, allow_NULL = TRUE)
meet_criteria(destination, allow_class = "character", has_length = 1)
2020-12-22 00:51:17 +01:00
stop_ifnot(destination %like% "[.]rds$", "the `destination` must be a file location with file extension .rds.")
2020-07-13 09:17:24 +02:00
2020-12-17 16:22:25 +01:00
mo_source_destination <- path.expand(destination)
2020-07-13 09:17:24 +02:00
2020-12-22 00:51:17 +01:00
stop_ifnot(interactive(), "this function can only be used in interactive mode, since it must ask for the user's permission to write a file to their home folder.")
2020-05-25 01:01:14 +02:00
if (is.null(path) || path %in% c(FALSE, "")) {
2020-12-27 00:07:00 +01:00
pkg_env$mo_source <- NULL
2020-12-17 16:22:25 +01:00
if (file.exists(mo_source_destination)) {
unlink(mo_source_destination)
message_("Removed mo_source file '", font_bold(mo_source_destination), "'",
2020-10-27 15:56:51 +01:00
add_fn = font_red,
as_note = FALSE)
2019-01-21 15:53:01 +01:00
}
return(invisible())
}
2020-07-13 09:17:24 +02:00
stop_ifnot(file.exists(path), "file not found: ", path)
2020-07-13 09:17:24 +02:00
2022-05-09 20:36:44 +02:00
df <- NULL
2019-10-11 17:21:02 +02:00
if (path %like% "[.]rds$") {
2019-01-21 15:53:01 +01:00
df <- readRDS(path)
2020-07-13 09:17:24 +02:00
2019-10-11 17:21:02 +02:00
} else if (path %like% "[.]xlsx?$") {
2019-01-21 15:53:01 +01:00
# is Excel file (old or new)
2020-12-27 20:32:40 +01:00
stop_ifnot_installed("readxl")
df <- readxl::read_excel(path)
2020-07-13 09:17:24 +02:00
2019-10-11 17:21:02 +02:00
} else if (path %like% "[.]tsv$") {
2022-05-09 20:36:44 +02:00
df <- utils::read.table(file = path, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
} else if (path %like% "[.]csv$") {
df <- utils::read.table(file = path, header = TRUE, sep = ",", stringsAsFactors = FALSE)
2020-07-13 09:17:24 +02:00
2019-01-21 15:53:01 +01:00
} else {
# try comma first
try(
2022-05-09 20:36:44 +02:00
df <- utils::read.table(file = path, header = TRUE, sep = ",", stringsAsFactors = FALSE),
2019-01-21 15:53:01 +01:00
silent = TRUE)
2020-12-22 00:51:17 +01:00
if (!check_validity_mo_source(df, stop_on_error = FALSE)) {
2019-02-27 11:36:12 +01:00
# try tab
try(
2022-05-09 20:36:44 +02:00
df <- utils::read.table(file = path, header = TRUE, sep = "\t", stringsAsFactors = FALSE),
2019-02-27 11:36:12 +01:00
silent = TRUE)
}
2020-12-22 00:51:17 +01:00
if (!check_validity_mo_source(df, stop_on_error = FALSE)) {
2019-01-21 15:53:01 +01:00
# try pipe
try(
2022-05-09 20:36:44 +02:00
df <- utils::read.table(file = path, header = TRUE, sep = "|", stringsAsFactors = FALSE),
2019-01-21 15:53:01 +01:00
silent = TRUE)
}
}
2020-07-13 09:17:24 +02:00
2020-05-16 13:05:47 +02:00
# check integrity
2022-05-09 20:36:44 +02:00
if (is.null(df)) {
stop_("the path '", path, "' could not be imported as a dataset.")
}
2020-12-22 00:51:17 +01:00
check_validity_mo_source(df)
2020-07-13 09:17:24 +02:00
2020-05-25 01:01:14 +02:00
df <- subset(df, !is.na(mo))
2020-07-13 09:17:24 +02:00
2019-03-01 09:34:04 +01:00
# keep only first two columns, second must be mo
2019-01-21 15:53:01 +01:00
if (colnames(df)[1] == "mo") {
2020-05-25 01:01:14 +02:00
df <- df[, c(colnames(df)[2], "mo")]
2019-03-01 09:34:04 +01:00
} else {
2020-05-25 01:01:14 +02:00
df <- df[, c(colnames(df)[1], "mo")]
2019-01-21 15:53:01 +01:00
}
2020-07-13 09:17:24 +02:00
2019-01-21 15:53:01 +01:00
df <- as.data.frame(df, stringAsFactors = FALSE)
2022-05-09 20:36:44 +02:00
df[, "mo"] <- as.mo(df[, "mo", drop = TRUE])
2020-07-13 09:17:24 +02:00
2019-01-21 15:53:01 +01:00
# success
2020-12-17 16:22:25 +01:00
if (file.exists(mo_source_destination)) {
2019-01-21 15:53:01 +01:00
action <- "Updated"
} else {
action <- "Created"
# only ask when file is created, not when it is updated
2020-12-17 16:22:25 +01:00
txt <- paste0(word_wrap(paste0("This will write create the new file '",
mo_source_destination,
"', for which your permission is needed.")),
"\n\n",
word_wrap("Do you agree that this file will be created?"))
showQuestion <- import_fn("showQuestion", "rstudioapi", error_on_fail = FALSE)
if (!is.null(showQuestion)) {
q_continue <- showQuestion("Create new file in home directory", txt)
} else {
q_continue <- utils::menu(choices = c("OK", "Cancel"), graphics = FALSE, title = txt)
}
if (q_continue %in% c(FALSE, 2)) {
return(invisible())
}
2019-01-21 15:53:01 +01:00
}
2020-12-17 16:22:25 +01:00
attr(df, "mo_source_location") <- path
2020-12-22 00:51:17 +01:00
attr(df, "mo_source_destination") <- mo_source_destination
2020-12-17 16:22:25 +01:00
attr(df, "mo_source_timestamp") <- file.mtime(path)
saveRDS(df, mo_source_destination)
2020-12-27 00:07:00 +01:00
pkg_env$mo_source <- df
2020-12-17 16:22:25 +01:00
message_(action, " mo_source file '", font_bold(mo_source_destination),
"' (", formatted_filesize(mo_source_destination),
") from '", font_bold(path),
"' (", formatted_filesize(path),
'), columns "', colnames(df)[1], '" and "', colnames(df)[2], '"')
2019-01-21 15:53:01 +01:00
}
#' @rdname mo_source
#' @export
2020-12-17 16:22:25 +01:00
get_mo_source <- function(destination = getOption("AMR_mo_source", "~/mo_source.rds")) {
if (!file.exists(path.expand(destination))) {
if (interactive()) {
2020-12-22 00:51:17 +01:00
# source file might have been deleted, so update reference
2020-12-17 16:22:25 +01:00
set_mo_source("")
}
2020-05-25 01:01:14 +02:00
return(NULL)
}
2020-12-27 00:07:00 +01:00
if (is.null(pkg_env$mo_source)) {
pkg_env$mo_source <- readRDS(path.expand(destination))
2019-01-21 15:53:01 +01:00
}
2020-05-25 01:01:14 +02:00
2020-12-27 00:07:00 +01:00
old_time <- attributes(pkg_env$mo_source)$mo_source_timestamp
new_time <- file.mtime(attributes(pkg_env$mo_source)$mo_source_location)
2020-12-17 16:22:25 +01:00
if (interactive() && !identical(old_time, new_time)) {
# source file was updated, also update reference
2020-12-27 00:07:00 +01:00
set_mo_source(attributes(pkg_env$mo_source)$mo_source_location)
2020-05-25 01:01:14 +02:00
}
2020-12-27 00:07:00 +01:00
pkg_env$mo_source
2019-01-21 15:53:01 +01:00
}
2019-03-05 22:47:42 +01:00
2020-12-22 00:51:17 +01:00
check_validity_mo_source <- function(x, refer_to_name = "`reference_df`", stop_on_error = TRUE) {
2020-02-14 19:54:13 +01:00
check_dataset_integrity()
2020-11-10 16:35:56 +01:00
if (paste(deparse(substitute(x)), collapse = "") == "get_mo_source()") {
2019-03-05 22:47:42 +01:00
return(TRUE)
}
2020-12-27 00:07:00 +01:00
if (is.null(pkg_env$mo_source) && (identical(x, get_mo_source()))) {
2019-03-05 22:47:42 +01:00
return(TRUE)
}
if (is.null(x)) {
2020-05-16 13:05:47 +02:00
if (stop_on_error == TRUE) {
2020-11-10 16:35:56 +01:00
stop_(refer_to_name, " cannot be NULL", call = FALSE)
2020-05-16 13:05:47 +02:00
} else {
return(FALSE)
}
2019-03-05 22:47:42 +01:00
}
if (!is.data.frame(x)) {
2020-05-16 13:05:47 +02:00
if (stop_on_error == TRUE) {
2020-11-10 16:35:56 +01:00
stop_(refer_to_name, " must be a data.frame", call = FALSE)
2020-05-16 13:05:47 +02:00
} else {
return(FALSE)
}
2019-03-05 22:47:42 +01:00
}
if (!"mo" %in% colnames(x)) {
2020-05-16 13:05:47 +02:00
if (stop_on_error == TRUE) {
2020-11-10 16:35:56 +01:00
stop_(refer_to_name, " must contain a column 'mo'", call = FALSE)
2020-05-16 13:05:47 +02:00
} else {
return(FALSE)
}
}
2022-05-09 20:36:44 +02:00
if (!all(x$mo %in% c("", microorganisms$mo, microorganisms$fullname), na.rm = TRUE)) {
2020-05-16 13:05:47 +02:00
if (stop_on_error == TRUE) {
2022-05-09 20:36:44 +02:00
invalid <- x[which(!x$mo %in% c("", microorganisms$mo, microorganisms$fullname)), , drop = FALSE]
2020-05-16 13:05:47 +02:00
if (nrow(invalid) > 1) {
plural <- "s"
} else {
plural <- ""
}
stop_("Value", plural, " ", vector_and(invalid[, 1, drop = TRUE], quotes = TRUE),
2020-05-16 13:05:47 +02:00
" found in ", tolower(refer_to_name),
", but with invalid microorganism code", plural, " ", vector_and(invalid$mo, quotes = TRUE),
2020-11-10 16:35:56 +01:00
call = FALSE)
2020-05-16 13:05:47 +02:00
} else {
return(FALSE)
}
2019-03-05 22:47:42 +01:00
}
2020-11-10 16:35:56 +01:00
if (colnames(x)[1] != "mo" & nrow(x) > length(unique(x[, 1, drop = TRUE]))) {
if (stop_on_error == TRUE) {
stop_(refer_to_name, " contains duplicate values in column '", colnames(x)[1], "'", call = FALSE)
} else {
return(FALSE)
}
}
if (colnames(x)[2] != "mo" & nrow(x) > length(unique(x[, 2, drop = TRUE]))) {
if (stop_on_error == TRUE) {
stop_(refer_to_name, " contains duplicate values in column '", colnames(x)[2], "'", call = FALSE)
} else {
return(FALSE)
}
}
return(TRUE)
2019-03-05 22:47:42 +01:00
}