AMR/vignettes/datasets.Rmd

313 lines
10 KiB
Plaintext
Raw Normal View History

---
2020-08-17 21:49:58 +02:00
title: "Data sets for download / own use"
date: '`r format(Sys.Date(), "%d %B %Y")`'
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 1
vignette: >
2020-08-17 21:49:58 +02:00
%\VignetteIndexEntry{Data sets for download / own use}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
2022-08-26 22:25:15 +02:00
```{r setup, include = FALSE, results = "markup"}
knitr::opts_chunk$set(
warning = FALSE,
collapse = TRUE,
comment = "#",
fig.width = 7.5,
fig.height = 5
)
library(AMR)
library(dplyr)
2022-08-26 14:02:08 +02:00
options(knitr.kable.NA = "")
structure_txt <- function(dataset) {
2022-08-28 10:31:50 +02:00
paste0(
"A data set with ",
format(nrow(dataset), big.mark = ","), " rows and ",
ncol(dataset), " columns, containing the following column names: \n",
AMR:::vector_or(colnames(dataset), quotes = "*", last_sep = " and ", sort = FALSE), "."
)
}
download_txt <- function(filename) {
2022-08-28 10:31:50 +02:00
msg <- paste0(
"It was last updated on ",
trimws(format(file.mtime(paste0("../data/", filename, ".rda")), "%e %B %Y %H:%M:%S %Z", tz = "UTC")),
". Find more info about the structure of this data set [here](https://msberends.github.io/AMR/reference/", ifelse(filename == "antivirals", "antibiotics", filename), ".html).\n"
)
github_base <- "https://github.com/msberends/AMR/raw/main/data-raw/"
filename <- paste0("../data-raw/", filename)
rds <- paste0(filename, ".rds")
2022-08-26 22:25:15 +02:00
txt <- paste0(filename, ".txt")
excel <- paste0(filename, ".xlsx")
feather <- paste0(filename, ".feather")
parquet <- paste0(filename, ".parquet")
sas <- paste0(filename, ".sas")
spss <- paste0(filename, ".sav")
2020-08-21 11:40:13 +02:00
stata <- paste0(filename, ".dta")
2022-08-26 22:25:15 +02:00
create_txt <- function(filename, type, software, exists) {
if (isTRUE(exists)) {
2022-08-28 10:31:50 +02:00
paste0(
"* Download as [", software, "](", github_base, filename, ") (",
AMR:::formatted_filesize(filename), ") \n"
)
2022-08-26 22:25:15 +02:00
} else {
paste0("* *(unavailable as ", software, ")*\n")
}
}
2022-08-28 10:31:50 +02:00
if (any(
file.exists(rds),
file.exists(txt),
file.exists(excel),
file.exists(feather),
file.exists(parquet),
file.exists(sas),
file.exists(spss),
file.exists(stata)
)) {
msg <- c(
msg, "\n**Direct download links:**\n\n",
create_txt(rds, "rds", "original R Data Structure (RDS) file", file.exists(rds)),
create_txt(txt, "txt", "tab-separated text file", file.exists(txt)),
create_txt(excel, "xlsx", "Microsoft Excel workbook", file.exists(excel)),
create_txt(feather, "feather", "Apache Feather file", file.exists(feather)),
create_txt(parquet, "parquet", "Apache Parquet file", file.exists(parquet)),
create_txt(sas, "sas", "SAS data file", file.exists(sas)),
create_txt(spss, "sav", "IBM SPSS Statistics data file", file.exists(spss)),
create_txt(stata, "dta", "Stata DTA file", file.exists(stata))
)
}
2020-08-17 21:49:58 +02:00
paste0(msg, collapse = "")
}
print_df <- function(x, rows = 6) {
2022-08-28 10:31:50 +02:00
x %>%
as.data.frame(stringsAsFactors = FALSE) %>%
head(n = rows) %>%
mutate_all(function(x) {
if (is.list(x)) {
sapply(x, function(y) {
if (length(y) > 3) {
paste0(paste(y[1:3], collapse = ", "), ", ...")
2020-12-17 16:22:25 +01:00
} else if (length(y) == 0 || all(is.na(y))) {
""
} else {
paste(y, collapse = ", ")
}
})
} else {
x
}
}) %>%
knitr::kable(align = "c")
}
```
2022-08-26 22:25:15 +02:00
All reference data (about microorganisms, antibiotics, R/SI interpretation, EUCAST rules, etc.) in this `AMR` package are reliable, up-to-date and freely available. We continually export our data sets to formats for use in R, MS Excel, Apache Feather, Apache Parquet, SPSS, SAS, and Stata. We also provide tab-separated text files that are machine-readable and suitable for input in any software program, such as laboratory information systems.
On this page, we explain how to download them and how the structure of the data sets look like.
2022-10-05 09:12:22 +02:00
## `microorganisms`: Full Microbial Taxonomy
`r structure_txt(microorganisms)`
This data set is in R available as `microorganisms`, after you load the `AMR` package.
2020-08-17 21:49:58 +02:00
`r download_txt("microorganisms")`
2022-08-26 22:25:15 +02:00
**NOTE: The exported files for Excel, SAS, SPSS and Stata contain only the first 50 SNOMED codes per record, as their file size would otherwise exceed 100 MB; the file size limit of GitHub.** Advice? Use R instead.
2020-08-17 21:49:58 +02:00
### Source
2022-10-05 09:12:22 +02:00
This data set contains the full microbial taxonomy of `r nr2char(length(unique(AMR::microorganisms$kingdom[!AMR::microorganisms$kingdom %like% "unknown"])))` kingdoms from the List of Prokaryotic names with Standing in Nomenclature (LPSN) and the Global Biodiversity Information Facility (GBIF):
2022-10-05 09:12:22 +02:00
* `r AMR:::TAXONOMY_VERSION$LPSN$citation` Accessed from <`r AMR:::TAXONOMY_VERSION$LPSN$url`> on `r documentation_date(AMR:::TAXONOMY_VERSION$LPSN$accessed_date)`.
* `r AMR:::TAXONOMY_VERSION$GBIF$citation` Accessed from <`r AMR:::TAXONOMY_VERSION$GBIF$url`> on `r documentation_date(AMR:::TAXONOMY_VERSION$GBIF$accessed_date)`.
* `r AMR:::TAXONOMY_VERSION$SNOMED$citation` URL: <`r AMR:::TAXONOMY_VERSION$SNOMED$url`>
### Example content
2020-08-17 21:49:58 +02:00
Included (sub)species per taxonomic kingdom:
```{r, echo = FALSE}
2022-08-28 10:31:50 +02:00
microorganisms %>%
count(kingdom) %>%
mutate(n = format(n, big.mark = ",")) %>%
setNames(c("Kingdom", "Number of (sub)species")) %>%
print_df()
```
2020-08-17 21:49:58 +02:00
Example rows when filtering on genus *Escherichia*:
2020-08-17 21:49:58 +02:00
```{r, echo = FALSE}
microorganisms %>%
2022-08-28 10:31:50 +02:00
filter(genus == "Escherichia") %>%
2020-08-17 21:49:58 +02:00
print_df()
```
2020-08-17 21:49:58 +02:00
2022-08-26 22:25:15 +02:00
## `antibiotics`: Antibiotic Agents
`r structure_txt(antibiotics)`
This data set is in R available as `antibiotics`, after you load the `AMR` package.
2020-08-17 21:49:58 +02:00
`r download_txt("antibiotics")`
### Source
This data set contains all EARS-Net and ATC codes gathered from WHO and WHONET, and all compound IDs from PubChem. It also contains all brand names (synonyms) as found on PubChem and Defined Daily Doses (DDDs) for oral and parenteral administration.
* [ATC/DDD index from WHO Collaborating Centre for Drug Statistics Methodology](https://www.whocc.no/atc_ddd_index/) (note: this may not be used for commercial purposes, but is freely available from the WHO CC website for personal use)
* [PubChem by the US National Library of Medicine](https://pubchem.ncbi.nlm.nih.gov)
* [WHONET software 2019](https://whonet.org)
### Example content
```{r, echo = FALSE}
antibiotics %>%
2022-08-28 10:31:50 +02:00
filter(ab %in% colnames(example_isolates)) %>%
print_df()
```
2022-08-26 22:25:15 +02:00
## `antivirals`: Antiviral Agents
`r structure_txt(antivirals)`
This data set is in R available as `antivirals`, after you load the `AMR` package.
2020-08-17 21:49:58 +02:00
`r download_txt("antivirals")`
### Source
This data set contains all ATC codes gathered from WHO and all compound IDs from PubChem. It also contains all brand names (synonyms) as found on PubChem and Defined Daily Doses (DDDs) for oral and parenteral administration.
* [ATC/DDD index from WHO Collaborating Centre for Drug Statistics Methodology](https://www.whocc.no/atc_ddd_index/) (note: this may not be used for commercial purposes, but is freely available from the WHO CC website for personal use)
* [PubChem by the US National Library of Medicine](https://pubchem.ncbi.nlm.nih.gov)
### Example content
```{r, echo = FALSE}
antivirals %>%
print_df()
```
2022-08-26 22:25:15 +02:00
## `rsi_translation`: Interpretation from MIC values / disk diameters to R/SI
2022-05-11 10:26:58 +02:00
`r structure_txt(rsi_translation)`
2022-05-11 10:26:58 +02:00
This data set is in R available as `rsi_translation`, after you load the `AMR` package.
2022-05-11 10:26:58 +02:00
`r download_txt("rsi_translation")`
2020-08-17 21:49:58 +02:00
### Source
2022-05-11 10:26:58 +02:00
This data set contains interpretation rules for MIC values and disk diffusion diameters. Included guidelines are CLSI (`r min(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "CLSI")$guideline)))`-`r max(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "CLSI")$guideline)))`) and EUCAST (`r min(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "EUCAST")$guideline)))`-`r max(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "EUCAST")$guideline)))`).
### Example content
```{r, echo = FALSE}
2022-08-28 10:31:50 +02:00
rsi_translation %>%
mutate(mo_name = mo_name(mo, language = NULL), .after = mo) %>%
mutate(ab_name = ab_name(ab, language = NULL), .after = ab) %>%
2022-05-11 10:26:58 +02:00
print_df()
```
2022-08-26 22:25:15 +02:00
## `intrinsic_resistant`: Intrinsic Bacterial Resistance
2022-05-11 10:26:58 +02:00
`r structure_txt(intrinsic_resistant)`
2022-05-11 10:26:58 +02:00
This data set is in R available as `intrinsic_resistant`, after you load the `AMR` package.
2022-05-11 10:26:58 +02:00
`r download_txt("intrinsic_resistant")`
2020-08-17 21:49:58 +02:00
### Source
2022-05-11 10:26:58 +02:00
This data set contains all defined intrinsic resistance by EUCAST of all bug-drug combinations, and is based on `r AMR:::format_eucast_version_nr("3.3")`.
### Example content
2022-05-11 10:26:58 +02:00
Example rows when filtering on *Enterobacter cloacae*:
```{r, echo = FALSE}
2022-05-11 10:26:58 +02:00
intrinsic_resistant %>%
2022-08-28 10:31:50 +02:00
transmute(
microorganism = mo_name(mo),
antibiotic = ab_name(ab)
) %>%
filter(microorganism == "Enterobacter cloacae") %>%
2022-05-11 10:26:58 +02:00
arrange(antibiotic) %>%
print_df(rows = Inf)
```
2022-08-26 22:25:15 +02:00
## `dosage`: Dosage Guidelines from EUCAST
`r structure_txt(dosage)`
This data set is in R available as `dosage`, after you load the `AMR` package.
`r download_txt("dosage")`
### Source
EUCAST breakpoints used in this package are based on the dosages in this data set.
Currently included dosages in the data set are meant for: `r AMR:::format_eucast_version_nr(unique(dosage$eucast_version))`.
### Example content
```{r, echo = FALSE}
2022-08-28 10:31:50 +02:00
dosage %>%
print_df()
```
2022-08-27 20:49:37 +02:00
## `example_isolates`: Example Data for Practice
`r structure_txt(example_isolates)`
This data set is in R available as `example_isolates`, after you load the `AMR` package.
`r download_txt("example_isolates")`
### Source
This data set contains randomised fictitious data, but reflects reality and can be used to practise AMR data analysis.
### Example content
```{r, echo = FALSE}
2022-08-28 10:31:50 +02:00
example_isolates %>%
2022-08-27 20:49:37 +02:00
print_df()
```
## `example_isolates_unclean`: Example Data for Practice
`r structure_txt(example_isolates_unclean)`
This data set is in R available as `example_isolates_unclean`, after you load the `AMR` package.
`r download_txt("example_isolates_unclean")`
### Source
This data set contains randomised fictitious data, but reflects reality and can be used to practise AMR data analysis.
### Example content
```{r, echo = FALSE}
2022-08-28 10:31:50 +02:00
example_isolates_unclean %>%
2022-08-27 20:49:37 +02:00
print_df()
```