AMR/vignettes/datasets.Rmd

---
title: "Data sets for download"
output: 
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Data sets for download}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE, results = 'markup'}
knitr::opts_chunk$set(
  warning = FALSE,
  collapse = TRUE,
  comment = "#",
  fig.width = 7.5,
  fig.height = 5
)
options(knitr.kable.NA = '')

file_size <- function(...) {
  size_kb <- file.size(...) / 1024
  if (size_kb > 500) {
    paste(round(size_kb / 1024, 1), "MB")
  } else {
    paste(round(size_kb, 1), "kB")
  }
}

structure_txt <- function(dataset) {
  paste0("A data set with ",
         format(nrow(dataset), big.mark = ","), " rows and ", 
         ncol(dataset), " columns, containing the following column names:\n\n*",
         paste0(colnames(dataset), collapse = ", "), "*.")
}

download_txt <- function(filename) {
    msg <- paste0("Download the data set preferably in the software you use, so the data file already has the correct data structure. Below files were updated on ", 
                trimws(format(file.mtime(paste0("../data/", filename, ".rda")), "%e %B %Y %H:%M:%S %Z")), ".")
                github_base <- "https://github.com/msberends/AMR/raw/master/data-raw/"
  gitlab_base <- "https://gitlab.com/msberends/AMR/-/raw/master/data-raw/"
  filename <- paste0("../data-raw/", filename)
  txt <- paste0(filename, ".txt")
  rds <- paste0(filename, ".rds")
  spss <- paste0(filename, ".sav")
  stata <- paste0(filename, ".sav")
  sas <- paste0(filename, ".dta")
  excel <- paste0(filename, ".xlsx")
  create_txt <- function(filename, type) {
    paste0("* ", type, ": ",
           "[from GitHub](", github_base, filename, "), ",
           "[from GitLab](", gitlab_base, filename, ") ",
           "(file size: ", file_size(filename), ")")
  }

  if (file.exists(rds)) msg <- c(msg, create_txt(rds, "R file (.rds)"))
  if (file.exists(excel)) msg <- c(msg, create_txt(excel, "Excel workbook (.xlsx)"))
  if (file.exists(spss)) msg <- c(msg, create_txt(spss, "SPSS file (.sav)"))
  if (file.exists(stata)) msg <- c(msg, create_txt(stata, "Stata file (.dta)"))
  if (file.exists(sas)) msg <- c(msg, create_txt(sas, "SAS file (.sas)"))
  if (file.exists(txt)) msg <- c(msg, create_txt(txt, "Tab separated file (.txt)"))
  paste0(msg, collapse = "\n\n")
}

library(AMR)
library(dplyr)

print_df <- function(x) {
  x %>% 
    head() %>% 
    mutate_all(function(x) {
      if (is.list(x)) {
        sapply(x, function(y) {
          if (length(y) > 3) {
            paste0(paste(y[1:3], collapse = ", "), ", ...")
          } else if (length(y) == 0 || is.na(y)) {
            ""
          } else {
            paste(y, collapse = ", ")
          }
        })
      } else {
        x
      }
    }) %>%
    knitr::kable(align = "c")
}

```

This package contains a lot of reference data sets that are all reliable, up-to-date and free to download. You can even use them outside of R, for example to train your laboratory information system (LIS) about intrinsic resistance! 

We included them in our `AMR` package, but also automatically 'mirror' them to our public repository in different software formats. On this page, we explain how to download them and how the structure of the data sets look like. The tab separated files **allow for machine reading taxonomic data and EUCAST and CLSI interpretation guidelines**, which is almost impossible with the Excel and PDF files distributed by EUCAST and CLSI.

*Note: Years and dates of updates mentioned on this page, are from on `AMR` package version `r utils::packageVersion("AMR")`, online released on `r format(utils::packageDate("AMR"), "%e %B %Y")`. **If you are reading this page from within R, please [visit our website](https://msberends.github.io/AMR/articles/datasets.html) for the latest update.***

## Microorganisms

This data set is in R available as `microorganisms`, after you load the `AMR` package.

#### Source

Our full taxonomy of microorganisms is based on the authoritative and comprehensive:

* [Catalogue of Life](http://www.catalogueoflife.org) (included version: `r AMR:::catalogue_of_life$year`)
* [List of Prokaryotic names with Standing in Nomenclature](https://lpsn.dsmz.de) (LPSN, included version: `r AMR:::catalogue_of_life$yearmonth_DSMZ`)

#### Structure

`r structure_txt(microorganisms)`

Included per taxonomic kingdom:

```{r, echo = FALSE}
microorganisms %>% 
  pull(kingdom) %>% 
  table() %>% 
  as.data.frame() %>% 
  mutate(Freq = format(Freq, big.mark = ",")) %>% 
  setNames(c("Kingdom", "Number of (sub)species")) %>% 
  print_df()
```


#### Download

`r download_txt("microorganisms")`

#### Example

Example rows when filtering on genus *Escherichia*:

```{r, echo = FALSE}
microorganisms %>%
  filter(genus == "Escherichia") %>% 
  print_df()
```

## Antibiotic agents

This data set is in R available as `antibiotics`, after you load the `AMR` package.

#### Source

This data set contains all EARS-Net and ATC codes gathered from WHO and WHONET, and all compound IDs from PubChem. It also contains all brand names (synonyms) as found on PubChem and Defined Daily Doses (DDDs) for oral and parenteral administration.

* [ATC/DDD index from WHO Collaborating Centre for Drug Statistics Methodology](https://www.whocc.no/atc_ddd_index/) (note: this may not be used for commercial purposes, but is frelly available from the WHO CC website for personal use)
* [PubChem by the US National Library of Medicine](https://pubchem.ncbi.nlm.nih.gov)
* [WHONET software 2019](https://whonet.org)

#### Structure

`r structure_txt(antibiotics)`

#### Download

`r download_txt("antibiotics")`

#### Example

Example rows:

```{r, echo = FALSE}
antibiotics %>%
  filter(ab %in% colnames(example_isolates)) %>% 
  print_df()
```


## Antiviral agents

This data set is in R available as `antivirals`, after you load the `AMR` package.

#### Source

This data set contains all ATC codes gathered from WHO and all compound IDs from PubChem. It also contains all brand names (synonyms) as found on PubChem and Defined Daily Doses (DDDs) for oral and parenteral administration.

* [ATC/DDD index from WHO Collaborating Centre for Drug Statistics Methodology](https://www.whocc.no/atc_ddd_index/) (note: this may not be used for commercial purposes, but is frelly available from the WHO CC website for personal use)
* [PubChem by the US National Library of Medicine](https://pubchem.ncbi.nlm.nih.gov)

#### Structure

`r structure_txt(antivirals)`

#### Download

`r download_txt("antivirals")`

#### Example

Example rows:

```{r, echo = FALSE}
antivirals %>%
  print_df()
```


## Intrinsic bacterial resistance

This data set is in R available as `intrinsic_resistant`, after you load the `AMR` package.

#### Source

This data set contains all defined intrinsic resistance by EUCAST of all bug-drug combinations. 

The data set is based on 'EUCAST Expert Rules, Intrinsic Resistance and Exceptional Phenotypes', version `r AMR:::EUCAST_VERSION_EXPERT_RULES`.

#### Structure

`r structure_txt(intrinsic_resistant)`

#### Download

`r download_txt("intrinsic_resistant")`

#### Example

Example rows:

```{r, echo = FALSE}
intrinsic_resistant %>%
  filter(microorganism %like% "^Klebsiella") %>% 
  print_df()
```


## Interpretation from MIC values / disk diameters to R/SI

This data set is in R available as `rsi_translation`, after you load the `AMR` package.

#### Source

This data set contains interpretation rules for MIC values and disk diffusion diameters. Included guidelines are CLSI (`r min(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "CLSI")$guideline)))`-`r max(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "CLSI")$guideline)))`) and EUCAST (`r min(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "EUCAST")$guideline)))`-`r max(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "EUCAST")$guideline)))`).

#### Structure

`r structure_txt(rsi_translation)`

#### Download

`r download_txt("rsi_translation")`

#### Example

Example rows:

```{r, echo = FALSE}
rsi_translation %>% 
  mutate(ab = ab_name(ab), mo = mo_name(mo)) %>% 
  print_df()
```
(v1.3.0.9004) data sets, as.disk() improvement 2020-08-16 21:38:42 +02:00			`---`
			`title: "Data sets for download"`
			`output:`
			`rmarkdown::html_vignette:`
			`toc: true`
			`toc_depth: 3`
			`vignette: >`
			`%\VignetteIndexEntry{Data sets for download}`
			`%\VignetteEncoding{UTF-8}`
			`%\VignetteEngine{knitr::rmarkdown}`
			`editor_options:`
			`chunk_output_type: console`
			`---`

			```{r setup, include = FALSE, results = 'markup'}
			`knitr::opts_chunk$set(`
			`warning = FALSE,`
			`collapse = TRUE,`
			`comment = "#",`
			`fig.width = 7.5,`
			`fig.height = 5`
			`)`
			`options(knitr.kable.NA = '')`

			`file_size <- function(...) {`
			`size_kb <- file.size(...) / 1024`
			`if (size_kb > 500) {`
			`paste(round(size_kb / 1024, 1), "MB")`
			`} else {`
			`paste(round(size_kb, 1), "kB")`
			`}`
			`}`

			`structure_txt <- function(dataset) {`
			`paste0("A data set with ",`
			`format(nrow(dataset), big.mark = ","), " rows and ",`
			`ncol(dataset), " columns, containing the following column names:\n\n*",`
			`paste0(colnames(dataset), collapse = ", "), "*.")`
			`}`

			`download_txt <- function(filename) {`
			`msg <- paste0("Download the data set preferably in the software you use, so the data file already has the correct data structure. Below files were updated on ",`
			`trimws(format(file.mtime(paste0("../data/", filename, ".rda")), "%e %B %Y %H:%M:%S %Z")), ".")`
			`github_base <- "https://github.com/msberends/AMR/raw/master/data-raw/"`
			`gitlab_base <- "https://gitlab.com/msberends/AMR/-/raw/master/data-raw/"`
			`filename <- paste0("../data-raw/", filename)`
			`txt <- paste0(filename, ".txt")`
			`rds <- paste0(filename, ".rds")`
			`spss <- paste0(filename, ".sav")`
			`stata <- paste0(filename, ".sav")`
			`sas <- paste0(filename, ".dta")`
			`excel <- paste0(filename, ".xlsx")`
			`create_txt <- function(filename, type) {`
			`paste0("* ", type, ": ",`
			`"[from GitHub](", github_base, filename, "), ",`
			`"[from GitLab](", gitlab_base, filename, ") ",`
			`"(file size: ", file_size(filename), ")")`
			`}`

			`if (file.exists(rds)) msg <- c(msg, create_txt(rds, "R file (.rds)"))`
			`if (file.exists(excel)) msg <- c(msg, create_txt(excel, "Excel workbook (.xlsx)"))`
			`if (file.exists(spss)) msg <- c(msg, create_txt(spss, "SPSS file (.sav)"))`
			`if (file.exists(stata)) msg <- c(msg, create_txt(stata, "Stata file (.dta)"))`
			`if (file.exists(sas)) msg <- c(msg, create_txt(sas, "SAS file (.sas)"))`
			`if (file.exists(txt)) msg <- c(msg, create_txt(txt, "Tab separated file (.txt)"))`
			`paste0(msg, collapse = "\n\n")`
			`}`

			`library(AMR)`
			`library(dplyr)`

			`print_df <- function(x) {`
			`x %>%`
			`head() %>%`
			`mutate_all(function(x) {`
			`if (is.list(x)) {`
			`sapply(x, function(y) {`
			`if (length(y) > 3) {`
			`paste0(paste(y[1:3], collapse = ", "), ", ...")`
			`} else if (length(y) == 0 \|\| is.na(y)) {`
			`""`
			`} else {`
			`paste(y, collapse = ", ")`
			`}`
			`})`
			`} else {`
			`x`
			`}`
			`}) %>%`
			`knitr::kable(align = "c")`
			`}`

			```

			`This package contains a lot of reference data sets that are all reliable, up-to-date and free to download. You can even use them outside of R, for example to train your laboratory information system (LIS) about intrinsic resistance!`

			We included them in our `AMR` package, but also automatically 'mirror' them to our public repository in different software formats. On this page, we explain how to download them and how the structure of the data sets look like. The tab separated files allow for machine reading taxonomic data and EUCAST and CLSI interpretation guidelines, which is almost impossible with the Excel and PDF files distributed by EUCAST and CLSI.

			Note: Years and dates of updates mentioned on this page, are from on `AMR` package version `r utils::packageVersion("AMR")`, online released on `r format(utils::packageDate("AMR"), "%e %B %Y")`. If you are reading this page from within R, please [visit our website](https://msberends.github.io/AMR/articles/datasets.html) for the latest update.**

			`## Microorganisms`

			This data set is in R available as `microorganisms`, after you load the `AMR` package.

			`#### Source`

			`Our full taxonomy of microorganisms is based on the authoritative and comprehensive:`

			* [Catalogue of Life](http://www.catalogueoflife.org) (included version: `r AMR:::catalogue_of_life$year`)
			* [List of Prokaryotic names with Standing in Nomenclature](https://lpsn.dsmz.de) (LPSN, included version: `r AMR:::catalogue_of_life$yearmonth_DSMZ`)

			`#### Structure`

			`r structure_txt(microorganisms)`

			`Included per taxonomic kingdom:`

			```{r, echo = FALSE}
			`microorganisms %>%`
			`pull(kingdom) %>%`
			`table() %>%`
			`as.data.frame() %>%`
			`mutate(Freq = format(Freq, big.mark = ",")) %>%`
			`setNames(c("Kingdom", "Number of (sub)species")) %>%`
			`print_df()`
			```


			`#### Download`

			`r download_txt("microorganisms")`

			`#### Example`

			`Example rows when filtering on genus Escherichia:`

			```{r, echo = FALSE}
			`microorganisms %>%`
			`filter(genus == "Escherichia") %>%`
			`print_df()`
			```

			`## Antibiotic agents`

			This data set is in R available as `antibiotics`, after you load the `AMR` package.

			`#### Source`

			`This data set contains all EARS-Net and ATC codes gathered from WHO and WHONET, and all compound IDs from PubChem. It also contains all brand names (synonyms) as found on PubChem and Defined Daily Doses (DDDs) for oral and parenteral administration.`

			`* [ATC/DDD index from WHO Collaborating Centre for Drug Statistics Methodology](https://www.whocc.no/atc_ddd_index/) (note: this may not be used for commercial purposes, but is frelly available from the WHO CC website for personal use)`
			`* [PubChem by the US National Library of Medicine](https://pubchem.ncbi.nlm.nih.gov)`
			`* [WHONET software 2019](https://whonet.org)`

			`#### Structure`

			`r structure_txt(antibiotics)`

			`#### Download`

			`r download_txt("antibiotics")`

			`#### Example`

			`Example rows:`

			```{r, echo = FALSE}
			`antibiotics %>%`
			`filter(ab %in% colnames(example_isolates)) %>%`
			`print_df()`
			```


			`## Antiviral agents`

			This data set is in R available as `antivirals`, after you load the `AMR` package.

			`#### Source`

			`This data set contains all ATC codes gathered from WHO and all compound IDs from PubChem. It also contains all brand names (synonyms) as found on PubChem and Defined Daily Doses (DDDs) for oral and parenteral administration.`

			`* [ATC/DDD index from WHO Collaborating Centre for Drug Statistics Methodology](https://www.whocc.no/atc_ddd_index/) (note: this may not be used for commercial purposes, but is frelly available from the WHO CC website for personal use)`
			`* [PubChem by the US National Library of Medicine](https://pubchem.ncbi.nlm.nih.gov)`

			`#### Structure`

			`r structure_txt(antivirals)`

			`#### Download`

			`r download_txt("antivirals")`

			`#### Example`

			`Example rows:`

			```{r, echo = FALSE}
			`antivirals %>%`
			`print_df()`
			```


			`## Intrinsic bacterial resistance`

			This data set is in R available as `intrinsic_resistant`, after you load the `AMR` package.

			`#### Source`

			`This data set contains all defined intrinsic resistance by EUCAST of all bug-drug combinations.`

			The data set is based on 'EUCAST Expert Rules, Intrinsic Resistance and Exceptional Phenotypes', version `r AMR:::EUCAST_VERSION_EXPERT_RULES`.

			`#### Structure`

			`r structure_txt(intrinsic_resistant)`

			`#### Download`

			`r download_txt("intrinsic_resistant")`

			`#### Example`

			`Example rows:`

			```{r, echo = FALSE}
			`intrinsic_resistant %>%`
			`filter(microorganism %like% "^Klebsiella") %>%`
			`print_df()`
			```


			`## Interpretation from MIC values / disk diameters to R/SI`

			This data set is in R available as `rsi_translation`, after you load the `AMR` package.

			`#### Source`

			This data set contains interpretation rules for MIC values and disk diffusion diameters. Included guidelines are CLSI (`r min(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "CLSI")$guideline)))`-`r max(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "CLSI")$guideline)))`) and EUCAST (`r min(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "EUCAST")$guideline)))`-`r max(as.integer(gsub("[^0-9]", "", subset(rsi_translation, guideline %like% "EUCAST")$guideline)))`).

			`#### Structure`

			`r structure_txt(rsi_translation)`

			`#### Download`

			`r download_txt("rsi_translation")`

			`#### Example`

			`Example rows:`

			```{r, echo = FALSE}
			`rsi_translation %>%`
			`mutate(ab = ab_name(ab), mo = mo_name(mo)) %>%`
			`print_df()`
			```