AMR/R/g.test.R

# ==================================================================== #
# TITLE                                                                #
# AMR: An R Package for Working with Antimicrobial Resistance Data     #
#                                                                      #
# SOURCE                                                               #
# https://github.com/msberends/AMR                                     #
#                                                                      #
# CITE AS                                                              #
# Berends MS, Luz CF, Friedrich AW, Sinha BNM, Albers CJ, Glasner C    #
# (2022). AMR: An R Package for Working with Antimicrobial Resistance  #
# Data. Journal of Statistical Software, 104(3), 1-31.                 #
# doi:10.18637/jss.v104.i03                                            #
#                                                                      #
# Developed at the University of Groningen and the University Medical  #
# Center Groningen in The Netherlands, in collaboration with many      #
# colleagues from around the world, see our website.                   #
#                                                                      #
# This R package is free software; you can freely use and distribute   #
# it for both personal and commercial purposes under the terms of the  #
# GNU General Public License version 2.0 (GNU GPL-2), as published by  #
# the Free Software Foundation.                                        #
# We created this package for both routine data analysis and academic  #
# research and it was publicly released in the hope that it will be    #
# useful, but it comes WITHOUT ANY WARRANTY OR LIABILITY.              #
#                                                                      #
# Visit our website for the full manual and a complete tutorial about  #
# how to conduct AMR data analysis: https://msberends.github.io/AMR/   #
# ==================================================================== #

#' *G*-test for Count Data
#'
#' [g.test()] performs chi-squared contingency table tests and goodness-of-fit tests, just like [chisq.test()] but is more reliable (1). A *G*-test can be used to see whether the number of observations in each category fits a theoretical expectation (called a ***G*-test of goodness-of-fit**), or to see whether the proportions of one variable are different for different values of the other variable (called a ***G*-test of independence**).
#' @inherit stats::chisq.test params return
#' @details If `x` is a [matrix] with one row or column, or if `x` is a vector and `y` is not given, then a *goodness-of-fit test* is performed (`x` is treated as a one-dimensional contingency table). The entries of `x` must be non-negative integers. In this case, the hypothesis tested is whether the population probabilities equal those in `p`, or are all equal if `p` is not given.
#'
#' If `x` is a [matrix] with at least two rows and columns, it is taken as a two-dimensional contingency table: the entries of `x` must be non-negative integers.  Otherwise, `x` and `y` must be vectors or factors of the same length; cases with missing values are removed, the objects are coerced to factors, and the contingency table is computed from these.  Then Pearson's chi-squared test is performed of the null hypothesis that the joint distribution of the cell counts in a 2-dimensional contingency table is the product of the row and column marginals.
#'
#' The p-value is computed from the asymptotic chi-squared distribution of the test statistic.
#'
#' In the contingency table case simulation is done by random sampling from the set of all contingency tables with given marginals, and works only if the marginals are strictly positive. Note that this is not the usual sampling situation assumed for a chi-squared test (such as the *G*-test) but rather that for Fisher's exact test.
#'
#' In the goodness-of-fit case simulation is done by random sampling from the discrete distribution specified by `p`, each sample being of size `n = sum(x)`. This simulation is done in \R and may be slow.
#'
#' ### *G*-test Of Goodness-of-Fit (Likelihood Ratio Test)
#' Use the *G*-test of goodness-of-fit when you have one nominal variable with two or more values (such as male and female, or red, pink and white flowers). You compare the observed counts of numbers of observations in each category with the expected counts, which you calculate using some kind of theoretical expectation (such as a 1:1 sex ratio or a 1:2:1 ratio in a genetic cross).
#'
#' If the expected number of observations in any category is too small, the *G*-test may give inaccurate results, and you should use an exact test instead ([fisher.test()]).
#'
#' The *G*-test of goodness-of-fit is an alternative to the chi-square test of goodness-of-fit ([chisq.test()]); each of these tests has some advantages and some disadvantages, and the results of the two tests are usually very similar.
#'
#' ### *G*-test of Independence
#' Use the *G*-test of independence when you have two nominal variables, each with two or more possible values. You want to know whether the proportions for one variable are different among values of the other variable.
#'
#' It is also possible to do a *G*-test of independence with more than two nominal variables. For example, Jackson et al. (2013) also had data for children under 3, so you could do an analysis of old vs. young, thigh vs. arm, and reaction vs. no reaction, all analyzed together.
#'
#' Fisher's exact test ([fisher.test()]) is an **exact** test, where the *G*-test is still only an **approximation**. For any 2x2 table, Fisher's Exact test may be slower but will still run in seconds, even if the sum of your observations is multiple millions.
#'
#' The *G*-test of independence is an alternative to the chi-square test of independence ([chisq.test()]), and they will give approximately the same results.
#'
#' ### How the Test Works
#' Unlike the exact test of goodness-of-fit ([fisher.test()]), the *G*-test does not directly calculate the probability of obtaining the observed results or something more extreme. Instead, like almost all statistical tests, the *G*-test has an intermediate step; it uses the data to calculate a test statistic that measures how far the observed data are from the null expectation. You then use a mathematical relationship, in this case the chi-square distribution, to estimate the probability of obtaining that value of the test statistic.
#'
#' The *G*-test uses the log of the ratio of two likelihoods as the test statistic, which is why it is also called a likelihood ratio test or log-likelihood ratio test. The formula to calculate a *G*-statistic is:
#'
#' \eqn{G = 2 * sum(x * log(x / E))}
#'
#' where `E` are the expected values. Since this is chi-square distributed, the p value can be calculated in \R with:
#' ```
#' p <- stats::pchisq(G, df, lower.tail = FALSE)
#' ```
#' where `df` are the degrees of freedom.
#'
#' If there are more than two categories and you want to find out which ones are significantly different from their null expectation, you can use the same method of testing each category vs. the sum of all categories, with the Bonferroni correction. You use *G*-tests for each category, of course.
#' @seealso [chisq.test()]
#' @references 1. McDonald, J.H. 2014. **Handbook of Biological Statistics (3rd ed.)**. Sparky House Publishing, Baltimore, Maryland. <http://www.biostathandbook.com/gtestgof.html>.
#' @source The code for this function is identical to that of [chisq.test()], except that:
#' - The calculation of the statistic was changed to \eqn{2 * sum(x * log(x / E))}
#' - Yates' continuity correction was removed as it does not apply to a *G*-test
#' - The possibility to simulate p values with `simulate.p.value` was removed
#' @export
#' @importFrom stats pchisq complete.cases
#' @examples
#' # = EXAMPLE 1 =
#' # Shivrain et al. (2006) crossed clearfield rice (which are resistant
#' # to the herbicide imazethapyr) with red rice (which are susceptible to
#' # imazethapyr). They then crossed the hybrid offspring and examined the
#' # F2 generation, where they found 772 resistant plants, 1611 moderately
#' # resistant plants, and 737 susceptible plants. If resistance is controlled
#' # by a single gene with two co-dominant alleles, you would expect a 1:2:1
#' # ratio.
#'
#' x <- c(772, 1611, 737)
#' g.test(x, p = c(1, 2, 1) / 4)
#'
#' # There is no significant difference from a 1:2:1 ratio.
#' # Meaning: resistance controlled by a single gene with two co-dominant
#' # alleles, is plausible.
#'
#'
#' # = EXAMPLE 2 =
#' # Red crossbills (Loxia curvirostra) have the tip of the upper bill either
#' # right or left of the lower bill, which helps them extract seeds from pine
#' # cones. Some have hypothesized that frequency-dependent selection would
#' # keep the number of right and left-billed birds at a 1:1 ratio. Groth (1992)
#' # observed 1752 right-billed and 1895 left-billed crossbills.
#'
#' x <- c(1752, 1895)
#' g.test(x)
#'
#' # There is a significant difference from a 1:1 ratio.
#' # Meaning: there are significantly more left-billed birds.
g.test <- function(x,
                   y = NULL,
                   # correct = TRUE,
                   p = rep(1 / length(x), length(x)),
                   rescale.p = FALSE) {
  DNAME <- deparse(substitute(x))
  if (is.data.frame(x)) {
    x <- as.matrix(x)
  }
  if (is.matrix(x)) {
    if (min(dim(x)) == 1L) {
      x <- as.vector(x)
    }
  }
  if (!is.matrix(x) && !is.null(y)) {
    if (length(x) != length(y)) {
      stop("'x' and 'y' must have the same length")
    }
    DNAME2 <- deparse(substitute(y))
    xname <- if (length(DNAME) > 1L || nchar(DNAME, "w") >
      30) {
      ""
    } else {
      DNAME
    }
    yname <- if (length(DNAME2) > 1L || nchar(DNAME2, "w") >
      30) {
      ""
    } else {
      DNAME2
    }
    OK <- complete.cases(x, y)
    x <- factor(x[OK])
    y <- factor(y[OK])
    if ((nlevels(x) < 2L) || (nlevels(y) < 2L)) {
      stop("'x' and 'y' must have at least 2 levels")
    }
    x <- table(x, y)
    names(dimnames(x)) <- c(xname, yname)
    DNAME <- paste(
      paste(DNAME, collapse = "\n"), "and",
      paste(DNAME2, collapse = "\n")
    )
  }
  if (any(x < 0) || anyNA(x)) {
    stop("all entries of 'x' must be nonnegative and finite")
  }
  if ((n <- sum(x)) == 0) {
    stop("at least one entry of 'x' must be positive")
  }


  if (is.matrix(x)) {
    METHOD <- "G-test of independence"
    nr <- as.integer(nrow(x))
    nc <- as.integer(ncol(x))
    if (is.na(nr) || is.na(nc) || is.na(nr * nc)) {
      stop("invalid nrow(x) or ncol(x)", domain = NA)
    }
    # add fisher.test suggestion
    if (nr == 2 && nc == 2) {
      warning("`fisher.test()` is always more reliable for 2x2 tables and although much slower, often only takes seconds.")
    }
    sr <- rowSums(x)
    sc <- colSums(x)
    E <- outer(sr, sc, "*") / n
    v <- function(r, c, n) c * r * (n - r) * (n - c) / n^3
    V <- outer(sr, sc, v, n)
    dimnames(E) <- dimnames(x)

    STATISTIC <- 2 * sum(x * log(x / E), na.rm = TRUE) # sum((abs(x - E) - YATES)^2/E) for chisq.test
    PARAMETER <- (nr - 1L) * (nc - 1L)
    PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
  } else {
    if (length(dim(x)) > 2L) {
      stop("invalid 'x'")
    }
    if (length(x) == 1L) {
      stop("'x' must at least have 2 elements")
    }
    if (length(x) != length(p)) {
      stop("'x' and 'p' must have the same number of elements")
    }
    if (any(p < 0)) {
      stop("probabilities must be non-negative.")
    }
    if (abs(sum(p) - 1) > sqrt(.Machine$double.eps)) {
      if (rescale.p) {
        p <- p / sum(p)
      } else {
        stop("probabilities must sum to 1.")
      }
    }
    METHOD <- "G-test of goodness-of-fit (likelihood ratio test)"
    E <- n * p
    V <- n * p * (1 - p)
    STATISTIC <- 2 * sum(x * log(x / E)) # sum((x - E)^2/E) for chisq.test
    names(E) <- names(x)

    PARAMETER <- length(x) - 1
    PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
  }
  names(STATISTIC) <- "X-squared"
  names(PARAMETER) <- "df"
  if (any(E < 5) && is.finite(PARAMETER)) {
    warning("G-statistic approximation may be incorrect due to E < 5")
  }

  structure(list(
    statistic = STATISTIC, argument = PARAMETER,
    p.value = PVAL, method = METHOD, data.name = DNAME,
    observed = x, expected = E, residuals = (x - E) / sqrt(E),
    stdres = (x - E) / sqrt(V)
  ), class = "htest")
}
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`# ==================================================================== #`
			`# TITLE #`
New mo algorithm, prepare for 2.0 2022-10-05 09:12:22 +02:00			`# AMR: An R Package for Working with Antimicrobial Resistance Data #`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`# #`
big website update, licence txt update 2019-01-02 23:24:07 +01:00			`# SOURCE #`
(v1.2.0.9026) move to github 2020-07-08 14:48:06 +02:00			`# https://github.com/msberends/AMR #`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`# #`
New mo algorithm, prepare for 2.0 2022-10-05 09:12:22 +02:00			`# CITE AS #`
			`# Berends MS, Luz CF, Friedrich AW, Sinha BNM, Albers CJ, Glasner C #`
			`# (2022). AMR: An R Package for Working with Antimicrobial Resistance #`
			`# Data. Journal of Statistical Software, 104(3), 1-31. #`
			`# doi:10.18637/jss.v104.i03 #`
			`# #`
support new mo codes 2022-12-27 15:16:15 +01:00			`# Developed at the University of Groningen and the University Medical #`
			`# Center Groningen in The Netherlands, in collaboration with many #`
			`# colleagues from around the world, see our website. #`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`# #`
big website update, licence txt update 2019-01-02 23:24:07 +01:00			`# This R package is free software; you can freely use and distribute #`
			`# it for both personal and commercial purposes under the terms of the #`
			`# GNU General Public License version 2.0 (GNU GPL-2), as published by #`
			`# the Free Software Foundation. #`
(v0.9.0.9008) Happy new year! Add lifecycles 2020-01-05 17:22:09 +01:00			`# We created this package for both routine data analysis and academic #`
			`# research and it was publicly released in the hope that it will be #`
			`# useful, but it comes WITHOUT ANY WARRANTY OR LIABILITY. #`
(v1.4.0) matching score update 2020-10-08 11:16:03 +02:00			`# #`
			`# Visit our website for the full manual and a complete tutorial about #`
(v1.5.0.9014) only_rsi_columns, is.rsi.eligible improvement 2021-02-02 23:57:35 +01:00			`# how to conduct AMR data analysis: https://msberends.github.io/AMR/ #`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`# ==================================================================== #`

(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' G-test for Count Data`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
(v0.8.0.9037) complete documentation rewrite 2019-11-28 23:00:37 +01:00			`#' [g.test()] performs chi-squared contingency table tests and goodness-of-fit tests, just like [chisq.test()] but is more reliable (1). A G-test can be used to see whether the number of observations in each category fits a theoretical expectation (called a **G-test of goodness-of-fit), or to see whether the proportions of one variable are different for different values of the other variable (called a G-test of independence**).`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`#' @inherit stats::chisq.test params return`
(v1.6.0.9021) join functions update 2021-05-12 18:15:03 +02:00			#' @details If `x` is a [matrix] with one row or column, or if `x` is a vector and `y` is not given, then a goodness-of-fit test is performed (`x` is treated as a one-dimensional contingency table). The entries of `x` must be non-negative integers. In this case, the hypothesis tested is whether the population probabilities equal those in `p`, or are all equal if `p` is not given.
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`#'`
(v1.6.0.9021) join functions update 2021-05-12 18:15:03 +02:00			#' If `x` is a [matrix] with at least two rows and columns, it is taken as a two-dimensional contingency table: the entries of `x` must be non-negative integers. Otherwise, `x` and `y` must be vectors or factors of the same length; cases with missing values are removed, the objects are coerced to factors, and the contingency table is computed from these. Then Pearson's chi-squared test is performed of the null hypothesis that the joint distribution of the cell counts in a 2-dimensional contingency table is the product of the row and column marginals.
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' The p-value is computed from the asymptotic chi-squared distribution of the test statistic.`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`#'`
(v1.4.0.9041) updates based on review 2020-12-17 16:22:25 +01:00			`#' In the contingency table case simulation is done by random sampling from the set of all contingency tables with given marginals, and works only if the marginals are strictly positive. Note that this is not the usual sampling situation assumed for a chi-squared test (such as the G-test) but rather that for Fisher's exact test.`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			#' In the goodness-of-fit case simulation is done by random sampling from the discrete distribution specified by `p`, each sample being of size `n = sum(x)`. This simulation is done in \R and may be slow.
styled, unit test fix 2022-08-28 10:31:50 +02:00			`#'`
fix for `NA` in `as.ab()` 2022-10-10 15:44:59 +02:00			`#' ### G-test Of Goodness-of-Fit (Likelihood Ratio Test)`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' Use the G-test of goodness-of-fit when you have one nominal variable with two or more values (such as male and female, or red, pink and white flowers). You compare the observed counts of numbers of observations in each category with the expected counts, which you calculate using some kind of theoretical expectation (such as a 1:1 sex ratio or a 1:2:1 ratio in a genetic cross).`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' If the expected number of observations in any category is too small, the G-test may give inaccurate results, and you should use an exact test instead ([fisher.test()]).`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' The G-test of goodness-of-fit is an alternative to the chi-square test of goodness-of-fit ([chisq.test()]); each of these tests has some advantages and some disadvantages, and the results of the two tests are usually very similar.`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
fix for `NA` in `as.ab()` 2022-10-10 15:44:59 +02:00			`#' ### G-test of Independence`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' Use the G-test of independence when you have two nominal variables, each with two or more possible values. You want to know whether the proportions for one variable are different among values of the other variable.`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' It is also possible to do a G-test of independence with more than two nominal variables. For example, Jackson et al. (2013) also had data for children under 3, so you could do an analysis of old vs. young, thigh vs. arm, and reaction vs. no reaction, all analyzed together.`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' Fisher's exact test ([fisher.test()]) is an exact test, where the G-test is still only an approximation. For any 2x2 table, Fisher's Exact test may be slower but will still run in seconds, even if the sum of your observations is multiple millions.`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' The G-test of independence is an alternative to the chi-square test of independence ([chisq.test()]), and they will give approximately the same results.`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
fix for `NA` in `as.ab()` 2022-10-10 15:44:59 +02:00			`#' ### How the Test Works`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			#' Unlike the exact test of goodness-of-fit ([fisher.test()]), the G-test does not directly calculate the probability of obtaining the observed results or something more extreme. Instead, like almost all statistical tests, the G-test has an intermediate step; it uses the data to calculate a test statistic that measures how far the observed data are from the null expectation. You then use a mathematical relationship, in this case the chi-square distribution, to estimate the probability of obtaining that value of the test statistic.
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' The G-test uses the log of the ratio of two likelihoods as the test statistic, which is why it is also called a likelihood ratio test or log-likelihood ratio test. The formula to calculate a G-statistic is:`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' \eqn{G = 2 * sum(x * log(x / E))}`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			#' where `E` are the expected values. Since this is chi-square distributed, the p value can be calculated in \R with:
			#' ```
			`#' p <- stats::pchisq(G, df, lower.tail = FALSE)`
			#' ```
			#' where `df` are the degrees of freedom.
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' If there are more than two categories and you want to find out which ones are significantly different from their null expectation, you can use the same method of testing each category vs. the sum of all categories, with the Bonferroni correction. You use G-tests for each category, of course.`
			`#' @seealso [chisq.test()]`
(v0.8.0.9037) complete documentation rewrite 2019-11-28 23:00:37 +01:00			`#' @references 1. McDonald, J.H. 2014. Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland. <http://www.biostathandbook.com/gtestgof.html>.`
(v0.8.0.9036) complete documentation rewrite 2019-11-28 22:32:17 +01:00			`#' @source The code for this function is identical to that of [chisq.test()], except that:`
			`#' - The calculation of the statistic was changed to \eqn{2 * sum(x * log(x / E))}`
			`#' - Yates' continuity correction was removed as it does not apply to a G-test`
			#' - The possibility to simulate p values with `simulate.p.value` was removed
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#' @export`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`#' @importFrom stats pchisq complete.cases`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#' @examples`
			`#' # = EXAMPLE 1 =`
			`#' # Shivrain et al. (2006) crossed clearfield rice (which are resistant`
			`#' # to the herbicide imazethapyr) with red rice (which are susceptible to`
			`#' # imazethapyr). They then crossed the hybrid offspring and examined the`
			`#' # F2 generation, where they found 772 resistant plants, 1611 moderately`
			`#' # resistant plants, and 737 susceptible plants. If resistance is controlled`
			`#' # by a single gene with two co-dominant alleles, you would expect a 1:2:1`
			`#' # ratio.`
			`#'`
big website update, licence txt update 2019-01-02 23:24:07 +01:00			`#' x <- c(772, 1611, 737)`
new, automated website 2022-08-21 16:37:20 +02:00			`#' g.test(x, p = c(1, 2, 1) / 4)`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
			`#' # There is no significant difference from a 1:2:1 ratio.`
			`#' # Meaning: resistance controlled by a single gene with two co-dominant`
			`#' # alleles, is plausible.`
			`#'`
			`#'`
			`#' # = EXAMPLE 2 =`
			`#' # Red crossbills (Loxia curvirostra) have the tip of the upper bill either`
			`#' # right or left of the lower bill, which helps them extract seeds from pine`
			`#' # cones. Some have hypothesized that frequency-dependent selection would`
			`#' # keep the number of right and left-billed birds at a 1:1 ratio. Groth (1992)`
			`#' # observed 1752 right-billed and 1895 left-billed crossbills.`
			`#'`
			`#' x <- c(1752, 1895)`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`#' g.test(x)`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`#'`
			`#' # There is a significant difference from a 1:1 ratio.`
			`#' # Meaning: there are significantly more left-billed birds.`
			`g.test <- function(x,`
(v0.7.1.9102) lintr 2019-10-11 17:21:02 +02:00			`y = NULL,`
			`# correct = TRUE,`
			`p = rep(1 / length(x), length(x)),`
			`rescale.p = FALSE) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`DNAME <- deparse(substitute(x))`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`if (is.data.frame(x)) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`x <- as.matrix(x)`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`if (is.matrix(x)) {`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`if (min(dim(x)) == 1L) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`x <- as.vector(x)`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`if (!is.matrix(x) && !is.null(y)) {`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`if (length(x) != length(y)) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("'x' and 'y' must have the same length")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`DNAME2 <- deparse(substitute(y))`
			`xname <- if (length(DNAME) > 1L \|\| nchar(DNAME, "w") >`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`30) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`""`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`} else {`
			`DNAME`
			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`yname <- if (length(DNAME2) > 1L \|\| nchar(DNAME2, "w") >`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`30) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`""`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`} else {`
			`DNAME2`
			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`OK <- complete.cases(x, y)`
			`x <- factor(x[OK])`
			`y <- factor(y[OK])`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`if ((nlevels(x) < 2L) \|\| (nlevels(y) < 2L)) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("'x' and 'y' must have at least 2 levels")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`x <- table(x, y)`
			`names(dimnames(x)) <- c(xname, yname)`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`DNAME <- paste(`
			`paste(DNAME, collapse = "\n"), "and",`
			`paste(DNAME2, collapse = "\n")`
			`)`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`}`
New mo algorithm, prepare for 2.0 2022-10-05 09:12:22 +02:00			`if (any(x < 0) \|\| anyNA(x)) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("all entries of 'x' must be nonnegative and finite")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
			`if ((n <- sum(x)) == 0) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("at least one entry of 'x' must be positive")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`


new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`if (is.matrix(x)) {`
			`METHOD <- "G-test of independence"`
			`nr <- as.integer(nrow(x))`
			`nc <- as.integer(ncol(x))`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`if (is.na(nr) \|\| is.na(nc) \|\| is.na(nr * nc)) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("invalid nrow(x) or ncol(x)", domain = NA)`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
edited g.test 2019-01-12 11:06:58 +01:00			`# add fisher.test suggestion`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`if (nr == 2 && nc == 2) {`
(v0.7.1.9102) lintr 2019-10-11 17:21:02 +02:00			warning("`fisher.test()` is always more reliable for 2x2 tables and although much slower, often only takes seconds.")
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`sr <- rowSums(x)`
			`sc <- colSums(x)`
(v0.7.1.9102) lintr 2019-10-11 17:21:02 +02:00			`E <- outer(sr, sc, "*") / n`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`v <- function(r, c, n) c * r * (n - r) * (n - c) / n^3`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`V <- outer(sr, sc, v, n)`
			`dimnames(E) <- dimnames(x)`
styled, unit test fix 2022-08-28 10:31:50 +02:00
prelim fix for g.test 2023-01-30 12:26:48 +01:00			`STATISTIC <- 2 * sum(x * log(x / E), na.rm = TRUE) # sum((abs(x - E) - YATES)^2/E) for chisq.test`
(v0.7.1.9102) lintr 2019-10-11 17:21:02 +02:00			`PARAMETER <- (nr - 1L) * (nc - 1L)`
			`PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`} else {`
			`if (length(dim(x)) > 2L) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("invalid 'x'")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
			`if (length(x) == 1L) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("'x' must at least have 2 elements")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
			`if (length(x) != length(p)) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("'x' and 'p' must have the same number of elements")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
			`if (any(p < 0)) {`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`stop("probabilities must be non-negative.")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`if (abs(sum(p) - 1) > sqrt(.Machine$double.eps)) {`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`if (rescale.p) {`
(v0.7.1.9102) lintr 2019-10-11 17:21:02 +02:00			`p <- p / sum(p)`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`} else {`
			`stop("probabilities must sum to 1.")`
			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`}`
			`METHOD <- "G-test of goodness-of-fit (likelihood ratio test)"`
			`E <- n * p`
			`V <- n * p * (1 - p)`
			`STATISTIC <- 2 * sum(x * log(x / E)) # sum((x - E)^2/E) for chisq.test`
			`names(E) <- names(x)`
styled, unit test fix 2022-08-28 10:31:50 +02:00
(v0.7.1.9102) lintr 2019-10-11 17:21:02 +02:00			`PARAMETER <- length(x) - 1`
			`PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`}`
new g.test, extra unit tests 2018-07-10 12:27:07 +02:00			`names(STATISTIC) <- "X-squared"`
			`names(PARAMETER) <- "df"`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`if (any(E < 5) && is.finite(PARAMETER)) {`
edited g.test 2019-01-12 11:06:58 +01:00			`warning("G-statistic approximation may be incorrect due to E < 5")`
styled, unit test fix 2022-08-28 10:31:50 +02:00			`}`

			`structure(list(`
			`statistic = STATISTIC, argument = PARAMETER,`
			`p.value = PVAL, method = METHOD, data.name = DNAME,`
			`observed = x, expected = E, residuals = (x - E) / sqrt(E),`
			`stdres = (x - E) / sqrt(V)`
			`), class = "htest")`
new g.test() and edited freq() 2018-07-01 21:40:37 +02:00			`}`