mirror of
https://github.com/msberends/AMR.git
synced 2025-12-18 15:00:23 +01:00
Built site for AMR@3.0.1.9003: ba30b08
This commit is contained in:
257
reference/g.test.md
Normal file
257
reference/g.test.md
Normal file
@@ -0,0 +1,257 @@
|
||||
# *G*-test for Count Data
|
||||
|
||||
`g.test()` performs chi-squared contingency table tests and
|
||||
goodness-of-fit tests, just like
|
||||
[`chisq.test()`](https://rdrr.io/r/stats/chisq.test.html) but is more
|
||||
reliable (1). A *G*-test can be used to see whether the number of
|
||||
observations in each category fits a theoretical expectation (called a
|
||||
***G*-test of goodness-of-fit**), or to see whether the proportions of
|
||||
one variable are different for different values of the other variable
|
||||
(called a ***G*-test of independence**).
|
||||
|
||||
## Usage
|
||||
|
||||
``` r
|
||||
g.test(x, y = NULL, p = rep(1/length(x), length(x)), rescale.p = FALSE)
|
||||
```
|
||||
|
||||
## Source
|
||||
|
||||
The code for this function is identical to that of
|
||||
[`chisq.test()`](https://rdrr.io/r/stats/chisq.test.html), except that:
|
||||
|
||||
- The calculation of the statistic was changed to \\2 \* sum(x \* log(x
|
||||
/ E))\\
|
||||
|
||||
- Yates' continuity correction was removed as it does not apply to a
|
||||
*G*-test
|
||||
|
||||
- The possibility to simulate p values with `simulate.p.value` was
|
||||
removed
|
||||
|
||||
## Arguments
|
||||
|
||||
- x:
|
||||
|
||||
a numeric vector or matrix. `x` and `y` can also both be factors.
|
||||
|
||||
- y:
|
||||
|
||||
a numeric vector; ignored if `x` is a matrix. If `x` is a factor, `y`
|
||||
should be a factor of the same length.
|
||||
|
||||
- p:
|
||||
|
||||
a vector of probabilities of the same length as `x`. An error is given
|
||||
if any entry of `p` is negative.
|
||||
|
||||
- rescale.p:
|
||||
|
||||
a logical scalar; if TRUE then `p` is rescaled (if necessary) to sum
|
||||
to 1. If `rescale.p` is FALSE, and `p` does not sum to 1, an error is
|
||||
given.
|
||||
|
||||
## Value
|
||||
|
||||
A list with class `"htest"` containing the following components:
|
||||
|
||||
- statistic:
|
||||
|
||||
the value the chi-squared test statistic.
|
||||
|
||||
- parameter:
|
||||
|
||||
the degrees of freedom of the approximate chi-squared distribution of
|
||||
the test statistic, `NA` if the p-value is computed by Monte Carlo
|
||||
simulation.
|
||||
|
||||
- p.value:
|
||||
|
||||
the p-value for the test.
|
||||
|
||||
- method:
|
||||
|
||||
a character string indicating the type of test performed, and whether
|
||||
Monte Carlo simulation or continuity correction was used.
|
||||
|
||||
- data.name:
|
||||
|
||||
a character string giving the name(s) of the data.
|
||||
|
||||
- observed:
|
||||
|
||||
the observed counts.
|
||||
|
||||
- expected:
|
||||
|
||||
the expected counts under the null hypothesis.
|
||||
|
||||
- residuals:
|
||||
|
||||
the Pearson residuals, `(observed - expected) / sqrt(expected)`.
|
||||
|
||||
- stdres:
|
||||
|
||||
standardized residuals, `(observed - expected) / sqrt(V)`, where `V`
|
||||
is the residual cell variance (Agresti, 2007, section 2.4.5 for the
|
||||
case where `x` is a matrix, `n * p * (1 - p)` otherwise).
|
||||
|
||||
## Details
|
||||
|
||||
If `x` is a [matrix](https://rdrr.io/r/base/matrix.html) with one row or
|
||||
column, or if `x` is a vector and `y` is not given, then a
|
||||
*goodness-of-fit test* is performed (`x` is treated as a one-dimensional
|
||||
contingency table). The entries of `x` must be non-negative integers. In
|
||||
this case, the hypothesis tested is whether the population probabilities
|
||||
equal those in `p`, or are all equal if `p` is not given.
|
||||
|
||||
If `x` is a [matrix](https://rdrr.io/r/base/matrix.html) with at least
|
||||
two rows and columns, it is taken as a two-dimensional contingency
|
||||
table: the entries of `x` must be non-negative integers. Otherwise, `x`
|
||||
and `y` must be vectors or factors of the same length; cases with
|
||||
missing values are removed, the objects are coerced to factors, and the
|
||||
contingency table is computed from these. Then Pearson's chi-squared
|
||||
test is performed of the null hypothesis that the joint distribution of
|
||||
the cell counts in a 2-dimensional contingency table is the product of
|
||||
the row and column marginals.
|
||||
|
||||
The p-value is computed from the asymptotic chi-squared distribution of
|
||||
the test statistic.
|
||||
|
||||
In the contingency table case simulation is done by random sampling from
|
||||
the set of all contingency tables with given marginals, and works only
|
||||
if the marginals are strictly positive. Note that this is not the usual
|
||||
sampling situation assumed for a chi-squared test (such as the *G*-test)
|
||||
but rather that for Fisher's exact test.
|
||||
|
||||
In the goodness-of-fit case simulation is done by random sampling from
|
||||
the discrete distribution specified by `p`, each sample being of size
|
||||
`n = sum(x)`. This simulation is done in R and may be slow.
|
||||
|
||||
### *G*-test Of Goodness-of-Fit (Likelihood Ratio Test)
|
||||
|
||||
Use the *G*-test of goodness-of-fit when you have one nominal variable
|
||||
with two or more values (such as male and female, or red, pink and white
|
||||
flowers). You compare the observed counts of numbers of observations in
|
||||
each category with the expected counts, which you calculate using some
|
||||
kind of theoretical expectation (such as a 1:1 sex ratio or a 1:2:1
|
||||
ratio in a genetic cross).
|
||||
|
||||
If the expected number of observations in any category is too small, the
|
||||
*G*-test may give inaccurate results, and you should use an exact test
|
||||
instead ([`fisher.test()`](https://rdrr.io/r/stats/fisher.test.html)).
|
||||
|
||||
The *G*-test of goodness-of-fit is an alternative to the chi-square test
|
||||
of goodness-of-fit
|
||||
([`chisq.test()`](https://rdrr.io/r/stats/chisq.test.html)); each of
|
||||
these tests has some advantages and some disadvantages, and the results
|
||||
of the two tests are usually very similar.
|
||||
|
||||
### *G*-test of Independence
|
||||
|
||||
Use the *G*-test of independence when you have two nominal variables,
|
||||
each with two or more possible values. You want to know whether the
|
||||
proportions for one variable are different among values of the other
|
||||
variable.
|
||||
|
||||
It is also possible to do a *G*-test of independence with more than two
|
||||
nominal variables. For example, Jackson et al. (2013) also had data for
|
||||
children under 3, so you could do an analysis of old vs. young, thigh
|
||||
vs. arm, and reaction vs. no reaction, all analyzed together.
|
||||
|
||||
Fisher's exact test
|
||||
([`fisher.test()`](https://rdrr.io/r/stats/fisher.test.html)) is an
|
||||
**exact** test, where the *G*-test is still only an **approximation**.
|
||||
For any 2x2 table, Fisher's Exact test may be slower but will still run
|
||||
in seconds, even if the sum of your observations is multiple millions.
|
||||
|
||||
The *G*-test of independence is an alternative to the chi-square test of
|
||||
independence
|
||||
([`chisq.test()`](https://rdrr.io/r/stats/chisq.test.html)), and they
|
||||
will give approximately the same results.
|
||||
|
||||
### How the Test Works
|
||||
|
||||
Unlike the exact test of goodness-of-fit
|
||||
([`fisher.test()`](https://rdrr.io/r/stats/fisher.test.html)), the
|
||||
*G*-test does not directly calculate the probability of obtaining the
|
||||
observed results or something more extreme. Instead, like almost all
|
||||
statistical tests, the *G*-test has an intermediate step; it uses the
|
||||
data to calculate a test statistic that measures how far the observed
|
||||
data are from the null expectation. You then use a mathematical
|
||||
relationship, in this case the chi-square distribution, to estimate the
|
||||
probability of obtaining that value of the test statistic.
|
||||
|
||||
The *G*-test uses the log of the ratio of two likelihoods as the test
|
||||
statistic, which is why it is also called a likelihood ratio test or
|
||||
log-likelihood ratio test. The formula to calculate a *G*-statistic is:
|
||||
|
||||
\\G = 2 \* sum(x \* log(x / E))\\
|
||||
|
||||
where `E` are the expected values. Since this is chi-square distributed,
|
||||
the p value can be calculated in R with:
|
||||
|
||||
p <- stats::pchisq(G, df, lower.tail = FALSE)
|
||||
|
||||
where `df` are the degrees of freedom.
|
||||
|
||||
If there are more than two categories and you want to find out which
|
||||
ones are significantly different from their null expectation, you can
|
||||
use the same method of testing each category vs. the sum of all
|
||||
categories, with the Bonferroni correction. You use *G*-tests for each
|
||||
category, of course.
|
||||
|
||||
## References
|
||||
|
||||
1. McDonald, J.H. 2014. **Handbook of Biological Statistics (3rd
|
||||
ed.)**. Sparky House Publishing, Baltimore, Maryland.
|
||||
|
||||
## See also
|
||||
|
||||
[`chisq.test()`](https://rdrr.io/r/stats/chisq.test.html)
|
||||
|
||||
## Examples
|
||||
|
||||
``` r
|
||||
# = EXAMPLE 1 =
|
||||
# Shivrain et al. (2006) crossed clearfield rice (which are resistant
|
||||
# to the herbicide imazethapyr) with red rice (which are susceptible to
|
||||
# imazethapyr). They then crossed the hybrid offspring and examined the
|
||||
# F2 generation, where they found 772 resistant plants, 1611 moderately
|
||||
# resistant plants, and 737 susceptible plants. If resistance is controlled
|
||||
# by a single gene with two co-dominant alleles, you would expect a 1:2:1
|
||||
# ratio.
|
||||
|
||||
x <- c(772, 1611, 737)
|
||||
g.test(x, p = c(1, 2, 1) / 4)
|
||||
#>
|
||||
#> G-test of goodness-of-fit (likelihood ratio test)
|
||||
#>
|
||||
#> data: x
|
||||
#> X-squared = 4.1471, p-value = 0.1257
|
||||
#>
|
||||
|
||||
# There is no significant difference from a 1:2:1 ratio.
|
||||
# Meaning: resistance controlled by a single gene with two co-dominant
|
||||
# alleles, is plausible.
|
||||
|
||||
|
||||
# = EXAMPLE 2 =
|
||||
# Red crossbills (Loxia curvirostra) have the tip of the upper bill either
|
||||
# right or left of the lower bill, which helps them extract seeds from pine
|
||||
# cones. Some have hypothesized that frequency-dependent selection would
|
||||
# keep the number of right and left-billed birds at a 1:1 ratio. Groth (1992)
|
||||
# observed 1752 right-billed and 1895 left-billed crossbills.
|
||||
|
||||
x <- c(1752, 1895)
|
||||
g.test(x)
|
||||
#>
|
||||
#> G-test of goodness-of-fit (likelihood ratio test)
|
||||
#>
|
||||
#> data: x
|
||||
#> X-squared = 5.6085, p-value = 0.01787
|
||||
#>
|
||||
|
||||
# There is a significant difference from a 1:1 ratio.
|
||||
# Meaning: there are significantly more left-billed birds.
|
||||
```
|
||||
Reference in New Issue
Block a user