Calculate the matching score for microorganisms — mo_matching

This helper function is used by as.mo() to determine the most probable match of taxonomic records, based on user input.

mo_matching_score(x, n)

Arguments

x	Any user input value(s)
n	A full taxonomic name, that exists in `microorganisms$fullname`

Matching score for microorganisms

With ambiguous user input in as.mo() and all the mo_* functions, the returned results are chosen based on their matching score using mo_matching_score(). This matching score $m$, ranging from 0 to 100%, is calculated as:

$$m_{(x, n)} = \frac{l_{n} - 0.5 \cdot \min \begin{cases}l_{n} \\ \operatorname{lev}(x, n)\end{cases}}{l_{n} \cdot p_{n} \cdot k_{n}}$$

where:

$x$ is the user input;
$n$ is a taxonomic name (genus, species and subspecies) as found in microorganisms$fullname;
$l_{n}$ is the length of $n$;
$\operatorname{lev}$ is the Levenshtein distance function;
$p_{n}$ is the human pathogenic prevalence of $n$, categorised into group $1$, $2$ and $3$ (see Details in ?as.mo), meaning that $p = \{1, 2 , 3\}$;
$k_{n}$ is the kingdom index of $n$, set as follows: Bacteria = $1$, Fungi = $2$, Protozoa = $3$, Archaea = $4$, and all others = $5$, meaning that $k = \{1, 2 , 3, 4, 5\}$.

This means that the user input x = "E. coli" gets for Escherichia coli a matching score of 68.8% and for Entamoeba coli a matching score of 7.9%.

All matches are sorted descending on their matching score and for all user input values, the top match will be returned.

Examples

as.mo("E. coli")
mo_uncertainties()

mo_matching_score("E. coli", "Escherichia coli")