This helper function is used by as.mo()
to determine the most probable match of taxonomic records, based on user input.
mo_matching_score(x, n)
x | Any user input value(s) |
---|---|
n | A full taxonomic name, that exists in |
With ambiguous user input in as.mo()
and all the mo_*
functions, the returned results are chosen based on their matching score using mo_matching_score()
. This matching score \(m\), ranging from 0 to 100%, is calculated as:
$$m_{(x, n)} = \frac{l_{n} - 0.5 \cdot \min \begin{cases}l_{n} \\ \operatorname{lev}(x, n)\end{cases}}{l_{n} \cdot p_{n} \cdot k_{n}}$$
where:
\(x\) is the user input;
\(n\) is a taxonomic name (genus, species and subspecies) as found in microorganisms$fullname
;
\(l_{n}\) is the length of \(n\);
\(\operatorname{lev}\) is the Levenshtein distance function;
\(p_{n}\) is the human pathogenic prevalence of \(n\), categorised into group \(1\), \(2\) and \(3\) (see Details in ?as.mo
), meaning that \(p = \{1, 2 , 3\}\);
\(k_{n}\) is the kingdom index of \(n\), set as follows: Bacteria = \(1\), Fungi = \(2\), Protozoa = \(3\), Archaea = \(4\), and all others = \(5\), meaning that \(k = \{1, 2 , 3, 4, 5\}\).
This means that the user input x = "E. coli"
gets for Escherichia coli a matching score of 68.8% and for Entamoeba coli a matching score of 7.9%.
All matches are sorted descending on their matching score and for all user input values, the top match will be returned.
as.mo("E. coli") mo_uncertainties() mo_matching_score("E. coli", "Escherichia coli")