(v1.3.0.9030) matching score update

2026-07-18 13:10:57 +02:00 · 2020-09-26 16:26:01 +02:00
parent 9667c2994f
commit 050a9a04fb
33 changed files with 249 additions and 175 deletions
--- a/docs/reference/as.mo.html
+++ b/docs/reference/as.mo.html
@@ -82,7 +82,7 @@
      </button>
      <span class="navbar-brand">
        <a class="navbar-link" href="../index.html">AMR (for R)</a>
-        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9028</span>
+        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9030</span>
      </span>
    </div>

@@ -366,21 +366,6 @@
 <p>Group 2 consists of all microorganisms where the taxonomic phylum is Proteobacteria, Firmicutes, Actinobacteria or Sarcomastigophora, or where the taxonomic genus is <em>Aspergillus</em>, <em>Bacteroides</em>, <em>Candida</em>, <em>Capnocytophaga</em>, <em>Chryseobacterium</em>, <em>Cryptococcus</em>, <em>Elisabethkingia</em>, <em>Flavobacterium</em>, <em>Fusobacterium</em>, <em>Giardia</em>, <em>Leptotrichia</em>, <em>Mycoplasma</em>, <em>Prevotella</em>, <em>Rhodotorula</em>, <em>Treponema</em>, <em>Trichophyton</em> or <em>Ureaplasma</em>. This group consequently contains all less common and rare human pathogens.</p>
 <p>Group 3 (least prevalent microorganisms) consists of all other microorganisms. This group contains microorganisms most probably not found in humans.</p>

-<h3>Background on matching scores</h3>
-
-
-<p>With ambiguous user input, the returned results are chosen based on their matching score using <code><a href='mo_matching_score.html'>mo_matching_score()</a></code>. This matching score is based on four parameters:</p><ol>
-<li><p>The prevalence \(P\) is categorised into group 1, 2 and 3 as stated above;</p></li>
-<li><p>A kingdom index \(K\) is set as follows: Bacteria = 1, Fungi = 2, Protozoa = 3, Archaea = 4, and all others = 5;</p></li>
-<li><p>The level of uncertainty \(U\) needed to get to the result, as stated above (1 to 3);</p></li>
-<li><p>The <a href='https://en.wikipedia.org/wiki/Levenshtein_distance'>Levenshtein distance</a> \(L\) is the distance between the user input and all taxonomic full names, with the text length of the user input being the maximum distance. A modified version of the Levenshtein distance \(L'\) based on the text length of the full name \(F\) is calculated as:</p></li>
-</ol>
-
-<p>$$L' = 1 - \frac{0.5L}{F}$$</p>
-<p>The final matching score \(M\) is calculated as:
-$$M = L' \times \frac{1}{P K U} = \frac{F - 0.5L}{F P K U}$$</p>
-<p>All matches are sorted descending on their matching score and for all user input values, the top match will be returned.</p>
-
    <h2 class="hasAnchor" id="source"><a class="anchor" href="#source"></a>Source</h2>

    
@@ -399,6 +384,22 @@ $$M = L' \times \frac{1}{P K U} = \frac{F - 0.5L}{F P K U}$$</p>
 <p><img src='figures/lifecycle_stable.svg' style=margin-bottom:5px /> <br />
 The <a href='lifecycle.html'>lifecycle</a> of this function is <strong>stable</strong>. In a stable function, major changes are unlikely. This means that the unlying code will generally evolve by adding new arguments; removing arguments or changing the meaning of existing arguments will be avoided.</p>
 <p>If the unlying code needs breaking changes, they will occur gradually. For example, a parameter will be deprecated and first continue to work, but will emit an message informing you of the change. Next, typically after at least one newly released version on CRAN, the message will be transformed to an error.</p>
+    <h2 class="hasAnchor" id="matching-score-for-microorganisms"><a class="anchor" href="#matching-score-for-microorganisms"></a>Matching score for microorganisms</h2>
+
+    
+
+<p>With ambiguous user input in <code>as.mo()</code> and all the <code><a href='mo_property.html'>mo_*</a></code> functions, the returned results are chosen based on their matching score using <code><a href='mo_matching_score.html'>mo_matching_score()</a></code>. This matching score \(m\) is calculated as:</p>
+<p>$$m_{(x, n)} = \frac{l_{n} - 0.5 \times \min \begin{cases}l_{n} \\ \operatorname{lev}(x, n)\end{cases}}{l_{n} p k}$$</p>
+<p>where:</p><ul>
+<li><p>\(x\) is the user input;</p></li>
+<li><p>\(n\) is a taxonomic name (genus, species and subspecies);</p></li>
+<li><p>\(l_{n}\) is the length of the taxonomic name;</p></li>
+<li><p>\(\operatorname{lev}\) is the <a href='https://en.wikipedia.org/wiki/Levenshtein_distance'>Levenshtein distance</a> function;</p></li>
+<li><p>\(p\) is the human pathogenic prevalence, categorised into group \(1\), \(2\) and \(3\) (see <em>Details</em> in <code>?as.mo</code>), meaning that \(p = \{1, 2 , 3\}\);</p></li>
+<li><p>\(k\) is the kingdom index, set as follows: Bacteria = \(1\), Fungi = \(2\), Protozoa = \(3\), Archaea = \(4\), and all others = \(5\), meaning that \(k = \{1, 2 , 3, 4, 5\}\).</p></li>
+</ul>
+
+<p>All matches are sorted descending on their matching score and for all user input values, the top match will be returned.</p>
    <h2 class="hasAnchor" id="catalogue-of-life"><a class="anchor" href="#catalogue-of-life"></a>Catalogue of Life</h2>

    
--- a/docs/reference/index.html
+++ b/docs/reference/index.html
@@ -81,7 +81,7 @@
      </button>
      <span class="navbar-brand">
        <a class="navbar-link" href="../index.html">AMR (for R)</a>
-        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9029</span>
+        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9030</span>
      </span>
    </div>

--- a/docs/reference/join.html
+++ b/docs/reference/join.html
@@ -82,7 +82,7 @@
      </button>
      <span class="navbar-brand">
        <a class="navbar-link" href="../index.html">AMR (for R)</a>
-        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9026</span>
+        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9029</span>
      </span>
    </div>

@@ -263,7 +263,7 @@
    </tr>
    <tr>
      <th>by</th>
-      <td><p>a variable to join by - if left empty will search for a column with class <code><a href='as.mo.html'>mo</a></code> (created with <code><a href='as.mo.html'>as.mo()</a></code>) or will be <code>"mo"</code> if that column name exists in <code>x</code>, could otherwise be a column name of <code>x</code> with values that exist in <code>microorganisms$mo</code> (like <code>by = "bacteria_id"</code>), or another column in <a href='microorganisms.html'>microorganisms</a> (but then it should be named, like <code>by = c("my_genus_species" = "fullname")</code>)</p></td>
+      <td><p>a variable to join by - if left empty will search for a column with class <code><a href='as.mo.html'>mo</a></code> (created with <code><a href='as.mo.html'>as.mo()</a></code>) or will be <code>"mo"</code> if that column name exists in <code>x</code>, could otherwise be a column name of <code>x</code> with values that exist in <code>microorganisms$mo</code> (like <code>by = "bacteria_id"</code>), or another column in <a href='microorganisms.html'>microorganisms</a> (but then it should be named, like <code>by = c("bacteria_id" = "fullname")</code>)</p></td>
    </tr>
    <tr>
      <th>suffix</th>
@@ -278,7 +278,7 @@
    <h2 class="hasAnchor" id="details"><a class="anchor" href="#details"></a>Details</h2>

    <p><strong>Note:</strong> As opposed to the <code>join()</code> functions of <code>dplyr</code>, <a href='https://rdrr.io/r/base/character.html'>character</a> vectors are supported and at default existing columns will get a suffix <code>"2"</code> and the newly joined columns will not get a suffix.</p>
-<p>These functions rely on <code><a href='https://rdrr.io/r/base/merge.html'>merge()</a></code>, a base R function to do joins.</p>
+<p>If the <code>dplyr</code> package is installed, their join functions will be used. Otherwise, the much slower <code><a href='https://rdrr.io/r/base/merge.html'>merge()</a></code> function from base R will be used.</p>
    <h2 class="hasAnchor" id="stable-lifecycle"><a class="anchor" href="#stable-lifecycle"></a>Stable lifecycle</h2>

    
--- a/docs/reference/mo_matching_score.html
+++ b/docs/reference/mo_matching_score.html
@@ -82,7 +82,7 @@
      </button>
      <span class="navbar-brand">
        <a class="navbar-link" href="../index.html">AMR (for R)</a>
-        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9028</span>
+        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9030</span>
      </span>
    </div>

@@ -242,7 +242,7 @@
    <p>This helper function is used by <code><a href='as.mo.html'>as.mo()</a></code> to determine the most probable match of taxonomic records, based on user input.</p>
    </div>

-    <pre class="usage"><span class='fu'>mo_matching_score</span>(<span class='kw'>x</span>, <span class='kw'>fullname</span>, uncertainty = <span class='fl'>1</span>)</pre>
+    <pre class="usage"><span class='fu'>mo_matching_score</span>(<span class='kw'>x</span>, <span class='kw'>n</span>)</pre>

    <h2 class="hasAnchor" id="arguments"><a class="anchor" href="#arguments"></a>Arguments</h2>
    <table class="ref-arguments">
@@ -252,7 +252,7 @@
      <td><p>Any user input value(s)</p></td>
    </tr>
    <tr>
-      <th>fullname</th>
+      <th>n</th>
      <td><p>A full taxonomic name, that exists in <code><a href='microorganisms.html'>microorganisms$fullname</a></code></p></td>
    </tr>
    <tr>
@@ -261,22 +261,28 @@
    </tr>
    </table>

-    <h2 class="hasAnchor" id="details"><a class="anchor" href="#details"></a>Details</h2>
+    <h2 class="hasAnchor" id="matching-score-for-microorganisms"><a class="anchor" href="#matching-score-for-microorganisms"></a>Matching score for microorganisms</h2>

-    <p>The matching score is based on four parameters:</p><ol>
-<li><p>A human pathogenic prevalence \(P\), that is categorised into group 1, 2 and 3 (see <code><a href='as.mo.html'>as.mo()</a></code>);</p></li>
-<li><p>A kingdom index \(K\) is set as follows: Bacteria = 1, Fungi = 2, Protozoa = 3, Archaea = 4, and all others = 5;</p></li>
-<li><p>The level of uncertainty \(U\) that is needed to get to a result (1 to 3, see <code><a href='as.mo.html'>as.mo()</a></code>);</p></li>
-<li><p>The <a href='https://en.wikipedia.org/wiki/Levenshtein_distance'>Levenshtein distance</a> \(L\) is the distance between the user input and all taxonomic full names, with the text length of the user input being the maximum distance. A modified version of the Levenshtein distance \(L'\) based on the text length of the full name \(F\) is calculated as:</p></li>
-</ol>
+    

-<p>$$L' = 1 - \frac{0.5L}{F}$$</p>
-<p>The final matching score \(M\) is calculated as:
-$$M = L' \times \frac{1}{P K U} = \frac{F - 0.5L}{F P K U}$$</p>
+<p>With ambiguous user input in <code><a href='as.mo.html'>as.mo()</a></code> and all the <code><a href='mo_property.html'>mo_*</a></code> functions, the returned results are chosen based on their matching score using <code>mo_matching_score()</code>. This matching score \(m\) is calculated as:</p>
+<p>$$m_{(x, n)} = \frac{l_{n} - 0.5 \times \min \begin{cases}l_{n} \\ \operatorname{lev}(x, n)\end{cases}}{l_{n} p k}$$</p>
+<p>where:</p><ul>
+<li><p>\(x\) is the user input;</p></li>
+<li><p>\(n\) is a taxonomic name (genus, species and subspecies);</p></li>
+<li><p>\(l_{n}\) is the length of the taxonomic name;</p></li>
+<li><p>\(\operatorname{lev}\) is the <a href='https://en.wikipedia.org/wiki/Levenshtein_distance'>Levenshtein distance</a> function;</p></li>
+<li><p>\(p\) is the human pathogenic prevalence, categorised into group \(1\), \(2\) and \(3\) (see <em>Details</em> in <code><a href='as.mo.html'>?as.mo</a></code>), meaning that \(p = \{1, 2 , 3\}\);</p></li>
+<li><p>\(k\) is the kingdom index, set as follows: Bacteria = \(1\), Fungi = \(2\), Protozoa = \(3\), Archaea = \(4\), and all others = \(5\), meaning that \(k = \{1, 2 , 3, 4, 5\}\).</p></li>
+</ul>
+
+<p>All matches are sorted descending on their matching score and for all user input values, the top match will be returned.</p>

    <h2 class="hasAnchor" id="examples"><a class="anchor" href="#examples"></a>Examples</h2>
    <pre class="examples"><span class='fu'><a href='as.mo.html'>as.mo</a></span>(<span class='st'>"E. coli"</span>)
 <span class='fu'><a href='as.mo.html'>mo_uncertainties</a></span>()
+
+<span class='fu'>mo_matching_score</span>(<span class='st'>"E. coli"</span>, <span class='st'>"Escherichia coli"</span>)
 </pre>
  </div>
  <div class="col-md-3 hidden-xs hidden-sm" id="pkgdown-sidebar">
--- a/docs/reference/mo_property.html
+++ b/docs/reference/mo_property.html
@@ -82,7 +82,7 @@
      </button>
      <span class="navbar-brand">
        <a class="navbar-link" href="../index.html">AMR (for R)</a>
-        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9026</span>
+        <span class="version label label-default" data-toggle="tooltip" data-placement="bottom" title="Latest development version">1.3.0.9030</span>
      </span>
    </div>

@@ -346,6 +346,22 @@
 <p><img src='figures/lifecycle_stable.svg' style=margin-bottom:5px /> <br />
 The <a href='lifecycle.html'>lifecycle</a> of this function is <strong>stable</strong>. In a stable function, major changes are unlikely. This means that the unlying code will generally evolve by adding new arguments; removing arguments or changing the meaning of existing arguments will be avoided.</p>
 <p>If the unlying code needs breaking changes, they will occur gradually. For example, a parameter will be deprecated and first continue to work, but will emit an message informing you of the change. Next, typically after at least one newly released version on CRAN, the message will be transformed to an error.</p>
+    <h2 class="hasAnchor" id="matching-score-for-microorganisms"><a class="anchor" href="#matching-score-for-microorganisms"></a>Matching score for microorganisms</h2>
+
+    
+
+<p>With ambiguous user input in <code><a href='as.mo.html'>as.mo()</a></code> and all the <code>mo_*</code> functions, the returned results are chosen based on their matching score using <code><a href='mo_matching_score.html'>mo_matching_score()</a></code>. This matching score \(m\) is calculated as:</p>
+<p>$$m_{(x, n)} = \frac{l_{n} - 0.5 \times \min \begin{cases}l_{n} \\ \operatorname{lev}(x, n)\end{cases}}{l_{n} p k}$$</p>
+<p>where:</p><ul>
+<li><p>\(x\) is the user input;</p></li>
+<li><p>\(n\) is a taxonomic name (genus, species and subspecies);</p></li>
+<li><p>\(l_{n}\) is the length of the taxonomic name;</p></li>
+<li><p>\(\operatorname{lev}\) is the <a href='https://en.wikipedia.org/wiki/Levenshtein_distance'>Levenshtein distance</a> function;</p></li>
+<li><p>\(p\) is the human pathogenic prevalence, categorised into group \(1\), \(2\) and \(3\) (see <em>Details</em> in <code><a href='as.mo.html'>?as.mo</a></code>), meaning that \(p = \{1, 2 , 3\}\);</p></li>
+<li><p>\(k\) is the kingdom index, set as follows: Bacteria = \(1\), Fungi = \(2\), Protozoa = \(3\), Archaea = \(4\), and all others = \(5\), meaning that \(k = \{1, 2 , 3, 4, 5\}\).</p></li>
+</ul>
+
+<p>All matches are sorted descending on their matching score and for all user input values, the top match will be returned.</p>
    <h2 class="hasAnchor" id="catalogue-of-life"><a class="anchor" href="#catalogue-of-life"></a>Catalogue of Life</h2>