1
0
mirror of https://github.com/msberends/AMR.git synced 2026-04-28 13:43:56 +02:00
Commit Graph

21 Commits

Author SHA1 Message Date
Matthijs Berends
8261b91b24 Fix custom reference_data support in as.sir() (#239) (PR #279)
* Fix custom reference_data support in as.sir() (#239)

- custom guideline names now correctly classify values as R: CLSI convention
  (>= breakpoint_R for MIC, <= for disk) applies only when guideline contains
  "CLSI"; all other guidelines including custom ones use the EUCAST convention
  (> breakpoint_R for MIC, < for disk)
- guideline argument is now optional when reference_data is manually set: if
  omitted or if its value does not match any row in the custom data, all rows
  in reference_data are used; if set to a value present in the data, only
  matching rows are filtered — useful for multi-guideline custom tables
- host = NA in custom reference_data now acts as a host-agnostic fallback
  when no host-specific breakpoint row exists for the current animal species
- updated reference_data argument documentation to explain these conventions

https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U

* Refactor R-classification logic using custom_breakpoints_set flag

Introduce custom_breakpoints_set <- !identical(reference_data, AMR::clinical_breakpoints)
at the top of as_sir_method() and replace all identical() calls inside that
function with this variable.

In the case_when_AMR interpretation blocks (MIC and disk), the R-classification
now has three explicit arms:
- !custom_breakpoints_set & EUCAST guideline -> open interval (> / <)
- !custom_breakpoints_set & CLSI guideline  -> closed interval (>= / <=)
- custom_breakpoints_set                    -> open interval (> / <), always,
  regardless of the guideline name in the custom data (e.g. "CLSI_custom"
  must not accidentally trigger CLSI convention)

https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U

* Fix unit tests for custom reference_data (#239)

- Do not override my_bp$mo / my_bp$ab in tests: assigning plain character
  strips the <mo>/<ab> class, which check_reference_data() rejects. Use the
  mo/ab values already present in the source row instead.
- Use NA_character_ instead of NA for my_bp$host so the host column keeps
  its character class.
- Pass breakpoint_type = "animal" explicitly in the host-fallback test since
  the custom reference_data only contains animal-type breakpoints.

https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U

* Add coerce_reference_data_columns() for lenient reference_data validation

check_reference_data() now returns the (possibly coerced) reference_data and
the call site captures the result so downstream code sees the fixed columns.

A new coerce_reference_data_columns() helper is called before the strict class
check inside check_reference_data(). It coerces columns to the expected types:
- mo  -> as.mo() if not already <mo> class
- ab  -> as.ab() if not already <ab> class
- character columns -> as.character() (e.g. host = NA becomes NA_character_)
- numeric columns  -> as.double()
- logical columns  -> as.logical()

This allows users to build a custom reference_data from a plain data.frame
without having to pre-apply as.mo()/as.ab() or worry about NA column types.

Updated the reference_data roxygen argument to document the auto-coercion and
restored the tests to the simpler form that uses plain character assignments,
relying on the new coercion instead of workarounds.

https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-25 14:38:01 +02:00
Matthijs Berends
19157ce718 Fix parallel computing in as.sir.data.frame (#276)
* Fix parallel computing in as.sir.data.frame

Six bugs in parallel = TRUE mode:

1. PSOCK workers (Windows / R < 4.0) never had AMR loaded, so every
   exported/AMR function call failed. Added clusterEvalQ(cl, library(AMR))
   with a graceful fallback to sequential when the package cannot be loaded
   (e.g. dev-only load_all() environments).

2. clusterExport'd AMR_env was a frozen serialised copy; as.sir() on the
   worker wrote to AMR:::AMR_env while run_as_sir_column read from the stale
   copy, so the captured log was always wrong. Fixed by resolving AMR_env
   dynamically via get("AMR_env", envir = asNamespace("AMR")) inside the
   worker function, and removing AMR_env from clusterExport.

3. In the fork-based (mclapply) path each worker inherited the parent's full
   sir_interpretation_history. Capturing the whole log then combining across
   workers duplicated every pre-existing entry. Fixed by recording the log
   row count before the as.sir() call and slicing only the new rows
   afterwards.

4. run_as_sir_column used non-exported internals (%pm>%, pm_pull,
   as.sir.default) that are inaccessible on PSOCK workers after library(AMR).
   Replaced pipe chains with direct as.mic(as.character(x[, col, drop=TRUE]))
   and as.disk(...) calls, and changed as.sir.default() to as.sir() which
   dispatches correctly via S3.

5. With info = TRUE, worker forks printed per-column progress messages
   simultaneously, producing garbled interleaved console output. Per-column
   messages are now suppressed inside workers (effective_info = FALSE) while
   the outer "Running in parallel" / "DONE" messages still appear.

6. Malformed Unicode escape \u00a (3 hex digits) in the "DONE" banner was
   parsed by R as U+00AD (soft hyphen) + "ONE"; corrected to  .

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Add parallel computing tests to test-sir.R

Eight targeted tests verify correctness of the parallel as.sir() path:
identical SIR output vs sequential, matching log row counts, no
pre-existing history duplication, reproducibility across runs, results
consistency across max_cores values, single-column fallback, and no
per-column worker messages leaking when info = TRUE. All pass when only
1 core is available (parallel silently falls back to sequential).

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Fix as.sir() data.frame: preserve already-<sir> columns, exclude metadata

Issue #278: two related bugs in the column-detection / type-assignment pipeline.

Bug 1 – already-<sir> columns deleted on re-run
  Line 886 excluded already-sir columns from the type assignment (they
  stayed type "") causing the result loop to do x[,col] <- NULL, deleting
  them.  Fix: drop the !is.sir() guard so all untyped columns fall through
  to type "sir" and are re-processed correctly.

Bug 2 – metadata columns treated as antibiotics
  as.ab("patient") -> OXY, as.ab("ward") -> PRU.  The column detector
  accepted any column whose name matched an antibiotic code, regardless of
  content.  Fix: for name-matched columns that do not already carry an AMR
  class, also verify content looks like AMR data (all_valid_mics, all-
  numeric, or any SIR-like string).  all_valid_disks() is intentionally
  avoided here because it strips letters from strings (as.disk("Pt_1")==1).

Also adds tools/benchmark_parallel.R: a standalone script that times
sequential vs parallel as.sir() across n=20/200/2000/20000 rows and
saves a ggplot2 PNG to tools/benchmark_parallel.png.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Update benchmark: two-panel script with warm-up and column-count sweep

Previous single-panel benchmark was misleading: the first sequential run
paid one-time cache-warm-up cost (skewing n=20), and only 6 columns were
used so only 6 cores were ever active on a 16-core machine.

New two-panel design:
  Left  – vary rows with 16 fixed AB columns (shows memory-bandwidth
          saturation for large n)
  Right – vary columns with fixed rows (shows the real speedup profile:
          parallel wins when n_cols >> 1)

Also adds a warm-up pass before measurements to eliminate first-call bias.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Optimise parallel as.sir(): row-batch mode when n_cols < n_cores

Previously parallel dispatch only parallelised by column, so a 6-column
dataset on a 16-core machine used at most 6 cores with the other 10 idle.
For large n this also caused memory-bandwidth saturation (each worker did
a full n-row scan of clinical_breakpoints simultaneously).

New row-batch mode (fork path, R >= 4.0, non-Windows):
  pieces_per_col = ceil(n_cores / n_cols)
  Jobs = n_cols × pieces_per_col  (≈ n_cores jobs total)
  Each job: one column × one row slice

Benefits:
  - All cores stay busy regardless of column count
  - Per-worker memory footprint shrinks by pieces_per_col ×
  - Breakpoints lookup cache pressure reduced per worker

PSOCK path (Windows / R < 4.0) is unchanged: per-job serialisation
overhead makes row batching unprofitable there.

run_as_sir_column() gains an optional `rows` parameter (NULL = all rows,
backward-compatible). Results are reassembled via as.sir(c(as.character(.)))
which is safe for already-clean SIR values.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Fix info=FALSE ignored when no breakpoints found in as_sir_method

Operator-precedence bug at line 1601:

  if (isTRUE(info) && nrow(df_unique) < 10 || nrow(breakpoints) == 0)

R evaluates && before ||, so this was equivalent to:

  (isTRUE(info) && nrow(df_unique) < 10) || (nrow(breakpoints) == 0)

When nrow(breakpoints) == 0 (e.g. cefoxitin / flucloxacillin / mupirocin
against E. coli in EUCAST) the intro message was always printed regardless
of info. Fix: add parentheses so info gates both conditions:

  isTRUE(info) && (nrow(df_unique) < 10 || nrow(breakpoints) == 0)

Also pass print = isTRUE(info) to progress_ticker so the progress bar
(which prints intro_txt as its title) is suppressed when info = FALSE.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Fix cli formatting in as.sir() messages

- stop_if for empty ab_cols: wrap as.mic() and as.disk() in
  {.help [{.fun ...}](...)} for clickable links in cli output
- Parallel mode message: use {.field col} formatting for column names
  and quotes = FALSE in vector_and(), consistent with the rest of the
  codebase (avoids double-quoting from both font_bold and quotes="'")

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Use font_bold() inside {.field} for column names in parallel message

Convention: paste0("{.field ", font_bold(col), "}") gives bold green
column names without quotation marks, consistent with the rest of the
codebase (e.g. the 'Cleaning values' message in run_as_sir_column).

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Add collapse = NULL to font_bold() for column name vectors

font_bold() without collapse = NULL joins a vector with "" into a single
string, breaking paste0() element-wise formatting for length > 1 vectors.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Add tools/ to .Rbuildignore

Keeps the benchmark script out of the built package tarball.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-25 00:34:38 +02:00
3a736bc484 (v3.0.1.9041) add breakpoints 2026 2026-03-30 10:01:49 +02:00
499c830ee7 (v3.0.1.9020) unit test fixes 2026-02-09 13:16:36 +01:00
225c73f7e7 (v3.0.1.9004) Revamp as.sir() interpretation for capped MICs
Fixes #243
Fixes #244
2025-12-15 13:18:13 +01:00
4d7c4ca52c (v3.0.0.9027) skimr update and as.ab warning - fixes #234, fixes #232 2025-09-10 13:32:52 +02:00
5e6bbdf3d1 (v2.1.1.9267) update ATCs 2025-05-01 11:48:49 +02:00
d2b3937a90 (v2.1.1.9257) adjust unit tests 2025-04-27 09:58:19 +02:00
f340e257fa (v2.1.1.9256) unit tests 2025-04-26 21:29:50 +02:00
be13934fe7 (v2.1.1.9249) unit test 2025-04-20 17:49:47 +02:00
492fe6872f (v2.1.1.9244) automated GPT training data 2025-04-19 15:57:12 +02:00
579025f678 (v2.1.1.9241) fix sir 2025-04-18 13:25:59 +02:00
40d7a971c3 (v2.1.1.9236) documentation 2025-04-12 11:46:42 +02:00
36fd99e1f4 (v2.1.1.9235) New website! 2025-04-08 15:54:30 +01:00
8deaf2c8eb (v2.1.1.9224) skip tests on cran 2025-03-20 23:29:21 +01:00
58d7aa8790 (v2.1.1.9199) fix eucast 2025-03-14 13:43:22 +01:00
861331b1df (v2.1.1.9196) fix eucast, unit tests 2025-03-13 15:38:39 +01:00
9aab129ea6 (v2.1.1.9195) add BTL-S, fix ranks in unknown microorganisms 2025-03-13 14:30:14 +01:00
f7938289eb (v2.1.1.9186) replace antibiotics with antimicrobials! 2025-03-07 20:43:26 +01:00
07efc292bc (v2.1.1.9163) cleanup 2025-02-27 14:04:29 +01:00
f03933940c (v2.1.1.9131) implement testthat 2025-01-27 21:43:10 +01:00