* Fix custom reference_data support in as.sir() (#239)
- custom guideline names now correctly classify values as R: CLSI convention
(>= breakpoint_R for MIC, <= for disk) applies only when guideline contains
"CLSI"; all other guidelines including custom ones use the EUCAST convention
(> breakpoint_R for MIC, < for disk)
- guideline argument is now optional when reference_data is manually set: if
omitted or if its value does not match any row in the custom data, all rows
in reference_data are used; if set to a value present in the data, only
matching rows are filtered — useful for multi-guideline custom tables
- host = NA in custom reference_data now acts as a host-agnostic fallback
when no host-specific breakpoint row exists for the current animal species
- updated reference_data argument documentation to explain these conventions
https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U
* Refactor R-classification logic using custom_breakpoints_set flag
Introduce custom_breakpoints_set <- !identical(reference_data, AMR::clinical_breakpoints)
at the top of as_sir_method() and replace all identical() calls inside that
function with this variable.
In the case_when_AMR interpretation blocks (MIC and disk), the R-classification
now has three explicit arms:
- !custom_breakpoints_set & EUCAST guideline -> open interval (> / <)
- !custom_breakpoints_set & CLSI guideline -> closed interval (>= / <=)
- custom_breakpoints_set -> open interval (> / <), always,
regardless of the guideline name in the custom data (e.g. "CLSI_custom"
must not accidentally trigger CLSI convention)
https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U
* Fix unit tests for custom reference_data (#239)
- Do not override my_bp$mo / my_bp$ab in tests: assigning plain character
strips the <mo>/<ab> class, which check_reference_data() rejects. Use the
mo/ab values already present in the source row instead.
- Use NA_character_ instead of NA for my_bp$host so the host column keeps
its character class.
- Pass breakpoint_type = "animal" explicitly in the host-fallback test since
the custom reference_data only contains animal-type breakpoints.
https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U
* Add coerce_reference_data_columns() for lenient reference_data validation
check_reference_data() now returns the (possibly coerced) reference_data and
the call site captures the result so downstream code sees the fixed columns.
A new coerce_reference_data_columns() helper is called before the strict class
check inside check_reference_data(). It coerces columns to the expected types:
- mo -> as.mo() if not already <mo> class
- ab -> as.ab() if not already <ab> class
- character columns -> as.character() (e.g. host = NA becomes NA_character_)
- numeric columns -> as.double()
- logical columns -> as.logical()
This allows users to build a custom reference_data from a plain data.frame
without having to pre-apply as.mo()/as.ab() or worry about NA column types.
Updated the reference_data roxygen argument to document the auto-coercion and
restored the tests to the simpler form that uses plain character assignments,
relying on the new coercion instead of workarounds.
https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Fix parallel computing in as.sir.data.frame
Six bugs in parallel = TRUE mode:
1. PSOCK workers (Windows / R < 4.0) never had AMR loaded, so every
exported/AMR function call failed. Added clusterEvalQ(cl, library(AMR))
with a graceful fallback to sequential when the package cannot be loaded
(e.g. dev-only load_all() environments).
2. clusterExport'd AMR_env was a frozen serialised copy; as.sir() on the
worker wrote to AMR:::AMR_env while run_as_sir_column read from the stale
copy, so the captured log was always wrong. Fixed by resolving AMR_env
dynamically via get("AMR_env", envir = asNamespace("AMR")) inside the
worker function, and removing AMR_env from clusterExport.
3. In the fork-based (mclapply) path each worker inherited the parent's full
sir_interpretation_history. Capturing the whole log then combining across
workers duplicated every pre-existing entry. Fixed by recording the log
row count before the as.sir() call and slicing only the new rows
afterwards.
4. run_as_sir_column used non-exported internals (%pm>%, pm_pull,
as.sir.default) that are inaccessible on PSOCK workers after library(AMR).
Replaced pipe chains with direct as.mic(as.character(x[, col, drop=TRUE]))
and as.disk(...) calls, and changed as.sir.default() to as.sir() which
dispatches correctly via S3.
5. With info = TRUE, worker forks printed per-column progress messages
simultaneously, producing garbled interleaved console output. Per-column
messages are now suppressed inside workers (effective_info = FALSE) while
the outer "Running in parallel" / "DONE" messages still appear.
6. Malformed Unicode escape \u00a (3 hex digits) in the "DONE" banner was
parsed by R as U+00AD (soft hyphen) + "ONE"; corrected to .
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Add parallel computing tests to test-sir.R
Eight targeted tests verify correctness of the parallel as.sir() path:
identical SIR output vs sequential, matching log row counts, no
pre-existing history duplication, reproducibility across runs, results
consistency across max_cores values, single-column fallback, and no
per-column worker messages leaking when info = TRUE. All pass when only
1 core is available (parallel silently falls back to sequential).
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Fix as.sir() data.frame: preserve already-<sir> columns, exclude metadata
Issue #278: two related bugs in the column-detection / type-assignment pipeline.
Bug 1 – already-<sir> columns deleted on re-run
Line 886 excluded already-sir columns from the type assignment (they
stayed type "") causing the result loop to do x[,col] <- NULL, deleting
them. Fix: drop the !is.sir() guard so all untyped columns fall through
to type "sir" and are re-processed correctly.
Bug 2 – metadata columns treated as antibiotics
as.ab("patient") -> OXY, as.ab("ward") -> PRU. The column detector
accepted any column whose name matched an antibiotic code, regardless of
content. Fix: for name-matched columns that do not already carry an AMR
class, also verify content looks like AMR data (all_valid_mics, all-
numeric, or any SIR-like string). all_valid_disks() is intentionally
avoided here because it strips letters from strings (as.disk("Pt_1")==1).
Also adds tools/benchmark_parallel.R: a standalone script that times
sequential vs parallel as.sir() across n=20/200/2000/20000 rows and
saves a ggplot2 PNG to tools/benchmark_parallel.png.
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Update benchmark: two-panel script with warm-up and column-count sweep
Previous single-panel benchmark was misleading: the first sequential run
paid one-time cache-warm-up cost (skewing n=20), and only 6 columns were
used so only 6 cores were ever active on a 16-core machine.
New two-panel design:
Left – vary rows with 16 fixed AB columns (shows memory-bandwidth
saturation for large n)
Right – vary columns with fixed rows (shows the real speedup profile:
parallel wins when n_cols >> 1)
Also adds a warm-up pass before measurements to eliminate first-call bias.
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Optimise parallel as.sir(): row-batch mode when n_cols < n_cores
Previously parallel dispatch only parallelised by column, so a 6-column
dataset on a 16-core machine used at most 6 cores with the other 10 idle.
For large n this also caused memory-bandwidth saturation (each worker did
a full n-row scan of clinical_breakpoints simultaneously).
New row-batch mode (fork path, R >= 4.0, non-Windows):
pieces_per_col = ceil(n_cores / n_cols)
Jobs = n_cols × pieces_per_col (≈ n_cores jobs total)
Each job: one column × one row slice
Benefits:
- All cores stay busy regardless of column count
- Per-worker memory footprint shrinks by pieces_per_col ×
- Breakpoints lookup cache pressure reduced per worker
PSOCK path (Windows / R < 4.0) is unchanged: per-job serialisation
overhead makes row batching unprofitable there.
run_as_sir_column() gains an optional `rows` parameter (NULL = all rows,
backward-compatible). Results are reassembled via as.sir(c(as.character(.)))
which is safe for already-clean SIR values.
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Fix info=FALSE ignored when no breakpoints found in as_sir_method
Operator-precedence bug at line 1601:
if (isTRUE(info) && nrow(df_unique) < 10 || nrow(breakpoints) == 0)
R evaluates && before ||, so this was equivalent to:
(isTRUE(info) && nrow(df_unique) < 10) || (nrow(breakpoints) == 0)
When nrow(breakpoints) == 0 (e.g. cefoxitin / flucloxacillin / mupirocin
against E. coli in EUCAST) the intro message was always printed regardless
of info. Fix: add parentheses so info gates both conditions:
isTRUE(info) && (nrow(df_unique) < 10 || nrow(breakpoints) == 0)
Also pass print = isTRUE(info) to progress_ticker so the progress bar
(which prints intro_txt as its title) is suppressed when info = FALSE.
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Fix cli formatting in as.sir() messages
- stop_if for empty ab_cols: wrap as.mic() and as.disk() in
{.help [{.fun ...}](...)} for clickable links in cli output
- Parallel mode message: use {.field col} formatting for column names
and quotes = FALSE in vector_and(), consistent with the rest of the
codebase (avoids double-quoting from both font_bold and quotes="'")
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Use font_bold() inside {.field} for column names in parallel message
Convention: paste0("{.field ", font_bold(col), "}") gives bold green
column names without quotation marks, consistent with the rest of the
codebase (e.g. the 'Cleaning values' message in run_as_sir_column).
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Add collapse = NULL to font_bold() for column name vectors
font_bold() without collapse = NULL joins a vector with "" into a single
string, breaking paste0() element-wise formatting for length > 1 vectors.
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
* Add tools/ to .Rbuildignore
Keeps the benchmark script out of the built package tarball.
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
---------
Co-authored-by: Claude <noreply@anthropic.com>