Optimise parallel as.sir(): row-batch mode when n_cols < n_cores

Previously parallel dispatch only parallelised by column, so a 6-column dataset on a 16-core machine used at most 6 cores with the other 10 idle. For large n this also caused memory-bandwidth saturation (each worker did a full n-row scan of clinical_breakpoints simultaneously). New row-batch mode (fork path, R >= 4.0, non-Windows): pieces_per_col = ceil(n_cores / n_cols) Jobs = n_cols × pieces_per_col (≈ n_cores jobs total) Each job: one column × one row slice Benefits: - All cores stay busy regardless of column count - Per-worker memory footprint shrinks by pieces_per_col × - Breakpoints lookup cache pressure reduced per worker PSOCK path (Windows / R < 4.0) is unchanged: per-job serialisation overhead makes row batching unprofitable there. run_as_sir_column() gains an optional `rows` parameter (NULL = all rows, backward-compatible). Results are reassembled via as.sir(c(as.character(.))) which is safe for already-clean SIR values. https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
2026-05-31 21:41:54 +02:00 · 2026-04-24 22:01:09 +00:00
parent d770469a63
commit 060449e234
3 changed files with 78 additions and 16 deletions
--- a/NEWS.md
+++ b/NEWS.md
@@ -37,6 +37,7 @@
 * Fixed BRMO classification by including bacterial complexes (#275)
 * Fixed `as.sir()` for data frames silently deleting columns whose AB class was already `<sir>` when called a second time (re-running on already-converted data) (#278)
 * Fixed `as.sir()` for data frames incorrectly treating metadata columns (e.g. `patient`, `ward`) as antibiotic columns when their names coincidentally matched an antibiotic code; column content is now validated against AMR data patterns before inclusion
+* Improved parallel computing in `as.sir()`: when the number of AB columns is smaller than the number of available cores, rows are now split into batches so all cores stay active (row-batch mode). Previously, a 6-column dataset on a 16-core machine would only use 6 cores; now all 16 are used, with each worker processing a smaller row slice (lower per-worker memory pressure)

 ### Updates
 * Extensive `cli` integration for better message handling and clickable links in messages and warnings (#191, #265)