Previously parallel dispatch only parallelised by column, so a 6-column
dataset on a 16-core machine used at most 6 cores with the other 10 idle.
For large n this also caused memory-bandwidth saturation (each worker did
a full n-row scan of clinical_breakpoints simultaneously).
New row-batch mode (fork path, R >= 4.0, non-Windows):
pieces_per_col = ceil(n_cores / n_cols)
Jobs = n_cols × pieces_per_col (≈ n_cores jobs total)
Each job: one column × one row slice
Benefits:
- All cores stay busy regardless of column count
- Per-worker memory footprint shrinks by pieces_per_col ×
- Breakpoints lookup cache pressure reduced per worker
PSOCK path (Windows / R < 4.0) is unchanged: per-job serialisation
overhead makes row batching unprofitable there.
run_as_sir_column() gains an optional `rows` parameter (NULL = all rows,
backward-compatible). Results are reassembled via as.sir(c(as.character(.)))
which is safe for already-clean SIR values.
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
Issue #278: two related bugs in the column-detection / type-assignment pipeline.
Bug 1 – already-<sir> columns deleted on re-run
Line 886 excluded already-sir columns from the type assignment (they
stayed type "") causing the result loop to do x[,col] <- NULL, deleting
them. Fix: drop the !is.sir() guard so all untyped columns fall through
to type "sir" and are re-processed correctly.
Bug 2 – metadata columns treated as antibiotics
as.ab("patient") -> OXY, as.ab("ward") -> PRU. The column detector
accepted any column whose name matched an antibiotic code, regardless of
content. Fix: for name-matched columns that do not already carry an AMR
class, also verify content looks like AMR data (all_valid_mics, all-
numeric, or any SIR-like string). all_valid_disks() is intentionally
avoided here because it strips letters from strings (as.disk("Pt_1")==1).
Also adds tools/benchmark_parallel.R: a standalone script that times
sequential vs parallel as.sir() across n=20/200/2000/20000 rows and
saves a ggplot2 PNG to tools/benchmark_parallel.png.
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
Eight targeted tests verify correctness of the parallel as.sir() path:
identical SIR output vs sequential, matching log row counts, no
pre-existing history duplication, reproducibility across runs, results
consistency across max_cores values, single-column fallback, and no
per-column worker messages leaking when info = TRUE. All pass when only
1 core is available (parallel silently falls back to sequential).
https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR