Optimise parallel as.sir(): row-batch mode when n_cols < n_cores

Previously parallel dispatch only parallelised by column, so a 6-column dataset on a 16-core machine used at most 6 cores with the other 10 idle. For large n this also caused memory-bandwidth saturation (each worker did a full n-row scan of clinical_breakpoints simultaneously). New row-batch mode (fork path, R >= 4.0, non-Windows): pieces_per_col = ceil(n_cores / n_cols) Jobs = n_cols × pieces_per_col (≈ n_cores jobs total) Each job: one column × one row slice Benefits: - All cores stay busy regardless of column count - Per-worker memory footprint shrinks by pieces_per_col × - Breakpoints lookup cache pressure reduced per worker PSOCK path (Windows / R < 4.0) is unchanged: per-job serialisation overhead makes row batching unprofitable there. run_as_sir_column() gains an optional `rows` parameter (NULL = all rows, backward-compatible). Results are reassembled via as.sir(c(as.character(.))) which is safe for already-clean SIR values. https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR
2026-07-19 21:10:58 +02:00 · 2026-04-24 22:01:09 +00:00
parent d770469a63
commit 060449e234
3 changed files with 78 additions and 16 deletions
--- a/tests/testthat/test-sir.R
+++ b/tests/testthat/test-sir.R
@@ -502,6 +502,21 @@ test_that("test-sir.R", {
  sir_single_par <- suppressMessages(as.sir(df_single, col_mo = "mo", info = FALSE, parallel = TRUE))
  expect_identical(sir_single_seq[["AMC"]], sir_single_par[["AMC"]])

+  # 9. row-batch mode (n_cols < n_cores): force row splitting via max_cores and
+  #    verify identical output to sequential for a dataset with 2 AB columns so
+  #    pieces_per_col = ceiling(max_cores / 2) >= 2 and row batching activates
+  df_wide <- data.frame(
+    mo  = "B_ESCHR_COLI",
+    AMC = as.mic(sample(c("1", "2", "4", "8"), n_par, TRUE)),
+    GEN = as.mic(sample(c("1", "2", "4", "8"), n_par, TRUE)),
+    stringsAsFactors = FALSE
+  )
+  sir_wide_seq <- suppressMessages(as.sir(df_wide, col_mo = "mo", info = FALSE))
+  sir_wide_par <- suppressMessages(as.sir(df_wide, col_mo = "mo", info = FALSE,
+                                          parallel = TRUE, max_cores = 8L))
+  expect_identical(sir_wide_seq[["AMC"]], sir_wide_par[["AMC"]])
+  expect_identical(sir_wide_seq[["GEN"]], sir_wide_par[["GEN"]])
+
  # 8. info = TRUE with parallel does not produce per-column worker messages
  #    (messages should only appear in the main process, not duplicated from workers)
  msgs <- capture.output(