1
0
mirror of https://github.com/msberends/AMR.git synced 2026-04-28 15:43:47 +02:00
Commit Graph

52 Commits

Author SHA1 Message Date
Matthijs Berends
19157ce718 Fix parallel computing in as.sir.data.frame (#276)
* Fix parallel computing in as.sir.data.frame

Six bugs in parallel = TRUE mode:

1. PSOCK workers (Windows / R < 4.0) never had AMR loaded, so every
   exported/AMR function call failed. Added clusterEvalQ(cl, library(AMR))
   with a graceful fallback to sequential when the package cannot be loaded
   (e.g. dev-only load_all() environments).

2. clusterExport'd AMR_env was a frozen serialised copy; as.sir() on the
   worker wrote to AMR:::AMR_env while run_as_sir_column read from the stale
   copy, so the captured log was always wrong. Fixed by resolving AMR_env
   dynamically via get("AMR_env", envir = asNamespace("AMR")) inside the
   worker function, and removing AMR_env from clusterExport.

3. In the fork-based (mclapply) path each worker inherited the parent's full
   sir_interpretation_history. Capturing the whole log then combining across
   workers duplicated every pre-existing entry. Fixed by recording the log
   row count before the as.sir() call and slicing only the new rows
   afterwards.

4. run_as_sir_column used non-exported internals (%pm>%, pm_pull,
   as.sir.default) that are inaccessible on PSOCK workers after library(AMR).
   Replaced pipe chains with direct as.mic(as.character(x[, col, drop=TRUE]))
   and as.disk(...) calls, and changed as.sir.default() to as.sir() which
   dispatches correctly via S3.

5. With info = TRUE, worker forks printed per-column progress messages
   simultaneously, producing garbled interleaved console output. Per-column
   messages are now suppressed inside workers (effective_info = FALSE) while
   the outer "Running in parallel" / "DONE" messages still appear.

6. Malformed Unicode escape \u00a (3 hex digits) in the "DONE" banner was
   parsed by R as U+00AD (soft hyphen) + "ONE"; corrected to  .

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Add parallel computing tests to test-sir.R

Eight targeted tests verify correctness of the parallel as.sir() path:
identical SIR output vs sequential, matching log row counts, no
pre-existing history duplication, reproducibility across runs, results
consistency across max_cores values, single-column fallback, and no
per-column worker messages leaking when info = TRUE. All pass when only
1 core is available (parallel silently falls back to sequential).

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Fix as.sir() data.frame: preserve already-<sir> columns, exclude metadata

Issue #278: two related bugs in the column-detection / type-assignment pipeline.

Bug 1 – already-<sir> columns deleted on re-run
  Line 886 excluded already-sir columns from the type assignment (they
  stayed type "") causing the result loop to do x[,col] <- NULL, deleting
  them.  Fix: drop the !is.sir() guard so all untyped columns fall through
  to type "sir" and are re-processed correctly.

Bug 2 – metadata columns treated as antibiotics
  as.ab("patient") -> OXY, as.ab("ward") -> PRU.  The column detector
  accepted any column whose name matched an antibiotic code, regardless of
  content.  Fix: for name-matched columns that do not already carry an AMR
  class, also verify content looks like AMR data (all_valid_mics, all-
  numeric, or any SIR-like string).  all_valid_disks() is intentionally
  avoided here because it strips letters from strings (as.disk("Pt_1")==1).

Also adds tools/benchmark_parallel.R: a standalone script that times
sequential vs parallel as.sir() across n=20/200/2000/20000 rows and
saves a ggplot2 PNG to tools/benchmark_parallel.png.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Update benchmark: two-panel script with warm-up and column-count sweep

Previous single-panel benchmark was misleading: the first sequential run
paid one-time cache-warm-up cost (skewing n=20), and only 6 columns were
used so only 6 cores were ever active on a 16-core machine.

New two-panel design:
  Left  – vary rows with 16 fixed AB columns (shows memory-bandwidth
          saturation for large n)
  Right – vary columns with fixed rows (shows the real speedup profile:
          parallel wins when n_cols >> 1)

Also adds a warm-up pass before measurements to eliminate first-call bias.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Optimise parallel as.sir(): row-batch mode when n_cols < n_cores

Previously parallel dispatch only parallelised by column, so a 6-column
dataset on a 16-core machine used at most 6 cores with the other 10 idle.
For large n this also caused memory-bandwidth saturation (each worker did
a full n-row scan of clinical_breakpoints simultaneously).

New row-batch mode (fork path, R >= 4.0, non-Windows):
  pieces_per_col = ceil(n_cores / n_cols)
  Jobs = n_cols × pieces_per_col  (≈ n_cores jobs total)
  Each job: one column × one row slice

Benefits:
  - All cores stay busy regardless of column count
  - Per-worker memory footprint shrinks by pieces_per_col ×
  - Breakpoints lookup cache pressure reduced per worker

PSOCK path (Windows / R < 4.0) is unchanged: per-job serialisation
overhead makes row batching unprofitable there.

run_as_sir_column() gains an optional `rows` parameter (NULL = all rows,
backward-compatible). Results are reassembled via as.sir(c(as.character(.)))
which is safe for already-clean SIR values.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Fix info=FALSE ignored when no breakpoints found in as_sir_method

Operator-precedence bug at line 1601:

  if (isTRUE(info) && nrow(df_unique) < 10 || nrow(breakpoints) == 0)

R evaluates && before ||, so this was equivalent to:

  (isTRUE(info) && nrow(df_unique) < 10) || (nrow(breakpoints) == 0)

When nrow(breakpoints) == 0 (e.g. cefoxitin / flucloxacillin / mupirocin
against E. coli in EUCAST) the intro message was always printed regardless
of info. Fix: add parentheses so info gates both conditions:

  isTRUE(info) && (nrow(df_unique) < 10 || nrow(breakpoints) == 0)

Also pass print = isTRUE(info) to progress_ticker so the progress bar
(which prints intro_txt as its title) is suppressed when info = FALSE.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Fix cli formatting in as.sir() messages

- stop_if for empty ab_cols: wrap as.mic() and as.disk() in
  {.help [{.fun ...}](...)} for clickable links in cli output
- Parallel mode message: use {.field col} formatting for column names
  and quotes = FALSE in vector_and(), consistent with the rest of the
  codebase (avoids double-quoting from both font_bold and quotes="'")

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Use font_bold() inside {.field} for column names in parallel message

Convention: paste0("{.field ", font_bold(col), "}") gives bold green
column names without quotation marks, consistent with the rest of the
codebase (e.g. the 'Cleaning values' message in run_as_sir_column).

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Add collapse = NULL to font_bold() for column name vectors

font_bold() without collapse = NULL joins a vector with "" into a single
string, breaking paste0() element-wise formatting for length > 1 vectors.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

* Add tools/ to .Rbuildignore

Keeps the benchmark script out of the built package tarball.

https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-25 00:34:38 +02:00
Matthijs Berends
2c21eba04c add CLAUDE.md with project context for Claude Code (#261)
* add CLAUDE.md with project context for Claude Code

Provides development commands, architecture overview, file conventions,
custom S3 classes, data files, testing setup, and versioning guidelines
to help Claude Code assist effectively in this repository.

https://claude.ai/code/session_01L3fTxqsg3Gc6J1znpWN1Mx

* add CLAUDE.md to .Rbuildignore

Excludes the Claude Code context file from the R package build tarball.

https://claude.ai/code/session_01L3fTxqsg3Gc6J1znpWN1Mx

* document version-bump requirement for every PR in CLAUDE.md

Each PR must increment the .9zzz dev counter by 1 in both
DESCRIPTION (Version: field) and NEWS.md (top-level heading).

https://claude.ai/code/session_01L3fTxqsg3Gc6J1znpWN1Mx

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-02-27 17:13:11 +01:00
3ba1b8a10a (v3.0.0.9022) postpone new features - we like a clearly focussed bugfix release first 2025-09-03 15:39:44 +02:00
c70ac149ff new WISCA vignette 2025-04-30 17:33:03 +02:00
4a336d040c (v2.1.1.9250) Automated README and index.md 2025-04-21 15:37:26 +02:00
15fc72fc66 (v2.1.1.9121) support tidymodels 2024-12-19 20:17:15 +01:00
87271d261a add Python package to repo 2024-11-21 10:06:26 +01:00
5c4d8fcd2a (v2.1.1.9095) Python support 2024-10-15 17:12:55 +02:00
08f7256852 include unit tests again 2023-07-13 12:45:14 +01:00
ddd01f9410 test to remove unit tests from build 2023-07-13 09:42:37 +01:00
0bcf55d3b6 improve as.mo() 2023-05-24 15:55:53 +02:00
303d61b473 new tibble export 2022-08-27 20:49:37 +02:00
d6676e9443 disk documentation fix 2022-08-21 16:52:09 +02:00
952d16de33 new, automated website 2022-08-21 16:37:20 +02:00
7226b70c3d update languages 2022-08-20 20:17:14 +02:00
ccb09706e4 v1.8.1 2022-03-24 23:05:04 +01:00
f5dcf0ad58 v1.8.0 as accepted by CRAN 2022-01-07 16:27:13 +01:00
a2d249962f (v1.7.1.9023) Removed filter_ functions, new set_ab_names(), ATC code update, ab selector update, fixes #46 and fixed #47 2021-08-16 21:54:34 +02:00
d277d58475 (v1.6.0.9002) R-3.0 installation fix 2021-04-12 14:24:40 +02:00
1737d56ae4 (v1.5.0.9026) vignette update, support for GISA 2021-02-25 12:31:12 +01:00
286eaa9699 (v1.5.0.9010) MDRO vignette update, get_episode for < day 2021-01-24 14:48:56 +01:00
c8bcecf232 (v1.4.0.9037) random_* functions 2020-12-12 23:17:29 +01:00
791bb6d33f (v1.3.0) remove vignettes from CRAN 2020-07-31 11:39:56 +02:00
c5f7294381 (v1.3.0) skip more CRAN tests 2020-07-31 10:50:08 +02:00
76fc8e1b14 (v1.2.0.9026) move to github 2020-07-08 14:48:06 +02:00
e2d05cb1b0 (v0.8.0.9017) keywords update 2019-11-06 14:43:23 +01:00
10e6b225e7 (v0.7.1.9107) v0.8.0 2019-10-15 14:35:23 +02:00
00cdb498a0 (v0.7.1.9102) lintr 2019-10-11 17:21:02 +02:00
398c5bdc4f (v0.7.1.9073) as.mo() self-learning algorithm 2019-09-15 22:57:30 +02:00
2667fff8a7 (v0.6.1.9050) support staged install 2019-06-01 20:40:49 +02:00
461eec9bac cfta streptococci, codecov.yml 2019-04-09 14:59:17 +02:00
30b559827c documentation update 2019-04-07 22:40:02 +02:00
fb1fc3686c Catalogue of life 2019-02-20 00:04:48 +01:00
46dcc7e2e8 set_mo_source 2019-01-21 15:53:01 +01:00
b48e609afe gitlab pkg cache 2019-01-05 09:50:22 +01:00
b92c392dd4 gitlab ci 2019-01-04 12:13:02 +01:00
eab3c9dac8 gitlab ci 2019-01-04 10:41:18 +01:00
6652f7d82b fix warnings 2019-01-04 09:49:42 +01:00
fd646fe1fc introduction of Packrat 2018-12-30 08:40:40 +01:00
92a32b62a7 new website, freq updates 2018-12-29 22:24:19 +01:00
456f3e8773 gitlab pages fix 2018-12-23 21:33:44 +01:00
92d2553dfe gitlab pages 2018-12-23 21:26:21 +01:00
b937662a97 limits for scale_y_percent - Licence update 2018-12-16 22:45:12 +01:00
5cfa5bbfe3 v0.5.0 2018-11-30 16:16:04 +01:00
87757748bd switch to gitlab 2018-10-23 16:49:40 +02:00
029157b3be 163 new trade names, added ab_tradenames 2018-08-29 12:27:37 +02:00
e5193c7749 AppVeyor 2018-08-13 23:05:53 +02:00
1ba7d883fe new ggplot enhancement 2018-08-11 21:30:00 +02:00
fc30d3fb13 freq: support for table 2018-07-09 14:02:58 +02:00
uscloud
dcc26dd942 Update freq function 2018-05-22 16:34:22 +02:00