P281424/AMR - AMR - Gitea RUG

mirror of https://github.com/msberends/AMR.git synced 2026-04-28 12:23:54 +02:00

Author	SHA1	Message	Date
Matthijs Berends	8261b91b24	Fix custom reference_data support in as.sir() (#239 ) (PR #279 ) * Fix custom reference_data support in as.sir() (#239) - custom guideline names now correctly classify values as R: CLSI convention (>= breakpoint_R for MIC, <= for disk) applies only when guideline contains "CLSI"; all other guidelines including custom ones use the EUCAST convention (> breakpoint_R for MIC, < for disk) - guideline argument is now optional when reference_data is manually set: if omitted or if its value does not match any row in the custom data, all rows in reference_data are used; if set to a value present in the data, only matching rows are filtered — useful for multi-guideline custom tables - host = NA in custom reference_data now acts as a host-agnostic fallback when no host-specific breakpoint row exists for the current animal species - updated reference_data argument documentation to explain these conventions https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U * Refactor R-classification logic using custom_breakpoints_set flag Introduce custom_breakpoints_set <- !identical(reference_data, AMR::clinical_breakpoints) at the top of as_sir_method() and replace all identical() calls inside that function with this variable. In the case_when_AMR interpretation blocks (MIC and disk), the R-classification now has three explicit arms: - !custom_breakpoints_set & EUCAST guideline -> open interval (> / <) - !custom_breakpoints_set & CLSI guideline -> closed interval (>= / <=) - custom_breakpoints_set -> open interval (> / <), always, regardless of the guideline name in the custom data (e.g. "CLSI_custom" must not accidentally trigger CLSI convention) https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U * Fix unit tests for custom reference_data (#239) - Do not override my_bp$mo / my_bp$ab in tests: assigning plain character strips the <mo>/<ab> class, which check_reference_data() rejects. Use the mo/ab values already present in the source row instead. - Use NA_character_ instead of NA for my_bp$host so the host column keeps its character class. - Pass breakpoint_type = "animal" explicitly in the host-fallback test since the custom reference_data only contains animal-type breakpoints. https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U * Add coerce_reference_data_columns() for lenient reference_data validation check_reference_data() now returns the (possibly coerced) reference_data and the call site captures the result so downstream code sees the fixed columns. A new coerce_reference_data_columns() helper is called before the strict class check inside check_reference_data(). It coerces columns to the expected types: - mo -> as.mo() if not already <mo> class - ab -> as.ab() if not already <ab> class - character columns -> as.character() (e.g. host = NA becomes NA_character_) - numeric columns -> as.double() - logical columns -> as.logical() This allows users to build a custom reference_data from a plain data.frame without having to pre-apply as.mo()/as.ab() or worry about NA column types. Updated the reference_data roxygen argument to document the auto-coercion and restored the tests to the simpler form that uses plain character assignments, relying on the new coercion instead of workarounds. https://claude.ai/code/session_01Q8KtFFGG9qrjAgLJBbxG2U --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-04-25 14:38:01 +02:00
Matthijs Berends	19157ce718	Fix parallel computing in as.sir.data.frame (#276 ) * Fix parallel computing in as.sir.data.frame Six bugs in parallel = TRUE mode: 1. PSOCK workers (Windows / R < 4.0) never had AMR loaded, so every exported/AMR function call failed. Added clusterEvalQ(cl, library(AMR)) with a graceful fallback to sequential when the package cannot be loaded (e.g. dev-only load_all() environments). 2. clusterExport'd AMR_env was a frozen serialised copy; as.sir() on the worker wrote to AMR:::AMR_env while run_as_sir_column read from the stale copy, so the captured log was always wrong. Fixed by resolving AMR_env dynamically via get("AMR_env", envir = asNamespace("AMR")) inside the worker function, and removing AMR_env from clusterExport. 3. In the fork-based (mclapply) path each worker inherited the parent's full sir_interpretation_history. Capturing the whole log then combining across workers duplicated every pre-existing entry. Fixed by recording the log row count before the as.sir() call and slicing only the new rows afterwards. 4. run_as_sir_column used non-exported internals (%pm>%, pm_pull, as.sir.default) that are inaccessible on PSOCK workers after library(AMR). Replaced pipe chains with direct as.mic(as.character(x[, col, drop=TRUE])) and as.disk(...) calls, and changed as.sir.default() to as.sir() which dispatches correctly via S3. 5. With info = TRUE, worker forks printed per-column progress messages simultaneously, producing garbled interleaved console output. Per-column messages are now suppressed inside workers (effective_info = FALSE) while the outer "Running in parallel" / "DONE" messages still appear. 6. Malformed Unicode escape \u00a (3 hex digits) in the "DONE" banner was parsed by R as U+00AD (soft hyphen) + "ONE"; corrected to . https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Add parallel computing tests to test-sir.R Eight targeted tests verify correctness of the parallel as.sir() path: identical SIR output vs sequential, matching log row counts, no pre-existing history duplication, reproducibility across runs, results consistency across max_cores values, single-column fallback, and no per-column worker messages leaking when info = TRUE. All pass when only 1 core is available (parallel silently falls back to sequential). https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Fix as.sir() data.frame: preserve already-<sir> columns, exclude metadata Issue #278: two related bugs in the column-detection / type-assignment pipeline. Bug 1 – already-<sir> columns deleted on re-run Line 886 excluded already-sir columns from the type assignment (they stayed type "") causing the result loop to do x[,col] <- NULL, deleting them. Fix: drop the !is.sir() guard so all untyped columns fall through to type "sir" and are re-processed correctly. Bug 2 – metadata columns treated as antibiotics as.ab("patient") -> OXY, as.ab("ward") -> PRU. The column detector accepted any column whose name matched an antibiotic code, regardless of content. Fix: for name-matched columns that do not already carry an AMR class, also verify content looks like AMR data (all_valid_mics, all- numeric, or any SIR-like string). all_valid_disks() is intentionally avoided here because it strips letters from strings (as.disk("Pt_1")==1). Also adds tools/benchmark_parallel.R: a standalone script that times sequential vs parallel as.sir() across n=20/200/2000/20000 rows and saves a ggplot2 PNG to tools/benchmark_parallel.png. https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Update benchmark: two-panel script with warm-up and column-count sweep Previous single-panel benchmark was misleading: the first sequential run paid one-time cache-warm-up cost (skewing n=20), and only 6 columns were used so only 6 cores were ever active on a 16-core machine. New two-panel design: Left – vary rows with 16 fixed AB columns (shows memory-bandwidth saturation for large n) Right – vary columns with fixed rows (shows the real speedup profile: parallel wins when n_cols >> 1) Also adds a warm-up pass before measurements to eliminate first-call bias. https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Optimise parallel as.sir(): row-batch mode when n_cols < n_cores Previously parallel dispatch only parallelised by column, so a 6-column dataset on a 16-core machine used at most 6 cores with the other 10 idle. For large n this also caused memory-bandwidth saturation (each worker did a full n-row scan of clinical_breakpoints simultaneously). New row-batch mode (fork path, R >= 4.0, non-Windows): pieces_per_col = ceil(n_cores / n_cols) Jobs = n_cols × pieces_per_col (≈ n_cores jobs total) Each job: one column × one row slice Benefits: - All cores stay busy regardless of column count - Per-worker memory footprint shrinks by pieces_per_col × - Breakpoints lookup cache pressure reduced per worker PSOCK path (Windows / R < 4.0) is unchanged: per-job serialisation overhead makes row batching unprofitable there. run_as_sir_column() gains an optional `rows` parameter (NULL = all rows, backward-compatible). Results are reassembled via as.sir(c(as.character(.))) which is safe for already-clean SIR values. https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Fix info=FALSE ignored when no breakpoints found in as_sir_method Operator-precedence bug at line 1601: if (isTRUE(info) && nrow(df_unique) < 10 \|\| nrow(breakpoints) == 0) R evaluates && before \|\|, so this was equivalent to: (isTRUE(info) && nrow(df_unique) < 10) \|\| (nrow(breakpoints) == 0) When nrow(breakpoints) == 0 (e.g. cefoxitin / flucloxacillin / mupirocin against E. coli in EUCAST) the intro message was always printed regardless of info. Fix: add parentheses so info gates both conditions: isTRUE(info) && (nrow(df_unique) < 10 \|\| nrow(breakpoints) == 0) Also pass print = isTRUE(info) to progress_ticker so the progress bar (which prints intro_txt as its title) is suppressed when info = FALSE. https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Fix cli formatting in as.sir() messages - stop_if for empty ab_cols: wrap as.mic() and as.disk() in {.help [{.fun ...}](...)} for clickable links in cli output - Parallel mode message: use {.field col} formatting for column names and quotes = FALSE in vector_and(), consistent with the rest of the codebase (avoids double-quoting from both font_bold and quotes="'") https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Use font_bold() inside {.field} for column names in parallel message Convention: paste0("{.field ", font_bold(col), "}") gives bold green column names without quotation marks, consistent with the rest of the codebase (e.g. the 'Cleaning values' message in run_as_sir_column). https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Add collapse = NULL to font_bold() for column name vectors font_bold() without collapse = NULL joins a vector with "" into a single string, breaking paste0() element-wise formatting for length > 1 vectors. https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR * Add tools/ to .Rbuildignore Keeps the benchmark script out of the built package tarball. https://claude.ai/code/session_012DXCXbZUC54Zij1z9bFiHR --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-04-25 00:34:38 +02:00
dr. M.S. (Matthijs) Berends	3a736bc484	(v3.0.1.9041) add breakpoints 2026	2026-03-30 10:01:49 +02:00
dr. M.S. (Matthijs) Berends	499c830ee7	(v3.0.1.9020) unit test fixes	2026-02-09 13:16:36 +01:00
dr. M.S. (Matthijs) Berends	225c73f7e7	(v3.0.1.9004) Revamp `as.sir()` interpretation for capped MICs Fixes #243 Fixes #244	2025-12-15 13:18:13 +01:00
dr. M.S. (Matthijs) Berends	4d7c4ca52c	(v3.0.0.9027) skimr update and as.ab warning - fixes #234 , fixes #232	2025-09-10 13:32:52 +02:00
dr. M.S. (Matthijs) Berends	5e6bbdf3d1	(v2.1.1.9267) update ATCs	2025-05-01 11:48:49 +02:00
dr. M.S. (Matthijs) Berends	d2b3937a90	(v2.1.1.9257) adjust unit tests	2025-04-27 09:58:19 +02:00
dr. M.S. (Matthijs) Berends	f340e257fa	(v2.1.1.9256) unit tests	2025-04-26 21:29:50 +02:00
dr. M.S. (Matthijs) Berends	be13934fe7	(v2.1.1.9249) unit test	2025-04-20 17:49:47 +02:00
dr. M.S. (Matthijs) Berends	492fe6872f	(v2.1.1.9244) automated GPT training data	2025-04-19 15:57:12 +02:00
dr. M.S. (Matthijs) Berends	579025f678	(v2.1.1.9241) fix sir	2025-04-18 13:25:59 +02:00
dr. M.S. (Matthijs) Berends	40d7a971c3	(v2.1.1.9236) documentation	2025-04-12 11:46:42 +02:00
dr. M.S. (Matthijs) Berends	36fd99e1f4	(v2.1.1.9235) New website!	2025-04-08 15:54:30 +01:00
dr. M.S. (Matthijs) Berends	8deaf2c8eb	(v2.1.1.9224) skip tests on cran	2025-03-20 23:29:21 +01:00
dr. M.S. (Matthijs) Berends	58d7aa8790	(v2.1.1.9199) fix eucast	2025-03-14 13:43:22 +01:00
dr. M.S. (Matthijs) Berends	861331b1df	(v2.1.1.9196) fix eucast, unit tests	2025-03-13 15:38:39 +01:00
dr. M.S. (Matthijs) Berends	9aab129ea6	(v2.1.1.9195) add `BTL-S`, fix ranks in unknown microorganisms	2025-03-13 14:30:14 +01:00
dr. M.S. (Matthijs) Berends	f7938289eb	(v2.1.1.9186) replace `antibiotics` with `antimicrobials`!	2025-03-07 20:43:26 +01:00
dr. M.S. (Matthijs) Berends	07efc292bc	(v2.1.1.9163) cleanup	2025-02-27 14:04:29 +01:00
dr. M.S. (Matthijs) Berends	f03933940c	(v2.1.1.9131) implement testthat	2025-01-27 21:43:10 +01:00

21 Commits