Dataset Model
DatasetModel
============In-memory model orchestrating:
The dataset AP matrix (systems × topics).
Streaming NSGA-II execution for BEST/WORST and sampling for AVERAGE.
Compact caches so aggregation/info queries don’t retain large populations.
Output artifacts
-Fun: objective rows
(K, correlation)
streamed during runs.-Var: genotype rows as Base64 masks (
"B64:<...>"
) streamed with -Fun.BEST/WORST → only when the per-K representative improves (monotone).
AVERAGE → exactly one row per K.
-Top: top solutions per cardinality (≤ 10 per K), maintained via block replacement.
Branches
AVERAGE:
For each K = 1..N, sample numberOfRepetitions subsets.
Correlate vs full-set mean vector, stream exactly one FUN/VAR row.
Percentiles computed from the same samples. TOP unused.
BEST/WORST:
NSGA-II objective encoding (internal):
BEST → obj0 = +K, obj1 = -corr
WORST → obj0 = -K, obj1 = +corr
Per generation: select representative per K (natural scale), stream on improvement, and update a persistent TOP pool.
RAM-focused design
After load/expansion, call sealData to drop boxed AP maps and keep dense PrecomputedData.
During runs, cache only:
per-K representative correlation: corrByK
per-K representative mask as Base64: repMaskB64ByK
TOP pool uses lightweight entries (TopEntry) without holding full solutions.
Properties
Boxed AP rows only during load/expansion. Cleared by sealData.
Wall-clock computing time in milliseconds (last run).
Full-set mean AP per system (boxed for API compatibility).
AVERAGE branch percentiles (materialized at the end).
Kept for API compatibility; no longer populated to avoid per-topic maps.
Functions
Drop per-run caches (representative masks, correlation map, TOP pools). Call after writers have closed and aggregations have consumed the model.
Drop percentile lists (often large) once final CSV/Parquet tables have been written.
Expand systems by either reverting to an original prefix (if fewer than the truth) or by appending randomized systems with their AP rows and labels.
Expand topics by appending randomized columns to AP rows and labels.
Find cached correlation for a cardinality (if present).
O(1) presence lookup using the cached representative mask per K (decoded on demand).
Return the presence mask for K encoded as "B64:<base64>"
(always sized to current numberOfTopics).
Return a copy of the representative presence mask for K, or null
if none cached.
Return the presence mask for K sized exactly to expectedSize.