Upcoming Features & Roadmap
This section is protected. Enter the password to view upcoming features:
Incorrect password. Please try again.
Here's a preview of the upcoming features I'm working on to make Wattlytics even more powerful and user-friendly:
Planned Feature Roadmap Overview
Wattlytics 2.0: Energy‑aware, heterogeneity‑aware total cost, performance and emissions modeling for HPC / AI clusters
| Phase | Key Features |
|---|---|
| Phase 1 Core Expansions |
Benchmark/GPU database expansion Heterogeneous cluster modeling Utilization modeling |
| Phase 2 Advanced Analytics |
Undervolting & throttling effects modeling Multi-year TCO projection with depreciation Monte Carlo / uncertainty analysis Smart configuration guidance system: — Configuration recommendation engine (rule-based) — Budget & constraint-based optimizer — ML model for configuration recommendations |
| Phase 3 User Experience & Data Integration |
Fixed energy budget modeling Forecasting & scenario planning Cluster preset profiles Alternative price sourcing |
| Phase 4 Collaboration & Reporting |
Collaboration mode Detailed PDF report generator API/Programmatic access Currency selector |
| Phase 5 Model Assessment & Extensions |
Frequency factor modeling Cooling types & PUE modeling Capital vs. Operational cost split Hardware vs. Software maintenance split Software cost dependency on node count Data center temperature as a factor CO₂ emission factor Net Present Value (NPV) factor CO₂ offsets & heat reuse modeling Risk, failure rates & replacement costs Thermal zoning or rack-level modeling Power capping / RAPL effect modeling Interconnect & network energy costs Billing mode toggle (Runtime vs. Energy) |
- Phase 1: Core expansions
- Benchmark/GPU database expansion: Support for more and mixed workloads from different domains (ML, CFD, molecular dynamics), e.g., AI (TensorFlow, PyTorch, MLPerf) and HPC loads (OpenFOAM, LAMMPS, ORCA, VASP (license issue), CP2K (⚠️some functionalities are yet unavailable on GPU), Quantum Espresso (⚠️Few kernels are challenging to port to GPUs but performance critical part of QE already in CUDA, therefore, unsure whether it must be 1 MPI process per GPU or one should use all CPU cores for a given GPU), SIRIUS). Allow users to define custom workload mixes and save them as templates. Currently supports GROMACS and AMBER with data export (CSV/JSON), plus custom data uploads (CSV/JSON) for users.
- Heterogeneous cluster modeling: For example, Helma cluster has three parts: Helma 1, Helma 2 and CPU. In multi-architecture clusters (CPU + multiple GPU generations), model interconnect costs/latencies or data transfer energy overhead when jobs cross hardware types (e.g., CPU ↔ GPU ↔ GPU of different gen).
- Utilization modeling: Rather than assuming GPUs operate constantly, simulate a job mix (some jobs heavy, some light) or utilization curves. Compute effective utilization (e.g. 70% usage, peaks, idle periods) more realistically. Add "Idle Power" or "Scheduling Inefficiencies" into models so users can toggle between ideal vs. real usage curves.
- Phase 2: Advanced analytics
- Undervolting & throttling effects modeling: Allow exploring how undervolting/underclocking or thermal throttling affects performance, power, TCO. Add plots of "power vs. perf loss" to visualize trade-offs.
- Multi-year TCO projection with depreciation: Visualize how total cost accumulates over the system’s lifetime. GPU cluster loses value over time affects Accounting, TCO realism and ROI and resale value (X-axis: 0 to lifetime in years; Y-axis: Cumulative TCO as a line and Depreciated asset value as shaded area). comment: Not sure how to factor this in, but in a commercial setting, asset value or cost decline over time matters. Since hardware loses value over time but power costs stay constant, power might be more important than hardware cost over the cluster's lifetime. Allow users to select a depreciation method (linear, double-declining, etc.) and add resale value estimates.
- Monte Carlo / uncertainty analysis: Select parameters with a ± range, run 1000+ randomized calculations. Display mean, standard deviation, and box or violin plots of performance/TCO results. For example, one can tweak parameters like W prefactors in the power model or the node's baseline power to see how they affect performance per TCO. Allow users choose “confidence level” (95%, 99%) and export summary statistics. Add pre-defined scenarios (e.g., “Power Variability ±5%”).
- Smart configuration guidance system
- Configuration recommendation engine (rule-based): Suggest actionable ways to improve performance/TCO based on current inputs (trained on user sessions or simulated configs). If a user sets a constraint (power < X, emissions < Y), highlight or warn when configurations exceed constraints. (Current: basic rule-based Smart Strategy Tip; Future: adaptive, data-driven suggestions)
- Budget & constraint-based optimizer: Given a budget, power budget, or emission cap, automatically suggest optimal GPU combinations/cluster sizes to maximize target metrics (perf/TCO, perf/emissions, etc.). Possible backend: MILP solver or even heuristic methods to generate "good enough" cluster suggestions. Add “constraint check warnings” with clear labels (e.g., ⚠️ “Exceeds Power Cap by 10%”). Input annual CO₂ cap → see which clusters fit (Carbon Budget Planner)
- ML model for configuration recommendations: After collecting enough usage data, recommend configurations via learned patterns
- Phase 3: User experience and data integration
- Fixed energy budget modeling: Model based on fixed energy rather than a fixed budget. Particularly relevant for larger insulations having power plants (LANL) and to a lesser extent for FAU power supplies at NHR@FAU. For example, given an X MW infrastructure, a cluster exceeding X MW within the allowed budget won’t work. How far can you optimize power (by reducing power cap or adjusting CPU/GPU clocks) before it forces you to switch to a different GPU? Given a power limit, how large can the system be while staying within the allowed budget? Add color bands or alerts on charts when a config exceeds power envelope. Explore "Power Envelope vs. Perf Curve" visualizations.
- Forecasting & scenario planning: Allow modeling of how GPU prices, electricity rates, grid carbon intensity might change over 5–10 years (e.g. falling GPU cost, rising carbon tax). Run forecasts / scenarios (optimistic, pessimistic). Let users select the trend type (Linear, Exponential), sensitivity bands, and duration (3, 5, 10 years).
- Cluster preset profiles: Add top 10 clusters from the Top500 or Green500 list and improve currently available Alex (A40, A100) and Helma (H100) profiles.
- Alternative price sourcing: Sourcing prices from deltacomputer.com when available; otherwise from alternate.de, sysgen.de, or internal records. VAT will be added if missing. One can always adjust the pricing (or take best option) if needed. Use crawlers or scraped data via public APIs where possible. Keep disclaimers — prices vary and VAT/local tax rules may apply.
- Phase 4: Collaboration and reporting
- Collaboration Mode: Share calculation setups with teammates via unique, bookmarkable, shareable public URLs with embedded settings. Currently includes a simple auto-generated blog summary share option. Let users log in, store their configurations, revisit, manage scenario libraries. Let users save, name, version multiple scenarios, compare over time; share with collaborators; comment, version diffing. Let external contributors define workloads, pricing models, or cooling setups via plugins or config files.
- Detailed PDF report generator: Export project reports with charts and explanations and add a report format toggle (Summary vs. Detailed). Currently offers only a simple PDF report generator.
- API/Programmatic Access: Expose Wattlytics as an API for researchers and developers so other tools (cluster schedulers, infrastructure planners) can query TCO estimations programmatically. Currently limited to a private GitHub repository.
- Currency Selector: Support for USD, GBP, INR, and more. Currently only supports EUR.
- Phase 5: Model assessment and extensions; 🚦Overengineered?
-
✅ Frequency Factor
- Model: W_node_baseline(f_cCPU, f_ucCPU) = W_baseCPU + W_dynamicCPU + W_noncompute
- Model: W_gpu(f_GPU)
- Comment: f_cCPU, f_ucCPU and f_GPU represents core CPU frequency, uncore CPU frequency and GPU graphics frequency. Example: On Alex, power usage is split as 11.6% CPU, 55% GPU, and 33.4% non-compute (including 6% fan and 27.4% other components). (Optional) Let advanced users input or tweak their own component-wise power ratios.
-
❓ Cooling Types & PUE Modeling
- Model: PUE = (factor_air * PUE_Air + factor_water * PUE_Water) / 2
- Comment: Incorporates cooling types (air-cooled, water-cooled), weighting factors for the fraction of energy consumption between air-cooled and water-cooled cluster to model heterogeneous clusters. Another variation: different codes on CPU and GPU clusters with varying PUEs require a weighted sum of benchmark power use. Incorporate PUE is good, but further modeling of cooling infrastructure overhead, power losses, UPS/PDUs, rack-level constraints. Refine PUE as nonlinear, temperature-dependent, or even: PUE = a + b*(ambient_temp). Show PUE vs. temperature curve (based on ASHRAE or DC operator data).
-
❓ Maintenance Cost Split: Capital vs. Operational
- Comment: Capital cost applies to the first 3 years (as per DFG rules), while operational cost covers the remaining lifetime. Allow user-defined splits as a fallback.
- ❓ Hardware vs. Software Maintenance Split
-
❓ Software Cost Dependency on Node Count
- Comment: Many commercial licenses scale with node count, such as Moya scheduler, SUSE Enterprise, or IBM file systems like GPFS. Include fields for "License per node vs. per core, "Floating license pools" and "Cost caps or institutional deals."
-
❓ Data Center Temperature as a Factor
- Comment: Temperature is a side effect of cooling infrastructure; efforts to reduce it (e.g., via chilled air or cooling systems) must be factored into the TCO.
-
❓ CO2 Emission Factor
- Comment: One can show “performance per kg CO₂”, or visualize "emissions over lifetime," or assumes 100% green electricity usage at FAU.
-
❓ Net Present Value (NPV) Factor
- Comment: NPV compares the value of future cash flows to the initial investment. Clearly define cash flow periods in the model. Add simple toggle: “Include NPV” → ask for discount rate.
-
❓Expand the model beyond “heat reuse revenue” and “factor”
- Comment: district heating, thermal credit, carbon offsets. Show “Effective CO₂ impact after offsets”
-
❓Risk, failure rates, replacement costs
- Comment: E.g., after X years some fraction of hardware fails, or maintenance increases.
-
❓ Thermal Zoning or Rack-Level Modeling
- Comment: Support for simulating heat hotspots or rack-level inefficiencies (especially important in mixed-density clusters). Suggest placement heuristics to reduce cooling imbalance and improve PUE at micro-level.
-
❓ Power Capping / RAPL Effect Modeling
- Comment: Integrate realistic models for RAPL or GPU power capping (e.g., NVIDIA nvidia-smi or AMD ROCm features). Consider frequency throttling, memory performance degradation, or latency effects. Especially useful for institutions using hard power limits for cluster management or within DC constraints.
-
❓ Interconnect & Network Energy Costs
- Comment: Extend modeling of network power costs, especially for multi-node MPI workloads. Include interconnect type (InfiniBand, Ethernet), bandwidth scaling, and latency penalties when mixing GPU generations or architectures (e.g., NVLink vs. PCIe-only). Factor into total system power and performance metrics.
-
⚠️ Conflict: Billed per hour vs. per kWh
- Comment: A high-efficiency application may consume more power but finish faster—lowering runtime costs while increasing energy bills. Flag in documentation or include as a toggle option. Add toggle for Billing method (Runtime-based vs. Energy-based). Show how fast-but-hot vs. slow-and-cool trade off differently.
-
✅ Frequency Factor