Federated Heterogeneous Compute and Storage Infrastructure for PUNCH4NFDI

1. Introduction

Particles, Universe, NuClei and Hadrons for the National Research Data Infrastructure (PUNCH4NFDI) is a major German consortium funded by the DFG (German Research Foundation). It represents approximately 9,000 scientists from particle, astro-, astroparticle, hadron, and nuclear physics communities. The consortium's prime goal is to establish a federated and FAIR (Findable, Accessible, Interoperable, Reusable) science data platform. A central challenge addressed is the federation of the highly heterogeneous computing (HPC, HTC, Cloud) and storage resources contributed "in-kind" by member institutions across Germany, enabling seamless, unified access for researchers.

2. Federated Heterogeneous Compute Infrastructure – Compute4PUNCH

The Compute4PUNCH concept is designed to provide transparent access to a diverse pool of compute resources without imposing significant changes on existing, operational systems at provider sites.

2.1. Core Architecture & Technologies

The federation is built on an HTCondor-based overlay batch system. The key innovation is the use of the COBalD/TARDIS resource meta-scheduler. TARDIS acts as a dynamic broker, translating HTCondor job requirements into provider-specific APIs (e.g., SLURM, Kubernetes) and managing the lifecycle of "pilot" jobs or containers on remote resources. This creates a virtual, federated resource pool.

Access is secured via a token-based Authentication and Authorization Infrastructure (AAI), providing a standardized credential for all connected resources.

2.2. User Access & Software Environment

Users interact with the system through familiar entry points:

Traditional login nodes for command-line access.
A centralized JupyterHub service for web-based interactive computing.

Software environment portability is solved using container technologies (e.g., Docker, Singularity/Apptainer) and the CERN Virtual Machine File System (CVMFS), which delivers software stacks efficiently via caching.

3. Federated Storage Infrastructure – Storage4PUNCH

Storage4PUNCH focuses on federating community storage systems, primarily based on dCache and XRootD technologies, which are standards in High-Energy Physics (HEP). The federation aims to provide a unified namespace and access protocol. The concept evaluates deeper integration through:

Storage federation protocols (e.g., based on XRootD's redirector federation or dCache's pool manager).
Caching layers to reduce latency and WAN traffic.
Metadata handling to improve data discoverability across the federation.

This creates a data lake accessible alongside the federated compute resources.

4. Technical Details & Mathematical Framework

The core scheduling logic can be modeled as an optimization problem. Let $R = \{r_1, r_2, ..., r_n\}$ be the set of heterogeneous resources, each with attributes like architecture, available cores $c_i$, memory $m_i$, and cost/priority factor $p_i$. A job $J$ has requirements $J_{req} = (c_{req}, m_{req}, arch_{req}, t_{req})$. The meta-scheduler's objective is to maximize overall utility or throughput.

A simplified scoring function for placing job $J$ on resource $r_i$ could be: $$ S(J, r_i) = \begin{cases} 0 & \text{if } r_i \text{ does not match } J_{req} \\ \alpha \cdot \frac{c_i}{c_{req}} + \beta \cdot \frac{m_i}{m_{req}} - \gamma \cdot p_i & \text{otherwise} \end{cases} $$ where $\alpha, \beta, \gamma$ are weighting coefficients. The COBalD/TARDIS system implements heuristics and real-time feedback loops to approximate such optimization dynamically, adjusting to resource availability and job queue states.

5. Prototype Results & Performance

Chart Description (Conceptual): A line chart showing "Aggregate Compute Capacity Accessible Over Time." The x-axis is time (months). Two lines are shown: 1) "Individual Resource Pools (Disconnected)" – flat, staggered lines representing static capacity of individual sites. 2) "Federated Pool via Compute4PUNCH" – a higher, more dynamic line that increases as more sites are integrated and shows smaller fluctuations, demonstrating load balancing across the federation. The chart illustrates the key result: the federated system provides users with a larger, more resilient, and more efficiently utilized virtual resource pool than the sum of its isolated parts.

Initial prototypes successfully demonstrated job submission from a single entry point (JupyterHub) to multiple backend HTCondor pools and HPC clusters (e.g., at KIT, DESY). Jobs utilizing containerized environments via CVMFS were executed transparently on different architectures. Early metrics indicate a reduction in job waiting time for users by leveraging underutilized cycles across the federation, though inter-site data transfer latency remains a critical factor for data-intensive workloads.

6. Analysis Framework: A Conceptual Case Study

Scenario: A multi-messenger astrophysics analysis correlating data from a neutrino telescope (IceCube) and a gamma-ray observatory (CTA).

Workflow without Federation: The researcher must: 1. Apply for separate compute allocations on an HPC cluster for simulation and an HTC farm for event processing. 2. Manually transfer large datasets (TB-scale) between storage systems at different institutes. 3. Manage disparate software environments and authentication methods.

Workflow with Compute4PUNCH/Storage4PUNCH: 1. The researcher logs into the PUNCH JupyterHub with a single token. 2. The analysis workflow is defined (e.g., using Snakemake or similar). Simulation tasks (HPC-suited) are automatically routed via TARDIS to appropriate HPC resources. High-throughput event processing tasks are sent to HTC farms. 3. The workflow references data via the federated storage namespace (e.g., `punch://data/icecube/run_xyz.root`). The underlying XRootD/dCache federation handles location and transfer. 4. All jobs pull a consistent software environment from CVMFS. This case study demonstrates the transformative potential: the researcher focuses on science, not infrastructure logistics.

7. Future Applications & Development Roadmap

The PUNCH4NFDI infrastructure lays the groundwork for several advanced applications:

Federated Machine Learning Training: Leveraging heterogeneous GPUs across sites for large-scale model training, potentially using frameworks like PyTorch or TensorFlow with federated learning algorithms adapted for the HTCondor/TARDIS backend.
Dynamic, Policy-Driven Workload Placement: Integrating carbon-aware scheduling, where jobs are routed to sites with high renewable energy availability, similar to concepts explored by the Green Algorithms initiative.
Inter-Consortium Federation: Serving as a blueprint for connecting with other NFDI consortia or European initiatives like the European Open Science Cloud (EOSC), creating a pan-European research infrastructure.
Intelligent Data Caching & Pre-fetching: Using workflow provenance and predictive analytics to cache datasets proactively at compute sites, mitigating WAN latency, a challenge also central to projects like IRIS-HEP.

The roadmap includes hardening the production service, expanding the resource pool, integrating more sophisticated data management services, and developing higher-level workflow orchestration tools.

8. Analyst's Perspective: Core Insight, Logical Flow, Strengths & Flaws, Actionable Insights

Core Insight: PUNCH4NFDI isn't building a new supercomputer; it's building a virtualization and orchestration layer that turns Germany's fragmented, balkanized research compute landscape into a cohesive, user-centric utility. This is a classic "federation-over-replacement" strategy, prioritizing adoption and incrementalism over revolutionary change—a pragmatically brilliant move given the political and operational realities of publicly-funded institutions.

Logical Flow: The logic is sound: 1) Acknowledge heterogeneity and ownership (resources stay with institutes). 2) Impose minimal new requirements (use tokens, containers). 3) Insert a smart, adaptive middleware layer (COBalD/TARDIS) to abstract complexity. 4) Provide simple, modern user interfaces (JupyterHub). 5) Federate data similarly to complete the loop. It's a bottom-up integration playbook that other consortia should study.

Strengths & Flaws: Strengths: The use of battle-tested components (HTCondor, dCache, CVMFS) from the HEP community drastically reduces technical risk. The focus on AAI and containers tackles the two biggest adoption blockers: access and software. The COBalD/TARDIS choice is inspired—it's a lightweight, Python-based scheduler designed for this exact hybrid-cloud, opportunistic scenario. Critical Flaws: The elephant in the room is data mobility. Federating compute is easier than federating storage. The paper mentions caching and metadata evaluation, but the hard problems of consistent global namespace performance, WAN data transfer costs, and cross-site data policy enforcement are merely gestured at. Without a robust solution here, the federated compute pool will be hamstrung for data-intensive workloads. Furthermore, the success is utterly dependent on sustained "in-kind" contributions from members—a potentially fragile economic model.

Actionable Insights: 1. For PUNCH4NFDI: Double down on the data layer. Partner aggressively with projects like Rucio for data management and the Open Science Grid for operational experience. Develop clear SLAs with resource providers, especially regarding data egress costs. 2. For Competitors/Imitators: Don't just copy the architecture. The real lesson is in the governance and lightweight integration model. Start with a working prototype on a few willing sites and grow organically. 3. For Vendors & Funding Agencies: This model demonstrates that future research computing investment should fund integration middleware and software sustainability (like COBalD) as much as, if not more than, raw hardware. Fund the "glue."

In conclusion, PUNCH4NFDI's approach is a masterclass in pragmatic cyberinfrastructure engineering. It recognizes that the biggest bottleneck in scientific computing is often not FLOPS, but usability and access. If they can crack the federated data nut, they will have created a model with genuine potential to reshape not just German, but European, research computing.

9. References

PUNCH4NFDI Consortium. (2024). PUNCH4NFDI White Paper. NFDI.
Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 17(2-4), 323-356.
Giffels, M., et al. (2023). COBalD/TARDIS - A dynamic resource overlay for opportunistic computing. Journal of Physics: Conference Series.
Blomer, J., et al. (2011). The CernVM File System. Journal of Physics: Conference Series, 331(5), 052004.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Proceedings of the IEEE International Conference on Computer Vision (ICCV). (Cited as an example of a transformative computational methodology that could leverage such federated infrastructure).
dCache Collaboration. (2023). dCache: A distributed storage system. https://www.dcache.org.
XRootD Collaboration. (2023). XRootD: High performance, scalable fault tolerant access to data. https://xrootd.slac.stanford.edu.
European Open Science Cloud (EOSC). (2024). https://eosc-portal.eu.