Federated Heterogeneous Compute and Storage Infrastructure for PUNCH4NFDI

1. Introduction

The PUNCH4NFDI (Particles, Universe, NuClei and Hadrons for the National Research Data Infrastructure) consortium, funded by the German Research Foundation (DFG), represents approximately 9,000 scientists from particle, astro-, astroparticle, hadron, and nuclear physics communities. Embedded within the broader NFDI initiative, its prime goal is to establish a federated and FAIR (Findable, Accessible, Interoperable, Reusable) science data platform. This platform aims to provide seamless access to the diverse computing and storage resources of the involved institutions, addressing common challenges posed by exponentially growing data volumes and computationally intensive analysis algorithms. This document focuses on the architectural concepts—Compute4PUNCH and Storage4PUNCH—designed to federate Germany's heterogeneous research infrastructure.

2. Federated Heterogeneous Compute Infrastructure – Compute4PUNCH

Compute4PUNCH addresses the challenge of effectively utilizing a wide array of in-kind contributed resources, including High-Throughput Compute (HTC), High-Performance Compute (HPC), and Cloud systems, distributed across Germany. These resources vary in architecture, operating systems, software stacks, and access policies. The core design principle is to create a unified overlay system with minimal intrusion on existing, operational resource providers.

2.1. Core Architecture & Integration

The federation is built around HTCondor as the central batch system overlay. Heterogeneous resources are dynamically integrated using the COBalD/TARDIS resource meta-scheduler. COBalD/TARDIS acts as an intelligent broker, piloting jobs to suitable backends (e.g., Slurm, Kubernetes clusters) based on resource availability, job requirements, and policies. This creates a single, logical resource pool from physically disparate systems.

2.2. User Access & Software Environment

User entry points are provided through traditional login nodes and a JupyterHub service. A token-based Authentication and Authorization Infrastructure (AAI) standardizes access. Software environment complexity is managed via container technologies (e.g., Docker, Singularity/Apptainer) and the CERN Virtual Machine File System (CVMFS), which delivers scalable, read-only software distributions to compute nodes globally.

3. Federated Storage Infrastructure – Storage4PUNCH

Storage4PUNCH aims to federate community-supplied storage systems, primarily based on dCache and XRootD technologies, which are well-established in High-Energy Physics (HEP). The federation employs common namespaces and protocols (like xrootd, WebDAV) to present a unified data access layer. The concept also evaluates integrating caching solutions and metadata handling services to improve data locality and discoverability across the federation.

4. Technical Implementation & Components

4.1. Authentication & Authorization (AAI)

A token-based AAI (likely leveraging OAuth 2.0/OpenID Connect standards, similar to WLCG IAM or INDIGO IAM) provides a single sign-on experience. It maps community identities to local resource permissions, abstracting away heterogeneous local authentication systems (e.g., Kerberos, SSH keys).

4.2. Resource Meta-Scheduling: COBalD/TARDIS

COBalD (the Coordinator) and TARDIS (the Transparent Adaptive Resource Dynamic Integration System) work in tandem. COBalD makes high-level scheduling decisions, while TARDIS manages the lifecycle of "pilots" or "placeholder jobs" on target resources. This decoupling allows for flexible policy enforcement (e.g., cost, fairness, priority) and dynamic adaptation to fluctuating resource states. The scheduling can be modeled as an optimization problem, minimizing a cost function $C_{total} = \sum_{i}(w_i \cdot T_i) + \lambda \cdot R$, where $T_i$ is the turnaround time for job $i$, $w_i$ is its priority weight, $R$ represents resource usage cost, and $\lambda$ is a balancing parameter.

4.3. Data & Software Layer

CVMFS is critical for software distribution. It uses a content-addressable storage model and aggressive caching (with Stratum 0/1 servers and local Squid caches) to deliver software repositories efficiently. The federation likely employs a hierarchy of CVMFS strata, with a central PUNCH repository stratum 0 and institutional stratum 1 mirrors. Data access follows a similar federated model, with storage elements (SEs) publishing their endpoints to a global directory (like Rucio or a simple REST service), allowing clients to resolve data locations transparently.

5. Prototype Status & Initial Experiences

The document indicates that prototypes of both Compute4PUNCH and Storage4PUNCH are operational. Initial scientific applications have been executed, providing valuable feedback on performance, usability, and integration pain points. While specific benchmark numbers are not provided in the excerpt, successful execution implies basic functionality of the overlay batch system, AAI integration, and software delivery via CVMFS has been validated. The experiences are guiding refinements in policy configuration, error handling, and user documentation.

6. Key Insights & Strategic Analysis

Core Insight: PUNCH4NFDI isn't building a new supercomputer; it's engineering a "federation fabric" that pragmatically stitches together existing, fragmented resources. This is a strategic shift from monolithic infrastructure to agile, software-defined resource aggregation, mirroring trends in commercial cloud but tailored for the constraints and culture of publicly-funded academia.

Logical Flow: The architecture follows a clear, dependency-driven logic: 1) Unify Identity (AAI) to solve the "who" problem, 2) Abstract Resources (COBalD/TARDIS + HTCondor) to solve the "where" problem, and 3) Decouple Environment (Containers + CVMFS) to solve the "with what" problem. This layered abstraction is textbook systems engineering, reminiscent of the success of the Worldwide LHC Computing Grid (WLCG), but applied to a more diverse resource set.

Strengths & Flaws: The major strength is its non-disruptive adoption model. By using overlay technologies and respecting site autonomy, it lowers the barrier for resource providers—a crucial success factor for consortia. However, this is also its Achilles' heel. The performance overhead of meta-scheduling and the inherent complexity of debugging across heterogeneous, independently-administered systems can be significant. The "minimal interference" mandate may limit the ability to implement advanced features like deep storage-compute coupling or dynamic network provisioning, potentially capping efficiency gains. Compared to a purpose-built, centralized system like Google's Borg or Kubernetes cluster, the federation will always have higher latency and lower utilization predictability.

Actionable Insights: For other consortia considering this path: 1) Invest heavily in monitoring and observability from day one. Tools like Grafana/Prometheus for the infrastructure and APM (Application Performance Monitoring) for user jobs are non-negotiable for managing complexity. 2) Standardize on a narrow set of container base images to reduce CVMFS maintenance burden. 3) Develop a clear, tiered support model that distinguishes federation-level issues from local site problems. The real test won't be technical feasibility—the HEP community has proven that—but operational sustainability and user satisfaction at scale.

7. Technical Deep Dive

Mathematical Model for Resource Scheduling: The COBalD/TARDIS system can be conceptualized as solving a constrained optimization problem. Let $J$ be the set of jobs, $R$ be the set of resources, and $S$ be the set of resource states (e.g., idle, busy, drained). The scheduler aims to maximize a utility function $U$ that considers job priority $p_j$, resource efficiency $e_{j,r}$, and cost $c_r$: $$\max \sum_{j \in J} \sum_{r \in R} x_{j,r} \cdot U(p_j, e_{j,r}, c_r)$$ subject to constraints: $$\sum_{j} x_{j,r} \leq C_r \quad \forall r \in R \quad \text{(Resource Capacity)}$$ $$\sum_{r} x_{j,r} \leq 1 \quad \forall j \in J \quad \text{(Job Assignment)}$$ $$x_{j,r} \in \{0,1\} \quad \text{(Binary Decision Variable)}$$ where $x_{j,r}=1$ if job $j$ is assigned to resource $r$. TARDIS dynamically manages the feasibility of assignments based on real-time state $S$.

Experimental Results & Diagram Description: While the provided PDF excerpt does not contain specific performance graphs, a typical evaluation would include diagrams comparing:
1. Job Throughput Over Time: A line chart showing the number of jobs completed per hour across the federated pool versus individual resource clusters, demonstrating the aggregation benefit.
2. Resource Utilization Heatmap: A grid visualization showing the percentage of CPUs/GPUs utilized across different resource providers (KIT, DESY, Bielefeld, etc.) over a week, highlighting load balancing effectiveness.
3. Job Startup Latency CDF: A Cumulative Distribution Function plot comparing the time from job submission to execution start in the federated system versus direct submission to a local batch system, revealing the meta-scheduling overhead.
4. Data Access Performance: A bar chart comparing read/write speeds for data accessed locally, from a federated storage element within the same region, and from a remote federated element, illustrating the impact of caching and network.

8. Analysis Framework & Conceptual Model

Case Study: Federated Analysis of Astronomical Survey Data
Scenario: A research group at Thüringer Landessternwarte Tautenburg needs to process 1 PB of imaging data from the Sloan Digital Sky Survey (SDSS) to identify galaxy clusters, a compute-intensive task requiring ~100,000 CPU-hours.
Process via Compute4PUNCH/Storage4PUNCH:
1. Authentication: The researcher logs into the PUNCH JupyterHub using their institutional credentials (via the token-based AAI).
2. Software Environment: Their Jupyter notebook kernel runs from a container image hosted on CVMFS, containing all necessary astronomy packages (Astropy, SExtractor, etc.).
3. Job Definition & Submission: They define a parameter sweep job in the notebook. The notebook uses a PUNCH client library to submit these as an HTCondor DAG (Directed Acyclic Graph) to the federated pool.
4. Resource Matching & Execution: COBalD/TARDIS evaluates the job requirements (CPU, memory, possibly GPU) and pilots them to available slots across, for example, HTC pools at KIT, HPC queues at Bielefeld University, and cloud nodes at DESY. Jobs read input data via the federated XRootD namespace from the closest storage location, possibly leveraging a cache.
5. Result Aggregation: Output files are written back to the federated storage. The researcher monitors progress via a unified web dashboard and finally aggregates the results in their notebook for analysis.
This case demonstrates the seamless integration of identity, compute, storage, and software management.

9. Future Applications & Development Roadmap

The PUNCH4NFDI infrastructure lays the groundwork for several advanced applications:
1. Federated Machine Learning Training: The heterogeneous resource pool, including potential GPU clusters, could support distributed ML training frameworks like PyTorch or TensorFlow across institutional boundaries, addressing privacy-preserving training needs where data cannot be centralized.
2. Interactive Analysis & Visualization: Enhancing the JupyterHub service with scalable, backend-powered interactive visualization tools (e.g., Jupyter widgets connected to Dask clusters on the federation) for large dataset exploration.
3. Integration with External Clouds & HPC Centers: Extending the federation model to incorporate commercial cloud credits (e.g., AWS, GCP) or national supercomputing centers (e.g., JUWELS at JSC) via a common billing/accounting layer, creating a true hybrid cloud for science.
4. Metadata and Data Lake Integration: Moving beyond simple file federation to a integrated data lake architecture, where the storage layer is coupled with a unified metadata catalog (e.g., based on Rucio or iRODS), enabling data discovery and provenance tracking across communities.
5. Workflow-as-a-Service: Offering higher-level platform services like REANA (Reproducible Analysis Platform) or Apache Airflow on top of the federated infrastructure, allowing scientists to define and execute complex, reproducible analysis pipelines without managing the underlying infrastructure.

The development roadmap will likely focus on hardening the production service, expanding the resource pool, integrating more sophisticated data management tools, and developing user-friendly APIs and SDKs to lower the adoption barrier for non-expert users.

10. References

PUNCH4NFDI Consortium. (2024). PUNCH4NFDI White Paper. [Internal Consortium Document].
Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience. Concurrency - Practice and Experience, 17(2-4), 323-356. https://doi.org/10.1002/cpe.938
Blomer, J., et al. (2011). Distribution of software in the CernVM file system with Parrot. Journal of Physics: Conference Series, 331(4), 042009. https://doi.org/10.1088/1742-6596/331/4/042009
Giffels, M., et al. (2022). COBalD and TARDIS – Dynamic resource overlay for opportunistic computing. EPJ Web of Conferences, 251, 02009. https://doi.org/10.1051/epjconf/202225102009
dCache Collaboration. (2023). dCache: A distributed storage data caching system. Retrieved from https://www.dcache.org/
XRootD Collaboration. (2023). XRootD: High performance, scalable fault tolerant access to data. Retrieved from http://xrootd.org/
Wilkinson, M. D., et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18
Verma, A., et al. (2015). Large-scale cluster management at Google with Borg. Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). https://doi.org/10.1145/2741948.2741964