Enabling Data Services for HPC

Authors: Jerome Soumagne (HDF Group), Philip Carns (Argonne National Laboratory), Mohamad Chaarawi (Intel Corporation), Kevin Huck (University of Oregon), Manish Parashar (Rutgers University), Robert Ross (Argonne National Laboratory)

Abstract: Distributed data services can enhance HPC productivity by providing storage, analysis, and visualization capabilities not otherwise present in conventional parallel file systems. Such services are difficult to develop, maintain, and deploy in a scientific workflow, however, due to the complexities of specialized HPC networks, RDMA data transfers, protocol encoding, and fault tolerance.

This BoF will bring together a growing community of researchers, developers, vendors, and facility operators who are either using or developing HPC data services. Participants will share and discuss practical experiences, implementation examples, new features, and best practices to construct and deploy production-quality, high-performance distributed services.

Long Description: With software and hardware becoming more specialized, there is a growing need in the HPC community to build distributed services. These include I/O services such as dedicated storage systems but also middleware services that either involve data analysis or code coupling for monitoring and processing data. As such and with the constraints imposed by the systems, the necessity to provide performance, resilience and adaptability to the end-user requires base components that can meet these requirements. However, due to the complexities of specialized HPC networks, RDMA data transfers, protocol encoding, and fault tolerance, this cannot be achieved, in most cases, without a substantial engineering effort.

Pursuing last year's BoF, the six co-organizers of this BoF include two representatives from industry, two from academia, and two from government research. Collectively, they represent a diverse collection of prominent data service projects:

DataSpaces (http://dataspaces.org) is a programming system developed at Rutgers University that provides abstractions and services for interaction, coordination and data exchange to support extreme-scale in-situ workflows. Specifically, DataSpaces implements a scalable, semantically specialized shared space abstraction that is dynamically accessible by all components and services in an application workflow, supporting application/system-aware data placement and movement.

The Distributed Asynchronous Object Storage (DAOS, http://daos.io) is an open-source software-defined object store developed at Intel and designed from the ground up for massively distributed Non Volatile Memory (NVM). DAOS provides features such as transactional non-blocking I/O, end-to-end data integrity, fine grained data control and elastic storage to optimize performance and cost. DAOS uses a flexible micro-service architecture leveraging open source components like Mercury, Argobots, ISA-L, PMDK and SPDK.

The Mochi project (http://www.mcs.anl.gov/research/projects/mochi/) is a collaboration between ANL, the HDF Group, LANL, and CMU. The Mochi project is creating a new framework for rapid composition of specialized data services that leverage emerging network and storage technology to meet the needs of data-intensive scientific computing. It relies on the Mercury project (http://mercury-hpc.github.io/), a library that implements remote procedure call (RPC) and hides the complexity of low-level HPC network layers.

The Scalable Observation System (SOS), with the reference implementation SOSflow developed at the University of Oregon, allows a broad set of online and in situ capabilities including code steering via remote method invocation, data analysis, and visualization. SOSflow can couple together multiple sources of data, such as application components and operating environment measures, with multiple software libraries and performance tools, efficiently creating holistic views of performance at runtime.

The organizers are well-qualified to help identify common challenges facing the broader HPC data services community and to engage expert speakers who will represent additional points of view. This BoF aims to: 1) give a representative overview of current practices for developing and deploying data services in HPC; 2) share experiences of higher-level services that have been built on top of these existing solutions; 3) present future directions and assess potential user needs among the HPC community.

This BoF targets HPC middleware software developers, researchers, vendors and facility operators.

URL: https://hpc-data-services.github.io/bofs/

Back to Birds of a Feather Archive Listing