DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines

Published in CIDR, 2022

Patrick Damme (Graz University of Technology & Know-Center GmbH)*; Marius Birkenbach (KAI); Constantinos Bitsakos (NTUA); Matthias Boehm (Graz University of Technology); Philippe Bonnet (IT Univ Copenhagen, Denmark); Florina M. Ciorba (Technical University of Dresden, Germany / University of Basel, Switzerland); Mark Dokter (Know-Center GmbH); Pawel Dowgiallo (Intel); Ahmed Eleliemy (University of Basel); Christian Faerber (Intel Corporation); Georgios Goumas (National Technical University of Athens); Dirk Habich (TU Dresden); Niclas Hedam (IT University of Copenhagen); Marlies Hofer (AVL List GmbH); Wenjun Huang (German Aerospace Center); Kevin Innerebner (Graz University of Technology); Vasileios Karakostas (National Technical University of Athens (NTUA)); Roman Kern (KNOW-CENTER GmbH); Tomaž Kosar (University of Maribor); Alexander Krause (TU Dresden); Daniel Krems (AVL List GmbH); Andreas Laber (Infineon); Wolfgang Lehner (TU Dresden); Eric Mier (TU Dresden); Marcus Paradies (German Aerospace Center); Bernhard Peischl (); Gabrielle Poerwawinata (University of Basel); Stratos Psomadakis (ICCS/NTUA); Tilmann Rabl (HPI, University of Potsdam); Piotr Ratuszniak (Intel Technology Poland); Pedro Silva (HPI, University of Potsdam); Nikolai Skuppin (German Aerospace Center (DLR)); Andreas Starzacher (Infineon); Benjamin Steinwender (KAI GmbH); Ilin Tolovski (Hasso Plattner Institute); Pinar Tozun (IT University of Copenhagen); Wojciech Ulatowski (Intel); Yuanyuan Wang (Technical University of Munich (TUM); German Aerospace Center (DLR)); Izajasz Wrosz (Intel); Aleš Zamuda (University of Maribor); Ce Zhang (ETH); Xiaoxiang Zhu (Technical University of Munich (TUM); German Aerospace Center (DLR)

Abstract. Integrated data analysis (IDA) pipelines—that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring—become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used—increasingly heterogeneous—hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results.

Tags: , , , ,