Modern high-resolution observational instruments and complex models of the earth system and of physical, chemical, and biological processes generate multiple hundreds of petabytes of scientific data per year. Digital data archives store such scientific data in private clouds for further investigation and long-term preservation, and disseminate it through data platforms via order-based catalogs. To reduce the total cost of ownership, such data platforms employ hierarchical storage management with large, disk-based caches and robotic tape libraries. Prefetching all the data from a slower storage layer is typically not possible due to the ad-hoc nature of scientific analyses and the size of the required data set to achieve satisfactory results for long-term trend analysis and prediction. Consequently, data movement is one of the most time- and energy-consuming tasks for data-intensive, scientific workflows. Near-data processing (NDP) has been advocated to reduce the amount of data to be transferred as early as possible. Unfortunately, for large-scale scientific data, this only benefits faster layers of the storage hierarchy. In a deep storage hierarchy, as it is common for active data archives, NDP can be even more beneficial if pushed further down the storage hierarchy.
We propose CryoDrill, an NDP framework, which pushes parts of the computation of a data analysis workflow down the storage hierarchy to enable processing close to the data while minimizing wasteful data movements up the storage hierarchy. CryoDrill specifically targets complex data analysis tasks on large amounts of scientific data residing in cold storage devices, such as archival disks, massive-array-of-idle-disks systems, and robotic tape libraries. We plan to use in-storage processing resources by extending storage controllers to run simple computation tasks, e.g., filtering, data tiling and tile selection, and aggregation, directly within the storage device or exploit near-storage processing capabilities through FPGAs.