Ophidia - High Performance Data Mining & Analytics for eScience

Ophidia is a CMCC Foundation research project addressing big data challenges for eScience. It provides support for data-intensive analysis exploiting advanced parallel computing techniques and smart data distribution methods. It relies on an array-based storage model and a hierarchical storage organization to partition and distribute multidimensional scientific datasets over multiple nodes. Even though the Ophidia analytics framework has been mainly and primarily designed to meet Climate Change analysis requirements, it can also be used in other scientific domains where data has a multidimensional nature.

Ophidia key features:

1. Designed for eScience
The n-dimensionality of scientific datasets requires tools that support specific data types (e.g. arrays) and primitives (e.g. slicing, dicing, pivoting, drill-down, roll-up) to properly enable data access and analysis. With regard to general-purpose analytics multidimensional systems, scientific data has a higher computing demand, which definitely leads to the need of having efficient parallel/distributed solutions to meet the (near) real-time analytics requirements. Ophidia supports different data formats (e.g. NetCDF, FITS, SAC, etc.), which allows managing data in different scientific domains.

2. Server-side approach

Most scientists currently use a workflow based on the “search, locate, download, and analyze” sequence of steps. This workflow will not be feasible on large scale and will fail for several reasons, such as: (i) ever-larger scientific datasets, (ii) time- and resource- consuming data downloads, and (iii) increased problem size and complexity requiring bigger computing facilities. On large scale, scientific discovery will strongly need to rely on data-intensive facilities close to data storage, parallel I/O systems and server-side analysis capabilities. Such an approach will move the analysis (and complexity) from the user’s desktop to the data centers and, accordingly, it will change the infrastructure focus from data sharing to data analysis.

3. Parallel and distributed
The Ophidia analytics platform provides several MPI-based parallel operators to manipulate (as a whole) the entire set of fragments associated with a datacube. Some relevant examples include: datacube sub-setting (slicing and dicing), datacube aggregation, array-based primitives at the datacube level, datacube duplication, datacube pivoting, and NetCDF file import and export. To address scalability and enable parallelism, from a physical point of view, a datacube in Ophidia is then horizontally split into several chunks (called fragments) that are distributed across multiple I/O nodes. Each I/O node hosts a set of I/O servers optimized to manage n-dimensional arrays. Each I/O server, in turn, manages a set of databases consisting of one or more fragments. As it can be easily argued, tuning the levels in this hierarchy can also affect performance. For a specific datacube, the higher the product of the four levels is, the smaller the size of each fragment will be.

4. Declarative
Ophidia exploits a declarative (query-like) approach to express and define data analysis tasks. The defined declarative data analytics language allows the user to create, manage and manipulate datacubes, as well as analyzing the data, by describing “what” is actually performed rather than “how”, leaving the selection of the implementation strategies to the system. Moreover, Ophidia provides support for complex workflows / operational chains, with a specific syntax and a dedicated scheduler to exploit inter- and intra-task parallelism.

5. Extensible
Ophidia provides a large set of operators (50+) and primitives (100+) covering various types of data manipulations, such as sub-setting (slicing and dicing), data reductions, duplication, pivoting, and file import and export. However, the framework is highly customizable: there is a minimal set of APIs through which it is possible to develop your own operator or primitive to implement and provide new algorithms and functionalities.

References:

S. Fiore, C. Palazzo, A. D’Anca, D. Elia, E. Londero, C. Knapic, S. Monna, N. M. Marcucci, F. Aguilar, M. Płóciennik, J. E. M. De Lucas, G. Aloisio, “Big Data Analytics on Large-Scale Scientific Datasets in the INDIGO-DataCloud Project”. In Proceedings of the ACM International Conference on Computing Frontiers (CF ’17), May 15-17, 2017, Siena, Italy, pp. 343-348.
A. D’Anca, C. Palazzo, D. Elia, S. Fiore, I. Bistinas, K. Böttcher, V. Bennett, G. Aloisio, “On the Use of In-memory Analytics Workflows to Compute eScience Indicators from Large Climate Datasets” 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Madrid, May 14-17, 2017, pp. 1035-1043.
S. Fiore, M. Plóciennik, C. M. Doutriaux, C. Palazzo, J. Boutte, T. Zok, D. Elia, M. Owsiak, A. D’Anca, Z. Shaheen, R. Bruno, M. Fargetta, M. Caballer, G. Moltó, I. Blanquer, R. Barbera, M. David, G. Donvito, D. N. Williams, V. Anantharaj, D. Salomoni, G. Aloisio, “Distributed and cloud-based multi-model analytics experiments on large volumes of climate change data in the earth system grid federation eco-system”. In Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 2016. p. 2911-2918.
M. Plociennik, S. Fiore, G. Donvito, M. Owsiak, M. Fargetta, R. Barbera, R. Bruno, E. Giorgio, D. N. Williams, and G. Aloisio, “Two-level Dynamic Workflow Orchestration in the INDIGO DataCloud for Large-scale, Climate Change Data Analytics Experiments”, International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA. Procedia Computer Science, vol. 80, 2016, pp. 722-733.
D. Elia, S. Fiore, A. D’Anca, C. Palazzo, I. Foster, D. N. Williams, G. Aloisio, “An in-memory based framework for scientific data analytics”. In Proceedings of the ACM International Conference on Computing Frontiers (CF ’16), May 16-19, 2016, Como, Italy, pp. 424-429.
C. Palazzo, A. Mariello, S. Fiore, A. D’Anca, D. Elia, D. N. Williams, G. Aloisio, “A Workflow-Enabled Big Data Analytics Software Stack for eScience”, The Second International Symposium on Big Data Principles, Architectures & Applications (BDAA 2015), HPCS 2015, Amsterdam, The Netherlands, July 20-24, 2015, pp. 545-552.
S. Fiore, M. Mancini, D. Elia, P. Nassisi, F. V. Brasileiro, I. Blanquer, I. A. A. Rufino, A.C. Seijmonsbergen, C. O. Galvao, V. P. Canhos, A. Mariello, C. Palazzo, A. Nuzzo, A. D’Anca, G. Aloisio, “Big data analytics for climate change and biodiversity in the EUBrazilCC federated cloud infrastructure”, Workshop on Analytics Platforms for the Cloud, In Proceedings of the 12th ACM International Conference on Computing Frontiers (CF ’15), May 18th, 2015, Ischia, Italy. Article 52, 8 pages.
S. Fiore, A. D’Anca, D. Elia, C. Palazzo, I. Foster, D. Williams, G. Aloisio, “Ophidia: A Full Software Stack for Scientific Data Analytics”, proc. of the 2014 International Conference on High Performance Computing & Simulation (HPCS 2014), July 21 – 25, 2014, Bologna, Italy, pp. 343-350, ISBN: 978-1-4799-5311-0.
S. Fiore, C. Palazzo, A. D’Anca, I. T. Foster, D. N. Williams, G. Aloisio, “A big data analytics framework for scientific data management”, IEEE BigData Conference 2013: 1-8.
S. Fiore, A. D’Anca, C. Palazzo, I. T. Foster, D. N. Williams, G. Aloisio, “Ophidia: Toward Big Data Analytics for eScience”, ICCS 2013, June 5-7, 2013 Barcelona, Spain, ICCS, volume 18 of Procedia Computer Science, page 2376-2385. Elsevier, 2013.
G. Aloisio, S. Fiore, I. Foster, D. Williams, “Scientific big data analytics challenges at large scale”, Big Data and Extreme-scale Computing (BDEC), April 30 to May 01, 2013, Charleston, South Carolina, USA

Contacts:

Sandro Fiore – ASC Division
[email protected]

CMCC Center:

Advanced Digital Innovation Center