WP4 Dynamic scheduling for MLMC

Objectives

The objective of WP4 is deployment of programming models and related runtimes HyperLoom and COMPss on the supercomputer infrastructure, their configuration and optimization. The work package will also involve the definition of a common python API for the different schedulers. The CFD solver backends and the MLMC engine will be designed to communicate through such API, so that different modules (or schedulers) can be tested independently. The API will also provide basic facilities to access to fast local storage, so to increase data IO capabilities and at the same time energy efficiency in IO operations.

The scheduling systems will also be configured so to provide a fault tolerance mechanism to allow handling critical events happening during the simulation of one of the simulation scenarios. This will include both detecting/recovering from hardware faults and handling of software problems (for example lack of convergence) of a given scenario under consideration.

Tasks

Task 4.1: Common API definition

The first task focuses at the definition of a common API, tailored to the needs of MLMC. This API will eb agreed by the partners and use extensively as a glue between the different packages. The API will be designed to allow the transparent substitution of a module by another, and will be designed so to allow switching implementation for the underlying scheduler.

Task 4.2: Deployment of VVUQ framework, tools and solvers on the infrastructure

VVUQ framework developed in the other WPs will be deployed on the supercomputer infrastructure. A detailed benchmark will be performed of the different components (both the CFD solves and the MLMC engine).

Task 4.3: Profiling

Profiling feedback will be provided to the partners involved. This will include taking advantage of BSCs profiling capabilities and will focus both on the problem scheduling and on the performance of the different domain solvers. This will also include exploring the possibility of using the low-level C API of the COMPSSs/HyperLoom within the production codes to increase the scalability of core routines within the solvers.

Task 4.4: Infrastructure benchmarking

In this task, infrastructure as well as execution frameworks in term of the scalability and optimal use of HW and SW will be tested on selected benchmarks. Application for access to TIER-0 systems will be prepared, and executed (through the PRACE mechanism) so to get access to the Europe largest machines. As an alternative, access to Exascale Demonstrators will be requested if this appear to be competitive in the time frame of the project.

Task 4.5: Development of interface to fast local storage

This task will provide a simple C and python API to allow taking advantage to fast local storage in IO operations. The interface will also attempt to provide a useful interface to next-generation local persistent storage.

Task 4.6 Framework development and optimization

This task will be ongoing through the entire length of the project. It will include the continuous improvements to the scheduling mechanisms and the development of optimizations specific to UQ and MLMC.