ANDREAS - Artificial intelligence traiNing scheDuler foR disaggrEgAted resource clusterS
Horizon 2020
Ruolo DEIB: Partecipante
Data inizio: 01/05/2020
Durata: 10 mesi
Sommario
ANDREAS (Artificial intelligence traiNing scheDuler foR disaggrEgAted resource clusterS) aims at addressing two key needs of the market: efficiency of the usage of the resources and reduction of power consumption. Today Artificial Intelligence (AI)Deep Learning (DL) methods are exploited for a wide range of applications and supported by several platforms.
ANDREAS aims at developing advanced scheduling solutions for the optimization of the DL training run-time performance and minimizing the energy consumption of the training phase in aggregated and disaggregated GPU-based clusters.
The architecture envisioned in ANDREAS is based on a SLURM queue manager, a pool of servers, a pool of GPUs accessed through a switch, an intelligent module performing application energy consumption and performance prediction connected with the jobs scheduler. Training jobs are submitted to SLURM and are characterized by a deadline and a priority (i.e., weight). Jobs are never rejected and can possibly be delayed. The final goal is to minimize the weighted job tardiness given the power budget established by the System Administrator.
ANDREAS is a 10-month project, and the team plans to build early prototypes of the solution by fall 2020.
ANDREAS aims at developing advanced scheduling solutions for the optimization of the DL training run-time performance and minimizing the energy consumption of the training phase in aggregated and disaggregated GPU-based clusters.
The architecture envisioned in ANDREAS is based on a SLURM queue manager, a pool of servers, a pool of GPUs accessed through a switch, an intelligent module performing application energy consumption and performance prediction connected with the jobs scheduler. Training jobs are submitted to SLURM and are characterized by a deadline and a priority (i.e., weight). Jobs are never rejected and can possibly be delayed. The final goal is to minimize the weighted job tardiness given the power budget established by the System Administrator.
ANDREAS is a 10-month project, and the team plans to build early prototypes of the solution by fall 2020.