[PDS/HPDA Seminar] 18/11/2022 from 10:00 to 11:00 at 1C27 – Ewa Turska (reading group) and François Trahay (team work)

During the PDS/HPDA Seminar of 18/11/2022 from 10:00 to 11:00, Ewa Turska will present a reading group talk and François Trahay will present a team work talk.

Visio: https://webconf.imt.fr/frontend/fra-vcg-byn-fxd

Location: 1C27

# Reading group: Leveraging Bagging for Evolving Data Streams (ECML PKDD’10)\n\nPresented by Ewa Turska on 18/11/2022 at 10:00. Attending this presentation is mandatory for the master students.

Paper: http://www.math.chalmers.se/Stat/Grundutb/GU/MSA220/S18/OnlineAndLeverageBagging.pdf

Full post: https://www.inf.telecom-sudparis.eu/pds/seminars_cpt/reading-group-10/

## Abstract
Bagging, boosting and Random Forests are classical ensemble methods used to improve the performance of single classifiers. They obtain superior performance by increasing the accuracy and diversity of the single classifiers. Attempts have been made to reproduce these methods in the more challenging context of evolving data streams. In this paper, we propose a new variant of bagging, called leveraging bagging. This method combines the simplicity of bagging with adding more randomization to the input, and output of the  classifiers. We test our method by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples.    

# Team work: Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning. (IPDPS’22)\n\nPresented by François Trahay on 18/11/2022 at 10:30. Attending this presentation is mandatory for the master students.

Paper: https://hal.archives-ouvertes.fr/hal-03599740/document

Full post: https://www.inf.telecom-sudparis.eu/pds/seminars_cpt/reading-group-9/

## Abstract
Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint. In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme.