During the PDS/HPDA Seminar of 10/2/2023 from 10:00 to 11:30, Victor Laforet will present a reading group talk, Hatem Mnaouer will present a reading group talk and Dimitrije Panic will present a reading group talk.
# Reading group: Lock Cohorting: A General Technique for Designing NUMA Locks (PPoPP’12)\n\nPresented by Victor Laforet on 10/2/2023 at 10:00. Attending this presentation is mandatory for the master students.
Multicore machines are quickly shifting to NUMA and CC-NUMA architectures, making scalable NUMA-aware locking algorithms, ones that take into account the machines’ non-uniform memory and caching hierarchy, ever more important. This paper presents lock cohorting, a general new technique for designing NUMA-aware locks that is as simple as it is powerful.Lock cohorting allows one to transform any spin-lock algorithm, with minimal non-intrusive changes, into scalable NUMAaware spin-locks. Our new cohorting technique allows us to easily create NUMA-aware versions of the TATAS-Backoff, CLH, MCS, and ticket locks, to name a few. Moreover, it allows us to derive a CLH-based cohort abortable lock, the first NUMA-aware queue lock to support abortability.We empirically compared the performance of cohort locks with prior NUMA-aware and classic NUMA-oblivious locks on a synthetic micro-benchmark, a real world key-value store application memcached, as well as the libc memory allocator. Our results demonstrate that cohort locks perform as well or better than known locks when the load is low and significantly out-perform them as the load increases.
# Reading group: Building Blocks for Network-Accelerated Distributed File Systems (SC’22)\n\nPresented by Hatem Mnaouer on 10/2/2023 at 10:30. Attending this presentation is mandatory for the master students.
High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let data flow from the network to storage targets, reducing overall latency and CPU utilization. Yet, this approach still involves CPUs on the data path to enforce storage policies such as authentication, replication, and erasure coding. We show how storage policies can be offloaded to fully programmable SmartNICs, without involving host CPUs. By using PsPIN, an open-hardware SmartNIC, we show latency improvements for writes (up to 2x), data replication (up to 2x), and erasure coding (up to 2x), when compared to respective CPU- and RDMA-based alternatives.
# Reading group: Is it Nemo or Dory? Fast and accurate object detection for IoT and edge devices (IoT’21)\n\nPresented by Dimitrije Panic on 10/2/2023 at 11:00. Attending this presentation is mandatory for the master students.
Current state-of-the-art object detection neural networks, such as YOLO and SSD, are trained and developed on serverclass GPUs. These neural networks do not scale down well to resource-constrained devices, with both accuracy and precision taking a significant hit at the expense of speed. This is particularly concerning as object detection algorithms are often used for low-powered devices, such as surveillance and smart-home cameras, where accuracy is critical. Therefore, these devices generally tend to forward data to servers for processing, which adds network latency. In other cases, algorithms developed for these devices shrink neural networks to reduce computation at the expense of accuracy, which is often not acceptable.We create an alternative object detection scheme for staticcamera systems such as those used in surveillance and smarthome settings. Our model does not require positions of objects in training data, enabling us to retain only the relevant parts of a video frame for training data and reduce data storage cost while preserving privacy of other subjects in the video. When evaluated on static-camera video feeds using an NVIDIA Jetson Nano (a hybrid Arm and GPU IoT embedded device platform), our method increases throughput compared to Tiny YOLOv3. Further, it takes half of the time that Tiny YOLOv3 takes when run on OpenCV’s CUDAoptimized DNN framework per frame while having high accuracy of bounding box prediction and classification. The lessons learned from this work also suggest other strategies for tailoring the deployment of deep learning algorithms at the edge.