Monday 8:00am-9:45am Room A DUAC Workshop 1A: International Workshop on Deployment and Use of Accelerators (DUAC) - Presentations Chair: Carlos Reaño (Queen's University Belfast) Bruno Meyer, Aurora Pozo, and Wagner M. Nunan Zola (Federal University of Paraná) Abstract Abstract Recent advances and applications of machine learning algorithms are becoming more common in different fields. It is expected that some applications require the processing of large datasets with those algorithms, which leads to high computational costs. Massively parallel GPU methods can be applied to surpass this limitation and reduce the execution time of these algorithms. The construction of approximate K-Nearest Neighbor Graphs (K-NNG) is frequently required for similarity search or other applications such as the t-SNE dimensionality reduction technique. The K-NNG represents the K closest points (neighbors) for each point in a set. In this paper, we propose and analyze an all-points K-Nearest Neighbor Graph construction algorithm on GPU called Warp-centric K-NNG (w-KNNG), which is based on the Random Projection Forest method. Usually, the construction or search for k-NN sets for high dimensional points presents challenges for its implementation on many-core processing units, due to the space limitation in maintaining these sets in high speed shared memory. We present three warp-centric approaches for our algorithm that efficiently search and maintain the k-NN high dimensional point sets in global memory. In our experiments, the new methods allows the algorithm to achieve up to 639% faster execution when compared to the state-of-the-art FAISS library, considering an equivalent accuracy of approximate K-NNG. One of the new strategies (w-KNNG atomic) is more successful when applied to a smaller number of dimensions, while the tiled w-KNNG approach was successful in general scenarios for higher dimensional points. 8:30am: Explaining the Classification Performance of Supervised and Semi-Supervised Methods for Automated Sparse Matrix Format Selection Sunidhi Dhandhania (Indian Institute of Technology Kanpur); Akshay Deodhar (College of Engineering, Pune); Konstantin Pogorelov (Simula Research Laboratory); Swarnendu Biswas (Indian Institute of Technology Kanpur); and Johannes Langguth (Simula Research Laboratory) Abstract Abstract Sparse matrix-vector multiplication (SpMV) is a key kernel among a wide range of applications from many different domains, but it constitutes a performance bottleneck. Many sparse matrix storage formats along with corresponding SpMV algorithms have been proposed. The performance of SpMV varies widely depending on the selected format and as well as the architecture and memory hierarchy of the processor. There is no easy method to determine the best format for a given matrix on a given machine. To address this, supervised Machine Learning (ML) techniques to automate the selection of the best formats have been proposed. However, existing supervised approaches suffer from several drawbacks. They depend on large representative datasets, and it is expensive to generate training data by benchmarking and retraining these systems to incorporate new classes of matrices or additional processor architectures becomes costly. Furthermore, many supervised systems are not explainable, i.e., it is hard to determine how and why they produce their results. 8:45am: An Intelligent Parallel Distributed Streaming Framework for Near Real-time Science Sensors and High Resolution Medical Images Samit Shivadekar and Jayalakshmi Mangalagiri (UMBC); Phuong Nguyen (UMBC, OpenKneck Inc); and David Chapman, Milton Halem, and Rahul Gite (UMBC) Abstract Abstract Our goals are to address challenges such as latency, scalability, throughput and heterogeneous data sources of streaming analytics and deep learning pipelines in science sensors and medical imaging applications. We present a prototype Intelligent Parallel Distributed Streaming Framework (IPDSF) that is capable of distributed streaming processing as well as performing distributed deep training in batch mode. IPDSF is designed to run streaming Artificial Intelligent (AI) analytic tasks using data parallelism including partitions of multiple streams of short time sensing data and high-resolution 3D medical images, and fine grain tasks distribution. We will show the implementation of IPDSF for two real world applications, (i) an Air Quality Index based on near real time streaming of aerosol Lidar backscatter and (ii) data generation of Covid-19 Computing Tomography (CT) scans using deep learning. We evaluate the latency, throughput, scalability, and quantitative evaluation of training and prediction compared against a baseline single instance. As the results, IPDSF scales to process thousands of streaming science sensors in parallel for Air Quality Index application. IPDSF uses novel 3D conditional Generative Adversarial Network (cGAN) training using parallel distributed Graphic Processing Units (GPU) nodes to generate realistic 3D high resolution Computed Tomography scans of Covid-19 patient lungs. We will show that IPDSF can reduce cGAN training time linearly with the number of GPUs. Henry Dietz (University of Kentucky) Abstract Abstract Quantum computers use quantum physics phenomena to create specialized hardware that can efficiently execute algorithms operating on entangled superposed data. That hardware must be attached to and controlled by a conventional host computer. However, it can be argued that the main benefit thus far has been from reformulating problems to make use of entangled superpositions rather than from use of exotic physics mechanisms to perform the computation -- such reformulations often have produced more efficient algorithms for conventional computers. Parallel bit pattern computing does not simulate quantum computing, but provides a way to use non-quantum, bit-level, massively-parallel, SIMD hardware to efficiently execute a broad class of algorithms leveraging superposition and entanglement. Tom Plano and Jeremy Buhler (Washington University) Abstract Abstract Streaming data-flow applications arise in many contexts where each item in a data stream must be processed within a bounded latency, or deadline, following its arrival. We consider applications whose behavior is irregular, in the sense that the application may reduce or amplify data volumes dynamically at various stages of its computation. Our implementation target for these applications is SIMD-capable processors such as GPUs. For such devices, organizing the computation so that a full-width SIMD vector of inputs can be processed at once makes the most efficient use of the processor. However, having parts of the computation wait while full vectors of input accumulate may incur more latency than the application's deadline allows. 9:30am: TurboBC: A Memory Efficient and Scalable GPU Based Betweenness Centrality(BC) Algorithm in the Language of Linear Algebra Oswaldo Artiles and Fahad Saeed (Florida International University) Abstract Abstract Betweenness centrality (BC) is a shortest path centrality metric used to measure the influence of individual vertices or edges on huge graphs that are used for modeling and analysis of human brain, omics data, or social networks. The application of the BC algorithm to modern graphs must deal with the size of the graphs, as well with highly irregular data-access patterns. These challenges are particularly important when the BC algorithm is implemented on Graphics Processing Units (GPU), due to the limited global memory of these processors, as well as the decrease in performance due to the load unbalance resulting from processing irregular data structures. In this paper, we present the first GPU based linear-algebraic formulation and implementation of BC, called TurboBC, a set of memory efficient BC algorithms that exhibits good performance and high scalability on unweighted, undirected or directed sparse graphs of arbitrary structure. Our experiments demonstrate that our TurboBC algorithms obtain more than 18 GTEPs and an average speedup of 31.9x over the sequential version of the BC algorithm, and are on average 1.7x and 2.2x faster than the state-of-the-art algorithms implemented on the high performance, GPU-based, gunrock, and CPU-based, ligra libraries, respectively. These experiments also show that by minimizing their memory footprint, the TurboBC algorithms are able to compute the BC of relatively big graphs, for which the gunrock algorithms ran out of memory. Monday 8:00am-10:15am Room B EMS Workshop 1B: International Workshop on Embedded Multicore Systems (ICPP-EMS) - Presentations Chair: YungChia Lin (MediaTek) 8:15am: ArchViMP – a Framework for Automatic Extraction of Concurrency-related Software Architectural Properties Monireh Pourjafarian (Technical University of Kaiserslautern) and Jasmin Jahic (University of Cambridge) Abstract Abstract Concurrent multithreaded programs are more complex than sequential ones due to inter-dependencies of threads over shared memory. Because of these, software architects and developers quickly become overwhelmed when trying to design and manage concurrent software. Existing approaches that try to support architecture efforts in this domain rely on the visualization of concurrency-related properties of software to ease its understanding, but they fail because i) the abstractions they use do not capture information of architectural significance, and because ii) raw visualization of the interdependencies does not scale. Zong-Ming Ke, Yun-Ze Li, and Da-Wei Chang (National Cheng Kung University) Abstract Abstract In-memory key-value caches are widely used by web applications to reduce latencies of database queries. In recent years, multilevel cell (MLC) technology has been used in various emerging non-volatile memories (NVM), allowing the building of large-scale main memory in a computing system. Building key-value caches on NVM, instead of DRAM, not only prevents cache data loss due to power outage but also improves the cache hit rate. However, it could lead to degraded performance due to the inferior write performance of NVM, compared to that of DRAM. To address this problem, previous studies focus on reducing the amount of NVM writes. In this paper, we propose Dual-KV, a method to improve performance of MLC NVM key-value caches. Different from previous studies, Dual-KV achieves the performance improvement by exploiting the write-latency/retention-time trade-off of MLC NVM. In MLC NVM, write latency increases with the retention time. The basic idea of Dual-KV is to adopt short-latency-and-short-retention writes (i.e., fast writes) for frequently updated data items while using long-latency-and-long-retention writes (i.e., slow writes) for the other data items. The experiment results show that Dual-KV improves throughput by up to 43% and 83% for single-thread and 16-thread YCSB workloads, respectively. Moreover, Dual-KV is designed to be integrated into existing key-value caches easily. Integrating Dual-KV into Memcached, a widely-used key-value cache, requires less than 30 lines of code modifications. Henry Kao (University of Toronto), Joshua San Miguel (University of Wisconsin-Madison), and Natalie Enright Jerger (University of Toronto) Abstract Abstract Coherence induced cache misses are an important aspect limiting the scalability of shared memory parallel programs. Many coherence misses are avoidable, namely misses due to false sharing -- when different threads write to different memory addresses that are contained within the same cache block causing unnecessary invalidations. Past work has proposed numerous ways to mitigate false sharing from coherence protocols optimized for certain sharing patterns, to software tools for false-sharing detection and repair. Our work leverages approximate computing and store value similarity in error-tolerant multi-threaded applications. We introduce a novel cache coherence protocol which implements an approximate instructions and coherence states to allow some limited incoherence within approximatable shared data to mitigate both coherence misses and coherence traffic for various sharing patterns. For applications from the Phoenix and AxBench suites, we see dynamic energy improvements within the NoC and memory hierarchy of up to 50.1% and speedup of up to 37.3% with low output error for approximate applications that exhibit false sharing. Jyun-Kai Lai and Wuu Yang (National Yang Ming Chiao Tung University) Abstract Abstract Rabbit is an LLVM-based hybrid binary translator with many diverse optimizations to improve the performance. In addition to platform-independent hyperchaining (indep), Rabbit also includes platform-dependent hyperchaining (dep) on both x86-64 and RISC-V architectures for both direct and indirect branches. The dep optimizations leverage architecture-specific instructions and patches to achieve the same effect as the indep optimiztion but gains more performance improvement. The experimental results show that the platform-dependent hyperchaining can achieve 1.08x and 1.05x speedup in comparison with platform-independent hyperchaining for direct and indirect branches, respectively. The experimental results also show that platform-dependent hyperchaining incurs little memory space overhead in comparison with platform-independent hyperchaining. Pin-Wei Liao, Wei-Chung Hsu, and Shih-Wei Liao (National Taiwan University) Abstract Abstract Edge inference has gained much popularity in recent years. Many AI accelerators have been proposed and extensively studied. Such devices are often packed with a large number of PEs (Processing Elements), and lots of on-chip SRAM. The key to successful AI acceleration is to effectively use the data transferred from off-chip DRAM to the on-chip SRAM. 9:30am: Accelerate Binarized Neural Networks with Processing-in-Memory Enabled by RISC-V Custom Instructions Che-Chia Lin, Chao-Lin Lee, and Jenq-Kuen Lee (National Tsing Hua University) and Howard Wang and Ming-Yu Hung (MediaTek Inc.) Abstract Abstract As the speed of processing unit grows rapidly, the bottleneck of system’s performance is usually the speed of memory and the situation is the so-called "Memory Wall". There are emerging technologies trying to take down the "Memory Wall", and one of them is Processing-in-Memory (PIM). Processing-in-Memory means that the data are processed just inside the memory itself. It does not need to take time to travel between CPU and Memory. Moreover, for very little modifications to memory devices, the memory can do primitive bit-wise operations at the memory side. Binarized Neural Network (BNN), which replaces the convolution’s multiplication and addition operations with bit-wise AND and population count operations, is therefore suited for utilizing PIM to gain performance. This work architects PIM AND, NOT, and population count operations and enables PIM operations working under RISC-V custom instruction encodings. Besides, we also utilize TVM’s support of BNN for application sources. In addition, we offer a new design for BNN’s convolution which better memory layout is considered. With our design, the results of the speedup range from 3.7x to 57.3x comparing with CPU-based system for the execution time of end-to-end BNN model inferences. 9:45am: Accelerating Neural Network Training using Arbitrary Precision Approximating Matrix Multiplication Algorithms Grey Ballard, Jack Weissenberger, and Luoping Zhang (Wake Forest University) Abstract Abstract Matrix multiplication is one of the bottleneck computations for training the weights within deep neural networks. To speed up the training phase, we propose to use faster algorithms for matrix multiplication known as Arbitrary Precision Approximating (APA) algorithms. APA algorithms perform asymptotically fewer arithmetic operations than the classical algorithm, but they compute an approximate result with an error that can be made arbitrarily small in exact arithmetic. Practical APA algorithms provide significant reduction in computation time and still provide enough accuracy for many applications like neural network training. We demonstrate that APA algorithms can be efficiently implemented and parallelized for multicore CPUs to obtain up to 28% and 21% speedups over the fastest implementation of the classical algorithm using one core and 12 cores, respectively. Furthermore, using these algorithms to train a Multi-Layer Perceptron (MLP) network yields no significant reduction in the training or testing error. Our performance results on a large MLP network show overall sequential and multithreaded performance improvements of up to 25% and 13%, respectively. Hui-Hsin Liao, Chao-Lin Lee, and Jenq-Kuen Lee (National Tsing Hua University) and Wei-Chih Lai, Ming-Yu Hung, and Chung-Wen Huang (MediaTek Inc) Abstract Abstract Recently, machine learning has been widely adopted in various scenarios, especially in edge devices. These edge devices, such as smartphones or IoT devices, are usually powered by limited batteries. Therefore, how to increase performance and achieve power savings become one of the critical issues during the development of deep learning frameworks. In the research efforts, there are numerous optimizations or methodologies developed to aim at improving CNN performance. In this paper, we focus on the convolution layer in CNN, which is one of the most computationally demanding operators in neural networks. Therefore, improving the convolution will contribute significantly to the entire model. We find the opportunities of sparse convolution, in which the certain matrices are with high sparsity. We proposed a flow in TVM, which provides a sparse convolution flow with weight pruning. In our flow, we maximize the sparsity by pruning certain weight and pertaining the model. With our proposed flow, TVM model runtime could achieve 11.42x speedup on average with ImageNet based models compared to the original flow. Monday 8:00am-10:15am Room C P2S2 Workshop 1C: International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) - Presentations Chair: John Leidel (Tactical Computing Laboratories LLC, Texas Tech University) Swati Singhal and Alan Sussman (University of Maryland, College Park) and Matthew Wolf, Kshitij Mehta, and Jong Choi (Oak Ridge National Lab) Abstract Abstract Modern scientific workflows are increasing in complexity with growth in computation power, incorporation of non-traditional computation methods, and advances in technologies enabling data streaming to support on-the-fly computation. These workflows have unpredictable runtime behaviors, and a fixed, predetermined resource assignment on supercomputers can be inefficient for overall performance and throughput. Inability to change resource assignments further limits the scientists to avail of science-driven opportunities or respond to failures. Jonas Posner and Claudia Fohry (University of Kassel) Abstract Abstract Resource elasticity allows to dynamically change the resources of running jobs, which may significantly improve the throughput on supercomputers. Elasticity requires support from both job schedulers and user applications. Whereas the adaptation of traditional programs requires additional programmer effort, task-based programs can be made elastic in a transparent way. Md Maruf Hossain and Erik Saule (University of North Carolina at Charlotte) Abstract Abstract Graph analysis now percolates society with applications ranging from advertising and transportation to medical research. The struc- ture of graphs is becoming more complex every day while they are getting larger. The increasing size of graph networks has made many of the classical algorithms reasonably slow. Fortunately, CPU architectures have evolved to adjust to new and more complex prob- lems in terms of core-level parallelism and vector-level parallelism (SIMD-level). Zheming Jin and Jeff Vetter (ORNL) Abstract Abstract Sum reduction is a primitive operation in parallel computing while SYCL is a promising programming model for heterogeneous computing. In this paper, we describe our SYCL implementations of integer sum reduction using atomic functions, shared local memory, vectorized memory accesses, and parameterized workload sizes. For a sufficiently large number of integers, tuning the SYCL implementation achieves 1.4X speedup over the open-source implementations on an Intel UHD630 integrated GPU. The SYCL reduction is 3% faster than the templated reduction in Thrust, and 0.3% faster than the device reduction in CUB on an Nvidia P100 GPU. Andrew Worley (Tennessee Technological University); Prema Prema Soundararajan (University of Alabama at Birmingham); Derek Schafer (University of Tennessee at Chattanooga); Purushotham Bangalore (University of Alabama at Birmingham); Ryan Grant and Matthew Dosanjh (Sandia National Laboratories); Anthony Skjellum (University of Tennessee, Chattanooga; SimCenter); and Sheikh Ghafoor (Tennessee Technological University) Abstract Abstract Partitioned point-to-point communication primitives provide a performance-oriented mechanism to support a hybrid parallel programming model of message passing and shared memory multiprocessing. It offers a cogent alternative to the earlier endpoints proposal for addressing threaded messaging passing with MPI+X. As of now, the partitioned point-to-point communication feature set has been added to the upcoming MPI-4.0 standard. These primitives enable an MPI application to transfer parts of a data buffer in a single MPI message. This approach is designed to facilitate these partial contributions using multiple threads or tasks. It can also be used to pipeline communication in a single execution context. Partitioning of both send-side and receive-side buffers is supported. Tu Mai Anh Do, Loïc Pottier, and Rafael Ferreira da Silva (University of Southern California, Information Sciences Institute); Silvina Caíno-Lores and Michela Taufer (University of Tennessee at Knoxville); and Ewa Deelman (University of Southern California, Information Sciences Institute) Abstract Abstract Scientific breakthroughs in biomolecular methods and improvements in hardware technology have shifted from a single long-running simulation to a large set of shorter simulations running simultaneously, called an ensemble. In an ensemble, each independent simulation is usually coupled with several analyses that apply identical or distinct algorithms on data produced by the corresponding simulation. Today, in situ methods are used to analyze large volumes of data generated by scientific simulations at runtime. This work studies the execution of ensemble-based simulations paired with in situ analyses using in-memory staging methods. Because simulations and analyses forming an ensemble typically run concurrently, deploying an ensemble requires efficient co-location-aware strategies, making sure the data flow between simulations and analyses that form an in situ workflow is efficient. Using an ensemble of molecular dynamics in situ workflows with multiple simulations and analyses, we first show that collecting traditional metrics such as makespan, instructions per cycle, memory usage, or cache miss ratio is not sufficient to characterize the complex behaviors of ensembles. Thus, we propose a method to evaluate the performance of ensembles of workflows that captures resource usage (efficiency), resource allocation, and component placement. Experimental results demonstrate that our proposed method can effectively capture the performance of different component placements in an ensemble. By evaluating different co-location scenarios, our performance indicator demonstrates improvements of up to four orders of magnitude when co-locating simulation and coupled analyses within a single computational host. 9:45am: FMSM: A Fuzzy Multi-keyword Search Scheme for Encrypted Cloud Data based on Multi-chain Network Heng He (School of Computer Science and Technology, Wuhan University of Science and Technology; Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System); chengyu Liu (Wuhan University of Science and Technology, School of Computer Science and Technology); Xiaohu Zhou (School of Computing, Engineering, and Built Environment,Birmingham City University); and Ke Feng (School of Computer Science and Technology, Wuhan University of Science and Technology; Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System) Abstract Abstract Searchable encryption is an efficient and secure solution that can perform search operation in ciphertext on cloud servers. It is widely used by a growing number of businesses and individuals. However, most existing searchable encryption schemes typically overestimate the security level of the cloud server, which is an untrusted and centralized third party in fact. The emergence of blockchain technology solves the trustworthiness problem of the third party with features such as decentralization, verifiability and immutability. The combination of searchable encryption and blockchain technology uses the smart contract instead of the cloud server to execute ciphertext search task, which improves security and efficiency of search significantly. We propose a Fuzzy Multi-keyword Search Scheme for Encrypted Cloud Data based on Multi-chain Network (FMSM). FMSM is a distributed searchable encryption scheme since we adopt a multi-chain architecture. It not only supports fuzzy multi-keyword search and dynamic updates for encrypted cloud data, but also supports parallel processing of data. In FMSM, counting bloom filter and MinHash algorithm are used to construct index vectors and query vectors on the corresponding two chains, besides, the search is done on a third chain. All of these calculation tasks are executed in the way of distributed computing, which makes the index construction and search process more efficient and the sorting results more accurate. The security analysis and performance evaluation show that the FMSM has high security, reliability, search efficiency and accuracy. Fady Ghanim (Oak Ridge National Labs), Wael Elwasif (Oak ridge National Labs), and David Bernholdt (Oak Ridge National Labs) Abstract Abstract The Parallel Random Access Machines (PRAM) abstraction is the simplest and most elegant algorithmic model for the design and analysis of parallel algorithms. It consists of different models categorized based on the underlying memory access mode used, the most powerful of which is the Concurrent Read Concurrent Write (CRCW) model. A PRAM algorithm describes a series of rounds, each of which consist of a collection of operations that can be executed concurrently within the same time step. However, the lack of support for concurrent memory accesses and the prevalence of asynchronous programming models, led to the belief that implementing CRCW PRAM algorithms is unattainable, and prompted many to avoid this model except for theoretical studies of optimal performance. Monday 10:15am-11:30am Room A AWASN Workshop 2A: International Workshop on Applications of Wireless Ad hoc and Sensor Networks (AWASN) - Presentations Chair: Kazuya Sakai (Tokyo Metropolitan University) Amir Reza Ramtin and Don Towsley (University of Massachusetts Amherst) Abstract Abstract Self-stabilization is an excellent approach for adding fault tolerance to a distributed multi-agent system. However, two properties of self-stabilization theory, closure and convergence, may not be satisfied if agents are selfish. To guarantee closure in the presence of selfish agents, we propose fault-containment as a method to constrain legitimate configurations of the self-stabilizing system to be Nash equilibria. To guarantee convergence, we introduce probabilistic self-stabilization to set the probabilities of rules such that agents' self-interests are satisfied. We also assume selfish agents as capable of performing unauthorized actions at any time, which threatens both properties, and present a stepwise solution to handle it. As a case study, we consider the problem of distributed clustering and propose self-stabilizing algorithms for forming clusters. Simulation results show that our algorithms react correctly to rule deviations and outperform comparable schemes in terms of fairness and stabilization time. 10:45am: Automated Arrhythmia Detection using Hilbert-Huang Transform based Convolutional Neural Network Tzu-Chia Lin and Jie Zhang (National Central University) and Min-Te Sun (National Central University) Abstract Abstract In this paper, a novel approach to arrhythmia-based signal classi- fication is introduced. The objective is to properly identify three classes of patients exhibiting normal sinus rhythm, atrial fibrillation, and other rhythm. The proposed method apply Hilbert-Huang transform on raw signal to generate noise-free reconstruction of the original containing temporal variations as input for classifica- tion mechanism to learn representative features. The features are directly learned by Convolutional Neural Network, thus replacing traditional methods of relying on experts to handcraft features. To summarize, this paper contains two major processes: utilize a non- linear and nonstationary signal processing technique to produce input, and to feed reconstructed signal containing representative features to CNN for multi-classification task. The experimental results indicate the effectiveness of this method, removing the need of human involvement in the process of feature selection. Through analyses and stimulations, the effectiveness of the proposed ECG- classification method is evaluated. 11am: New Evacuation Guidance Using Augmented Reality for Emergency Rescue Evacuation Support System (ERESS) Tomotaka Wada, Takuya Ikeda, and Yuta Kanayama (Kansai University) and Kazuhiro Ohtsuki (Kobe University) Abstract Abstract We are developing an Emergency Rescue Evacuation Support System (ERESS) for the purpose of evacuation support in the event of a local disaster. This system uses only terminals equipped with multiple sensors such as smartphones. It is a system that automatically detects disasters at an early stage and provides disaster evacuees with highly real-time disaster information. In this paper, we propose an evacuation guidance method using augmented reality (AR) in order to convey the evacuation route in the event of a disaster to the evacuees in an easy-to-understand manner. By using AR, many people can intuitively confirm the direction in which they should evacuate, leading to quick and accurate evacuation. It is versatile because it properly synthesizes the 3D model prepared in advance. The performance of the proposed method is evaluated by an indoor demonstration experiment, and its effectiveness is shown. 11:15am: Analysis on Nursing Care Activity Related Stress Level for Reduction of Caregiving Workload Atsushi Miyaji, Tomokazu Matsui, Zhihua Zhang, Hyuckjin Choi, Manato Fujimoto, and Keiichi Yasumoto (Nara Institute of Science and Technology) Abstract Abstract In Japan, the demand for nursing homes is increasing due to the rapid aging of the population, while the shortage of caregivers has become a serious problem. This problem has been recognized as a social issue because it leads to an increase in the workload per caregiver. In response, we have been developing a platform that can easily collect nursing care activities in an effort to reduce the workload. In the process, we thought that the mental state (i.e., stress) of caregivers, which changes due to their care activities, might dramatically affect their work efficiency. The objective of this study is to obtain new knowledge for reducing the workload of caregivers by visualizing and analyzing his/her stress. In this paper, we ask caregivers to wear a heart-rate sensor, and measure objective stress indicators such as R-R interval (RRI) and low frequency (LF) /high frequency (HF) ratio obtained from each sensor, as well as subjective stress indicators obtained from questionnaires administered "before work," "during lunch breaks," and "after work." To be specific, we analyzed and visualized the changes in stress associated with care activities based on psychological indicators of caregivers collected in an actual nursing home. As a result, we found that tended to increase during certain care activities and there were some relationships between those activities and stress indicators. Monday 10:15am-12:00pm Room B LLPP Workshop 2B: Workshop on LLVM in Parallel Processing (LLPP) - Presentations Chair: Johannes Doerfert (Argonne National Laboratory (ANL)) 10:15am: mp4 Welcome Remarks Jon Chesterfield (AMD) Abstract Abstract The remote procedure call (RPC) is a simple interface for executing code on a different machine. Almost none of the well known problems inherent to RPC apply on a shared memory system. Further, a shared memory system is sufficient to implement a RPC library, so that said simple interface can be more widely available. Johannes Doerfert (Argonne National Laboratory), Joseph Huber (Oak Ridge National Laboratory), and Melanie Cornelius (Illinois Institute of Technology) Abstract Abstract Debugging an application is famously twice as hard as writing the application in the first place. While this sentiment predates modern GPU programming, it is all the more true when the application has to manage computation and memory across different architectures, memory spaces, and execution modes. Any subtle error, in the application, the compiler, or runtime system, can lead to unexpected behavior that is hard to understand from the program output alone. While some tools for GPU debugging exist, their maturity and usefulness varies gravely. Furthermore, as OpenMP offloading puts an abstraction layer between the programmer and the underlying hardware, the information from a native GPU driver (debugging tool) is not always transferable. As the OpenMP Tooling \cite{ompt} (OMPT) and Debug \cite{ompd} (OMPD) interfaces are still not ready to debug OpenMP offloading code in production, developers have a hard time to comprehend the implementation state, error sources, and interplay of the OpenMP world with the foreign device runtimes, e.g., CUDA. Michael Kruse (Argonne National Laboratory) Abstract Abstract OpenMP 5.1 introduced the first loop nest transformation directives unroll and tile, and more are expected to be included in OpenMP 6.0. We discuss the two Abstract Syntax Tree (AST) representations used by Clang's implementation that is currently under development. The first representation is designed for compatibility with the existing implementation and stores the transformed loop nest in a shadow AST next to the syntactical AST. The second representation introduces a new meta AST-node OMPCanonicalLoop that guarantees that the semantic requirements of an OpenMP loop are met, and a CanonicalLoopInfo type that the OpenMPIRBuilder uses to represent literal and transformed loops. This second approach provides a better abstraction of loop semantics, removes the need for shadow AST nodes that are only relevant for code generation, allows sharing the implementation with other front-ends such as flang, but depends on the OpenMPIRBuilder which is currently under development. Jiashu Wang, Xun Deng, Kai-Ting Amy Wang, and ZiChun Ye (Huawei Technologies Canada Co., Ltd.) Abstract Abstract We present an IR-to-IR Converter that is capable of converting from LLVM IR to Halide IR and MLIR’s Affine Dialect to support generation of high performant SYCL kernel code targeting accelerators with explicit memory hierarhcy design. The converter performs program reconstruction to raise abstraction of the IR. Once the IR is raised to the level of Halide IR, AKG can be leveraged to generate performant DaVinci code. Alternatively, when the IR is raised to MLIR’s Affine Dialect, existing MLIR Affine passes with minor modifications can be readily invoked. Subsequently, the IR is lowered back to LLVM IR through progressive lowering. LLVM’s LLC is used to create the final binary for both cases. We extend upon SYCL’s buffer, accessor and parallel_for abstractions to facilitate the IR raising process. Tarindu Jayatilaka (University of Moratuwa), Hideto Ueno (University of Tokyo), Giorgis Georgakoudis (Lawrence Livermore National Laboratory), EunJung Park (Los Alamos National Laboratory), and Johannes Doerfert (Argonne National Laboratory) Abstract Abstract Compilers come with a multitude of optimizations to choose from, and the chosen optimizations significantly affect the performance of the code being optimized. Selecting the optimal set of optimizations to apply and determining the order to run them is non-trivial. The current approach to solve this is to use a one-size-fits-all optimization sequence, that is designed to perform reasonably well for any given source code. They are not designed to optimize depending on the nature of the code which usually results in non-optimal performance. In this paper, we present preliminary work tackling the problem from the perspective of compile-time by adapting the optimization sequence to cater to different program types. We start by analyzing how the source code interacts with the passes, as well as how the passes interact with each other. We use this information and propose two potential methods driven by machine learning to run customized optimization sequences on the source code. First, we look at how we can use a neural network to predict and skip passes that do not optimize the code to improve compilation time. Second, we look at how we can use clustering and predictive models to select custom pass pipelines. We believe that our approach has the potential to replace the current one-size-fits-all approach, with better optimization sequences that are tailored to perform better depending on the code. At the same time, it will allow testing the potential pipelines thoroughly, a practical requirement to gain confidence in the correctness of the compiler. Atmn Patel (University of Waterloo), Shilei Tian (Stony Brook University), Johannes Doerfert (Argonne National Laboratory), and Barbara Chapman (Stony Brook University) Abstract Abstract While parallel programming is hard, programming accelerators has always been even more complicated. One fundamental reason is the lack of mature tooling that can be used to inspect a program that executes on two different architectures. As GPU software stacks of different vendors provide vastly different experience for developers, it is clear that the gold standard for debugging is still host (CPU) execution with its myriad of mature tooling options. Monday 10:15am-12:00pm Room C PDADS Workshop 2C: International Workshop on Parallel and Distributed Algorithms for Decision Sciences (PDADS) - Presentations Chair: Sudip Seal (Oak Ridge National Laboratory) 10:30am: Domain Decomposition Preconditioners for Unstructured Network Problems in Parallel Vector Architectures Daniel Adrian Maldonado, Michel Schanen, François Pacaud, and Mihai Anitescu (Argonne National Laboratory) Abstract Abstract Complex network problems typically require the solution of very large unstructured linear systems with sparsity being their only defining characteristic. While Krylov iterative solvers have been shown to outperform direct solvers for very large systems, their performance is severely limited by the lack of robust and scalable preconditioners. Philippe Codognet (Sorbonne University / CNRS, University of Tokyo) Abstract Abstract We present experiments in solving combinatorial optimization and constraint satisfaction problems by means of Quantum Annealing. We describe how to model classic constraint problems such as N-queens and magic square as well as hard combinatorial problems such as the Costas Array Problem or the Quadratic Assignment Problem in terms of QUBO (Quadratic Unconstrained Binary Optimization). QUBO is the input language of quantum computers based on quantum annealing such as the D-Wave systems and of the "quantum-inspired" but classical devices such as Fujitsu's Digital Annealing Unit or Hitachi's CMOS Annealing Machine. We present preliminary results for solving these combinatorial optimization and constraint satisfaction problems by implementation on the D-Wave quantum computer. Vinay Gavirangaswamy (Western Michigan University, Cray- A Hewlett Packard Enterprise Company); Ajay Gupta (Western Michigan University); Vasilije Perovic (University of Rhode Island); and Hisham Saleh (Western Michigan University) Abstract Abstract Algorithms for ensemble methods (EM) based on bootstrap aggregation often perform copious amount of redundant computations (RC) thus limiting their practicality. Given this constraint, we propose a framework that views these algorithms as a collection of computational units (cu), a tightly coupled set of both mathematical operations and data. This view facilitates a reduction in RC (RRC), thereby allowing for faster execution plans. Inspired by the floor tiling approach in VLSI, we look to engineer solutions for RRC while possibly reconfiguring the underlying computing system’s compiler technology stack. We start by showing that under the assumption that the computational system has unbounded but finite memory (i.e., the memory is large enough to hold all intermediate values) and that each cu has a uniform cost, our approach reduces to a well-studied directed bandwidth problem for the directed acyclic graphs (DAGs). Next, we consider a more realistic scenario where the computing system has limited memory and concurrent execution while still assuming a uniform cost. Using a new notion of (𝑟, 𝑠) set cover of a DAG (nodes representing computational units and edges representing their interdependencies) we formulate the problem of reducing redundant computational steps in EM as a variation of a directed bandwidth problem. We show that the graph’s minimum bandwidth is closely related to memory requirements for studying RRC. Finally, our preliminary experimental results are supportive of the proposed approach for RRC and promising that it can be applied to a broader set of algorithms in decision sciences. 11:15am: Design Considerations for GPU-based Mixed Integer Programming on Parallel Computing Platforms Kalyan Perumalla and Maksudul Alam (Oak Ridge National Laboratory) Abstract Abstract Mixed Integer Programming (MIP) is a powerful abstraction in combinatorial optimization that finds real-life application across many significant sectors. The recent proliferation of graphical processing unit (GPU)-based accelerated computing architectures in large-scale parallel computing or supercomputing presents new opportunities as well as challenges in the advancement of MIP solver technology to effectively use the new accelerated computing platforms and scale to large parallel systems. Here, we recount the conventional processor-based strategies and focus on configurations where the most promising intersection lies between parallel MIP solver approaches and the specific strengths of accelerated parallel platforms. We note that the best potential lies in solving problems whose individual matrix sizes (of the linear program relaxation) fit entirely within one accelerator's memory and whose branch-and-bound (or branch-and-cut) trees cannot be fully contained within a small number of computational nodes. Additionally, we identify ideal features of computational linear algebra support on GPU accelerators that would help advance this direction of scalable parallel solution of MIP problems on GPU-based accelerated computing architectures. Gabriel George Baravdish, Jonas Unger, and Ehsan Miandji (Linköping University) Abstract Abstract In this paper, we propose a novel GPU application for highly parallel and computationally efficient compressed sensing of n-dimensional (nD) signal using the smoothed L0 (SL0) algorithm. The SL0-algorithm is known to have great success as a tool for finding sparse solutions in under-determined systems of linear equations. Here, we demonstrate the efficiency of our approach by showing several examples of nD tensor reconstructions. We also consider the traditional 1D compressed sensing, and compare the results. Eduardo Romero-Gainza, Christopher Stewart, and Angela Li (The Ohio State University); Kyle Hale (Illinois Institute of Technology); and Nathaniel Morris (The Ohio State University) Abstract Abstract Memory mapping enhances decision tree implementations by enabling constant-time statistical inference, and is particularly effective when memory mapped tables fit in processor cache. However, memory mapping is more challenging when applied to random forests— ensembles of many trees—as the table sizes can easily outstrip cache capacity. We argue that careful system design for parallel and cache efficiency can make memory mapping effective for random forests. Our preliminary results show memory-mapped forests can speed up inference latency by a factor of up to 30×. Monday 12:00pm-1:00pm Room A Poster Poster Chair: Anthony Kougkas (Illinois Institute of Technology); Anne Benoit (ENS Lyon) 12pm: Boosting Compaction Performance of LSM-tree-based KV Stores in Multi-Near-Data Processing Systems Hui Sun (Anhui Universtiy), Qiang Wang (Anhui University), Yuhong Zhao and Yinliang Yue (Institute of Information Engineering,Chinese Academy of Sciences), and Song Fu (University of North Texas) Abstract Abstract In the era of big data, log-structured merge-tree (LSM-tree) based key-value stores (KV stores) are increasingly becoming an alternative to traditional relational databases because of a high throughput and scalability. KV stores are widely used in data-intensive applications to manage large-scale unstructured data. KV stores can trigger compaction processes to sort data but this procedure can impair write performance. The existing work on the near-data processing (i.e., NDP) framework-based KV stores mostly focuses on a single disk device, whereas the NDP cluster provides high performance owing to its advantages in storage and computing convergence. Thus, we propose a multi-near-data processing framework to boost compaction performance of KV store named MStore. MStore applies a multi-column LSM-tree data layout based on the key range for NDP clusters. In MStore, each column is placed in an NDP device, thus, multiple compaction tasks are conducted in parallel, which improves the compaction performance. Second, MStore enlarges the level width of the LSM-tree, which reduces the depth of LSM-tree and decreases compaction operations, lessening the write amplification. Third, according to the data organization, we design a load balance mechanism through the key-range adjustment-based storage balancing and compaction tasks-awareness computing balancing. With the multi-NDP framework, MStore makes full use of parallel computing and I/O resources to boost compaction performance in KV stores. Compared with TStore, the performance of MStore with 4 NDP devices achieves 3.88$\times$ improvement under random-write DB_Bench. Under YCSB-C with random-write Zipfian, the performance of MStore is 4.32x that of TStore. Guangli Dai, Pavan Kumar Paluri, Albert Mo Kim Cheng, and Panruo Wu (University of Houston) Abstract Abstract Many High-Performance Computing applications involve multi-process programming. While multiple processes running in parallel can greatly reduce the completion time of applications, there are many irregular applications where the workloads on different processes may vary significantly and thus are not balanced. An irregular application is unbalanced in the sense that processes with smaller workloads have to be suspended to wait for processes with larger workloads when each process is allocated a sole CPU. Since such unbalanced workloads can lead to a tremendous waste of computational resources, we propose the RRP-Xen virtualization platform for these irregular applications. Specifically, RRP-Xen can cluster multiple processes with small workloads from one or more irregular applications onto the same CPU. Compared to other HPC cloud solutions, which also involve virtualization, RRP-Xen can offer real-time performance guarantees to each process. Consequently, the number of CPUs needed for deploying multiple irregular applications can be reduced while the completion of each irregular application experiences only minor delays. Renzhi Xiao, Dan Feng, and Yuchong Hu (Huazhong University of Science and Technology) Abstract Abstract Hashing is an in-memory index that provides fast search performance because it is a flat structure that can quickly locate a key-value item position through a hash function. Non-volatile memory (NVM) technologies have promoted the researches of NVM-based hashing structures since NVM has the durability of hard disk and the performance of DRAM. However, current NVM-based hashing schemes must tackle data inconsistencies and avoid rapid wear due to NVM write reorder and limited endurance. In this paper, we propose a log-free and consistent chained hashing for NVM, called ConHash. The ConHash leverages log-free failure-atomic writes to degrade the overhead of data consistency and exploits the pre-allocation of the bucket in the chain to solve hash conflicts efficiently. ConHash uses the size function to get the number of keys in the hashing instead of the fixed count variable, which may cause the fast wear problem for NVM. Compared with the state-of-the-art schemes, experimental results demonstrate that our ConHash decreases the insertion latency up to 88.2\% and degrades the deletion latency up to 82.6\%. Md Maruf Hossain and Erik Saule (University of North Carolina at Charlotte) Abstract Abstract Graphs like social networks and web graphs are evolving over time intervals. Traditional graph analysis that expects a fixed data set is not sufficient for these kinds of temporal graphs. Streaming graph algorithms are more popular to handle this kind of problem. In this poster, we are going to show the performance analysis of Pagerank on the temporal graph. Lots of algorithms have been proposed to compute Pagerank in the evolving graph. Most of the previous research study shows that incremental way more popular for the temporal graph. 12:48pm: XHYPRE: A high-precision numerical software package for solving large-scale sparse linear equations Chuanying Li (Hunan University), Graillat Stef (Sorbonne Université), Hao Jiang (National University of Defense Technology), and Zhe Quan (Hunan University) Abstract Abstract Due to the inaccurate numerical calculations on high-performance computing platforms, we study how to control the cumulative effect of rounding errors. Specifically, in this paper, we first adopt error-free transformation technology to design high-precision SpMV. Then we design high-precision GMRES, BiCGSTAB, and PCG algorithms. Finally, based on the HYPRE software package, we design and implement a high-precision numerical software package XHYPRE with the high-precision algorithms above for large-scale sparse linear equations. Extensive numerical experiments verify that the final calculation results are more reliable and accurate. Tuesday 8:15am-9:00am Webinar Keynote Keynote: Rick L. Stevens, DOE-ANL, Exascale and then what?: The next decade for HPC and AI Chair: Xian-He Sun (Illinois Institute of Technology) We are on the verge of deploying Exascale systems, which remarkably in the US are quite similar in architecture and programming model. But what comes after these initial exascale systems? Are we going to see AI accelerators breakout into the mainstream? Are hybrid-HPC/AI surrogate models the future of scientific computing? And if so what is the architecture and software implication? Is a hardware disaggregation model viable for HPC applications? Are quantum and neuromorphic systems likely to be on the major vendor’s roadmaps for the next decade? Heterogenous workflows connecting the edge to the core are becoming more important yet our software stacks are unprepared and do not treat workflows as first class applications. In this talk I’ll try to make sense of trends and future directions and outline some important research problems for the next decade.
Rick Stevens is the Associate Laboratory Director of the Computing, Environment and Life Sciences Directorate at Argonne National Laboratory, and a Professor of Computer Science at the University of Chicago, with significant responsibility in delivering on the U.S. national initiative for Exascale computing and developing the DOE initiative in Artificial Intelligence (AI) for Science. His research spans the computational and computer sciences, from high-performance computing to the building of innovative tools and techniques for biological science and infectious disease research as well as approaches to advance deep learning to accelerate cancer and COVID-19 research. He also specializes in collaborative visualization technology and grid computing. At Argonne, he leads the Laboratory’s AI for Science initiative and is currently focusing on high-performance computing systems which includes collaborating with Intel and Cray to launch Argonne’s first exascale computer, Aurora, as well as the National AI Accelerator Testbed which brings together leading AI scientists to provide an open and unbiased environment for the evaluation of emerging AI accelerator technologies designed to accelerate training and inference for deep learning models.
Prof. Stevens is a member of the American Association for the Advancement of Science and has received many national honors for his research, including an R&D 100 award and most recently being named a Fellow of the Association of Computer Machinery (ACM) for his continuing contributions to high-performance computing. mp4 Tuesday 9:15am-10:15am Room A Conference Paper 1A: Memory Systems and NVM Chair: Xi Wang (RIOS Laboratory) Shizhi Jiang (University of Chinese Academy of Sciences; Institute of Software, Chinese Academy of Sciences) and Yiwei Ci, Qiusong Yang, and Mingshu Li (Institute of Software, Chinese Academy of Sciences) Abstract Abstract To learn complex memory access patterns effectively, many spatial data prefetchers have been proposed that characterize the patterns as fixed-length delta sequences. However, because complex patterns are variable in workloads, it is difficult for fixed-length delta sequences to recognize them with both high competitive coverage and accuracy. That is, longer delta sequences increase accuracy at a lower probability of pattern matching, while shorter delta sequences increase coverage at a higher probability of false predictions. A classical strategy is to introduce the multiple matching mechanism associated with variable-length delta sequences, but sequences have to be redundantly stored in multiple tables. Jingwen Du, Fang Wang, Dan Feng, Weiguang Li, and Fan Li (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology) Abstract Abstract For performance benefits, recent trends of modern data centers tend to use NVM (Non-volatile Memory) as storage and utilize one-sided primitives of RDMA (Remote Direct Memory Access) to directly access NVM for I/O requests. However, one-sided RDMA doesn’t have durability semantics currently, and data may partially exist in the NVM when crashes happen, thus causing unexpected data inconsistency. Mengya Lei, Fang Wang, Dan Feng, Fan Li, and Xueliang Wei (huazhong university of science and technology) Abstract Abstract Data security is a very important issue in non-volatile memory (NVM) systems. Due to high security and low decryption latency, counter mode encryption (CME) is often used, which unfortunately suffers from extra efforts to guarantee the crash consistency of counters. Existing schemes fail to efficiently guarantee the crash consistency of counters and data, thus resulting in inefficient recovery and high costs. In this paper, we propose a Crash-Consistency-Aware Encryption scheme (CCAE) for NVM systems. CCAE is inspired by two key observations. First, we observe that all used log entries in a transaction have the same number of encryption times before being recycled due to the write appending feature. This motivates our shared counter optimization for log encryption. Second, we observe that the counters related to uncommitted data blocks only need to be restored to the latest or newer value before the crash to avoid OTP reuse. This motivates our delayed counter persistency scheme for data encryption. Evaluation results show that compared with the state-of-the-art design, CCAE reduces NVM write traffic caused by counters by 67%, improves system performance by 14%, and decreases NVM energy consumption by 35%. Bagus Hanindhito, Ruihao Li, and Dimitrios Gourounas (The University of Texas at Austin); Arash Fathi, Karan Govil, and Dimitar Trenev (ExxonMobil); and Andreas Gerstlauer and Lizy John (The University of Texas at Austin) Abstract Abstract Wave simulations are used in many applications: medical imaging, oil and gas exploration, earthquake hazard mitigation, and defense systems, among others. Most of these applications require repeated solutions of the wave equation on supercomputers. Minimizing time to solution and energy consumption are very beneficial in this domain. Data movement overhead is one of the key bottlenecks that affect energy consumption. Tuesday 9:15am-10:15am Room B Conference Paper 1B: GPU Computing and Task-based Programming Models Chair: Tobias Weinzierl (Durham University) Seiya Kozakai (Hosei University), Noriyuki Fujimoto (Osaka Prefecture University), and Koichi Wada (Hosei University) Abstract Abstract In this paper, we propose integer sorting algorithms based on histogram and prefix-sums and we show that their GPU-implementations are faster than the fastest sorting GPU-implementations in Thrust and/or CUB library for several input data. In particular, our algorithm is very useful in the cases that the maximum number of input data is smaller than the number of input data and/or the number of kinds of input data is smaller than the maximum number of input data. 9:30am: Combining Dynamic Concurrency Throttling with Voltage and Frequency Scaling on Task-based Programming Models Antoni Navarro Muñoz (Barcelona Supercomputing Center (BSC-CNS), Universitat Politècnica de Catalunya); Arthur F. Lorenzon (Federal University of Pampa); Eduard Ayguadé Parra (Universitat Politècnica de Catalunya, Barcelona Supercomputing Center (BSC-CNS)); and Vicenç Beltran Querol (Barcelona Supercomputing Center Barcelona Supercomputing Center (BSC-CNS)) Abstract Abstract Being on the verge of exascale performance has shifted the prioritization of performance in applications to the inclusion of power-performance efficiency as a primary objective in the High Performance Computing (HPC) community. Simultaneously, this has surfaced hardware and software efforts that employ techniques such as dynamic voltage and frequency scaling (DVFS) for core and uncore units or dynamic concurrency throttling (DCT) to exploit hardware resources efficiently, by saving energy while maintaining performance. These techniques are complementary, so they can be used together. However, employing them is not a straightforward task, as they have to be adjusted based on the workload, and it is even more complex to combine them properly. Thus, these techniques should be applied transparently by a runtime system, without relying on application developers. In this paper, we extend a task-based runtime system with an infrastructure that categorizes workloads based on their computational profile -- memory-bounded, compute-bounded, or balanced. This categorization is done in an on-line manner and with a negligible overhead. With this additional information, we enhance the CPU-manager and scheduler of OmpSs-2, a task-based parallel programming model, to automatically combine DVFS and DCT techniques based on workloads. Moreover, we show that our heuristics transparently improve energy efficiency on average by 15% with no significant performance loss and either equal or surpass the energy efficiency of the best static configuration available. Martin Koppehel (Otto-von-Guericke Universität Magdeburg), Tobias Groth and Sven Groppe (Universität zu Lübeck), and Thilo Pionteck (Otto-von-Guericke Universität Magdeburg) Abstract Abstract In this work we present an optimized version of the Adaptive Radix Tree (ART) index structure for GPUs. We analyze an existing GPU implementation of ART (GRT), identify bottlenecks and present an optimized data structure and layout to improve the lookup and update performance. We show that our implementation outperforms the existing approach by a factor up to 2 times for lookups and up to 10 times for updates using the same GPU. We also show that the sequential memory layout presented here is beneficial for lookup-intensive workloads on the CPU, outperforming the ART by up to 10 times. We analyze the impact of the memory architecture of the GPU, where it becomes visible that traditional GDDR6(X) is beneficial for the index lookups due to the faster clock rates compared to High Bandwidth Memory (HBM). Yanhao Chen (Rutgers University) and Fei Hua, Yuwei Jin, and Eddy Z. Zhang (Rutgers Unversity) Abstract Abstract Programming today's many-core processor is challenging. Due to the enormous amount of parallelism, synchronization is expensive. We need efficient data structures for providing automatic and scalable synchronization methods. In this paper, we focus on the priority queue data structure. We develop a heap-based priority queue implementation called BGPQ. BGPQ uses batched key nodes as the internal data representation, exploits both task parallelism and data parallelism, and is linearizable. We show that BGPQ achieves up to 88X and 11 X speedup compared with state-of-the-art CPU and GPU parallel priority queue implementations, respectively. We also apply BGPQ to search problems, including 0-1 Knapsack and A*. We achieve up to 100X and 46X speedup, respectively, over best CPU implementations. Tuesday 9:15am-10:15am Room C Conference Paper 1C: Resource Management and Infrastructure Chair: Yang You (National University of Singapore) Jinyu Yu, Dan Feng, Wei Tong, Pengze Lv, and Yufei Xiong (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology) Abstract Abstract It is common to deploy multiple workloads in a cluster to achieve high resource utilization, which tends to bring more resource contention and performance interferences. Therefore, how to guarantee the QoS of various services in mixed workload deployment scenarios is a challenge. Existing solutions preempt the resources from batch jobs to guarantee the QoS of latency-sensitive services and inevitably compromise the performance of batch jobs. On the other hand, the cluster trace data shows that most jobs did not take full advantage of their assigned resources. Thus, instead of preempting the resources of batch jobs, the resource requirements of latency-sensitive tasks can be met with the underutilized resources. Longfang Zhou (Southwest University of Science and Technology, State Key Laboratory of Aerodynamics); Xiaorong Zhang (South West University of Science and Technology); Wenxiang Yang (College of Computer, National University of Defense Technology; Computational Aerodynamics Institute, China Aerodynamics Research and Development Center); Yongguo Han (Southwest University of Science and Technology); Fang Wang (Computational Aerodynamics Institute, China Aerodynamics Research and Development Center; State Key Laboratory of Aerodynamics); Yadong Wu (Sichuan University of Science and Engineering); and Jie Yu (State Key Laboratory of Aerodynamics; Computational Aerodynamics Institute, China Aerodynamics Research and Development Center) Abstract Abstract Supercomputers serve a lot of parallel jobs by scheduling jobs and allocating computing resources. One popular scheduling strategy is the First Come First Serve (FCFS). However, there are always some idle resources not being effectively utilized, since they are not enough and reserved for the head job in the waiting queue. To improve resource utilization, a common solution is to use backfilling, which allocates the reserved computing resources to a small, short job selected from the queue, on the premise of not delaying the original head job. Unfortunately, the estimated job runtime provided by users is often overestimated. Previous studies extract features from historical job logs and predict runtime based on machine learning. However, traditional features (e.g. CPU, user, submitting time, etc.) are insufficient to describe the characteristics of jobs. In this paper, we propose a novel runtime prediction framework called PREP. It explores a new feature named job running path, which has important implications on the job's characteristics, such as the project it belongs to, data sets and parameters it uses, etc. As there is a strong correlation between job runtime and its running path. PREP groups jobs into separate clusters according to their running paths and trains a runtime prediction model for each job cluster. Final results demonstrate that adding the new feature can achieve high prediction accuracy of 88\% and it has a better prediction effect than other methods, such as Last-2 and IRPA. Hongyan Li, Hang Lu, Jiawen Huang, Wenxu Wang, Mingzhe Zhang, and Wei Chen (State Key Laborotary of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences); Liang Chang (University of Electronic Science and Technology of China); and Xiaowei Li (State Key Laborotary of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences) Abstract Abstract Classic DNN pruning mostly leverages software-based methodologies to tackle the accuracy/speed tradeoff, which involves complicated procedures like critical parameter searching, fine-tuning and sparse training to find the best plan. In this paper, we explore the opportunities of hardware runtime pruning and propose a hardware runtime pruning methodology, termed as “BitX” to empower versatile DNN inference. It targets the abundant useless bits in the parameters, pinpoints and prunes these bits on-the-fly in the proposed BitX accelerator. The versatility of BitX lies in: (1) software effortless; (2) orthogonal to the software-based pruning; and (3) multi-precision support (including both floating point and fixed point). Empirical studies on image classification and object detection models highlight the following results: (1) up to 4.82x speedup over the original non-pruned DNN and 14.76x over the software-pruned DNN; (2) up to 0.07% and 0.9% higher accuracy for the floating-point and fixed-point DNN, respectively; (3) 2.00x and 3.79x performance improvement over the state-of-the-art accelerators, with 0.039 mm2 and 68.62 mW (floating-point 32), 36.41 mW(16-bit fixed point) power consumption under TSMC 28 nm technology library. Jananie Jarachanthan and Li Chen (University of Louisiana at Lafayette), Fei Xu (East China Normal University), and Bo Li (Hong Kong University of Science and Technology) Abstract Abstract The salient pay-per-use nature of serverless computing has driven its continuous penetration as an alternative computing paradigm for various workloads. Yet, challenges arise and remain open when shifting machine learning workloads to the serverless environment. Specifically, the restriction on the deployment size over serverless platforms combining with the complexity of neural network models makes it difficult to deploy large models in a single serverless function. In this paper, we aim to fully exploit the advantages of the serverless computing paradigm for machine learning workloads targeting at mitigating management and overall cost while meeting the response-time Service Level Objective (SLO). We design and implement AMPS-Inf, an autonomous framework customized for model inferencing in serverless computing. Driven by the cost-efficiency and timely-response, our proposed AMPS-Inf automatically generates the optimal execution and resource provisioning plans for inference workloads. The core of AMPS-Inf relies on the formulation and solution of a Mixed-Integer Quadratic Programming problem for model partitioning and resource provisioning with the objective of minimizing cost without violating response time SLO. We deploy AMPS-Inf on the AWS Lambda platform, evaluate with the state-of-the-art pre-trained models in Keras including ResNet50, Inception-V3 and Xception, and compare with Amazon SageMaker and three baselines. Experimental results demonstrate that AMPS-Inf achieves up to 98%cost saving without degrading response time performance. Tuesday 10:30am-11:30am Room A Conference Paper 2A: Storage Systems and Parallel I/O Chair: Suren Byna (Lawrence Berkeley National Laboratory) Haiwei Deng and Ranhao Jia (Department of Computer Science and Engineering, Shanghai Jiao Tong University) and Chentao Wu (Department of Computer Science and Engineering, Shanghai Jiao Tong University; Sichuan Research Institute, Shanghai Jiao Tong University) Abstract Abstract Erasure Codes (ECs) have widely been used in distributed storage systems to ensure data availability because of its low storage cost and high reliability. However, the update operations in erasure coded storage systems can bring extremely high I/O latency and load imbalance due to the complexity of relationships between data and parity blocks. Although several methods such as Parity Logging (PL) and Log-Structured Array (LSA) have been proposed to improve the performance of updates, they either bring extra I/O operations or decrease the performance of file access. To address the above problems, we propose a novel Graphassisted Out-of-place Update (GOOD) scheme to improve the performance of the update, which avoids extra I/O operations and maintains the parallelism of the file access. The key idea of GOOD is to write the updated blocks into new places while maintaining blocks of the same file distribute among different nodes. The update and garbage collection processes are guided by the graphs constructed in advance. To demonstrate the effectiveness of GOOD, we conduct several experiments in Hadoop clusters. The results show that GOOD reduces the average response time by up to 55.33% and number of I/O operations by up to 20.28% compared to the existing methods. 10:45am: Multi-level Forwarding and Scheduling Recovery Technique in Heterogeneous Network for Erasure-coded Clusters Hai Zhou and Dan Feng (Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics) and Yuchong Hu (Huazhong University of Science and Technology, School of Computer Science and Technology) Abstract Abstract Erasure codes offer a storage-efficient redundancy mechanism for maintaining data availability guarantees in storage clusters, yet also incur high network traffic consumption and recovery time in failure repair. Exiting studies aim to reduce the recovery time in the heterogeneous network. However, the recovery time is always limited by the link with the low bandwidth between nodes, due to bandwidth heterogeneity. Recently, Bai et al. proposed a parallel pipeline tree technique, called PPT, to reduce recovery time by utilizing a special bandwidth gap to bypass the low-bandwidth link. But we find that PPT’s gap-based bypassing method will cause network congestion and competition. In this paper, we propose SMFRepair, a single-node multi-level forwarding repair technique that uses idle nodes to bypass low-bandwidth links without incurring network congestion and competition. Furthermore, a multi-node scheduling repair technique, called MSRepair, is proposed. MSRepair finds a recovery solution that schedules the parallel repair of multi-node and transfers data from as large-bandwidth links as possible, with the primary objective of minimizing the recovery time. Large-scale Mininet simulation and Amazon EC2 real experiments show that compared to state-of-the-art repair techniques, the single-node recovery time can be reduced by up to 36.65%, and the multi-node recovery time can be reduced by up to 55.10%. Yang Zhou, Fang Wang, and Dan Feng (Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics) Abstract Abstract Disk failure has always been a major problem for data centers, leading to data loss. Current research works used supervised learning to offline training through a large number of labeled samples. However, these offline methods are no longer suitable for disk failure prediction tasks in the current big data environment. Behind this explosive amount of data, most methods do not take into account the label values used in the model training phase may not be easy to obtain, or the obtained label values are not completely accurate. These problems further restrict the development of supervised learning and offline modeling in disk failure prediction. In this paper, ASLDP, a novel disk failure prediction method is proposed, which uses active learning and semi-supervised learning. According to the characteristics of data in the disk life cycle, ASLDP carries out active learning for those clear labeled samples, which selects valuable samples and eliminates redundancy. For those samples that are unclearly labeled or unlabeled, ASLDP combines with semi-supervised learning for pre-labeled, and enhances the generalization ability by active learning. The results on three realistic datasets demonstrate that ASLDP achieves stable failure detection rates of 80-85% with low false alarm rates, compared to current online learning methods. Furthermore, ASLDP can overcome the problems of the sample label missing and data redundancy in the massive data environment, compared to current offline learning methods. Liangfeng Cheng, Yuchong Hu, and Zhaokang Ke (Huazhong University of Science and Technology) and Zhongjie Wu (Alibaba) Abstract Abstract Modern cloud-scale cold storage data centers have begun to support right-provisioning of a rack’s resources (power, cooling, etc.), which allows only a small fraction of all hard disks to be active (spinning) concurrently at any given time to reduce the cost of ownership. Data deduplication is a traditional approach to split files into chunks and eliminate duplicate chunks, which can also cut costs for cold storage systems. However, when combined with right-provisioning, classical deduplication may make a file deduplicated and stored across disks some of which are not active currently, thus leading to unacceptable access performance caused by spinning up and down of disks. Tuesday 10:30am-11:30am Room B Conference Paper 2B: Scheduling Algorithms and Optimizations Chair: Antonino Tumeo (Pacific Northwest National Laboratory (PNNL)) Lucas Perotin (ENS Lyon) and Hongyang Sun and Padma Raghavan (Vanderbilt University) Abstract Abstract The scheduling literature has traditionally focused on a single type of resource (e.g., computing nodes). However, scientific applications in modern High-Performance Computing (HPC) systems process large amounts of data, hence have diverse requirements on different types of resources (e.g., cores, cache, memory, I/O). All of these resources could potentially be exploited by the runtime scheduler to improve the application performance. In this paper, we study multi-resource scheduling to minimize the makespan of computational workflows comprised of parallel jobs subject to precedence constraints. The jobs are assumed to be moldable, allowing the scheduler to flexibly select a variable set of resources before execution. We propose a multi-resource, list-based scheduling algorithm, and prove that, on a system with $d$ types of schedulable resources, our algorithm achieves an approximation ratio of $1.619d+2.545\sqrt{d}+1$ for any $d$, and a ratio of $d+O(\sqrt[3]{d^2})$ for large $d$. We also present improved results for independent jobs and for jobs with special precedence constraints (e.g., series-parallel graphs and trees). Finally, we prove a lower bound of $d$ on the approximation ratio of any list scheduling scheme with local priority considerations. To the best of our knowledge, these are the first approximation results for moldable workflows with multiple resource requirements. YuAng Chen and Yeh-ching Chung (Chinese University of Hong Kong, Shenzhen) Abstract Abstract PageRank, weighing the importance of vertices in a graph, serves as an fundamental algorithm for graph-structured tasks in a variety of domains. However, the processing capacity of multicore systems is oftentimes poorly utilized for large-scale PageRank due to the irregular memory accesses and poor cache efficiency. In this paper, we present HiPa, a novel hierarchical partitioning methodology to accelerate PageRank by utilizing the memory-cache architecture of the multicore system. For the shared memory, HiPa subdivides the graph based on the NUMA characteristics to reduce remote memory access while ensuring workload balance. For the private cache, HiPa further splits the graph into cache-able partitions to promote in-core computing and cache locality. Based on the partitioning strategy, systematical optimizations are proposed, such as thread management and new data layout. These effectively alleviate thread migration and thread contention, thus enhancing the scalability of HiPa. The integration of NUMA- and cache-aware parallelism allows HiPa to harness the potential of multicore systems. The performance of HiPa is evaluated by comparing with the the start-of-the-art graph frameworks and hand-optimized implementations. Over the best among them, HiPa achieves accelerations from 1.11x to 1.45x, and reductions in remote memory accesses from 1.87x to 3.90x Moreover, we investigate the behaviors of HiPa on different processor micro-architectures to push its performance closer to hardware limit. Ali Eker, David Timmerman, Barry Williams, Kenneth Chiu, and Dmitry Ponomarev (State University of New York at Binghamton) Abstract Abstract The performance and scalability of Parallel Discrete Event Simulation (PDES) can be significantly impacted by temporarily inactive threads that occupy CPU resources but do no useful processing. A recent design called Demand-Driven PDES (DD-PDES) identifies such threads and de-schedules them from CPU cores to eliminate the unnecessary overhead. In this paper, we propose significant further improvements to DD-PDES. First, we introduce a new GVT (Global Virtual Time)-guided algorithm named GG-PDES to perform de-scheduling operations in a lock-free fashion and without relying on a centralized controller thread as was used previously. Second, we introduce the Dynamic CPU Affinity algorithm built on top of GG-PDES that adaptively pins simulation threads to CPU cores to achieve a balanced execution. We demonstrate that these optimizations can yield performance improvements in the range of 13% to 50% over the original DD-PDES system. Yubin Duan and Jie Wu (Temple University) Abstract Abstract Reducing the inference time of Deep Neural Networks (DNNs) is critical when running time sensitive applications on mobile devices. Existing research has shown that partitioning a DNN and offloading a part of its computation to cloud servers can reduce the inference time. The single DNN partition problem has been extensively investigated recently. However, in real-world applications, a mobile device usually generates multiple DNN inference jobs simultaneously, and little attention has been paid to this case. We aim to minimize the makespan of multiple DNNs by jointly optimizing their partitioning and scheduling. Our observations show that the local computation time on a mobile device follows an increasing function, while the communication workload for offloading is usually decreasing as more DNN layers are computed. Based on this, we first relax our problem on continuous domain and show that partitioning all line-structure DNNs at the same layer is sufficient for makespan optimization. Then, for the discrete domain, two types of partitions are sufficient when the time difference between two adjacent partition layers is not drastic, subject to a given condition. An algorithm based on the binary search that efficiently finds optimal partition layers is illustrated. We also extend our approach to general-structure DNNs and offer a heuristic solution. Experiments have been conducted to evaluate the performance of different partition and scheduling methods on sample DNNs. Results validate the optimality of our theoretical results. Tuesday 10:30am-11:30am Room C Conference Paper 2C: GPU-Accelerated Applications Chair: Taisuke Boku (University of Tsukuba) Zonghao Feng and Qiong Luo (Hong Kong University of Science and Technology) Abstract Abstract Sequence alignment is traditionally done between individual reads and a reference sequence. Recently, several sequence-to-graph aligners have been developed, because graph genomes provide more information than sequences. For example, variation graphs, which are created by augmenting linear genomes with known genetic variations, can significantly improve the quality of genotyping. However, existing sequence-to-graph aligners run on the CPU only, and the processing time is long. To speed up the processing, we propose a parallel sequence-to-graph alignment algorithm named HGA (Heterogeneous Graph Aligner) that runs on both the CPU and GPUs. Our algorithm achieves efficient CPU-GPU co-processing through dynamically distributing tasks to each processor. We design optimizations for frequent structures in genome graphs to reduce computational cost. Moreover, we propose the GCSR (Genome CSR) data structure for efficient genome graph processing on GPUs. We also apply architecture-aware optimizations to improve memory locality and increase throughput. Our experiments on a Xeon E5-2683 CPU and eight RTX 2080 Ti GPUs show that our algorithms outperform state-of-the-art aligners by up to 15.8 times and scale well on both the CPU and GPUs. The code of our HGA is available at https://github.com/RapidsAtHKUST/hga. 10:45am: Exploring HW/SW Co-Optimizations for Accelerating Large-scale Texture Identification on Distributed GPUs Junsong Wang (V-Origin), Xiaofan Zhang (Unviversity of Illinois at Urbana-Champaign), and Yubo Li and Yonghua Lin (V-Origin) Abstract Abstract Texture identification has been developed recently to support one-to-one verification and one-to-many search, which provides broader support than texture classification in real applications. It has demonstrated great potentials to enable product traceabilityby identifying the unique texture information on the surface of the targeted objects. However, existing hardware acceleration schemes are not enough to support a large-scale texture identification, especially for the search task, where the number of texture images being searched can reach millions, creating enormous compute and memory demands and making real-time texture identification infeasible. To address these problems, we propose a comprehensive toolset with joint optimization strategies from both hardwareand software to deliver optimized GPU acceleration and leverage large-scale texture identification with real-time responses. Novel technologies include: 1) a highly-optimized cuBLAS implementation for efficiently running 2-nearest neighbors algorithm; 2)a hybrid cache design to incorporate host memory for streaming datatoward GPUs, which delivers a 5×larger memory capacity while running the targeted workloads; 3) a batch process to fully exploit the data reuse opportunities. 4) asymmetric localfeature extraction to reduce the feature matrix. To the best of our knowledge, this work is the first implementation toprovide real-time large-scale texture identification on GPUs. By exploring the co-optimizations from both hardware and software,we can deliver 31× faster search and 20×larger feature cache capacity compared to a conventional CUDA implementation. We also demonstrate our proposed designs by proposing a distributed texture identification system with 14 Tesla P100 GPUs which can complete 872,984 texture similarity comparisons in just one second. Robin Kobus, André Müller, and Daniel Jünger (Johannes Gutenberg University Mainz); Christian Hundt (NVIDIA AI Technology Center Luxembourg); and Bertil Schmidt (Johannes Gutenberg University Mainz) Abstract Abstract The cost of DNA sequencing has dropped exponentially over the past decade, making genomic data accessible to a growing number of scientists and engineers. In bioinformatics, localization of short DNA sequences (reads) within large genomic sequences is commonly facilitated by constructing index data structures which allow for efficient querying of substrings. Recent metagenomic classification pipelines annotate reads with taxonomic labels by analyzing their $k$-mer histograms with respect to a reference genome database. CPU-based index construction is often performed in a preprocessing phase due to the relatively high cost of building irregular data structures such as hash maps. However, the rapidly growing amount of available reference genomes establishes the need for index construction and querying at interactive speeds. In this paper, we introduce MetaCache-GPU -- an ultra-fast metagenomic short read classifier specifically tailored to fit the massively parallel characteristics of CUDA-enabled accelerators. Our approach employs a novel hash table variant featuring efficient minhash fingerprinting of reads for locality-sensitive hashing and their rapid insertion using warp-aggregated operations. Our performance evaluation shows that MetaCache-GPU is able to build large reference databases in a matter of seconds, enabling instantaneous operability, while popular CPU-based tools such as Kraken2 require over an hour for index construction on the same data. In the context of an ever-growing number of reference genomes, MetaCache-GPU is the first metagenomic classifier that makes analysis pipelines with on-demand composition of large-scale reference genome sets practical. The source code is publicly available at https://github.com/muellan/metacache. Ricardo Nobre and Aleksandar Ilic (INESC-ID; Instituto Superior Técnico, Universidade de Lisboa); Sergio Santander-Jiménez (University of Extremadura); and Leonel Sousa (INESC-ID; Instituto Superior Técnico, Universidade de Lisboa) Abstract Abstract The investigation of highly-efficient parallel algorithms targeting modern heterogeneous systems can provide bioinformaticians new mechanisms to find relations between genetics, phenotype and environment. This paper proposes an approach for fourth-order exhaustive, i.e. as precise as possible, epistasis detection targeting modern heterogeneous systems. Being implemented in Data Parallel C++ / SYCL, the proposed approach relies on technologies and tools built around open standards, making it able to target different types of architectures and devices. As a means to show interoperability with hardware from different sources, we have included performance results obtained from execution on different systems. Scaled to the number of samples, the proposed approach achieved a performance per GPU stream core of up to 236, 472 or 487 mega quads of SNPs processed per second, on GPUs with the Gen9.5, Gen12 and Turing architectures. This metric is reported for different implementation variants combining different considered optimizations. The proposed approach is able to target a wide range of CPU and GPU devices, enabling more users to access high-throughput epistasis detection software. Tuesday 11:45am-12:45pm Room A Conference Paper 3A: Performance Modeling and Evaluation Chair: Doru Thom Popovici (Lawrence Berkeley National Laboratory) Junyao Yang, Yuchen Wang, and Zhenlin Wang (Michigan Technological University) Abstract Abstract The Miss Ratio Curve (MRC) is an important metric and effective tool for caching system performance prediction and optimization. Since the Least Recently Used (LRU) replacement policy is the de facto policy for many existing caching systems, most previous studies on efficient MRC construction are predominantly focused on the LRU replacement policy. Recently, the random sampling-based replacement mechanism, as opposed to replacement relying on the rigid LRU data structure, gains more popularity due to its lightweight and flexibility. To approximate LRU, at replacement times, the system randomly selects K objects and replaces the least recently used object among the sample. Redis implements this approximated LRU policy. We observe that there can exist a significant miss ratio gap between exact LRU and random sampling-based LRU under different sampling size K, therefore existing LRU MRC construction techniques cannot be directly applied to random sampling based LRU cache without loss of accuracy. 12pm: An Evaluation of Task-Parallel Frameworks for Sparse Solvers on Multicore and Manycore CPU Architectures Abdullah Alperen, Md Afibuzzaman, and Fazlay Rabbi (Michigan State University); M. Yusuf Ozkaya and Umit Catalyurek (Georgia Institute of Technology); and Hasan Metin Aktulga (Michigan State University) Abstract Abstract Recently, several task-parallel programming models have emerged to address the high synchronization and load imbalance issues as well as data movement overheads in modern shared memory architectures. OpenMP, the most commonly used shared memory parallel programming model, has added task execution support with dataflow dependencies. HPX and Regent are two more recent runtime systems that also support the dataflow execution model and extend it to distributed memory environments. We focus on parallelization of sparse matrix computations on shared memory architectures. We evaluate the OpenMP, HPX and Regent runtime systems in terms of performance and ease of implementation, and compare them against the traditional BSP model for two popular eigensolvers, Lanczos and LOBPCG. We give a general outline in regards to achieving parallelism using these runtime systems, and present a heuristic for tuning their performance to balance tasking overheads with the degree of parallelism that can be exposed. We then demonstrate their merits on two architectures, Intel Broadwell (a multicore processor) and AMD EPYC (a modern manycore processor). We observe that these frameworks achieve up to 13.7$\times$ fewer cache misses over an efficient BSP implementation across L1, L2 and L3 cache layers. They also obtain up to 9.9$\times$ improvement in execution time over the same BSP implementation. Guillem López-Paradís, Adrià Armejach, and Miquel Moretó (Barcelona Supercomputing Center, Universitat Politècnica de Catalunya) Abstract Abstract In recent years there has been a surge of inter- est in designing custom accelerators for power-efficient high- performance computing. However, available tools to simulate low- level RTL designs often neglect the target system in which the design will operate. This hinders proper testing and debugging of functionalities, and does not allow co-designing the accelerator to obtain a balanced and efficient architecture. Alexandre Denis, Emmanuel Jeannot, and Philippe Swartvagher (INRIA) Abstract Abstract Parallel runtime systems such as MPI or task-based libraries provide models to manage both computation and communication by allocating cores, scheduling threads, executing communication algorithms. Efficiently implementing such models is challenging due to their interplay within the runtime system. In this paper, we assess interferences between communications and computations when they run side by side. We study the impact of communications on computations, and conversely the impact of computations on communication performance. We consider two aspects: CPU frequency, and memory contention. We have designed benchmarks to measure these phenomena. We show that CPU frequency variations caused by computation have a small impact on communication latency and bandwidth. However, we have observed on Intel, AMD and ARM processors, that memory contention may cause a severe slowdown of computation and communication when they occur at the same time. We have designed a benchmark with a tunable arithmetic intensity that shows how interferences between communication and computation actually depend on memory pressure of the application. Finally we have observed up to 90 % performance loss on communications with common HPC kernels such as CG and GEMM. Tuesday 11:45am-12:45pm Room B Conference Paper 3B: Parallelization and Code Generation Chair: Barbara Chapman (Stony Brook University) 11:45am: Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors Mingzhen Li, Yi Liu, Hailong Yang, Yongmin Hu, Qingxiao Sun, Bangduo Chen, Xin You, Xiaoyan Liu, Zhongzhi Luan, and Depei Qian (Beihang University) Abstract Abstract Stencil computation is an indispensable building block of many scientific applications and is widely used by the numerical solvers of partial differential equations (PDEs). Due to the complex computation patterns of different stencils and the various hardware targets (e.g., many-core processors), many domain-specific languages (DSLs) have been proposed to optimize stencil computation. However, existing stencil DSLs mostly focus on the performance optimizations on homogeneous many-core processors such as CPUs and GPUs, and fail to embrace emerging heterogeneous many-core processors such as Sunway. In addition, few of them can support expressing stencil with multiple time dependencies and optimizations from both spatial and temporal dimensions. Moreover, most stencil DSLs are unable to generate codes that can run efficiently in large scale, which limits their practical applicability. In this paper, we propose MSC, a new stencil DSL designed to express stencil computation in both spatial and temporal dimensions. It can generate high-performance stencil codes for large-scale execution on emerging many-core processors. Specially, we design several optimization primitives for improving parallelism and data locality, and a communication library for efficient halo exchange in large scale execution. The experiment results show that our MSC achieves better performance compared to the state-of-the-art stencil DSLs. Jan-Patrick Lehr, Christian Bischof, Florian Dewald, Heiko Mantel, Mohammad Norouzi, and Felix Wolf (Technical University of Darmstadt) Abstract Abstract The size and complexity of high-performance computing applications present a serious challenge to manual reasoning about program behavior. The vastness and diversity of their code bases often break automatic analysis tools, which could otherwise be used. As a consequence, developers resort to mini-apps, i.e., trimmed-down proxies of the original programs that retain key performance characteristics. Unfortunately, their construction is difficult and time consuming and prevents their mass production. In this paper, we propose a systematic and tool-supported approach to extract mini-apps from large-scale applications that reduces the manual effort needed to create them. Our approach covers the stages kernel identification, code extraction, data capture, and representativeness validation. We demonstrate it using an astrophysics simulation with ~8.5 million lines of code and extract a mini-app with only ~1,100 lines of code. For the mini-app, we evaluate the reduction of code complexity and execution similarity, and show how it enables the tool-supported discovery of unexploited parallelization opportunities, reducing the simulation's runtime significantly. 12:15pm: Automatic Generation of High-Performance Inference Kernels for Graph Neural Networks on Multi-Core Systems Qiang Fu and H. Howie Huang (George Washington University) Abstract Abstract Graph neural networks are powerful in learning from high-dimensional graph-structured data, for which a number of frameworks such as DGL and Pytorch-geometrics have been developed to facilitate the construction, training, and deployment of such models. Unfortunately, existing systems underperform when inferring on huge graph data on multi-core CPUs. Furthermore, traditional graph processing systems are struggling with complexity issues due to their low-level programming interfaces. In this paper, we present a new compiler-based software framework Gin optimized for graph neural network inference, which offers a user-friendly interface via an intuitive programming model for defining graph neural network models. Gin builds high-level dataflow graphs as intermediate representations, which are compiled into highly efficient codes as well as binary inference kernels. Our evaluation shows that Gin significantly accelerates the inference on billion-edge graphs, beating three state-of-the-art solutions i.e., DGL, Tensorflow, and Pytorch-geometrics, by 31.44x on average, with much higher CPU and memory bandwidth utilization. In addition, Gin is able to achieve considerable speedup (up to 7.6×) over traditional graph processing system Ligra. Hannah Cartier (Rhodes College), James Dinan (NVIDIA), and D. Brian Larkins (Rhodes College) Abstract Abstract Applications that rely on sparse or irregular data are often challenging to scale on modern distributed-memory systems. As a result, these systems typically require continuous load balancing in order to maintain efficiency. Work stealing is a common technique to remedy imbalance. In this work we present a strategy for work stealing that reduces the amount of communication required for a steal operation by half. We show that in exchange for a small amount of additional complexity to manage the local queue state we can combine both discovering and claiming work into a single step. Conventionally, work stealing uses a two step process of discovering work and then claiming it. Our system, SAWS, provides a mechanism where both processes are performed in a singular communication without the need for multiple synchronization messages. This reduction in communication is possible with the novel application of atomic operations that manipulate a compact representation of task queue metadata. We demonstrate the effectiveness of this strategy using known benchmarks for testing dynamic load balancing systems and for performing unbalanced tree searches. Our results show the reduction in communication reduces task acquisition time and steal time, which in turn improves overall performance on sparse computations. Tuesday 11:45am-12:45pm Room C Conference Paper 3C: Applications with Machine Learning Chair: Xingfu Wu (Argonne National Laboratory, University of Chicago) 11:45am: CNN+LSTM Accelerated Turbulent Flow Simulation with Link-Wise Artificial Compressibility Method Sijiang Fan (National University of Defense Technology, University of Manchester); Jiawei Fei, Xiao-Wei Guo, and Canqun Yang (National University of Defense Technology); and Alistair Revell (University of Manchester) Abstract Abstract The simulation of turbulent flow, the most common form of fluid, is indispensable in computational fluid dynamics (CFD). The synthetic eddy method (SEM) generates the turbulent inflow and is adopted as the inlet boundary condition of simulation. However, SEM is time-consuming and can significantly slow down the simulation process which occupies 58% of the whole computational time. This is highly inefficient especially since SEM is only used as the inlet. In this paper, we propose an efficient alternative. In particular, we leverage CNN+LSTM to replace SEM to obtain the turbulence statistics and combine it with link-wise artificial compressibility method (LW-ACM), which is a fast numerical method of CFD. We validate the predicted results by CNN+LSTM and prove that our model can provide the correct turbulence statistics even after a long time. Experiment results show that our CNN+LSTM module achieves over 15× speedup compared with SEM, which greatly reduces the time consumption of turbulent inflow generation (from 58% to 7%). As a result, the whole time of turbulent flow simulation is more than halved. Compared with a newly released GPU-accelerated standard lattice Boltzmann method solver, our combination of CNN+LSTM and LW-ACM is about 8.6× faster. Among all studies reported to date, our work is the fastest implementation for simulating turbulent channel flow, an important step for the field of fast CFD analysis. 12pm: ComputeCOVID19+: Accelerating COVID-19 Diagnosis and Monitoring via High-Performance Deep Learning Garvit Goel, Atharva Gondhalekar, and Jingyuan Qi (Virginia Tech); Zhicheng Zhang (Stanford University); and Guohua Cao and Wu Feng (Virginia Tech) Abstract Abstract The COVID-19 pandemic has highlighted the importance of diagnosis and monitoring as early and accurately as possible. However, the reverse-transcription polymerase chain reaction (RT-PCR) test results in two issues: (1) protracted turnaround time from sample collection to testing result and (2) compromised test accuracy, as low as 67%, due to when the test is administered and due to how the samples are collected, handled, and delivered to the lab to conduct the RT-PCR test. Thus, we present ComputeCOVID19+, our computed tomography-based framework to improve the testing speed and accuracy of COVID-19 (plus its variants) via a deep learning-based network for CT image enhancement called DDnet. To demonstrate its speed and accuracy, we evaluate ComputeCOVID19+ across many sources of computed tomography (CT) images and on many heterogeneous platforms, including multi-core CPU, many-core GPU, and even FPGA. Our results show that ComputeCOVID19+ can significantly shorten the turnaround time from days to minutes and improve the testing accuracy to 91%. Aymen Al Saadi (Rutgers University); Austin Clyde (Argonne, Univ. of Chicago); Shantenu Jha (Brookhaven National Laboratory, Rutgers University); Agastya Bhati (UCL); Alexander Brace (Argonne); Li Tan (Brookhaven National Laboratory); Anda Trifan (University of Illinois at Urbana Champaign, Argonne National Laboratory); Heng Ma (Argonne National Laboratory); Matteo Turilli (Rutgers University, Brookhaven National Laboratory); Yadu Babuji and Benjamin Blaiszik (University of Chicago, Argonne National Laboratory); Thomas Brettin (Argonne National Laboratory); Kyle Chard (University of Chicago); Ryan Chard (Argonne National Laboratory); Ian Foster (Argonne National Laboratory, University of Chicago); Thomas Gibbs and Kristopher Keipert (NVIDIA Inc.); Alexander Partin (Argonne National Laboratory); Ashka Shah (University of Chicago); Abraham Stern (NVIDIA Inc.); Aristeidis Tsaris (Oak Ridge National Laboratory); Huub Van Dam (Brookhaven National Laboratory); Arvind Ramanathan (Argonne); Peter Coveney (UCL); Rick Stevens (Argonne National Laboratory, University of Chicago); Hyungro Lee and Andre Merzky (Rutgers University); Mikhail Titov (Brookhaven National Laboratory); Dario Alfe (University College London); Dieter Kranzlmüller (Leibniz Research Centre); Shunzhou Wan (University College London); David Wifling and Gerald Mathias (Leibniz Research Centre); Zhuozhao Li (University of Chicago); Thorsten Kurth (NVIDIA Inc.); and Junqi Yin (Oak Ridge National Laboratory) Abstract Abstract The drug discovery process currently employed in the pharmaceutical industry typically requires about 10 years and $2-3 billion to deliver one new drug. This is both too expensive and too slow, especially in emergencies like the COVID-19 pandemic. In silico methodologies need to be improved to better select lead compounds that can proceed to later stages of the drug discovery protocol accelerating the entire process. No single methodological approach can achieve the necessary accuracy with the required efficiency. Here, we describe an Integrated Modeling PipEline for COVID Cure by Assessing Better LEads (IMPECCABLE) and how its multiple methodological innovations overcome this fundamental limitation. We also describe the development of computational infrastructure to support these innovations at scale and characterize the performance for throughput, peak performance, and scientific results. Individual workflow components deliver 100- to 1000 improvement over traditional methods. The integration of methods supported by scalable infrastructure speeds up drug discovery by orders of magnitudes. IMPECCABLE has screened approx. 10^{11} ligands and has been used to discover a promising drug candidate. These capabilities have been used by the US-DOE National Virtual Biotechnology Laboratory, and the EU Centre of Excellence in Computational Biomedicine 12:30pm: Multi-Agent Reinforcement Learning based Distributed Renewable Energy Matching for Datacenters Haoyu Wang, Haiying Shen, Jiechao Gao, Kevin Zheng, and Xiaoying Li (University of Virginia) Abstract Abstract The rapid growth of cloud computing in cloud datacenters in recent decades greatly increases the brown energy consumption in datacenters, and hence significant increase of carbon emission that negatively impacts on the environment as well as the monetary cost. However, the un-stability of the renewable energy cannot guarantee the support to the datacenter and the energy competition of different datacenter may lead to datacenter energy outage. In this paper, we focus on the problem that how to match different renewable energy generators to the datacenters from different cloud providers to minimize the carbon emission, monetary cost, and service level objective (SLO) violation due to renewable energy shortage. The challenges here are that the datacenters may compete in energy requesting, the renewable energy generation is not stable and the decision should be made quickly. There have been no previous efforts devoting to this problem. To solve the problem, we first test several machine learning techniques on long-term prediction accuracy on renewable energy generation and energy demand using real traces and identify SARIMA for the prediction. We then propose a multi-agent reinforcement learning based method (MARL) for each datacenter to determine how much renewable energy to request from each generator based on the predicted results. The trace-driven experiments show that MARL outperforms other methods by increasing up to 35% SLO satisfaction ratio, and reducing up to 19% (0.33 billion dollars in 90 days) total monetary cost and 33% total carbon emission, and DGJP further improves the performance. Tuesday 7:00pm-8:00pm Webinar ICPP Bonfire Bonfire Gathering: Remember the Old Tradition Virtually! Chair: Wu Feng (Virginia Tech) ICPP started in 1972 as an intimate workshop-like gathering of intellectuals in parallel processing. As ICPP rapidly grew in stature and size, the ICPP bonfire joke-telling party came into being about 1976 to capture that original informal workshop-like setting for more than a decade. This year’s virtual ICPP bonfire will reminisce about the history behind the bonfires, their social and professional impact to the parallel computing community, and the jokes that were told, all from the perspective of long-time ICPP bonfire attendees from the 1970s and 1980s, who are noted below.
HJ Siegel (moderator),
Bruce Berra,
Tom Casavant,
Hank Dietz,
Doug DeGroot,
Duncan Lawrie,
Diane Smith Wednesday 8:00am-9:00am Webinar Conference Paper Best Paper Candidates Chair: Felix Wolf (Technical University of Darmstadt) BP Hanfeng Liu (School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics for Society); Zeyi Wen (Department of Computer Science and Software Engineering, The University of Western Australia); and Wei Cai (School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics for Society) Abstract Abstract Particle Swarm Optimization (PSO) has been widely used in various optimization tasks (e.g., neural architecture search and autonomous vehicle navigation), because it can solve non-convex optimization problems with simplicity and efficacy. However, the PSO algorithm is often time-consuming to use, especially for high-dimensional problems, which hinders its applicability in time-critical applications. In this paper, we propose novel techniques to accelerate the PSO algorithm with GPUs. To mitigate the efficiency bottleneck, we formally model the PSO optimization as a process of element-wise operations on matrices. Based on the modeling, we develop an efficient GPU algorithm to perform the element-wise operations in massively parallel using the tensor cores and shared memory. Moreover, we propose a series of novel techniques to improve our proposed algorithm, including (i) GPU resource-aware thread creation to prevent creating too many threads when the number of particles/dimensions is large; (ii) designing parallel techniques to initialize swarm particles with fast random number generation; (iii) exploiting GPU memory caching to manage swarm information instead of allocating new memory and (iv) developing a schema to support customized swarm evaluation functions. We conduct extensive experiments on four optimization applications to study the efficiency of our algorithm called "FastPSO". Experimental results show that FastPSO consistently outperforms the existing CPU-based PSO libraries by two orders of magnitude, and transcends the existing GPU-based implementation by 5 to 7 times while achieving better or competitive optimization results. BP Yang Yang and Qiang Cao (Huazhong University of Science and Technology) Abstract Abstract The first commercial Non-Volatile Memory (NVM) (i.e., Intel Optane DC Persistent Memory) exhibits limited parallelism, especially for write operations, which is generally neglected by existing NVM-aware file systems. Besides, the concurrent control of file systems also limits their scalability on high-performance NVMs under mainstream multi-core architectures. To effectively exploit full parallelism inherent in both NVMs and multi-core processors to enhance the overall performance, this paper proposes a novel scalable persistent memory file system, called SPMFS. SPMFS first partitions global metadata structures of the file system into per-core structures to distribute load and relieve contention. Second, SPMFS presents a fine-grained range lock to support concurrent accesses upon a file. Finally, SPMFS designs a dedicated I/O thread pool to offer optimal parallelism inherent in underlying NVM regardless of varying user-threads. We implement an SPMFS prototype and evaluate it under a variety of workloads generated by IOtest, Filebench, FIO, and production traces from Alibaba Pangu. The experiments show that SPMFS provides better scalability, and achieves up to 2.37$\times$ write throughput improvement over state-of-the-art kernel NVM-aware file systems (Ext4-DAX, NOVA, and PMFS) and user-space file systems (Strata and Libnvmmio), without sacrificing the read performance. 8:30am: Exploiting system level heterogeneity to improve the performance of a GeoStatistics multi-phase task-based application BP Lucas Leandro Nesi (Institute of Informatics, Federal University of Rio Grande do Sul); Arnaud Legrand (University Grenoble Alpes, CNRS); and Lucas Mello Schnorr (Institute of Informatics, Federal University of Rio Grande do Sul) Abstract Abstract Heterogeneity is part of HPC infrastructures, not only at the intra-node but at the system level. Applications with multiple phases with distinct resource necessities can take advantage of this inter-node heterogeneity to improve performance and reduce resource idleness. Such an application is ExaGeoStat, a task-based machine learning framework specifically designed for geostatistics data. This work presents strategies to efficiently distribute multi-phase applications in system-level heterogeneous resources. We both (1) improve application phase overlap by optimizing runtime and scheduling decisions and (2) compute the optimal distribution for all the phases using a linear program leveraging node heterogeneity while limiting communication overhead. The performance gains of our phase overlap improvements are between 36% and 50% compared to the original base synchronous and homogeneous execution. We show that by adding some slow nodes to a homogeneous set of fast nodes, we can improve the performance by another 25% compared to a standard block-cyclic distribution, thereby harnessing any machine. BP TANMOY SEN and Haiying Shen (University of Virginia) Abstract Abstract Applications running in edge computing system seamlessly collect data, process data and take actions accordingly. In many cases, the applications need to assist people in real time, and even have life-or-death consequences such as heart attack detection in healthcare and object detection in driving, which requires low job latency. Moreover, power and bandwidth are constrained resources in edge computing systems. Therefore, a challenge is how to handle data efficiently to reduce job latency, and meanwhile reduce power and bandwidth consumption. Pre- vious works only focus on where to store collected source data to reduce the communication latency for source data sharing. Noticing that intermediate and final processing results may be shared by many applications, we propose to store intermediate and final results for sharing to avoid the duplicated computation. We also propose data collection that reduces data collection frequency based on context-related factors to achieve an optimal tradeoff between the overhead and decision making accuracy. We further propose data redundancy elimination to reduce the redundant data transmitted between edge and fog nodes. Our combined data operation strategies show significant improvement over the state-of-the-art methods in terms job latency, power and bandwidth consumption for experiments on both simulated and real edge environment. Wednesday 9:15am-10:15am Room A Panel Panel: Celebrating 50 Years of ICPP Chair: Rudolf Eigenmann (University of Delaware) Panelists: Keshav Pingali (University of Texas at Austin; Moderator), Susanne Hambrusch (Purdue University), David Padua (University of Illinois at Urbana-Champaign), HJ Siegel (Colorado State University), Guang Gao (University of Delaware), Lionel Ni (Hong Kong University of Science and Technology), and Ahmed Sameh (Purdue University). mp4 Wednesday 10:30am-11:30am Room A Conference Paper 4A: Graph Computing Chair: Erika Parsons (University of Washington, Bothell) 10:30am: Ascetic: Enhancing Cross-Iterations Data Efficiency in Out-of-Memory Graph Processing on GPUs Ruiqi Tang, Ziyi Zhao, and Kailun Wang (Nankai University); Xiaoli Gong (NNankai University); Jin Zhang (Nankai University); Wenwen Wang (University of Georgia); and Pen-Chung Yew (University of Minnesota at Twin Cities) Abstract Abstract Graph analytics are widely used in real-world applications, and GPUs are major accelerators for such applications. However, as graph sizes become significantly larger than the capacity of GPU memory, the performance can degrade significantly due to the heavy overhead moving large amount of data between CPU memory and GPU memory. Lin Zhu, Qiang-Sheng Hua, and Hai Jin (Huazhong University of Science and Technology) Abstract Abstract In this paper, we propose a parallel algorithm for computing all-pairs shortest paths (APSP) for sparse graphs on the distributed memory system with $p$ processors. To exploit the graph sparsity, we first preprocess the graph by utilizing several known algorithmic techniques in linear algebra such as fill-in reducing ordering and elimination tree parallelism. Then we map the preprocessed graph on the distributed memory system for both load balancing and communication reduction. Finally, we design a new scheduling strategy to minimize the communication cost. The bandwidth cost (communication volume) and the latency cost (number of messages) of our algorithm are $O(\frac{n^2 \log^2 p }{p}+|S|^2 \log^2 p)$ and $O(\log^2 p)$, respectively, where $S$ is a minimal vertex separator that partitions the graph into two components of roughly equal size. Compared with the state-of-the-art result for dense graphs where the bandwidth and latency costs are $O(\frac{n^2}{\sqrt{p}})$ and $O(\sqrt{p}\log^2 p$), respectively, our algorithm reduces the latency cost by a factor of $O(\sqrt{p})$, and reduces the bandwidth cost by a factor of $O(\frac{\sqrt{p}}{\log^2 p})$ for sparse graphs with $|S|=O( \frac{n}{\sqrt{p}})$. We also present the bandwidth and latency costs lower bounds for computing APSP on sparse graphs, which are $\Omega(\frac{n^2}{p}+|S|^2)$ and $\Omega(\log^2 p)$, respectively. This implies that the bandwidth cost of our algorithm is nearly optimal and the latency cost is optimal. Huashan Yu (Department of Computer Science and Technology, Peking University) and Xiaolin Wang and Yingwei Luo (Department of Computer Science and Technology, Peking University; Peng Cheng Lab, Shenzhen) Abstract Abstract The Single-Source Shortest Path (SSSP) problem is to compute the shortest distances in a weighted graph from a source vertex to every other vertex. This paper focuses on parallel efficiency and scalability of SSSP computations on large-scale graphs. We propose an edge-fencing strategy to customize a SSSP algorithm’s schedule for every SSSP computation, and devise a path-centric SSSP algorithm with this strategy. This strategy aims at reducing both the relaxed edges and relaxations repeated on each edge. It exploits a few fence values to select the relaxed edges and schedule edge relaxations according to lengths of the created paths. The path-centric algorithm works on a hierarchical graph model, and exploits the edge-fencing strategy to schedule edge relaxations in parallel settings. The hierarchical graph model quantifies the length distribution of shortest paths in large-scale graphs, provides appropriate fence values for every SSSP computation. The algorithm was evaluated on a wide range of synthetic graphs and real-world graphs. The experimental results suggest that our algorithm is efficient and scalable for graphs with skewed degree distributions, and its performance is relatively insensitive to the hierarchical graph model’s accuracy. Mohsen Koohi Esfahani, Peter Kilpatrick, and Hans Vandierendonck (Queen's University Belfast) Abstract Abstract The skewed degree distribution of real-world graphs is the main source of poor locality in graph processing, especially in the pull traversal order of a graph. This paper argues that different vertices in power-law graphs have different locality characteristics and therefore the traversal method should be adapted to these characteristics. However, conventional graph traversal methods, such as push and pull, traverse all vertices in the same manner. This paper argues that this leads to sub-optimal memory locality, hence poor efficiency. Wednesday 10:30am-11:30am Room B Conference Paper 4B: Storage Software and Optimizations Chair: Anthony Kougkas (Illinois Institute of Technology) Qiliang Li, Min Lyu, Liangliang Xu, Yinlong Xu, and Wei Wang (University of Science and Technology of China) Abstract Abstract In the era of explosive data growth, RAID2.0 architecture with dozens or even hundreds of disks is commonly used to provide large capacity data storage. Due to limited resources, such as memory and CPU, the reconstruction for disk failures in RAID2.0 is executed in batches. Traditional random data placement and recovery scheme make the I/O access highly skewed within a batch, which slows down the reconstruction speed. Jing Hu and Jianxi Chen (Wuhan National Laboratory for Optoelectronics,Huazhong University of Science and Technology); Yifeng Zhu (University of Maine); and Qing Yang, Zhouxuan Peng, and Ya Yu (Wuhan National Laboratory for Optoelectronics,Huazhong University of Science and Technology) Abstract Abstract Emerging persistent memory (PM) is a promising technology that provides near-DRAM performance and disk-like durability. However, many data structures designed based on DRAM, such as hash table and B-tree, are sub-optimal for PM. Prior studies have shown that the scalability of hash tables on Intel Optane DC Persistent Memory Modules (DCPMM) degrades significantly due to expensive lock-based concurrency control and massive data movements during rehashing. This paper proposes a lock-free parallel multi-split extendible hashing scheme (PMEH), which eliminates the lock contention overhead, reduces data movements during rehashing, and ensures data consistency. Under the widely used YCSB workloads, the evaluation results show that compared to other state-of-the-art hashing schemes, PMEH is up to 1.38x faster for insertion and up to 1.9x faster for deletion, while reducing 52% extra writes. In addition, PMEH can ensure instant recovery regardless of data size. Jun Li, Minjun Li, and Zhigang Cai (Southwest University); Francois Trahay (Telecom SudParis); Mohamed Wahib (National Institute of Advanced Industrial Science and Technology, RIKEN Center for Computational Science); Balazs Gerofi (RIKEN Center for Computational Science); and Zhiming Liu, Min Huang, and Jianwei Liao (Southwest University) Abstract Abstract Modern high density SSDs commonly designate a part of their capacity as a cache using an SLC (Single-level Cell)-mode region. Partial programming is then adopted for reducing space fragmentation in the SLC-mode pages, but it exacerbates program disturbance. This paper proposes a partial programming scheme (called intra-page update) by updating hot, small size data inside a given page to minimize the negative impact induced by program disturbance. Moreover, we introduce a novel data movement principle to separate hot and cold write data in the SLC-mode cache when updating the data or carrying out garbage collections. As a result, the hot updated data can be kept in the SLC-mode cache and the cold data will be flushed onto the high density SSD region. Simulation tests on several realistic disk traces show that our proposal improves both bit error rate and I/O performance compared to state-of-the-art methods, without a noticeable decrease in total endurance. Junhao Zhu (National University of Defense Technology); Kaixin Huang (ByteDance Inc., Shanghai Jiao Tong University); Xiaomin Zou (Huazhong University of Science and Technology); and Chenglong Huang, Nuo Xu, and Liang Fang (National University of Defense Technology) Abstract Abstract With high memory density, non-volatility and DRAM-scale latency, non-volatile memory (NVM) bring evolution to storage systems and durable data structures. And Intel Optane DC persistent memory module (AEP), the first commercial product of NVM, shows some features that are different from previous assumptions: higher read latency, lower bandwidth and block access granularity compared with DRAM. It is reasonable to build up hybrid memory to give full play to the complementary advantages of DRAM and NVM. In this paper, we present a read-efficient and write-optimized hashing scheme for hybrid DRAM-NVM memory, named HDNH (Hybrid DRAM-NVM Hashing). Our design can be summarized into three key points. First, we decouple the storage for data and metadata by placing key-value items in non-volatile table for persistence while placing metadata in Optimistic Compression Filter (OCF) to reduce excessive NVM accesses. Second, we design hot table in DRAM to speed up search requests and propose an efficient replacement strategy called RAFL. Third, we develop a fine-grained optimistic concurrency mechanism to enable high-performance concurrent accesses on multi-core systems. Experimental results on the AEP platform show that HDNH outperforms its counterparts by up to 2.9x under various YCSB workloads. Wednesday 10:30am-11:30am Room C Conference Paper 4C: Algorithms and Applications Chair: Zhou Jin (China University of Petroleum, Beijing,) Runtian Ren and Xueyan Tang (Nanyang Technological University) Abstract Abstract We consider two combinatorial optimization problems, named Generalized Skyline Interval Coloring (GSIC) and Dynamic Geometric Bin Packing (DGBP). The input to both problems is a set of interval jobs, with each job specified by a horizontal active time interval and a vertical size. For GSIC, each job is to be allocated a vertical interval of the specified size in the range $[0, +\infty)$. For any two jobs with overlapping active intervals, their vertical intervals must not overlap. The instantaneous cost of an allocation at any time is defined as the highest point allocated to the active jobs. The target is to minimize the accumulated cost over time. For DGBP, each job is to be assigned to a machine of capacity $g$ and be allocated a vertical interval of the specified size in the range $[0, g)$. For any two jobs with overlapping active intervals, if they are assigned to the same machine, their vertical intervals must not overlap. The target is to minimize the total machine busy time, where the busy time of a machine is the time duration in which there is at least one active job in the machine. We develop $O(1)$-approximation algorithms for both problems in the offline setting and asymptotically optimal algorithms in the non-clairvoyant and clairvoyant online settings. Zhuoran Ji and Cho-Li Wang (The University of Hong Kong) Abstract Abstract DBSCAN is a popular clustering algorithm, which shows great success in many real-world applications. Its advantages come at the expense of massive computation, especially for computing the distance matrix. Driven by deep learning, many Artificial Intelligence (AI) chips have been developed. With efficient matrix multiplication units, AI chips can significantly accelerate the distance calculation. However, DBSCAN also needs to identify and count the neighbors for each point. It is challenging for most AI chips due to over-specialization. Moreover, the increasing data size and the limited device memory capacity force DBSCAN to follow a mini-batch manner. It results in a high data transfer overhead, which further hinders the performance of DBSCAN on AI chips. In this paper, we propose two novel techniques to address the challenges of accelerating the DBSCAN algorithm with AI chips: (1) new neighbor identification algorithms using bitwise operations only, while traditional solutions require the compare-and-select operations that are weakly supported in AI chips; and (2) two speculative execution strategies to reduce the data transfer overhead induced by mini-batches. Evaluations show that deploying distance matrix calculation to tensor cores achieves 2.61x speedup on Nvidia RTX 3090. On Huawei Ascend 310, our neighbor identification algorithms achieve 17.88x throughout of using CPUs for neighbor identification. The speculative execution strategies further reduce the execution time by 15.1% on average for normal datasets and up to 99.0% for sparse datasets. Nikita Mishin and Daniil Berezun (Saint Petersburg State University, JetBrains Research) and Alexander Tiskin (Saint Petersburg State University) Abstract Abstract The longest common subsequence (LCS) problem on a pair of strings is a classical problem in string algorithms. Its extension, the semi-local LCS problem, provides a more detailed comparison of the input strings, without any increase in asymptotic running time. Several semi-local LCS algorithms have been proposed previously; however, to the best of our knowledge, none have yet been implemented. In this paper, we explore a new hybrid approach to the semi-local LCS problem. We also propose a novel bit-parallel LCS algorithm. In the experimental part of the paper, we present an implementation of several existing and new parallel LCS algorithms and evaluate their performance. Zitong Li, Qiming Fang, and Grey Ballard (Wake Forest University) Abstract Abstract Tucker decomposition is a low-rank tensor approximation that generalizes a truncated matrix singular value decomposition (SVD). Existing parallel software has shown that Tucker decomposition is particularly effective at compressing terabyte-sized multidimensional scientific simulation datasets, computing reduced representations that satisfy a specified approximation error. The general approach is to get a low-rank approximation of the input data by performing a sequence of matrix SVDs of tensor unfoldings, which tend to be short-fat matrices. In the existing approach, the SVD is performed by computing the eigendecomposition of the Gram matrix of the unfolding. This method sacrifices some numerical stability in exchange for lower computation costs and easier parallelization. We propose using a numerically more stable though more computationally expensive way to compute the SVD by preprocessing with a QR decomposition step and computing an SVD of only the small triangular factor. The more numerically stable approach allows us to achieve nearly the same accuracy with half the working precision (for example, single rather than double precision). We demonstrate that our method scales as well as the existing approach, and the use of lower precision leads to an overall reduction in running time when using 10s to 1000s of processors. Using the same working precision, we are also able to compute Tucker decompositions of the scientific datasets with much smaller approximation error. Wednesday 11:45am-12:45pm Room A Conference Paper 5A: Linear Algebra Algorithms Chair: Matthew Knepley (University at Buffalo, Rice University) Yuan Tang and Weiguo Gao (Fudan University) Abstract Abstract Frigo et al. proposed an ideal cache model and a recursive technique to design sequential cache-efficient algorithms in a cache-oblivious fashion. % Ballard et al. pointed out that it is a fundamental open problem to extend the technique to an arbitrary architecture. % Ballard et al. raised another open question on how to parallelize Strassen's algorithm exactly and efficiently on an arbitrary number of processors. Christoph Klein and Robert Strzodka (University of Heidelberg, ZITI) Abstract Abstract Partial pivoting is the method of choice to ensure stability in matrix factorizations performed on CPUs. For sparse matrices, this has not been implemented on GPUs so far because of problems with data-dependent execution flow. This work incorporates scaled partial pivoting into a tridiagonal GPU solver in such a fashion that despite the data-dependent decisions no SIMD divergence occurs. The cost of the computation is completely hidden behind the data movement which itself runs at maximum bandwidth. Therefore, the cost of the tridiagonal GPU solver is no more than the minimally required data movement. For large single precision systems with $2^{25}$ unknowns, speedups of 5 are reported in comparison to the numerically stable tridiagonal solver (\texttt{gtsv2}) of cuSPARSE. The proposed tridiagonal solver is also evaluated as a preconditioner for Krylov solvers of large sparse linear equation systems. As expected it performs best for problems with strong anisotropies. Viviana Arrigoni, Filippo Maggioli, Annalisa Massini, and Emanuele Rodolà (Sapienza, University of Rome) Abstract Abstract The multiplication of a matrix by its transpose, A^TA, appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (ATA) for computing this product, based upon the classical Strassen algorithm as a sub-routine. In particular, we decrease the computational cost to 2/3 the time required by Strassen's algorithm, amounting to 14/3n^log_2 7 floating point operations. ATA works for generic rectangular matrices, and exploits the peculiar symmetry of the resulting product matrix for saving memory. In addition, we provide an extensive implementation study of ATA in a shared memory system, and extend its applicability to a distributed environment. To support our findings, we compare our algorithm with state-of-the-art solutions specialized in the computation of A^T*A, as well as with solutions for the computation of the general A^T*B product applied to this problem. Our experiments highlight good scalability with respect to both the matrix size and the number of involved processes, as well as favorable performance for both the parallel paradigms and the sequential implementation when compared with other methods in the literature. CHENHAO XIE (Pacific Northwest National Laboratory); Jieyang Chen (Oak Ridge National Laboratory); Jesun Firoz (Pacific Northwest National Laboratory); Jiajia Li (Pacific Northwest National Laboratory, William&Mary); Shuaiwen Leon Song (University of Sydney); and Kevin Barker, Mark Raugas, and Ang Li (Pacific Northwest National Laboratory) Abstract Abstract Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for \textit{Sparse Triangular Solver (SpTRSV)} which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53x (up to 9.86x) speedup on a DGX-1 system and 3.66x (up to 9.64x) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system. Wednesday 11:45am-12:45pm Room B Conference Paper 5B: Data Analytics Systems and Runtime Chair: Rong Ge (Clemson University) Yijie Shen, Jin Xiong, and Dejun Jiang (Institute of Computing Technology, CAS; University of Chinese Academy of Sciences) Abstract Abstract MapReduce-based SQL processing frameworks, such as Hive and Spark SQL, are widely used to support big data analytics. Currently these systems mainly adopt the record-at-a-time execution model, which is less efficient in terms of CPU utilization. In contrast, vectorized execution is able to make better use of CPU cache by bulk processing a record batch at a time. However, simply applying vectorized execution to MapReduce-based frameworks results in low efficient vectorized shuffle. Moreover, existing vectorized execution donot make full use of CPU cache for complex operators (e.g. Sort and Aggregation). In this paper, we present VEE, a thorough vectorized execution engine designed for SQL query processing on Spark. First, VEE designs compact in-memory data layout and serialization-aware assembling for vectorized shuffle to expedites shuffle execution, since they reduce shuffle data footprint and related computations. Secondly, VEE applies in-memory record batch rearrangement for Sort and Aggregation to greatly reduce random memory access and increase query performance. Thirdly, VEE carefully designs operator-aware batch length when handling different operators, which makes better utilization of CPU cache and increases query performance. We conduct extensive performance evaluations. The experiment results show that the performance speedup of VEE against Spark is up to 72.7% and 25.0% on average for OLAP workloads (TPC-H). The vectorized execution technologies in VEE are also applicable to other MapReduce-based data analytic frameworks to improve their query performance. Bowen Yu and Huanqi Cao (Tsinghua University), Tianyi Shan (University of California San Diego), Haojie Wang (Tsinghua University), Xiongchao Tang (Sangfor Technologies Inc. and Tsinghua Shenzhen International Graduate School), and Wenguang Chen (Tsinghua University) Abstract Abstract Machine learning applications on Spark suffers from poor scalability. In this paper, we reveal that the key reasons is the non-scalable reduction, which is restricted by the non-splittable object programming interface in Spark. This insight guides us to propose Sparker, Spark with Efficient Reduction. By providing a split aggregation interface, Sparker is able to perform split aggregation with scalable reduction while being backward compatible with existing applications. We implemented Sparker in 2,534 lines of code. Sparker can improve the aggregation performance by up to 6.47× and can improve the end-to-end performance of MLlib model training by up to 3.69× with a geometric mean of 1.81×. Qianwen Ye, Wuji Liu, and Chase Q. Wu (New Jersey Institute of Technology) Abstract Abstract An increasing number of big data applications in various domains generate datasets continuously, which must be processed\ for various purposes in a timely manner. As one of the most popular streaming data processing systems, Spark Streaming applies a batch-based mechanism, which receives real-time input data streams and divides the data into multiple batches before passing them to Spark processing engine. As such, inappropriate system configurations including batch interval and executor count may lead to unstable states, hence undermining the capability and efficiency of real-time computing. Hence, finding suitable configurations is crucial to the performance of such systems. Many machine learning- and search-based algorithms have been proposed to provide configuration recommendations for streaming applications where input data streams are fed at a constant speed, which, however, is extremely rare in practice. Most real-life streaming applications process data streams arriving at a time-varying rate and hence require real-time system monitoring and continuous configuration adjustment, which still remains largely unexplored. We propose a novel streaming optimization scheme based on Simultaneous Perturbation Stochastic Approximation (SPSA), referred to as NoStop, which dynamically tunes system configurations to improve real-time system performance with negligible overhead and proved convergence. The performance superiority of NoStop is illustrated by real-life experiments in comparison with Bayesian Optimization and Spark Back Pressure solutions. Extensive experimental results show that NoStop is able to keep track of the changing pattern of input data in real time and provide optimal configuration settings to achieve the best system performance. Md Muhib Khan and Weikuan Yu (Florida State University) Abstract Abstract Spark is popular for its ability to enable high-performance data analytics applications on diverse systems. Its great versatility is achieved through numerous user- and system-level options, resulting in an exponential configuration space that, ironically, hinders data analytics's optimal performance. The colossal complexity is caused by two main issues: the high dimensionality of configuration space and the expensive black-box configuration-performance relationship. In this paper, we design and develop a robust tuning framework called ROBOTune that can tackle both issues and tune Spark applications quickly for efficient data analytics. Specifically, it performs parameter selection through a Random Forests based model to reduce the dimensionality of analytics configuration space. In addition, ROBOTune employs Bayesian Optimization to overcome the complex nature of the configuration-performance relationship and balance exploration and exploitation to efficiently locate a globally optimal or near-optimal configuration. Furthermore, ROBOTune strengthens Latin Hypercube Sampling with caching and memoization to enhance the coverage and effectiveness in the generation of sample configurations. Our evaluation results demonstrate that ROBOTune finds similar or better performing configurations than contemporary tuning tools like BestConfig and Gunther while improving on search cost by 1.59× and 1.53× on average and up to 2.27× and 1.71×, respectively. Wednesday 11:45am-12:45pm Room C Conference Paper 5C: Applications and Performance Chair: Valerie Taylor (Argonne National Laboratory (ANL)) Hanpei Wu (SIST, ShanghaiTech University, China); Tongliang Deng (SenseTime Research, China); Yanliang Zou (SIST, ShanghaiTech University, China); Shu Yin (SIST, ShanghaiTech University, China; State Key Lab of High Performance Computing); Si Chen (West Chester University, PA, USA); and Tao Xie (San Diego State University, CA, USA) Abstract Abstract Visual molecular dynamics (VMD) has been widely used by numerous molecular dynamics (MD) applications to animate and analyze the trajectory of an MD simulation. One challenge faced by domain scientists, however, is how to filter out inactive data (i.e., data irrelevant to the subject) from the enormous output of an MD simulation. To solve it, we propose ADA (application-conscious data acquirer), a light-weight file system middleware that can perform an application-conscious data pre-processing. It provides host CPUs with only data needed instead of an entire raw dataset. Next, we implement an ADA prototype, which is then integrated into three computing platforms: an SSD server, a nine-node OrangeFS storage cluster, and a fat-node server with 1 TB memory. Further, we evaluate ADA by running a computational biology application on the three platforms. Our experimental results show that compared to a traditional file system an ADA-assisted file system improves data processing turnaround time by up to 13.4x and reduces up to 2.5x memory usage for data rendering. Besides, ADA allows the 1TB memory server to render more than 2x VMD graphs while saving 3x energy consumption. Kun Qiu, Harry Chang, Yang Hong, Wenjun Zhu, Xiang Wang, and Baoqian Li (Intel APAC) Abstract Abstract Deep Packet Inspection (DPI), which is one of the most important network techniques, has been widely utilized in current networking systems. By comparing the payloads of traffic to an existing signature database, DPI can identify whether traffic or packet is harmful, or belong to which application. The literal matching engine, which plays a key role in DPI, is the primary determinant of the system performance. FDR, an engine that can match a character with multiple literals in only 3 SIMD operations, has been developed. However, FDR has a significant performance drop-off when the signature database is composed of small-scale literal rule sets, whose occupations are larger than 90% in the modern database. In this paper, we have designed Teddy, an engine that is highly optimized for small-scale literal rule sets. Comparing with FDR, Teddy significantly increases the efficiency in small-scale literal rule sets by designing a novel algorithm that can parallelly match up to 64 characters with only 16 SIMD operations. Meanwhile, to evaluate Teddy in real-world DPI systems, we have implemented Teddy in Hyperscan with AVX512 platforms. The evaluation results show that Teddy achieves a maximum of 35.29x performance increases than Aho-corasick (AC), 2.16x performance increases than FDR in most cases. Jianda Wang and Yang Hu (University of Texas at Dallas) Abstract Abstract Nowadays, the Radio Access Network (RAN) is resorting to Function Virtualization (NFV) paradigm to enhance its architectural viability. However, our characterization of virtual RAN (vRAN) on modern processors depicts a frustrating picture of Single-Instruction Multi-Data (SIMD) acceleration. The data arrangement processes in vRAN software pipeline do not align data for efficient SIMD processing across the pipeline. Specifically, existing data arrangement processes cannot fully utilize the ALU ports in modern processors, which leads to high backend bound and fails to saturate the memory bandwidth between registers and L1 cache. To overcome the overburden, we thoroughly examine the state-of-the-art CPU architecture and find there are idle ports which could be utilized by the process. Motivated by this observation, we propose "Arithmetic Ports Consciousness Mechanism" (APCM) utilizing these idle ports to eliminate the backend bound and saturate the memory bandwidth. The APCM decreases the data arrangement's backend bound from 45% to 3% and promotes its memory bandwidth utilization by 4X-16X. The CPU time of the data arrangement process can be reduced by 67% - 92% and the overall latency of the vRAN packet transmission is decreased by 12% - 20%. Marquita Ellis (The University of California at Berkeley, Lawrence Berkeley National Lab); Aydin Buluc (Lawrence Berkeley National Lab, The University of California at Berkeley); and Katherine Yelick (The University of California at Berkeley, Lawrence Berkeley National Lab) Abstract Abstract This work examines a data-intensive irregular application from genomics that represents a type of Generalized N-Body problems, one of the “seven giants” of the NRC Big Data motifs. In this problem, computations (genome alignments) are performed on sparse data-dependent pairs of inputs, with variable cost computation and variable datum sizes. Unlike simulation-based N-Body problems, there is no inherent locality in the pairwise interactions, and the interaction sparsity depends on particular parameters of the input, which can also affect the quality of the output. We build-on a pre-existing bulk-synchronous implementation, using collective communication in MPI, and implement a new asynchronous one, using cross-node RPCs in UPC++. We establish the intranode comparability and efficiency of both, scaling from one to all core(s) on node. Then we evaluate the multinode scalability from 1 node to 512 nodes (32,768 cores) of NERSC’s Cray XC40 with Intel Xeon Phi “Knight's Landing” nodes. With real workloads, we examine the load balance of the irregular computation and communication, and the costs of many small asynchronous messages versus few large-aggregated messages, in both latency and overall application memory footprint. While both implementations demonstrate good scaling, the study reveals some of the programming and architectural challenges for scaling this type of data-intensive irregular application, and contributes code that can be used in genomics pipelines or in benchmarking for data analytics more broadly. Thursday 8:15am-9:00am Webinar Keynote Keynote: Manish Parashar, U. Utah, NSF, Harnessing Advanced Cyberinfrastructure for Urgent Science Chair: Sameer Shende (University of Oregon) Advanced Cyberinfrastructure (CI) plays an essential role in our ability to respond to national emergencies, from pandemics to extreme weather event (e.g., hurricanes and tornadoes), wildfires, cyber-attacks, and nuclear disasters. This was demonstrated over the past year, as CI has played a critical role in supporting the urgent science needed to address the global COVID-19 pandemic, including advancing our understanding of the SARS-CoV-2 virus structure, its host interactions, developing strategies to mitigate its spread, and supporting early-stage drug development. Specifically, the COVID-19 HPC Consortium brought together international stakeholders across academia, government, industry, and non-profits to provide CI resources, services, and expertise, in an agile and expedient way, to help address the many dimensions of this pandemic. In this talk, I will discuss the role of CI in supporting urgent science using our response to the COVID-19 pandemic and the COVID-19 HPC Consortium as an example. I will also explore how we can leverage these experiences to better prepare for future emergencies.
Manish is Office Director of the Office of Advanced Cyberinfrastructure at NSF. He joins NSF as an IPA from the University of Utah where he is the Director of the Scientific Computing and Imaging (SCI) Institute, Chair in Computational Science and Engineering, and Professor in School of Computing. His research interests are in the broad areas of Parallel and Distributed Computing and Computational and Data-Enabled Science and Engineering. Manish is Fellow of AAAS, ACM, and IEEE/IEEE Computer Society. For more information, please visit http://manishparashar.org.
mp4 Thursday 9:15am-10:15am Room A Conference Paper 6A: Networking and Routing Chair: Kevin A. Brown (Argonne National Laboratory (ANL)) Yiran Zhang (Tsinghua university, Beijing National Research Center for Information Science and Technology (BNRist)); Kun Qian (Alibaba); and Fengyuan Ren (Tsinghua university, Beijing National Research Center for Information Science and Technology (BNRist)) Abstract Abstract InfiniBand (IB) has become one of the most popular high-speed interconnect in High Performance Computing (HPC). The backpressure effect of credit-based link-layer flow control in IB introduces congestion spreading, which increases queueing delay and hurts application completion time. IB congestion control (IB CC) has been defined in IB specification to address the congestion spreading problem. Nowadays, HPC clusters are increasingly being used to run diverse workloads with shared network infrastructure. The coexistence of messages transfers of different applications imposes great challenges to IB CC. In this paper, we re-exam IB CC through fine-grained experimental observations and reveal several fundamental problems. Inspired by our understanding and insights, we present a new receiver-driven congestion control for InfiniBand (RR CC). RR CC includes two key mechanisms: receiver-driven congestion identification and receiver-driven rate regulation, which empower eliminating both in-network congestion and endpoint congestion in one control loop. RR CC has much fewer parameters and requires no modifications to InfiniBand switches. Evaluations show that RR CC achieves better average/tail message latency and link utilization than IB CC under various scenarios. Sen Liu and Xiang Lin (Fudan University), Zehua Guo (Beijing Institute of Technology), Yi Wang (Peng Cheng Laboratory), Mohamed Adel Serhani (United Arab Emirates University), and Yang Xu (Fudan University) Abstract Abstract The traffic of modern data centers exhibits long-tail distribution, in which massive delay-sensitive short flows and a small number of bandwidth-hungry long flows co-exist. These two types of flows could share same bottleneck links in the data center networks but request different or even opposite network requirements. Existing solutions try to realize a trade-off between the requirements of different flows by either prioritizing short flows or limiting the buffer used by long flows at switches or end-hosts. However, they do not consider the dynamic traffic change and suffer from performance degradation, resulted from severe queueing delay and massive packet drops for short flows under current First-In-First- Out (FIFO) queueing mechanism. In this paper, we propose an novel buffer management scheme at switches, called Cut-in Queue (CQ), to achieve both low latency for short flows and high throughput for long flows. Based on network status in real time, CQ prioritizes short flows by dynamically cutting the short flows' packets into the head of long flows or evicting some enqueued long flows' packets and enables high throughput for long flows in most of the cases. Evaluation of both DPDK testbed and NS2 simulations show that CQ outperforms state-of-the-art buffer management schemes by reducing flow completion time by up to 73%. 9:45am: sRouting: Towards a Better Flow Size Estimation Performance through Routing and Sketch Configuration Yang Shi and Mei Wen (National University of Defense Technology) Abstract Abstract Flow size estimation is highly important and beneficial to various applications, including traffic engineering and anomaly detection. Considering the resource constraints and performance requirements in place, sketches are widely used to accomplish this task. However, when sketches are applied in real-life networks, the measurement performance is often decreased due to many practical reasons, including partial deployment of sketches, unique traffic characteristics, etc. In this paper, we present sRouting, a practical framework that aims to better utilize the deployed sketches. sRouting improves the measurement performance by optimizing which flows are monitored and where. We first investigate the relationship between the accuracy of a given sketch and the total number of packets, then formulate the offline problem in sRouting as an integer linear programming problem. To solve this problem efficiently, we devise a rounding-based algorithm and provide its performance guarantees. Furthermore, to handle dynamic changes in the network, we design an online adjustment algorithm capable of responding appropriately to these changes. Through experiments using real traces and typologies, we demonstrate that sRouting can significantly improve the volume of monitored traffic and reduce the measurement error without negatively impacting the network throughput. En Wang, Dongming Luan, and Yongjian Yang (Jilin University); Zihe Wang (Renmin University of China); Pengmin Dong (Jilin University); Dawei Li (Montclair State University); Wenbin Liu (Jilin University); and Jie Wu (Temple University) Abstract Abstract Vehicular CrowdSensing (VCS) has become a powerful sensing paradigm by selecting users driving vehicles to perform tasks. In most existing research, the platform centrally allocates tasks according to the collected user information. We argue that the information collection process results in user privacy leakage, and the centralized allocation leads to a heavy computation complexity. We propose to apply a distributed task allocation method in the widely-used route navigation system. The navigation system recommends several routes to a user and each route may cover some tasks. Then, the user distributively selects a route according to the route profit (task reward minus route cost). Since the task reward is shared by the passed users, the route selections of users may influence each other. Hence, it remains unclear how to design a distributed route navigation approach to reach an equilibrium state (i.e., each user is satisfied with the selected route), while guaranteeing a good total profit. To this end, we formulate the problem as a multi-user potential game and propose a distributed route navigation algorithm. The simulation results verify that the proposed algorithm achieves a Nash equilibrium, while achieving a total user profit performance close to that of the optimal solution. Thursday 9:15am-10:15am Room B Conference Paper 6B: Machine Learning and Acceleration Chair: Riyadh Baghdadi (New York University Abu Dhabi, Massachusetts Institute of Technology (MIT)) Zhenwei Zhang, Qiang Qi, and Ruitao Shang (East China Normal University); Li Chen (University of Louisiana at Lafayette); and Fei Xu (East China Normal University) Abstract Abstract Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on overlapping the computation and communication to improve DDNN training performance, the GPU and network resources are still under-utilized in DDNN training clusters. To tackle this issue, in this paper, we design and implement a predictable communication scheduling strategy named Prophet to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization. Leveraging our observed stepwise pattern of gradient transfer start time, Prophet first uses the monitored network bandwidth and the profiled time interval among gradients to predict the appropriate number of gradients that can be grouped into blocks. Then, these gradient blocks can be transferred one by one to guarantee high utilization of GPU and network resources while ensuring the priority of gradient transfer (i.e., low-priority gradients cannot preempt high-priority gradients in the network transfer). Prophet can make the forward propagation start as early as possible so as to greedily reduce the waiting (idle) time of GPU resources during the DDNN training process. Prototype experiments with representative DNN models trained on Amazon EC2 demonstrate that Prophet can improve the DDNN training performance by up to 40% compared with the state-of-the-art priority-based communication scheduling strategies, yet with negligible runtime performance overhead. Hao Lan (University of Toronto), Li Chen (University of Louisiana at Lafayette), and Baochun Li (University of Toronto) Abstract Abstract With the ever-increasing size and complexity of deep neural network models, it is difficult to fit and train a complete copy of the model on a single computational device with limited capability. Therefore, large neural networks are usually trained on a mixture of devices, including multiple CPUs and GPUs, of which the computational speed and efficiency are drastically affected by how these models are partitioned and placed on the devices. In this paper, we propose Mars, a novel design to find efficient placements for large models. Mars leverages a self-supervised graph neural network pre-training framework to generate node representations for operations, which is able to capture the topological properties of the computational graph. Then, a sequence-to-sequence neural network is applied to split large models into small segments so that Mars can predict the placements sequentially. Novel optimizations have been applied in the placer design to achieve the best possible performance in terms of the time needed to complete training the agent for placing models with very large sizes. We deployed and evaluated Mars on benchmarks involving Inception-V3, GNMT, and BERT models. Extensive experimental results show that Mars can achieve up to 27.2% and 2.7% speedup of per-step training time than the state-of-the-art for GNMT and BERT models, respectively. We also show that with self-supervised graph neural network pre-training, our design achieves the fastest speed in discovering the optimal placement for Inception-V3. Dongsheng Li, Dan Huang, Zhiguang Chen, and Yutong Lu (Sun Yat-sen University) Abstract Abstract Convolution Neural Network has gained a great success in deep learning applications and been accelerated by dedicated convolutional algorithms. Winograd-based algorithm can greatly reduce the number of arithmetic operations required in convolution. However, our experiments show that existing implementations in deep learning libraries cannot achieve expected parallel performance on ARM manycore CPUs with last-level cache (LLC). Compared to multicore processor, ARM manycore CPUs have more cores, more NUMA nodes and the parallel performance is more easily restricted by memory bandwidth, cache contention, NUMA configuration and etc. In this paper, we propose an optimized implementation for single-precision Winograd-based algorithm on ARM manycore CPUs. Our algorithm adjusts the data layout according to the input shape and is optimized for the characteristics of ARM processor, thus reducing the transformation overhead and achieving high arithmetic intensity. We redesign the parallel algorithm for Winograd-based convolution to achieve a more efficient implementation for manycore CPUs. The experimental results with 32 cores show that for modern ConvNets, our implementation achieves speedups ranging from 3x to 5x over the state-of-the-art Winograd-based convolution on ARM processor. Even conducted on a set of convolutional benchmarks executing on a 128-core system with 4 NUMA nodes, the results show that our implementation can also achieve better performance than existing implementations on ARM processor. 10am: Hippie: A Data-Paralleled Pipeline Approach to Improve Memory-Efficiency and Scalability for Large DNN Training Xiangyu Ye, Zhiquan Lai, Shengwei Li, Lei Cai, Ding Sun, Linbo Qiao, and Dongsheng Li (National University of Defense Technology) Abstract Abstract With the increase of both data and parameter volume, it has become a big challenge to efficiently train large-scale DNN models on distributed platforms. Ordinary parallelism modes, i.e., data parallelism, model parallelism and pipeline parallelism, can no longer satisfy the efficient scaling of large DNN model training on multiple nodes. Meanwhile, the problem of too much memory consumption seriously restricts GPU computing efficiency and training throughput. In this paper, we propose Hippie, a hybrid parallel training framework that integrates pipeline parallelism and data parallelism to improve the memory efficiency and scalability of large DNN training. Hippie adopts a hybrid parallel method based on hiding gradient communication, which improves the throughput and scalability of training. Meanwhile, Hippie introduces the last-stage pipeline scheduling and recomputation for specific layers to effectively reduce the memory overhead and ease the difficulties of training large DNN models on memory-constrained devices. To achieve a more reasonable evaluation of the optimization effect, we propose an index of memory efficiency (ME) to represent the tradeoff between throughput and memory overhead. We implement Hippie based on PyTorch and NCCL. Experiments on various models show that Hippie achieves above 90% scaling efficiency on a 16-GPU platform. Moreover, Hippie increases throughput by up to 80% while saving 57% of memory overhead, achieveing 4.18× memory efficiency. Thursday 9:15am-10:15am Room C Conference Paper 6C: Data Structures and Applications Chair: Wei Xue (Tsinghua University) Zhengming Yi, Yiping Yao, and Kai Chen (National University of Defense Technology) Abstract Abstract Universal constructions are attractive as they can turn a sequential implementation of any date structure into a concurrent implementation. However, existing universal constructions have limitations, such as imposing high copying overhead, or poor scalability on NUMA systems mainly due to their lack of NUMA-aware design principles. To overcome these limitations, this paper introduces CR, a universal construction that provides highly scalable updates on NUMA systems while offering fast read-side performance. CR achieves NUMA-awareness by utilizing delegation within a NUMA node and a global shared log to maintain the consistency of replicas of data structures across nodes. Using CR does not require expertise in concurrent data structure design. Our evaluation shows that CR has up to 11.2 times better performance compared to a state-of-the-art universal construction CX on our tested sequential data structures. To demonstrate the effectiveness and applicability of CR, we have applied CR to an in-memory database system. The database shows up to 18.1 times better performance compared to the original version. Yizhi Huang (Hunan University, Zhejiang Lab); Yanlong Yin (Zhejiang Lab); Yan Liu (Hunan University); Shuibing He (Zhejiang University, Zhejiang Lab); and Yang Bai and Renfa Li (Hunan University) Abstract Abstract This paper presents a heterogeneous collaborative computing framework for SGD-based Matrix Factorization, named HCC-MF. HCC-MF can train the feature matrix efficiently using multiple CPUs and GPUs. It performs collaborative computing with data parallelism, where a server CPU is in charge of management and synchronization and other heterogeneous worker CPUs and worker GPUs performs calculation with their data assignments. HCC-MF adopts two data partition strategies, “data partition with heterogeneous load balance” and “data partition with hidden synchronization.” We build a time cost model to guide the data distribution among multiple workers and we design several communication optimization techniques with consideration of data sets’ and processors’ characteristics. Experimental results indicate that HCC-MF can utilize more than 88% of the platform’s computing power, yielding a speedup of 2.9 compared with advanced SGD-based MF, CuMF_SGD, on large-scale data sets. 9:45am: FedCav: Contribution-aware Model Aggregation on Distributed Heterogeneous Data in Federated Learning Hui Zeng, Tongqing Zhou, Yeting Guo, and Zhiping Cai (College of Computer, National University of Defense Technology) and Fang Liu (School of Design, Hunan University) Abstract Abstract The emerging federated learning (FL) paradigm allows multiple distributed devices to cooperatively train models in parallel with the raw data retained locally. The local-computed parameters will be transferred to a centralized server for aggregation. However, the vanilla aggregation method ignores the heterogeneity of the distributed data, which may lead to slow convergence and low training efficiency. Yet, existing data scheduling and improved aggregation methods either incur privacy concerns or fail to consider the fine-grained heterogeneity. We propose FedCav, a contribution-aware model aggregation algorithm that differentiates the merit of local updates and explicitly favors the model-informed contributions. The intuition is that the local data showing higher inference loss is likely to facilitate better performance improvement. To this end, we design a novel global loss function with explicit optimization preference on informative local updates, theoretically prove its convex property, and use it to regulate the gradient descent process iteratively. Additionally, we propose to identify abnormal updates with fake loss by auditing historic local training statistics. The results of extensive experiments demonstrate that FedCav needs fewer training rounds (~34%) for convergence and achieves better inference accuracy (~2.4%) than the baselines (i.e., FedAvg and FedProx). We also observe that FedCav can actively mitigate the model replacement attacks with agile recovery capability towards the aggregation. Haosen Wen, Wentao Cai, Mingzhe Du, Louis Jenkins, Benjamin Valpey, and Michael L. Scott (University of Rochester) Abstract Abstract The emergence of fast, dense, nonvolatile main memory suggests that certain long-lived data might remain in its natural pointer-rich format across program runs and hardware reboots. Operations on such data must currently be instrumented with explicit write-back and fence instructions to ensure consistency in the wake of a crash. Techniques to minimize the cost of this instrumentation are an active topic of research. Thursday 10:30am-11:30am Room A Conference Paper 7A: Performance Optimization Chair: Michael Gerndt (Technical University Munich) Daichi Mukunoki (RIKEN Center for Computational Science), Katsuhisa Ozaki (Shibaura Institute of Technology), Takeshi Ogita (Tokyo Woman's Christian University), and Toshiyuki Imamura (RIKEN Center for Computational Science) Abstract Abstract Although IEEE 754-2008 binary128 (with a 15-bit exponent and 113-bit significand, i.e., quadruple-precision) is not currently implemented on x86 in hardware, software emulation is available on some compilers. However, the performance is significantly slower compared to the binary64 operation, which is supported natively in hardware. This study proposes a fast implementation of matrix multiplication on matrices stored in the binary128 format on x86 CPUs. The proposed implementation utilizes the Ozaki scheme, which is an accurate matrix multiplication algorithm proposed by Ozaki et al. in 2012. This scheme enables one to perform most computations using the binary64 matrix multiplication (the DGEMM routine in Basic Linear Algebra Subprograms (BLAS)); it can exploit the high-performance of highly-optimized vendor BLAS. Although the achievable performance depends on the input matrices (the inner-product dimension, the absolute range, and the significand bit length), the proposed implementation can achieve better performance and accuracy compared to naive matrix multiplication performed using the GCC's binary128 emulation in many cases. In addition, we discuss GPU acceleration, performance on reduced precision inputs, an implementation based on binary32 matrix multiplication (SGEMM), application to memory-intensive operations, and the possibility of a distributed parallel implementation. 10:45am: Regu2D: Accelerating Vectorization of SpMV on Intel Processors through 2D-partitioning and Regular Arrangement Xiang Fei and Youhui Zhang (Tsinghua University) Abstract Abstract Sparse matrix-vector multiplication (SpMV) is an elementary kernel of many high-performance computing (HPC) applications, and it is often one of the performance bottlenecks of them. Accelerating SpMV on vector processors usually faces several issues including irregular data accesses, memory bandwidth limitation, and the short vector problem. Based on a detailed analysis of the effects and interactions of various technologies introduced by state-ofthe-art studies (ALBUS, CVR, CSR5, SELL-C-σ etc.), we propose Regu2D, a comprehensive solution to accelerate vectorization of SpMV through three methods: adaptive 2D-partitioning, the regular arrangement of matrix elements, and indices compression. Dynamic programming algorithms are used to optimize the first two methods. We conduct experiments on Intel Xeon processors (Skylake architecture) which support AVX-512 SIMD instructions and use sparse matrices from the University of Florida Sparse Matrix Collection. Experiments show that Regu2D achieves an average speedup of 1.69X, 1.93X, 1.40X, and 1.20X over ALBUS, CVR, CSR5, and SELL-C-σ for 30 scale-free sparse matrices, respectively. For 16 HPC sparse matrices, Regu2D achieves an average speedup of 1.34X, 1.89X, 1.34X, and 1.50X over them, respectively. Enda Yu and Dezun Dong (National University of Defense Technology); Yemao Xu (National University of Defense Technology, Information and Communication Engineering Design Institute); and Shuo Ouyang and Xiangke Liao (National University of Defense Technology) Abstract Abstract Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with a parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to a decrease in convergence accuracy. In this paper, we combine parallel mechanism with gradient quantization and delayed full-gradient compensation, and propose a new distributed optimization method named CD-SGD, which can hide the overhead of gradient compression, overlap part of the communication and obtain high convergence accuracy. The local update operation in CD-SGD allows the next iteration to be launched quickly without waiting for the completion of gradient compression and the current communication process. Besides, the accuracy loss caused by gradient compression is solved by k-step correction method introduced in CD-SGD. We prove that CD-SGD has convergence guarantee and it achieves at least O(1/√K+1/K) convergence rate. We conduct extensive experiments on MXNet to verify the convergence properties and scaling performance of CD-SGD. Experimental results on a 16-GPU cluster show that convergence accuracy of CD-SGD is close to or even slightly better than that of S-SGD, and its end-to-end time is 30% less than 2-bit gradient compression under a 56Gbps bandwidth environment. 11:15am: Recursion Brings Speedup to Out-of-Core TensorCore-based Linear Algebra Algorithms: A Case Study of Classic Gram-Schmidt QR Factorization Shaoshuai Zhang and Panruo Wu (University of Houston) Abstract Abstract Out-of-core processing aims to handle large amount of data when the memory is limited. There exists several out-of-core applications including disk-memory and CPU-GPU processing. Ideally, these out-of-core applications can be expected to be close to the peak performance of the in-core computations, if the data movement between different memory hierarchies can be overlapped by the in-core computations effectively. However, with the emergence of matrix accelerators such as TensorCore GPU, the imbalance between the speed of computations and data movement is further exacerbated, such that even high computation intensity kernels can be dominated by data movement cost. In such cases, the algorithms need to be redesigned to reduce communication volume and overlap the data movement by pipelines. In this paper, we select classic Gram-Schmidt QR factorization as an example to illustrate our recursive strategy, which shows smaller amount of data movement and higher overlapping ratio than the conventional blocking QR factorization algorithm. The results suggest this technique can potentially be applied to broader matrix computations kernels. Thursday 10:30am-11:30am Room B Conference Paper 7B: Machine Learning Algorithms Chair: Leonid Oliker (Lawrence Berkeley National Laboratory) 10:30am: Dubhe: Towards Data Unbiasedness with Homomorphic Encryption in Federated Learning Client Selection Shulai Zhang, Zirui Li, Quan Chen, Wenli Zheng, Jingwen Leng, and Minyi Guo (Shanghai Jiao Tong University) Abstract Abstract Federated learning (FL) is a distributed machine learning paradigm that allows clients to collaboratively train a model over their own local data. FL promises the privacy of clients and its security can be strengthened by cryptographic methods such as additively homomorphic encryption (HE). However, the efficiency of FL could seriously suffer from the statistical heterogeneity in both the data distribution discrepancy among clients and the global distribution skewness. We mathematically demonstrate the cause of performance degradation in FL and examine the performance of FL over various datasets. To tackle the statistical heterogeneity problem, we propose a pluggable system-level client selection method named Dubhe, which allows clients to proactively participate in training, meanwhile preserving their privacy with the assistance of HE. Experimental results show that Dubhe is comparable with the optimal greedy method on the classification accuracy, with negligible encryption and communication overhead. Junhong Liu, Dongxu Yang, and Junjie Lai (NVIDIA) Abstract Abstract Convolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complexity a lot, making the convolution computing fast. But existing implementations of Winograd convolution is limited to small tiles, i.e. 𝐹 (4 × 4, 3 × 3) and 𝐹 (2 × 2, 3 × 3), and single precision data. In this paper, we propose an optimized mixed-precision 𝐹 (6×, 6, 3 × 3) Winograd convolution implementation on NVIDIA Ampere GPUs using Tensor Cores. Our experiments show that the accuracy of mixed-precision 𝐹 (6 × 6, 3 × 3) Winograd convolution is sufficient to infer the convolutional neural networks. Besides, our method achieves up to 15.71x and 2.41x speedup on NVIDIA Ampere A100, compared with the state of the art Winograd based convolution and Gemm based convolution in cuDNN 8.1.0, respectively. Moreover, we integrate our 𝐹 (6 × 6, 3 × 3) Winograd convolution implementation into NVIDIA TensorRT, which is a C++ inference library on GPUs provided by NVIDIA, as custom layer plugins. And we build the whole VGG network model using our custom Winograd convolution layers and other layers supported by TensorRT. The experiments show that the accuracy of the whole VGG network using our 𝐹 (6 × 6, 3 × 3) Winograd convolution is 71.24%, while the accuracy of using FP32 computing for the VGG network is 71.22%. Guangli Li (Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences); Zhen Jia (Amazon); Xiaobing Feng (Institute of Computing Technology, Chinese Academy of Sciences); and Yida Wang (Amazon) Abstract Abstract Low-precision computation, which has been widely supported in contemporary hardware, is considered as one of the most effective methods to accelerate convolutional neural networks. However, low-precision computation is not widely used to speed up Winograd, an algorithm for fast convolution computation, due to the numerical error introduced by combining Winograd transformation and quantization. In this paper, we propose a low-precision Winograd convolution approach, LoWino, based on post-training quantization, which employs a linear quantization method in the Winograd domain to reduce the precision loss caused by transformations. Moreover, we present an efficient implementation that integrates well-designed optimization techniques, thereby adequately exploiting the capability of low-precision computation on modern CPUs. We evaluate our approach on Intel Xeon Scalable Processors by leveraging representative convolutional layers in prevailing deep neural networks. Experimental results show that LoWino achieves up to 2.04$\times$ speedup over state-of-the-art implementations in the vendor library while maintaining the accuracy at a reasonable level. Liang Gao (National University of Defense Technology); Li Li (ShenZhen Institutes of Advanced Technology, Chinese Academy of Sciences); Yingwen Chen (National University of Defense Technology); Wenli Zheng (Emerging Parallel Computing Center, Shanghai Jiao Tong University); ChengZhong Xu (University of Macau); and Ming Xu (National University of Defense Technology) Abstract Abstract Federated learning is a novel machine learning framework that enables multiple devices to collaboratively train high-performance models while preserving data privacy. Federated learning is a kind of crowdsourcing computing, where a task publisher shares profit with workers to utilize their data and computing resources. Intuitively, devices have no interest to participate in federated learning without rewards that match their expended resources. In addition, guarding against malicious workers is also essential because they may upload meaningless updates to get undeserving rewards or damage the global model. In order to effectively solve these problems, we propose FIFL, a fair incentive mechanism for federated learning. FIFL rewards workers fairly to attract reliable and efficient ones while punishing and eliminating the malicious ones based on a dynamic real-time worker assessment mechanism. We evaluate the effectiveness of FIFL through theoretical analysis and comprehensive experiments. The evaluation results show that FIFL fairly distributes rewards according to workers' behaviour and quality. FIFL increases the system revenue by $0.2\%$ to $3.4\%$ in reliable federations compared with baselines. In the unreliable scenario containing attackers which destroy the model's performance, the system revenue of FIFL outperforms the baselines by more than $46.7\%$. Thursday 10:30am-11:30am Room C Conference Paper 7C: Virtualization and Stream Processing Chair: Ali Jannesari (Iowa State University) Lulu Yao, Yongkun Li, and Jiawei Li (University of Science and Technology of China); Weijie Wu (Independent Researcher); and Yinlong Xu (University of Science and Technology of China) Abstract Abstract As applications are often allocated more memory than they actually used to cope with peak memory demand due to the workload dynamics in virtualized systems running multiple virtual machines, memory adjustment by reclaiming inactive memory in virtual machines is an effective way to enable memory overcommitment so as to reduce the cost. However, existing schemes are designed based on one-shot adjustment, which may reclaim a large size in a single operation, and they are unaware of both memory access dynamics and memory sensitivity of different applications, so they usually result in excessive reclamation and lead to severe performance slowdown. To address this issue, we propose PMA, a progressive memory adjustment scheme that takes into consideration both memory access dynamics and memory sensitivity and leverages virtual machine performance feedback to progressively reclaim inactive memory to avoid performance slowdown. Besides, PMA is designed based on ballooning (i.e., the balloon driver) so it preserves the good isolation between host and virtual machines for full virtualization systems. We also implement a prototype in host user space, and experiments show that PMA effectively limits the performance drop of memory overcommitment (e.g., the performance drop of every virtual machine is limited within 10% for up to 33% memory overcommitment), which is already very close to the optimal case of having enough memory, so PMA is efficient to enable memory overcommitment with performance guarantee in full virtualization systems. 10:45am: Best VM Selection for Big Data Applications across Multiple Frameworks by Transfer Learning Yuewen Wu and Heng Wu (Institute of Software, Chinese Academy of Sciences) and Yuanjia Xu, Yi Hu, Wenbo Zhang, Hua Zhong, and Tao Huang (Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences) Abstract Abstract Cloud providers are presented with a bewildering choice of VM types for a range of contemporary data processing frameworks today. However, existing performance modeling and machine learning efforts cannot pick optimal VM types for multiple frameworks simultaneously, since they are difficult to balance model accuracy and model training cost. We propose Vesta, a novel transfer learning approach, to address this challenge: (1) it abstracts knowledge of VM type selection through offline benchmarking on multiple frameworks; (2) it employs a two-layer bipartite graph to represent knowledge across frameworks; (3) it minimizes training overhead by reusing the knowledge to select the best VM type for given applications. Comparing with state-of-the-art efforts, our experiments on 30 applications of Hadoop, Hive and Spark show that Vesta can improve application performance up to 51% while reducing 85% training overhead. Huiyao Mei, Hanhua Chen, Hai Jin, and Qiangsheng Hua (Huazhong University of Science and Technology) and Bing Bing Zhou (The University of Sydney) Abstract Abstract \emph{Complete Event Trend} (CET) detection over large-scale event streams is important and challenging in various applications such as financial services, real-time business analysis, and supply chain management. A potential large number of partial intermediate results during complex event matching can raise prohibitively high memory cost for the processing system. The state-of-the-art scheme leverages compact graph encoding, which represents the common sub-sequences of different complex events using a common sub-graph to achieve space efficiency for storing the intermediate results. However, we show that such a design raises unacceptable computation cost for the graph traversal needed whenever a new event comes. To address this problem, in this paper, we propose a novel \emph{attribute-based indexing} (ABI) graph model to represent the relationship between events. By classifying the predicates and constructing the graph based on both the comparators in the predicates and the attribute values of the events, we achieve parallel event stream processing and efficient graph construction. Our design significantly reduces the total computation cost of graph construction from $O(n^2)$ to $O(nlog(m))$, where $n$ is the number of events and $m$ is the number of the attribute vertices. We further design several efficient traversal-based algorithms to extract CETs from the graph. We implement our design and conduct comprehensive experiments to evaluate the performance of this design. The results show that our design wins a couple of orders of magnitude back from state-of-the-art schemes. Stijn Schildermans and Kris Aerts (KU Leuven), Jianchen Shan (Hofstra University), and Xiaoning Ding (New Jersey Institute of Technology) Abstract Abstract To this day, efficient timer management is a major challenge in virtualized environments. Contemporary timekeeping techniques in guest kernels frequently interact with timer hardware, which requires continual and costly hypervisor interference. |
Monday 8:00am-9:45am Room A DUAC Workshop 1A: International Workshop on Deployment and Use of Accelerators (DUAC) - Presentations Chair: Carlos Reaño (Queen's University Belfast) 8:30am: Explaining the Classification Performance of Supervised and Semi-Supervised Methods for Automated Sparse Matrix Format Selection pdf, mp4Monday 8:00am-10:15am Room B EMS Workshop 1B: International Workshop on Embedded Multicore Systems (ICPP-EMS) - Presentations Chair: YungChia Lin (MediaTek) 8:15am: ArchViMP – a Framework for Automatic Extraction of Concurrency-related Software Architectural Properties pdf, mp49:30am: Accelerate Binarized Neural Networks with Processing-in-Memory Enabled by RISC-V Custom Instructions pdf, mp4Monday 8:00am-10:15am Room C P2S2 Workshop 1C: International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2) - Presentations Chair: John Leidel (Tactical Computing Laboratories LLC, Texas Tech University) Monday 10:15am-11:30am Room A AWASN Workshop 2A: International Workshop on Applications of Wireless Ad hoc and Sensor Networks (AWASN) - Presentations Chair: Kazuya Sakai (Tokyo Metropolitan University) 10:45am: Automated Arrhythmia Detection using Hilbert-Huang Transform based Convolutional Neural Network pdf, mp4Monday 10:15am-12:00pm Room B LLPP Workshop 2B: Workshop on LLVM in Parallel Processing (LLPP) - Presentations Chair: Johannes Doerfert (Argonne National Laboratory (ANL)) 10:15am: mp4 Welcome Remarks Monday 10:15am-12:00pm Room C PDADS Workshop 2C: International Workshop on Parallel and Distributed Algorithms for Decision Sciences (PDADS) - Presentations Chair: Sudip Seal (Oak Ridge National Laboratory) 10:30am: Domain Decomposition Preconditioners for Unstructured Network Problems in Parallel Vector Architectures pdf, movTuesday 8:15am-9:00am Webinar Keynote Keynote: Rick L. Stevens, DOE-ANL, Exascale and then what?: The next decade for HPC and AI Chair: Xian-He Sun (Illinois Institute of Technology) We are on the verge of deploying Exascale systems, which remarkably in the US are quite similar in architecture and programming model. But what comes after these initial exascale systems? Are we going to see AI accelerators breakout into the mainstream? Are hybrid-HPC/AI surrogate models the future of scientific computing? And if so what is the architecture and software implication? Is a hardware disaggregation model viable for HPC applications? Are quantum and neuromorphic systems likely to be on the major vendor’s roadmaps for the next decade? Heterogenous workflows connecting the edge to the core are becoming more important yet our software stacks are unprepared and do not treat workflows as first class applications. In this talk I’ll try to make sense of trends and future directions and outline some important research problems for the next decade.
Rick Stevens is the Associate Laboratory Director of the Computing, Environment and Life Sciences Directorate at Argonne National Laboratory, and a Professor of Computer Science at the University of Chicago, with significant responsibility in delivering on the U.S. national initiative for Exascale computing and developing the DOE initiative in Artificial Intelligence (AI) for Science. His research spans the computational and computer sciences, from high-performance computing to the building of innovative tools and techniques for biological science and infectious disease research as well as approaches to advance deep learning to accelerate cancer and COVID-19 research. He also specializes in collaborative visualization technology and grid computing. At Argonne, he leads the Laboratory’s AI for Science initiative and is currently focusing on high-performance computing systems which includes collaborating with Intel and Cray to launch Argonne’s first exascale computer, Aurora, as well as the National AI Accelerator Testbed which brings together leading AI scientists to provide an open and unbiased environment for the evaluation of emerging AI accelerator technologies designed to accelerate training and inference for deep learning models.
Prof. Stevens is a member of the American Association for the Advancement of Science and has received many national honors for his research, including an R&D 100 award and most recently being named a Fellow of the Association of Computer Machinery (ACM) for his continuing contributions to high-performance computing. mp4 Tuesday 9:15am-10:15am Room A Conference Paper 1A: Memory Systems and NVM Chair: Xi Wang (RIOS Laboratory) Tuesday 9:15am-10:15am Room B Conference Paper 1B: GPU Computing and Task-based Programming Models Chair: Tobias Weinzierl (Durham University) Tuesday 10:30am-11:30am Room A Conference Paper 2A: Storage Systems and Parallel I/O Chair: Suren Byna (Lawrence Berkeley National Laboratory) Tuesday 10:30am-11:30am Room B Conference Paper 2B: Scheduling Algorithms and Optimizations Chair: Antonino Tumeo (Pacific Northwest National Laboratory (PNNL)) Tuesday 11:45am-12:45pm Room A Conference Paper 3A: Performance Modeling and Evaluation Chair: Doru Thom Popovici (Lawrence Berkeley National Laboratory) Tuesday 11:45am-12:45pm Room B Conference Paper 3B: Parallelization and Code Generation Chair: Barbara Chapman (Stony Brook University) 11:45am: Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors pdf, mp4Tuesday 11:45am-12:45pm Room C Conference Paper 3C: Applications with Machine Learning Chair: Xingfu Wu (Argonne National Laboratory, University of Chicago) 11:45am: CNN+LSTM Accelerated Turbulent Flow Simulation with Link-Wise Artificial Compressibility Method pdf, mp412pm: ComputeCOVID19+: Accelerating COVID-19 Diagnosis and Monitoring via High-Performance Deep Learning pdf, mp4Tuesday 7:00pm-8:00pm Webinar ICPP Bonfire Bonfire Gathering: Remember the Old Tradition Virtually! Chair: Wu Feng (Virginia Tech) ICPP started in 1972 as an intimate workshop-like gathering of intellectuals in parallel processing. As ICPP rapidly grew in stature and size, the ICPP bonfire joke-telling party came into being about 1976 to capture that original informal workshop-like setting for more than a decade. This year’s virtual ICPP bonfire will reminisce about the history behind the bonfires, their social and professional impact to the parallel computing community, and the jokes that were told, all from the perspective of long-time ICPP bonfire attendees from the 1970s and 1980s, who are noted below.
HJ Siegel (moderator),
Bruce Berra,
Tom Casavant,
Hank Dietz,
Doug DeGroot,
Duncan Lawrie,
Diane Smith Wednesday 9:15am-10:15am Room A Panel Panel: Celebrating 50 Years of ICPP Chair: Rudolf Eigenmann (University of Delaware) Panelists: Keshav Pingali (University of Texas at Austin; Moderator), Susanne Hambrusch (Purdue University), David Padua (University of Illinois at Urbana-Champaign), HJ Siegel (Colorado State University), Guang Gao (University of Delaware), Lionel Ni (Hong Kong University of Science and Technology), and Ahmed Sameh (Purdue University). mp4 Wednesday 10:30am-11:30am Room A Conference Paper 4A: Graph Computing Chair: Erika Parsons (University of Washington, Bothell) 10:30am: Ascetic: Enhancing Cross-Iterations Data Efficiency in Out-of-Memory Graph Processing on GPUs pdf, mp4Wednesday 10:30am-11:30am Room B Conference Paper 4B: Storage Software and Optimizations Chair: Anthony Kougkas (Illinois Institute of Technology) Wednesday 11:45am-12:45pm Room A Conference Paper 5A: Linear Algebra Algorithms Chair: Matthew Knepley (University at Buffalo, Rice University) Wednesday 11:45am-12:45pm Room B Conference Paper 5B: Data Analytics Systems and Runtime Chair: Rong Ge (Clemson University) Thursday 8:15am-9:00am Webinar Keynote Keynote: Manish Parashar, U. Utah, NSF, Harnessing Advanced Cyberinfrastructure for Urgent Science Chair: Sameer Shende (University of Oregon) Advanced Cyberinfrastructure (CI) plays an essential role in our ability to respond to national emergencies, from pandemics to extreme weather event (e.g., hurricanes and tornadoes), wildfires, cyber-attacks, and nuclear disasters. This was demonstrated over the past year, as CI has played a critical role in supporting the urgent science needed to address the global COVID-19 pandemic, including advancing our understanding of the SARS-CoV-2 virus structure, its host interactions, developing strategies to mitigate its spread, and supporting early-stage drug development. Specifically, the COVID-19 HPC Consortium brought together international stakeholders across academia, government, industry, and non-profits to provide CI resources, services, and expertise, in an agile and expedient way, to help address the many dimensions of this pandemic. In this talk, I will discuss the role of CI in supporting urgent science using our response to the COVID-19 pandemic and the COVID-19 HPC Consortium as an example. I will also explore how we can leverage these experiences to better prepare for future emergencies.
Manish is Office Director of the Office of Advanced Cyberinfrastructure at NSF. He joins NSF as an IPA from the University of Utah where he is the Director of the Scientific Computing and Imaging (SCI) Institute, Chair in Computational Science and Engineering, and Professor in School of Computing. His research interests are in the broad areas of Parallel and Distributed Computing and Computational and Data-Enabled Science and Engineering. Manish is Fellow of AAAS, ACM, and IEEE/IEEE Computer Society. For more information, please visit http://manishparashar.org.
mp4 Thursday 9:15am-10:15am Room A Conference Paper 6A: Networking and Routing Chair: Kevin A. Brown (Argonne National Laboratory (ANL)) Thursday 9:15am-10:15am Room B Conference Paper 6B: Machine Learning and Acceleration Chair: Riyadh Baghdadi (New York University Abu Dhabi, Massachusetts Institute of Technology (MIT)) Thursday 10:30am-11:30am Room A Conference Paper 7A: Performance Optimization Chair: Michael Gerndt (Technical University Munich) 10:45am: Regu2D: Accelerating Vectorization of SpMV on Intel Processors through 2D-partitioning and Regular Arrangement pdf, mp4Thursday 10:30am-11:30am Room B Conference Paper 7B: Machine Learning Algorithms Chair: Leonid Oliker (Lawrence Berkeley National Laboratory) |