Tuesday 9:00am-10:00am Straub Hall, Room 156 Keynote Keynote Chair: Mark Robins (Intel) AI and HPC: Challenges and Opportunities Mark Robins (Intel, Sr. Director / Head of the AI Strategy Office at Intel) Biography Biography Mark Robins (Intel, Sr. Director / Head of the AI Strategy Office at Intel) Mark Robins was recently named Sr. Director / Head of the AI Strategy Office at Intel, responsible for establishing and driving the AI strategy for the company. Previously, Mark served as Sr. Director / Head of AI Products at Intel, responsible for product management and planning for Intel’s hardware and software AI products. Mark served as VP Products for Nervana prior to the acquisition by Intel in July 2016. Prior to Nervana, Mark served as VP Products for Influitive and, before that, as VP Products for Chegg through their IPO in 2013. Before Chegg, Mark was co-founder/CEO of Grouply, a social networking startup funded by Reid Hoffman and O’Reilly Alphatech Ventures that was acquired in 2010. Before Grouply, Mark was Sr. Director of Product Management at Siebel Systems through its acquisition by Oracle in 2006. Mark started his career as a satellite communications systems engineer for Hughes Aircraft Company (now Boeing). Mark earned a BS and MS in electrical engineering from Cornell and Caltech, respectively, where he also studied neural networks. Mark holds an MBA from Harvard Business School. Abstract Abstract After several false starts in decades past, AI is now practical and poised to change our lives. We experience AI in action today when Facebook automatically identifies our friends in the photos we post or when Alexa accurately responds to our request to play a song or make a purchase. And AI-powered autonomous driving is right around the corner, as 37% of miles driven will be done by shared and autonomous vehicles by 2030, according to the 2017 PwC Strategy & Digital Auto Report. AI will also transform virtually all industries, including healthcare (e.g., enhanced diagnostics, drug discovery), financial services (e.g., algorithmic trading, fraud detection), and energy (e.g., oil exploration, smart grids). Tuesday 10:00am-10:30am Straub Hall, Room 156 Paper Best Paper Session Chair: Michele Weiland (University of Edinburgh) ImageNet Training in Minutes Yang You (UC Berkeley), Zhao Zhang (TACC), Cho-Jui Hsieh (UC Davis), and James Demmel and Kurt Keutzer (UC Berkeley) Abstract Abstract In this paper, we investigate large scale computers' capability of speeding up deep neural networks (DNN) training. Our approach is to use large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from a group of researchers at Facebook, our approach shows higher test accuracy on batch sizes that are larger than 16K. Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe v1.0.7. Tuesday 11:00am-12:30pm Straub Hall, Room 156 Paper Monitoring and Network Analysis Chair: Martin Schulz (Technical University Munich) Ramin Izadpanah (University of Central Florida), Nichamon Naksinehaboon (Open Grid Computing), Jim Brandt and Ann Gentile (Sandia National Laboratories), and Damian Dechev (University of Central Florida) Abstract Abstract The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. While raw performance data is increasingly exposed at the component level, the usefulness of the data is dependent on the ability to do meaningful analysis on actionable timescales. However, current system monitoring infrastructures largely focus on data collection, with analysis performed off-system in post-processing mode. This increases the time required to provide analysis and feedback to a variety of consumers. Arya Mazaheri and Felix Wolf (Technische Universitaet Darmstadt) and Ali Jannesari (Iowa State University) Abstract Abstract A critical factor for developing robust shared-memory applications is the efficient use of the cache and the communication between threads. Inappropriate data structures, algorithm design, and inefficient thread affinity may result in superfluous communication between threads/cores and severe performance problems. For this reason, state-of-the-art profiling tools focus on thread communication and behavior to present different metrics that enable programmers to write cache-friendly programs. The data shared between a pair of threads should be reused with a reasonable distance to preserve data locality. However, existing tools do not take into account the locality of communication events and mainly focus on analyzing the amount of communication instead. In this paper, we introduce a new method to analyze performance and communication bottlenecks that arise from data-access patterns and thread interactions of each code region. We propose new hardware-independent metrics to characterize thread communication and provide suggestions for applying appropriate optimizations on a specific code region. We evaluated our approach on the SPLASH and Rodinia benchmark suites. Experimental results validate the effectiveness of our approach by finding communication locality issues due to inefficient data structures and/or poor algorithm implementations. By applying the suggested optimizations, we improved the performance in Rodinia benchmarks by up to 56%. Furthermore, by varying the input size we demonstrated the ability of our method to assess the cache usage and scalability of a given application in terms of its inherent communication. Kevin A. Brown (Tokyo Institute of Technology, Tokyo Tech/AIST RWBC-Open Innovation Laboratory); Nikhil Jain (Lawrence Livermore National Laboratory); Satoshi Matsuoka (Tokyo Institute of Technology, RIKEN Center for Computational Sciences); Martin Schulz (Technische Universität München); and Abhinav Bhatele (Lawrence Livermore National Laboratory) Abstract Abstract Network congestion arising from simultaneous data transfers can be a significant performance bottleneck for many applications, especially when network resources are shared by multiple concurrently running jobs. Many studies have focused on the impact of network congestion on either MPI performance or I/O performance but the interaction between MPI and I/O traffic is rarely studied and not well understood. In this paper, we analyze and characterize the interference between MPI and I/O traffic on fat-tree networks, highlighting the role of important factors such as message sizes, communication intervals, and job sizes. We also investigate several strategies for reducing MPI-I/O interference, and the benefits and tradeoffs of each approach for different scenarios. Tuesday 11:00am-12:30pm Straub Hall, Room 145 Paper Task Placement Algorithms Chair: Jee Choi (IBM) Amelie Chi Zhou (Shenzhen University); Tien-Dat Phan (Dassault Systemes, ENS Rennes / IRISA); Shadi Ibrahim (Inria); and Bingsheng He (National University of Singapore) Abstract Abstract Many Big Data processing applications nowadays run on large-scale multi-tenant clusters. Due to hardware heterogeneity and resource contentions, straggler problem has become the norm rather than the exception in such clusters. To handle the straggler problem, speculative execution has emerged as one of the most widely used straggler mitigation techniques. Although a number of speculative execution mechanisms have been proposed, as we have observed from real-world traces, the questions of "when" and "where" to launch speculative copies have not been fully discussed and hence cause inefficiencies on the performance and energy of Big Data applications. In this paper, we propose a performance and an energy model to reveal the performance and energy variations with different speculative execution solutions. We further propose a window-based dynamic resource reservation and a heterogeneity-aware copy allocation technique to answer the "when" and "where" questions for speculative copies. Evaluations using real-world traces show that our proposed technique can improve the performance of Big Data applications by up to 30% and reduce the overall energy consumption by up to 34%. Maria Predari and Henning Meyerhenke (University of Cologne) and Roland Glantz (Karlsruhe Institute of Technology) Abstract Abstract In this paper we propose a new method to enhance a mapping of a parallel application's computational tasks to the processing elements of a parallel computer. The idea behind our method TiMEr is to enhance such a mapping by drawing on the observation that many topologies take the form of a partial cube. This class of graphs includes all rectangular and cubic meshes, any such torus with even extensions in each dimension, all hypercubes, and all trees. Haipeng Dai and Ke Sun (Nanjing University); Alex X. Liu (Michigan State University); and Lijun Zhang, Jiaqi Zheng, and Guihai Chen (Nanjing University) Abstract Abstract This paper studies the problem of cHarging tAsk Scheduling for direcTional wireless chargEr networks (HASTE), i.e., given a set of rotatable directional wireless chargers on a 2D area and a series of offline (online) charging tasks, scheduling the orientations of all the chargers with time in a centralized offline (distributed online) fashion to maximize the overall charging utility for all the tasks. We prove that HASTE is NP-hard. Then, we prove that a relaxed version of HASTE falls within the realm of maximizing a submodular function subject to a partition matroid constraint, and propose a centralized offline algorithm that achieves (1 − ρ)(1 −1/e) approximation ratio to address HASTE where ρ is the switching delay of chargers. Further, we propose a distributed online algorithm and prove it achieves 1/2(1 − ρ)(1 −1/e) competitive ratio. We conduct simulations, and field experiments on a testbed consisting of 8 off-the-shelf power transmitters and 8 rechargeable sensor nodes. The results show that our distributed online algorithm achieves 92.97% of the optimal charging utility, and outperforms the comparison algorithms by up to 26.19% in terms of charging utility. Tuesday 11:00am-12:30pm Straub Hall, Room 245 Paper Graph Applications Chair: Konstantinos Krommydas (Intel Corporation) Kun Qiu, Yuanyang Zhu, Jing Yuan, Jin Zhao, and Xin Wang (Fudan University) and Tilman Wolf (University of Massachusetts Amherst) Abstract Abstract Determining the shortest-path distance between vertices in a weighted graph is an important problem for a broad range of fields, such as context-aware search and route selection. While many efficient methods for querying shortest-path distance have been proposed, they are poorly suited for parallel architectures, such as multi-core CPUs or computer clusters, due to the strong task dependencies. In this paper, we propose ParaPLL, a new parallelism-friendly framework for fast shortest-path distance query on large-scale weighted graphs. ParaPLL exploits intra-node and inter-node parallelism by using shared memory and message passing paradigms respectively. We also design task assignment and synchronization policies, which allow ParaPLL to reach remarkable speedups compared to state-of-the-art solutions. Moreover, we also prove the correctness of ParaPLL. To the best of our knowledge, ParaPLL is the first parallel framework that utilizing pruned landmark labeling to accelerate shortest-path distance queries on large-scale weighted graphs. Our evaluation results show that ParaPLL is 9.46 times faster than the corresponding serial version on a weighted 0.3M-vertex graph using a 12-core computer. ParaPLL on a 6-node computer cluster can also achieve a speedup of up to 5.6 over the single-node implementation. Xianghao Xu and Fang Wang (Huazhong University of Science and Technology), Hong Jiang (University of Texas at Arlington), Yongli Cheng (FuZhou University), and Dan Feng and Yongxuan Zhang (Huazhong University of Science and Technology) Abstract Abstract In recent years, a number of out-of-core graph processing systems have been proposed to process graphs with billions of edges on just one commodity computer, due to their high cost efficiency. To obtain the better performance, these systems adopt a full I/O model that accesses all edges during the computation to avoid the ineffectiveness of random I/Os. Although this model ensures good I/O access locality, it loads a large number of useless edges when running graph algorithms that only require a small portion of edges in each iteration. A natural method to solve this problem is the on-demand I/O model that only accesses the active edges. However, this method only works well for the graph algorithms with very few active edges, since the I/O cost will grow rapidly as the number of active edges increases due to larger amount of random I/Os. In this paper, we present HUS-Graph, an efficient out-of-core graph processing system to address the above I/O issues and achieve a good balance between I/O amount and I/O access locality. HUSGraph first adopts a hybrid update strategy including two update models, Row-oriented Push (ROP) and Column-oriented Pull (COP). It can adaptively select the optimal update model for the graph algorithms that have different computation and I/O features, based on an I/O-based performance prediction method. Furthermore,HUSGraph proposes a dual-block representation to organize graph data, which ensures good access locality. Extensive experimental results Jianping Zeng and Hongfeng Yu (University of Nebraska-Lincoln) Abstract Abstract Community detection is essential to various graph analysis applications. Infomap is a graph clustering algorithm capable of achieving high-quality communities. However, it remains a very challenging problem to effectively apply Infomap on large graphs. By analyzing communication and workload patterns of Infomap and leveraging a distributed delegate partitioning and distribution method, we develop a new heuristic strategy to carefully coordinate the community constitution from the vertices of a graph in a distributed environment, and achieve the convergence of the distributed clustering algorithm. We have implemented our optimized algorithm using MPI (Message Passing Interface), which can be easily employed or extended to massively distributed computing systems. We analyze the correctness of our algorithm, and conduct an intensive experimental study to investigate the communication and computation cost of our distributed algorithm, which has not shown in previous work. The results demonstrate the scalability and the correctness of our distributed Infomap algorithm with large-scale real-world datasets. Tuesday 2:00pm-3:30pm Straub Hall, Room 156 Paper Performance Tools and Methodologies Chair: Sameer Shende (University of Oregon) Ajay Ramaswamy, Nalini Kumar, Aravind Neelakantan, Herman Lam, and Greg Stitt (University of Florida) Abstract Abstract With extremely large design spaces for algorithm and architecture to be explored, there is a need for fast and scalable performance modeling tools for preparing HPC application codes. Behavioral Emulation (BE) is a recent coarse-grained modeling and simulation methodology that has been proposed to solve this co-design problem. In this paper, we introduce a distributed parallel simulation library for Behavioral Emulation called BE-SST, integrated into the Structural Simulation Toolkit (SST). BE-SST provides simple interfaces and framework for development of coarse-grained BE models which can be extended to model new notional architectures. BE-SST also supports Monte Carlo simulations to generate meaningful distributions and summary statistics rather than a single datum for performance. In this paper, we present BE-SST simulations of two existing large DOE machines (Vulcan and Titan), which have been validated against actual testbed measurements and showed 5-10\% error. These validated system models (up to 128k cores) are used to make blind predictions of application performance on systems larger than the current machines (up to 512k cores) - a crucial simulator feature for design-space exploration of notional systems. We further studied BE-SST in terms of scalability and performance, simulating up to a million cores, with BE-SST running on more than 2k parallel processes. BE-SST shows good scalability with a linear increase in memory usage and simulation time with increase in simulated system size, and a peak speedup of 7x over single process simulation. With ease of use and good scaling, we assert that BE-SST can significantly speed up design-space exploration. Brian Kocoloski (Washington University in St. Louis) and John Lange (University of Pittsburgh) Abstract Abstract Performance variability is a major problem for extreme scale parallel computing applications that rely on bulk synchronization and collective communication. While this problem is most prominent in the context of exascale systems, it is increasingly impacting other communities such as machine learning and graph analytics. In this paper, we present an experimental performance analysis framework called varbench that is designed to precisely measure the prevalence of performance variability in a system, as well as to support workload characterization with respect to how and when a workload generates variability. We demonstrate several of varbench’s capabilities as they pertain to exascale-class systems, including its utility for discovering architectural trends, for performing crossarchitectural comparisons, and for understanding key statistical properties of performance distributions that have implications for how system software should be designed to mitigate variability François Trahay (SAMOVAR, CNRS, Télécom SudParis, Université Paris-Saclay); Manuel Selva (Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG); Lionel Morel (Univ Grenoble Alpes, CEA, List); and Kevin Marquet (Univ Lyon, INSA Lyon, Inria, CITI) Abstract Abstract Non Uniform Memory Access (NUMA) architectures are nowadays common for running High-Performance Computing (HPC) applications. In such architectures, several distinct physical memories are assembled to create a single shared memory. Nevertheless, because there are several physical memories, access times to these memories are not uniform depending on the location of the core performing the memory request and on the location of the target memory. Hence, threads and data placement are crucial to efficiently exploit such architectures. To help taking decision about this placement, profiling tools are needed. In this work, we propose NUMA MeMory Analyzer (NumaMMA), a new profiling tool for understanding the memory access patterns of HPC applications. NumaMMA combines efficient collection of memory traces using hardware mechanisms with original visualization means allowing to see how memory access patterns evolve over time. The information reported by NumaMMA allows to understand the nature of these access patterns inside each object allocated by the application. We show how NumaMMA can help understanding the memory patterns of several HPC applications in order to optimize them and get speedups up to 28% over the standard non optimized version. Tuesday 2:00pm-3:30pm Straub Hall, Room 145 Paper Networking Algorithms Chair: Kamesh Madduri (Pennsylvania State University) Yang Chen and Jie Wu (Temple University) Abstract Abstract Network Function Virtualization (NFV) changes the way that we implement network services, or middleboxes, from expensive hard- wares to software functions. These software middleboxes, also called Virtual Network Functions (VNFs), run on switch-connected commodity servers. Efficiently placing such middleboxes is challenging because of their traffic-changing effects and dependency relations. Private (used by one single flow) middleboxes can save more link bandwidth resources while shared (used by multiple flows) middleboxes cut down server resource expenses. This paper formulates the resource usage trade-off between bandwidth consumption and cost of middlebox placement as a combined cost minimization problem. After proving the NP-hardness of our problem in general topologies, we narrow down to a specific kind of topology: tree-structured networks. We study two kinds of constraints: traffic-chaining ratio and middlebox dependency relations. With homogeneous flows, we propose optimal greedy algorithms for the placement of a single middlebox first, and then multiple middleboxes without order. We also introduce a dynamic programming algorithm for the placement of a totally-ordered middlebox set. A performance-guaranteed algorithm is designed to handle heterogeneous flows. Extensive simulations are conducted to evaluate the performance of our proposed algorithms in various scenarios. Xu Lin (Xidian University), Deke Guo (National University of Defense Technology), Yulong Shen (Xidian University), and Guoming Tang and Bangbang Ren (National University of Defense Technology) Abstract Abstract Network Function Virtualization (NFV) is an emerging technology, which enables service agility, flexibility and cost reduction by replacing traditional hardware middleboxes with Virtual Network Functions (VNFs) running on general-purpose servers. Service Function Chain (SFC) constitutes an end-to-end service by organizing a series of VNFs in a specific order. Particularly, hybrid SFC (SFC with parallel VNFs) is proposed to much reduce the traffic delay in sequential SFCs. Nevertheless, how to strategically select VNF instances and links in hybrid SFC embedding remains an open problem. In this paper, we target at the cost minimization and address the optimal hybrid SFC embedding problem. Specifically, we first develop a novel abstraction model for the hybrid SFC with Directed Acyclic Graph (DAG), which helps convert diverse hybrid SFCs to the standardized DAG-SFC form. Then, we formulate the optimal DAG-SFC embedding problem as an integer optimization model and propose a greedy method (called BBE) to solve the NP-hard problem. MBBE method is developed upon BBE method to further cut down the computation complexity in model solving. Extensive simulation results demonstrate the effectiveness of our approach for cost reduction in hybrid SFC embedding. Xiaoyu Wang, Haipeng Dai, Weijun Wang, Jiaqi Zheng, Guihai Chen, and Wanchun Dou (Nanjing University) and Xiaobing Wu (University of Canterbury) Abstract Abstract This paper considers the problem of Heterogeneous wIreless charger Placement with Obstacles (HIPO), i.e., given a number of heterogeneous rechargeable devices distributed on a 2D plane where exist obstacles of arbitrary shapes, deploying heterogeneous chargers with a given cardinality of each type, i.e., determining their positions and orientations, the combination of which we name as strategies, on the plane such that the rechargeable devices achieve maximized charging utility. After presenting our practical directional charging model, we first propose to use a piecewise constant function to approximate the nonlinear charging power, and divide the whole area into multi-feasible geometric areas in which a certain type of chargers have constant approximated charging power. Next, we propose the Practical Dominating Coverage Set extraction algorithm to reduce the unlimited solution space to a limited one by exacting a finite set of candidate strategies for all multi-feasible geometric areas. Finally, we prove the problem falls in the realm of maximizing a monotone submodular function subject to a partition matroid constraint, which allows a greedy algorithm with approximation ratio of $1/2-\epsilon$. We conduct both simulations and field experiments to evaluate the performance of our algorithm and other five comparison algorithms. The results show that our algorithm outperforms the comparison algorithms by at least 33.49% on average. Tuesday 2:00pm-3:30pm Straub Hall, Room 245 Paper Astronomy and Earth Systems Chair: Kevin Huck (University of Oregon) Thomas R. Devine (West Virginia University, Fairmont State University) and Katerina Goseva-Popstojanova and Di Pang (West Virginia University) Abstract Abstract Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets. Junmin Xiao, Shigang Li, and Baodong Wu (Institute of Computing Technology, Chinese Academy of Sciences); He Zhang (Institute of Atmospheric Physics, Chinese Academy of Sciences); and Kun Li, Erlin Yao, Yunquan Zhang, and Guangming Tan (Institute of Computing Technology, Chinese Academy of Sciences) Abstract Abstract Dynamical core is one of the most time-consuming parts in the global atmospheric general circulation model (AGCM), which is widely used for the numerical simulation of the dynamic evolution process of global atmosphere. Due to its complicated calculation procedures and the non-uniformity of latitude-longitude mesh, the parallelization suffers from high communication overhead. In this paper, we deduce the operator form of the calculating flow in the dynamical core of AGCM. Furthermore, it is abstracted out that the stencil and collection alternate action is the basic operation in the dynamic core. Based on the operator form of calculation flow, we propose the corresponding optimization strategy for each operator. In the end, we develop a communication-avoiding algorithm to reduce communication overhead in the dynamic core. Our experiments show that the communication-avoiding algorithm reduces the total runtime by 54\% at most for a 50 km resolution model running 10 years. Especially for communication reduction, the new algorithm achieves 1.4x speedup on average for the collective communication and 3.9x speedup on average for the communication involved in the stencil computation. Satish Puri and Anmol Paudel (Marquette University) Abstract Abstract In recent times, geospatial datasets are growing in terms of size, complexity and heterogeneity. High performance systems are needed to analyze such data to produce actionable insights in an efficient manner. For polygonal a.k.a vector datasets, operations such as I/O, data partitioning, communication, and load balancing becomes challenging in a cluster environment. In this work, we present MPI-Vector-IO, a parallel I/O library that we have designed using MPI-IO specifically for partitioning and reading irregular vector data formats such as Well Known Text. Our system can perform spatial in-memory indexing and join efficiently for an order of magnitude larger datasets compared to the existing literature. It makes MPI aware of spatial data, spatial primitives and provides support for spatial data types embedded within collective computation and communication using MPI message-passing library. These abstractions along with parallel I/O support are useful for parallel Geographic Information System (GIS) application development on HPC platforms. Tuesday 4:00pm-6:00pm Straub Hall, Room 156 Paper Performance on GPU Systems Chair: Sameer Shende (University of Oregon) Lionel Eyraud-Dubois (Inria) and Thomas Lambert (University of Manchester) Abstract Abstract We consider the problem of data allocation when performing matrix multiplication on a heterogeneous node, with multicores and GPUs. Classical (cyclic) allocations designed for homogeneous settings are not appropriate, but the advent of task-based runtime systems makes it possible to use more general allocations. Previous theoretical work has proposed square and cube partitioning algorithms aimed at minimizing data movement for matrix multiplication. We propose techniques to adapt these continuous square partitionings to allocating discrete tiles of a matrix, and strategies to adapt the static allocation at runtime. We use these techniques in an implementation of Matrix Multiplication based on the StarPU runtime system, and we show through extensive experiments that this implementation allows to consistently obtain a lower communication volume while improving slightly the execution time, compared to standard state-of-the-art dynamic strategies. Zhuohang Lai and Qiong Luo (Hong Kong University of Science and Technology) and Xiaoying Jia (Nvidia Corporation) Abstract Abstract Scatter and gather are two essential data-parallel primitives for memory-intensive applications. The performance challenge is in their irregular memory access patterns, especially on architectures with high memory latency, such as GPUs. Previous work has proposed multi-pass scatter and gather schemes to optimize their performance on earlier GPUs; on newer GPUs, nevertheless, anecdotal evidence showed that such schemes had little performance benefit on small datasets, and few studies have been conducted on larger datasets. Therefore, we propose a systematic study to re-evaluate the performance of multi-pass scatter and gather on three newer GPUs with various data sizes. Specifically, we micro-benchmark the undocumented Translation Lookaside Buffers (TLBs) on these GPUs to quantitatively analyze their performance impact. We then develop an analytical model to analyze the execution of irregular memory accesses and estimate the multi-pass performance. Our evaluation on the newer GPUs shows that (1) TLB caching can affect the performance of irregular memory accesses more significantly than data caching; (2) on datasets larger than the L3 TLB size, the multi-pass schemes, with a suitable number of passes, can reduce up to 87.8% of the execution time over the single-pass version due to better TLB locality. Our model can predict the multi-pass performance on various GPUs, with an average accuracy of 92.9%. It can further suggest a suitable number of passes for the best performance. Wei Tan (Citadel), Shiyu Chang and Liana Fong (IBM Research), Cheng Li (UIUC), Zijun Wang (IBM Research), and Liangliang Cao (HelloVera.AI) Abstract Abstract Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative filtering, data compression, feature extraction, word embedding, etc. While many problem-specific optimization techniques have been proposed, alternating least square (ALS) remains popular due to its general applicability (e.g. easy to handle positive-unlabeled inputs), fast convergence and parallelization capability. Current MF implementations are either optimized for a single machine or with a need of a large computer cluster but still running slow. This because a single machine provides limited compute power for large-scale data while multiple machines suffer from the network communication bottleneck. Andre Weissenberger (Johann Wolfgang Goethe University) and Bertil Schmidt (Johannes Gutenberg University of Mainz) Abstract Abstract Data compression is a fundamental building block in a wide range of applications. Besides its intended purpose to save valuable storage on hard disks, compression can be utilized to increase the effective bandwidth to attached storage as realized by state-of-the-art file systems. In the foreseeing future, on-the-fly compression and decompression will gain utmost importance for the processing of data-intensive applications such as streamed Deep Learning tasks or Next Generation Sequencing pipelines, which establishes the Need for fast parallel implementations. Huffman coding is an integral part of a number of compression methods. However, efficient parallel implementation of Huffman decompression is difficult due to inherent data dependencies (i.e. the location of a decoded symbol depends on its predecessors). In this paper, we present the first massively parallel decoder implementation that is compatible with Huffman’s original method by taking advantage of the self-synchronization property of Huffman codes. Our performance evaluation on three different CUDA-enabled GPUs (TITAN V, TITAN XP, GTX 1080) demonstrates speedups of over one order-of-magnitude compared to the state-of-art CPU-based Zstandard Huffman decoder. Tuesday 4:00pm-6:00pm Straub Hall, Room 145 Paper Algorithms Chair: Wu Feng (Virginia Tech) Rintu Panja and Sathish Vadhiyar (IISC Banagalore, IISC) Abstract Abstract Efficient processing of large-scale graph applications on heterogeneous CPU-GPU systems require effectively harnessing the combined power of both the CPU and GPU devices. Finding minimum spanning tree (MST) is an important graph application and is used in different domains. When applying MST algorithms for large-scale graphs across multiple nodes (or machines), the existing approaches use BSP (bulk synchronous parallel) model involving large-scale communications. In this paper, we propose a multi-node multi-device algorithm for MST, {\em MND-MST}, that uses a divide-and-conquer approach by partitioning the input graph across multiple nodes and devices and performing independent Boruvka's MST computations on the devices. The results from the different nodes are merged using a novel hybrid merging algorithm that ensures that the combined results on a node never exceeds it memory capacity. The algorithm also simultaneously harnesses both CPU and GPU devices. In our experiments, we show that our proposed algorithm shows 24-88% performance improvements over an existing BSP approach. We also show that the algorithm exhibits almost linear scalability, and the use of GPUs result in upto 23% improvement in performance over multi-node CPU-only performance. Zachary Blanco, Bangtian Liu, and Maryam Mehri Dehnavi (Rutgers University) Abstract Abstract Tensors, or N-dimensional arrays, are increasingly used to repre- sent multi-dimensional data. Sparse tensor decomposition algo- rithms are of particular interest in analyzing and compressing big datasets due to the fact that most of real-world data is sparse and multi-dimensional. However, state-of-the-art tensor decomposition algorithms are not scalable for overwhelmingly large and higher- order sparse tensors on distributed platforms. In this paper, we use the Mapreduce model and the Spark engine to implement tensor factorizations on distributed platforms. The proposed CSTF, Cloud- based Sparse Tensor Factorization, is a scalable distributed algorithm for tensor decompositions for large data. It uses the coordinate storage format (COO) to operate on the tensor non-zeros directly, thus, eliminating the need for tensor unfolding and the storage of intermediate data. Also, a novel queuing strategy (QCOO) is proposed to exploit the dependency and data reuse between a se- quence of tensor operations in tensor decomposition algorithms. Details on the key-value storage paradigm and Spark features used to implement the algorithm and the data reuse strategies are also provided. The queuing strategy reduces data communication costs by 35% for 3rd-order tensors and 31% for 4th-order tensors over the COO-based implementation respectively. Compared with the state- of-the-art work, BIGTensor, CSTF achieves 2.2× to 6.9× speedup for 3rd-order tensor decompositions. Saeed Soori (Rutgers University), Aditya Devarakonda (University of California Berkeley), Zachary Blanco (Rutgers University), James Demmel (University of California Berkeley), and Mert Gurbuzbalaban and Maryam Mehri Dehnavi (Rutgers University) Abstract Abstract Proximal Newton methods are iterative algorithms that solve l1-regularized least squares problems. Distributed-memory implementation of these methods have become popular since they enable the analysis of large-scale machine learning problems. However, the scalability of these methods is limited by the communication overhead on modern distributed architecture. We propose a stochastic variance-reduced proximal method along with iteration-overlapping and Hessian-reuse to find an efficient trade-off between computation complexity and data communication. The proposed RC-SFSITA algorithm reduces latency cost by a factor of k without altering bandwidth costs. RC-SFISTA is implemented on both MPI and Spark and compared to state-of-the-art framework, ProxCoCoA. The performance of RC-SFISTA is evaluated on 1 to 512 nodes for multiple benchmarks and demonstrates speedups of up to 12x compared to ProxCoCoA with scaling properties that outperform the original algorithm. Juan Zhao (College of Meteorology and Oceanology,National University of Defense Technology; College of Computer,National University of Defense Technology); Jun qiang Song, Min Zhu, and Jin cai Li (College of Meteorology and Oceanology,National University of Defense Technology); Zhenyu Huang (State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences); and Xiaoyong Li and Xiao li Ren (College of Meteorology and Oceanology,National University of Defense Technology) Abstract Abstract Solving Boolean polynomial systems as an important aspect of symbolic computation, plays a fundamental role in various real applications. Although there exist many efficient sequential algorithms for solving Boolean polynomial systems, they are inefficient or even unavailable when the problem scale becomes large, due to the computational complexity of the problem and the limited processing capability of a single node. In this paper we propose an efficient parallel characteristic set method called PBCS for solving Boolean polynomial systems under the high-performance computing environment. Specifically, PBCS takes full advantage of the state-of-the-art characteristic set method and achieves load balancing by dynamically reallocating tasks. Moreover, the performance is further improved by optimizing the parameter setting. Extensive experiments are conducted to demonstrate that PBCS is efficient and scalable for solving Boolean equations, especially for the equations rasing from stream ciphers that have block triangular structure. In addition, the algorithm has good scalability and can be extended to the size of thousands CPU cores with a stable speedup. Tuesday 4:00pm-6:00pm Straub Hall, Room 245 Paper Scheduling Algorithms Chair: Michele Weiland (University of Edinburgh) Li Han (East China Normal University, China; Univ Lyon, CNRS, ENS de Lyon, Inria, Université Claude-Bernard Lyon 1, LIP UMR5668 LYON Cedex 07 France); Valentin Le Fèvre (Univ Lyon, CNRS, ENS de Lyon, Inria, Université Claude-Bernard Lyon 1, LIP UMR5668 LYON Cedex 07 France); Louis-Claude Canon (Univ Lyon, CNRS, ENS de Lyon, Inria, Université Claude-Bernard Lyon 1, LIP UMR5668 LYON Cedex 07 France; FEMTO-ST, Université de Bourgogne Franche-Comté, France); Yves Robert (Univ Lyon, CNRS, ENS de Lyon, Inria, Université Claude-Bernard Lyon 1, LIP UMR5668 LYON Cedex 07 France; University of Tennessee Knoxville, USA); and Frédéric Vivien (Univ Lyon, CNRS, ENS de Lyon, Inria, Université Claude-Bernard Lyon 1, LIP UMR5668 LYON Cedex 07 France) Abstract Abstract This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as HEFT and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as M-SPGs (Minimal Series-Parallel Graphs). Extensive experiments report significant gain over both CkptAll and CkptNone, for a wide variety of workflows. Yibo Jin and Zhuzhong Qian (Nanjing University); Song Guo (Hong Kong Polytechnic University); and Sheng Zhang, Xiaoliang Wang, and Sanglu Lu (Nanjing University) Abstract Abstract Many organizations and companies have deployed not only datacenters but also large number of geo-distributed heterogeneous edges to provide fast data analytics services. Since large volume of data transmission across WAN can be costly, existing works mainly focus on pre-processing data in-place to avoid transmission. However, the heterogeneity of edges on either local computing capacity or network bandwidth limits the efficient use on scarce resource, which may result in long task completion time. To cope with dynamic demands on scarce resource, we take the heterogeneity of both computing capacity and network bandwidth of geo-distributed edges into consideration when assigning data analytical tasks and their associated data between the central datacenter and edges such that the overall latency can be reduced. We formulate the geo-distributed data-task joint scheduling problem (GJS), show its NP-hardness, and propose a near-optimal randomized scheduling algorithm (ran-GJS). ran-GJS can be proved concentrated around its optimum value with high probability, i.e., 1-O(e^{-t^{2}}) where t is the concentration bound by using Martingale Analysis. The experimental results obtained form both extensive simulations and Yarn-based prototype show that ran-GJS significantly speeds up the geo-distributed analytics with a gain on average completion time of at least 28% over state-of-the-art baseline algorithms. Less Provisioning: A Fine-Grained Resource Scaling Engine for Long-Running Services with Tail Latency Guarantees Binlei Cai and Rongqi Zhang (School of Computer Science and Technology, Tianjin University; Tianjin Key Laboratory of Advanced Networking); Laiping Zhao (School of Computer Software, Tianjin University; Tianjin Key Laboratory of Advanced Networking); and Keqiu Li (School of Computer Science and Technology, Tianjin University; Tianjin Key Laboratory of Advanced Networking) Abstract Abstract Modern resource management frameworks guarantee low tail latency for long-running services using the resource over-provisioning method, resulting in serious waste of resource and increasing the service costs greatly. To reduce the over-provisioning cost, we present EFRA, an elastic and fine-grained resource allocator that enables much more efficient resource provisioning while guaranteeing the tail latency Service Level Objective (SLO). EFRA achieves this through the cooperation of three key components running on a containerized platform: The period detector identifies the period features of the workload through a convolution-based time series analysis. The resource reservation component estimates the just-right amount of resources based on the period analysis through a top-K based collaborative filtering approach. The online reprovisioning component dynamically adjusts the resources for further enforcing the tail latency SLO. Testbed experiments show that EFRA is able to increase the average resource utilization to 43%, and save up to 66% resources while guaranteeing the same tail latency objective. Brandon Nesterenko and Qing Yi (UCCS) and Jia Rao (UTA) Abstract Abstract Traditional process scheduling in the operating system focuses on high CPU utilization while achieving fairness among the processes. However, this can lead to an inefficient usage of other hardware resources, e.g., the caches, which have limited capacity and is a scarce resource on most systems. This paper extends a traditional operating system scheduler to schedule processes more efficiently against hardware resources. Through the introduction of a new concept, a \textit{progress period}, which models the variation of resource access characteristics during application execution, our scheduling extension dynamically monitors the changes in resource access behavior of each process being scheduled, tracks their collective usage of hardware resources, and schedules the processes to decrease overall system power consumption without compromising performance. Testing this scheduling system on programs on an Intel(R) Xeon(R) E5-2420 CPU with twelve kernels from the BLAS suite and five applications from the SPLASH-2 benchmark suite yielded a 48\% maximum decrease in system energy consumption (average 12\%), and a 1.88x maximum increase in application performance (average 1.16x). Tuesday 6:30pm-9:02pm Erb Memorial Union (EMU) Ballroom, 2nd Floor Poster, Reception Poster Reception and Dinner Chair: Kengo Nakajima (University of Tokyo, RIKEN) Weidong Zhang, Chang Cui, and Yifeng Chen (Peking University) and Qifei Zhang (Zhejiang University) Abstract Abstract Many synchronous parallel algorithms like PageRank require a large number of iteration steps. The overheads of global synchronizations on general-purpose cluster take substantial proportion of the execution time for lightweight computations. We propose a variant of Bulk Synchronous Parallel (BSP), Delta-Stepping Synchronous Parallel (DSP), with fewer iteration steps. It achieves faster convergence process by taking full advantage of data locality. Teemu Kerola (University of Helsinki) Abstract Abstract A GPU-type special processor is proposed for sorting large data sets with, e.g., one billion keys. The parallel bubble sorter (PBS) processor implements parallel bubble sort in hardware with a very large set of special registers. A method to utilize the PBS for various large sorting problems is presented. Xi Wang, John D. Leidel, and Yong Chen (Texas Tech University) Abstract Abstract Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can dramatically slow down the overall performance of applications. The growing desire of high memory bandwidth and low latency access stimulate the advent of novel 3D-staked memory devices such as the Hybrid Memory Cube (HMC), which provides significantly higher bandwidth compared with the conventional JEDEC DDR devices. Even though many existing studies have been devoted to achieving high bandwidth throughput of HMC, the bandwidth potential cannot be fully exploited due to the lack of highly efficient memory coalescing and interfacing methodology for HMC devices. In this research, we introduce a novel memory coalescer methodology that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC. We present the design and implementation of this approach on RISC-V embedded cores with attached HMC devices. Our evaluation results show that the new memory coalescer eliminates 47.47% memory accesses to HMC and improves the overall performance by 13.14% on average. Peter McCorquodale, Phillip Colella, and Brian Van Straalen (Lawrence Berkeley National Laboratory) and Christos Kavouklis (Lawrence Livermore National Laboratory) Abstract Abstract This poster describes a new algorithm, Method of Local Corrections (MLC), and a high-performance implementation for solving Poisson's equation with infinite-domain boundary conditions, on locally-refined nested rectangular grids. MLC represents the potential as a linear superposition of small local discrete convolutions, with global coupling represented using a non-iterative form of geometric multigrid. Thus the data motion is comparable to that of only a single V-cycle of multigrid, and hence is an order of magnitude smaller than tradition multigrid iteration. The computational kernels where most of the time is spent are 3D FFTs on small domains. Our results show solution error that is fourth order in mesh spacing, down to a fixed barrier that can be made low using high-order Mehrstellen stencils. Strong scaling tests on 64 to 4096 cores on NERSC Cori I (Haswell) show over 60% efficiency, and weak scaling by replication tests over 64 to 32768 cores show 92% efficiency on the same platform. We find comparable solve times between HPGMG on a uniform grid with one billion grid points, and MLC on the same number of grid points adaptively distributed. Since MLC operates on a nested series of adaptive locally-refined grids, it is able to solve problems with much higher resolution at the finest level than an algorithm on a uniform grid. Cheng Qian (National University of Defense Technology, University of Pittsburgh); Bruce Childers (University of Pittsburgh); and Libo Huang, Qi Yu, and Zhiying Wang (National University of Defense Technology) Abstract Abstract Graph traversal is widely involved in lots of realistic scenarios such as road routing, social network and so on. Unfortunately graph traversal is quite time-consuming because of terrible spatial locality. Conventional prefetch technology and parallel framework do not bring much benefit. Hybrid Memory Cube (HMC) can serve as high bandwidth main memory as well as providing an enhanced logic layer with the ability of controlling memory access and processing in memory (PIM). Armed with the knowledge of graph's structure, we propose CGAcc in this paper. By deploying three prefetchers on the logic layer and making them in a pipeline way, CGAcc achieves average 2.8X speedup compared with conventional memory system with stream prefetcher. An Extensible Ecosystem of Tools Providing User Friendly HPC Access and Supporting Jupyter Notebooks Ben Glick and Jens Mache (Lewis & Clark College) Abstract Abstract High performance computing systems can be hard to use, and can require advanced knowledge of the command line, resource managers, and other details that scientists without a systems administration background may find distracting. In this project, we describe our extensible ecosystem of tools which has allowed students and researchers with minimal HPC backgrounds to use our system with virtually no training. Among the tools we support are (1) a web-based job submission and management system, (2) a GUI-based file transfer and sharing tool, and (3) a Jupyter notebook environment. Together, these tools and the interfaces around them allow users to run complete HPC workflows without any use of the command line. Performance measurements show that our ecosystem incurs only marginal overhead. In-Depth Reliability Characterization of NAND Flash based Solid State Drives in High Performance Computing Systems Shuwen Liang, Zhi Qiao, and Song Fu (University of North Texas) and Weisong Shi (Wayne State University) Abstract Abstract ABSTRACTNAND Tiago Barreto Goes Perez and Xiaobo Zhou (University of Colorado Colorado Springs) Abstract Abstract Spark is among the most popular parallel and distributed data analytics frameworks in use today. But just like its peers and predecessors, it still suffers from performance degradation caused by resource bottlenecks. The challenge is finding multiple methods that leverage awareness of resource bottlenecks to improve performance. Alvaro Fernandez (Instituto de Física Corpuscular (IFIC), Universidad de Valencia and CSIC); Juan Orduña (Departamento de Informática, Universiad de Valencia, SPAIN); and Santiago Gonzalez (Instituto de Física Corpuscular (IFIC), Universidad de Valencia and CSIC) Abstract Abstract Modern scientific collaborations, like the ATLAS experiment at CERN, produce large amounts of data that need cataloging to meet multiple use cases and search criteria. Several challenges arise in indexing and collecting billions of events, or particle collisions, from hundred of grid sites worldwide. In addition, we also face some challenges in the organization of the data storage layer of the catalog, that should be capable of handling mixed OLTP (high-volume transaction processing updates ) and OLAP (real-time analytical queries) use cases. Xinyu Chen and Trilce Estrada (Computer Science Department, University of New Mexico) Abstract Abstract We present KeyBin2, a key-based clustering method that is able to learn from distributed data in parallel. KeyBin2 uses random projections and discrete optimizations to efficiently clustering very high dimensional data. Because it is based on keys computed independently per dimension and per data point, KeyBin2 can scale linearly. We perform accuracy and scalability tests to evaluate our algorithm's performance using synthetic and real datasets. The experiments show that KeyBin2 outperforms other parallel clustering methods for problems with increased complexity. Finally, we present an application of KeyBin2 for in-situ clustering of protein folding trajectories. Sarunya Pumma (Virginia Tech), Min Si (Argonne National Laboratory), Wu-chun Feng (Virginia Tech), and Pavan Balaji (Argonne National Laboratory) Abstract Abstract As deep learning systems continue to grow in importance, several researchers have been analyzing approaches to make such systems efficient and scalable on high-performance computing platforms. As computational parallelism increases, however, data I/O becomes the major bottleneck limiting the overall system scalability. In this paper, we present a detailed analysis of the performance bottlenecks of Caffe on large supercomputing systems and propose an optimized I/O plugin, namely LMDBIO. Throughout the paper, we present six sophisticated data I/O techniques that address five major problems in state of the art I/O subsystem. Our experimental results show that Caffe/LMDBIO can improve the overall execution time of Caffe/LMDB by 77-fold compared with LMDB. Jose Manuel Monsalve Diaz (University of Delaware) Abstract Abstract On the quest of High Performance Computing systems capable of achieving high parallelism on scientific applications, a trend of adding acceleration devices with specialized hardware architectures can be observed. An evidence of this are the 102 supercomputers of the latest top 500 list that use an accelerator or co-processor. Furthermore, new programming languages and frameworks have been created to support accelerators. While each vendor may offer its own programming solution, traditional programming frameworks as OpenMP are working to solve this issue adapting the standard to support target devices. Although it is not the only option, all solutions share a common objective of increasing code portability across vendors and multiple architectural generations. However, maturity and correctness of these solution needs to be tested. In particular for OpenMP, there is a breach between the standard release and when compilers and vendors implement support for it. Hence, there is a need of a methodology to assess the quality of an implementation of the OpenMP standard, as well as evaluate system compliance during systems deployment. Such methodology should reveal bugs or conflicts on the interpretation of the standard, as well as provide metrics to characterize the quality level of an implementation. Consequently, this work presents a tests suite for OpenMP 4.5. This suite follows a well thought methodology that was established in advanced to provide a common and steady ground, allowing vendors, systems developers and programmers to asses the level of readiness and maturity of an OpenMP 4.5 implementation for a particular vendor. Sunimal Rathnayake and Yong Meng Teo (National University of Singapore) Abstract Abstract With cloud computing offering scalable resources and pay-per use pricing, it is an attractive platform for scalable computing where application execution is constrained only by the cost budget. Additionally, cloud applications are increasingly becoming larger due to recent advancements in big data processing and machine learning, among others. Due to the large number of resource configurations available on the cloud, it is a non-trivial task for cloud users to determine the largest size of the application executable for a given cost budget and execution time deadline. In addition, the challenge becomes multi-fold as resource demands of the application do not necessarily scale linearly with problem size. Given a cost budget and a time deadline, this paper introduces a measurement-driven analytical modeling approach to determine the largest Pareto-optimal problem size of an application and the corresponding cloud configuration for execution. We evaluate our approach with a subset of representative applications that exhibit range of resource demand growth patterns. We show the existence of cost-time-size Pareto-frontier with multiple sweet spots meeting user constraints. To characterize the cost-performance of cloud resources, we introduce Performance Cost Ratio (PCR) metric. Motivated by Gustafson's law, we investigate fixed-time scaling of applications on cloud and analyze the trade-offs between the application problem size, cost and time. Chad Wood (University of Oregon) Abstract Abstract Efficiently observing and interacting with complex scientific workflows at scale presents unique challenges. SOSflow helps meet them. This work presents the model and a design overview of the SOSflow In Situ Scalable Observation System, and shares some recent results oriented towards performance understanding of complex workflows at scale. Payal Nandy (University of Utah, university of utah); Eddie C. Davis (Boise State University); Mahdi S. Mohammadi (University of Arizona); Mary Hall (University of Utah); Michelle Strout (University of Arizona); and Catherine Olschanowsky (Boise State University) Abstract Abstract The inspector/executor paradigm permits using runtime information in concert with compiler optimization. An inspector collects information that is only available at runtime; this information is used by an optimized executor that was created at compile time. Inspectors are widely used in optimizing irregular computations, where information about data dependences, loop bounds, data structures, and memoryaccess patterns are collected at runtime and used to guide code transformation, parallelization, and data layout. Most research that uses inspectors relies on instantiating inspector templates, invoking inspector library code, or manually writing inspectors. This poster describes abstractions for generating inspectors for loop and data transformations for sparse matrix computations using the Sparse Polyhedral Framework(SPF). SPF is an extension of the polyhedral framework for transformation and code generation. SPF extends the polyhedral framework to represent runtime information with uninterpreted functions and inspector computations that explicitly realize such functions at runtime. It has previously been used to derive inspectors for data and iteration space reordering. This poster introduces data transformations intoSPF, such as conversions between sparse matrix formats, and show how prior work can be supported by SPF. We also discuss possible extensions to support inspector composition and incorporate other optimizations. This work represents a step towards creating composable inspectors in keeping with the composability of affine transformations on the executors. Md Shafayat Rahman (Florida State University) Abstract Abstract The performance of the interconnect network is massively important in the modern day supercomputer and data centers. As a PhD student in the FSU CS EXPLORER (EXtreme-scale comPuting, modeLing, netwORking & systEms Research) lab under the supervision of Dr. Xin Yuan, my research activity revolves around the analysis, improvement and performance evaluation of a number of topology and routing schemes widely used in the field of high performance computing. Over the years, I worked in a number of projects that is briefly described in my poster and in the extended research abstract. Ioannis Paraskevakos (Rutgers University) Abstract Abstract My research focuses on providing a common workload management middleware abstraction, on distributed resources, that will capture the commonalities of the data analysis between different domain sciences, and a resource management abstraction that will allow Big Data style analytics on HPCs in conjunction with HPC style simulations. One important component of these abstractions is the Pilot abstraction. The Pilot abstraction operates as a resource placeholder which can be used to execute a set of heterogeneous tasks within the placeholder on the same resource. I am using the Pilot-Abstraction and its implementation RADICAL-Pilot to support different types of data analysis as well as use it as a building block for the middleware abstraction defined before. Kenneth Weber and Justin A. Brew (University of Mount Union, Department of Computer Science) Abstract Abstract The modular integer greatest common divisor (GCD) algorithm Hui Guo, Libo Huang, Yashuai Lv, Qi Yu, Sheng Ma, and Zhiying Wang (National University of Defense Technology) Abstract Abstract Breadth First Search (BFS) is a key graph searching algorithm for many graph computing applications. However, memory divergence harms GPU’s efficiency dramatically due to the irregular memory access pattern of BFS. Memory prefetching can fetch useful data into on-chip memory in advance, but current prefetchers cannot deal with data-dependent memory accesses. In this paper, we propose DSAP, a data structure-aware prefetcher on GPU that generates prefetching requests based on the data-dependent relationship between graph data structures and the memory access pattern of BFS. Also, we introduce an adaptive prefetching depth management to make DSAP more efficient according to the limited on-chip memory storage. We evaluate six datasets from three different kinds of applications and DSAP can achieve geometrical mean speedups of 1.28x, up to 1.48x, compared to the GPU without using prefetching. Madhura Kumaraswamy and Michael Gerndt (Technical University of Munich) Abstract Abstract Optimizing the energy consumption is a major challenge in the current HPC landscape. The EU Horizon 2020 project READEX provides an auto-tuning framework to improve the energy-efficiency of HPC applications. It leverages the variation in the application behavior between individual time loop iterations, or phases as well as changing control flow within a single phase. Shehenaz Shaik and Sanjeev Baskiyar (Auburn University) Abstract Abstract Fog computing paradigm is emerging as complementary to cloud computing paradigm to realize deployment of large scale IoT environments supporting widely dispersed IoT devices, users, and corresponding applications, leveraging fog nodes of varied resource configurations located in vicinity. Researchers have shown the need of fog computing towards deployment of latency-critical and bandwidth-intensive applications. Management of resources and services is critical in widely dispersed fog environment with heterogeneous fog nodes to ensure optimal utilization of available infrastructure and energy resources on fog nodes and IoT devices. Towards efficient management of fog, we proposed Hierarchical and Autonomous Fog Architecture (HAFA) to organize heterogeneous fog nodes into a multi-layered connected hierarchy based on several parameters such as physical location, distance from IoT devices and/or users, node resource configuration, privacy and security. The initial results show the ease of search for an optimal fog node with required resource configuration towards deployment of application services. Kanika Sood and Boyana Norris (University of Oregon) and Elizabeth Jessup (University of Colorado Boulder) Abstract Abstract Scientific and engineering applications from different domains like astrophysics, robotics, computational fluid dynamics, thermodynamics, often involve the solution of large sparse linear systems. For large, sparse linear systems, iterative methods are the preferred choice of solution method. There are many software libraries that offer a variety of parallel preconditioned Newton-Krylov methods for solving sparse problems. However, the selection of an optimal Krylov method remains to be user's responsibility. In this work, we present the technique we propose for the optimal solver method suggestions based on the problem characteristics and the amount of communication involved in the Krylov methods. Yasodha Suriyakumar and Karen L. Karavanic (Portland State University) and Hamid Moradkhani (University of Alabama) Abstract Abstract We present the results of a performance study of a newly developed drought prediction code called DroughtHPC. Holistic view helps to identify bottlenecks and identify areas for improvement in the software. The significant performance bottlenecks identified include: the overhead of calls to the VIC hydrologic model from within a python loop; VIC code structures that precluded parallelization with OpenMP; and significant file accesses. We observed challenges in diagnosing the performance of the code due to the use of an external modeling code in combination with python, a fairly common scenario in scientific computation. To address these challenges we are developing PPerfG, a tool for visualizing Holistic HPC Workflows, and have implemented an initial prototype. Carl Yang (UC Davis, LBNL); Aydin Buluc (LBNL, UC Berkeley); and John D. Owens (UC Davis) Abstract Abstract Push-pull, also known as direction-optimized breadth-first-search (DOBFS), is a key optimization for making breadth-first-search (BFS) run efficiently. Linear algebra-based frameworks have advantages in conciseness, performance and portability. However, there is no work in literature describing how to implement it within a linear algebra-based framework. Our work shows that DOBFS fits well within the linear algebra-based framework. MOHAMMAD ALAUL HAQUE MONIL (UNIVERSITY OF OREGON) Abstract Abstract Finding an optimal value for a parameter that impacts application performance is a hard problem and often requires repetitive execution of the application. This auto-tuning effort by running the application many times causes wastage of HPC resources. In this research, we provide a preliminary study which demonstrates parameter value searching at runtime while ensuring better performance. We use APEX performance measurement library to implement an adaptive auto-tuning policy to tune parcel coalescing parameters based on a sampled counter of HPX runtime system. Our study shows APEX policy can converge on parcel coalescing parameters while reducing the network overhead and hence ensures reduction of execution time. Anmol Paudel and Satish Puri (Marquette University) Abstract Abstract Spatial data processing and analytics is a highly data- and compute-intensive task. Doing this in a HPC environment has remained a challenge due to scalability issues. The scalability issues usually arise due to the non-uniform nature of spatial data which makes load balancing inefficient. Existing algorithms and systems which are sequential in nature cannot be ported into a distributed system without tackling the issue of load balancing. Porting existing algorithms to a parallel system would also require finding their parallel counterpart and rewriting code in a new language like CUDA or refactoring them using pthreads. In our work, we present a framework for processing and analyzing big spatial data in a HPC environment by overcoming or diminishing the above mentioned issues. Our work includes a way to read the non-uniform spatial data into a MPI system using MPI-Vector-IO. This allows us to reduce the read time of big spatial data and get started on processing faster. ADLB is introduced as a potential solution to handling the issues of load balancing during processing of the big spatial data. Moreover, we propose using directive based programming to port the existing sequential code to run in parallel so that we can avoid rewriting or heavy refactoring. Using OpenMP will allow our code to run in parallel threads in the nodes extracting more performance when possible and using OpenACC we will be able to use any GPUs the nodes might have. Vjatsešlav Antoškin and Benson Muite (University of Tartu) Abstract Abstract Political redistricting is an important operation that is done to ensure a fair selection of electoral representatives. It can be formulated as a combinatorial optimization problem. In realistic cases, this problem can be challenging to solve due to the large number of solutions. The effectiveness of parallel computing to more effectively search the solution space is examined in specially designed test cases where the optimal solution is known. Hoang-Vu Dang and Marc Snir (University of Illinois at Urbana-Champaign) Abstract Abstract Communication hardware and software have a significant impact on the performance Suleyman Savas (Halmstad University) Abstract Abstract We propose a design method to develop domain-specific heterogeneous manycores by extending simple cores with custom instructions based on application requirements. We develop tools to automate the design method and facilitate development of the manycores. The tools generate competitive hardware/software co-design in terms of performance and hardware resource usage. Sajal Dash (Virginia Tech); Nick Kinney, Robin Varghese, and Harold Garner (Edward Via College of Osteopathic Medicine, Blacksburg); Wu-chun Feng (Virginia Tech); and Ramu Anandakrishnan (Edward Via College of Osteopathic Medicine, Blacksburg) Abstract Abstract Disruptions in certain molecular pathways due to combinations of genetic mutations (hits) are known to cause cancer. Due to a large number of mutations present in tumor cells, experimentally identifying these combinations is not possible except in very rare cases. Current computational approaches simply do not search for specific combinations of multiple genetic mutations. Instead, current algorithms search for sets of driver mutations based on mutation frequency and mutational signatures. Here, we present a fundamentally different approach for identifying carcinogenic mutations: we search for combinations of carcinogenic mutations (multi-hit combinations). By avoiding the convolution of different driver mutations associated with different individual instances of cancer, multi-hit combinations may be able to identify the specific cause for each cancer instance. We mapped the problem of identifying a set of multi-hit combinations to a weighted set cover problem. We use a greedy algorithm to identify sets of multi-hit combinations for seventeen cancer types for which there are at least two hundred matched tumor and blood-derived normal samples in the cancer genome atlas (TCGA). When tested using an independent validation dataset, these combinations are able to differentiate between tumor and normal tissue samples with 91% sensitivity (95% Con- fidence Interval (CI) = 89-92%) and 93% specificity (95% CI = 91-94%), on average for seventeen cancer types. The combinations identified by this method, with experimental validation, can aid in better diagnosis, provide insights into the etiology of various cancer types, and provide a rational basis for designing targeted combination therapies. Robert Lim (University of Oregon) Abstract Abstract Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. Our methodology statically separates CUDA binaries into basic block regions and dynamically measures instruction andbasic block frequencies. Our approach captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel’s CFGs to gain insights into an application’s resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of our approach is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel thread divergence characteristics that facilitate end users, autotuners, and compilers in generating high performing code. Eishi Arima and Toshihiro Hanawa (The University of Tokyo) and Martin Schulz (Technical University of Munich) Abstract Abstract High Performance Computing (HPC) systems are facing severe limitations in both power and memory bandwidth/capacity. Both limitations have been addressed individually: to exploit performance under a strict power constraint, power shifting, which allocates more power budget on the bottleneck component, is a promising approach; for memory bandwidth/capacity, the industry is gradually shifting towards hybrid main memory designs that comprise multiple technologies (e.g., DRAM and Non-Volatile RAM). However, few works look at the combination of both trends. Anuva Kulkarni, Franz Franchetti, and Jelena Kovacevic (Carnegie Mellon University) Abstract Abstract Extreme memory requirements and high communication overhead prevent scaling of large scale iterative simulations involving parallel Fast Fourier Transforms (FFTs) to higher grid sizes, which is necessary for high resolution analysis. To overcome these limitations, we propose an algorithm to run stress-strain simulations on CPU-GPU platforms for larger problem sizes using irregular domain decomposition and local FFTs. Early results show that our method lowers iteration cost without adversely impacting accuracy of the result. Aaron Goin, Ronald Cotton, and Xinghui Zhao (Washington State University) Abstract Abstract Distributed computing technologies have been used to facilitate machine learning applications, so that high-end hardware resources, such as clusters, GPUs, etc., can be leveraged to achieve high performance. In this paper, we present WebNN, a web-based distributed framework for deep learning. WebNN enables users to configure, distribute, and train neural networks asynchronously over the Internet, utilizing peer-owned resources with increased flexibility. Experiments have been carried out to evaluate WebNN, and the results show that WebNN is a effective solution for distributed training of models using pre-labeled batches, or client-provided data. Jason Hiebel, Laura E. Brown, and Zhenlin Wang (Michigan Technological University) Abstract Abstract Dynamically tuning or reconfiguring operating system and architectural resources at runtime can improve performance by adapting system operation to the current workload. We seek to address the problems of both system modeling and dynamic system configuration in the context of sequential decision processes with limited feedback. Random profiling is a technique for sequential decision processes with limited feedback, which can adequately and efficiently capture the relationships between workload behavior, system configuration, and performance. Along with machine learning methods, random profiling can be used for determining descriptive metrics for workload behavior and constructing policies which dynamically reconfigure the operating system and microarchitecture at runtime. We present three such problems: system event selection, dynamic paging mode selection, and dynamic hardware prefetcher configuration. Maciej Pawlik, Kamil Figiela, and Maciej Malawski (AGH University of Science and Technology) Abstract Abstract Function-as-a-Service services are still a novelty in portfolios of cloud service providers. There is a significant amount of research done on development of new use cases. One of the new proposals is to exploit the computing power of FaaS by running HPC applications [1]. While not all workloads can be adapted to run as cloud functions, means for executing workflow-type applications are available. In preparation to develop performance models for studied applications, a benchmarking framework was proposed in [2]. Saurabh Kalikar (Indian Institute of Technology Madras, India) and Rupesh Nasre (Indian Institute of Technology Madras) Abstract Abstract Hierarchies are special linked structures, where each child node denotes a specialization or a part of its parents. For instance, node representing a department in an academic hierarchy is a part of its parent institute. Conversely, a node representing an institute contains all its departments. Hierarchical locking is a way to lock a node in a hierarchy which implicitly locks its descendants. This is useful because the whole sub-hierarchy rooted at a node can be protected using a single lock. A node could be hierarchically locked only if (i) it is not locked by any other thread, and (ii) none of its descendants are currently locked by any other thread, and (iii) none of its ancestors are currently locked by any other thread. For instance, in Figure 1, hierarchical locking of node G requires that no other thread currently holds locks on (i) G itself, (ii) its descendant nodes M and N , and (iii) its ancestors A and C . However, as an implementation, hierarchical locking poses various challenges. We present our interval based hierarchical locking technique and show how it tackles all the challenges in hierarchical locking. Vignesh Adhinarayanan (Virginia Tech) Abstract Abstract Modern high-performance computing (HPC) systems are power-limited. For instance, the U.S. Department of Energy has set a power envelope of 20MW for the exascale supercomputer expected to arrive in 2022-23. Achieving this target requires a 10.8-fold increase in performance over today’s fastest supercomputer with only a 1.3-fold increase in power consumption. As a consequence, the architecture of an HPC system is changing rapidly---e.g., via heterogeneity and hardware overprovisioning. In my thesis, I address (i) modeling, (ii) management, and (iii) evaluation challenges concerning power and energy in this changing landscape of HPC systems. In this extended abstract, I focus on my work on modeling data movement power over interconnect wires. Wednesday 9:00am-10:00am Straub Hall, Room 156 Plenary Talk Plenary Chair: Manish Parashar (Rutgers University) Transforming Science through Cyberinfrastructure Manish Parashar (University of Rutgers and NSF OAC Office Director) Biography Biography Manish Parashar (University of Rutgers and NSF OAC Office Director) Manish is Office Director of the Office of Advanced Cyberinfrastructure at NSF. He joins NSF from Rutgers, The State University of New Jersey, where he is currently a Distinguished Professor and the founding Director of the Rutgers Discovery Informatics Institute. His research interests are in the broad areas of Parallel and Distributed Computing and Computational and Data-Enabled Science and Engineering. Manish is Fellow of AAAS, Fellow of IEEE/IEEE Computer Society and ACM Distinguished Scientist. Abstract Abstract NSF's Office of Advanced Cyberinfrastructure (OAC) seeks to foster the advanced cyberinfrastructure that is critical to the advancement of all areas of science and engineering research and education. For over a decade, OAC's (and ACI and OCI before it) investments have consistently enabled new innovations and discoveries. However, recent years are witnessing dramatic changes in nature and requirements of science applications, in the scale and pervasiveness of data, and in the landscape of technologies and resources. It is essential that the research cyberinfrastructure ecosystem evolve in response to these changes, and the HPC community, its campus leaders, researchers and professional experts all have important roles to play. This talk will present an overview of OAC and its programs and investments. It will also present a vision for evolving program and priorities to transform science in the 21st century. Wednesday 10:30am-12:30pm Straub Hall, Room 156 Paper Memory Performance Chair: Martin Schulz (Technical University Munich) Anne Benoit (ENS Lyon & Inria; Georgia Institute of Technology, Atlanta); Swann Perarnau (Argonne National Laboratory); Loïc Pottier (ENS Lyon & Inria); and Yves Robert (ENS Lyon & Inria; University of Tennessee, Knoxville) Abstract Abstract This work presents a realistic performance model to execute scientific workflows on high-bandwidth-memory architectures such as the Intel Knights Landing. We provide a detailed analysis of the execution time on such platforms, taking into account transfers from both fast and slow memory and their overlap with computations. We discuss several scheduling and mapping strategies: not only tasks must be assigned to computing resources, but also one has to decide which fraction of input and output data will reside in fast memory and which will have to stay in slow memory. We use extensive simulations to assess the impact of the mapping strategies on performance. We also conduct experiments for a simple 1D Gauss-Seidel kernel, which assess the accuracy of the model and further demonstrate the importance of a tuned memory management. Our model and results lay the foundations for further studies and experiments on dual-memory systems. Neil A. Butcher (Notre Dame); Stephen L. Olivier, Jonathan W. Berry, and Simon D. Hammond (Sandia National Labs); and Peter M. Kogge (Notre Dame) Abstract Abstract Technologies such as Multi-Channel DRAM (MCDRAM) or High Bandwidth Memory (HBM) provide significantly more bandwidth than conventional memory. This trend has raised questions about how applications should manage data transfers between levels. This paper focuses on evaluating different usage modes of the MCDRAM in Intel Knights Landing (KNL) manycore processors. We evaluate these usage modes with a sorting kernel and a sorting- based streaming benchmark. We develop a performance model for the benchmark and use experimental evidence to demonstrate the correctness of the model. The model projects near-optimal numbers of copy threads for memory bandwidth bound computations. We demonstrate on KNL up to a 1.9X speedup for sort when the problem does not fit in MCDRAM over an OpenMP GNU sort that does not use MCDRAM Mohamed Mohamedin and Sebastiano Peluso (Virginia Tech), Masoomeh Javidi Kishi (Lehigh University), Ahmed Hassan (Alexandria University), and Roberto Palmieri (Lehigh University) Abstract Abstract In this paper we present Nemo, a NUMA-aware Transactional Memory (TM) design and implementation optimized for promoting scalability in applications running on top of NUMA architectures. Nemo deploys a hybrid design where conflicting threads alternate the usage of single timestamps and vector clocks to identify inconsistent executions depending upon the source of conflict. We assessed the performance of Nemo by using both synthetic and well-known OLTP transactional workloads. Our approach offers improvements over the six state-of-the-art competitors we implemented. Kai Xu (School of Software, Shandong University); Robin Kobus (Institute for Computer Science, Johannes Gutenberg University); Yuandong Chan, Ping Gao, and Xiangxu Meng (School of Software, Shandong University); Yanjie Wei (Shenzhen Institutes of Advanced Technology, CAS); Bertil Schmidt (Johannes Gutenberg University of Mainz); and Weiguo Liu (School of Software, Shandong University) Abstract Abstract Modern high throughput sequencing platforms can produce large amounts of short read DNA data at low cost. Error correction is an important but time-consuming initial step when processing this data in order to improve the quality of downstream analyses. In this paper, we present a Scalable Parallel Error CorrecToR designed to improve the throughput of DNA error correction for Illumina reads on various parallel platforms. Our design is based on a k-spectrum approach where a Bloom filter is frequently probed as a key operation and is optimized towards AVX-512-based multi-core CPUs, Xeon Phi many-cores (both KNC and KNL), and heterogeneous compute clusters. A number of architecture-specific optimizations are employed to achieve high performance such as memory alignment, vectorized Bloom filter probing, and a stack-based iteration to eliminate recursion. Our experiments show that our optimizations result in speedups of up to 2.8, 5.2, and 9.3 on a CPU (Xeon W-2123), a KNC-based Xeon Phi (31S1P), and a KNL-based Xeon Phi (7210), respectively, compared to a multi-threaded CPU reference implementation for the error correction stage. Furthermore, when executed on the same hardware, SPECTR achieves a speedup of up to 1.7, 2.1, 2.4, and 6.4, compared to the state-of-the-art tools Lighter, BLESS2, RECKONER, and Musket, respectively. In addition, our MPI implementation exhibits an efficiency of around 86% when executed on 32 nodes of the Tianhe-2 supercomputer. SPECTR is available at https://github.com/Xu-Kai/SPECTR. Wednesday 10:30am-12:30pm Straub Hall, Room 145 Paper Machine Learning and Networks Chair: Peter Pirkelbauer (University of Alabama at Birmingham, Computer Science) Haitao Zhang, Bingchang Tang, Xin Geng, and Huadong Ma (Beijing University of Posts and Telecomm. (BUPT)) Abstract Abstract Hybrid CPU-GPU cluster has become a promising computing paradigm for large-scale video analytics. However, the uncertainty and variability of workloads and heterogeneous resources in the cluster can lead to the unbalanced use of the hybrid computing resources and further cause the performance degradation of the computing platform. This problem becomes more challenging with the computation complexity and dependencies of video tasks in the hybrid cluster. In this paper, we focus on the video workload parallelization problem with fine-grained task division and feature description in the hybrid CPU-GPU cluster. Firstly, for achieving high resource utilization and task throughput, we propose a two-stage video task scheduling approach based on deep reinforcement learning. In our approach, a task execution node is selected by the cluster-level scheduler for the mutually independent video tasks, and then the node-level scheduler assigns the interrelated video subtasks to the appropriate computing units. By using the deep Q-network, the two-stage scheduling model is online learned to perform the current optimal scheduling actions according to the runtime status of cluster environments, the characteristics of video tasks, and the dependencies between video tasks. Secondly, based on the transfer learning technology, a scheduling strategy generalization method is proposed to efficiently rebuild the task scheduling model referring to the existing model. Finally, we conduct the extensive experiments to analyze the impact of the model parameters on the scheduling actions, and then the experimental results also validate that our learning based task scheduling approach outperforms the other widely used methods. GLP4NN: A Convergence-invariant and Network-agnostic Light-weight Parallelization Framework for Deep Neural Networks on Modern GPUs Hao Fu and Shanjiang Tang (Tianjin University), Bingsheng He (National University of Singapore), and Ce Yu and Jizhou Sun (Tianjin University) Abstract Abstract In this paper, we propose a network-agnostic and convergence-invariant light-weight parallelization framework, namely \emph{GLP4NN}, to accelerate the training of Deep Neural Networks (DNNs) by taking advantage of emerging GPU features, especially concurrent kernel execution. To determine the number of concurrent kernels on the fly, we design and implement an analytical model in the kernel analyzer module, and a compact asynchronous resource tracker in the resource tracker module to collect kernels' runtime configurations with low memory and time overheads. We further develop a runtime scheduler module to dispatch workloads to GPU devices. Moreover, we also provide a pool-based stream manager for handling GPU work queues to avoid consuming too many CPU threads or processes. In our experiments, we integrate GLP4NN into Caffe to accelerate the batch-based training of four well-known networks on NVIDIA GPUs, which is a common method adopted by CNNs and RNNs. Experimental results show GLP4NN is able to achieve a speedup of up to 4X over the original implementation as well as keep the convergence property of networks. Xinyu Chen, Jeremy Benson, and Matt Peterson (University of New Mexico); Michela Taufer (University of Tennessee); and Trilce Estrada (University of New Mexico) Abstract Abstract We present KeyBin2, a key-based clustering method that is able to learn from distributed data in parallel. KeyBin2 uses random projections and discrete optimizations to efficiently clustering very high dimensional data. Because it is based on keys computed independently per dimension and per data point, KeyBin2 can scale linearly. We perform accuracy and scalability tests to evaluate our algorithm's performance using synthetic and real datasets. The experiments show that KeyBin2 outperforms other parallel clustering methods for problems with increased complexity. Finally, we present an application of KeyBin2 for in-situ clustering of protein folding trajectories. Jiang Xiao, Zhuang Xiong, Song Wu, Yusheng Yi, Hai Jin, and Kan Hu (Huazhong University of Science & Technology) Abstract Abstract Disk failure has become a major concern with the rapid expansion of storage systems in data centers. Based on SMART (Self-Monitoring, Analysis and Reporting Technology) attributes, many researchers derive disk failure prediction models using machine learning techniques. Despite of the significant development, the majorities rely on offline training and thereby hinder their adaption to the continuous update of forthcoming data, suffering from the ‘model aging’ problem. We are therefore motivated to uncover the root cause – the dynamic SMART distribution for ‘model aging’, aiming to resolve the performance degradation as to pave a comprehensive study in practice. In this paper, we introduce a novel disk failure prediction model based on Online Random Forests (ORFs). Our ORF-based model can automatically evolve with sequentially arrival data on- the-fly and thus is highly adaptive to the variance of SMART distribution over time. Moreover, it has favourable advantage against the offline counterparts in terms of superior prediction performance. Experiments on real-world datasets show that our ORF model converges rapidly to the offline random forests and achieves stable failure detection rates of 93-99% with low false alarm rates. Furthermore, we demonstrate the ability of our approach on maintaining stable prediction performance for the long-term usage in data centers. Wednesday 10:30am-12:30pm Straub Hall, Room 245 Paper Networking Chair: Federico Silla (UPV) Qian Liu and Haipeng Dai (State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu); Alex X. Liu (State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu; Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA.); Qi Li (Air Force Engineering University); and Xiaoyu Wang and Jiaqi Zheng (State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu) Abstract Abstract This paper proposes a new counter architecture for network measurement called Cache Assisted and randomizEd ShAring counteRs (CAESAR). One of the greatest challenges for per- flow traffic measurement is designing an online measurement module to keep pace with the highly increased link speed. To address this challenge, we use a fast on-chip memory as the cache before the slow off-chip SRAM counters, thereby de- creasing the accesses per flow to off-chip counters to improve time efficiency without any packet loss. We use randomized sharing counters among multiple flows in SRAM to achieve a compact data structure with high storage efficiency. By de-noising the impact from other flows which share counters with a certain flow, we theoretically analyze the expectation and confidence interval of its estimated flow size accurate- ly. In this paper, we use the real-world network traces to conduct software simulations, as well as FPGA experiments on a Xilinx Virtex-7 FPGA chip to verify our theoretical findings. The results show that CAESAR is up to 92.4% and 90% faster than prior work CASE and RCS respectively, and CAESAR decreases over half the average relative errors of both CASE and RCS. Md Shafayat Rahman, Md Atiqul Mollah, Peyman Faizian, and Xin Yuan (Florida State University) Abstract Abstract The Slim Fly topology has recently been proposed for the future generation supercomputers. It has small diameter and relies on the Universal Globally Adaptive Load-balanced (UGAL) routing, which adapts the routes between minimal (MIN) routing and Valiant Load-Balancing (VLB) routing to exploit the network capacity. In this work, we show that the current Slim Fly is not load-balanced for both MIN routing and VLB routing, in that certain links in the network have a significantly higher probability to carry traffic than others. As such, hot spots are more likely to form on such links. We propose two approaches to address this problem and to make Slim Fly load-balanced: (1) modifying the topology by selectively increasing the bandwidth of the potential hot-spot links so that the original routing becomes load-balanced, and (2) modifying the routing scheme by using a weighted VLB routing to distribute the traffic in a more load balanced fashion than the original VLB routing on the original Slim Fly. The results of our performance analysis and simulation demonstrate that both approaches result in more effective Slim Fly than its current form. Jiayao Wang (University of Shanghai for Science and Technology); Abdullah Al-Mamun (University of Nevada, Reno); Tonglin Li (Lawrence Berkeley National Laboratory); Linhua Jiang (University of Shanghai for Science and Technology); and Dongfang Zhao (University of California, Davis; University of Nevada, Reno) Abstract Abstract In a wireless sensor network (WSN) where nodes are mostly battery-powered, queries' energy consumption and response time are two of the most important metrics as they represent the network's sustainability and performance, respectively. Conventional techniques used to focus only one of the two metrics and did not attempt to optimize both in a coordinated manner. This work aims to achieve both high sustainability and high performance of WSN queries at the same time. To that end, a new mechanism is proposed to construct the topology of a three-tier WSN. The proposed mechanism eliminates routing tables and employs a novel and efficient addressing scheme inspired by the Chinese Remainder Theorem (CRT). The CRT-based topology allows for query parallelism, an unprecedented feature in WSNs. On top of the new topology encoded by CRT, a new protocol is designed to parallelly preprocess collected data on sensor nodes by effectively aggregating and deduplicating data in a neighborhood cluster. Moreover, a new algorithm is devised to allow the queries and results to be transmitted through low-power and fault-tolerant paths using recursive elections over a subset of the entire power range. With all these new techniques combined, the proposed system outperforms the state-of-the-art from various perspectives: (i) the query response is improved by up to 53\%; (ii) the energy consumption is reduced by up to 70\%; and (iii) the reliability is increased by up to 39\%. Anping He (School of Information Science and Engineering, Lanzhou University); Hong Chen (Institute of Microelectronics, Tsinghua University,Beijing National Research Center for Information Science and Technology); Guangbo Feng (School of Information Science and Engineering, Lanzhou University); Jiling Zhang (School of Physical Science and Engineering, Lanzhou University); Pengfei Li (School of Information Science and Engineering, Lanzhou University); and Yong Hei (Institute of Microelectronics Chinese Academy of Sciences) Abstract Abstract We have implemented an asynchronous mesh network. This paper describes our innovative design using a Click controller. Compared to designs that use other asynchronous circuit families with C-elements and four-phase bundled data, our two-phase Click-based Bounded Bundled Data design is faster, but introduces phase skews when handling concurrent traffic at a single node. Instead of eliminating the phase skews, we use them as computation slots. Our network uses a novel asynchronous arbiter with a queue that can accept data from both the four cardinal directions as well as from a local source, five directions in all. We have implemented our network design in 1 × 1, 2 × 2 and 4 × 4 sizes, larger network could be implemented easier since the isomorphism and modularity of the routing nodes. Our experiments show that an initial data item passes through a node in 157ns v.s. 81ns for non-delaybranch and delay-branch designs separately. Following items take about 65% as long. But for a network, the average latency of a node keeps almost same for different paths. We believe that with the non-delay-branch designs, our asynchronous mesh network could offer 10.1M routes per second for a 1 × 1 network and 5.33M routes per second for 2 × 2 or 5.06M for 4 × 4 networks, and work at the rate of 17.3M, 10.1M and 11.7M with the enhanced delay-branch way. For both cases, its latency is approximately linear with scale. Wednesday 2:00pm-3:30pm Straub Hall, Room 156 Paper Performance Studies Chair: Filippo Mantovani (Barcelona Supercomputing Center) Meng Tang and Mohamed Gadou (University of Florida), Steven Rennich (Nvidia), Timothy A. Davis (Texas A&M University), and Sanjay Ranka (University of Florida) Abstract Abstract Scientific computing relies heavily on matrix factorization. Cholesky factorization is typically used to solve the linear equation system Ax=b where A is symmetric and positive definite. A large number of applications require operating on sparse matrices. A major overhead with factorization of sparse matrices on GPUs is addressing the cost of transferring the data from the CPU to the GPU. Additionally, the computational efficiency of factorization of small dense matrices has to be addressed. Jan Hückelheim (Imperial College London), Paul Hovland and Sri Hari Krishna Narayanan (Argonne National Laboratory), and Paulius Velesko (Intel Corporation) Abstract Abstract Ensemble computations are used to evaluate a function for multiple inputs, for example in uncertainty quantification. Embedded ensemble computations perform several evaluations within the same program, often enabling a reduced overall runtime by exploiting vectorisation and parallelisation opportunities that are not present in individual ensemble members. This is challenging if members take different control flow paths. We present a source-to-source transformation that turns a given C program into an embedded ensemble program that computes members in a single-instruction-multiple-data fashion using OpenMP SIMD pragmas. We use techniques from whole-function vectorisation, achieving effective vectorisation for moderate amounts of branch divergence, particularly on processors with masked instructions such as recent Xeon Phi or Skylake processors with AVX-512. Moritz von Looz, Charilaos Tzovas, and Henning Meyerhenke (University of Cologne) Abstract Abstract Mesh partitioning is an indispensable tool for efficient parallel numerical simulations. Its goal is to minimize communication between the processes of a simulation while achieving load balance. Established graph-based partitioning tools yield a high solution quality; however, their scalability is limited. Geometric approaches usually scale better, but their solution quality may be unsatisfactory for non-trivial mesh topologies. Wednesday 2:00pm-3:30pm Straub Hall, Room 145 Paper Machine Learning Chair: Wu Feng (Virginia Tech) Oguz Kaya (Inria Bordeaux), Ramakrishnan Kannan (Oak Ridge National Laboratory), and Grey Ballard (Wakeforest University) Abstract Abstract Non-negative matrix factorization (NMF), the problem of finding two non-negative low-rank factors whose product approximates an input matrix, is a useful tool for many data mining and scientific applications such as topic modeling in text mining and unmixing in microscopy. In this paper, we focus on scaling algorithms for NMF to very large sparse datasets and massively parallel machines by employing effective algorithms, communication patterns, and partitioning schemes that leverage the sparsity of the input matrix. We consider two previous works developed for related problems, one that uses a fine-grained partitioning strategy using a point-to-point communication pattern and one that uses a Cartesian, or checkerboard, partitioning strategy using a collective-based communication pattern. We show that a combination of the previous approaches balances the demands of the various computations within NMF algorithms and achieves high efficiency and scalability. From the experiments, we see that our proposed strategy runs up to 10x faster than the state of the art on real-world dataset Connor Imes (University of Chicago), Steven Hofmeyr (Lawrence Berkeley National Laboratory), and Henry Hoffmann (University of Chicago) Abstract Abstract Resource scheduling in high performance computing (HPC) usually aims to minimize application runtime rather than optimize for energy efficiency. Most existing research on reducing power and energy consumption imposes the constraint that little or no performance loss is allowed, which improves but still does not maximize energy efficiency. By optimizing for energy efficiency instead of application turnaround time, we can reduce the cost of running scientific applications. We propose using machine learning classification, driven by low-level hardware performance counters, to predict the most energy-efficient resource settings to use during application runtime, which unlike static resource scheduling dynamically adapts to changing application behavior. We evaluate our approach on a large shared-memory system using four complex bioinformatic HPC applications, decreasing energy consumption over the naive race scheduler by 20% on average, and by as much as 38%. An average increase in runtime of 31% is dominated by a 39% reduction in power consumption, from which we extrapolate the potential for a 24% increase in throughput for future over-provisioned, power-constrained clusters. This work demonstrates that low-overhead classification is suitable for dynamically optimizing energy efficiency during application runtime. Michael R. Wyatt and Stephen Herbein (University of Delaware); Todd Gamblin, Adam Moody, and Dong H. Ahn (Lawrence Livermore National Laboratory); and Michela Taufer (ACM) Abstract Abstract For job allocation decision, current batch schedulers have access to and use only information on the number of nodes and runtime because it is readily available at submission time from user job scripts. User-provided runtimes are typically inaccurate because users overestimate or lack understanding of job resource requirements. Beyond the number of nodes and runtime, other system resources, including IO and network, are not available but play a key role in system performance. There is the need for automatic, general, and scalable tools that provide accurate resource usage information to schedulers so that, by becoming resource-aware, they can better manage system resources. Wednesday 2:00pm-3:30pm Straub Hall, Room 245 Paper Materials and Molecular Dynamics Chair: Jose Canales (UCAM) Shigang Li, Baodong Wu, and Yunquan Zhang (Institute of Computing Technology, Chinese Academy of Sciences); Xiangmeng Wang, Jianjiang Li, and Changjun Hu (University of Science and Technology Beijing); and Jue Wang, Yangde Feng, and Ningming Nie (Computer Network Information Center, Chinese Academy of Sciences) Abstract Abstract The limitation of simulation scales leads to a gap between simulation results and physical phenomena. This paper reports our efforts on increasing the scalability of metal material microscopic damage simulation on the Sunway TaihuLight supercomputer. We use a multiscale modeling approach that couples Molecular Dynamics (MD) with Kinetic Monte Carlo (KMC). According to the characteristics of metal materials, we design a dedicated data structure to record the neighbor atoms for MD, which significantly reduces the memory consumption. Data compaction and double buffer are used to reduce the data transfer overhead between the main memory and the local store. We propose an on-demand communication strategy for KMC to remarkably reduce the communication overhead. We simulate 4*10^12 atoms on 6,656,000 master+slave cores using MD with 85% parallel efficiency. Using the coupled MD-KMC approach, we simulate 3.2*10^10 atoms in 19.2 days temporal scale on 6,240,000 master+slave cores with runtime of 8.6 hours. Combining Task-based Parallelism and Adaptive Mesh Refinement Techniques in Molecular Dynamics Simulations Raphaël PRAT and Laurent COLOMBET (CEA) and Raymond NAMYST (University of Bordeaux, INRIA) Abstract Abstract Modern parallel architectures require applications to generate massive parallelism so as to feed their large number of cores and their wide vector units. We revisit the extensively studied classical Molecular Dynamics N-body problem in the light of these hardware constraints. We use Adaptive Mesh Refinement techniques to store particles in memory, and optimize the force computation loop using multi-threading and vectorization-friendly data structures. Our design is guided by the need for load balancing and adaptavity raised by highly dynamic particle sets, as typically observed in simulations of strong shocks resulting in material micro-jetting. We analyze performance results on several simulation scenarios, over nodes equipped with Intel Xeon Phi Knights Landing (KNL) or Intel Xeon Skylake (SKL) processors. Performance obtained with our OpenMP implementation outperforms state-of-the-art implementations (LAMMPS) on both steady and micro-jetting particles simulations. In the latter case, our implementation is 4.7 times faster on KNL, and 2 times faster on SKL. Ioannis Paraskevakos (Rutgers University); Andre Luckow (Ludwig-Maximilians University); Mahzad Khoshlessan (Arizona State University); George Chantzialexiou (Rutgers University); Thomas E. Cheatham (University of Utah); Oliver Beckstein (Arizona State University); Geoffrey C. Fox (Indiana University); and Shantenu Jha (Rutgers University, Brookhaven National Laboratory) Abstract Abstract Different parallel frameworks for implementing data analysis applications have been proposed by the HPC and Big Data communities. In this paper, we investigate three task-parallel frameworks: Spark, Dask and RADICAL-Pilot with respect to their ability to support data analytics on HPC resources and compare them with MPI. We investigate the data analysis requirements of Molecular Dynamics (MD) simulations which are significant consumers of supercomputing cycles, producing immense amounts of data. A typical large-scale MD simulation of a physical system of O(100k) atoms over μsecs can produce from O(10) GB to O(1000) GBs of data. We propose and evaluate different approaches for parallelization of a representative set of MD trajectory analysis algorithms, in particular the computation of path similarity and leaflet identification. We evaluate Spark, Dask and RADICAL-Pilot with respect to their abstractions and runtime engine capabilities to support these algorithms. We provide a conceptual basis for comparing and understanding different frameworks that enable users to select the optimal system for each application. We also provide a quantitative performance analysis of the different algorithms across the three frameworks. Wednesday 4:00pm-5:30pm Straub Hall, Room 156 Paper Programming Models Chair: Olga Pearce (Lawrence Livermore National Laboratory) Brandon Hedden and Xinghui Zhao (Washington State University) Abstract Abstract The Actor model of concurrency provides a convenient way to build large-scale distributed systems, harnessing its inherent parallelism and the message-driven nature. As a result, the Actor model has been widely used in a variety of distributed applications, especially cloud-based applications. However, software bugs in actor systems are not well studied. In this paper, we present a comprehensive analysis on known bugs in actor-based systems. The contributions of this paper are two-fold. First, we have created an actor bug taxonomy, and carefully examined and categorized bugs reported in a number of real-world actor systems. Second, we have compared the characteristics actor bugs with bugs reported in cloud systems in general. Our results shed lights on understanding the benefits and unique challenges for utilizing the Actor model for system development, as well as deployment in open systems like clouds. A Framework for Auto-Parallelization and Code Generation: An Integrative Case Study with Legacy FORTRAN Codes Konstantinos Krommydas (Intel Corporation), Paul Sathre (Virginia Tech), Ruchira Sasanka (Intel Corporation), and Wu-chun Feng (Virginia Tech) Abstract Abstract GLAF, short for Grid-based Language and Auto-parallelization Framework, is a programming framework that seeks to democratize parallel programming by facilitating better productivity in parallel computing via an intuitive graphical programming interface (GPI) that automatically parallelizes and generates code in many languages. Originally, GLAF addressed program development from scratch via the GPI; but this unduly restricted GLAF’s utility to creating new codes only. Thus, this paper extends GLAF by enabling program development from pre-existing kernels of interest, which can then be easily and transparently integrated into existing legacy codes. Specifically, we address the theoretical and practical limitations of integration and interoperability of auto-generated parallel code within existing FORTRAN codes; enhance GLAF to overcome these limitations; and present an integrative case study and evaluation of the enhanced GLAF via the implementation of important kernels in two NASA codes: (1) the Synoptic Surface & Atmospheric Radiation Budget (SARB), part of the Clouds and the Earth’s Radiant Energy System (CERES), and (2) the Fully Unstructured Navier-Stokes (FUN3D) suite for computational fluid dynamics. Nathan Hjelm (Los Alamos National Lab, University of New Mexico); Matthew Dosanjh and Ryan Grant (Sandia National Laboratories); Taylor Groves (Lawrence Berkeley National Laboratory); Patrick Bridges (University of New Mexico); and Dorian Arnold (Emory University, University of New Mexico) Abstract Abstract One-sided communication is crucial to enabling communication concurrency. As core counts have increased, particularly with many-core architectures, one-sided (RMA) communication has been proposed to address the ever increasing contention at the %there are more processes trying to use a single network interface. The difficulty in using one-sided (RMA) communication with MPI is that the performance of MPI implementations using RMA with multiple concurrent threads is not well understood. Past studies have been done using MPI RMA in combination with multi-threading (RMA-MT) but they have been performed on older MPI implementations lacking RMA-MT optimizations. In addition prior work has only been done at smaller scale ($<=$512 cores). Wednesday 4:00pm-5:30pm Straub Hall, Room 145 Paper Resilience and Reliability Chair: Allen Malony (University of Oregon) Characterizing the Impact of Soft Errors Affecting Floating-point ALUs using RTL-level Fault Injection omer subasi (Pacific Northwest National Laboratory), Chun-Kai Chang and Mattan Erez (The University of Texas at Austin), and Sriram Krishnamoorthy (Pacific Northwest National Laboratory) Abstract Abstract In this paper, we study the propagation of soft errors affecting floating-point ALUs from the micro-architecture to architecturally visible registers. We perform an extensive evaluation to understand the impact of errors on the type of errors that can be observed by a user program. The analysis can be used to guide the design of software-based fault-injection techniques that match the anticipated micro-architectural behavior. Zhichao Yan and Hong Jiang (University of Texas at Arlington), Witawas Srisa-an (University of Nebraska at Lincoln), Sharad Seth (University of Nebraska-Lincoln), and Yujuan Tan (Chongqing University) Abstract Abstract Soft error is a type of transient errors that occur due in part to reductions in capacitance and operating voltages in modern electronic components. Recently, the problem of soft errors has become more prevalent due to several design factors, including aggressive device scaling and newer energy-efficient designs, thus significantly threatening the reliability of computer systems. Since the occurrence of soft errors is non-deterministic, detecting them and recovering from them can be quite challenging. A common way to detect soft errors is to execute two identical program instances and then compare their results. Although this approach is effective, it is not efficient as both non-trivial computation and memory resources must be invested to support such redundant executions. Kai Wu and Wenqian Dong (University of California Merced), Qiang Guan (Kent State University), Nathan DeBardeleben (Los Alamos National Laboratory), and Dong Li (University of California Merced) Abstract Abstract Understanding how the application is resilient to hardware and software errors is critical to high-performance computing. To evaluate application resilience, the application level fault injection is the most common method. However, the application level fault injection can be very expensive when running the application in parallel in large scales due to the high requirement for hardware resource during fault injection. Wednesday 4:00pm-5:30pm Straub Hall, Room 245 Paper Performance of Sparse Algorithms Chair: Sameer Shende (University of Oregon) Xinliang Wang, Ping Xu, and Wei Xue (Tsinghua University, National Supercomputer Center in Wuxi); Yulong Ao (School of Mathematical Sciences,Peking University); Chao Yang (School of Mathematical Sciences &National Engineering Laboratory forVideo Technology, Peking University); Haohuan Fu, Lin Gan, and Guangwen Yang (Tsinghua University, National Supercomputing Center in Wuxi); and Weimin Zheng (Tsinghua University) Abstract Abstract The sparse triangular solver, SpTRSV, is one of the most important kernels in many scientific and engineering applications. Efficiently parallelizing the SpTRSV on modern many-core architectures is considerably difficult due to the inherent dependency of tasks, and the frequent but discontinuous memory accesses. Achieving high performance of SpTRSV is even more challenging for SW26010, the new-generation customized heterogeneous many-core processor equipped in the Sunway TaihuLight supercomputer. The known parallel SpTRSVs have to be refactored to fit the single-thread and cacheless design of SW26010. In this work, we focus on how to design and implement fast SpTRSV for structured grid problems on SW26010. A generalized algorithm framework of parallel SpTRSV is proposed for best utilization of the features and flexibilities of SW26010 many-core architecture according to the fine-grained Producer-Consumer model. Moreover, a novel parallel structured-grid SpTRSV is presented by using direct data transfers across registers of the computing elements of SW26010. Experiments on four typical structured-grid triangular problems with different problem sizes demonstrate that our SpTRSV can achieve an average bandwidth utilization of 81.7\%, which leads to a speedup of 17 over serial method on the management processing element of SW26010. And experiments with linear sparse solvers show that this new SpTRSV can achieve superior performance over the latest Intel Xeon CPU and Intel KNL over DDR4 memory. Qiao Sun, Changyou Zhang, Changmao Wu, Leisheng Li, and Jiajia Zhang (Institute of Software, Chinese Academy of Sciences) Abstract Abstract SpMV (Sparse Matrix Vector multiplication), y = Ax, multiplies a sparse matrix with a dense vector and is a widely used computing primitive in the domain of HPC. On the newly SW26010 many-core platform, we propose a highly efficient CSR (Compressed Storage Row) based implementation of parallel SpMV, referred to as SW_CSR-SpMV. SpMV in the CSR format can be trivially parallelized but is considered a bandwidth bounded problem, and therefore to avoid redundant bandwidth usage becomes of great importance. The original problem is sequentially partitioned into row-slices, each of which can reside in the fast scratchpad memory, so that the loaded x'es can be reused; meanwhile, a dynamic look-ahead scheme is applied to avoid redundant memory access; we split the many-core mesh into smaller communication scope to facilitate the sharing of the common data across the working threads via the high speed on-mesh data bus. Beyond the above, to leverage massive parallelism balanced workload is ensured by both static and dynamic means. Performance evaluation is done on a benchmark of 36 frequently used sparse matrices in the fields of graph computing, computational fluid dynamics etc.. While the performance upper-bound is defined by the ratio between the minimal data access volume required against the practically optimal bandwidth, ignoring the computing overhead, SW_CSR-SpMV can achieve an efficiency of nearly 87%, maintaining over 75% for 1/3 of the testing matrices. SW_CSR-SpMV is further applied in a PETSc based application, a 1.75x-2.6x speedup is sustained in a multi-process environment on the Sunway TaiHuLight supercomputer. Hong Zhang and Richard T. Mills (Argonne National Laboratory), Karl Rupp (TU Wien), and Barry F. Smith (Argonne National Laboratory) Abstract Abstract Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such architectures present several challenges to their efficient use. Here, we explore the efficient implementation of sparse matrix-vector (SpMV) multiplications---a critical kernel for the workhorse iterative linear solvers used in most PDE-based simulations---on recent CPU architectures from Intel as well as the second-generation Knights Landing Intel Xeon Phi, which features many CPU cores, wide SIMD lanes, and on-package high-bandwidth memory. Traditional SpMV algorithms use compressed sparse row storage format, which is a hindrance to exploiting wide SIMD lanes. We study alternative matrix formats and present an efficient optimized SpMV kernel, based on a sliced ELLPACK representation, which we have implemented in the PETSc library. In addition, we demonstrate the benefit of using this representation to accelerate preconditioned iterative solvers in realistic PDE-based simulations in parallel. Thursday 9:00am-10:00am Straub Hall, Room 156 Plenary Talk Plenary Chair: Mary Hall (University of Utah) Bringing Sparse Computations into the Optimization Light Mary Hall (University of Utah) Biography Biography Mary Hall (University of Utah) Mary Hall is a professor in the School of Computing at University of Utah. She received a PhD in Computer Science from Rice University. Her research focus brings together compiler optimizations targeting current and future high-performance architectures on real-world applications. Hall's prior work has developed compiler techniques for exploiting parallelism and locality on a diversity of architectures: automatic parallelization for SMPs, superword-level parallelism for multimedia extensions, processing-in-memory architectures, FPGAs and more recently many-core CPUs and GPUs. Professor Hall is an ACM Distinguished Scientist and ACM’s representative on the Computing Research Association Board of Directors. She is deeply interested in computing history, having served on the ACM History Committee for a decade and as chair from 2009-2014. She also actively participates in outreach programs to encourage the participation of women and underrepresented minorities in computer science. Abstract Abstract Scalable computations where the data is sparse — that is, a tiny subset of the data is populated — are widely represented in scientific computing and big data analytics. Sparse data are typically represented by sparse matrices and graphs, which reduce data storage and computation requirements (e.g., by storing only nonzero data elements) through the use of indirection matrices such as B in A[B[i]]. This indirection impacts performance on modern architectures. Whether the code is optimized by human programmer, compiler or hardware, any optimizations that must understand the memory access patterns for A require the values of B, which are not known until runtime. Over time, many optimizations of sparse computations have been developed that simplify the code and reduce the amount of indirection through custom data representations, but these representations must also rely on runtime knowledge of the nonzero structure of the sparse data. In this talk, we describe a framework for composing compile-time and runtime optimizations of sparse codes, and demonstrate its effectiveness at achieving performance comparable to manually-tuned code for applications in scientific computing and data analytics. Thursday 10:30am-12:30pm Straub Hall, Room 156 Paper Resource Management Chair: Taisuke Boku (University of Tsukuba) Donglin Yang, Wei Rang, and Dazhao Cheng (University of North Carolina at Charlotte) Abstract Abstract As MapReduce is becoming increasingly popular in large-scale data analysis, there is a growing need for moving MapReduce into multi-tenant clouds. However, there is an important challenge that the performance of MapReduce applications can be significantly influenced by the time-varying network bandwidth in a shared cluster. Although a few recent studies improve MapReduce performance by dynamic scheduling to reduce the shuffle traffic, most of them do not consider the impact by widely existing hierarchical network architectures in data centers. In this paper, we propose and design a Hierarchical topology (Hit) aware MapReduce scheduler to minimize overall data traffic cost and hence to reduce job execution time. We first formulate the problem as a Topology Aware Assignment (TAA) optimization problem while considering dynamic computing and communication resources in the cloud with hierarchical network architecture. We further develop a synergistic strategy to solve the TAA problem by using the stable matching theory, which ensures the preference of both individual tasks and hosting machines. Finally, we implement the proposed scheduler as a pluggable module on Hadoop YARN and evaluate its performance by testbed experiments and simulations. The experimental results show Hit-scheduler can improve job completion time by 28% and 11% compared to Capacity Scheduler and Probabilistic Network-Aware scheduler, respectively. Our simulations further demonstrate that Hit-scheduler can gain the traffic cost by 38% at most and improve the average shuffle flow traffic time by 32% compared to Capacity scheduler. Huazhe Zhang and Henry Hoffmann (University of Chicago) Abstract Abstract Large scale parallel machines are subject to system-wide power budgets (or caps). As these machines grow in capacity, they can concurrently execute dependent applications that were previously processed serially. Such application coupling saves IO and time as the applications now communicate at runtime instead of through disk. Such coupled applications are predicted to be a major workload for future exascale supercomputers; e.g., scientific simulations will execute concurrently with in situ analysis. While support for system-wide power caps has been widely studied, prior work does not consider the impact on coupled applications. Minghao Zhao and Zhenhua Li (Tsinghua University); Ennan Zhai (Yale University); Gareth Tyson (Queen Mary University of London); Chen Qian (University of California, Santa Cruz); Zhenyu Li (Institute of Computing Technology, Chinese Academy of Sciences); and Leiyu Zhao (Tsinghua University) Abstract Abstract Object storage clouds (e.g., Amazon S3) have become extremely popular due to their highly usable interface and cost-effectiveness. They are, therefore, widely used by various applications (e.g., Dropbox) to host user data. However, because object storage clouds are flat and lack the concept of a directory, it becomes necessary to maintain file meta-data and directory structure in a separate index cloud. This paper investigates the possibility of using a single object storage cloud to efficiently host the whole filesystem for users, including both the file content and directories, while avoiding meta-data loss caused by index cloud failures. We design a novel data structure, Hierarchical Hash (or H2), to natively enable the efficient mapping from filesystem operations to object-level operations. Based on H2, we implement a prototype system, H2Cloud, that can maintain large filesystems of users in an object storage cloud and support fast directory operations. Both theoretical analysis and real-world experiments confirm the efficacy of our solution: H2Cloud achieves faster directory operations than OpenStack Swift by orders of magnitude, and has similar performance to Dropbox but yet does not need a separate index cloud. Xuesong Li (Tsinghua University, High-Tech Institute of Xi'an); Wenxue Cheng, Tong Zhang, Jing Xie, and Fengyuan Ren (Tsinghua University); and Bailong Yang (High-Tech Institute of Xi'an) Abstract Abstract Recently, high performance packet I/O frameworks are expected an extensive application for their ability to process packets from 10Gbps or higher speed links. To achieve high throughput and low latency, high performance packet I/O frameworks usually employ busy polling technique. As busy polling will burn all CPU cycles even if there's no packet to process, these frameworks are quite power inefficient. Meanwhile, exploiting power management techniques such as DVFS and LPI in high performance packet I/O frameworks is challenging, because neither the OS nor the frameworks can provide information (e.g., the actual CPU utilization, available idle period, or the target frequency) required by power management techniques. In this paper, we establish an analytical model that can formulate the packet processing flow of high performance packet I/O to help address the above challenges. From the analytical model, we can deduce the actual CPU utilization and average idle period in different traffic load, and gain the insight to choose CPU frequency that can appropriately balance the power consumption and packet latency. Then, we propose two simple but effective approaches to conduct power conservation for high performance packet I/O: one with the aid of traffic information and the other without. Experiments with Intel DPDK show that both approaches can achieve significant power reduction (35.90% and 34.43% on average respectively) while incurring <1us of latency increase. Thursday 10:30am-12:30pm Straub Hall, Room 145 Paper Memory and Caching Chair: Sameer Shende (University of Oregon) Xi Wang, John D. Leidel, and Yong Chen (Texas Tech University) Abstract Abstract Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can dramatically slow down the overall performance of applications. The growing desire of high memory bandwidth and low latency access stimulate the advent of novel 3D-staked memory devices such as the Hybrid Memory Cube (HMC), which provides significantly higher bandwidth compared with the conventional JEDEC DDR devices. Even though many existing studies have been devoted to achieving high bandwidth throughput of HMC, the bandwidth potential cannot be fully exploited due to the lack of highly efficient memory coalescing and interfacing methodology for HMC devices. In this research, we introduce a novel memory coalescer methodology that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC. We present the design and implementation of this approach on RISC-V embedded cores with attached HMC devices. Our evaluation results show that the new memory coalescer eliminates 47.47% memory accesses to HMC and improves the overall performance by 13.14% on average. Muhammad Rafique and Zhichun Zhu (University of Illinois at Chicago) Abstract Abstract Prefetching is a well-studied technique where data is fetched from main memory ahead of time speculatively and stored in caches or dedicated prefetch buffer. With the introduction of Hybrid Memory Cube (HMC), a 3-D memory module with multiple memory layers stacked over a single logic layer using thousands of Through Silicon Vias (TSVs), huge internal bandwidth availability makes memory-side prefetching a more efficient approach to improving system performance. In this paper, we introduce a memory-side prefetching scheme for HMC based main memory system that utilizes its logic area and exploits the huge internal bandwidth provided by TSVs. Our scheme closely monitors the access pattern to memory banks and make intelligent prefetch decisions for rows with high utilization or causing row buffer conflicts. We also introduce a prefetch buffer management scheme that makes replacement decision within the prefetch buffer based on both the utilization and recency of the prefetched rows. Our simulation results indicate that our approach improves performance by 17.9% on average, compared to a baseline scheme that prefetches a whole row on every memory request. Our scheme also outperforms an existing memory-side prefetching scheme by 8.7% on average, which dynamically adjusts the prefetch degree based on the usefulness of prefetched data. Xian Zhu, Robert Wernsman, and Joseph Zambreno (Iowa State University) Abstract Abstract A modern Graphics Processing Unit (GPU) utilizes L1 Data (L1D) caches to reduce memory bandwidth requirements and latencies. However, the L1D cache can easily be overwhelmed by many memory requests from GPU function units, which can bottleneck GPU performance. It is shown that the performance of L1D caches is greatly reduced for many GPU applications as a large amount of L1D cache lines are replaced before they are re-referenced. By examining the cache access patterns of these applications, we observe L1D caches with low associativity have difficulty capturing data locality for GPU applications with diverse reuse patterns. These patterns result in frequent line replacements and low data re-usage. Yuanrong Wang and Xueqi Li (Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences) and Dawei Zang, Guangming Tan, and Ninghui Sun (Institute of Computing Technology, Chinese Academy of Sciences) Abstract Abstract The deluge of genomics data is incurring prohibitively high computational costs. As an important building block for genomic data processing algorithms, FM-index search occupies most of the execution time in sequence alignment. Due to massive random streaming memory references relative to only small amount of computations, FM-index search algorithm exhibits extremely low efficiency on conventional architectures. This paper proposes Niubility, an accelerator for FM-index search in genomic sequence alignment. Based on our algorithm-architecture co-design analysis, we found that conventional architectures exploit low memory-level parallelism so that the available memory bandwidth cannot be fully utilized. Niubility accelerator customizes bit-wise operations and exploits data-level parallelism, that produces maximal concurrent memory accesses to saturate memory bandwidth. We implement an accelerator ASIC in an ST 28nm process that achieves up to 990x speedup over the state-of-the-art software. Thursday 10:30am-12:30pm Straub Hall, Room 245 Paper Runtime Systems Chair: Olga Pearce (Lawrence Livermore National Laboratory) D. Brian Larkins and John Snyder (Rhodes College) and James Dinan (Intel Corporation) Abstract Abstract Many HPC applications have successfully applied Partitioned Global Address Space (PGAS) parallel programming models to efficiently manage shared data that is distributed across multiple nodes in a distributed memory system. However, while the flat addressing model provided by PGAS systems is effective for regular array data, it renders such systems difficult to use with loosely-structured or sparse data. This work proposes a logically addressed PGLAS model that naturally supports a variety of data models through the automatic mapping of an application-defined key space onto the underlying distributed memory system. We present an efficient implementation of the PGLAS model built atop a parallel distributed hash table (PDHT) and demonstrate that this model is amenable to offloading using the Portals 4 network programming interface. We demonstrate the effectiveness of PDHT using representative applications from the computational chemistry and genomics domains. Results indicate that PGLAS models such as PDHT provide a promising new method for parallelizing applications with non-regular data. Hoang-Vu Dang and Marc Snir (University of Illinois) Abstract Abstract This paper describes FULT, a User-Level Threading scheduling system that uses bit-vectors to represent runnable threads. This system is aimed at efficient support of event driven task scheduling. We show a significant reduction in the cost of signal and wait primitives, high scalability, and similar performance for task spawning and other operations, compared conventional task schedulers that use work queues. Jason Hiebel, Laura E. Brown, and Zhenlin Wang (Michigan Technological University) Abstract Abstract Virtualization technology is a key component for data center management which allows for multiple users and applications to share a single, physical machine. Modern virtual machine monitors utilize both software and hardware-assisted paging for memory virtualization, however neither paging mode is always preferable. Previous studies have shown that dynamic selection, which at runtime selects paging modes according to relevant performance metrics, can be effective in tailoring memory virtualization to program workload. However, these approaches require low-level manual analysis, or depend on prior knowledge of workload characteristics and phasing. Matthew G. F. Dosanjh (Sandia National Labratories); S. Mahdieh Ghazimirsaeed (Queen’s University); Ryan E. Grant and Whit Schonbein (Sandia National Laboratories, University of New Mexico); Michael J. Levenhagen (Sandia National Laboratories); Patrick G. Bridges (Univeristy of New Mexico); and Ahmad Afsahi (Queen’s University) Abstract Abstract The performance critical path for MPI implementations relies on fast receive side operation, which in turn requires fast list traversal. The performance of list traversal is dependent on data-locality; whether the data is currently contained in a close-to-core cache due to its temporal locality or if its spacial locality allows for predictable pre- fetching. In this paper, we explore the effects of data locality on the MPI matching problem by examining both forms of locality. First, we explore spacial locality, by combining multiple entries into a single linked list element, we can control and modify this form of locality. Secondly, we explore temporal locality by utilizing a new technique called “hot caching”, a process that creates a thread to periodically access certain data, increasing its temporal locality. In this paper, we show that by increasing data locality, we can improve MPI perfor- mance on a variety of architectures up to 4x for micro-benchmarks and up to 2x for an application. Thursday 2:00pm-3:30pm Straub Hall, Room 156 Paper Parallel and Distributed Algorithms Chair: Kamesh Madduri (Pennsylvania State University) João Paulo de Araujo and Luciana Arantes (Sorbonne Université, CNRS, INRIA, LIP6); Elias P. Duarte Júnior (Federal University of Paraná); Luiz A. Rodrigues (Western Paraná State University); and Pierre Sens (Sorbonne Université, CNRS, INRIA, LIP6) Abstract Abstract A causal broadcast ensures that messages are delivered to all nodes (processes) preserving causal relation of the messages. In this paper, we propose a causal broadcast protocol for distributed systems whose nodes are logically organized in a virtual hypercube-like topology called VCube. Messages are broadcast by dynamically building spanning trees rooted in the message's source node. By using multiple trees, the contention bottleneck problem of a single root spanning tree approach is avoided. Furthermore, different trees can intersect at some node. Hence, by taking advantage of both the out-of-order reception of causally related messages at a node and these paths intersections, a node can delay to one or more of its children in the tree, the forwarding of the messages whose some causal dependencies it knows that the children in question can not satisfy yet. Such a delay does not induce any overhead. Experimental evaluation conducted on top of PeerSim simulator confirms the communication effectiveness of our causal broadcast protocol in terms of latency and message traffic reduction. Saurabh Kalikar and Rupesh Nasre (Indian Institute of Technology Madras) Abstract Abstract We study multi-granularity locking in hierarchies. Existing popular mechanisms for multi-granularity locking work on two extremes: fine-grained locking which maximizes concurrency, and coarse- grained locking which minimizes the locking cost. Between the two extremes, there exist several pareto-optimal options which provide a trade-o between the concurrency cost and the locking cost. These options are captured under multi-granularity locking (MGL) protocol, but there exists no methodical way to choose a good MGL option. In this work, we present a locking technique, NumLock, which chooses optimal locking combination to serve MGL requests. NumLock works in two phases: first, it generates few pareto-optimal MGL options and second, it quickly analyzes these options to report the optimal one. We propose a greedy algorithm to generate a subset of the pareto-optimal options. We further improve the concurrency by abstracting the non-planar hierarchy as logical planar sub-hierarchies using multiple intervals. Our study reveals that the lock manager must explore various MGL options for achieving the best parallel performance, and that the extreme options often used in practice are suboptimal. We validate our claim quantitatively using an XML database as well as STMBench7, and show that the intermediate MGL options perform better compared to the existing state-of-the-art methods. In particular, NumLock achieves 65% reduction in execution time for the XML database, and 25% throughput improvement on STMBench7. Fei Wang and Xiaofeng Gao (Shanghai Jiao Tong University), Jun Ye (Intel Asia-Pacificn R&D Ltd.), and Guihai Chen (Shanghai Jiao Tong University) Abstract Abstract Variance reduction (VR) techniques for convergence rate acceleration of stochastic gradient descent (SGD) algorithm have been developed with great efforts recently. VR's two variants, stochastic variance-reduced-gradient (SVRG-SGD) and importance sampling (IS-SGD) have achieved remarkable progresses. Meanwhile, asynchronous SGD (ASGD) is becoming more critical due to the ever-increasing scale of the optimization problems. The application of VR in ASGD to accelerate its convergence rate has therefore attracted much interest and SVRG-ASGDs are therefore proposed. However, we found that SVRG suffers dissatisfying performance in accelerating ASGD when the datasets are sparse and large-scale. In such case, SVRG-ASGD's iterative computation cost is magnitudes higher than ASGD which makes it very slow. On the other hand, IS achieves improved convergence rate with few extra computation cost and is invariant to the sparsity of dataset. This advantage makes it very suitable for the acceleration of ASGD for large-scale sparse datasets. In this paper we propose a novel IS-combined ASGD for effective convergence rate acceleration, namely, IS-ASGD. We theoretically prove the superior convergence bound of IS-ASGD. Experimental results also demonstrate our statements. Thursday 2:00pm-3:30pm Straub Hall, Room 145 Paper Performance of Graph Algorithms Chair: Allen Malony (University of Oregon) Yulin Che, Shixuan Sun, and Qiong Luo (Hong Kong University of Science and Technology) Abstract Abstract A common class of graph structural clustering algorithms, pioneered by SCAN (Structural Clustering Algorithm for Networks), not only find clusters among vertices but also classify vertices as cores, hubs and outliers. However, these algorithms suffer from efficiency issues due to the great amount of computation required on structural similarity among vertices. Pruning-based SCAN algorithms improve efficiency by reducing the amount of computation. Nevertheless, this structural similarity computation is still the performance bottleneck, especially on big graphs of billions of edges. In this paper, we propose to parallelize pruning-based SCAN algorithms on multi-core CPUs and Intel Xeon Phi Processors (KNL) with multiple threads and vectorized instructions. Specifically, we design ppSCAN, a multi-phase vertex computation based parallel algorithm, to avoid redundant computation and achieve scalability. Moreover, we propose a pivot-based vectorized set intersection algorithm for structural similarity computation. Experimental results show that ppSCAN is scalable on both CPU and KNL with respect to the number of threads. On the 1.8 billion-edge graph friendster, our ppSCAN completes within 65 seconds on KNL (64 physical cores with hyper-threading). This performance is 100x-130x faster than our single-threaded version, and up to 250x faster than pSCAN, the state-of-the-art sequential algorithm, on the same platform. Deepak Ajwani (Nokia Bell Laboratories, Dublin); Erika Duriakova (Insight Centre for Data Analytics; School of Computer Science and Informatics, University College Dublin); Neil Hurley (Insight Centre for Data Analytics; School of Computer Science, University College Dublin); and Ulrich Meyer and Alexander Schickedanz (Goethe University, Frankfurt) Abstract Abstract We consider the loop less k-shortest path (KSP) problem. Although this problem has been studied in the sequential setting for at least the last two decades, no good parallel implementations are known. In this paper, we provide (i) a first systematic empirical comparison of various KSP algorithms and heuristic optimisations, (ii) carefully engineer various parallel implementations of these sequential algorithms and (iii) perform an extensive study of these parallel implementations on a range of graph classes and multicore architectures to determine the best algorithm and parallelization strategy for different graph classes. Li Zhou (The Ohio State University), Ren Chen and Yinglong Xia (Huawei Research America), and Radu Teodorescu (The Ohio State University) Abstract Abstract Many big data analytics applications explore a set of related entities, which are naturally modeled as graph. However, graph processing is notorious for its performance challenges due to random data access patterns, especially for large data volumes. Solving these challenges is critical to the performance of industry-scale applications. In contrast to most prior works, which focus on accelerating a single graph processing task, in industrial practice we consider multiple graph processing tasks running concurrently, such as a group of queries issued simultaneously to the same graph. In this paper, we present an edge-set based graph traversal framework called C-Graph (i.e. Concurrent Graph), running on a distributed infrastructure, that achieves both high concurrency and efficiency for k-hop reachability queries. The proposed framework maintains global vertex states to facilitate graph traversals, and supports both synchronous and asynchronous communication. In this study, we decompose a set of graph processing tasks into local traversals and analyze their performance on C-Graph. More specifically, we optimize the organization of the physical edge-set and explore the shared subgraphs. We experimentally show that our proposed framework outperforms several baseline methods. Thursday 2:00pm-3:30pm Straub Hall, Room 245 Paper Storage Chair: Kevin Huck (University of Oregon) Zhirong Shen and Patrick P. C. Lee (The Chinese University of Hong Kong) Abstract Abstract The update performance in erasure-coded data centers is often bottlenecked by the constrained cross-rack bandwidth. We propose CAU, a cross-rack-aware update mechanism that aims to mitigate the cross-rack update traffic in erasure-coded data centers. CAU builds on three design elements: (i) selective parity updates, which select the appropriate parity update approach based on the update pattern and the data layout to reduce the cross-rack update traffic; (ii) data grouping, which relocates and groups updated data chunks in the same rack to further reduce the cross-rack update traffic; and (iii) interim replication, which stores a temporary replica for each newly updated data chunk. We evaluate CAU via trace-driven analysis, local cluster experiments, and Amazon EC2 experiments. We show that CAU enhances state-of-the-arts by mitigating the cross-rack update traffic as well as maintaining high update performance in both local cluster and geo-distributed environments. Duchy: Achieving Both SSD Durability and Controllable SMR Cleaning Overhead in Hybrid Storage Systems Xuchao Xie, Tianye Yang, Qiong Li, Dengping Wei, and Liquan Xiao (National University of Defense Technology) Abstract Abstract Integrating solid-state drives (SSDs) and shingled magnetic recording (SMR) drives can build cost-effective hybrid storage systems. However, both SSD and SMR drives endure inherent defects that are mutually exclusive. The write endurance of SSD is limited while SMR drives should prevent from writes due to the cleaning-caused performance degradation. In this paper, we propose Duchy, an endurable SSD caching scheme that simultaneously respects SMR constraints. Duchy filters ineffectual write traffic out of SSD without exacerbating the performance degradation of SMR drives. Meanwhile, Duchy leverages SSD to regulate the written zones in SMR drives to achieve controllable cleaning duration. Our experimental results indicate that compared with legacy SSD caching designs, only Duchy can achieve both system performance improvement and SSD write traffic reduction. Hua Wang and Xinbo Yi (Wuhan National Labo for Optoelectronics, HuaZhong University of Science and Technology); Ping Huang (Department of Computer and Information Sciences, Temple University); Bin Cheng (Shenzhen Tencent Computer System Co., Ltd.); and Ke Zhou (Wuhan National Labo for Optoelectronics, HuaZhong University of Science and Technology) Abstract Abstract SSD has been playing a significantly important role in caching systems due to its high performance-to-cost ratio. Since cache space is much smaller than that of the backend storage, write density (writes per unit time and space) of SSD cache is therefore much higher than that of HDD, which brings about great challenges to SSD’s lifetime. Meanwhile, under social network workloads, quite a few writes on SSD are unnecessary, e.g., Tencent’s photo caching shows that about 61% of total photos are just accessed once whereas they are still swapped into cache. Therefore, if we can predict this kind of photos proactively and prevent them from entering the cache, we can eliminate unnecessary SSD cache writes and improve cache space utilization. To cope with the challenge, we put forward a "one-time-access criteria" that is applied to cache space, and further propose a "one-time-access-exclusion" policy. Based on that, we design a prediction based classifier to facilitate the policy. Unlike the state-of-the-art history-based predictions, our prediction is non-history-oriented, which is challenging to achieve a good prediction accuracy. To address this issue, we integrate a decision tree into the classifier, extract social-related information as classifying features, and apply cost sensitive learning to improve classification precision. Due to these techniques, we attain a predication accuracy over 80%. Experimental results show that "one-time-access-exclusion" approach makes caching performance outstanding in most aspects, taking LRU for instance, hit rate is improved by 17%, cache writes are decreased by 79%, and the average access latency is dropped by 7.5%. Thursday 4:00pm-5:30pm Straub Hall, Room 156 Paper Matrix and Graph Algorithms Chair: Jee Choi (IBM) Carl Yang (University of California, Davis; Lawrence Berkeley National Laboratory); Aydin Buluc (Lawrence Berkeley National Laboratory); and John D. Owens (University of California, Davis) Abstract Abstract We factor Beamer's push-pull, also known as direction-optimized breadth-first-search (DOBFS) into 3 separable optimizations, and analyze them for generalizability, asymptotic speedup, and contribution to overall speedup. We demonstrate that masking is critical for high-performance and can be generalized to all graph algorithms where the sparsity pattern of the output is known a priori. We show that these graph algorithm optimizations, which together constitute DOBFS, can be neatly and separably described using linear algebra and can be expressed in the GraphBLAS linear algebra-based framework. We provide experimental evidence that with these optimizations, a DOBFS expressed in a linear-algebra-based graph framework attains competitive performance with state-of-the-art graph frameworks on the GPU and on a multi-threaded CPU, achieving 101 GTEPS on a Scale 22 RMAT graph. UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters Longlong Liao (National University of Defense Technology, State Key Laboratory of High Performance Computing); Kenli Li (Hunan University, National Supercomputing Center in Changsha); Keqin Li (State University of New York, National Supercomputing Center in Changsha); Canqun Yang (National University of Defense Technology, State Key Laboratory of High Performance Computing); and Qi Tian (University of Texas at San Antonio) Abstract Abstract As the majority of popular deep neural network (DNN) frameworks focus on a closed format CUDA implementations based on one or more NVIDIA GPUs, they cannot efficiently leverage other devices in cluster mode to accelerate the training and inference of DNNs except NVIDIA GPUs. To accelerate DNNs using heterogeneous multi-/many-core clusters, we propose an OpenCL-based DNN framework called UHCL-Darknet. First, we design a unified OpenCL platform model for the heterogeneous cluster called UHCL, and an adaptive runtime system with the affinity-based dynamic scheduler for UHCL, enabling transparent utilization of a wide variety of vendor-specific OpenCL devices in the heterogeneous cluster. Then, we extend Darknet to UHCL by introducing the parallel optimization of DNNs, such as paralleling Winogrand-based convolutions and auto-tuning parameterized OpenCL kernels. The training and inference of art-of-data DNN models (e.g., YOLOv2, ResNet-50, and DenseNet-201) are executed on an experimental heterogeneous cluster. Results show that UHCL-Darknet is a scalable and portable DNN framework for heterogeneous clusters, and achieves 1.9x and 2.2x speedups on average respectively for the image throughput of data-parallel training and inference on the experimental heterogeneous cluster. Optimization of the Spherical Harmonics Transform based Tree Traversals in the Helmholtz FMM Algorithm Michael P. Lingg (Michigan State University, Belcan Engineering) and Stephen M. Hughey and H. M. Aktulga (Michigan State University) Abstract Abstract The fast multipole method (FMM) provides a fast and accurate method of solving large N-body problems with application in a broad range of physics problems. While there exists a large body of work for efficient implementation of the Laplace variant of the FMM algorithm, an optimized implementation of the Helmholtz variant of FMM which is used to study wave physics related phenomena is more complicated. In this study, we focus on the computationally expensive tree traversal operations of the Helmholtz FMM algorithm and describe a series of techniques to improve their performance. We demonstrate that the described techniques yield significant speedups (up to a factor of 4.71x) on a realistic surface geometry commonly encountered in Helmholtz problems. Thursday 4:00pm-5:30pm Straub Hall, Room 145 Paper Data Processing Chair: Taisuke Boku (University of Tsukuba) Song Wu and Zhiyi Liu (Huazhong University of Science & Technology); Shadi Ibrahim (Inria Rennes); and Lin Gu, Hai Jin, and Fei Chen (Huazhong University of Science & Technology) Abstract Abstract Existing stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable data processing requirements, this dilemma makes the selection and deployment of stream processing frameworks in an embarrassing situation. Moreover, current data stream or operation stream paradigm cannot handle data burst efficiently, which probably results in noticeable performance degradation. This paper introduces a dual-paradigm stream processing, called DO (Data and Operation) that can adapt to stream data volatility. It enables data to be processed in micro-batches (i.e., operation stream) when data burst occurs to achieve high throughput, while data is processed record by record (i.e., data stream) in the remaining time to sustain low latency. DO embraces a method to detect data bursts, identify the main operations affected by the data burst and switch paradigms accordingly. Our insight behind DO’s design is that the trade-off between latency and throughput of stream processing frameworks can be dynamically achieved according to data communication among operations in a fine-grained manner (i.e., operation level) instead of framework level. We implement a prototype stream processing framework that adopts DO. Our experimental results show that our framework with DO can achieve 5x speedup over operation stream under low data stream sizes, and outperforms data stream on throughput by around 2.1x to 3.2x under data burst. Yusen Li (Nankai University); Xueyan Tang and Wentong Cai (Nanyang Technological University); Jiancong Tong (Baidu Inc.); Xiaoguang Liu and Gang Wang (Nankai University); and Chuansong Gao, Xuan Cao, Guanhui Geng, and Minghui Li (Baidu Inc.) Abstract Abstract With the rapid growth of the Web scale, large scale search engines have to set up a huge number of machines to place the index files of the Web contents. The index files are normally divided into smaller index shards which are often replicated so that queries can be processed in parallel. We observe that the index shard replication strategy could have a significant impact on the resource utilization of machines. In this paper, we investigate the index shard replication problem with the goal of improving the resource utilization of machines in search engine datacenters. We formulate the problem as a variant of the Multi-Dimensional Vector Bin Packing Problem and propose several strategies to approximate the optimal solution. The proposed strategies are evaluated by extensive experiments using data from real commercial search engines. The results demonstrate the effectiveness of the proposed strategies. Our work also yields many insights about the impact of different input properties on the performance and the key factors that each strategy is sensitive to. We believe that this paper will provide valuable guidance to the choice of the index shard replication strategy in practice. Chen Zhang and Qiang Cao (Huazhong University of Science and Technology) Abstract Abstract Surveillance video cameras are ubiquitous around us. Full-feature object-detection models such as YOLOv2 can automatically analyze surveillance videos in real-time with high accuracy while consuming huge computational resources. Directly applying these models for practical scenarios with large-scale deployed cameras requires prohibitively expensive computation. This, however, is both wasteful and unnecessary considering the fact that the concerned anomalous events occur rarely among these massive volumes of video streams. Therefore, in this paper, we propose a Fast Filtering System for Video Analytics (FFS-VA), a pipelined multi-stage video analyzing system, to make video analytics much cost-effective. FFS-VA is designed to filter out vast but non-target-object frames by two prepositive stream-specialized filters and a small full-function tiny-YOLO model, to drastically reduce the number of video frames arriving at the full-feature model in the back-end. FFS-VA presents a global feedback-queue mechanism to balance the processing rates of different filters in both intra-stream and inter-stream processes. FFS-VA also designs a dynamic batch technique to achieve an adjustable trade-off between throughput and latency. FFS-VA reasonably distributes all tasks on CPUs and GPUs to fully exploit the underlying hardware resources. We implement a FFS-VA prototype and evaluate FFS-VA against the state-of-the-art YOLOv2 under the same hardware and representative video workloads. The experimental results show that under a 10\% target-object occurrence rate on two GPUs, FFS-VA can support up to 30 concurrent video streams (7$\times$ more than YOLOv2) in the online case, and obtain 3$\times$ speedup when offline analyzing a stream, with an accuracy loss of less than 2\%. Thursday 4:00pm-5:30pm Straub Hall, Room 245 Paper I/O and File Systems Chair: Peter Pirkelbauer (University of Alabama at Birmingham, Computer Science) Ram Kesavan, Matthew Curtis-Maury, and Mrinal Bhattacharjee (NetApp) Abstract Abstract The WAFL® write allocator is responsible for assigning blocks on persistent storage to data in a way that maximizes both write throughput to the storage media and subsequent read performance of data. The ability to quickly and efficiently guide the write allocator towards desirable regions of available free space is critical to achieving that goal. This is influenced by several factors, such as any underlying RAID geometry, media-specific attributes such as erase-block size of solid state drives or zone size of shingled magnetic hard drives, and free space fragmentation. This paper presents and evaluates the techniques used by the WAFL write allocator to efficiently find regions of free space. Xiaoyi Zhang, Dan Feng, Yu Hua, Jianxi Chen, and Mandi Fu (Huazhong University of Science and Technology) Abstract Abstract The development of non-volatile memory technologies (NVMs) has attracted interest in designing data structures that are efficiently adapted to NVMs, which have asymmetric properties of reads and writes and limited write endurance compared with traditional DRAM. In this context, several NVM-friendly hashing schemes have been proposed to reduce extra writes to NVMs. However, these works neither consider the cost of cacheline flush and memory fence nor provide mechanisms to maintain data consistency in case of unexpected system failures. In this paper, we propose a write-efficient and consistent hashing scheme, called group hashing. The basic idea behind group hashing is to reduce the consistency cost while guaranteeing data consistency in case of unexpected system failures. Our group hashing consists of two major contributions: (1) We use 8-byte failure-atomic write to guarantee the data consistency, which eliminates the duplicate copy writes to NVMs, thus reducing the consistency cost of the hash table structure. (2) In order to improve CPU cache efficiency, our group hashing leverages a novel technique, i.e., group sharing, which divides the hash table into groups and deploys a contiguous memory space in each group to deal with hash collisions, thus reducing CPU cache misses to obtain higher performance in terms of request latency. We have implemented group hashing and evaluated the performance by using three real-world traces. Extensive experimental results demonstrate that our group hashing achieves low request latency as well as high CPU cache efficiency, compared with state-of-the-art NVM-based hashing schemes. Tiago Barreto Goes Perez and Xiaobo Zhou (University of Colorado Colorado Springs) and Dazhao Cheng (University of North Carolina, Charlotte) Abstract Abstract Optimizing memory cache usage is vital for performance of in-memory data-parallel frameworks such as Spark. Current data-analytic frameworks utilize the popular Least Recently Used (LRU) policy, which does not take advantage of data dependency information available in the application's directed acyclic graph (DAG). Recent research in dependency-aware caching, notably MemTune and Least Reference Count (LRC), have made important improvements to close this gap. But they do not fully leverage the DAG structure, which imparts information such as the time-spatial distribution of data references across the workflow, to further improve cache hit ratio and application runtime. |
Tuesday 11:00am-12:30pm Straub Hall, Room 156 Paper Monitoring and Network Analysis Chair: Martin Schulz (Technical University Munich) Tuesday 11:00am-12:30pm Straub Hall, Room 145 Paper Task Placement Algorithms Chair: Jee Choi (IBM) Tuesday 2:00pm-3:30pm Straub Hall, Room 156 Paper Performance Tools and Methodologies Chair: Sameer Shende (University of Oregon) Tuesday 2:00pm-3:30pm Straub Hall, Room 145 Paper Networking Algorithms Chair: Kamesh Madduri (Pennsylvania State University) Tuesday 4:00pm-6:00pm Straub Hall, Room 156 Paper Performance on GPU Systems Chair: Sameer Shende (University of Oregon) Tuesday 6:30pm-9:02pm Erb Memorial Union (EMU) Ballroom, 2nd Floor Poster, Reception Poster Reception and Dinner Chair: Kengo Nakajima (University of Tokyo, RIKEN) An Extensible Ecosystem of Tools Providing User Friendly HPC Access and Supporting Jupyter Notebooks pdf, pdf, pdf, pdfIn-Depth Reliability Characterization of NAND Flash based Solid State Drives in High Performance Computing Systems pdf, pdf, pdf, pdfWednesday 10:30am-12:30pm Straub Hall, Room 156 Paper Memory Performance Chair: Martin Schulz (Technical University Munich) Wednesday 10:30am-12:30pm Straub Hall, Room 145 Paper Machine Learning and Networks Chair: Peter Pirkelbauer (University of Alabama at Birmingham, Computer Science) Wednesday 2:00pm-3:30pm Straub Hall, Room 156 Paper Performance Studies Chair: Filippo Mantovani (Barcelona Supercomputing Center) Wednesday 2:00pm-3:30pm Straub Hall, Room 145 Paper Machine Learning Chair: Wu Feng (Virginia Tech) Wednesday 4:00pm-5:30pm Straub Hall, Room 156 Paper Programming Models Chair: Olga Pearce (Lawrence Livermore National Laboratory) Wednesday 4:00pm-5:30pm Straub Hall, Room 145 Paper Resilience and Reliability Chair: Allen Malony (University of Oregon) Thursday 10:30am-12:30pm Straub Hall, Room 156 Paper Resource Management Chair: Taisuke Boku (University of Tsukuba) Thursday 10:30am-12:30pm Straub Hall, Room 145 Paper Memory and Caching Chair: Sameer Shende (University of Oregon) Thursday 2:00pm-3:30pm Straub Hall, Room 156 Paper Parallel and Distributed Algorithms Chair: Kamesh Madduri (Pennsylvania State University) Thursday 2:00pm-3:30pm Straub Hall, Room 145 Paper Performance of Graph Algorithms Chair: Allen Malony (University of Oregon) Thursday 4:00pm-5:30pm Straub Hall, Room 156 Paper Matrix and Graph Algorithms Chair: Jee Choi (IBM) Thursday 4:00pm-5:30pm Straub Hall, Room 145 Paper Data Processing Chair: Taisuke Boku (University of Tsukuba) |