The poster session will be held on Tuesday, August 14th at 6:30pm in the EMU ballroom.

Delta-Stepping Synchronous Parallel Model

Zhang, Cui, Chen, Zhang

Many synchronous parallel algorithms like PageRank require a large number of iteration steps. The overheads of global synchronizations on general-purpose cluster take substantial proportion of the execution time for lightweight computations. We propose a variant of Bulk Synchronous Parallel (BSP), Delta-Stepping Synchronous Parallel (DSP), with fewer iteration steps. It achieves faster convergence process by taking full advantage of data locality.

Poster Flyer

Linear Time Sorting for Large Data Sets with Specialized Processor


A GPU-type special processor is proposed for sorting large data sets with, e.g., one billion keys. The parallel bubble sorter (PBS) processor implements parallel bubble sort in hardware with a very large set of special registers. A method to utilize the PBS for various large sorting problems is presented.

Poster Flyer

Exploring Memory Coalescing for 3D-Stacked Hybrid Memory Cube

Wang, Leidel, Chen

Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can dramatically slow down the overall performance of applications. The growing desire of high memory bandwidth and low latency access stimulate the advent of novel 3D-staked memory devices such as the Hybrid Memory Cube (HMC), which provides significantly higher bandwidth compared with the conventional JEDEC DDR devices. Even though many existing studies have been devoted to achieving high bandwidth throughput of HMC, the bandwidth potential cannot be fully exploited due to the lack of highly efficient memory coalescing and interfacing methodology for HMC devices. In this research, we introduce a novel memory coalescer methodology that facilitates memory bandwidth efficiency and the overall performance through an efficient and scalable memory request coalescing interface for HMC. We present the design and implementation of this approach on RISC-V embedded cores with attached HMC devices. Our evaluation results show that the new memory coalescer eliminates 47.47% memory accesses to HMC and improves the overall performance by 13.14% on average.

Poster Flyer

A Low-Communication Method to Solve Poisson's Equation on Locally-Structured Grids

McCorquodale, Colella, Van Straalen, Kavouklis

This poster describes a new algorithm, Method of Local Corrections (MLC), and a high-performance implementation for solving Poisson's equation with infinite-domain boundary conditions, on locally-refined nested rectangular grids. MLC represents the potential as a linear superposition of small local discrete convolutions, with global coupling represented using a non-iterative form of geometric multigrid. Thus the data motion is comparable to that of only a single V-cycle of multigrid, and hence is an order of magnitude smaller than tradition multigrid iteration. The computational kernels where most of the time is spent are 3D FFTs on small domains. Our results show solution error that is fourth order in mesh spacing, down to a fixed barrier that can be made low using high-order Mehrstellen stencils. Strong scaling tests on 64 to 4096 cores on NERSC Cori I (Haswell) show over 60% efficiency, and weak scaling by replication tests over 64 to 32768 cores show 92% efficiency on the same platform. We find comparable solve times between HPGMG on a uniform grid with one billion grid points, and MLC on the same number of grid points adaptively distributed. Since MLC operates on a nested series of adaptive locally-refined grids, it is able to solve problems with much higher resolution at the finest level than an algorithm on a uniform grid.

Poster Flyer

CGAcc: CSR-based Graph Traversal Accelerator on HMC

Qian, Childers, Huang, Yu, Wang

Graph traversal is widely involved in lots of realistic scenarios such as road routing, social network and so on. Unfortunately graph traversal is quite time-consuming because of terrible spatial locality. Conventional prefetch technology and parallel framework do not bring much benefit. Hybrid Memory Cube (HMC) can serve as high bandwidth main memory as well as providing an enhanced logic layer with the ability of controlling memory access and processing in memory (PIM). Armed with the knowledge of graph's structure, we propose CGAcc in this paper. By deploying three prefetchers on the logic layer and making them in a pipeline way, CGAcc achieves average 2.8X speedup compared with conventional memory system with stream prefetcher.

Poster Flyer

An Extensible Ecosystem of Tools Providing User Friendly HPC Access and Supporting Jupyter Notebooks

Glick, Mache

High performance computing systems can be hard to use, and can require advanced knowledge of the command line, resource managers, and other details that scientists without a systems administration background may find distracting. In this project, we describe our extensible ecosystem of tools which has allowed students and researchers with minimal HPC backgrounds to use our system with virtually no training. Among the tools we support are (1) a web-based job submission and management system, (2) a GUI-based file transfer and sharing tool, and (3) a Jupyter notebook environment. Together, these tools and the interfaces around them allow users to run complete HPC workflows without any use of the command line. Performance measurements show that our ecosystem incurs only marginal overhead.

Poster Flyer

In-Depth Reliability Characterization of NAND Flash based Solid State Drives in High Performance Computing Systems

Liang, Qiao, Fu, Shi

ABSTRACTNAND flash based solid state drives (SSD) have been widely used in high performance computing (HPC) systems due to their better performance compared with the traditional hard disk drives. How- ever, little is known about the reliability characteristics of SSDs in production systems. Existing works study the statistical distri- butions of SSD failures in the field. They do not go deep into SSD drives and investigate the unique error types and health dynam- ics that distinguish SSDs from hard disk drives. In this paper, we explore the SSD-specific SMART (Self-Monitoring, Analysis, and Reporting Technology) attributes to conduct an in-depth analysis of SSD reliability in a production datacenter. Our dataset contains half a million records with more than twenty attributes. We lever- age machine learning technologies, specifically data clustering and correlation analysis methods, to discover groups of SSDs which have different health status and the relation among SSD-specific SMART attributes. Our results show that 1) Media wear affects the reliability of SSDs more than any other factors, and 2) SSDs transit from a health group to another which infers the reliability degrada- tion of those drives. To the best of our knowledge, this is the first study investigating SSD-specific SMART data to characterize SSD reliability in a production environment.

Poster Flyer

Leveraging Resource Bottleneck Awareness and Optimizations for Data Analytics Performance

Barreto Goes Perez, Zhou

Spark is among the most popular parallel and distributed data analytics frameworks in use today. But just like its peers and predecessors, it still suffers from performance degradation caused by resource bottlenecks. The challenge is finding multiple methods that leverage awareness of resource bottlenecks to improve performance. Among the research progress we have made so far, includes the proposal and development of two systems. The first, PETS, is a tuning system that is capable of tuning multiple parameters at the same time in a few iterations, taking advantage of a resource awareness feedback system. The second, MRD, is a cache memory management system which utilizes reference-distance information, to evict the furthest and least likely data in the cache to be used, while aggressively prefetching the nearest and most likely data that will be needed. Both systems have shown promising results, with significant performance gains over default systems, as well as improvements in related recent work.

Poster Flyer

Performance Improvements of an Event Index Distributed System

Fernandez, Orduña, Gonzalez

Modern scientific collaborations, like the ATLAS experiment at CERN, produce large amounts of data that need cataloging to meet multiple use cases and search criteria. Several challenges arise in indexing and collecting billions of events, or particle collisions, from hundred of grid sites worldwide. In addition, we also face some challenges in the organization of the data storage layer of the catalog, that should be capable of handling mixed OLTP (high-volume transaction processing updates ) and OLAP (real-time analytical queries) use cases. In order to overcome the challenges on the distributed data collection of events, in this thesis we have proposed , designed and implemented a distributed Producer/Consumer architecture, based on Object Store as a shared storage, and with dynamic data selection. Producers run at hundreds of grid sites worldwide indexing millions of files summing up Petabytes of data, and store a small quantity of metadata per event in an ObjectStore. Then, a reference to the data is sent to a Supervisor, that signals Consumers to retrieve the data at the desired granularity and consolidate at a central Hadoop based data backend. In the area of the internal organization of the data, we propose an architecture based on a NoSQL backend storage, and new data schemas to better accommodate related data (reprocessings), avoiding duplicate information, and improving navigation. We propose applying memory caching techniques to improve access times for recent loaded data, which is usually the most accessed data by the end-users use cases.

Poster Flyer

Designing Domain-Specific Heterogenous Manycores from Dataflow Programs


We propose a design method to develop domain-specific heterogeneous manycores by extending simple cores with custom instructions based on application requirements. We develop tools to automate the design method and facilitate development of the manycores. The tools generate competitive hardware/software co-design in terms of performance and hardware resource usage.

Poster Flyer

A HPC Framework for Big Spatial Data Processing and Analytics

Paudel, Puri

Spatial data processing and analytics is a highly data- and compute-intensive task. Doing this in a HPC environment has remained a challenge due to scalability issues. The scalability issues usually arise due to the non-uniform nature of spatial data which makes load balancing inefficient. Existing algorithms and systems which are sequential in nature cannot be ported into a distributed system without tackling the issue of load balancing. Porting existing algorithms to a parallel system would also require finding their parallel counterpart and rewriting code in a new language like CUDA or refactoring them using pthreads. In our work, we present a framework for processing and analyzing big spatial data in a HPC environment by overcoming or diminishing the above mentioned issues. Our work includes a way to read the non-uniform spatial data into a MPI system using MPI-Vector-IO. This allows us to reduce the read time of big spatial data and get started on processing faster. ADLB is introduced as a potential solution to handling the issues of load balancing during processing of the big spatial data. Moreover, we propose using directive based programming to port the existing sequential code to run in parallel so that we can avoid rewriting or heavy refactoring. Using OpenMP will allow our code to run in parallel threads in the nodes extracting more performance when possible and using OpenACC we will be able to use any GPUs the nodes might have.

Poster Flyer

A Computational Investigation of Redistricting Using Simulated Annealing

Antoškin, Muite

Political redistricting is an important operation that is done to ensure a fair selection of electoral representatives. It can be formulated as a combinatorial optimization problem. In realistic cases, this problem can be challenging to solve due to the large number of solutions. The effectiveness of parallel computing to more effectively search the solution space is examined in specially designed test cases where the optimal solution is known.

Poster Flyer

Utilization of Random Profiling for System Modeling and Dynamic Configuration

Hiebel, Brown, Wang

Dynamically tuning or reconfiguring operating system and architectural resources at runtime can improve performance by adapting system operation to the current workload. We seek to address the problems of both system modeling and dynamic system configuration in the context of sequential decision processes with limited feedback. Random profiling is a technique for sequential decision processes with limited feedback, which can adequately and efficiently capture the relationships between workload behavior, system configuration, and performance. Along with machine learning methods, random profiling can be used for determining descriptive metrics for workload behavior and constructing policies which dynamically reconfigure the operating system and microarchitecture at runtime. We present three such problems: system event selection, dynamic paging mode selection, and dynamic hardware prefetcher configuration.

Poster Flyer

Performance evaluation of parallel cloud functions

Pawlik, Figiela, Malawski

Function-as-a-Service services are still a novelty in portfolios of cloud service providers. There is a significant amount of research done on development of new use cases. One of the new proposals is to exploit the computing power of FaaS by running HPC applications [1]. While not all workloads can be adapted to run as cloud functions, means for executing workflow-type applications are available. In preparation to develop performance models for studied applications, a benchmarking framework was proposed in [2]. This poster presents work done by expanding the mentioned idea for benchmarking in terms of functionality and scope of tested infrastructures. The new framwork combines two aspects of previous benchmarking suite: testing infrastructure provisioning (simultaneous execution of multiple tasks) and floating point performance into one. This approach allows for obtaining a more complete view of studied performance, which includes factors like task start delay and influence of parallelism. The testing load is generated with Linpack, an industry standard benchmark with large set of comparable results. Tested cloud function providers include: Amazon, Google and IBM. Presented results provide an overview of performance that might be expected from FaaS infrastructures. While this research explores frontiers of HPC, future needs related to exascale and edge computing might bring HPC workloads to cloud functions. [1] Spillner et al. "FaaSter, Better, Cheaper: The Prospect of Serverless Scientific Computing and HPC." Latin American High Performance Computing Conference. 2017. [2] M. Malawski, K. Figiela, A. Gajek, A. Zima, Benchmarking heterogeneous cloud functions, in: Euro-Par 2017: Parallel Processing Workshops

Poster Flyer

Algorithm Design for Large Scale FFT-Based Simulations on CPU-GPU Platforms

Kulkarni, Franchetti, Kovacevic

Extreme memory requirements and high communication overhead prevent scaling of large scale iterative simulations involving parallel Fast Fourier Transforms (FFTs) to higher grid sizes, which is necessary for high resolution analysis. To overcome these limitations, we propose an algorithm to run stress-strain simulations on CPU-GPU platforms for larger problem sizes using irregular domain decomposition and local FFTs. Early results show that our method lowers iteration cost without adversely impacting accuracy of the result.

Poster Flyer

WebNN: A Distributed Framework for Deep Learning

Goin, Cotton, Zhao

Distributed computing technologies have been used to facilitate machine learning applications, so that high-end hardware resources, such as clusters, GPUs, etc., can be leveraged to achieve high performance. In this paper, we present WebNN, a web-based distributed framework for deep learning. WebNN enables users to configure, distribute, and train neural networks asynchronously over the Internet, utilizing peer-owned resources with increased flexibility. Experiments have been carried out to evaluate WebNN, and the results show that WebNN is a effective solution for distributed training of models using pre-labeled batches, or client-provided data.

Poster Flyer

Toward Footprint-Aware Power Shifting for Hybrid Memory Based Systems

Arima, Hanawa, Schulz

High Performance Computing (HPC) systems are facing severe limitations in both power and memory bandwidth/capacity. Both limitations have been addressed individually: to exploit performance under a strict power constraint, power shifting, which allocates more power budget on the bottleneck component, is a promising approach; for memory bandwidth/capacity, the industry is gradually shifting towards hybrid main memory designs that comprise multiple technologies (e.g., DRAM and Non-Volatile RAM). However, few works look at the combination of both trends. We propose a power management concept called footprint-aware power shifting, which explicitly targets hybrid main memory systems. The idea is based on the following observation: in spite of the system software's efforts to optimize the data allocations on such a system, the effective memory bandwidth decreases considerably when we scale the problem size of applications. As a result, the performance bottleneck changes among components depending on the footprint size, which then also changes the optimal power budget settings. We demonstrate the phenomenon and quantitatively show the impact of our power management approach.

Poster Flyer

Resource and Service Management in Fog Computing

Shaik, Baskiyar

Fog computing paradigm is emerging as complementary to cloud computing paradigm to realize deployment of large scale IoT environments supporting widely dispersed IoT devices, users, and corresponding applications, leveraging fog nodes of varied resource configurations located in vicinity. Researchers have shown the need of fog computing towards deployment of latency-critical and bandwidth-intensive applications. Management of resources and services is critical in widely dispersed fog environment with heterogeneous fog nodes to ensure optimal utilization of available infrastructure and energy resources on fog nodes and IoT devices. Towards efficient management of fog, we proposed Hierarchical and Autonomous Fog Architecture (HAFA) to organize heterogeneous fog nodes into a multi-layered connected hierarchy based on several parameters such as physical location, distance from IoT devices and/or users, node resource configuration, privacy and security. The initial results show the ease of search for an optimal fog node with required resource configuration towards deployment of application services.

Poster Flyer

Exploiting Inter-Phase Application Dynamism to Auto-Tune HPC Applications for Energy-Efficiency

Kumaraswamy, Gerndt

Optimizing the energy consumption is a major challenge in the current HPC landscape. The EU Horizon 2020 project READEX provides an auto-tuning framework to improve the energy-efficiency of HPC applications. It leverages the variation in the application behavior between individual time loop iterations, or phases as well as changing control flow within a single phase. READEX uses a two-step methodology, consisting of Design-Time Analysis and Runtime Application Tuning. At design-time, fine-granular instrumented program regions are filtered, and significant regions that have a tuning potential are determined. Then, the readex_interphase tuning plugin, performs three tuning steps, i.e., cluster analysis, default execution and verification to exploit the inter-loop dynamism. During cluster analysis, it creates a search space of system configurations using a random search strategy. It executes experiments to request for measurements for the phases as well as instances (rts's) of the significant regions. It uses DBSCAN with parameters minPts and eps, which is automatically determined using the elbow method to cluster phases. Normalized PAPI metrics - compute intensity and conditional branch instructions are used as features to group phases with similar behavior. Cluster-best tuning parameter settings are determined for each phase cluster. During default execution, the plugin executes using the default settings of the batch system. In the verification step, the actual savings with the switching overhead are verified against the computed theoretical savings. Finally, the static and dynamic savings are determined. The plugin was evaluated for miniMD and INDEED, and the obtained energy savings highlight the effectiveness of this methodology.

Poster Flyer

DSAP: Data Structure-Aware Prefetching for Breadth First Search on GPU

Guo, Huang, Lv, Yu, Ma, Wang

Breadth First Search (BFS) is a key graph searching algorithm for many graph computing applications. However, memory divergence harms GPU’s efficiency dramatically due to the irregular memory access pattern of BFS. Memory prefetching can fetch useful data into on-chip memory in advance, but current prefetchers cannot deal with data-dependent memory accesses. In this paper, we propose DSAP, a data structure-aware prefetcher on GPU that generates prefetching requests based on the data-dependent relationship between graph data structures and the memory access pattern of BFS. Also, we introduce an adaptive prefetching depth management to make DSAP more efficient according to the limited on-chip memory storage. We evaluate six datasets from three different kinds of applications and DSAP can achieve geometrical mean speedups of 1.28x, up to 1.48x, compared to the GPU without using prefetching.

Poster Flyer

Toward a Multi-GPU Implementation of the Modular Integer GCD Algorithm: Extended Abstract

Weber, Brew

The modular integer greatest common divisor (GCD) algorithmholds promise to provide superior performance to sequential algorithmson extremely large input. In order to demonstrate theefficacy of the algorithm, an implementation on a system withmultiple Graphics Processing Units (GPUs) is proposed, based ona single-GPU implementation described herein. The implementation’sperformance is analyzed to predict the size of input needed todemonstrate superior performance when compared to one popularsequential implementation of the integer GCD.

Poster Flyer

Middleware for Data Intensive Analytics on HPC


My research focuses on providing a common workload management middleware abstraction, on distributed resources, that will capture the commonalities of the data analysis between different domain sciences, and a resource management abstraction that will allow Big Data style analytics on HPCs in conjunction with HPC style simulations. One important component of these abstractions is the Pilot abstraction. The Pilot abstraction operates as a resource placeholder which can be used to execute a set of heterogeneous tasks within the placeholder on the same resource. I am using the Pilot-Abstraction and its implementation RADICAL-Pilot to support different types of data analysis as well as use it as a building block for the middleware abstraction defined before.

Poster Flyer

Topologies and Adaptive Routing on Large-Scale Interconnects


The performance of the interconnect network is massively important in the modern day supercomputer and data centers. As a PhD student in the FSU CS EXPLORER (EXtreme-scale comPuting, modeLing, netwORking & systEms Research) lab under the supervision of Dr. Xin Yuan, my research activity revolves around the analysis, improvement and performance evaluation of a number of topology and routing schemes widely used in the field of high performance computing. Over the years, I worked in a number of projects that is briefly described in my poster and in the extended research abstract.

Poster Flyer

Abstractions for Specifying Sparse Matrix Data Transformations

Nandy, Davis, Mohammadi, Hall, Strout, Olschanowsky

The inspector/executor paradigm permits using runtime information in concert with compiler optimization. An inspector collects information that is only available at runtime; this information is used by an optimized executor that was created at compile time. Inspectors are widely used in optimizing irregular computations, where information about data dependences, loop bounds, data structures, and memoryaccess patterns are collected at runtime and used to guide code transformation, parallelization, and data layout. Most research that uses inspectors relies on instantiating inspector templates, invoking inspector library code, or manually writing inspectors. This poster describes abstractions for generating inspectors for loop and data transformations for sparse matrix computations using the Sparse Polyhedral Framework(SPF). SPF is an extension of the polyhedral framework for transformation and code generation. SPF extends the polyhedral framework to represent runtime information with uninterpreted functions and inspector computations that explicitly realize such functions at runtime. It has previously been used to derive inspectors for data and iteration space reordering. This poster introduces data transformations intoSPF, such as conversions between sparse matrix formats, and show how prior work can be supported by SPF. We also discuss possible extensions to support inspector composition and incorporate other optimizations. This work represents a step towards creating composable inspectors in keeping with the composability of affine transformations on the executors.

Poster Flyer

SOSflow: A Scalable Observation System for Introspection and In Situ Analytics


Efficiently observing and interacting with complex scientific workflows at scale presents unique challenges. SOSflow helps meet them. This work presents the model and a design overview of the SOSflow In Situ Scalable Observation System, and shares some recent results oriented towards performance understanding of complex workflows at scale.

Poster Flyer

Cost-Time Performance of Scaling Applications on the Cloud

Rathnayake, Teo

With cloud computing offering scalable resources and pay-per use pricing, it is an attractive platform for scalable computing where application execution is constrained only by the cost budget. Additionally, cloud applications are increasingly becoming larger due to recent advancements in big data processing and machine learning, among others. Due to the large number of resource configurations available on the cloud, it is a non-trivial task for cloud users to determine the largest size of the application executable for a given cost budget and execution time deadline. In addition, the challenge becomes multi-fold as resource demands of the application do not necessarily scale linearly with problem size. Given a cost budget and a time deadline, this paper introduces a measurement-driven analytical modeling approach to determine the largest Pareto-optimal problem size of an application and the corresponding cloud configuration for execution. We evaluate our approach with a subset of representative applications that exhibit range of resource demand growth patterns. We show the existence of cost-time-size Pareto-frontier with multiple sweet spots meeting user constraints. To characterize the cost-performance of cloud resources, we introduce Performance Cost Ratio (PCR) metric. Motivated by Gustafson's law, we investigate fixed-time scaling of applications on cloud and analyze the trade-offs between the application problem size, cost and time.

Poster Flyer

OpenMP 4.5 Implementations: Evaluation & Verification of Offloading Features

Monsalve Diaz

On the quest of High Performance Computing systems capable of achieving high parallelism on scientific applications, a trend of adding acceleration devices with specialized hardware architectures can be observed. An evidence of this are the 102 supercomputers of the latest top 500 list that use an accelerator or co-processor. Furthermore, new programming languages and frameworks have been created to support accelerators. While each vendor may offer its own programming solution, traditional programming frameworks as OpenMP are working to solve this issue adapting the standard to support target devices. Although it is not the only option, all solutions share a common objective of increasing code portability across vendors and multiple architectural generations. However, maturity and correctness of these solution needs to be tested. In particular for OpenMP, there is a breach between the standard release and when compilers and vendors implement support for it. Hence, there is a need of a methodology to assess the quality of an implementation of the OpenMP standard, as well as evaluate system compliance during systems deployment. Such methodology should reveal bugs or conflicts on the interpretation of the standard, as well as provide metrics to characterize the quality level of an implementation. Consequently, this work presents a tests suite for OpenMP 4.5. This suite follows a well thought methodology that was established in advanced to provide a common and steady ground, allowing vendors, systems developers and programmers to asses the level of readiness and maturity of an OpenMP 4.5 implementation for a particular vendor.

Poster Flyer

Iterative Solver Selection Techniques for Sparse Linear Systems

Sood, Norris, Jessup

Scientific and engineering applications from different domains like astrophysics, robotics, computational fluid dynamics, thermodynamics, often involve the solution of large sparse linear systems. For large, sparse linear systems, iterative methods are the preferred choice of solution method. There are many software libraries that offer a variety of parallel preconditioned Newton-Krylov methods for solving sparse problems. However, the selection of an optimal Krylov method remains to be user's responsibility. In this work, we present the technique we propose for the optimal solver method suggestions based on the problem characteristics and the amount of communication involved in the Krylov methods. We classify different Krylov methods based on their performance using several supervised machine learning techniques. We solve each linear system with multiple solver-preconditioner combinations and capture the solver timing. We compute various features of these systems such as the trace, row/column variability. These properties along with the solver-preconditioner combination, together form the training set. The output of the model is a list of ``good'' solver recommendations. For problems that require high levels of parallelism (>1000 cores), we propose a method for modeling the performance of parallel preconditioned Krylov methods. With this model, we analyze the solver and preconditioner algorithms provided by PETSc identifying the operations that perform inter-process communication for each iteration. The result is a communication-wise ranking for the preconditioned and non-preconditioned cases. The solver ranking list obtained from this model is combined with the ML-generated solver list to find the intersection, which generates a ranked list of ``good'' solvers.

Poster Flyer

Performance Analysis of DroughtHPC and Holistic HPC Workflows

Suriyakumar, Karavanic, Moradkhani

We present the results of a performance study of a newly developed drought prediction code called DroughtHPC. Holistic view helps to identify bottlenecks and identify areas for improvement in the software. The significant performance bottlenecks identified include: the overhead of calls to the VIC hydrologic model from within a python loop; VIC code structures that precluded parallelization with OpenMP; and significant file accesses. We observed challenges in diagnosing the performance of the code due to the use of an external modeling code in combination with python, a fairly common scenario in scientific computation. To address these challenges we are developing PPerfG, a tool for visualizing Holistic HPC Workflows, and have implemented an initial prototype.

Poster Flyer

Push-Pull on Graphs is Column- and Row-based SpMV Plus Masks

Yang, Buluc, Owens

Push-pull, also known as direction-optimized breadth-first-search (DOBFS), is a key optimization for making breadth-first-search (BFS) run efficiently. Linear algebra-based frameworks have advantages in conciseness, performance and portability. However, there is no work in literature describing how to implement it within a linear algebra-based framework. Our work shows that DOBFS fits well within the linear algebra-based framework.

Poster Flyer

Adaptive auto-tuning in HPX using APEX


Finding an optimal value for a parameter that impacts application performance is a hard problem and often requires repetitive execution of the application. This auto-tuning effort by running the application many times causes wastage of HPC resources. In this research, we provide a preliminary study which demonstrates parameter value searching at runtime while ensuring better performance. We use APEX performance measurement library to implement an adaptive auto-tuning policy to tune parcel coalescing parameters based on a sampled counter of HPX runtime system. Our study shows APEX policy can converge on parcel coalescing parameters while reducing the network overhead and hence ensures reduction of execution time.

Poster Flyer

Interval based Framework for Locking in Hierarchies

Kalikar, Nasre

Hierarchies are special linked structures, where each child node denotes a specialization or a part of its parents. For instance, node representing a department in an academic hierarchy is a part of its parent institute. Conversely, a node representing an institute contains all its departments. Hierarchical locking is a way to lock a node in a hierarchy which implicitly locks its descendants. This is useful because the whole sub-hierarchy rooted at a node can be protected using a single lock. A node could be hierarchically locked only if (i) it is not locked by any other thread, and (ii) none of its descendants are currently locked by any other thread, and (iii) none of its ancestors are currently locked by any other thread. For instance, in Figure 1, hierarchical locking of node G requires that no other thread currently holds locks on (i) G itself, (ii) its descendant nodes M and N , and (iii) its ancestors A and C . However, as an implementation, hierarchical locking poses various challenges. We present our interval based hierarchical locking technique and show how it tackles all the challenges in hierarchical locking.

Poster Flyer

Models and Techniques for Green High-Performance Computing


Modern high-performance computing (HPC) systems are power-limited. For instance, the U.S. Department of Energy has set a power envelope of 20MW for the exascale supercomputer expected to arrive in 2022-23. Achieving this target requires a 10.8-fold increase in performance over today’s fastest supercomputer with only a 1.3-fold increase in power consumption. As a consequence, the architecture of an HPC system is changing rapidly---e.g., via heterogeneity and hardware overprovisioning. In my thesis, I address (i) modeling, (ii) management, and (iii) evaluation challenges concerning power and energy in this changing landscape of HPC systems. In this extended abstract, I focus on my work on modeling data movement power over interconnect wires.

Poster Flyer

Fast and generic concurrent message-passing

Dang, Snir

Communication hardware and software have a significant impact on the performanceof clusters and supercomputers. The Message-Passing Interface (MPI) has become ade-facto standard API for communication in HPC. However, it recently faces a newchallenge due to the emergence of many-core nodes and of programming models thatprovide dynamic task parallelism and assume large numbers of concurrent,light-weight threads. Using MPI atop of these languages/runtimes is inefficientbecause MPI implementation is not able to perform well with threads. Using MPIas a communication middleware is also not efficient since MPI has to providemany abstractions that are not needed for many of the frameworks, thus havingthe extra overhead.Our research focuses on improving the communication of applications andframeworks in three fronts. First, although MPI performance is lagging behind,we show that it can be improved using more advanced techniques of threadsynchronization and appropriate semantic relaxations. Our proposal andtechniques are being incorporated into the MPICH implementation. Second, wedevelop a generic and low-level communication interface (LCI) which targetsemerging applications such as asynchronous task-model where the current MPIsemantics are less ideal. LCI design allows for generic and low-overheadproducer and consumer matching, better thread integration, and maps more directto the network interface. In previous works LCI has improved thestate-of-the-art performance of graph analytics frameworks. Lastly, we developFULT, a fast User-level thread scheduling technique using bit-vectors to furtherlower the overheads of signal/wait operations and improves the inter-operationbetween thread and communication runtime.

Poster Flyer

I/O Bottleneck Investigation in Deep Learning Systems

Pumma, Si, Feng, Balaji

As deep learning systems continue to grow in importance, several researchers have been analyzing approaches to make such systems efficient and scalable on high-performance computing platforms. As computational parallelism increases, however, data I/O becomes the major bottleneck limiting the overall system scalability. In this paper, we present a detailed analysis of the performance bottlenecks of Caffe on large supercomputing systems and propose an optimized I/O plugin, namely LMDBIO. Throughout the paper, we present six sophisticated data I/O techniques that address five major problems in state of the art I/O subsystem. Our experimental results show that Caffe/LMDBIO can improve the overall execution time of Caffe/LMDB by 77-fold compared with LMDB.

Poster Flyer

Identifying Carcinogenic Multi-hit Combinations usingWeighted Set Cover Algorithm

Dash, Kinney, Varghese, Garner, Feng, Anandakrishnan

Disruptions in certain molecular pathways due to combinations of genetic mutations (hits) are known to cause cancer. Due to a large number of mutations present in tumor cells, experimentally identifying these combinations is not possible except in very rare cases. Current computational approaches simply do not search for specific combinations of multiple genetic mutations. Instead, current algorithms search for sets of driver mutations based on mutation frequency and mutational signatures. Here, we present a fundamentally different approach for identifying carcinogenic mutations: we search for combinations of carcinogenic mutations (multi-hit combinations). By avoiding the convolution of different driver mutations associated with different individual instances of cancer, multi-hit combinations may be able to identify the specific cause for each cancer instance. We mapped the problem of identifying a set of multi-hit combinations to a weighted set cover problem. We use a greedy algorithm to identify sets of multi-hit combinations for seventeen cancer types for which there are at least two hundred matched tumor and blood-derived normal samples in the cancer genome atlas (TCGA). When tested using an independent validation dataset, these combinations are able to differentiate between tumor and normal tissue samples with 91% sensitivity (95% Con- fidence Interval (CI) = 89-92%) and 93% specificity (95% CI = 91-94%), on average for seventeen cancer types. The combinations identified by this method, with experimental validation, can aid in better diagnosis, provide insights into the etiology of various cancer types, and provide a rational basis for designing targeted combination therapies.

Poster Flyer

Efficient Matching of GPU Kernel Subgraphs


Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. Our methodology statically separates CUDA binaries into basic block regions and dynamically measures instruction andbasic block frequencies. Our approach captures this information in a control flow graph (CFG) and performs subgraph matching across various kernel’s CFGs to gain insights into an application’s resource requirements, based on the shape and traversal of the graph, instruction operations executed and registers allocated, among other information. The utility of our approach is demonstrated with SHOC and Rodinia application case studies on a variety of GPU architectures, revealing novel thread divergence characteristics that facilitate end users, autotuners, and compilers in generating high performing code.

Poster Flyer

KeyBin2: Distributed Clustering for Scalable and In-situ Analysis

Chen, Estrada

We present KeyBin2, a key-based clustering method that is able to learn from distributed data in parallel. KeyBin2 uses random projections and discrete optimizations to efficiently clustering very high dimensional data. Because it is based on keys computed independently per dimension and per data point, KeyBin2 can scale linearly. We perform accuracy and scalability tests to evaluate our algorithm's performance using synthetic and real datasets. The experiments show that KeyBin2 outperforms other parallel clustering methods for problems with increased complexity. Finally, we present an application of KeyBin2 for in-situ clustering of protein folding trajectories.

Poster Flyer