ICPP 2018 Program

Monday, August 13th

8:00am-5:00pm

ICPP Registration

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Registration

9:00am-12:30pm

EMS Workshop

Maple, Ballroom Area, Erb Memorial Union (EMU), 2nd Floor

Workshop

QC Workshop

Oak, Ballroom Area, Erb Memorial Union (EMU), 2nd Floor

Workshop

9:00am-6:30pm

NextGenClouds Workshop

Straub Hall, Room 145

Workshop

P2S2 Workshop

Straub Hall, Room 156

Workshop

SRMPDS Workshop

Straub Hall, Room 245

Workshop

10:30am-11:00am

Break

Straub Hall Lobby

Break

12:30pm-1:30pm

Lunch

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Lunch

2:00pm-3:30pm

AWASN Workshop

Maple, Ballroom Area, Erb Memorial Union (EMU), 2nd Floor

Workshop

2:00pm-5:00pm

BIO-HPC Workshop

Oak, Ballroom Area, Erb Memorial Union (EMU), 2nd Floor

Workshop

2:00pm-6:00pm

Introduction to Running AI Workloads on PowerAI

Gumwood, Ballroom Area, Erb Memorial Union (EMU), 2nd Floor

Tutorial

3:30pm-4:00pm

Break

Straub Hall Lobby

Break

Tuesday, August 14th

8:00am-10:30am

ICPP Registration

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Registration

8:30am-9:00am

Welcome and Introduction

Straub Hall, Room 156

Allen Malony

details

Keynote

9:00am-10:00am

Keynote

Straub Hall, Room 156

Mark Robins

AI and HPC: Challenges and Opportunities

Keynote

10:00am-10:30am

Best Paper Session

Straub Hall, Room 156

Michele Weiland

ImageNet Training in Minutes

Paper

10:30am-11:00am

Break

Straub Hall Lobby

Break

11:00am-12:30pm

Graph Applications

Straub Hall, Room 245

Konstantinos Krommydas

ParaPLL: Fast Parallel Shortest-path Distance Query on Large-scale Weighted Graphs

HUS-Graph: I/O-Efficient Out-of-Core Graph Processing with Hybrid Update Strategy

A Distributed Infomap Algorithm for Scalable and High-Quality Community Detection

Paper

Monitoring and Network Analysis

Straub Hall, Room 156

Martin Schulz

Integrating Low-latency Analysis into HPC System Monitoring

Unveiling Thread Communication Bottlenecks Using Hardware-Independent Metrics

Interference between I/O and MPI Traffic on Fat-tree Networks

Paper

Task Placement Algorithms

Straub Hall, Room 145

Jee Choi

Energy-Efficient Speculative Execution using Advanced Reservation for Heterogeneous Clusters

Topology-induced Enhancement of Mappings

Charging Task Scheduling for Directional Wireless Charger Networks

Paper

12:30pm-2:00pm

ICPP Executive Meeting

Oak, Ballroom Area, Erb Memorial Union (EMU), 2nd Floor

Allen Malony

Lunch

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Lunch

1:00pm-2:00pm

Ph.D. Forum Introduction

Maple, Ballroom Area, Erb Memorial Union (EMU), 2nd Floor

Ph.D. Forum

2:00pm-3:30pm

Astronomy and Earth Systems

Straub Hall, Room 245

Kevin Huck

Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

Communication-Avoiding for Dynamical Core of Atmospheric General Circulation Model

MPI-Vector-IO: Parallel I/O and Partitioning for Geospatial Vector Data

Paper

Networking Algorithms

Straub Hall, Room 145

Kamesh Madduri

NFV Middlebox Placement with Balanced Set-up Cost and Bandwidth Consumption

DAG-SFC: Minimize the Embedding Cost of SFC with Parallel VNFs

Heterogeneous Wireless Charger Placement with Obstacles

Paper

Performance Tools and Methodologies

Straub Hall, Room 156

Sameer Shende

Scalable Behavioral Emulation of Extreme-Scale Systems Using Structural Simulation Toolkit

Varbench: an Experimental Framework to Measure and Characterize Performance Variability

NumaMMA: NUMA MeMory Analyzer

Paper

3:30pm-4:00pm

Break

Straub Hall Lobby

Break

4:00pm-6:00pm

Algorithms

Straub Hall, Room 145

Wu Feng

MND-MST: A Multi-Node Multi-Device Parallel Boruvka's MST Algorithm

CSTF: Large-Scale Sparse Tensor Factorizations on Distributed Platforms

Reducing Communication in Proximal Newton Methods for Sparse Least Squares Problems

PBCS: An Efficient Parallel Characteristic Set Method for Solving Boolean Polynomial Systems

Paper

Performance on GPU Systems

Straub Hall, Room 156

Sameer Shende

Using Static Allocation Algorithms for Matrix Matrix Multiplication on Multicores and GPUs

Revisiting Multi-pass Scatter and Gather on GPUs

Matrix Factorization on GPUs with Memory Optimization and Approximate Computing

Massively Parallel Huffman Decoding on GPUs

Paper

Scheduling Algorithms

Straub Hall, Room 245

Michele Weiland

A Generic Approach to Scheduling and Checkpointing Workflows

ran-GJS: Orchestrating Data Analytics for Heterogeneous Geo-distributed Edges

Less Provisioning: A Fine-Grained Resource Scaling Engine for Long-Running Services with Tail Latency Guarantees

Improving Resource Utilization through Demand Aware Process Scheduling

Paper

6:30pm-9:30pm

Poster Reception and Dinner

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Kengo Nakajima

Delta-Stepping Synchronous Parallel Model

Linear Time Sorting for Large Data Sets with Specialized Processor

Exploring Memory Coalescing for 3D-Stacked Hybrid Memory Cube

A Low-Communication Method to Solve Poisson's Equation on Locally-Structured Grids

CGAcc: CSR-based Graph Traversal Accelerator on HMC

An Extensible Ecosystem of Tools Providing User Friendly HPC Access and Supporting Jupyter Notebooks

In-Depth Reliability Characterization of NAND Flash based Solid State Drives in High Performance Computing Systems

Leveraging Resource Bottleneck Awareness and Optimizations for Data Analytics Performance

Performance Improvements of an Event Index Distributed System

KeyBin2: Distributed Clustering for Scalable and In-situ Analysis

I/O Bottleneck Investigation in Deep Learning Systems

OpenMP 4.5 Implementations: Evaluation & Verification of Offloading Features

Cost-Time Performance of Scaling Applications on the Cloud

SOSflow: A Scalable Observation System for Introspection and In Situ Analytics

Abstractions for Specifying Sparse Matrix Data Transformations

Topologies and Adaptive Routing on Large-Scale Interconnects

Middleware for Data Intensive Analytics on HPC

Toward a Multi-GPU Implementation of the Modular Integer GCD Algorithm: Extended Abstract

DSAP: Data Structure-Aware Prefetching for Breadth First Search on GPU

Exploiting Inter-Phase Application Dynamism to Auto-Tune HPC Applications for Energy-Efficiency

Resource and Service Management in Fog Computing

Iterative Solver Selection Techniques for Sparse Linear Systems

Performance Analysis of DroughtHPC and Holistic HPC Workflows

Push-Pull on Graphs is Column- and Row-based SpMV Plus Masks

Adaptive auto-tuning in HPX using APEX

A HPC Framework for Big Spatial Data Processing and Analytics

A Computational Investigation of Redistricting Using Simulated Annealing

Fast and generic concurrent message-passing

Designing Domain-Specific Heterogenous Manycores from Dataflow Programs

Identifying Carcinogenic Multi-hit Combinations usingWeighted Set Cover Algorithm

Identifying Carcinogenic Multi-hit Combinations usingWeighted Set Cover Algorithm

Sajal Dash (Virginia Tech); Nick Kinney, Robin Varghese, and Harold Garner (Edward Via College of Osteopathic Medicine, Blacksburg); Wu-chun Feng (Virginia Tech); and Ramu Anandakrishnan (Edward Via College of Osteopathic Medicine, Blacksburg)

Abstract

Disruptions in certain molecular pathways due to combinations of genetic mutations (hits) are known to cause cancer. Due to a large number of mutations present in tumor cells, experimentally identifying these combinations is not possible except in very rare cases. Current computational approaches simply do not search for specific combinations of multiple genetic mutations. Instead, current algorithms search for sets of driver mutations based on mutation frequency and mutational signatures. Here, we present a fundamentally different approach for identifying carcinogenic mutations: we search for combinations of carcinogenic mutations (multi-hit combinations). By avoiding the convolution of different driver mutations associated with different individual instances of cancer, multi-hit combinations may be able to identify the specific cause for each cancer instance. We mapped the problem of identifying a set of multi-hit combinations to a weighted set cover problem. We use a greedy algorithm to identify sets of multi-hit combinations for seventeen cancer types for which there are at least two hundred matched tumor and blood-derived normal samples in the cancer genome atlas (TCGA). When tested using an independent validation dataset, these combinations are able to differentiate between tumor and normal tissue samples with 91% sensitivity (95% Con- fidence Interval (CI) = 89-92%) and 93% specificity (95% CI = 91-94%), on average for seventeen cancer types. The combinations identified by this method, with experimental validation, can aid in better diagnosis, provide insights into the etiology of various cancer types, and provide a rational basis for designing targeted combination therapies.

pdf, pdf, pdf, pdf

Efficient Matching of GPU Kernel Subgraphs

Toward Footprint-Aware Power Shifting for Hybrid Memory Based Systems

Algorithm Design for Large Scale FFT-Based Simulations on CPU-GPU Platforms

WebNN: A Distributed Framework for Deep Learning

Utilization of Random Profiling for System Modeling and Dynamic Configuration

Performance evaluation of parallel cloud functions

Interval based Framework for Locking in Hierarchies

Models and Techniques for Green High-Performance Computing

Poster, Reception

Wednesday, August 15th

8:00am-10:00am

ICPP Registration

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Registration

9:00am-10:00am

Plenary

Straub Hall, Room 156

Manish Parashar

Transforming Science through Cyberinfrastructure

Plenary Talk

10:00am-10:30am

Break

Straub Hall Lobby

Break

10:30am-12:30pm

Machine Learning and Networks

Straub Hall, Room 145

Peter Pirkelbauer

Learning Driven Parallelization for Large-Scale Video Workload in Hybrid CPU-GPU Cluster

GLP4NN: A Convergence-invariant and Network-agnostic Light-weight Parallelization Framework for Deep Neural Networks on Modern GPUs

KeyBin2: Distributed Clustering for Scalable and In-Situ Analysis

Disk Failure Prediction in Data Centers via Online Learning

Paper

Memory Performance

Straub Hall, Room 156

Martin Schulz

A Performance Model to Execute Workflows on High-Bandwidth-Memory Architectures

Optimizing for KNL Usage Modes When Data Doesn’t Fit in MCDRAM

Nemo: NUMA-aware Concurrency Control for Scalable Transactional Memory

SPECTR: Scalable Parallel Short Read Error Correction on Multi-core and Many-core Architectures

SPECTR: Scalable Parallel Short Read Error Correction on Multi-core and Many-core Architectures

Kai Xu (School of Software, Shandong University); Robin Kobus (Institute for Computer Science, Johannes Gutenberg University); Yuandong Chan, Ping Gao, and Xiangxu Meng (School of Software, Shandong University); Yanjie Wei (Shenzhen Institutes of Advanced Technology, CAS); Bertil Schmidt (Johannes Gutenberg University of Mainz); and Weiguo Liu (School of Software, Shandong University)

Abstract

Modern high throughput sequencing platforms can produce large amounts of short read DNA data at low cost. Error correction is an important but time-consuming initial step when processing this data in order to improve the quality of downstream analyses. In this paper, we present a Scalable Parallel Error CorrecToR designed to improve the throughput of DNA error correction for Illumina reads on various parallel platforms. Our design is based on a k-spectrum approach where a Bloom filter is frequently probed as a key operation and is optimized towards AVX-512-based multi-core CPUs, Xeon Phi many-cores (both KNC and KNL), and heterogeneous compute clusters. A number of architecture-specific optimizations are employed to achieve high performance such as memory alignment, vectorized Bloom filter probing, and a stack-based iteration to eliminate recursion. Our experiments show that our optimizations result in speedups of up to 2.8, 5.2, and 9.3 on a CPU (Xeon W-2123), a KNC-based Xeon Phi (31S1P), and a KNL-based Xeon Phi (7210), respectively, compared to a multi-threaded CPU reference implementation for the error correction stage. Furthermore, when executed on the same hardware, SPECTR achieves a speedup of up to 1.7, 2.1, 2.4, and 6.4, compared to the state-of-the-art tools Lighter, BLESS2, RECKONER, and Musket, respectively. In addition, our MPI implementation exhibits an efficiency of around 86% when executed on 32 nodes of the Tianhe-2 supercomputer. SPECTR is available at https://github.com/Xu-Kai/SPECTR.

pdf, pdf

Paper

Networking

Straub Hall, Room 245

Federico Silla

Cache Assisted Randomized Sharing Counters in Network Measurement

Load-Balanced Slim Fly Networks

Toward Performant and Energy-efficient Queries in Three-tier Wireless Sensor Networks

Click-Based Asynchronous Mesh Network with Bounded Bundled Data

Click-Based Asynchronous Mesh Network with Bounded Bundled Data

Anping He (School of Information Science and Engineering, Lanzhou University); Hong Chen (Institute of Microelectronics, Tsinghua University,Beijing National Research Center for Information Science and Technology); Guangbo Feng (School of Information Science and Engineering, Lanzhou University); Jiling Zhang (School of Physical Science and Engineering, Lanzhou University); Pengfei Li (School of Information Science and Engineering, Lanzhou University); and Yong Hei (Institute of Microelectronics Chinese Academy of Sciences)

Abstract

We have implemented an asynchronous mesh network. This paper describes our innovative design using a Click controller. Compared to designs that use other asynchronous circuit families with C-elements and four-phase bundled data, our two-phase Click-based Bounded Bundled Data design is faster, but introduces phase skews when handling concurrent traffic at a single node. Instead of eliminating the phase skews, we use them as computation slots. Our network uses a novel asynchronous arbiter with a queue that can accept data from both the four cardinal directions as well as from a local source, five directions in all. We have implemented our network design in 1 × 1, 2 × 2 and 4 × 4 sizes, larger network could be implemented easier since the isomorphism and modularity of the routing nodes. Our experiments show that an initial data item passes through a node in 157ns v.s. 81ns for non-delaybranch and delay-branch designs separately. Following items take about 65% as long. But for a network, the average latency of a node keeps almost same for diﬀerent paths. We believe that with the non-delay-branch designs, our asynchronous mesh network could oﬀer 10.1M routes per second for a 1 × 1 network and 5.33M routes per second for 2 × 2 or 5.06M for 4 × 4 networks, and work at the rate of 17.3M, 10.1M and 11.7M with the enhanced delay-branch way. For both cases, its latency is approximately linear with scale.

pdf, pdf

Paper

12:30pm-2:00pm

Lunch

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Lunch

Ph.D. Forum Discussion

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Ph.D. Forum

2:00pm-3:30pm

Machine Learning

Straub Hall, Room 145

Wu Feng

Partitioning and Communication Strategies for Sparse Non-negative Matrix Factorization

Energy-efficient Application Resource Scheduling using Machine Learning Classifiers

PRIONN: Predicting Runtime and IO using Neural Networks

Paper

Materials and Molecular Dynamics

Straub Hall, Room 245

Jose Canales

Massively Scaling the Metal Microscopic Damage Simulation on Sunway TaihuLight Supercomputer

Combining Task-based Parallelism and Adaptive Mesh Refinement Techniques in Molecular Dynamics Simulations

Task-parallel Analysis of Molecular Dynamics Trajectories

Paper

Performance Studies

Straub Hall, Room 156

Filippo Mantovani

A Multilevel Subtree Method for Single and Batched Sparse Cholesky Factorization

Vectorised Computation of Diverging Ensembles

Balanced k-means for Parallel Geometric Partitioning

Paper

3:30pm-4:00pm

Break

Straub Hall Lobby

Break

4:00pm-5:30pm

Performance of Sparse Algorithms

Straub Hall, Room 245

Sameer Shende

A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010

A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010

Xinliang Wang, Ping Xu, and Wei Xue (Tsinghua University, National Supercomputer Center in Wuxi); Yulong Ao (School of Mathematical Sciences,Peking University); Chao Yang (School of Mathematical Sciences &National Engineering Laboratory forVideo Technology, Peking University); Haohuan Fu, Lin Gan, and Guangwen Yang (Tsinghua University, National Supercomputing Center in Wuxi); and Weimin Zheng (Tsinghua University)

Abstract

The sparse triangular solver, SpTRSV, is one of the most important kernels in many scientific and engineering applications. Efficiently parallelizing the SpTRSV on modern many-core architectures is considerably difficult due to the inherent dependency of tasks, and the frequent but discontinuous memory accesses. Achieving high performance of SpTRSV is even more challenging for SW26010, the new-generation customized heterogeneous many-core processor equipped in the Sunway TaihuLight supercomputer. The known parallel SpTRSVs have to be refactored to fit the single-thread and cacheless design of SW26010. In this work, we focus on how to design and implement fast SpTRSV for structured grid problems on SW26010. A generalized algorithm framework of parallel SpTRSV is proposed for best utilization of the features and flexibilities of SW26010 many-core architecture according to the fine-grained Producer-Consumer model. Moreover, a novel parallel structured-grid SpTRSV is presented by using direct data transfers across registers of the computing elements of SW26010. Experiments on four typical structured-grid triangular problems with different problem sizes demonstrate that our SpTRSV can achieve an average bandwidth utilization of 81.7\%, which leads to a speedup of 17 over serial method on the management processing element of SW26010. And experiments with linear sparse solvers show that this new SpTRSV can achieve superior performance over the latest Intel Xeon CPU and Intel KNL over DDR4 memory.

pdf, pdf

Bandwidth Reduced Parallel SpMV on the SW26010 Many-Core Platform

Vectorized Parallel Sparse Matrix-Vector Multiplication in PETSc Using AVX-512

Paper

Ph.D. Career Planning

Gumwood, Ballroom Area, Erb Memorial Union (EMU), 2nd Floor

Ph.D. Forum

Programming Models

Straub Hall, Room 156

Olga Pearce

A Comprehensive Study on Bugs in Actor Systems

A Framework for Auto-Parallelization and Code Generation: An Integrative Case Study with Legacy FORTRAN Codes

Improving MPI Multi-threaded RMA Communication Performance

Improving MPI Multi-threaded RMA Communication Performance

Nathan Hjelm (Los Alamos National Lab, University of New Mexico); Matthew Dosanjh and Ryan Grant (Sandia National Laboratories); Taylor Groves (Lawrence Berkeley National Laboratory); Patrick Bridges (University of New Mexico); and Dorian Arnold (Emory University, University of New Mexico)

Abstract

One-sided communication is crucial to enabling communication concurrency. As core counts have increased, particularly with many-core architectures, one-sided (RMA) communication has been proposed to address the ever increasing contention at the %there are more processes trying to use a single network interface. The difficulty in using one-sided (RMA) communication with MPI is that the performance of MPI implementations using RMA with multiple concurrent threads is not well understood. Past studies have been done using MPI RMA in combination with multi-threading (RMA-MT) but they have been performed on older MPI implementations lacking RMA-MT optimizations. In addition prior work has only been done at smaller scale ($<=$512 cores).

In this paper, we describe a new RMA implementation for Open MPI. The implementation targets scalability and multi-threaded performance. We describe the design and implementation of our RMA improvements and offer an evaluation that demonstrates scaling to 262,144 cores, the full size of a leading supercomputer installation. In contrast, the previous implementation failed to scale past approximately 4,096 cores. To evaluate this approach, we then compare against a vendor optimized MPI RMA-MT implementation with microbenchmarks, mini-applications, and a full astrophysics code at large scale on a many-core architecture. This is the first time that an evaluation at large scale on many-core architectures has been done for MPI RMA-MT (524,288 cores) and the first large scale application performance comparison between two different RMA-MT optimized MPI implementations. The results show interesting trade-offs between the Cray MPI and Open MPI RMA-MT optimized implementations.

pdf, pdf

Paper

Resilience and Reliability

Straub Hall, Room 145

Allen Malony

Characterizing the Impact of Soft Errors Affecting Floating-point ALUs using RTL-level Fault Injection

Leverage Redundancy in Hardware Transactional Memory to Improve Cache Reliability

Modeling Application Resilience in Large-scale Parallel Execution

Paper

6:30pm-9:30pm

Conference Dinner

Jordan Schnitzer Museum of Art (JSMA)

Reception

Thursday, August 16th

9:00am-10:00am

Plenary

Straub Hall, Room 156

Mary Hall

Bringing Sparse Computations into the Optimization Light

Plenary Talk

10:00am-10:30am

Break

Straub Hall Lobby

Break

10:30am-12:30pm

Memory and Caching

Straub Hall, Room 145

Sameer Shende

Memory Coalescing for Hybrid Memory Cube

CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube

Improving First Level Cache Efficiency for GPUs Using Dynamic Line Protection

Accelerating FM-index Search for Genomic Data Processing

Paper

Resource Management

Straub Hall, Room 156

Taisuke Boku

Joint Optimization of MapReduce Scheduling and Network Policy in Hierarchical Clouds

Performance & Energy Tradeoffs for Dependent Distributed Applications Under System-wide Power Caps

H2Cloud: Maintaining the Whole Filesystem in an Object Storage Cloud

Power Efficient High Performance Packet I/O

Paper

Runtime Systems

Straub Hall, Room 245

Olga Pearce

Efficient Runtime Support for a Partitioned Global Logical Address Space

FULT: Fast User-Level Thread Scheduling Using Bit-Vectors

Constructing Dynamic Policies for Paging Mode Selection

The Case for Semi-Permanent Cache Occupancy

Paper

12:30pm-2:00pm

Lunch

Erb Memorial Union (EMU) Ballroom, 2nd Floor

Lunch

2:00pm-3:30pm

Parallel and Distributed Algorithms

Straub Hall, Room 156

Kamesh Madduri

A Communication-Efficient Causal Broadcast Protocol

NumLock: Towards Optimal Multi-Granularity Locking in Hierarchies

IS-ASGD: Accelerating Asynchronous SGD using Importance Sampling

Paper

Performance of Graph Algorithms

Straub Hall, Room 145

Allen Malony

Parallelizing Pruning-based Graph Structural Clustering

An Empirical Comparison of k-Shortest Simple Path Algorithms on Multicores

An Empirical Comparison of k-Shortest Simple Path Algorithms on Multicores

Deepak Ajwani (Nokia Bell Laboratories, Dublin); Erika Duriakova (Insight Centre for Data Analytics; School of Computer Science and Informatics, University College Dublin); Neil Hurley (Insight Centre for Data Analytics; School of Computer Science, University College Dublin); and Ulrich Meyer and Alexander Schickedanz (Goethe University, Frankfurt)

Abstract

We consider the loop less k-shortest path (KSP) problem. Although this problem has been studied in the sequential setting for at least the last two decades, no good parallel implementations are known. In this paper, we provide (i) a first systematic empirical comparison of various KSP algorithms and heuristic optimisations, (ii) carefully engineer various parallel implementations of these sequential algorithms and (iii) perform an extensive study of these parallel implementations on a range of graph classes and multicore architectures to determine the best algorithm and parallelization strategy for different graph classes.

We find that even though the worst-case complexity of the best undirected KSP algorithm O(k(m + n logn)) is significantly better than that of the popular and considerably simpler directed KSP algorithm O(kn(m + n logn)), the two algorithms are fairly competitive in terms of their empirical performance on small diameter graphs. Furthermore, we show that a few simple optimisations help to bridge the gap between these KSP algorithms even more. However, on moderate to large diameter graphs, the undirected KSP algorithm is considerably faster than the directed algorithms, both in sequential and parallel settings. In terms of the parallelisation strategy, simply replacing the shortest path subroutine by parallel ∆-stepping algorithm can provide a good speed-up for many KSP algorithms on random graphs. In contrast, for graphs with skewed degree distribution, a more complex strategy of parallelizing the different deviations and then parallelizing the shortest path computation inside the deviations with the remaining threads, provides a better performance.

pdf, pdf

C-Graph: A Highly Efficient Concurrent Graph Reachability Query Framework

Paper

Storage

Straub Hall, Room 245

Kevin Huck

Cross-Rack-Aware Updates in Erasure-Coded Data Centers

Duchy: Achieving Both SSD Durability and Controllable SMR Cleaning Overhead in Hybrid Storage Systems

Efficient SSD Caching by Avoiding Unnecessary Writes using Machine Learning

Efficient SSD Caching by Avoiding Unnecessary Writes using Machine Learning

Hua Wang and Xinbo Yi (Wuhan National Labo for Optoelectronics, HuaZhong University of Science and Technology); Ping Huang (Department of Computer and Information Sciences, Temple University); Bin Cheng (Shenzhen Tencent Computer System Co., Ltd.); and Ke Zhou (Wuhan National Labo for Optoelectronics, HuaZhong University of Science and Technology)

Abstract

SSD has been playing a significantly important role in caching systems due to its high performance-to-cost ratio. Since cache space is much smaller than that of the backend storage, write density (writes per unit time and space) of SSD cache is therefore much higher than that of HDD, which brings about great challenges to SSD’s lifetime. Meanwhile, under social network workloads, quite a few writes on SSD are unnecessary, e.g., Tencent’s photo caching shows that about 61% of total photos are just accessed once whereas they are still swapped into cache. Therefore, if we can predict this kind of photos proactively and prevent them from entering the cache, we can eliminate unnecessary SSD cache writes and improve cache space utilization. To cope with the challenge, we put forward a "one-time-access criteria" that is applied to cache space, and further propose a "one-time-access-exclusion" policy. Based on that, we design a prediction based classifier to facilitate the policy. Unlike the state-of-the-art history-based predictions, our prediction is non-history-oriented, which is challenging to achieve a good prediction accuracy. To address this issue, we integrate a decision tree into the classifier, extract social-related information as classifying features, and apply cost sensitive learning to improve classification precision. Due to these techniques, we attain a predication accuracy over 80%. Experimental results show that "one-time-access-exclusion" approach makes caching performance outstanding in most aspects, taking LRU for instance, hit rate is improved by 17%, cache writes are decreased by 79%, and the average access latency is dropped by 7.5%.

pdf, pdf

Paper

3:30pm-4:00pm

Break

Straub Hall Lobby

Break

4:00pm-5:30pm

Data Processing

Straub Hall, Room 145

Taisuke Boku

Dual-Paradigm Stream Processing

Index Shard Replication Strategies for Improving Resource Utilization in Large Scale Search Engines

FFS-VA: A Fast Filtering System for Large-scale Video Analytics

Paper

I/O and File Systems

Straub Hall, Room 245

Peter Pirkelbauer

Efficient Search for Free Blocks in the WAFL File System

A Write-efficient and Consistent Hashing Scheme for Non-Volatile Memory

Reference-distance Eviction and Prefetching for Cache Management in Spark

Paper

Matrix and Graph Algorithms

Straub Hall, Room 156

Jee Choi

Implementing Push-Pull Efficiently in GraphBLAS

UHCL-Darknet: An OpenCL-based Deep Neural Network Framework for Heterogeneous Multi-/Many-core Clusters

Optimization of the Spherical Harmonics Transform based Tree Traversals in the Helmholtz FMM Algorithm

Paper