Extreme-scale Scientific Software Stack Forum

Agenda

Thursday, September 24, 2020 (Click for more information)

Abstract

Open source, community-developed reusable scientific software represents a large and growing body of capabilities. Linux distributions, vendor software stacks and individual disciplined software product teams provide the scientific computing community with usable holistic software environments containing core open source software components. At the same time, new software capabilities make it into these distributions in a largely ad hoc fashion.

The Extreme-scale Scientific Software Stack (E4S),first announced in November 2018, along with its community-organized scientific software development kits (SDKs), is a new community effort to create lightweight cross-team coordination of scientific software development, delivery and deployment and a set of support tools and processes targeted at improving scientific software quality via improved practices, policy, testing and coordination.

E4S (https://e4s.io), which announced the release of Version 1.0 in November 2019, is an open architecture effort, welcoming teams that are developing technically compatible and high- quality products to participate in the community. E4S and the SDKs are sponsored by the US Department of Energy Exascale Computing Project (ECP), driven by our need to effectively develop, test, deliver and deploy our open source software products on next generation platform to the scientific community.

In this presentation, we introduce E4S, discuss its design and implementation goals and show examples of success and challenges so far. We will also discuss our connection with other key community efforts we rely upon for our success and describe how collaboration around E4S can be realized.

Bio

Michael Heroux is a Senior Scientist at Sandia National Laboratories, Director of SW Technologies for the US DOE Exascale Computing Project (ECP) and Scientist in Residence at St. John’s University, MN. His research interests include all aspects of scalable scientific and engineering software for new and emerging parallel computing architectures.

He leads several projects in this field: ECP SW Technologies is an integrated effort to provide the software stack for ECP. The Trilinos Project (2004 R&D 100 winner) is an effort to provide reusable, scalable scientific software components. The Mantevo Project (2013 R&D 100 winner) is focused on the development of open source, portable mini-applications and mini-drivers for the co-design of future supercomputers and applications. HPCG is an official TOP 500 benchmark for ranking computer systems, complementing LINPACK.

Session Video

Abstract

CSCS was an early adopter of GPUs, with the installation of our Piz Daint in 2013 with NVIDIA K20X GPUs, and an upgrade to P100 GPUs in 2016. As as a result of our experiences, we have focused on improving performance portability of key applications and libraries, to ensure that we can take our user program to new architectures in the future. To these ends, we have built a large team focused on performance portable software development. Our efforts on refactoring codes for performance portability are bearing fruit as we prepare for the increased diversity of accelerated platforms, with GPUs from NVIDIA, AMD and Intel as well as ARM CPUs, that will come online over the next 2 years. In this talk I will discuss the approaches that we have taken, and some lessons learnt from our efforts porting over five key applications to one of these new platforms, namely AMD GPUs.

Bio

Ben Cumming is a group lead in the scientific software and libraries unit at the Swiss National Supercomputing Center (CSCS). His role at CSCS is to lead groups of developers that develop and support performance portable HPC libraries and applications. He has worked on HPC simulation of different fields, including groundwater flow, numerical weather prediction and neuroscience.

Session Video

Abstract

A brief overview of High Performance Computing and Computational Fluid Dynamics (CFD) at NASA is presented. Typical CFD simulations leveraging capacity and capability-class computing are shown. Ongoing development activities and challenges are also highlighted, including migration to emerging architectures and their application to real-world problems.

Bio

Eric Nielsen is a Senior Research Scientist with the Computational AeroSciences Branch at NASA Langley Research Center in Hampton, Virginia. He received his PhD in Aerospace Engineering from Virginia Tech and has worked at Langley for the past 28 years. Dr. Nielsen specializes in the development of computational aerodynamics software for the world's most powerful computer systems. The software has been distributed to thousands of organizations around the country and supports major national research and engineering efforts at NASA, in industry, academia, the Department of Defense, and other government agencies. He has published extensively on the subject and has given presentations around the world on his work. He has served as the Principal Investigator on Agency-level projects at NASA as well as efforts sponsored by the Department of Energy. Dr. Nielsen is a recipient of NASA's Silver Achievement, Exceptional Achievement, and Exceptional Engineering Achievement Medals as well as NASA Langley's HJE Reid Award for best research publication.

Session Video

Abstract

The EuroHPC Joint Undertaking is a large European initiative to establish a broad high-performance computing ecosystem among its 32 European member states. It targets both hardware efforts, most prominently the European Processor Initiative (EPI) developing a new generation of energy efficient processors, and software activities towards an integrated HPC development and application environment. As part of the latter, EuroHPC is currently starting several large-scale projects targeting a comprehensive ecosystem for European high-performance computing applications. Ultimately, all EuroHPC efforts will combine in the first European exascale class system to be deployed in the next few years.

In this talk I will give an overview of two of the newly formed software projects: DEEP-SEA, covering a comprehensive software stack for the first European exascale systems, and REGALE, providing adaptive resource management across workflows, with a special focus on power and energy management. Together, they tackle some of the major obstacles we face on our path to exascale, including the need for more application malleability through new system management and programming approaches, the extension of existing programming models and APIs, the development of new asynchronous programming approaches, the integration and efficient use of new memory technologies in an existing software stack, as well as system-wide performance monitoring and active steering.

The latter is closely tied to the PowerStack efforts, a community driven effort to establish a vertically integrated ability to measure, steer and control power and energy usage, and will provide one of the first large implementation of these concepts. This will provide a significant contribution to making exascale class systems both usable and sustainable.

Bio

Martin Schulz is a Full Professor and Chair for Computer Architecture and Parallel Systems at the Technische Universität München (TUM), which he joined in 2017, as well as a member of the board of directors at the Leibniz Supercomputing Centre. Prior to that, he held positions at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory (LLNL) and Cornell University. He earned his Doctorate in Computer Science in 2001 from TUM and a Master of Science in Computer Science from UIUC. Martin has published over 200 peer-reviewed papers and currently serves as the chair of the MPI Forum, the standardization body for the Message Passing Interface. His research interests include parallel and distributed architectures and applications; performance monitoring, modeling and analysis; memory system optimization; parallel programming paradigms; tool support for parallel programming; power-aware parallel computing; and fault tolerance at the application and system level. Martin was a recipient of the IEEE/ACM Gordon Bell Award in 2006 and an R&D 100 award in 2011.

Session Video

Abstract

E4S is a Spack-based software distribution that focuses on producing a comprehensive and coherent software stack for exascale. But what goes into building a software ecosystem? HPC software is notoriously difficult to manage due to the wide variety of environments it must support, and there is no one true “supported” configuration for any piece of software. Testing, building, and hardening software for so many deployment targets is extremely challenging. Spack is the central tooling that underlies E4S. Work in and around the Spack community has enabled the E4S team to manage a broad ecosystem of dependencies, and to automate the builds of binaries and containers. This talk will cover existing features in Spack and how they work with E4S, as well as aspects of E4S that work for Spack. This talk will also talk about some of the harder software challenges looming just over the horizon as we approach exascale machines, as well as our plans to address them.

Bio

Todd Gamblin is a Computer Scientist in Livermore Computing's Advanced Technology Office at Lawrence Livermore National Laboratory. He created Spack, a popular open source HPC package management tool with a rapidly growing community of contributors. He leads the Packaging Technologies Project in the U.S. Exascale Computing Project, LLNL's DevRAMP project on developer productivity, and an LLNL Strategic Initiative on software integration and dependency management. His research interests include dependency management, software engineering, parallel computing, performance measurement, and performance analysis.

Todd has been at LLNL since 2008. He received the Early Career Research Award from the U.S. Department of Energy in 2014, an R&D 100 award in 2019, and the LLNL Director's Science & Technology Award in 2020. He received Ph.D. and M.S. degrees in Computer Science from the University of North Carolina at Chapel Hill in 2009 and 2005, and his B.A. in Computer Science and Japanese from Williams College in 2002.

Session Video

Abstract

Supercomputing facilities are beginning to officially support the Extreme-scale Scientific Software Stack but the road to this adoption has been fraught with technical and policy potholes. While many such challenges have been encountered, facility teams have mostly been able to resolve these conflicts through feedback to Spack and the E4S project – but the solutions may not be what you think! I will talk about the Exascale Computing Project (ECP) Software Deployment at Facilities (SD) activities that have directly impacted the viability of E4S at DOE's Office of Science supercomputing facilities. This talk will also delve into the differences between software developer and facility values and how that insight has helped convince supercomputing facilities that supporting E4S is beneficial for their users. Continuous integration, software building, software testing, and software delivery concepts will be touched on with an eye towards bringing industry best practices to facility processes and procedures.

Bio

Ryan is the group leader for the HPC Core Operations Group at the Oak Ridge Leadership Computing Facility. His group is responsible for delivering highly-scalable and reliable Linux Infrastructure, Networking, and Security services to the high-performance computing resources found in the Supercomputing network.

Previously, he was a Senior Security Engineer at the Oak Ridge Leadership Computing Facility where he spent 10 years coordinating incident detection, security system deployment, security policy development, and supercomputing security architecture.

His background in Linux systems administration at scale helps him understand the nuances of system complexity when developing solutions to security and operational mandates. Overall, Ryan enjoys tackling technical challenges and tries to find creative ways to safely and efficiently enable research when OLCF security policy, technical computational limitations, or other roadblocks chafe against the OLCF research mission.

Ryan also leads the Software Deployment at Facilities (SD) area of the Exascale Computing Project. Its mission is to ensure that the AD and ST products funded by ECP are buildable, testable, and available for use by ECP at the Office of Science supercomputing facilities. Major components of this work include developing supercomputing-specific enhancements to continuous integration tools like GitLab server and runners. Other effort includes using Spack to install scientific software included in E4S along with CI/CD pipelines to automatically produce build artifacts that users at facilities can pull in to their own from-source builds.

He is currently interested in understanding how to secure cloud compute platforms such as Kubernetes and OpenShift to an enterprise standard. He also develops and maintains an open source, PIV compliant, secure password management and distribution tool called pkpass. Additionally, Ryan is one of the primary proponents of a message based log aggregation platform to help OLCF collect instrumented system data and deliver it to consumers that can influence decision making, reporting, and security policy.

He holds a masters degree in computer science from the University of Tennessee. He previously was a certified GIAC Exploit Researcher And Advanced Penetration Tester and has taught several computer hardware and Linux systems administration courses at Pellissippi State Community College as an adjunct faculty member.

Session Video

Abstract

The US Department of Energy's Exascale Computing Project (ECP) has invested in preparing the LLVM compiler infrastructure for exascale computing, and simultaneously, the LLVM community has added new features to the project that are important for the HPC ecosystem. ECP has improved Clang and LLVM to better support OpenMP, Fortran, loop optimizations, and inter-procedural analysis. Support for parallel programming models, such as HIP and SYCL, and new capabilities, such as MLIR, are being added by others. Even seemingly-unrelated additions to LLVM, such as its new libc project, could have a significant impact on HPC environments. This development will continue, and furthermore, opens up even more opportunities for future enhancements.

Bio

Hal Finkel graduated from Yale University in 2011 with a Ph.D. in theoretical physics focusing on numerical simulation of early-universe cosmology. He’s now the Lead for Compiler Technology and Programming Languages at the ALCF. Hal has contributed to the LLVM compiler infrastructure project for many years and is currently the code owner of the PowerPC backend and the pointer-aliasing-analysis subsystem, among others. As part of DOE's Exascale Computing Project (ECP), Hal is a PathForward technical lead, Co-PI for the PROTEAS-TUNE, Flang, Kokkos, and Proxy Apps projects, and a member of several other ECP-funded projects. Hal represents Argonne on the C++ Standards Committee and serves as vice-chair of the committee. He was the lead developer on the bgclang project, which provided LLVM/Clang on IBM Blue Gene/Q supercomputers. Hal also helps develop the Hardware/Hybrid Accelerated Cosmology Code (HACC), a two-time IEEE/ACM Gordon Bell Prize finalist. He has designed and implemented a tree-based force evaluation scheme and the I/O subsystem and contributed to many other HACC components.

Session Video

Abstract

How do you create a successful and sustainable OpenSource community from scratch? That is a question which lately has become more pressing for the Kokkos team. The user community appears to be on a near exponential growth curve for now, with for example the number of slack members more than doubling every year and more ECP products depending on Kokkos than on Fortranaccording to the ECP dependency list. Supporting the needs of this kind of user community goes well beyond what a small team at a single institution can provide. This talk will provide some insights into the Kokkos teams attempt to deal with this success story and some thoughts on challenges and opportunities for an effort to sustain this OpenSource community in the long run.

Bio

Christian Trott is a high performance computing expert with extensive experience designing and implementing software for modern HPC systems. He is a principal member of staff at Sandia National Laboratories, where he leads the Kokkos core team developing the performance portability programming model for C++ and heads Sandia's delegation to the ISO C++ standards committee. He also serves as adviser to numerous application teams, helping them redesign their codes using Kokkos and achieve performance portability for the next generation of supercomputers. Christian is a regular contributor to numerous scientific software projects including LAMMPS and Trilinos. He earned a doctorate from the University of Technology Ilmenau in theoretical physics with a focus on computational material research

Session Slides

Abstract

This talk will focus on challenges in designing, developing, packaging, and deploying high-performance and scalable MPI and HPC Cloud middleware for HPC clusters. We will discuss the designs, sample performance numbers and best practices of using the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) for HPC and DL/ML applications on modern HPC clusters while considering the support for multi-core systems (x86, ARM and OpenPOWER), high-performance networks (InfiniBand, Omni-Path, RoCE, AWS-EFA, and iWARP), NVIDIA and AMD GPGPUs (including GPUDirect RDMA). We will also discuss our recent experiences in packaging and releasing the MVAPICH2, MVAPICH-X, and MVAPICH2-GDR libraries using Spack environments

Bio

Hari Subramoni received the Ph.D. degree in Computer Science from The Ohio State University, Columbus, OH, in 2013. He is a research scientist in the Department of Computer Science and Engineering at the Ohio State University, USA, since September 2015. His current research interests include high performance interconnects and protocols, parallel computer architecture, network-based computing, exascale computing, network topology aware computing, QoS, power-aware LAN-WAN communication, fault tolerance, virtualization, big data, and cloud computing. He has published over 70 papers in international journals and conferences related to these research areas. Recently, Dr. Subramoni is doing research and working on the design and development of MVAPICH2, MVAPICH2-GDR, and MVAPICH2-X software packages. He is a member of IEEE. More details about Dr. Subramoni are available from: http://www.cse.ohio-state.edu/~subramon

Session Video

Abstract

This talk will cover the current state of the software stack on the NSF’s Frontera and Stampede2 systems at TACC, the evolution that led to this stack and how it is changing, and our current estimates of future needs we see for the planned NSF Leadership Class Computing Facility. Included will be connections to the OpenHPC and BigHPC software projects, as well as connections point to the DOE’s ECP software projects.

Bio

Dr. Dan Stanzione, Associate Vice President for Research at The University of Texas at Austin and Executive Director of the Texas Advanced Computing Center (TACC), is a nationally recognized leader in high performance computing. He is the principal investigator (PI) for several projects including a National Science Foundation (NSF) grant to acquire and deploy Frontera, which is the fastest supercomputer at a U.S. university. Stanzione is also the PI of TACC's Stampede2 and Wrangler systems, supercomputers for high performance computing and for data-focused applications, respectively. He served for six years as the co-director of CyVerse, a large-scale NSF life sciences cyberinfrastructure in which TACC is a major partner. In addition, Stanzione was a co-principal investigator for TACC's Ranger and Lonestar supercomputers, large-scale NSF systems previously deployed at UT Austin. Stanzione received his bachelor's degree in electrical engineering and his master's degree and doctorate in computer engineering from Clemson University, where he later directed the supercomputing laboratory and served as an assistant research professor of electrical and computer engineering. He has previously served as an AAAS Science and Technology Policy Fellow in Washington DC and the director of the Fulton High Performance Computing Initiative at Arizona State University.

Session Video

Abstract

The development of increasingly complex computer architectures with larger performance potential provides developers of application codes, including multiphysics modeling, and the coupling of simulations and data analytics, the opportunity to perform larger simulations achieving more accurate solutions than ever before. Achieving high performance on these new heterogeneous architectures requires expertise knowledge. To meet these challenges in a timely fashion and make the best use of these capabilities, application developers will need to rely on a variety of mathematical libraries that are developed by diverse independent teams throughout the HPC community. It is not sufficient for these libraries to individually deliver high performance on these architectures, but they also need to work well when built and used in combination within the application. The extreme-scale scientific software development kit (xSDK), which includes more than twenty HPC math libraries, is being developed to provide such an infrastructure.
This talk will discuss what is needed to achieve an effective ecosystem of interoperable math libraries that can easily built on top of large application codes, as well as efforts to provide sustainability, portability and high performance of the xSDK math libraries.

Bio

Ulrike Meier Yang leads the Mathematical Algorithms & Computing group at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. Her research interests are numerical algorithms, particularly algebraic multigrid methods, high performance computing, and scientific software design. She contributes to the scalable linear solvers library hypre. She leads the xSDK project in the DOE Exascale Computing Project.

Session Video

Abstract

As the code complexity of HPC applications expands, development teams continually rely on detailed software operation workflows to enable automation of building and testing their application. HPC application development can become increasingly complex and, as a result, difficult to maintain when the target HPC platforms' environments are increasingly diverse and continually changing. Recently, containers have demonstrated a new software deployment mechanism and the latest support for containers in HPC environments makes them now attainable for application teams.

This talk introduces the Supercontainers effort representing a consolidated effort across the DOE to use a multi-level approach to accelerate adoption of container technologies for exascale systems. A major tenant of the project is to ensure that container runtimes are well poised to take advantage of future HPC systems, including efforts to ensure container images can be scalable, interoperable, and well integrated into exascale supercomputers across the DOE. The project focuses on foundational system software research needed for ensuring containers can be deployed at scale. Supercontainers hopes to also provide enhanced user and developer support to ensure containerized exascale applications and software are both efficient and performant. Fundamentally, containers have the potential to simplify HPC application deployment and improve overall build and testing efficiency for first exascale systems.

Bio

Andrew J. Younge is a Computer Scientist in the Scalable System Software department at Sandia National Laboratories. Andrew currently serves as the Lead PI for the Supercontainers effort under the DOE Exascale Computing Project and is a key contributor to the Astra system, the world's first supercomputer based on the Arm processor deployed under Sandia's Vanguard program. His research interests include high performance computing, virtualization, distributed systems, and energy efficient computing. The focus of his research is on improving the usability and efficiency of system software for supercomputing systems. Prior to joining Sandia, Andrew held visiting positions at the MITRE Corporation, the University of Southern California's Information Sciences Institute, and the University of Maryland, College Park. He received his PhD in computer science from Indiana University in 2016 and his BS and MS in computer science from the Rochester Institute of Technology in 2008 and 2010 respectively.

Session Slides

Abstract

E4S can simplify software installation for scientific software in general, and helps users whether they’re installing software on their workstation or on an HPC system. They can use a container-based deployment either with the full E4S container or with their own customized version applying recipes from the E4S project and E4S-developed Spack build caches to quickly create container distributions and deploy their software. This talk will describe the various components of E4S and show a brief demo.

Bio

Dr. Sameer Shende has helped develop the TAU Performance System, the Program Database Toolkit (PDT), the Extreme-scale Scientific Software Stack (E4S) [https://e4s.io] and the HPCLinux distro. His research interests include tools and techniques for performance instrumentation, measurement, analysis, runtime systems, HPC container runtimes, and compiler optimizations. He serves as a Research Associate Professor and the Director of the Performance Research Laboratory at the University of Oregon, and as the President and Director of ParaTools, Inc., ParaTools, SAS, and ParaTools, Ltd.

Session Slides

Abstract

We have been carrying out the FLAGSHIP 2020 to develop the Japanese next- generation flagship supercomputer, Post-K, named as “Fugaku”. In the project, we have designed a new Arm-SVE enabled processor, called A64FX, as well as the system including interconnect with the industry partner, Fujitsu. The processor is designed for energy-efficiency and sustained application performance. The “Fugaku” system was delivered, and is now partially used for early access users. It is scheduled to be put into operation for public service around 2021. In this talk, system software, programming models and tools of the “Fugaku” system will be presented as well as its architecture overview and some preliminary performance results.

Bio

Mitsuhisa Sato received the M.S. degree and the Ph.D. degree in information science from the University of Tokyo in 1984 and 1990. From 2001, he was a professor of Graduate School of Systems and Information Engineering, University of Tsukuba. He has been working as a director of Center for computational sciences, University of Tsukuba from 2007 to 2013. Since October 2010, he is appointed to the research team leader of programming environment research team in Advanced Institute of Computational Science (AICS), renamed to R-CCS, RIKEN. Since 2014, he is working as a team leader of architecture development team in FLAGSHIP 2020 project to develop Japanese flagship supercomputer “Fugaku” in RIKEN. Since 2018, he is appointed to a deputy Director of RIKEN Center for Computational Science. He is a Professor (Cooperative　Graduate　School Program) and Professor Emeritus of University of Tsukuba.