The Supercomputer “Fugaku”
and
Software, programming models and tools

Mitsuhisa Sato Team Leader of Architecture Development Team

Deputy project leader, FLAGSHIP 2020 project
Deputy Director, RIKEN Center for Computational Science (R-CCS)

Professor (Cooperative Graduate School Program), University of Tsukuba
Missions
• Building the Japanese national flagship supercomputer “Fugaku” (a.k.a post K), and
• Developing wide range of HPC applications, running on Fugaku, in order to solve social and science issues in Japan (application development projects was over at the end of March, 2020)

Overview of Fugaku architecture
Node: Manycore architecture
• Armv8-A + SVE (Scalable Vector Extension)
• SIMD Length: 512 bits
• # of Cores: 48 + (2/4 for OS) (> 3.0 TF / 48 core)
• Co-design with application developers and high memory bandwidth utilizing on-package stacked memory (HBM2)
  1 TB/s B/W
• Low power: 15GF/W (dgemm)

Network: TofuD
• Chip-Integrated NIC, 6D mesh/torus Interconnect

Status and Update
• March 2019: The Name of the system was decided as “Fugaku”
• Aug. 2019: The K computer decommissioned, stopped the services and shutdown (removed from the computer room)
• Oct 2019: access to the test chips was started.
• Nov. 2019: Fujitsu announce FX1000 and FX700, and business with Cray.
• Nov 2019: Fugaku clock frequency will be 2.0GHz and boost to 2.2 GHz.
• Nov 2019: Green 500 1st position!
• Oct-Nov 2019: MEXT announced the Fugaku “early access program” to begin around Q2/CY2020
• Dec 2019: Delivery and Installation of “Fugaku” was started.
• May 2020: Delivery completed
• June 2020: 1st in Top500, HPCG, Graph 500, HPL-AI at ISC2020

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Q1</td>
<td>Q2</td>
<td>Q3</td>
<td>Q4</td>
<td>Q1</td>
<td>Q2</td>
<td>Q3</td>
<td>Q4</td>
<td>Q1</td>
<td>Q2</td>
</tr>
<tr>
<td></td>
<td>Basic Design</td>
<td>Design and Implementation</td>
<td>Manufacturing, installation and Tuning</td>
<td>Operation</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Sep/24/2020
Fugaku won 1st position in 4 benchmarks!

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>1st</th>
<th>Score</th>
<th>Unit</th>
<th>2nd</th>
<th>Score</th>
<th>1st/ 2nd</th>
</tr>
</thead>
<tbody>
<tr>
<td>TOP500 (LINPACK)</td>
<td>Fugaku</td>
<td>415.5</td>
<td>PFLOPS</td>
<td>Summit (US)</td>
<td>148.6</td>
<td>2.80</td>
</tr>
<tr>
<td>HPCG</td>
<td>Fugaku</td>
<td>13.4</td>
<td>PFLOPS</td>
<td>Summit (US)</td>
<td>2.93</td>
<td>4.57</td>
</tr>
<tr>
<td>HPL-AI</td>
<td>Fugaku</td>
<td>1.42</td>
<td>EFLOPS</td>
<td>Summit (US)</td>
<td>0.55</td>
<td>2.58</td>
</tr>
<tr>
<td>Graph500</td>
<td>Fugaku</td>
<td>70,980</td>
<td>GTEPS</td>
<td>太湖之光 TaihuLight (China)</td>
<td>23,756</td>
<td>2.99</td>
</tr>
</tbody>
</table>

2 to 4 times faster in every benchmark!
**Prediction of conformation dynamics of proteins on the surface of SARS-Cov-2**

GENESIS MD to interpolate unknown experimentally undetectable dynamic behavior of spike proteins, whose static behavior has been identified via Cryo-EM

(Yuji Sugita, RIKEN)

**Fragment molecular orbital calculations for COVID-19 proteins**

Large-scale, detailed interaction analysis of COVID-19 using Fragment Molecular Orbital (FMO) calculations using ABINIT-MP

(Yuji Mochizuki, Rikkyo University)

**Exploring new drug candidates for COVID-19**

Large-scale MD to search & identify therapeutic drug candidates showing high affinity for COVID-19 target proteins from 2000 existing drugs

(Yasushi Okuno, RIKEN / Kyoto University)

**Societal-Epidemiology**

Prediction and Countermeasure for Virus Droplet Infection under the Indoor Environment

Massive parallel simulation of droplet scattering with airflow and hat transfer under indoor environment such as commuter trains, offices, classrooms, and hospital rooms

(Makoto Tsubokura, RIKEN / Kobe University)

**Simulation analysis of pandemic phenomena**

Combining simulations & analytics of disease propagation w/contact tracing apps, economic effects of lockdown, and reflections social media, for effective mitigation policies

(Nobuyasu Ito, RIKEN)
KPIs on Fugaku development in FLAGSHIP 2020 project

3 KPIs (key performance indicator) were defined for Fugaku development

- **1. Extreme Power-Efficient System**
  - Maximum performance under Power consumption of 30 - 40MW (for system)
  - Approx. 15 GF/W (dgemm) confirmed by the prototype CPU => 1st in Green 500 !!!

- **2. Effective performance of target applications**
  - It is expected to exceed 100 times higher than the K computer’s performance in some applications
  - 125 times faster in GENESIS (MD application), 120 times faster in NICAM+LETKF (climate simulation and data assimilation) were estimated

- **3. Ease-of-use system for wide-range of users**
  - Co-design with application developers
  - Shared memory system with high-bandwidth on-package memory must make existing OpenMP-MPI program ported easily.
  - No programming effort for accelerators such as GPUs is required.
CPU Architecture: A64FX

- Armv8.2-A (AArch64 only) + SVE (Scalable Vector Extension)
  - FP64/FP32/FP16 (https://developer.arm.com/products/architecture/a-profile/docs)
- SVE 512-bit wide SIMD
- # of Cores: 48 + (2/4 for OS)
- Co-design with application developers and high memory bandwidth utilizing on-package stacked memory: HBM2(32GiB)
- Leading-edge Si-technology (7nm FinFET), low power logic design (approx. 15 GF/W (dgemm)), and power-controlling knobs
- Clock frequency: 2.0 GHz(normal), 2.2 GHz (boost)
- Peak performance
  - 3.0 TFLOPS@2GHz (>90% @ dgemm)
  - Memory B/W 1024GB/s (>80% stream)
  - Byte per Flops: 0.33

“Common” programing model will be to run each MPI process on a NUMA node (CMG) with OpenMP-MPI hybrid programming.

48 threads OpenMP is also supported.
• TSMC 7nm FinFET
• CoWoS technologies for HBM2
TofuD Interconnect

- 6 RDMA Engines
- Hardware barrier support
- Network operation offloading capability

<table>
<thead>
<tr>
<th></th>
<th>8B Put latency</th>
<th>1MiB Put throughput</th>
</tr>
</thead>
<tbody>
<tr>
<td>Put latency</td>
<td>0.49 – 0.54 usec</td>
<td>6.35 GB/s</td>
</tr>
</tbody>
</table>

TNI: Tofu Network Interface (RDMA engine)

TNR (Tofu Network Router)

2 lanes x 10 ports

40.8 GB/s

(6.8 GB/s x 6)

TofuD: MPI_Send/Receive Latency and BW

- **MPI PingPong**

![Graph showing latency and throughput for different message sizes.](image-url)
Fugaku prototype board and rack

Shelf: 48 CPUs (24 CMU)
Rack: 8 shelves = 384 CPUs (8x48)

A64FX™

2 CPU / CMU

HBM2

Water

Water

Electrical signals
Fugaku System Configuration

- **158,976** node
- **Two types of nodes**
- **3-level hierarchical storage system**
  - **1st Layer**
    - One of 16 compute nodes, called Compute & Storage I/O Node, has SSD about 1.6 TB
    - Services
      - Cache for global file system
      - Temporary file systems
        - Local file system for compute node
        - Shared file system for a job
  - **2nd Layer**
    - Fujitsu FEFS: Lustre-based global file system
  - **3rd Layer**
    - Cloud storage services
- Boost mode: 3.3792TF x 150k+ = 500+ PF

I/O Network

Network

Shared File Systems

Cloud Storage Gateway Nodes

Pre/Post Processing Nodes
  - Large-Memory Nodes
  - Visualization Nodes

Intranet

Internet
Advances from the K computer

<table>
<thead>
<tr>
<th></th>
<th>K computer</th>
<th>Fugaku</th>
<th>ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td># core</td>
<td>8</td>
<td>48</td>
<td></td>
</tr>
<tr>
<td>Si tech. (nm)</td>
<td>45</td>
<td>7</td>
<td></td>
</tr>
<tr>
<td>Core perf. (GFLOPS)</td>
<td>16</td>
<td>64(70)</td>
<td>4(4.4)</td>
</tr>
<tr>
<td>Chip(node) perf. (TFLOPS)</td>
<td>0.128</td>
<td>3.072(3.379)</td>
<td>24(26.4)</td>
</tr>
<tr>
<td>Memory BW (GB/s)</td>
<td>64</td>
<td>1024</td>
<td></td>
</tr>
<tr>
<td>B/F (Bytes/FLOP)</td>
<td>0.5</td>
<td>0.33</td>
<td></td>
</tr>
<tr>
<td>#node / rack</td>
<td>96</td>
<td>384</td>
<td>4</td>
</tr>
<tr>
<td>#node/system</td>
<td>82,944</td>
<td>158,976</td>
<td></td>
</tr>
<tr>
<td>System perf.(DP PFLOPS)</td>
<td>10.6</td>
<td>488(537)</td>
<td>42.3(52.2)</td>
</tr>
<tr>
<td></td>
<td></td>
<td>977(1070)</td>
<td>84.6(104.4)</td>
</tr>
</tbody>
</table>

- SVE increases core performance
- Silicon tech. and scalable architecture (CMG) to increase node performance
- HBM enables high bandwidth

Value in blankets
Indicate the number
At boost mode (2.2GHz)

More than 7.6 M General-purpose cores!
Benchmark Results on test chip A64FX

- **CloverLeaf (UK Mini-App Consortium), Fortran/C**
  - A hydrodynamics mini-app to solve the compressible Euler equations in 2D, using an explicit, second-order method
  - Stencil calculation

- **TeaLeaf (UK Mini-App Consortium), Fortran**
  - A mini-application to enable design-space explorations for iterative sparse linear solvers
  - [https://github.com/UK-MAC/TeaLeaf_ref.git](https://github.com/UK-MAC/TeaLeaf_ref.git)
  - Problem size: Benchmarks/tea_bm_5.in, end_step=10 -> 3

- **LULESH (LLNL), C**
  - Mini-app representative of simplified 3D Lagrangian hydrodynamics on an unstructured mesh, indirect memory access
## Benchmark Results on test chip A64FX

### Disclaimer:
The software used for the evaluation, such as the compiler, is still under development and its performance may be different when the supercomputer Fugaku starts its operation.

<table>
<thead>
<tr>
<th>Platform</th>
<th>Compiler Options</th>
</tr>
</thead>
<tbody>
<tr>
<td>A64FX test chip (2.0 GHz)</td>
<td>Fujitsu compiler</td>
</tr>
<tr>
<td>ThunderX2 @ Apollo70</td>
<td>-Kfast,openmp</td>
</tr>
<tr>
<td>28C/2S @ 2.0GHz</td>
<td>Arm HPC compiler</td>
</tr>
<tr>
<td>Arm HPC compiler 19.1</td>
<td>-Ofast -march=armv8-a(+sve)</td>
</tr>
<tr>
<td>Broadwell (Xeon E5-2680 v4)</td>
<td>Intel compiler</td>
</tr>
<tr>
<td>14C/2S @ 2.4GHz</td>
<td>-O3 -qopenmp -march=native</td>
</tr>
<tr>
<td>Intel compiler 2019.0.045</td>
<td></td>
</tr>
<tr>
<td>Skylake (Xeon Gold 6126) @ Cygnus, Univ. of Tsukuba</td>
<td></td>
</tr>
<tr>
<td>12C/2S @ 2.6GHz</td>
<td></td>
</tr>
<tr>
<td>Intel compiler 19.0.3.199</td>
<td></td>
</tr>
</tbody>
</table>
- Evaluation using one CMG (NUMA node) without MPI
- Good scalability by increasing the number of threads within CMG.
- One GMG performance is comparable to Intel one. (Chip contains 4 CMG!)
TeaLeaf

- Evaluation of MPI program within one chip (upto 4 MPI process)
- Changing #threads within CMG
- The speedup is limited for more than 4 threads due to the memory bandwidth (?)
- We need more performance analysis.

Xeon @ Cygnus, Univ. of Tsukuba
Intel Xeon Gold 6126
2.6GHz; 12 core x 2 socket

Relative performance (to 1T/A64FX)

Sep/24/2020
Evaluation using one CMG(NUMA node) without MPI

One CMG performance is less than Thx2 and Intel one

We found low vectorization (SIMD (SVE) instructions ratio is a few %)

We need more code tuning for more vectorization using SIMD
Fugaku / Fujitsu FX1000 System Software Stack

**Math Libraries**
Fujitsu: BLAS, LAPACK, ScaLAPACK, SSL II
RIKEN: EigenEXA, KMATH_FFT3D, Batched BLAS, ...

**Compiler and Script Languages**
Fortran, C/C++, OpenMP, Java, python, ...
(Multiple Compilers supported: Fujitsu, Arm, GNU, LLVM/CLANG, PGI, ...)

**Tuning and Debugging Tools**
Fujitsu: Profiler, Debugger, GUI

**Red Hat Enterprise Linux 8 Libraries**

- **Process/Thread**
  - PIP

- **Low Level Communication**
  - uTofu, LLC

- **File I/O for Hierarchical Storage**
  - Lustre/LLIO

- **Virtualization & Container**
  - KVM, Singularity

- **File I/O**
  - DTF

- **Communication**
  - Fujitsu MPI
  - RIKEN MPI

- **Domain Spec. Lang.**
  - FDPS

- **High-level Prog. Lang.**
  - XMP

**Live Data Analytics**
Apache Flink, Kibana, ...

**Cloud Software Stack**
OpenStack, Kubernetes, NEWT...

**Batch Job and Management System**

**ObjectStore**
S3 Compatible

**Hierarchical File System**

**Open Source Management Tool**
Spack

~ 3000 Apps supported by Spack

Most applications will work with simple recompile from x86/RHEL environment. LLNL Spack automates this.
# OSS Application Porting @ Arm HPC Users Group

[(http://arm-hpc.gitlab.io/)](http://arm-hpc.gitlab.io/)

<table>
<thead>
<tr>
<th>Application</th>
<th>Lang.</th>
<th>GCC</th>
<th>LLVM</th>
<th>Arm</th>
<th>Fujitsu</th>
</tr>
</thead>
<tbody>
<tr>
<td>LAMMPS</td>
<td>C++</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>GROMACS</td>
<td>C</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>GAMESS*</td>
<td>Fortran</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>OpenFOAM</td>
<td>C++</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>NAMD</td>
<td>C++</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>WRF</td>
<td>Fortran</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>Quantum ESPRESSO</td>
<td>Fortran</td>
<td>Ok in as is</td>
<td>Ok in as is</td>
<td>Ok in as is</td>
<td>Modified</td>
</tr>
<tr>
<td>NWChem</td>
<td>Fortran</td>
<td>Ok in as is</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>ABINIT</td>
<td>Fortran</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>CP2K</td>
<td>Fortran</td>
<td>Ok in as is</td>
<td>Issues found</td>
<td>Issues found</td>
<td>Modified</td>
</tr>
<tr>
<td>NEST*</td>
<td>C++</td>
<td>Ok in as is</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
<tr>
<td>BLAST*</td>
<td>C++</td>
<td>Ok in as is</td>
<td>Modified</td>
<td>Modified</td>
<td>Modified</td>
</tr>
</tbody>
</table>
Low-power Design & Power Management

- 7nm FinFET (TSMC) with low-power logic design
- A64FX provides power management function called “Power Knob”
  - FL pipeline usage: FLA only, EX pipeline usage: EXA only, Frequency reduction …
  - User program can change “Power Knob” for power optimization
  - “Energy monitor” facility enables chip-level power monitoring and detailed power analysis of applications

- “Eco-mode” : FLA only with lower “stand-by” power for ALUs
  - Reduce the power-consumption for memory intensive apps.
  - 4 apps out of 9 target applications select “eco-mode” for the max performance under the limitation of our power capacity (Even using HBM2!)

- Retention mode: power state for de-activation of CPU with keeping network alive
  - Large reduction of system power-consumption at idle time
  - “Power Knobs” can be controlled by Sandia PowerAPIs and setting running modes.
    - We are now designing the accounting system to give incentive to make use of power-knobs
    - “Power budget” as well as node-hour budget.
**System software and Programming models & languages for “Fugaku”**

- Standard programming model is OpenMP (for NUMA node(CMG)) + MPI
  - Both OpenMPI (by Fujitsu) and MPICH (by Riken) are supported.
  - OpenMP 4.x is supported by Fujitsu compiler. LLVM-based compiler and gcc available.
  - uTofu low-level comm. Layer for Tofu-D interconnect.
- Container and Virtual machine (KVM, Singularity, …)
- DL4Fugaku: AI framework for Fugaku, used in Chainer, PyTorch, TensorFlow
- Many Open-source software will be ported using Spack

- System software and Programming tools, Math-Libs developed by RIKEN
  - McKernel: Light-weight Kernel enabling jitter-less environment for large-scale parallel program execution.
  - XcalableMP directive-based PGAS Language
  - FDPS: DLS for Framework for Developing Particle Simulators.
  - EigenExa: Eigen-value math library for large-scale parallel systems.
What’s XcalableMP (XMP for short)?
- A PGAS programming model and language for distributed memory, proposed by XMP Spec WG
- XMP Spec WG is a special interest group to design and draft the specification of XcalableMP language. It is now organized under PC Cluster Consortium, Japan. Mainly active in Japan, but open for everybody.

Project status (as of June 2019)
- XMP Spec Version 1.4 is available at XMP site. new features: mixed OpenMP and OpenACC, libraries for collective communications.
- Reference implementation by U. Tsukuba and Riken AICS: Version 1.3.1 (C and Fortran90) is available for PC clusters, Cray XT and K computer. Source-to- Source compiler to code with the runtime on top of MPI and GasNet.
- HPCC class 2 Winner 2013. 2014

Language Features
- Directive-based language extensions for Fortran and C for PGAS model
- Global view programming with global-view distributed data structures for data parallelism
- SPMD execution model as MPI
- Pragmas for data distribution of global array.
- Work mapping constructs to map works and iteration with affinity to data explicitly.
- Rich communication and sync directives such as “gmove” and “shadow”.
- Many concepts are inherited from HPF
- Co-array feature of CAF is adopted as a part of the language spec for local view programming (also defined in C).

The spec of XcalableMP 1.x is now converged. We are now moving to XcalableMP 2.0 with global task-based parallel programming and PGAS.
Performance of XcalableMP on Fugaku

- XcalableMP was taken as a parallel programming language project for improving the productivity and performance of parallel programming.
- XcalableMP is now available on Fugaku and the performance is enhanced by the Fugaku interconnect, Tofu-D.

Impact-3D (global view, stencil apps)
Fusion simulation code

QCD (Local view programming, Coarray)

NT-Chem (local view programming, Coarray)
FDPS: a framework for developing parallel particle simulation codes

- Developed by Prof. Makino’s group, R-CCS
- Basic idea: "abstract" code for
  - domain decomposition
  - particle exchange
  - parallel O(N log N) interaction calculation
- Implemented as a template class library in C++.
- A single program can run on a notebook, a cluster of Intel servers, and the entire K computer, without change (Well, interaction function needs some optimization)
- Works also on GPGPUs

Basic concept of FDPS. The user program gives the definitions of particle and interaction to FDPS, and calls ...

Gravitational N-body (270k/process)
Weak scaling performance pretty good for up to all nodes of K computer
Challenges of programming for Fugaku

- **Task-based programming models for “Fugaku”**
  - to exploit parallelism of SIMD and manycore for A64FX, and enables overlapping comp. and comm.
  - OpenMP 4.0 task + MPI (Multithread-aware MPI)
  - XcalableMP 2.0 is being designed for task-based programming on global address space (PGAS)

- **How to exploit SIMD**
  - SIMD is a key for performance on A64FX
  - OpenMP SIMD directives
  - Compiler optimization (Fujitsu compiler)
    - SWP: software pipelining, loop fission, …
  - OpenCL for SVE (Arm SIMD)
  - Comm Optimization by Low-level layer, uTofu.

Performance improvement by SWP in Livermore Kernels by Fujitsu compiler
Programming models for beyond “Fugaku”

- Programming support for accelerator-based heterogenous parallel system (incl. FPGA Clusters)
  - (Fugaku has no accelerators.)
  - XcalableACC: integration of XcalableMP and OpenACC
  - Task-based offloading to accelerators
Concluding Remarks

- We are interested in fostering eco-system about HPC Arm
  - A64FX is only a processor supporting SVE!
- International Collaborations are welcome
  - DOE-MEXT collaborations
    - Arm Arch collaboration (SNL/NNSA, U. Bristol)
    - Spack for Fugaku (LLNL, going-on)
    - ECP software porting & evaluation (planned)
  - CEA@France, A*STAR@Singapore, U Oregon, …
- I am interested in an accounting system to give incentives to make use of “Power-knobs”.
  - “Power budget” as well as node-hour budget, to promote power-saving.