BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20210808T235336Z
LOCATION:Room A
DTSTART;TZID=America/Chicago:20210812T103000
DTEND;TZID=America/Chicago:20210812T104500
UID:icpp_ICPP 2021_sess116_pap310@linklings.com
SUMMARY:Accurate Matrix Multiplication on Binary128 Format Accelerated by
Ozaki Scheme
DESCRIPTION:Conference Paper\n\nAccurate Matrix Multiplication on Binary12
8 Format Accelerated by Ozaki Scheme\n\nMukunoki, Ozaki, Ogita, Imamura\n\
nAlthough IEEE 754-2008 binary128 (with a 15-bit exponent and 113-bit sign
ificand, i.e., quadruple-precision) is not currently implemented on x86 in
hardware, software emulation is available on some compilers. However, the
performance is significantly slower compared to the binary64 operation, w
hich is supported natively in hardware. This study proposes a fast impleme
ntation of matrix multiplication on matrices stored in the binary128 forma
t on x86 CPUs. The proposed implementation utilizes the Ozaki scheme, whic
h is an accurate matrix multiplication algorithm proposed by Ozaki et al.
in 2012. This scheme enables one to perform most computations using the bi
nary64 matrix multiplication (the DGEMM routine in Basic Linear Algebra Su
bprograms (BLAS)); it can exploit the high-performance of highly-optimized
vendor BLAS. Although the achievable performance depends on the input mat
rices (the inner-product dimension, the absolute range, and the significan
d bit length), the proposed implementation can achieve better performance
and accuracy compared to naive matrix multiplication performed using the G
CC's binary128 emulation in many cases. In addition, we discuss GPU accele
ration, performance on reduced precision inputs, an implementation based o
n binary32 matrix multiplication (SGEMM), application to memory-intensive
operations, and the possibility of a distributed parallel implementation.
END:VEVENT
END:VCALENDAR