BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20200129T163557Z
LOCATION:505
DTSTART;TZID=America/Denver:20191118T163000
DTEND;TZID=America/Denver:20191118T164000
UID:submissions.supercomputing.org_SC19_sess128_ws_ia102@linklings.com
SUMMARY:Cascaded DMA Controller for Speedup of Indirect Memory Access in I
 rregular Applications
DESCRIPTION:Workshop\n\nCascaded DMA Controller for Speedup of Indirect Me
 mory Access in Irregular Applications\n\nKashimata, Kitamura, Kimura, Kasa
 hara\n\nIndirect memory accesses caused by sparse linear algebra calculati
 ons are widely used in important real applications.  However, they also ca
 use serious inefficient memory accesses and pipeline stalls resulting in l
 ow execution efficiency even with high memory bandwidth and much computati
 onal resource.  One of the important issues of indirect memory accesses, s
 uch as accessing A[B[i]], is it requires two successive memory accesses: t
 he index loads (B[i]) and the following data element accesses (A[B[i]]).  
 To overcome this situation, we propose the Cascaded-DMAC (CDMAC).  This CD
 MAC is intended to be attached in each core of a multicore chip in additio
 n to a CPU core, a vector accelerator, and a local data memory.  It perfor
 ms data transfers between an off-chip main memory and an in-core local dat
 a memory, which provides data to the accelerator.  The key idea of the CDM
 AC is cascading two DMACs so that the first one loads indices, then the se
 cond one accesses data elements by using these indices.  Thus, this organi
 zation realizes the autonomous indirect memory accesses by giving an index
  array and an element array, and obtains the efficient SIMD computations b
 y lining up the sparse data into the local data memory.  We implemented a 
 multicore processor having the proposed CDMAC on an FPGA board.  The evalu
 ation result of sparse matrix-vector multiplications on the FPGA shows tha
 t the CDMAC achieves a maximum speedup of 17x compared with the CPU data t
 ransfer.\n\nTag: Workshop Reg Pass, Algorithms, Architectures, Irregular A
 pplications\n\nRegistration Category: Workshop Reg Pass, Algorithms, Archi
 tectures, Irregular Applications
URL:https://sc19.supercomputing.org/presentation/?id=ws_ia102&sess=sess128
END:VEVENT
END:VCALENDAR