ppl-accel AT lists.cs.illinois.edu

Subject: Ppl-accel mailing list

List archive

[[ppl-accel] ] TMS - New Log in Task Meeting Minutes

From: TMS <tms AT charm.cs.illinois.edu>
To: ppl-accel AT cs.illinois.edu
Subject: [[ppl-accel] ] TMS - New Log in Task Meeting Minutes
Date: Fri, 28 Jun 2019 17:02:33 -0500
Authentication-results: illinois.edu; spf=none smtp.mailfrom=tms AT charm.cs.illinois.edu; dmarc=none header.from=charm.cs.illinois.edu

Michael Robson has added a new log to Task Meeting Minutes

Log Entry:

Accel Meeting

11 - 12:30

In Attendance: Nikolai, Juan, Ronak, Michael, Jaemin, Juan

Agenda

GPU Overview

Technical Details

Nikolai's relevant work

GPU centric comm semantics
Similar to Nicol - inspiration
- also open source on github
- well engineered (code less so)
- on nvidia's github
Comm as a cuda kernel
- enq on stream - dependencies
- streams = cpu threads (logical equivalent)
- want same MPI semantics
CUDA aware MPI
- unaware of users computation on GPU
- do synch manually
lack of comm/comp overlap - bsp model
- launch overhead - 10us
- latency on IB - 1 us
implementation
- cpu bg thread
- cpu follow progress via cuda events
- gpu wait - 2 ways
- launch spin wait kernel
  - 1 thread
  - mem address - unified memory - poll on int exit on nonzero
  - Nicol uses this - more portable
- cuda driver call - stream memory options section
  - need cuda kernel flag driver support
  - mem address w produceer and consumer
  - bit 1/0, counter, etc
  - cuStreamWait32 - basically opposite of events
  - not just launching kernel
  - mananged by SM scheduler or cuda kernel driver on CPU side
  - cuStreamWrite32 - can impl events using this
approach - aluminum on github under nikolai at livermore
- use by some apps at LLNL
- wrapper around Nicol
- MPI collectives + send/recv + send + recv
integration - associate MPI stream with communicator - one user stream
- eqv to setting attribute on communictor
- communicator associated with stream
- semantics basically same on CPU
- could do this on CPU for threads/chares - notions of stream of exec
- don't support multiple streams with one comm
- could do this if careful
assumption
- 1 rank per gpu
comm thread impl
- c++ posix thread
- bound appropriately - hwloc
- need to initiate MPI operations in the right order
- similar to bcst/reductions in charm
- runtime could get around
- tags (if they were supported - boo MPI) gets around this
mostly proof of concept
- using GPU direct RDMA would improve perf and remove issues
- have to be wriring IB code to use
- could use MVAPICH GDR
- lots of bugs in MPI distros with multithreaded GPU usage
recap
- comm another cuda kernel
- runtime does 'magic' to make it work like non-blocking MPI
- up to runtime implementer to make magic happen / efficiently

What Marc/Pavan want moving forward

GPU oriented comm work
- can even do from charm side
- def want from MPI side
want to put semantics in MPI
endpoints MPI proposal
- openmp + GPU + MPI
- probbaly won't happen

Other projects

pitched to MVAPICH - want a paper
MPI forum - need proof of concept in a (major) MPI distribution
some nvidia (affiliated) projects
nicol - p2p on roadmap for future - want to expand to SciCo
historically - pushing IB verbs to GPU
NVSHMEM - openshmem semantics on GPU - public v intranode (private v avail internode)
- puts/gets inside GPU kernels
- cuda scheduler can deschedule threads as if pending device mem reads
- natural fine grained comm/comp overlap
TensorFlow - does internally (maybe some other rts)

Charm plans

GPU aware Charm
local compleition aware sends

Next Steps

getting up to speed on GPUs - nvidia programming guide (Juan)
optional: PTX ISA

Stray

Marc also retirig today
32 GB V100
where does nvidia rt run?
- some on dev
- alot on cpu

Stary2

Pavan on committee?
charm messages as GPU kernels? - extend Nikolai's work

View task

[[ppl-accel] ] TMS - New Log in Task Meeting Minutes, TMS, 06/28/2019
- <Possible follow-up(s)>
- [[ppl-accel] ] TMS - New Log in Task Meeting Minutes, TMS, 06/28/2019