Skip to Content.
Sympa Menu

ppl-accel - [[ppl-accel] ] TMS - New Log in Task Meeting Minutes

ppl-accel AT lists.cs.illinois.edu

Subject: Ppl-accel mailing list

List archive

[[ppl-accel] ] TMS - New Log in Task Meeting Minutes


Chronological Thread 
  • From: TMS <tms AT charm.cs.illinois.edu>
  • To: ppl-accel AT cs.illinois.edu
  • Subject: [[ppl-accel] ] TMS - New Log in Task Meeting Minutes
  • Date: Fri, 28 Jun 2019 17:02:33 -0500
  • Authentication-results: illinois.edu; spf=none smtp.mailfrom=tms AT charm.cs.illinois.edu; dmarc=none header.from=charm.cs.illinois.edu

Michael Robson has added a new log to Task Meeting Minutes

Log Entry:

Accel Meeting

11 - 12:30
In Attendance: Nikolai, Juan, Ronak, Michael, Jaemin, Juan

Agenda

GPU Overview

Technical Details

Nikolai's relevant work

  • GPU centric comm semantics
  • Similar to Nicol - inspiration
    • also open source on github
    • well engineered (code less so)
    • on nvidia's github
  • Comm as a cuda kernel
    • enq on stream - dependencies
    • streams = cpu threads (logical equivalent)
    • want same MPI semantics
  • CUDA aware MPI
    • unaware of users computation on GPU
    • do synch manually
  • lack of comm/comp overlap - bsp model
    • launch overhead - 10us
    • latency on IB - 1 us
  • implementation
    • cpu bg thread
    • cpu follow progress via cuda events
    • gpu wait - 2 ways
    • launch spin wait kernel
      • 1 thread
      • mem address - unified memory - poll on int exit on nonzero
      • Nicol uses this - more portable
    • cuda driver call - stream memory options section
      • need cuda kernel flag driver support
      • mem address w produceer and consumer
      • bit 1/0, counter, etc
      • cuStreamWait32 - basically opposite of events
      • not just launching kernel
      • mananged by SM scheduler or cuda kernel driver on CPU side
      • cuStreamWrite32 - can impl events using this
  • approach - aluminum on github under nikolai at livermore
    • use by some apps at LLNL
    • wrapper around Nicol
    • MPI collectives + send/recv + send + recv
  • integration - associate MPI stream with communicator - one user stream
    • eqv to setting attribute on communictor
    • communicator associated with stream
    • semantics basically same on CPU
    • could do this on CPU for threads/chares - notions of stream of exec
    • don't support multiple streams with one comm
    • could do this if careful
  • assumption
    • 1 rank per gpu
  • comm thread impl
    • c++ posix thread
    • bound appropriately - hwloc
    • need to initiate MPI operations in the right order
    • similar to bcst/reductions in charm
    • runtime could get around
    • tags (if they were supported - boo MPI) gets around this
  • mostly proof of concept
    • using GPU direct RDMA would improve perf and remove issues
    • have to be wriring IB code to use
    • could use MVAPICH GDR
    • lots of bugs in MPI distros with multithreaded GPU usage
  • recap
    • comm another cuda kernel
    • runtime does 'magic' to make it work like non-blocking MPI
    • up to runtime implementer to make magic happen / efficiently

What Marc/Pavan want moving forward

  • GPU oriented comm work
    • can even do from charm side
    • def want from MPI side
  • want to put semantics in MPI
  • endpoints MPI proposal
    • openmp + GPU + MPI
    • probbaly won't happen

Other projects

  • pitched to MVAPICH - want a paper
  • MPI forum - need proof of concept in a (major) MPI distribution
  • some nvidia (affiliated) projects
  • nicol - p2p on roadmap for future - want to expand to SciCo
  • historically - pushing IB verbs to GPU
  • NVSHMEM - openshmem semantics on GPU - public v intranode (private v avail internode)
    • puts/gets inside GPU kernels
    • cuda scheduler can deschedule threads as if pending device mem reads
    • natural fine grained comm/comp overlap
  • TensorFlow - does internally (maybe some other rts)

Charm plans

  • GPU aware Charm
  • local compleition aware sends

Next Steps

  • getting up to speed on GPUs - nvidia programming guide (Juan)
  • optional: PTX ISA

Stray

  • Marc also retirig today
  • 32 GB V100
  • where does nvidia rt run?
    • some on dev
    • alot on cpu

Stary2

  • Pavan on committee?
  • charm messages as GPU kernels? - extend Nikolai's work

View task




Archive powered by MHonArc 2.6.19.

Top of Page