ppl-accel AT lists.cs.illinois.edu

Subject: Ppl-accel mailing list

List archive

[ppl-accel] [TMS] Log updated in task Accel Minutes

From: Michael Robson <nosborm AT gmail.com>
To: ppl-accel AT cs.illinois.edu
Subject: [ppl-accel] [TMS] Log updated in task Accel Minutes
Date: Wed, 30 Jul 2014 11:23:13 -0500
List-archive: <http://lists.cs.uiuc.edu/pipermail/ppl-accel/>
List-id: <ppl-accel.cs.uiuc.edu>

A log has been updated with new data by Michael Robson
The text of the log is:
Accelerator Meeting @ 10:00-11:00 AM in 4102 SC
In Attendance: Prof. Kale, Lukasz, Michael, Ronak, Harshit

Summary of Work
-Michael - read papers (Cosmic, gCharm, Xeon Phi, etc) and meet Ronak for
future plans
-Ronak - read papers and COI API
-Harshit - integration of GPUManager/CUBlas for OpenATOM
-Lukasz - Predicinting performance of GPU kernels 2010 PPOP Sarah from WenMei
Sanjay Patel, Bill Gropp
--put on accel wiki page

Status
-GPU Manager
--Relatively OK
--Possible problems in SMP mode
---Lukasz Goal (post thesis): Get SMP mode working
----Needs access to BlueWaters/Titan/etc
-Dave's tools
--Somewhat unknown
--Integrated into the mainline

-Missing memory management where we move buffers between CPU and GPU
--Automatically
--May exist in Dave's tools (Harshit says yes)

-gCharm
--Stack different than GPU Manager/Dave stack

OpenATOM
-Need to get GPU support working fast
--Review due
-Focus on getting optimized dgemm working first
-May be other places in code where we can offload

Load Balancing on GPUs

Xeon Phi Long Term
-Going to be substantially different in the future
-Sockets
-Things to pay off:
--Somewhat heterogeneous ppn

Our interest/unique stuff:
-Load balancing (hetero)
-Different performance behaviors on diff h/w
-& persistent application
--Given this how do we divide work?
-Dave
--Auto-tuned thing for thesis, determine %
-Pritish
--Balance between, very specific
-If work all the same, simpler
-Mix of work makes it more interesting
-EX: Take BW with some GPU nodes and some non
--Run single app efficiently using all h/w

Hybrid Machine Layer
-Currently supported
--Use fabric right now to communicate inter-
--Use shared memory to communicate locally
-How do we want to look in SMP mode?
-Possible project: vary number of comm threads used in SMP Mode* (Lukasz's
idea)

-Mini-project of using API in machine layer
--Might be a good place to start

Ronaks's idea: measure work to know where to assign
-Method to capture how a method is using mem bandwidth?*
-and Network
-Luaksz: Doesn't capture memory bandwidth vs CPU speed

Hetero LB
-Two that currently take into account CPU speed diff's
-Some notion of trying to affect load

Work Breakdown (from Eric):
1. add tile description distribution to startup and command line arg parsing
2. integrate it into the pe, rank, and node offset computation macros
3. rationalize the use of cminodesize to distinguish between mycminodesize
and cminodesize(thatnodeoverthere).
4. make sure we interact with the cpuaffinity and physnode APIs correctly.

Optimizations
-SIMD, vector, etc
-Worthwhile long term
-See if compiler can vectorize it
--If not, re-write so it can be used
-Some of Dave's work might include this
Two short benchmarks
-CPU limit
-Memory
-Run something like jacobi to see if those ratios correlate

Possible Projects
-Machine layer
-Profiling various code for ratio of speed between hardware
-Two big pieces
--Hetero LB
--Multi-PPN
-Optimization/vectorization

-Long term GPU work?
--Memory management (also for Phi?)

Action Items
-Start reading Dave's Thesis
-Update wiki page with pointers to helpful papers
-Develop two short benchmarks

To view this item, click on or cut-paste
https://charm.cs.illinois.edu/private/tms/listlog.php?param=1490#12620

--
Message generated by TMS

[ppl-accel] [TMS] Log updated in task Accel Minutes, Michael Robson, 07/30/2014