charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] CharmLU optimization for heterogeneous run (Broadwell+Intel Xeon Phi 7250)

From: Ekaterina Tutlyaeva <xgl AT rsc-tech.ru>
To: charm AT cs.uiuc.edu
Subject: [charm] CharmLU optimization for heterogeneous run (Broadwell+Intel Xeon Phi 7250)
Date: Fri, 20 Jan 2017 12:31:50 +0300

Dear support,

I'm trying to get best results for CharmLU in heterogeneous environment - 2 node with different CPUs
First node: Broadwell

2 CPU per node
Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz;

20 cores, 40 threads (http://ark.intel.com/ru/products/91753/Intel-Xeon-Processor-E5-2698-v4-50M-Cache-2_20-GHz)

(theoretical peak is about 665.6 GFlops double precision);
RAM: 128 GB, DDR4/2133MHz

Second node: Intel Knight Landing

1 CPU

RAM: Intel Xeon Phi 7250 @ 1.40 GHz
68 cores, 272 threads

(theoretical peak 3046.4 GFlops double precision)

RAM: 16 GB Intel MCDRAM + 192 GB DDR4

Now the best result for this integrated environment, that I was able to get, is 661.186 GFlops. (But theoretical max is 3700!)

Would you be so kind, please, to give me some hints, how can I optimize my runs and achieve the best results to get a little bit closer to theoretical values with CharmLU in my environment? May be, there are some scheduler features.. Optimizations?

What am I doing wrong? What can I optimize?

May be there is more appropriate scheduler strategy, that I can choose? (I've tried
+MetaLB scheduler). Should I manually set the process mapping scheme?

The execution parameters for the best run (661 gflops):
./charmrun --bootstrap ssh -machinefile hosts +p108 ./charmlu 144000 360 1200000000 120 3

I've tried different block sizes (from 120 up to 560)

different number of processes (min=88 max = 172, while the total number of physical cores is 108 = 68 (KNL)+20*2 (2xBroadwell))

The 144000 - is maximum matrix size, that I can use. larger values crashes with segfault.

1500000000 -- is the max memory treshhold, that I can use. Larger values finished with segfault** (While 1200000000 mem treshhold limit gives best values in benchmarks)

The mapping scheme 3 (2D Tiling) and Pivot batch size: 120 are found empirically. (On the lower matrix sizes these values gives the best results. May be you could recommend best ranges for there values to benchmark in my environment?

The Send Limit: 2 parameter stays unchanged, may be I should do something with it?

My CharmLU compilation options:
OPT = -O3 -axCORE-ACX2,MIC-AVX512

config,mk uses MKL for math:
SEND_LIM = 2
BLAS_INC = -I${MKLROOT}/include
BLAS_LD = -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_blacs_intelmpi_lp64
BLAS_LIBS = -lpthread -lm -ldl

I'm using charm-6.7.1.

The build options for charm:

./build LIBS mpi-linux-x86_64 smp -j14 --with-refnum-type=int -axCORE-AVX2,MIC-AVX512

Thank you very much for your time!! Sorry for the long letter...

Best regards,

Ekaterina

ps:
** [1] Stack Traceback: (for Mem Threshold (MB): 1600000000)
[1:0] _ZN14BlockScheduler13registerBlockE9CkIndex2D+0x1f2 [0x5490e2]
[1:1] _ZN5LUBlk4initE8LUConfig12CProxy_LUMgr21CProxy_BlockScheduler10CkCallbackS3_S3_+0x133 [0x539ea3]
[1:2] _ZN5LUBlk8_when_20EPN13Closure_LUBlk18startup_25_closureE+0x1f4 [0x4d8944]
[1:3] _ZN13CkIndex_LUBlk35_call_schedulerReady_CkReductionMsgEPvS0_+0x259 [0x53c8e9]
[1:4] CkDeliverMessageReadonly+0x118 [0x5a84e8]
[1:5]   [0x614c81]
[1:6] _ZN15CkIndex_CkArray29_call_recvBroadcast_CkMessageEPvS0_+0x410 [0x680800]
[1:7] CkDeliverMessageFree+0x8a [0x5ca18a]
[1:8]   [0x5b29f8]
[1:9] CsdScheduler+0x59f [0x80249f]
[1:10]   [0x7e8db7]
[1:11]   [0x7e638a]
[1:12] +0x7dc5 [0x2b5204c08dc5]

[charm] CharmLU optimization for heterogeneous run (Broadwell+Intel Xeon Phi 7250), Ekaterina Tutlyaeva, 01/20/2017