Skip to Content.
Sympa Menu

charm - Re: [charm] Optimization

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Optimization


Chronological Thread 
  • From: "Bohm, Eric J" <ebohm AT illinois.edu>
  • To: Dan Kokron <dkokron AT gmail.com>
  • Cc: "charm AT cs.illinois.edu" <charm AT cs.illinois.edu>
  • Subject: Re: [charm] Optimization
  • Date: Thu, 27 Aug 2020 13:53:51 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=illinois.edu; dmarc=pass action=none header.from=illinois.edu; dkim=pass header.d=illinois.edu; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=JQ6uBMwMw+DHC3fBg7y4wDWI5ZE6o7EAqy6/kZw/iQ8=; b=AXCPWkPEFz0OYV1rrDzPb5kjF9mjD72zcagbgW7SKBc+xKcfq3oK2LoRLXyxWzulevPVO9DK3y7231VQtR8Hr3eRlDXcjs+Z94FtQCVR6BgGGsUZ0JwEzsQH2t42o7xVdSrrRxypCu7f4kOQtU+tRfuYddwo+RpHYLgiBfYocOzKgwdZs5k1RAeSRSj+dW5/vRrAm41k9grLdw+S6cIkholiQR+fyK9X4VwMNpbGzkHj+iQi3soKvQd1jWS42ZbH2toiZMSLNhcJeJCab7nVulE0w54qlQgT/84ciRkKeVOtVVs5ODUm9fNWJIV5hqRvhCjT1qZkuDLmujIlCVAWiw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=ZDSHaeEdrWPOy7NePu2nuBPZc0n5KMY4dH6WM6lGFnOHvuRZkzur9B0Cw3U5gHrqQl42B/PytdfYVbucxuN7Qf197fvwUvF/NmqcXLI1YTejr4Mf7wmLdJ+riI8g7nYiuUcNva0aUI3iwMPxoqPnqOhC4x0HxXKg/jn0Xk05dMBZPb5fTXvUj4N0zde9pFEoTcs/et/Ft191gV9Hhm1Ra7NiDC5klmIN4A9t+tGeT496/jlNRfDRMZjhmkGANCjep/hk2njrYuFO1Jb7+KbQM6Xaw88EAOIXlwD704YMIlEj4lQgnJiDXGtk8AKxa1RE9IV3SB4IlMdGGL08YvbNFg==
  • Authentication-results: illinois.edu; spf=neutral smtp.mailfrom=ebohm AT illinois.edu; dkim=pass header.s=selector2-uillinoisedu-onmicrosoft-com header.d=uillinoisedu.onmicrosoft.com; dmarc=none header.from=illinois.edu
  • Authentication-results: gmail.com; dkim=none (message not signed) header.d=none;gmail.com; dmarc=none action=none header.from=illinois.edu;

The poor performance may be arising from thread oversubscription.  Regardless, PPN 38 is a high ratio of worker threads to communication threads for NAMD.

Jim suggested :"May want to change pemap from 1-19,21-39,41-59,61-79 to 1-19+40,21-39+40 depending on how OS is mapping hyperthreads, or use logical to be safe: +pemap L2-39,42-79 +commap L0,40"
May want to change pemap from 1-19,21-39,41-59,61-79 to 1-19+40,21-39+40 depending on how OS is mapping hyperthreads, or use logical to be safe: +pemap L2-39,42-79 +commap L0,40

On Aug 26, 2020, at 10:22 AM, Dan Kokron <dkokron AT gmail.com> wrote:

I am working with some researchers who are running some COVID19 simulations using NAMD.  I have performed an extensive search of the parameter space trying to find the best performance for their case.  I am asking this question here because the performance of this case depends greatly on communication.

Eric Bohm suggested that UCX+SMP would provide the best scaling yet that configuration (or my use of it) falls significantly behind native UCX.  See attached.



Hardware:
multi-node Xeon (skylake), each node has 2 Gold 6148 (40 hardware cores with HT enabled)
nodes are connected with EDR Infiniband

Software:
NAMD git/master
CHARM++ 6.10.2
HPCX 2.7.0 (OpenMPI + UCX-1.9.0)
Intel 2019.5.281 compiler

CHARM++  for the UCX+SMP build was built with
setenv base_charm_opts "-O3 -ip -g -xCORE-AVX512"
./build charm++ ucx-linux-x86_64 icc ompipmix smp --suffix avx512 --with-production $base_charm_opts --basedir=$HPCX_UCX_DIR --basedir=$HPCX_MPI_DIR -j12

The native UCX build was the same except without the 'smp' option.

The UCX+SMP build of NAMD was built with
FLOATOPTS = -ip -O3 -xCORE-AVX512 -qopt-zmm-usage=high -fp-model fast=2 -no-prec-div -qoverride-limits -DNAMD_DISABLE_SSE -qopenmp-simd -DNAMD_AVXTILES
./config Linux-x86_64-icc --charm-arch ucx-linux-x86_64-ompipmix-smp-icc-avx512 --with-fftw3 --fftw-prefix /fftw-3.3.8/install/namd --charm-opts -verbose

Simulation Case:
1764532 atoms (See attached output listing)

UCX+SMP launch
mpiexec -np $nsockets --map-by ppr:1:socket --bind-to core -x UCX_TLS="rc,xpmem,self"  /Linux-x86_64-icc-ucx-smp-xpmem-avx512/namd2 +ppn 38 +pemap 1-19,21-39,41-59,61-79 +commap 0,20 +setcpuaffinity +showcpuaffinity restart.namd

Would you expect native UCX to outperform UCX_SMP in this scenario?
Can you suggest some ways to improve the performance of my UCX+SMP build?

Dan
<NAMD_UCX+SMP_vs_Native_UCX.png><skxOnskx.UCX.XPMEM.SMP.48s>




Archive powered by MHonArc 2.6.19.

Top of Page