Skip to Content.
Sympa Menu

charm - Re: [charm] It builds, but it doesn't run (was Re: Build error in ckrdma.C)

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] It builds, but it doesn't run (was Re: Build error in ckrdma.C)


Chronological Thread 
  • From: Nitin Bhat <nitin.bhat.k AT gmail.com>
  • To: charm AT cs.illinois.edu
  • Subject: Re: [charm] It builds, but it doesn't run (was Re: Build error in ckrdma.C)
  • Date: Fri, 12 Jul 2019 15:28:22 -0500
  • Authentication-results: illinois.edu; spf=softfail smtp.mailfrom=nitin.bhat.k AT gmail.com; dkim=pass header.d=gmail.com header.s=20161025; dmarc=pass header.from=gmail.com

Forwarding this to charm mailing list for archival. 


On Jul 12, 2019, at 3:24 PM, Nitin Bhat <nitin.bhat.k AT gmail.com> wrote:

Hi Gerardo, 

That’s great to know. 

I’ll open an issue in our repository and we’ll dig deeper into why Simple PMI doesn’t work in a Slurm allocated environment. 

You’re very welcome. Let us know if you run into any other issues or have any questions. 

Regards, 
Nitin

On Jul 12, 2019, at 3:05 PM, Gerardo Cisneros-Stoianowski <gerardo AT mellanox.com> wrote:

Nitin,

Your suggestion to use slurmpmi fixed the issue and it appears charm is now running.

Regarding the use of ompipmix, that would have also been not much of a problem, since HPC-X 2.4 comes not only with UCX, but also with Open MPI 4.0, so the alternative options would've been simply "ompipmix --basedir=$MPI_HOME".

Thanks again.

Saludos,

Gerardo
-- 
Gerardo Cisneros-Stoianowski, Ph.D.
Senior Engineer, HPC Applications Performance
Mellanox Technologies



From: Nitin Bhat <nitin.bhat.k AT gmail.com>
Sent: Friday, July 12, 2019 2:15 PM
To: Gerardo Cisneros-Stoianowski
Cc: Mikhail Brinskii
Subject: Re: It builds, but it doesn't run (was Re: [charm] Build error in ckrdma.C)
 
Hi Gerardo, 

Thanks for getting back. 

-1 looks like PMI_FAIL from here(http://formalverification.cs.utah.edu/sawaya/html/d1/df0/pmi_8h-source.html). I wonder why that operation is failing. 

Can you try using SlurmPMI and/or PMIx (would require OpenMPI) and check if you still get the error? 

To use SlurmPMI for process management, you can build Charm (ChaNGa in this case) as:
./build ChaNGa ucx-linux-x86_64 icc ifort slurmpmi 2>&1 

Similarly, to build with OpenMPI, you would have to download and install OpenMPI (or use a pre-installed version on the cluster) and then build as:
./build ChaNGa ucx-linux-x86_64 icc ifort ompipmix —basedir=<path to OpenMPI> 2>&1 

Thanks,
Nitin

On Jul 12, 2019, at 1:41 PM, Gerardo Cisneros-Stoianowski <gerardo AT mellanox.com> wrote:

Nitin,
 
Thanks for bringing Mikhail into the thread.  I inserted a print statement as instructed; this is what I got:
 
[gerardo@thor035 simplearrayhello]$ ./charmrun +p2 hello
 
Running on 2 processors:  hello
charmrun>  /usr/bin/setarch x86_64 -R  mpirun -np 2  hello
PMI_KVS_Put returned -1
PMI_KVS_Put returned -1
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: UCX: UcxInitEps: runtime_kvs_put error failed: 5
[. . . followed by stack trace as before . . .]
 
To answer your question, I'm building and running on an actual compute node under a Slurm allocation, not the login node.  Perhaps the implicit mpirun command is missing one or more options.
 
Saludos,
 
Gerardo
-- 
Gerardo Cisneros-Stoianowski, Ph.D.
Senior Engineer, HPC Applications Performance
Mellanox Technologies

From: Nitin Bhat <nitin.bhat.k AT gmail.com>
Sent: Friday, July 12, 2019 1:22 PM
To: Gerardo Cisneros-Stoianowski
Cc: Mikhail Brinskii
Subject: Re: It builds, but it doesn't run (was Re: [charm] Build error in ckrdma.C)
 
Hi Gerardo,  
 
I haven’t seen that error before. I’m cc’ing Mikhail, who might be familiar with this error. But it seems like the program is crashing during the early stages of PMI initialization (inside PMI_KVS_Put).
 
To help us debug the error, can you add this print statement on line 169 of src/arch/util/proc_management/runtime-pmi.C. (after the PMI_KVS_Put call) and run your program? The return code will help identify the error. 
 
Additionally, I wanted to check if you’re running the program on one of the login nodes of the cluster? If so, can you try running on one of the compute nodes (the ones allocated by the job scheduler) and check if you’re still seeing the error? 
 
Thanks,
Nitin
 
 
On Jul 12, 2019, at 11:59 AM, Gerardo Cisneros-Stoianowski <gerardo AT mellanox.com> wrote:
 
Nitin,
 
I noticed that your fixes were merged, so I updated my copy and tried the build again.  This time it completed, but when I tried running the test suggested at the end of the build output, the run failed immediately, as follows:
 
pushd ucx-linux-x86_64-ifort-icc/tests/charm++/simplearrayhello
make
. . .
./charmrun +p2 hello
 
Running on 2 processors:  hello
charmrun>  /usr/bin/setarch x86_64 -R  mpirun -np 2  hello
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: UCX: UcxInitEps: runtime_kvs_put error failed: 5
------- Partition 0 Processor 0 Exiting: Called CmiAbort ------
Reason: UCX: UcxInitEps: runtime_kvs_put error failed: 5
[0] Stack Traceback:
  [0:0] _Z14CmiAbortHelperPKcS0_S0_ii+0x57  [0x5a0297]
[0] Stack Traceback:
  [0:0] _Z14CmiAbortHelperPKcS0_S0_ii+0x57  [0x5a0297]
  [0:1] CmiAbort+0x1a  [0x5a023a]
  [0:2]   [0x59dc88]
  [0:3] _Z8LrtsInitPiPPPcS_S_+0x388  [0x59acb8]
  [0:4] ConverseInit+0x2a3  [0x5a0663]
  [0:1] CmiAbort+0x1a  [0x5a023a]
  [0:2]   [0x59dc88]
  [0:3] _Z8LrtsInitPiPPPcS_S_+0x388  [0x59acb8]
  [0:4] ConverseInit+0x2a3  [0x5a0663]
  [0:5] charm_main+0x27  [0x489de7]
  [0:6] __libc_start_main+0xf5  [0x2aaaaeef13d5]
  [0:7]   [0x485a09]
  [0:5] charm_main+0x27  [0x489de7]
  [0:6] __libc_start_main+0xf5  [0x2aaaaeef13d5]
  [0:7]   [0x485a09]
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
 
I did not send this to the charm list because I first wanted to ask you how to proceed.
 
Saludos,
 
Gerardo
 

From: Nitin Bhat <nitin.bhat.k AT gmail.com>
Sent: Friday, July 12, 2019 9:05 AM
To: Gerardo Cisneros-Stoianowski
Cc: charm AT cs.illinois.edu
Subject: Re: [charm] Build error in ckrdma.C
 
Hi Gerardo,  
 
Thanks for letting us know about the issue you’re seeing. 
 
This was because of a recent change where we added some code for de-registration of buffers. However, for the UCX layer, since CMK_ONESIDED_IMPL (support for the Zerocopy API) is disabled for now, we should’ve also disabled CMK_REG_REQUIRED. 
 
I’ve added a PR (https://github.com/UIUC-PPL/charm/pull/2395) to fix this issue. It should be merged by today after another internal review.
Let us know if you see any other issues. 
 
Thanks,
Nitin
 
 
On Jul 11, 2019, at 5:18 PM, Gerardo Cisneros-Stoianowski <gerardo AT mellanox.com> wrote:
 
Hello.
 
While building the development version of Charm++ on an IB-enabled cluster of Intel BDW processors, I ran into the following errors in ckrdma.C (possibly due to a missing #include):
 
../bin/charmc   -I.   -c -o ckrdma.o ckrdma.C
ckrdma.C(107): error: identifier "deregisterBuffer" is undefined
        deregisterBuffer(source);
        ^
 
ckrdma.C(116): error: identifier "deregisterBuffer" is undefined
        deregisterBuffer(*this);
        ^
 
ckrdma.C(252): error: identifier "deregisterBuffer" is undefined
        deregisterBuffer(destination);
        ^
 
ckrdma.C(261): error: identifier "deregisterBuffer" is undefined
        deregisterBuffer(*this);
        ^
 
ckrdma.C(366): error: identifier "deregisterDestBuffer" is undefined
        deregisterDestBuffer(info);
        ^
 
ckrdma.C(377): error: identifier "CmiInvokeRemoteDeregAckHandler" is undefined
        CmiInvokeRemoteDeregAckHandler(info->srcPe, info);
        ^
 
ckrdma.C(386): error: identifier "deregisterSrcBuffer" is undefined
        deregisterSrcBuffer(info);
        ^
 
ckrdma.C(397): error: identifier "CmiInvokeRemoteDeregAckHandler" is undefined
        CmiInvokeRemoteDeregAckHandler(info->destPe, info);
        ^
 
compilation aborted for ckrdma.C (code 2)
Fatal Error by charmc in directory /global/home/users/gerardo/tools/charm/ucx-linux-x86_64-ifort-icc/tmp
   Command icpc -fpic -DCMK_GFORTRAN -I../bin/../include -I/usr/include/ -I./proc_management/ -I./proc_management/simple_pmi/ -D__CHARMC__=1 -I. -fno-stack-protector -std=c++11 -c ckrdma.C -o ckrdma.o returned error code 2
charmc exiting...
gmake: *** [ckrdma.o] Error 1
-------------------------------------------------
 
The command line I used for my build was the following:
 
./build ChaNGa ucx-linux-x86_64 icc ifort 2>&1 | tee build_chng_ucx_i185h240.log
 
I’m using Intel 2018.5.274 compilers and HPC-X 2.4.0 (or, more specifically, UCX 1.6.0).
 
Any help or suggestion will be most appreciated.
 
Saludos,
 
Gerardo
-- 
Gerardo Cisneros-Stoianowski, Ph.D.
Senior Engineer, HPC Applications Performance
Mellanox Technologies
+52-55-55637958




  • Re: [charm] It builds, but it doesn't run (was Re: Build error in ckrdma.C), Nitin Bhat, 07/12/2019

Archive powered by MHonArc 2.6.19.

Top of Page