Skip to Content.
Sympa Menu

charm - [charm] FW: PAMI_Context_advance throws an error after PAMI_Rput call

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] FW: PAMI_Context_advance throws an error after PAMI_Rput call


Chronological Thread 
  • From: "Bhat, Nitin" <nbhat4 AT illinois.edu>
  • To: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: [charm] FW: PAMI_Context_advance throws an error after PAMI_Rput call
  • Date: Thu, 14 Sep 2017 14:43:10 +0000
  • Accept-language: en-US
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=nbhat4 AT illinois.edu

Forwarding the original email regarding the PAMI error as it didn’t make it to the charm mailing list the first time I sent it.

 

From: Nitin Bhat <nitin AT hpccharm.com>
Date: Wednesday, September 13, 2017 at 4:25 PM
To: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
Cc: Sameer Kumar <sameermanepalli AT gmail.com>
Subject: PAMI_Context_advance throws an error after PAMI_Rput call

 

Hello, 

 

I am getting an error while working with RDMA calls in the PAMI communication library on BG/Q Vesta and needed help on debugging it. 

 

I get the error when I build charm with “./build charm++ pamilrts-bluegeneq --with-production –j16 –g”. Surprisingly, the error does not reproduce when I build charm with –O0 optimization “./build charm++ pamilrts-bluegeneq –j16 –O0 –g”.

 

Specifically, the crash is at PAMI_Context_advance, which is called sometime after calling PAMI_Rput. I have attached the job output and the stack trace that I obtained from bgqstack. 

 

I see that the error occurs after the completion function executes (done_fn that I pass to the PAMI_Rput call). 

Additionally, the stack trace reveals that the error occurs at /bgsys/source/srcV1R2M4.29840/comm/sys/buildtools/pami/p2p/protocols/rput/PutRdma.h:149, which is a call to the complete_simple() method, and the line which shows the error is 

put->simple.done_fn (context, put->simple.cookie, PAMI_SUCCESS);

 

But I’m not sure why the invocation of the done function is throwing an error. I am not doing anything specific in the done_fn other than print out “completion fn beg” and “completion fn end” as seen in the job output. 

 

Interestingly, things work just fine and I don’t see any crash at PAMI_Context_advance when my program uses PAMI_Rget (instead of PAMI_Rput).

 

Any pointers for debugging this error? Are there any restrictions for making calls to PAMI_Context_advance after Rput/Rget calls?

 

Thanks,

Nitin Bhat

Software Engineer, 

Charmworks Inc.

[nbhat@vestalac1 simple_rput]$ bgq_stack simple_rput core.0
------------------------------------------------------------------------
Program : ./simple_rput
------------------------------------------------------------------------
+++ID Rank: 0, TGID: 1, Core: 0, HWTID:0 TID: 1 State: RUN

000000000126a890
00000456.long_branch_r2off.memcpy+0
/bgsys/source/srcV1R2M4.29840/comm/sys/buildtools/pami/p2p/protocols/rput/PutRdma.h:149

000000000125f2a0
_ZN4PAMI6Device5Shmem13RecPacketWorkINS_4Fifo8WrapFifoINS3_10FifoPacketILj32ELj160EEENS_14BoundedCounter3BGQ17IndirectL2BoundedELj128ENS_6Wakeup3BGQEEEE21advance_with_callbackEPv
/bgsys/source/srcV1R2M4.29840/comm/sys/buildtools/pami/components/devices/shmem/ShmemWork.h:73

000000000137f29c
_ZN4PAMI6Device7Generic13GenericThread13executeThreadEPv
/bgsys/source/srcV1R2M4.29840/comm/sys/buildtools/pami/components/devices/generic/AdvanceThread.h:100

000000000124d614
PAMI_Context_advance
/bgsys/source/srcV1R2M4.29840/comm/sys/buildtools/pami/api/c/pami.cc:508

00000000011d3bd8
CmiGetNonLocal
/gpfs/vesta-home/nbhat/software/charm_2/pamilrts-bluegeneq/tmp/machine.c:894

00000000011dc25c
CsdNextMessage
/gpfs/vesta-home/nbhat/software/charm_2/pamilrts-bluegeneq/tmp/convcore.c:1779

00000000011dc498
CsdScheduleForever
/gpfs/vesta-home/nbhat/software/charm_2/pamilrts-bluegeneq/tmp/convcore.c:1904

00000000011dc168
CsdScheduler
/gpfs/vesta-home/nbhat/software/charm_2/pamilrts-bluegeneq/tmp/convcore.c:1843

00000000011d30b0
ConverseRunPE
/gpfs/vesta-home/nbhat/software/charm_2/pamilrts-bluegeneq/tmp/machine-common-core.c:1297

00000000011d21f8
ConverseInit
/gpfs/vesta-home/nbhat/software/charm_2/pamilrts-bluegeneq/tmp/machine-common-core.c:1198

0000000001014314
main
/gpfs/vesta-home/nbhat/software/charm_2/pamilrts-bluegeneq/tmp/main.C:18

0000000001515898
generic_start_main
/bgsys/drivers/V1R2M4/ppc64/toolchain/gnu/glibc-2.12.2/csu/../csu/libc-start.c:226

0000000001515b94
__libc_start_main
/bgsys/drivers/V1R2M4/ppc64/toolchain/gnu/glibc-2.12.2/csu/../sysdeps/unix/sysv/linux/powerpc/libc-start.c:194

0000000000000000
??
??:0

[nbhat@vestalac1 simple_rput]$ runjob --block $COBALT_PARTNAME -p 2 -n 2 :
./simple_rput 10
2017-09-13 20:54:57.777 (INFO ) [0xfff982ac9d0] ibm.runjob.AbstractOptions:
using properties file /bgsys/local/etc/bg.properties
2017-09-13 20:54:57.778 (INFO ) [0xfff982ac9d0] ibm.runjob.AbstractOptions:
max open file descriptors: 65536
2017-09-13 20:54:57.778 (INFO ) [0xfff982ac9d0] ibm.runjob.AbstractOptions:
core file limit: 18446744073709551615
2017-09-13 20:54:57.780 (INFO ) [0xfff982ac9d0] 45473:tatu.runjob.client:
scheduler job id is 572020
2017-09-13 20:54:57.788 (INFO ) [0xfff77983610] 45473:tatu.runjob.monitor:
monitor started
2017-09-13 20:54:58.194 (INFO ) [0xfff77983610] 45473:tatu.runjob.monitor:
task record 1859875 created
2017-09-13 20:54:58.195 (INFO ) [0xfff982ac9d0]
VST-22020-33131-32:45473:ibm.runjob.client.options.Parser: set local socket
to runjob_mux from properties file
2017-09-13 20:55:00.420 (INFO ) [0xfff982ac9d0]
VST-22020-33131-32:2030146:ibm.runjob.client.Job: job 2030146 started
Choosing optimized barrier algorithm name I0:MultiSync2Device:SHMEM:GI
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: v6.8.0-3-gb18b724
[0] isomalloc.c> Disabling isomalloc because isomalloc disabled by conv-mach
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (64-way SMP).
[0][0][0] ckrdma.C, rput beginning
[0][0][0] ckrdma.C, calling rput
[0][0][0] before putData
[0][0][0] after putData
[0][0][0] ckrdma.C, done calling rput
[0][0][0] completion fn beg
[0][0][0] completion fn end
2017-09-13 20:55:01.215 (INFO ) [0xfff77983610] 45473:tatu.runjob.monitor:
tracklib completed
2017-09-13 20:55:01.822 (WARN ) [0xfff982ac9d0]
VST-22020-33131-32:2030146:ibm.runjob.client.Job: terminated by signal 4
2017-09-13 20:55:01.822 (WARN ) [0xfff982ac9d0]
VST-22020-33131-32:2030146:ibm.runjob.client.Job: abnormal termination by
signal 4 from rank 0
2017-09-13 20:55:01.822 (INFO ) [0xfff982ac9d0] tatu.runjob.client: task
terminated by signal 4
2017-09-13 20:55:01.822 (INFO ) [0xfff77983610] 45473:tatu.runjob.monitor:
monitor terminating
2017-09-13 20:55:01.826 (INFO ) [0xfff982ac9d0] tatu.runjob.client: monitor
completed



  • [charm] FW: PAMI_Context_advance throws an error after PAMI_Rput call, Bhat, Nitin, 09/14/2017

Archive powered by MHonArc 2.6.19.

Top of Page