Charm++ parallel programming system

Text archives Help


Re: [charm] PAMI_Context_advance throws an error after PAMI_Rput call


Chronological Thread 
  • From: Sameer Kumar <sameermanepalli AT gmail.com>
  • To: Nitin Bhat <nitin AT hpccharm.com>
  • Cc: "charm AT lists.cs.illinois.edu" <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] PAMI_Context_advance throws an error after PAMI_Rput call
  • Date: Thu, 14 Sep 2017 05:15:23 -0400
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=sameermanepalli AT gmail.com

Looks like a user error triggering crash in memcpy over shmem. Can you verify the buffer addresses are correct? Also verify calls to PAMI_Memregion_create.

Sent from my iPhone

On 13-Sep-2017, at 5:25 pm, Nitin Bhat <nitin AT hpccharm.com> wrote:

Hello, 

 

I am getting an error while working with RDMA calls in the PAMI communication library on BG/Q Vesta and needed help on debugging it. 

 

I get the error when I build charm with “./build charm++ pamilrts-bluegeneq --with-production –j16 –g”. Surprisingly, the error does not reproduce when I build charm with –O0 optimization “./build charm++ pamilrts-bluegeneq –j16 –O0 –g”.

 

Specifically, the crash is at PAMI_Context_advance, which is called sometime after calling PAMI_Rput. I have attached the job output and the stack trace that I obtained from bgqstack. 

 

I see that the error occurs after the completion function executes (done_fn that I pass to the PAMI_Rput call). 

Additionally, the stack trace reveals that the error occurs at /bgsys/source/srcV1R2M4.29840/comm/sys/buildtools/pami/p2p/protocols/rput/PutRdma.h:149, which is a call to the complete_simple() method, and the line which shows the error is 

put->simple.done_fn (context, put->simple.cookie, PAMI_SUCCESS);

 

But I’m not sure why the invocation of the done function is throwing an error. I am not doing anything specific in the done_fn other than print out “completion fn beg” and “completion fn end” as seen in the job output. 

 

Interestingly, things work just fine and I don’t see any crash at PAMI_Context_advance when my program uses PAMI_Rget (instead of PAMI_Rput).

 

Any pointers for debugging this error? Are there any restrictions for making calls to PAMI_Context_advance after Rput/Rget calls?

 

Thanks,

Nitin Bhat

Software Engineer, 

Charmworks Inc.

<stack_trace.txt>
<job_output.txt>


  • Re: [charm] PAMI_Context_advance throws an error after PAMI_Rput call, Sameer Kumar, 09/14/2017

Archive powered by MHonArc 2.6.19.

Top of page