Skip to Content.
Sympa Menu

charm - [charm] Fwd: backtrace of ChaNGa process

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

[charm] Fwd: backtrace of ChaNGa process


Chronological Thread 
  • From: Shad Kirmani <sxk5292 AT cse.psu.edu>
  • To: charm AT cs.uiuc.edu, Jason Holmes <jholmes AT psu.edu>, Padma Raghavan <raghavan AT cse.psu.edu>
  • Subject: [charm] Fwd: backtrace of ChaNGa process
  • Date: Mon, 26 Mar 2012 13:22:32 -0400
  • List-archive: <http://lists.cs.uiuc.edu/pipermail/charm>
  • List-id: CHARM parallel programming system <charm.cs.uiuc.edu>

Hello,

Sometimes at startup of ChaNGa compiled for ibverbs, the processes will hang for a long period of time at the beginning of the job.  A backtrace of a process looks like this:

#0  0x00000038daa0b795 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00002b93ecee7a7b in ibv_cmd_create_qp ()
  from /usr/lib64/libmlx4-rdmav2.so
#2  0x000000000061add0 in recvBarrierMessage ()
#3  0x000000000061b882 in CmiBarrier ()
#4  0x00000000006206ec in CmiTimerInit ()
#5  0x00000000006216ec in ConverseCommonInit ()
#6  0x000000000061d723 in ConverseInit ()
#7  0x00000000005afd4c in main ()

With the verbose flag added to charmrun, the hang occurs right after it says that all nodes are connected:

...
Charmrun> Waiting for 62-th client to connect.
Charmrun> Waiting for 63-th client to connect.
Charmrun> All clients connected.
Charmrun> IP tables sent.
Charmrun> node programs all connected

We did not see these hangs when ChaNGa was compiled for MPI-linux-x86_64 instead of net-linux-x86_64 with ibverbs.  When the hang occurs, it can either go away after a period of time and the job runs or it just hangs long enough that we give up and kill it.

This is on a RedHat Enterprise Linux 5 system using libibverbs-1.1.3-2.

Thanks,
Shad




Archive powered by MHonArc 2.6.16.

Top of Page