Skip to Content.
Sympa Menu

charm - Re: [charm] Deadlock detection

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Deadlock detection


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>, charm <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Deadlock detection
  • Date: Wed, 27 Jun 2018 09:58:57 -0600
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dmarc=pass header.from=lanl.gov

Hi Vinicius,

This is useful. I have put in a quiescence detection callback which does not
appear to be triggered usually, i.e., most of the time. But I managed to
trigger it running a bunch of regression tests, scheduled in parallel on a
shared-memory machine, in an infinite shell-while loop and also randomly load
all CPUs in the background + also using Charm++ randomized message queues.
I'm
assuming that triggering quiescence in this scenario means that I successfully
hit an error that would otherwise lead to a deadlock. (This might very well be
a conjecture with some wishful thinking.)

Am I correct, that quiescence should not happen in a "normal" application that
is not specifically designed with quiescence in mind?

Also, for more information, I built Charm++ with the following build flags:

--enable-error-checking --with-prio-type=int --enable-randomized-msgq
--suffix randq-debug

So assuming I'm on the right path, now I would like to get a more meaningful
state out of my application besides the trace below. Does anyone have any clue
how to get perhaps an entry method name into such a trace?

=======================================================================
Stack trace (most recent call last):
#23 Object "/lib/x86_64-linux-gnu/ld-2.27.so", at 0xffffffffffffffff, in
#22 Object ".../Main/inciter", at 0x4c13a9, in _start
#21 Source "../csu/libc-start.c", line 310, in __libc_start_main
#20 Source
".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/main.C", line 9, in
main
6: int stack_top=0;
7: memory_stack_top = &stack_top;
8:
> 9: ConverseInit(argc, argv, (CmiStartFn) _initCharm, 0, 0);
10:
11: return 0;
12: }
#19 Source "./machine-common-core.c", line 1432, in ConverseInit
#18 Source "./machine-common-core.c", line 1539, in ConverseRunPE
#17 Source
".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c", line
1847, in CsdScheduler
1845: int CsdScheduler(int maxmsgs)
1846: {
>1847: if (maxmsgs<0) CsdScheduleForever();
1848: else if (maxmsgs==0)
1849: CsdSchedulePoll();
1850: else /*(maxmsgs>0)*/
#16 Source
".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c", line
1911, in CsdScheduleForever
1908: msg = CsdNextMessage(&state);
1909: if (msg!=NULL) { /*A message is available-- process it*/
1910: if (isIdle) {isIdle=0;CsdEndIdle();}
>1911: SCHEDULE_MESSAGE
1912:
1913: #if CMK_CELL
1914: if (progressCount <= 0) {
#15 Source
".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c", line
1663, in CmiHandleMessage
1660: CmiAbort("Msg handler does not exist, possible race
condition during init\n");
1661: }
1662: #endif
>1663: (h->hdlr)(msg,h->userPtr);
1664: #if CMK_TRACE_ENABLED
1665: /* setMemoryStatus(0) */ /* charmdebug */
1666: //_LOG_E_HANDLER_END(handlerIdx); /* projector */
#14 Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
line 1329, in _processHandler
1326: case ForChareMsg :
1327: TELLMSGTYPE(CkPrintf("proc[%d]: _processHandler with msg
type: ForChareMsg\n", CkMyPe());)
1328: if(env->isPacked()) CkUnpackMessage(&env);
>1329: _processForPlainChareMsg(ck,env);
1330: break;
1331: case ForVidMsg :
1332: TELLMSGTYPE(CkPrintf("proc[%d]: _processHandler with msg
type: ForVidMsg\n", CkMyPe());)
#13 Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
line 1025, in _processForPlainChareMsg
1022: obj = env->getObjPtr();
1023: #endif
1024: }
>1025: _invokeEntry(epIdx,env,obj);
1026: _STATS_RECORD_PROCESS_MSG_1();
1027: }
#12 Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
line 652, in _invokeEntry
649: _TRACE_BEGIN_EXECUTE(env, obj);
650: if(_entryTable[epIdx]->appWork)
651: _TRACE_BEGIN_APPWORK();
> 652: _invokeEntryNoTrace(epIdx,env,obj);
653: if(_entryTable[epIdx]->appWork)
654: _TRACE_END_APPWORK();
655: _TRACE_END_EXECUTE();
#11 Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
line 641, in _invokeEntryNoTrace
638: {
639: void *msg = EnvToUsr(env);
640: _SET_USED(env, 0);
> 641: CkDeliverMessageFree(epIdx,msg,obj);
642: }
643:
644: static inline void _invokeEntry(int epIdx,envelope *env,void
*obj)
#10 Source ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
line 597, in CkDeliverMessageFree
594: #if CMK_CHARMDEBUG
595: CpdBeforeEp(epIdx, obj, msg);
596: #endif
> 597: _entryTable[epIdx]->call(msg, obj);
598: #if CMK_CHARMDEBUG
599: CpdAfterEp(epIdx);
600: #endif
#9 Source "RNGTest/../Main/inciter.def.h", line 290, in
_call_quiescence_void
#8 Source "../Main/Inciter.C", line 216, in quiescence
213: }
214:
215: void quiescence() {
> 216: Throw( "Quiescence detected" );
217: }
218:
219: private:
#7 Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at 0x7fffe4c294ae,
in __cxa_throw
#6 Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at 0x7fffe4c29515,
in
#5 Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at 0x7fffe4c26762,
in
#4 Source "../Base/ProcessException.C", line 79, in __invoke
76: //
*****************************************************************************
77: {
78: // override Charm++'s terminate handler
> 79: std::set_terminate( [](){
80: tk::Exception("Terminate was called").handleException();
81: CkAbort( "Signal caught" );
82: } );
#3 Source "../Base/ProcessException.C", line 80, in operator()
77: {
78: // override Charm++'s terminate handler
79: std::set_terminate( [](){
> 80: tk::Exception("Terminate was called").handleException();
81: CkAbort( "Signal caught" );
82: } );
#2 Source "../Base/Exception.C", line 193, in handleException
190: #ifdef HAS_BACKWARD
191: fprintf( stderr, ">>>\n>>> =========== STACK TRACE
==========\n>>>\n" );
192: using namespace backward;
> 193: StackTrace st; st.load_here(64);
194: Printer p; p.print( st, stderr );
195: fprintf( stderr, ">>>\n>>> ======= END OF STACK TRACE
=======\n>>>\n" );
196: #endif
#1 Source ".../clang-x86_64/include/backward.hpp", line 750, in load_here
747: return 0;
748: }
749: _stacktrace.resize(depth);
> 750: size_t trace_cnt = details::unwind(callback(*this),
depth);
751: _stacktrace.resize(trace_cnt);
752: skip_n_firsts(0);
753: return size();
#0 Source ".../clang-x86_64/include/backward.hpp", line 734, in
unwind<backward::StackTraceImpl<backward::system_tag::linux_tag>::callback>
731: template <typename F>
732: size_t unwind(F f, size_t depth) {
733: Unwinder<F> unwinder;
> 734: return unwinder(f, depth);
735: }
736:
737: } // namespace details
=======================================================================

Thanks,
Jozsef

On 06.26.2018 21:51, Vinicius Freitas wrote:
> Jozsef,
>
> That depends on what is a deadlock for you, in your specific application.
> There is a reason for which Quiescence Detection is not usually recommended.
> Entry methods don't usually "wait to be called". If in your specific
> application you are specifically waiting for something to happen inside an
> entry method, then no, quiescence will not happen.
>
> So, for your first question, a deadlock may or may not be a quiescence,
> depending on what is causing your deadlock, investigating this would be
> important for you, if it is not viable to completely eliminate it.
> As for your second question, I've only used quiescence detection with
> callbacks in Charm++, this way, I would only know where I am at the
> application execution flow,
> but information on what happened would only be available on data that would
> have been manipulated beforehand, during your operations that caused a
> deadlock.
>
>
> Hope I was of help,
> Best regards,
>
> 2018-06-26 21:41 GMT-03:00 Jozsef Bakosi
> <jbakosi AT lanl.gov>:
>
> > Thanks, Vinicius,
> >
> > I also thought about quiescence detection, but the manual says:
> >
> > "In Charm++, quiescence is defined as the state in which no processor is
> > executing an entry point, no messages are awaiting processing, and there
> > are no
> > messages in-flight."
> >
> > Does a deadlock qualify as quiescence? I think if there is a deadlock,
> > messages
> > may be awaiting processing, i.e., entry methods are waiting to be called,
> > but
> > for some reason they are not called. In other words, would quiescence
> > happen for
> > a deadlock?
> >
> > Also even if the answer is yes, how and what kind of information do I get
> > at the
> > application level after the my function is called after quiescence? How
> > would
> > that help debugging what lead to that function call?
> >
> > Thanks,
> > Jozsef
> >
> > On 06.26.2018 21:06, Vinicius Freitas wrote:
> > > Jozsef,
> > >
> > > Charm has a Quiescence detection mechanism that might help you. As soon
> > as
> > > you start a Quiescence detection in Charm, you will also declare to
> > > which
> > > method it will reduce to (and then, every chare will execute), and it
> > will
> > > only be triggered once NOTHING is happening on the system. Would this
> > help you
> > > in any way?
> > >
> > > Vinicius F.
> > >
> > > Em ter, 26 de jun de 2018 20:04, Jozsef Bakosi
> > > <jbakosi AT lanl.gov>
> > escreveu:
> > >
> > > > Hi folks,
> > > >
> > > > Time to time I run into asynchronous logic errors (that I'm pretty
> > sure are
> > > > my fault) that non-deterministically produce deadlocks without an
> > > > error
> > > > message.
> > > >
> > > > I wonder what tools I have available that I can use to diagnose such
> > > > problems.
> > > >
> > > > Somehow, it would be great if I could detect from the runtime system
> > that
> > > > some messages are being waited on, but nothing really happens in which
> > case
> > > > I could dump some messages and their labels/ids/etc, that would help
> > > > identify at least what entry methods are waiting for messages, or
> > something
> > > > similar.
> > > >
> > > > What do you suggest? How do you deal with such errors?
> > > >
> > > > Thanks,
> > > > Jozsef



Archive powered by MHonArc 2.6.19.

Top of Page