Skip to Content.
Sympa Menu

charm - Re: [charm] Deadlock detection

charm AT lists.cs.illinois.edu

Subject: Charm++ parallel programming system

List archive

Re: [charm] Deadlock detection


Chronological Thread 
  • From: Jozsef Bakosi <jbakosi AT lanl.gov>
  • To: Phil Miller <mille121 AT illinois.edu>
  • Cc: Vinicius Freitas <vinicius.mct.freitas AT gmail.com>, charm <charm AT lists.cs.illinois.edu>
  • Subject: Re: [charm] Deadlock detection
  • Date: Thu, 28 Jun 2018 07:02:38 -0600
  • Authentication-results: illinois.edu; spf=pass smtp.mailfrom=jbakosi AT lanl.gov; dmarc=pass header.from=lanl.gov

Thanks Phil,

I think I managed to track the problem down with the quiescence detection
feature, using printouts to guess where the random deadlock happened.

I would like to understand quiescence detection better in Charm++. Can I
assume
that if an algorithm is NOT specifically designed with relying on quiescence
in
mind, quiescence occurring is (likely) a sign of error? If so, how likely? How
does quiescence detection really work in Charm++? Is it based on some defined
frequency, which if reached with no messages, it is triggered? I'm wondering
if I
should keep the quiescence detection in the code to increase the chances of
catching random errors.

Thanks,
Jozsef

On 06.27.2018 11:32, Phil Miller wrote:
> Hi Jozsef,
>
> You could modify your objects' existing control flow to set a flag or
> counter indicating where they are, and add a 'debugging output' method to
> them to print those counters/flags that would get called by the QD callback.
>
> For the development team, this suggests a simplistic addition to SDAG to
> have an entry method that will dump the object's SDAG state, so that users
> would only have to write their own if they need something more specific.
> Additionally, this suggests a useful feature idea for CharmDebug in being
> able to examine this state interactively. Drawing from Alinea's tools, it
> would be good to visualize and cluster objects' SDAG state, making it easy
> to examine and compare representatives and outliers.
>
> Phil
>
> On Wed, Jun 27, 2018 at 10:58 AM, Jozsef Bakosi
> <jbakosi AT lanl.gov>
> wrote:
>
> > Hi Vinicius,
> >
> > This is useful. I have put in a quiescence detection callback which does
> > not
> > appear to be triggered usually, i.e., most of the time. But I managed to
> > trigger it running a bunch of regression tests, scheduled in parallel on a
> > shared-memory machine, in an infinite shell-while loop and also randomly
> > load
> > all CPUs in the background + also using Charm++ randomized message
> > queues. I'm
> > assuming that triggering quiescence in this scenario means that I
> > successfully
> > hit an error that would otherwise lead to a deadlock. (This might very
> > well be
> > a conjecture with some wishful thinking.)
> >
> > Am I correct, that quiescence should not happen in a "normal" application
> > that
> > is not specifically designed with quiescence in mind?
> >
> > Also, for more information, I built Charm++ with the following build
> > flags:
> >
> > --enable-error-checking --with-prio-type=int --enable-randomized-msgq
> > --suffix randq-debug
> >
> > So assuming I'm on the right path, now I would like to get a more
> > meaningful
> > state out of my application besides the trace below. Does anyone have any
> > clue
> > how to get perhaps an entry method name into such a trace?
> >
> > =======================================================================
> > Stack trace (most recent call last):
> > #23 Object "/lib/x86_64-linux-gnu/ld-2.27.so", at 0xffffffffffffffff, in
> > #22 Object ".../Main/inciter", at 0x4c13a9, in _start
> > #21 Source "../csu/libc-start.c", line 310, in __libc_start_main
> > #20 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/main.C",
> > line 9, in main
> > 6: int stack_top=0;
> > 7: memory_stack_top = &stack_top;
> > 8:
> > > 9: ConverseInit(argc, argv, (CmiStartFn) _initCharm, 0, 0);
> > 10:
> > 11: return 0;
> > 12: }
> > #19 Source "./machine-common-core.c", line 1432, in ConverseInit
> > #18 Source "./machine-common-core.c", line 1539, in ConverseRunPE
> > #17 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c",
> > line 1847, in CsdScheduler
> > 1845: int CsdScheduler(int maxmsgs)
> > 1846: {
> > >1847: if (maxmsgs<0) CsdScheduleForever();
> > 1848: else if (maxmsgs==0)
> > 1849: CsdSchedulePoll();
> > 1850: else /*(maxmsgs>0)*/
> > #16 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c",
> > line 1911, in CsdScheduleForever
> > 1908: msg = CsdNextMessage(&state);
> > 1909: if (msg!=NULL) { /*A message is available-- process it*/
> > 1910: if (isIdle) {isIdle=0;CsdEndIdle();}
> > >1911: SCHEDULE_MESSAGE
> > 1912:
> > 1913: #if CMK_CELL
> > 1914: if (progressCount <= 0) {
> > #15 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/convcore.c",
> > line 1663, in CmiHandleMessage
> > 1660: CmiAbort("Msg handler does not exist, possible race
> > condition during init\n");
> > 1661: }
> > 1662: #endif
> > >1663: (h->hdlr)(msg,h->userPtr);
> > 1664: #if CMK_TRACE_ENABLED
> > 1665: /* setMemoryStatus(0) */ /* charmdebug */
> > 1666: //_LOG_E_HANDLER_END(handlerIdx); /* projector */
> > #14 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
> > line 1329, in _processHandler
> > 1326: case ForChareMsg :
> > 1327: TELLMSGTYPE(CkPrintf("proc[%d]: _processHandler with
> > msg type: ForChareMsg\n", CkMyPe());)
> > 1328: if(env->isPacked()) CkUnpackMessage(&env);
> > >1329: _processForPlainChareMsg(ck,env);
> > 1330: break;
> > 1331: case ForVidMsg :
> > 1332: TELLMSGTYPE(CkPrintf("proc[%d]: _processHandler with
> > msg type: ForVidMsg\n", CkMyPe());)
> > #13 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
> > line 1025, in _processForPlainChareMsg
> > 1022: obj = env->getObjPtr();
> > 1023: #endif
> > 1024: }
> > >1025: _invokeEntry(epIdx,env,obj);
> > 1026: _STATS_RECORD_PROCESS_MSG_1();
> > 1027: }
> > #12 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
> > line 652, in _invokeEntry
> > 649: _TRACE_BEGIN_EXECUTE(env, obj);
> > 650: if(_entryTable[epIdx]->appWork)
> > 651: _TRACE_BEGIN_APPWORK();
> > > 652: _invokeEntryNoTrace(epIdx,env,obj);
> > 653: if(_entryTable[epIdx]->appWork)
> > 654: _TRACE_END_APPWORK();
> > 655: _TRACE_END_EXECUTE();
> > #11 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
> > line 641, in _invokeEntryNoTrace
> > 638: {
> > 639: void *msg = EnvToUsr(env);
> > 640: _SET_USED(env, 0);
> > > 641: CkDeliverMessageFree(epIdx,msg,obj);
> > 642: }
> > 643:
> > 644: static inline void _invokeEntry(int epIdx,envelope *env,void
> > *obj)
> > #10 Source
> > ".../clang-x86_64/charm/mpi-linux-x86_64-randq-debug/tmp/ck.C",
> > line 597, in CkDeliverMessageFree
> > 594: #if CMK_CHARMDEBUG
> > 595: CpdBeforeEp(epIdx, obj, msg);
> > 596: #endif
> > > 597: _entryTable[epIdx]->call(msg, obj);
> > 598: #if CMK_CHARMDEBUG
> > 599: CpdAfterEp(epIdx);
> > 600: #endif
> > #9 Source "RNGTest/../Main/inciter.def.h", line 290, in
> > _call_quiescence_void
> > #8 Source "../Main/Inciter.C", line 216, in quiescence
> > 213: }
> > 214:
> > 215: void quiescence() {
> > > 216: Throw( "Quiescence detected" );
> > 217: }
> > 218:
> > 219: private:
> > #7 Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at
> > 0x7fffe4c294ae, in __cxa_throw
> > #6 Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at
> > 0x7fffe4c29515, in
> > #5 Object "/usr/lib/x86_64-linux-gnu/libc++abi.so.1.0", at
> > 0x7fffe4c26762, in
> > #4 Source "../Base/ProcessException.C", line 79, in __invoke
> > 76: // ******************************
> > ***********************************************
> > 77: {
> > 78: // override Charm++'s terminate handler
> > > 79: std::set_terminate( [](){
> > 80: tk::Exception("Terminate was called").handleException();
> > 81: CkAbort( "Signal caught" );
> > 82: } );
> > #3 Source "../Base/ProcessException.C", line 80, in operator()
> > 77: {
> > 78: // override Charm++'s terminate handler
> > 79: std::set_terminate( [](){
> > > 80: tk::Exception("Terminate was called").handleException();
> > 81: CkAbort( "Signal caught" );
> > 82: } );
> > #2 Source "../Base/Exception.C", line 193, in handleException
> > 190: #ifdef HAS_BACKWARD
> > 191: fprintf( stderr, ">>>\n>>> =========== STACK TRACE
> > ==========\n>>>\n" );
> > 192: using namespace backward;
> > > 193: StackTrace st; st.load_here(64);
> > 194: Printer p; p.print( st, stderr );
> > 195: fprintf( stderr, ">>>\n>>> ======= END OF STACK TRACE
> > =======\n>>>\n" );
> > 196: #endif
> > #1 Source ".../clang-x86_64/include/backward.hpp", line 750, in
> > load_here
> > 747: return 0;
> > 748: }
> > 749: _stacktrace.resize(depth);
> > > 750: size_t trace_cnt =
> > details::unwind(callback(*this),
> > depth);
> > 751: _stacktrace.resize(trace_cnt);
> > 752: skip_n_firsts(0);
> > 753: return size();
> > #0 Source ".../clang-x86_64/include/backward.hpp", line 734, in
> > unwind<backward::StackTraceImpl<backward::system_tag::linux_tag>::
> > callback>
> > 731: template <typename F>
> > 732: size_t unwind(F f, size_t depth) {
> > 733: Unwinder<F> unwinder;
> > > 734: return unwinder(f, depth);
> > 735: }
> > 736:
> > 737: } // namespace details
> > =======================================================================
> >
> > Thanks,
> > Jozsef
> >
> > On 06.26.2018 21:51, Vinicius Freitas wrote:
> > > Jozsef,
> > >
> > > That depends on what is a deadlock for you, in your specific
> > > application.
> > > There is a reason for which Quiescence Detection is not usually
> > recommended.
> > > Entry methods don't usually "wait to be called". If in your specific
> > > application you are specifically waiting for something to happen inside
> > an
> > > entry method, then no, quiescence will not happen.
> > >
> > > So, for your first question, a deadlock may or may not be a quiescence,
> > > depending on what is causing your deadlock, investigating this would be
> > > important for you, if it is not viable to completely eliminate it.
> > > As for your second question, I've only used quiescence detection with
> > > callbacks in Charm++, this way, I would only know where I am at the
> > > application execution flow,
> > > but information on what happened would only be available on data that
> > would
> > > have been manipulated beforehand, during your operations that caused a
> > > deadlock.
> > >
> > >
> > > Hope I was of help,
> > > Best regards,
> > >
> > > 2018-06-26 21:41 GMT-03:00 Jozsef Bakosi
> > > <jbakosi AT lanl.gov>:
> > >
> > > > Thanks, Vinicius,
> > > >
> > > > I also thought about quiescence detection, but the manual says:
> > > >
> > > > "In Charm++, quiescence is defined as the state in which no processor
> > is
> > > > executing an entry point, no messages are awaiting processing, and
> > there
> > > > are no
> > > > messages in-flight."
> > > >
> > > > Does a deadlock qualify as quiescence? I think if there is a deadlock,
> > > > messages
> > > > may be awaiting processing, i.e., entry methods are waiting to be
> > called,
> > > > but
> > > > for some reason they are not called. In other words, would quiescence
> > > > happen for
> > > > a deadlock?
> > > >
> > > > Also even if the answer is yes, how and what kind of information do I
> > get
> > > > at the
> > > > application level after the my function is called after quiescence?
> > > > How
> > > > would
> > > > that help debugging what lead to that function call?
> > > >
> > > > Thanks,
> > > > Jozsef
> > > >
> > > > On 06.26.2018 21:06, Vinicius Freitas wrote:
> > > > > Jozsef,
> > > > >
> > > > > Charm has a Quiescence detection mechanism that might help you. As
> > soon
> > > > as
> > > > > you start a Quiescence detection in Charm, you will also declare to
> > which
> > > > > method it will reduce to (and then, every chare will execute), and
> > > > > it
> > > > will
> > > > > only be triggered once NOTHING is happening on the system. Would
> > > > > this
> > > > help you
> > > > > in any way?
> > > > >
> > > > > Vinicius F.
> > > > >
> > > > > Em ter, 26 de jun de 2018 20:04, Jozsef Bakosi
> > > > > <jbakosi AT lanl.gov>
> > > > escreveu:
> > > > >
> > > > > > Hi folks,
> > > > > >
> > > > > > Time to time I run into asynchronous logic errors (that I'm pretty
> > > > sure are
> > > > > > my fault) that non-deterministically produce deadlocks without an
> > error
> > > > > > message.
> > > > > >
> > > > > > I wonder what tools I have available that I can use to diagnose
> > such
> > > > > > problems.
> > > > > >
> > > > > > Somehow, it would be great if I could detect from the runtime
> > system
> > > > that
> > > > > > some messages are being waited on, but nothing really happens in
> > which
> > > > case
> > > > > > I could dump some messages and their labels/ids/etc, that would
> > help
> > > > > > identify at least what entry methods are waiting for messages, or
> > > > something
> > > > > > similar.
> > > > > >
> > > > > > What do you suggest? How do you deal with such errors?
> > > > > >
> > > > > > Thanks,
> > > > > > Jozsef



Archive powered by MHonArc 2.6.19.

Top of Page